This project aims at analyzing a sample from the Hacker News data set. Hacker News is a website where users submit posts that are voted and commented upon, similar to reddit.
The table below provides a brief description of each of the columns.
Column Name | Description |
---|---|
id | The unique identifier from Hacker News for a post |
title | The title of the post |
url | The URL that the post links to, if it has any |
num_points | The number of points the post acquired by substracting the total number of downvotes from the total number of upvotes |
author | The username of the author of the post |
created_at | The date and time at which the post was submitted |
# Reading the file as a list of lists
from csv import reader
opened_file = open("hacker_news.csv")
read_file = reader(opened_file)
hn = list(read_file)
# Function to display rows [Stops at end_index - 1]
def explore_data(dataset, start_index, end_index):
for index in range(start_index, end_index):
print(dataset[index])
print("\n")
# First 6 rows, including the header
explore_data(hn, 0, 6)
['id', 'title', 'url', 'num_points', 'num_comments', 'author', 'created_at'] ['12224879', 'Interactive Dynamic Video', 'http://www.interactivedynamicvideo.com/', '386', '52', 'ne0phyte', '8/4/2016 11:52'] ['10975351', 'How to Use Open Source and Shut the Fuck Up at the Same Time', 'http://hueniverse.com/2016/01/26/how-to-use-open-source-and-shut-the-fuck-up-at-the-same-time/', '39', '10', 'josep2', '1/26/2016 19:30'] ['11964716', "Florida DJs May Face Felony for April Fools' Water Joke", 'http://www.thewire.com/entertainment/2013/04/florida-djs-april-fools-water-joke/63798/', '2', '1', 'vezycash', '6/23/2016 22:20'] ['11919867', 'Technology ventures: From Idea to Enterprise', 'https://www.amazon.com/Technology-Ventures-Enterprise-Thomas-Byers/dp/0073523429', '3', '1', 'hswarna', '6/17/2016 0:01'] ['10301696', 'Note by Note: The Making of Steinway L1037 (2007)', 'http://www.nytimes.com/2007/11/07/movies/07stein.html?_r=0', '8', '2', 'walterbell', '9/30/2015 4:12']
# Separating the column headers from the data set
headers = hn[0]
# Removing the headers from the data set
hn = hn[1:]
print("The headers are \n", headers, "\n")
print("Checking if the header row was removed\n")
explore_data(hn, 0, 5)
The headers are ['id', 'title', 'url', 'num_points', 'num_comments', 'author', 'created_at'] Checking if the header row was removed ['12224879', 'Interactive Dynamic Video', 'http://www.interactivedynamicvideo.com/', '386', '52', 'ne0phyte', '8/4/2016 11:52'] ['10975351', 'How to Use Open Source and Shut the Fuck Up at the Same Time', 'http://hueniverse.com/2016/01/26/how-to-use-open-source-and-shut-the-fuck-up-at-the-same-time/', '39', '10', 'josep2', '1/26/2016 19:30'] ['11964716', "Florida DJs May Face Felony for April Fools' Water Joke", 'http://www.thewire.com/entertainment/2013/04/florida-djs-april-fools-water-joke/63798/', '2', '1', 'vezycash', '6/23/2016 22:20'] ['11919867', 'Technology ventures: From Idea to Enterprise', 'https://www.amazon.com/Technology-Ventures-Enterprise-Thomas-Byers/dp/0073523429', '3', '1', 'hswarna', '6/17/2016 0:01'] ['10301696', 'Note by Note: The Making of Steinway L1037 (2007)', 'http://www.nytimes.com/2007/11/07/movies/07stein.html?_r=0', '8', '2', 'walterbell', '9/30/2015 4:12']
We will focus on posts whose titles start with either Ask HN
or Show HN
. Ask HN
posts are those submitted by users to ask the Hacker News community a specific question. Likewise, the Show HN
posts are those where users want to show a project or a product to the community.
We will compare these two types of posts to answer the following questions:
** Filtering by the type of the posts **
For this purpose, we first filter the rows in the data set according to the type of posts. The categories will be ask_posts
and show_posts
. We will assign other_posts
to the remaining types of posts.
ask_posts = []
show_posts = []
other_posts = []
for each_row in hn:
title = each_row[1]
# startswith function is case sensitive
# so make turn title to small letters
title = title.lower()
if title.startswith('ask hn'):
ask_posts.append(each_row)
elif title.startswith('show hn'):
show_posts.append(each_row)
else:
other_posts.append(each_row)
# Function to check the number of posts in each list
def n_posts(str_posts, list_name):
print("The number of posts in",str_posts , "is", len(list_name))
n_posts("ask_posts", ask_posts)
n_posts("show_posts", show_posts)
n_posts("other_posts", other_posts)
explore_data(ask_posts, 0,5)
The number of posts in ask_posts is 1744 The number of posts in show_posts is 1162 The number of posts in other_posts is 17194 ['12296411', 'Ask HN: How to improve my personal website?', '', '2', '6', 'ahmedbaracat', '8/16/2016 9:55'] ['10610020', 'Ask HN: Am I the only one outraged by Twitter shutting down share counts?', '', '28', '29', 'tkfx', '11/22/2015 13:43'] ['11610310', 'Ask HN: Aby recent changes to CSS that broke mobile?', '', '1', '1', 'polskibus', '5/2/2016 10:14'] ['12210105', 'Ask HN: Looking for Employee #3 How do I do it?', '', '1', '3', 'sph130', '8/2/2016 14:20'] ['10394168', 'Ask HN: Someone offered to buy my browser extension from me. What now?', '', '28', '17', 'roykolak', '10/15/2015 16:38']
According to the type of posts
As previously mentioned, our analysis is linked to the number of comments. First, we are going to determine which type of posts from the two we are considering, receives more comments on average.
# Function to find the number of total comments in a type of post
def average_comments(type_name):
total_comments = 0
for each_row in type_name:
total_comments += int(each_row[4])
average = total_comments / len(type_name)
print(average)
Average_Ask = average_comments(ask_posts)
Average_Show = average_comments(show_posts)
14.038417431192661 10.31669535283993
Since Ask HN
posts receive around 14 comments on average than Show HN
posts that receive an average of approximately 10 comments, we conclude that ask posts receive more comments on average. Thus, the second part of our analysis will focus on ask posts.
According to time
We are going to analyze the average number of comments by hour. So, we need to know the number of comments by hour as well as the number of posts by hour.
# Calculating the number of comments and posts by hour
import datetime as dt
result_list = [] # Stores created_at and number of comments
for each_row in ask_posts:
created_at = each_row[6]
n_comments = int(each_row[4])
result_list.append([created_at, n_comments])
counts_by_hour = {} # Number of posts by hour
comments_by_hour = {}
date_format = "%m/%d/%Y %H:%M"
for each_row in result_list:
dt_object = dt.datetime.strptime(each_row[0], date_format)
hour = dt_object.strftime("%H")
if hour not in counts_by_hour:
counts_by_hour[hour] = 1
comments_by_hour[hour] = each_row[1]
else:
counts_by_hour[hour] += 1
comments_by_hour[hour] += each_row[1]
# Calculating the average number of comments by hour
avg_by_hour = [] # List of lists to store [hour, average]
for each_hour in comments_by_hour:
avg_by_hour.append([each_hour, comments_by_hour[each_hour]/counts_by_hour[each_hour]])
avg_by_hour
[['17', 11.46], ['10', 13.440677966101696], ['06', 9.022727272727273], ['23', 7.985294117647059], ['15', 38.5948275862069], ['16', 16.796296296296298], ['11', 11.051724137931034], ['04', 7.170212765957447], ['02', 23.810344827586206], ['07', 7.852941176470588], ['12', 9.41095890410959], ['14', 13.233644859813085], ['13', 14.741176470588234], ['05', 10.08695652173913], ['21', 16.009174311926607], ['00', 8.127272727272727], ['18', 13.20183486238532], ['19', 10.8], ['22', 6.746478873239437], ['09', 5.5777777777777775], ['01', 11.383333333333333], ['20', 21.525], ['03', 7.796296296296297], ['08', 10.25]]
It will be hard to identify the hours with the highest values. So, we need to sort avg_by_hour
.
# Sorted uses the first value of a list to sort
# So we need to swap the values in avg_by_hour
swap_avg_by_hour = []
for each_hour in avg_by_hour:
swap_avg_by_hour.append([each_hour[1], each_hour[0]])
swap_avg_by_hour
[[11.46, '17'], [13.440677966101696, '10'], [9.022727272727273, '06'], [7.985294117647059, '23'], [38.5948275862069, '15'], [16.796296296296298, '16'], [11.051724137931034, '11'], [7.170212765957447, '04'], [23.810344827586206, '02'], [7.852941176470588, '07'], [9.41095890410959, '12'], [13.233644859813085, '14'], [14.741176470588234, '13'], [10.08695652173913, '05'], [16.009174311926607, '21'], [8.127272727272727, '00'], [13.20183486238532, '18'], [10.8, '19'], [6.746478873239437, '22'], [5.5777777777777775, '09'], [11.383333333333333, '01'], [21.525, '20'], [7.796296296296297, '03'], [10.25, '08']]
# Using the sorted() function
# Descending
sorted_swap = sorted(swap_avg_by_hour, reverse = True)
print("Top 5 Hours for Ask Posts Comments")
string_format = "{}: {:.2f} average comments per post"
for average, hour in sorted_swap[:6]:
dt_object = dt.datetime.strptime(hour, "%H")
time = dt_object.strftime("%H:%M")
print(string_format.format(time, average))
Top 5 Hours for Ask Posts Comments 15:00: 38.59 average comments per post 02:00: 23.81 average comments per post 20:00: 21.52 average comments per post 16:00: 16.80 average comments per post 21:00: 16.01 average comments per post 13:00: 14.74 average comments per post
The hour that receives the highest number of comments on average is 15:00, with an average of around 39 comments per post.
According to the documentation, the time zone is Eastern Time in the US). Thus, in Mauritius the peak time would be 23:00.
The sample that was taken excluded posts without any comments. So, it would be better to precise that out of posts that received comments, ASK HN
posts seem to do better with the peak time being 15:00.
Thus, to maximize the number of comments a post receives, it is recommended to submit one that is of type Ask HN
around 15:00.