Hacker News is a site started by the startup incubator Y Combinator, where user-submitted stories (known as "posts") are voted and commented upon, similar to reddit. Hacker News is extremely popular in technology and startup circles, and posts that make it to the top of Hacker News' listings can get hundreds of thousands of visitors as a result.
We're specifically interested in posts whose titles begin with either Ask HN
or Show HN
. Users submit Ask HN
posts to ask the Hacker News community a specific question. Below are a few examples:
Ask HN: How to improve my personal website?
Ask HN: Am I the only one outraged by Twitter shutting down share counts?
Ask HN: Aby recent changes to CSS that broke mobile?
Likewise, users submit Show HN posts to show the Hacker News community a project, product, or just generally something interesting. Below are a few examples:
Show HN: Wio Link ESP8266 Based Web of Things Hardware Development Platform'
Show HN: Something pointless I made
Show HN: Shanhu.io, a programming playground powered by e8vm
We'll compare these two types of posts, using python built-ins, to determine the following:
Do Ask HN or Show HN receive more comments on average?
Do posts created at a certain time receive more comments on average?
This data set is Hacker News posts from September 26, 2015 to September 26, 2016. You can find the data set here, and below are descriptions of the columns:
id
: The unique identifier from Hacker News for the post
title
: The title of the post
url
: The URL that the posts links to, if it the post has a URL
num_points
: The number of points the post acquired, calculated as the total number of upvotes minus the total number of downvotes
num_comments
: The number of comments that were made on the post
author
: The username of the person who submitted the post
created_at
: The date and time at which the post was submitted (the time zone is Eastern Time in the US)
We'll read in the hacker news csv file. Because we are analyzing posts with comments, we'll remove posts without any and look at the first five rows.
import datetime as dt
from csv import reader
def csv_to_list(file, head=True):
'''
Transforms a csv file to a list of lists, and returns both the
header and data - unless head parameter is set to False.
Arg:
file (str): name or path for csv file
head (bool): True is default; False for no header
Returns:
tuple or list: Returns tuple with header as the first element
and data as the second. If head is False, returns data as a list.
'''
with open(file) as openfile:
readfile = reader(openfile)
hn = list(readfile)
if head:
return hn[0], hn[1:]
else:
return hn
def remove_nocomments(dataset):
'''
Remove posts with no comments.
Arg:
dataset (list): dataset as a list of lists
Returns:
list: dataset as a list of lists
'''
clean_dataset = []
for row in dataset:
num_comments = row[4]
if num_comments != '0':
clean_dataset.append(row)
return clean_dataset
def create_ask_show(dataset):
'''
Create three different datasets: one for posts starting
with Ask HN, and one for posts starting with Show HN, and
one for other posts.
Arg:
dataset (list): dataset as a list of lists
Returns:
tuple: datasets as a list of lists with posts starting
with Ask HN, posts starting with Show HN, and other posts
'''
ask_posts = []
show_posts = []
other_posts = []
for row in dataset:
title = row[1].lower()
if title.startswith('ask hn'):
ask_posts.append(row)
elif title.startswith('show hn'):
show_posts.append(row)
else:
other_posts.append(row)
return ask_posts, show_posts, other_posts
def avg_num_comments(dataset):
'''
Calculate the average number of comments for posts in the dataset.
Arg:
dataset (list): dataset as a list of lists
Returns:
float: average number of comments
'''
total_comments = 0
for row in dataset:
num_comments = int(row[4])
total_comments += num_comments
avg_comments = total_comments/len(dataset)
return avg_comments
def counts_comments_hr(dataset):
'''
Creates frequency tables that counts both number of
posts per hour and number of comments per hour.
Arg:
dataset (list): dataset as a list of lists
Returns:
tuple: dict key=hour: value=number of posts, dict key=hour: value=number of comments
'''
# will contain `created_at` datetime object and number of comments for each post in `ask_posts`
result_list = []
for row in dataset:
created_at = dt.datetime.strptime(row[6], '%m/%d/%Y %H:%M')
num_comments = int(row[4])
result_list.append([created_at, num_comments])
# will contain number of posts from ask_posts per hour
counts_by_hour = {}
# will contain total number of comments from ask_posts per hour
comments_by_hour = {}
for row in result_list:
hour = row[0].hour
num_comments = row[1]
if hour in counts_by_hour:
counts_by_hour[hour] += 1
comments_by_hour[hour] += num_comments
else:
counts_by_hour[hour] = 1
comments_by_hour[hour] = num_comments
return counts_by_hour, comments_by_hour
def calc_avg_by_hr(comments, counts):
'''
Calculate the averge number of comments for posts by hour.
Arg:
comments (dict): comments per hour - key=hour: value=comments per hour
counts (dict): number of posts per hour - key=hour: value=posts per hour
Returns:
list: list of lists with each row having a hour as first element and
average number of comments per post as second element.
'''
avg_by_hr = []
for hour in comments:
hourly = [hour, comments[hour]/counts[hour]]
avg_by_hr.append(hourly)
return avg_by_hr
# read in file
header, hn = csv_to_list('HN_posts_year_to_Sep_26_2016.csv')
# remove posts with no comments
hn = remove_nocomments(hn)
# display header and the first five rows
print(header)
for row in hn[:5]:
print('\n', row, '\n')
['id', 'title', 'url', 'num_points', 'num_comments', 'author', 'created_at'] ['12578975', 'Saving the Hassle of Shopping', 'https://blog.menswr.com/2016/09/07/whats-new-with-your-style-feed/', '1', '1', 'bdoux', '9/26/2016 3:13'] ['12578908', 'Ask HN: What TLD do you use for local development?', '', '4', '7', 'Sevrene', '9/26/2016 2:53'] ['12578822', 'Amazons Algorithms Dont Find You the Best Deals', 'https://www.technologyreview.com/s/602442/amazons-algorithms-dont-find-you-the-best-deals/', '1', '1', 'yarapavan', '9/26/2016 2:26'] ['12578694', 'Emergency dose of epinephrine that does not cost an arm and a leg', 'http://m.imgur.com/gallery/th6Ua', '2', '1', 'dredmorbius', '9/26/2016 1:54'] ['12578624', 'Phone Makers Could Cut Off Drivers. So Why Dont They?', 'http://www.nytimes.com/2016/09/25/technology/phone-makers-could-cut-off-drivers-so-why-dont-they.html', '4', '1', 'danso', '9/26/2016 1:37']
Because we're specifically interested in posts whose titles begin with either Ask HN
or Show HN
and will be comparing the two, we'll place those posts in two different datasets. In addition, we will keep other posts to check that we captured all posts.
ask_posts, show_posts, other_posts = create_ask_show(hn)
print('\nThe number of posts in each data set\n'.title())
print(f'ask_posts: {len(ask_posts)}\n')
print(f'show_posts: {len(show_posts)}\n')
print(f'other_posts: {len(other_posts)}')
print('---------------------\n')
print(f'total posts: {len(hn)}')
The Number Of Posts In Each Data Set ask_posts: 6911 show_posts: 5059 other_posts: 68431 --------------------- total posts: 80401
# call `avg_num_comments` to calculate the average number of comments per post in `ask_posts`
avg_ask_comments = avg_num_comments(ask_posts)
# call `avg_num_comments` to calculate the average number of comments per post in `show_posts`
avg_show_comments = avg_num_comments(show_posts)
print(f'\nAverage number of comments per post in `ask_posts`: {avg_ask_comments}\n')
print(f'Average number of comments per post in `show_posts`: {avg_show_comments}\n')
Average number of comments per post in `ask_posts`: 13.744175951381855 Average number of comments per post in `show_posts`: 9.810832180272781
On average, ask posts receive more comments than show posts. Since ask posts are more likely to receive comments, we'll focus our remaining analysis just on these posts.
Next, we'll determine if ask posts created at a certain time are more likely to attract comments. We'll use the following steps to perform this analysis:
Calculate the amount of ask posts created in each hour of the day, along with the number of comments received.
Calculate the average number of comments ask posts receive by hour created.
In the code block below, we'll tackle the first step — calculating the amount of ask posts and comments by hour created.
counts_by_hour, comments_by_hour = counts_comments_hr(ask_posts)
counts_by_hour
{2: 227, 1: 223, 22: 287, 21: 407, 19: 420, 17: 404, 15: 467, 14: 378, 13: 326, 11: 251, 10: 219, 9: 176, 7: 157, 3: 212, 16: 415, 8: 190, 0: 231, 23: 276, 20: 392, 18: 452, 12: 274, 4: 186, 6: 176, 5: 165}
The table above is the amount of ask posts by hour created with the first number representing the hour and the second representing posts. Below is the amount of comments by hour created.
comments_by_hour
{2: 2996, 1: 2089, 22: 3372, 21: 4500, 19: 3954, 17: 5547, 15: 18525, 14: 4972, 13: 7245, 11: 2797, 10: 3013, 9: 1477, 7: 1585, 3: 2154, 16: 4466, 8: 2362, 0: 2277, 23: 2297, 20: 4462, 18: 4877, 12: 4234, 4: 2360, 6: 1587, 5: 1838}
Now, we'll tackle the second step by using the two dictionaries above to create a list of lists containing the hours during which posts were created and the average number of comments those posts received.
avg_by_hr = calc_avg_by_hr(comments_by_hour, counts_by_hour)
avg_by_hr
[[2, 13.198237885462555], [1, 9.367713004484305], [22, 11.749128919860627], [21, 11.056511056511056], [19, 9.414285714285715], [17, 13.73019801980198], [15, 39.66809421841542], [14, 13.153439153439153], [13, 22.2239263803681], [11, 11.143426294820717], [10, 13.757990867579908], [9, 8.392045454545455], [7, 10.095541401273886], [3, 10.160377358490566], [16, 10.76144578313253], [8, 12.43157894736842], [0, 9.857142857142858], [23, 8.322463768115941], [20, 11.38265306122449], [18, 10.789823008849558], [12, 15.452554744525548], [4, 12.688172043010752], [6, 9.017045454545455], [5, 11.139393939393939]]
Although we now have the results we need, this format makes it hard to identify the hours with the highest values. Let's finish by sorting the list of lists in descending order by highest average comments and printing the five rows in a format that's easier to read.
# sort avg_by_hr by average comments and in descending order
sorted_avg_by_hr = sorted(avg_by_hr, key=lambda row: row[1], reverse=True)
print('\nTop 5 Hours for Ask Posts Comments\n')
for row in sorted_avg_by_hr[:5]:
hr = dt.time(hour=row[0])
string = f'{hr:%H:%M}: {row[1]:.2f} average comments per post\n'
print(string)
Top 5 Hours for Ask Posts Comments 15:00: 39.67 average comments per post 13:00: 22.22 average comments per post 12:00: 15.45 average comments per post 10:00: 13.76 average comments per post 17:00: 13.73 average comments per post
Which hours should you create a post during to have a higher chance of receiving comments?
When working with time series data, it is helpful to group hours in periods - such as morning, afternoon, evening, and night. According to Britannica Dictionary, many people can agree with morning 05:00 - 11:59, afternoon 12:00 - 16:59, evening 17:00 - 20:59, and night 21:00 - 04:59. Using this framework, the afternoon is the best time to create a post to have a higher chance of receiving comments.
In the table above, Top 5 Hours for Ask Posts Comments, the top three hours are all in the afternoon. I am located in Central Time Zone, and the times are in Eastern Time Zone. Even after adjusting for Time Zones, three hours out of the top five hours are still in the afternoon.
To narrow the hours down even more, 15:00 and 13:00 are the two best times. 15:00 is the most popular hour and is approximately 17 comments per post on average higher than 13:00, which is the second most popular time. 13:00 is almost seven comments higher on average than third place.
Lastly, to be even more specific, 15:00 is the most popular hour. Still referencing the table above, it's average comments per post are approximately seventy-nine percent higher than the next best option.
After comparing posts that begin with Ask HN
and Show HN
using data on Hacker News from September 26, 2015 to September 26, 2016, we can answer the two questions that we proposed. Do posts starting with Ask HN
or Show HN
receive more comments on average? After separating the posts into two and dividing number of comments by posts, we find that posts beginning with Ask HN
receive more comments per post on average. Second, do posts created at a certain time receive more comments on average? Looking at just the posts from Ask HN
, because those receive more comments per post, we find after identifying the number of comments per post by hour that the afternoon, and 15:00 specifically, is the best time to create posts on Hacker News to receive more comments.