Hacker News is a site started by the startup incubator Y Combinator, where user-submitted stories (known as "posts") receive votes and comments, similar to reddit. Hacker News is extremely popular in technology and startup circles, and posts that make it to the top of the Hacker News listings can get hundreds of thousands of visitors as a result.
You can find the data set here, but note that we have reduced from almost 300,000 rows to approximately 20,000 rows by removing all submissions that didn't receive any comments and then randomly sampling from the remaining submissions. Below are descriptions of the columns:
We're specifically interested in posts with titles that begin with either Ask HN or Show HN. Users submit Ask HN posts to ask the Hacker News community a specific question. Below are a few examples:
Likewise, users submit Show HN posts to show the Hacker News community a project, product, or just something interesting. Below are a few examples:
We'll compare these two types of posts to determine the following:
First, we'll import the data and check its structure:
from csv import reader
opened_file = open('HN_posts_year_to_Sep_26_2016.csv', encoding = 'utf8')
read_file = reader(opened_file)
hn = list(read_file)
print(hn[:5])
[['id', 'title', 'url', 'num_points', 'num_comments', 'author', 'created_at'], ['12579008', 'You have two days to comment if you want stem cells to be classified as your own', 'http://www.regulations.gov/document?D=FDA-2015-D-3719-0018', '1', '0', 'altstar', '9/26/2016 3:26'], ['12579005', 'SQLAR the SQLite Archiver', 'https://www.sqlite.org/sqlar/doc/trunk/README.md', '1', '0', 'blacksqr', '9/26/2016 3:24'], ['12578997', 'What if we just printed a flatscreen television on the side of our boxes?', 'https://medium.com/vanmoof/our-secrets-out-f21c1f03fdc8#.ietxmez43', '1', '0', 'pavel_lishin', '9/26/2016 3:19'], ['12578989', 'algorithmic music', 'http://cacm.acm.org/magazines/2011/7/109891-algorithmic-composition/fulltext', '1', '0', 'poindontcare', '9/26/2016 3:16']]
Then, we'll remove the headers and assign it to other variable, and update the data list of lists without the header
headers = hn[0]
hn = hn[1:]
print(headers)
print(hn[:5])
['id', 'title', 'url', 'num_points', 'num_comments', 'author', 'created_at'] [['12579008', 'You have two days to comment if you want stem cells to be classified as your own', 'http://www.regulations.gov/document?D=FDA-2015-D-3719-0018', '1', '0', 'altstar', '9/26/2016 3:26'], ['12579005', 'SQLAR the SQLite Archiver', 'https://www.sqlite.org/sqlar/doc/trunk/README.md', '1', '0', 'blacksqr', '9/26/2016 3:24'], ['12578997', 'What if we just printed a flatscreen television on the side of our boxes?', 'https://medium.com/vanmoof/our-secrets-out-f21c1f03fdc8#.ietxmez43', '1', '0', 'pavel_lishin', '9/26/2016 3:19'], ['12578989', 'algorithmic music', 'http://cacm.acm.org/magazines/2011/7/109891-algorithmic-composition/fulltext', '1', '0', 'poindontcare', '9/26/2016 3:16'], ['12578979', 'How the Data Vault Enables the Next-Gen Data Warehouse and Data Lake', 'https://www.talend.com/blog/2016/05/12/talend-and-Â\x93the-data-vaultÂ\x94', '1', '0', 'markgainor1', '9/26/2016 3:14']]
Next, we'll separate posts beginning with Ask HN and Show HN (and case variations) into two different lists, and other posts to a third list.
ask_posts = []
show_posts = []
other_posts = []
for post in hn:
title = post[1]
if title.lower().startswith('ask hn'):
ask_posts.append(post)
elif title.lower().startswith('show hn'):
show_posts.append(post)
else:
other_posts.append(post)
print('There are', len(ask_posts), 'ask HN posts')
print('There are', len(show_posts), 'show HN posts')
print('There are', len(other_posts), 'other posts')
There are 9139 ask HN posts There are 10158 show HN posts There are 273822 other posts
Next, let's determine if ask posts or show posts receive more comments on average.
total_ask_comments = 0
for post in ask_posts:
comments = int(post[4])
total_ask_comments += comments
avg_ask_comments = total_ask_comments / len(ask_posts)
total_show_comments = 0
for post in show_posts:
comments = int(post[4])
total_show_comments += comments
avg_show_comments = total_show_comments / len(show_posts)
print('The average of comments in ask HR posts is', round(avg_ask_comments,1))
print('The average of comments in show HR posts is', round(avg_show_comments,1))
The average of comments in ask HR posts is 10.4 The average of comments in show HR posts is 4.9
Therefore, we can conclude that, on average, ask HR posts receive more comments (10.4 vs 4.9) compared with show HR posts.
Since ask posts are more likely to receive comments, we'll focus our remaining analysis just on these posts.
Next, we'll determine if ask posts created at a certain time are more likely to attract comments. We'll use the following steps to perform this analysis:
import datetime as dt
result_list = []
for post in ask_posts:
comments = int(post[4])
created = post[6]
result_list.append([created, comments])
counts_by_hour = {}
comments_by_hour = {}
for item in result_list:
created = dt.datetime.strptime(item[0], '%m/%d/%Y %H:%M')
hour_created = created.hour
comments = int(item[1])
if hour_created in counts_by_hour:
counts_by_hour[hour_created] += 1
comments_by_hour[hour_created] += comments
else:
counts_by_hour[hour_created] = 1
comments_by_hour[hour_created] = comments
print('Posts count per hour:', counts_by_hour)
print('Comments count per hour:', comments_by_hour)
avg_by_hour = []
for hour in comments_by_hour:
avg_by_hour.append([hour, round(comments_by_hour[hour] / counts_by_hour[hour], 1)])
print('Average number of comments by hour:', avg_by_hour)
Posts count per hour: {2: 269, 1: 282, 22: 383, 21: 518, 19: 552, 17: 587, 15: 646, 14: 513, 13: 444, 11: 312, 10: 282, 9: 222, 7: 226, 3: 271, 23: 343, 20: 510, 16: 579, 8: 257, 0: 301, 18: 614, 12: 342, 4: 243, 6: 234, 5: 209} Comments count per hour: {2: 2996, 1: 2089, 22: 3372, 21: 4500, 19: 3954, 17: 5547, 15: 18525, 14: 4972, 13: 7245, 11: 2797, 10: 3013, 9: 1477, 7: 1585, 3: 2154, 23: 2297, 20: 4462, 16: 4466, 8: 2362, 0: 2277, 18: 4877, 12: 4234, 4: 2360, 6: 1587, 5: 1838} Average number of comments by hour: [[2, 11.1], [1, 7.4], [22, 8.8], [21, 8.7], [19, 7.2], [17, 9.4], [15, 28.7], [14, 9.7], [13, 16.3], [11, 9.0], [10, 10.7], [9, 6.7], [7, 7.0], [3, 7.9], [23, 6.7], [20, 8.7], [16, 7.7], [8, 9.2], [0, 7.6], [18, 7.9], [12, 12.4], [4, 9.7], [6, 6.8], [5, 8.8]]
Although we now have the results we need, this format makes it difficult to identify the hours with the highest values. Let's finish by sorting the list of lists and printing the five highest values in a format that's easier to read.
swap_avg_by_hour = []
for row in avg_by_hour:
swap_avg_by_hour.append([row[1], row[0]])
sorted_swap = sorted(swap_avg_by_hour, reverse=True)
print('Top 5 Hours for Ask Posts Comments')
for row in sorted_swap[:5]:
time_hour = dt.datetime.strptime(str(row[1]), '%H')
format_hour = time_hour.strftime("%H:%M")
avg_comments = row[0]
print('{hour}: {comments} average comments per post'.format(hour = format_hour, comments = avg_comments))
Top 5 Hours for Ask Posts Comments 15:00: 28.7 average comments per post 13:00: 16.3 average comments per post 12:00: 12.4 average comments per post 02:00: 11.1 average comments per post 10:00: 10.7 average comments per post
Therefore, we can conclude that the best time of the day to write a post and have more comments, being the post an 'ask HR' post, is at 15:00 Eastern Time in the US. As we live in Spain, this would be at 20:00 Spanish time