In this study we will explore (a sample of) posts that were posted on Hacker News. Hacker News is a site started by the startup incubator Y Combinator, where user-submitted stories (known as "posts") are voted and commented upon, similar to reddit. Hacker News is extremely popular in technology and startup circles, and posts that make it to the top of Hacker News' listings can get hundreds of thousands of visitors as a result.
Some posts can easily attract a lot of views, and comments. In this study we will explore aspects that impact the amount of comments for a post.
Post title: when creating posts, users can - optionally - add Ask HN
or Show HN
to the title of the post. They do so to explicitly 'ask' or 'show' something to the Hacker News community. We'll analyze whether posts with these tags receive more comments on average.
Post timing: also, we will explore whether posts published at certain times receive more comments on average.
Data
The source data for this study can be found here. It contains almost 300,000 rows, each row representing a post. The data is of 2016. However, for this study we make use of a version that been reduced to approximately 20,000 rows by removing all submissions that did not receive any comments, and then randomly sampling from the remaining submissions. This file was prepared by Dataquest and can be downloaded from here.
Let us start with reading in the data, and displaying the header row and a small sample.
from csv import reader
opened_file = open('inputdata/hacker_news.csv')
read_file = reader(opened_file)
hn = list(read_file)
for row in hn[:4]:
print (row, '\n')
['id', 'title', 'url', 'num_points', 'num_comments', 'author', 'created_at'] ['12224879', 'Interactive Dynamic Video', 'http://www.interactivedynamicvideo.com/', '386', '52', 'ne0phyte', '8/4/2016 11:52'] ['10975351', 'How to Use Open Source and Shut the Fuck Up at the Same Time', 'http://hueniverse.com/2016/01/26/how-to-use-open-source-and-shut-the-fuck-up-at-the-same-time/', '39', '10', 'josep2', '1/26/2016 19:30'] ['11964716', "Florida DJs May Face Felony for April Fools' Water Joke", 'http://www.thewire.com/entertainment/2013/04/florida-djs-april-fools-water-joke/63798/', '2', '1', 'vezycash', '6/23/2016 22:20']
Let's split of the headers in in headers
, and keep the data itself in hn
. (And print to check the results)
headers = hn[0]
print ('Number of records before removing the header: ', len(hn))
hn = hn[1:]
print ('Number of records after removing the header: ', len(hn))
print ('\n','The first three rows of the data:', '\n')
for row in hn[:3]:
print (row, '\n')
Number of records before removing the header: 20101 Number of records after removing the header: 20100 The first three rows of the data: ['12224879', 'Interactive Dynamic Video', 'http://www.interactivedynamicvideo.com/', '386', '52', 'ne0phyte', '8/4/2016 11:52'] ['10975351', 'How to Use Open Source and Shut the Fuck Up at the Same Time', 'http://hueniverse.com/2016/01/26/how-to-use-open-source-and-shut-the-fuck-up-at-the-same-time/', '39', '10', 'josep2', '1/26/2016 19:30'] ['11964716', "Florida DJs May Face Felony for April Fools' Water Joke", 'http://www.thewire.com/entertainment/2013/04/florida-djs-april-fools-water-joke/63798/', '2', '1', 'vezycash', '6/23/2016 22:20']
Next, let us split the data into three new lists:
ask_posts
(the one who posted added 'ask hn' or similar)show_posts
(the one who posted added 'show hn' or similar)other_posts
(the remainder)# Create empty lists
ask_posts = []
show_posts = []
other_posts = []
# Fill the lists
for post in hn:
title = post[1]
if title.lower().startswith('ask hn'):
ask_posts.append(post)
elif title.lower().startswith('show hn'):
show_posts.append(post)
else:
other_posts.append(post)
# Print some samples
print('Sample posts for "ask":', '\n')
for post in ask_posts[:2]:
print (post)
print('\n', 'Sample posts for "show":', '\n')
for post in show_posts[:2]:
print (post)
print('\n', 'Sample posts for other:', '\n')
for post in other_posts[:2]:
print (post)
# Check the totals
print ('\n')
print ('Number of posts in the original list is', len(hn))
print ('Number of posts in "ask" is', len(ask_posts))
print ('Number of posts in "show" is', len(show_posts))
print ('Number of posts in "other" is', len(other_posts))
print ('Sum of the three new lists is', len(ask_posts)+len(show_posts)+len(other_posts))
Sample posts for "ask": ['12296411', 'Ask HN: How to improve my personal website?', '', '2', '6', 'ahmedbaracat', '8/16/2016 9:55'] ['10610020', 'Ask HN: Am I the only one outraged by Twitter shutting down share counts?', '', '28', '29', 'tkfx', '11/22/2015 13:43'] Sample posts for "show": ['10627194', 'Show HN: Wio Link ESP8266 Based Web of Things Hardware Development Platform', 'https://iot.seeed.cc', '26', '22', 'kfihihc', '11/25/2015 14:03'] ['10646440', 'Show HN: Something pointless I made', 'http://dn.ht/picklecat/', '747', '102', 'dhotson', '11/29/2015 22:46'] Sample posts for other: ['12224879', 'Interactive Dynamic Video', 'http://www.interactivedynamicvideo.com/', '386', '52', 'ne0phyte', '8/4/2016 11:52'] ['10975351', 'How to Use Open Source and Shut the Fuck Up at the Same Time', 'http://hueniverse.com/2016/01/26/how-to-use-open-source-and-shut-the-fuck-up-at-the-same-time/', '39', '10', 'josep2', '1/26/2016 19:30'] Number of posts in the original list is 20100 Number of posts in "ask" is 1744 Number of posts in "show" is 1162 Number of posts in "other" is 17194 Sum of the three new lists is 20100
Next, let's determine if "ask posts" or "show posts" receive more comments on average.
total_ask_comments = 0
for post in ask_posts:
total_ask_comments += int(post[4])
avg_ask_comments = total_ask_comments/len(ask_posts)
print ('Average number of comments for "ask" posts is {:.2f}'.format(avg_ask_comments))
total_show_comments = 0
for post in show_posts:
total_show_comments += int(post[4])
avg_show_comments = total_show_comments/len(show_posts)
print ('Average number of comments for "show" posts is {:.2f}'.format(avg_show_comments))
Average number of comments for "ask" posts is 14.04 Average number of comments for "show" posts is 10.32
It appears that 'ask' posts receive more comments on average than 'show' posts.
To analyze whether particular times of the day attact more comments, we will continue with these "ask" posts.
import datetime as dt
# Create a list that contains the creation times and number of comments (ask-posts only)
result_list = []
for post in ask_posts:
created_at = post[6]
num_comments = int(post[4])
result_list.append([created_at, num_comments])
#print (result_list[:3])
# Build frequency tables for the number of posts and for the number of comments, per hour of the day
counts_by_hour = {}
comments_by_hour = {}
for row in result_list:
created_at = dt.datetime.strptime(row[0], '%m/%d/%Y %H:%M')
hour = created_at.hour
if hour not in counts_by_hour:
counts_by_hour[hour] = 1
comments_by_hour[hour] = row[1]
else:
counts_by_hour[hour] += 1
comments_by_hour[hour] += row[1]
# Create a table that contains the hours of day and the average number of comments per posts
avg_by_hour = []
for hour in counts_by_hour:
num_posts = counts_by_hour[hour]
num_comments = comments_by_hour[hour]
average = num_comments / num_posts
avg_by_hour.append([hour, average])
# Sort the list (on its first element, being the hour of day)
avg_by_hour.sort()
# Print the result
output = "For hour {:02d} the average number of comments per post is {:.2f}"
for row in avg_by_hour:
print (output.format(row[0], row[1]))
For hour 00 the average number of comments per post is 8.13 For hour 01 the average number of comments per post is 11.38 For hour 02 the average number of comments per post is 23.81 For hour 03 the average number of comments per post is 7.80 For hour 04 the average number of comments per post is 7.17 For hour 05 the average number of comments per post is 10.09 For hour 06 the average number of comments per post is 9.02 For hour 07 the average number of comments per post is 7.85 For hour 08 the average number of comments per post is 10.25 For hour 09 the average number of comments per post is 5.58 For hour 10 the average number of comments per post is 13.44 For hour 11 the average number of comments per post is 11.05 For hour 12 the average number of comments per post is 9.41 For hour 13 the average number of comments per post is 14.74 For hour 14 the average number of comments per post is 13.23 For hour 15 the average number of comments per post is 38.59 For hour 16 the average number of comments per post is 16.80 For hour 17 the average number of comments per post is 11.46 For hour 18 the average number of comments per post is 13.20 For hour 19 the average number of comments per post is 10.80 For hour 20 the average number of comments per post is 21.52 For hour 21 the average number of comments per post is 16.01 For hour 22 the average number of comments per post is 6.75 For hour 23 the average number of comments per post is 7.99
It appears there are significant differences indeed. Let's visualize this a bit clearer, and show which are the hours of day where posts (on average) attract most comments.
# Create a list that is sorted on the average number of comments instead
swap_avg_by_hour = []
for row in avg_by_hour:
swap_avg_by_hour.append([row[1], row[0]])
# Created a sorted version of this list
sorted_swap = sorted (swap_avg_by_hour, reverse = True)
# Display the results
print ('Top 5 Hours for Ask Posts Comments', '\n')
output = "{}: {:.2f} average comments per post"
for row in sorted_swap[:5]:
thetime = dt.datetime.strptime(str(row[1]), '%H')
thetime = thetime.strftime('%H:%M')
print ( output.format(thetime,row[0] ))
Top 5 Hours for Ask Posts Comments 15:00: 38.59 average comments per post 02:00: 23.81 average comments per post 20:00: 21.52 average comments per post 16:00: 16.80 average comments per post 21:00: 16.01 average comments per post
So those are the best times of days to post if you want to attract comments. What is interesting to see is that the top 5 hours are on very different hours during the day. One possible explanation could be that commenters are located across the globe, and that these different hours represent peak times for different time zones. (That would require further study though.)
Note that the times above are for the US Eastern Time. (As per the dataset documentation.)
For our time zone (Central European Time), you'll need to add six hours to that.
Refering back to the goal of this study, let's summarize the conclusions.
Post title: when creating posts, adding Ask HN
to your post title will do better for attracting comments than adding Show HN
:
(It has not been compared with posts for not adding a tag at all.)
Post timing: the time of day of posting appears to have significant impact on the number of comments that you will attract. Based on an analysis of the Ask HN
posts, the top hours (in Central European Time) are: