Hacker News is a site started by the startup incubator Y Combinator, where user-submitted stories (known as "posts") receive votes and comments, similar to reddit. Hacker News is extremely popular in technology and startup circles, and posts that make it to the top of the Hacker News listings can get hundreds of thousands of visitors as a result.
The aim of the project is to compare the two types of posts submitted on Hacker News:
in order to determine:
Ask HN
(Question) or Show HN
(Showcase) recieves more commentsThe datset can be found here.
It contains data about posts submitten in the communityin a 12 month period (up to September 2016) and was originally put together by Hacker News in 2016. It has not been updated since.
It originally has 300,000 rows. We have reduced it to approximately 20,000 rows by removing all submissions that didn't receive any comments and then randomly sampling from the remaining submissions.
Below are descriptions of the columns:
Column Name in Dataset | Description |
---|---|
"id" | the unique identifier from Hacker News for the post |
"title" | the title of the post |
"url" | the URL that the posts links to, if the post has a URL |
"num_points" | the number of points the post acquired, calculated as the total number of upvotes minus the total number of downvotes |
"num_comments" | the number of comments on the post |
"author" | the username of the person who submitted the post |
"created_at" | the date and time of the post's submission |
For the purposes of this project, the columns of interest to us are: num_comments
and created_at
.
Let's open the dataset and display the first 5 rows.
from csv import reader
opened_file = open ('hacker_news.csv')
read_file = reader (opened_file)
hn = list (read_file)
hn [:5]
[['id', 'title', 'url', 'num_points', 'num_comments', 'author', 'created_at'], ['12224879', 'Interactive Dynamic Video', 'http://www.interactivedynamicvideo.com/', '386', '52', 'ne0phyte', '8/4/2016 11:52'], ['10975351', 'How to Use Open Source and Shut the Fuck Up at the Same Time', 'http://hueniverse.com/2016/01/26/how-to-use-open-source-and-shut-the-fuck-up-at-the-same-time/', '39', '10', 'josep2', '1/26/2016 19:30'], ['11964716', "Florida DJs May Face Felony for April Fools' Water Joke", 'http://www.thewire.com/entertainment/2013/04/florida-djs-april-fools-water-joke/63798/', '2', '1', 'vezycash', '6/23/2016 22:20'], ['11919867', 'Technology ventures: From Idea to Enterprise', 'https://www.amazon.com/Technology-Ventures-Enterprise-Thomas-Byers/dp/0073523429', '3', '1', 'hswarna', '6/17/2016 0:01']]
headers = hn [0]
hn = hn [1:]
print(headers)
'\n'
hn [:5]
['id', 'title', 'url', 'num_points', 'num_comments', 'author', 'created_at']
[['12224879', 'Interactive Dynamic Video', 'http://www.interactivedynamicvideo.com/', '386', '52', 'ne0phyte', '8/4/2016 11:52'], ['10975351', 'How to Use Open Source and Shut the Fuck Up at the Same Time', 'http://hueniverse.com/2016/01/26/how-to-use-open-source-and-shut-the-fuck-up-at-the-same-time/', '39', '10', 'josep2', '1/26/2016 19:30'], ['11964716', "Florida DJs May Face Felony for April Fools' Water Joke", 'http://www.thewire.com/entertainment/2013/04/florida-djs-april-fools-water-joke/63798/', '2', '1', 'vezycash', '6/23/2016 22:20'], ['11919867', 'Technology ventures: From Idea to Enterprise', 'https://www.amazon.com/Technology-Ventures-Enterprise-Thomas-Byers/dp/0073523429', '3', '1', 'hswarna', '6/17/2016 0:01'], ['10301696', 'Note by Note: The Making of Steinway L1037 (2007)', 'http://www.nytimes.com/2007/11/07/movies/07stein.html?_r=0', '8', '2', 'walterbell', '9/30/2015 4:12']]
As we are interested in comparing the Question and Showcase type submissions, we will split them into two lists by using the string method startswith
. To ensure that we have take into account all posts, we will use the lower
method to convert all title strings to lower case.
#creating empty lists
ask_posts = []
show_posts = []
other_posts = []
for row in hn:
title = row [1]
if title.lower().startswith('ask hn') is True:
ask_posts.append (row)
elif title.lower().startswith ('show hn') is True:
show_posts.append (row)
else:
other_posts.append (row)
print ('Ask posts:' , len (ask_posts))
print ('Show Posts:' , len (show_posts))
print ('Other Posts:', len (other_posts))
Ask posts: 1744 Show Posts: 1162 Other Posts: 17194
Let's ensure we have collected the right posts in each list
ask_posts [:3]
[['12296411', 'Ask HN: How to improve my personal website?', '', '2', '6', 'ahmedbaracat', '8/16/2016 9:55'], ['10610020', 'Ask HN: Am I the only one outraged by Twitter shutting down share counts?', '', '28', '29', 'tkfx', '11/22/2015 13:43'], ['11610310', 'Ask HN: Aby recent changes to CSS that broke mobile?', '', '1', '1', 'polskibus', '5/2/2016 10:14']]
show_posts [:3]
[['10627194', 'Show HN: Wio Link ESP8266 Based Web of Things Hardware Development Platform', 'https://iot.seeed.cc', '26', '22', 'kfihihc', '11/25/2015 14:03'], ['10646440', 'Show HN: Something pointless I made', 'http://dn.ht/picklecat/', '747', '102', 'dhotson', '11/29/2015 22:46'], ['11590768', 'Show HN: Shanhu.io, a programming playground powered by e8vm', 'https://shanhu.io', '1', '1', 'h8liu', '4/28/2016 18:05']]
Which posts recieve more comments? Let's compare the averages for these two types of posts. In order to do this, we will:
total_ask_comments = 0
for post in ask_posts:
comments = int (post [4])
total_ask_comments += comments
avg_ask_comments = total_ask_comments/len(ask_posts)
print ('Average comments per "Ask" post:', avg_ask_comments)
total_show_comments = 0
for post in show_posts:
comments = int (post [4])
total_show_comments += comments
avg_show_comments = total_show_comments/len(show_posts)
print ('Average comments per "Show" post:', avg_show_comments)
Average comments per "Ask" post: 14.038417431192661 Average comments per "Show" post: 10.31669535283993
We can observe that Ask comments recieve more comments on average (14) compared to Show comments (10.3).
Since ask posts are more likely to receive comments, we'll focus our remaining analysis just on these posts.
created_at
column to extract time (in 24h format) of the dayimport datetime as dt
results_list = []
for post in ask_posts:
created_at = post [6]
comments = int (post[4])
results_list.append ([created_at, comments])
results_list [:5]
[['8/16/2016 9:55', 6], ['11/22/2015 13:43', 29], ['5/2/2016 10:14', 1], ['8/2/2016 14:20', 3], ['10/15/2015 16:38', 17]]
counts_by_hour = {}
comments_by_hour = {}
date_format = '%m/%d/%Y %H:%M' #instantiating datetime object, which will be second argument in our strptime methord
for row in results_list:
date = row [0]
time = dt.datetime.strptime(date,date_format).strftime('%H') #extracting only hour (in 24h format)
if time not in counts_by_hour:
counts_by_hour[time] = 1
comments_by_hour[time] = int(row [1])
else:
counts_by_hour[time] += 1
comments_by_hour[time] += int (row[1])
#presenting the comments by hour in order of the hour of the day
sorted_comments = sorted(comments_by_hour.items(), key=lambda x:x[0]) #we are sorting by key, not value, hence the '0'
for item in sorted_comments:
print (item)
('00', 447) ('01', 683) ('02', 1381) ('03', 421) ('04', 337) ('05', 464) ('06', 397) ('07', 267) ('08', 492) ('09', 251) ('10', 793) ('11', 641) ('12', 687) ('13', 1253) ('14', 1416) ('15', 4477) ('16', 1814) ('17', 1146) ('18', 1439) ('19', 1188) ('20', 1722) ('21', 1745) ('22', 479) ('23', 543)
Now let's find out the average comment by hour of the day
avg_by_hour = []
for hour in comments_by_hour:
avg_by_hour.append ([hour, comments_by_hour[hour]/counts_by_hour[hour]])
avg_by_hour
[['09', 5.5777777777777775], ['13', 14.741176470588234], ['10', 13.440677966101696], ['14', 13.233644859813085], ['16', 16.796296296296298], ['23', 7.985294117647059], ['12', 9.41095890410959], ['17', 11.46], ['15', 38.5948275862069], ['21', 16.009174311926607], ['20', 21.525], ['02', 23.810344827586206], ['18', 13.20183486238532], ['03', 7.796296296296297], ['05', 10.08695652173913], ['19', 10.8], ['01', 11.383333333333333], ['22', 6.746478873239437], ['08', 10.25], ['04', 7.170212765957447], ['00', 8.127272727272727], ['06', 9.022727272727273], ['07', 7.852941176470588], ['11', 11.051724137931034]]
Let's print it in a more eye-friendly way to see which hours of the day have the most comments per post for the Ask posts.
sorted_list = sorted({tuple(x): x for x in avg_by_hour}.values())
sorted_list
[['00', 8.127272727272727], ['01', 11.383333333333333], ['02', 23.810344827586206], ['03', 7.796296296296297], ['04', 7.170212765957447], ['05', 10.08695652173913], ['06', 9.022727272727273], ['07', 7.852941176470588], ['08', 10.25], ['09', 5.5777777777777775], ['10', 13.440677966101696], ['11', 11.051724137931034], ['12', 9.41095890410959], ['13', 14.741176470588234], ['14', 13.233644859813085], ['15', 38.5948275862069], ['16', 16.796296296296298], ['17', 11.46], ['18', 13.20183486238532], ['19', 10.8], ['20', 21.525], ['21', 16.009174311926607], ['22', 6.746478873239437], ['23', 7.985294117647059]]
This list gives us a clear understanding of how comments per ask post variate throughout the day, but not which are the top hours for posts to recieve most comments.
Let's print a list in order of highest average comments per post to find out.
swap_avg_by_hour = []
for hour in avg_by_hour:
swap_avg_by_hour.append ([hour[1], hour[0]])
swap_avg_by_hour
[[5.5777777777777775, '09'], [14.741176470588234, '13'], [13.440677966101696, '10'], [13.233644859813085, '14'], [16.796296296296298, '16'], [7.985294117647059, '23'], [9.41095890410959, '12'], [11.46, '17'], [38.5948275862069, '15'], [16.009174311926607, '21'], [21.525, '20'], [23.810344827586206, '02'], [13.20183486238532, '18'], [7.796296296296297, '03'], [10.08695652173913, '05'], [10.8, '19'], [11.383333333333333, '01'], [6.746478873239437, '22'], [10.25, '08'], [7.170212765957447, '04'], [8.127272727272727, '00'], [9.022727272727273, '06'], [7.852941176470588, '07'], [11.051724137931034, '11']]
sorted_swap = sorted(swap_avg_by_hour, reverse = True)
print ('Top 5 Hours for Ask Posts Comments')
for avg,hr in sorted_swap[:5]:
print(
'{}: {:.2f} average comments per post'
.format(dt.datetime.strptime(hr,'%H').strftime('%H:%M'), avg)
)
Top 5 Hours for Ask Posts Comments 15:00: 38.59 average comments per post 02:00: 23.81 average comments per post 20:00: 21.52 average comments per post 16:00: 16.80 average comments per post 21:00: 16.01 average comments per post
We can see that the top time is 15:00, 3 o'clock in the afternoon. It is interesting to see how spread out these times are, though there are two two-hour intervals: 15:00
& 16:00
, and 20:00
& 21:00
which feature among the top 5, indicating an increase of activity at those hours.
One could argue these hours are when students finish college or shcool. The other time interval is on average after dinner, which is when professionals would most likely be on the platform. This second time slot could also correspond to people submitting after school/work activities.
There is also a surge at 2am, 02:00
, which could be related some members in the community posting during unisual hours or in the morning in Europe.
As is s as specifified in the documentation:
created_at
: the date and time the post was made (the time zone is Eastern Time in the US)
Conclusion
In conclusion, if you want to get high engagement in your question-related posts, you should post between 3pm and 5pm ET. You should abstrain from posting before the afternoon (in US timezone). This is evidenced by an overall higher comments per post average starting from 2pm ET in the above cell (id = 10
). This drops off between 10pm ET, with a surge starting from 1am to 3am ET.
With additional data such as georgraphy of the user, it could be possible to see how these trends in hours are due to different regions. Hacker News being a predominantly English-speaking website, I will assume for the time being that these variations are due to user activity in North America and Europe, where English is more widespread as a language for news and blogging.
Furthermore, these conclusions are drawn from a sample in the dataset and with more data instances at our disposal, the results may vary.