In this project we will explore posts that were posted on Hacker News. Hacker News is a site started by the startup incubator Y Combinator, where user-submitted stories (known as "posts") are voted and commented upon, similar to reddit. Hacker News is extremely popular in technology and startup circles, and posts that make it to the top of Hacker News' listings can get hundreds of thousands of visitors as a result.
The data can be found here. It contains almost 300,000 rows, each row representing a post. However we use of a version that been reduced to approximately 20,000 rows by removing all submissions that did not receive any comments, and then randomly sampling from the remaining submissions.
id
: The unique identifier from Hacker News for the posttitle
: The title of the posturl
: The URL that the posts links to, if it the post has a URLnum_points
: The number of points the post acquired, calculated as the total number of upvotes minus the total number of downvotesnum_comments
: The number of comments that were made on the postauthor
: The username of the person who submitted the postcreated_at
: The date and time at which the post was submittedIn this project, we are more interested in posts whose titles begin with either Ask HN or Show HN. Users submit Ask HN to ask the Hacker News community a question. Below is an example of Ask HN
Ask HN: How to improve my personal website?
Ask HN: Am I the only one outraged by Twitter shutting down share counts?
Ask HN: Aby recent changes to CSS that broke mobile?
Users submit Show HN to show the community a project, product, or something interesting. Below is an example:
Show HN: Wio Link ESP8266 Based Web of Things Hardware Development Platform'
Show HN: Something pointless I made
Show HN: Shanhu.io, a programming playground powered by e8vm
Our goal is to compare the 2 types of posts to determine:
Do Ask HN or Show HN receive more comments on average?
Do posts created at a certain time receive more comments on average?
import pprint
pp = pprint.PrettyPrinter()
from csv import reader
with open('hacker_news.csv') as f:
read_file = reader(f)
hn = list(read_file)
pp.pprint(hn[:5])
[['id', 'title', 'url', 'num_points', 'num_comments', 'author', 'created_at'], ['12224879', 'Interactive Dynamic Video', 'http://www.interactivedynamicvideo.com/', '386', '52', 'ne0phyte', '8/4/2016 11:52'], ['10975351', 'How to Use Open Source and Shut the Fuck Up at the Same Time', 'http://hueniverse.com/2016/01/26/how-to-use-open-source-and-shut-the-fuck-up-at-the-same-time/', '39', '10', 'josep2', '1/26/2016 19:30'], ['11964716', "Florida DJs May Face Felony for April Fools' Water Joke", 'http://www.thewire.com/entertainment/2013/04/florida-djs-april-fools-water-joke/63798/', '2', '1', 'vezycash', '6/23/2016 22:20'], ['11919867', 'Technology ventures: From Idea to Enterprise', 'https://www.amazon.com/Technology-Ventures-Enterprise-Thomas-Byers/dp/0073523429', '3', '1', 'hswarna', '6/17/2016 0:01']]
headers = hn[0]
hn = hn[1:]
pp.pprint(headers)
pp.pprint(hn[:5])
['id', 'title', 'url', 'num_points', 'num_comments', 'author', 'created_at'] [['12224879', 'Interactive Dynamic Video', 'http://www.interactivedynamicvideo.com/', '386', '52', 'ne0phyte', '8/4/2016 11:52'], ['10975351', 'How to Use Open Source and Shut the Fuck Up at the Same Time', 'http://hueniverse.com/2016/01/26/how-to-use-open-source-and-shut-the-fuck-up-at-the-same-time/', '39', '10', 'josep2', '1/26/2016 19:30'], ['11964716', "Florida DJs May Face Felony for April Fools' Water Joke", 'http://www.thewire.com/entertainment/2013/04/florida-djs-april-fools-water-joke/63798/', '2', '1', 'vezycash', '6/23/2016 22:20'], ['11919867', 'Technology ventures: From Idea to Enterprise', 'https://www.amazon.com/Technology-Ventures-Enterprise-Thomas-Byers/dp/0073523429', '3', '1', 'hswarna', '6/17/2016 0:01'], ['10301696', 'Note by Note: The Making of Steinway L1037 (2007)', 'http://www.nytimes.com/2007/11/07/movies/07stein.html?_r=0', '8', '2', 'walterbell', '9/30/2015 4:12']]
ask_posts = []
show_posts = []
other_posts = []
for post in hn:
title = post[1].lower()
if title.startswith('ask hn'):
ask_posts.append(post)
elif title.startswith('show hn'):
show_posts.append(post)
else:
other_posts.append(post)
print("Number of ask hn post {}".format(len(ask_posts)))
print("Number of show hn post {}".format(len(show_posts)))
print("Number of other post {}".format(len(other_posts)))
Number of ask hn post 1744 Number of show hn post 1162 Number of other post 17194
We separated the ask posts
, show posts
and other posts
into 3 list of lists. You can see that we have 1744 ask posts, 1162 show posts and 17194 other posts. Below is the first five rows of the each posts type
print('ASK POSTS\n=====================')
pp.pprint(ask_posts[:5])
print('SHOW POSTS\n=====================')
pp.pprint(show_posts[:5])
print('OTHER POSTS\n=====================')
pp.pprint(other_posts[:5])
ASK POSTS ===================== [['12296411', 'Ask HN: How to improve my personal website?', '', '2', '6', 'ahmedbaracat', '8/16/2016 9:55'], ['10610020', 'Ask HN: Am I the only one outraged by Twitter shutting down share counts?', '', '28', '29', 'tkfx', '11/22/2015 13:43'], ['11610310', 'Ask HN: Aby recent changes to CSS that broke mobile?', '', '1', '1', 'polskibus', '5/2/2016 10:14'], ['12210105', 'Ask HN: Looking for Employee #3 How do I do it?', '', '1', '3', 'sph130', '8/2/2016 14:20'], ['10394168', 'Ask HN: Someone offered to buy my browser extension from me. What now?', '', '28', '17', 'roykolak', '10/15/2015 16:38']] SHOW POSTS ===================== [['10627194', 'Show HN: Wio Link ESP8266 Based Web of Things Hardware Development ' 'Platform', 'https://iot.seeed.cc', '26', '22', 'kfihihc', '11/25/2015 14:03'], ['10646440', 'Show HN: Something pointless I made', 'http://dn.ht/picklecat/', '747', '102', 'dhotson', '11/29/2015 22:46'], ['11590768', 'Show HN: Shanhu.io, a programming playground powered by e8vm', 'https://shanhu.io', '1', '1', 'h8liu', '4/28/2016 18:05'], ['12178806', 'Show HN: Webscope Easy way for web developers to communicate with ' 'Clients', 'http://webscopeapp.com', '3', '3', 'fastbrick', '7/28/2016 7:11'], ['10872799', 'Show HN: GeoScreenshot Easily test Geo-IP based web pages', 'https://www.geoscreenshot.com/', '1', '9', 'kpsychwave', '1/9/2016 20:45']] OTHER POSTS ===================== [['12224879', 'Interactive Dynamic Video', 'http://www.interactivedynamicvideo.com/', '386', '52', 'ne0phyte', '8/4/2016 11:52'], ['10975351', 'How to Use Open Source and Shut the Fuck Up at the Same Time', 'http://hueniverse.com/2016/01/26/how-to-use-open-source-and-shut-the-fuck-up-at-the-same-time/', '39', '10', 'josep2', '1/26/2016 19:30'], ['11964716', "Florida DJs May Face Felony for April Fools' Water Joke", 'http://www.thewire.com/entertainment/2013/04/florida-djs-april-fools-water-joke/63798/', '2', '1', 'vezycash', '6/23/2016 22:20'], ['11919867', 'Technology ventures: From Idea to Enterprise', 'https://www.amazon.com/Technology-Ventures-Enterprise-Thomas-Byers/dp/0073523429', '3', '1', 'hswarna', '6/17/2016 0:01'], ['10301696', 'Note by Note: The Making of Steinway L1037 (2007)', 'http://www.nytimes.com/2007/11/07/movies/07stein.html?_r=0', '8', '2', 'walterbell', '9/30/2015 4:12']]
total_ask_comments = 0
for post in ask_posts:
total_ask_comments += int(post[4])
avg_ask_comments = total_ask_comments/len(ask_posts)
print ('Average number of comments for ask posts: {:.2f}'.format(avg_ask_comments))
total_show_comments = 0
for post in show_posts:
total_show_comments += int(post[4])
avg_show_comments = total_show_comments/len(show_posts)
print ('Average number of comments for show posts: {:.2f}'.format(avg_show_comments))
Average number of comments for ask posts: 14.04 Average number of comments for show posts: 10.32
On average the ask posts receive more comments than the show posts.
Ask posts has more comments on average 14 comments than show posts with 10 comments.
People are like to answer a question than to comment on a show post. That's why ask post are more likely to receive comments.
import datetime as dt
result_list = []
for post in ask_posts:
created_at = post[6]
num_comments = int(post[4])
result_list.append([created_at, num_comments])
counts_by_hour = {}
comments_by_hour = {}
date_format = '%m/%d/%Y %H:%M'
for row in result_list:
created_at = dt.datetime.strptime(row[0], date_format)
hour = created_at.strftime('%H')
if hour not in counts_by_hour:
counts_by_hour[hour] = 1
comments_by_hour[hour] = row[1]
else:
counts_by_hour[hour] += 1
comments_by_hour[hour] += row[1]
print('Posts created by hour:')
pp.pprint(counts_by_hour)
print('======================================')
print('Comments posted by hour:')
pp.pprint(comments_by_hour)
Posts created by hour: {'00': 55, '01': 60, '02': 58, '03': 54, '04': 47, '05': 46, '06': 44, '07': 34, '08': 48, '09': 45, '10': 59, '11': 58, '12': 73, '13': 85, '14': 107, '15': 116, '16': 108, '17': 100, '18': 109, '19': 110, '20': 80, '21': 109, '22': 71, '23': 68} ====================================== Comments posted by hour: {'00': 447, '01': 683, '02': 1381, '03': 421, '04': 337, '05': 464, '06': 397, '07': 267, '08': 492, '09': 251, '10': 793, '11': 641, '12': 687, '13': 1253, '14': 1416, '15': 4477, '16': 1814, '17': 1146, '18': 1439, '19': 1188, '20': 1722, '21': 1745, '22': 479, '23': 543}
Above, we created 2 dictionaries: counts_by_hour
for the posts created per hour and comments_by_hour
for the comments created by hour. The hours are in 24h format. For example you can see that at 17(5pm)
there were 100
posts and 1146
comments created.
Now let's calculate the average number of comments for posts created during each hour of the day. We'll use the counts_by_hour and comments_by_hour dictionaries.
avg_by_hour = []
for comment in comments_by_hour:
avg_by_hour.append([comment, comments_by_hour[comment]/counts_by_hour[comment]])
print("Average no's of comments per post:")
pp.pprint(avg_by_hour)
Average no's of comments per post: [['00', 8.127272727272727], ['11', 11.051724137931034], ['22', 6.746478873239437], ['06', 9.022727272727273], ['18', 13.20183486238532], ['14', 13.233644859813085], ['05', 10.08695652173913], ['07', 7.852941176470588], ['15', 38.5948275862069], ['23', 7.985294117647059], ['04', 7.170212765957447], ['20', 21.525], ['19', 10.8], ['16', 16.796296296296298], ['01', 11.383333333333333], ['12', 9.41095890410959], ['10', 13.440677966101696], ['02', 23.810344827586206], ['21', 16.009174311926607], ['03', 7.796296296296297], ['17', 11.46], ['08', 10.25], ['13', 14.741176470588234], ['09', 5.5777777777777775]]
swap_avg_by_hour = []
for h, c in avg_by_hour:
swap_avg_by_hour.append([c,h])
pp.pprint(swap_avg_by_hour)
[[8.127272727272727, '00'], [11.051724137931034, '11'], [6.746478873239437, '22'], [9.022727272727273, '06'], [13.20183486238532, '18'], [13.233644859813085, '14'], [10.08695652173913, '05'], [7.852941176470588, '07'], [38.5948275862069, '15'], [7.985294117647059, '23'], [7.170212765957447, '04'], [21.525, '20'], [10.8, '19'], [16.796296296296298, '16'], [11.383333333333333, '01'], [9.41095890410959, '12'], [13.440677966101696, '10'], [23.810344827586206, '02'], [16.009174311926607, '21'], [7.796296296296297, '03'], [11.46, '17'], [10.25, '08'], [14.741176470588234, '13'], [5.5777777777777775, '09']]
# sort by the average number of comments
sorted_swap = sorted(swap_avg_by_hour, reverse = True)
pp.pprint(sorted_swap[:5])
[[38.5948275862069, '15'], [23.810344827586206, '02'], [21.525, '20'], [16.796296296296298, '16'], [16.009174311926607, '21']]
As you can see above we sorted through our swapped list and printed the top 5 hours for Ask posts comments. 15(3pm) has the most comments per hour with 38.5 followed by 2am with 23.8
print ('Top 5 Hours for Ask Posts Comments', '\n')
for comment, hour in sorted_swap[:5]:
each_hour = dt.datetime.strptime(hour, '%H').strftime('%H:%M')
comment_per_hour = '{h}: {c:.2f} average comments per post'.format(h = each_hour, c = comment)
print(comment_per_hour)
Top 5 Hours for Ask Posts Comments 15:00: 38.59 average comments per post 02:00: 23.81 average comments per post 20:00: 21.52 average comments per post 16:00: 16.80 average comments per post 21:00: 16.01 average comments per post
Let's summarize the project.
Post title: when creating posts, adding Ask HN to your post title will do better for attracting comments than adding Show HN:
Ask HN: 14.04 average comments per post
Show HN: 10.32 average comments per post
Post timing: the time of day of posting appears to have significant impact on the number of comments that you will attract. Based on an analysis of the Ask HN posts, the top hours are:
15:00: 38.59 average comments per post
02:00: 23.81 average comments per post
20:00: 21.52 average comments per post
16:00: 16.80 average comments per post
21:00: 16.01 average comments per post