We will be reviewing a dataset of submissions(user-submitted stories known as "posts") to popular technology site Hacker News. Hacker News is extremely popular in technology and startup circles, and posts that make it to the top of Hacker News' listings can get hundreds of thousands of visitors as a result.
We are specifically interested in posts whose titles begin with either Ask HN or Show HN. Ask HN posts are specific questions thrown to the Hacker News community while Show HN posts are showcasing project, product, or just generally something interesting to the same community. Our aim is to compare both types of posts to determine if;
from csv import reader
file = open('hacker_news.csv', encoding='utf8')
read = reader(file)
hn = list(read)
print(hn[:5])
[['id', 'title', 'url', 'num_points', 'num_comments', 'author', 'created_at'], ['12579008', 'You have two days to comment if you want stem cells to be classified as your own', 'http://www.regulations.gov/document?D=FDA-2015-D-3719-0018', '1', '0', 'altstar', '9/26/2016 3:26'], ['12579005', 'SQLAR the SQLite Archiver', 'https://www.sqlite.org/sqlar/doc/trunk/README.md', '1', '0', 'blacksqr', '9/26/2016 3:24'], ['12578997', 'What if we just printed a flatscreen television on the side of our boxes?', 'https://medium.com/vanmoof/our-secrets-out-f21c1f03fdc8#.ietxmez43', '1', '0', 'pavel_lishin', '9/26/2016 3:19'], ['12578989', 'algorithmic music', 'http://cacm.acm.org/magazines/2011/7/109891-algorithmic-composition/fulltext', '1', '0', 'poindontcare', '9/26/2016 3:16']]
Header and Columns Description
headers = hn[0]
print(headers)
['id', 'title', 'url', 'num_points', 'num_comments', 'author', 'created_at']
hn = hn[1:]
print(hn[:5])
[['12579008', 'You have two days to comment if you want stem cells to be classified as your own', 'http://www.regulations.gov/document?D=FDA-2015-D-3719-0018', '1', '0', 'altstar', '9/26/2016 3:26'], ['12579005', 'SQLAR the SQLite Archiver', 'https://www.sqlite.org/sqlar/doc/trunk/README.md', '1', '0', 'blacksqr', '9/26/2016 3:24'], ['12578997', 'What if we just printed a flatscreen television on the side of our boxes?', 'https://medium.com/vanmoof/our-secrets-out-f21c1f03fdc8#.ietxmez43', '1', '0', 'pavel_lishin', '9/26/2016 3:19'], ['12578989', 'algorithmic music', 'http://cacm.acm.org/magazines/2011/7/109891-algorithmic-composition/fulltext', '1', '0', 'poindontcare', '9/26/2016 3:16'], ['12578979', 'How the Data Vault Enables the Next-Gen Data Warehouse and Data Lake', 'https://www.talend.com/blog/2016/05/12/talend-and-Â\x93the-data-vaultÂ\x94', '1', '0', 'markgainor1', '9/26/2016 3:14']]
Separating Ask HN and Show HN posts
ask_posts = []
show_posts = []
other_posts = []
for post in hn:
title = post[1]
if title.lower().startswith('ask hn') == True:
ask_posts.append(post)
elif title.lower().startswith('show hn') == True:
show_posts.append(post)
else:
other_posts.append(post)
print(len(ask_posts))
print(len(show_posts))
print(len(other_posts))
9139 10158 273822
print(ask_posts[:5])
[['12578908', 'Ask HN: What TLD do you use for local development?', '', '4', '7', 'Sevrene', '9/26/2016 2:53'], ['12578522', 'Ask HN: How do you pass on your work when you die?', '', '6', '3', 'PascLeRasc', '9/26/2016 1:17'], ['12577908', 'Ask HN: How a DNS problem can be limited to a geographic region?', '', '1', '0', 'kuon', '9/25/2016 22:57'], ['12577870', 'Ask HN: Why join a fund when you can be an angel?', '', '1', '3', 'anthony_james', '9/25/2016 22:48'], ['12577647', 'Ask HN: Someone uses stock trading as passive income?', '', '5', '2', '00taffe', '9/25/2016 21:50']]
total_ask_comments = 0
for post in ask_posts:
num_comments = int(post[4])
total_ask_comments += num_comments
avg_ask_comments = total_ask_comments / len(ask_posts)
print(avg_ask_comments)
10.393478498741656
total_show_comments = 0
for post in show_posts:
num_comments = int(float(post[4]))
total_show_comments += num_comments
avg_show_comments = total_show_comments / len(show_posts)
print(avg_show_comments)
4.886099625910612
On average, Ask HN posts have more comments than Show HN posts.
import datetime as dt
result_list = []
for post in ask_posts:
created_at = post[6]
num_comments = int(post[4])
result_list.append([created_at, num_comments])
counts_by_hour = {}
comments_by_hour = {}
for row in result_list:
date = dt.datetime.strptime(row[0], "%m/%d/%Y %H:%M")
hour = date.strftime("%H")
if hour not in counts_by_hour:
counts_by_hour[hour] = 1
comments_by_hour[hour] = row[1]
else:
counts_by_hour[hour] += 1
comments_by_hour[hour] += row[1]
avg_by_hour = []
for hour in counts_by_hour:
avg_by_hour.append([hour, comments_by_hour[hour] / counts_by_hour[hour]])
print(sorted(avg_by_hour))
[['00', 7.5647840531561465], ['01', 7.407801418439717], ['02', 11.137546468401487], ['03', 7.948339483394834], ['04', 9.7119341563786], ['05', 8.794258373205741], ['06', 6.782051282051282], ['07', 7.013274336283186], ['08', 9.190661478599221], ['09', 6.653153153153153], ['10', 10.684397163120567], ['11', 8.96474358974359], ['12', 12.380116959064328], ['13', 16.31756756756757], ['14', 9.692007797270955], ['15', 28.676470588235293], ['16', 7.713298791018998], ['17', 9.449744463373083], ['18', 7.94299674267101], ['19', 7.163043478260869], ['20', 8.749019607843136], ['21', 8.687258687258687], ['22', 8.804177545691905], ['23', 6.696793002915452]]
swap_avg_by_hour = []
for row in avg_by_hour:
swap_avg_by_hour.append([row[1], row[0]])
print(swap_avg_by_hour)
sorted_swap = sorted(swap_avg_by_hour, reverse = True)
print("Top 5 Hours for Ask Posts Comments")
for row in sorted_swap:
time = dt.time(hour = int(row[1])).strftime("%H:%M")
print("{time}:{avg:.2f} average comments per post".format(time=time, avg=row[0]))
[[11.137546468401487, '02'], [7.407801418439717, '01'], [8.804177545691905, '22'], [8.687258687258687, '21'], [7.163043478260869, '19'], [9.449744463373083, '17'], [28.676470588235293, '15'], [9.692007797270955, '14'], [16.31756756756757, '13'], [8.96474358974359, '11'], [10.684397163120567, '10'], [6.653153153153153, '09'], [7.013274336283186, '07'], [7.948339483394834, '03'], [6.696793002915452, '23'], [8.749019607843136, '20'], [7.713298791018998, '16'], [9.190661478599221, '08'], [7.5647840531561465, '00'], [7.94299674267101, '18'], [12.380116959064328, '12'], [9.7119341563786, '04'], [6.782051282051282, '06'], [8.794258373205741, '05']] Top 5 Hours for Ask Posts Comments 15:00:28.68 average comments per post 13:00:16.32 average comments per post 12:00:12.38 average comments per post 02:00:11.14 average comments per post 10:00:10.68 average comments per post 04:00:9.71 average comments per post 14:00:9.69 average comments per post 17:00:9.45 average comments per post 08:00:9.19 average comments per post 11:00:8.96 average comments per post 22:00:8.80 average comments per post 05:00:8.79 average comments per post 20:00:8.75 average comments per post 21:00:8.69 average comments per post 03:00:7.95 average comments per post 18:00:7.94 average comments per post 16:00:7.71 average comments per post 00:00:7.56 average comments per post 01:00:7.41 average comments per post 19:00:7.16 average comments per post 07:00:7.01 average comments per post 06:00:6.78 average comments per post 23:00:6.70 average comments per post 09:00:6.65 average comments per post
From Top 5 Hours for Ask Posts Comments, we can conclude that to generate high comments on Hacker News Ask posts, it best to posts in the afternoon(12:00 - 15:00(peak period)). Also for nocturnal folks, their Ask HN posts can be made at 2:00 AM.
total_ask_points = 0
for post in ask_posts:
num_points = int(post[3])
total_ask_points += num_points
avg_ask_points = total_ask_points / len(ask_posts)
print(avg_ask_points)
total_show_points = 0
for post in show_posts:
num_points = int(post[3])
total_show_points += num_points
avg_show_points = total_show_points / len(ask_posts)
print(avg_show_points)
11.31174089068826 16.49863223547434
On average, Show HN posts have more points than Ask HN posts
avg_show_points > avg_ask_points
Determining what hours of the day do other_posts generate high number of points
import datetime as dt
point_list = []
for post in other_posts:
created_at = post[6]
num_points = int(post[3])
point_list.append([created_at, num_points])
counts_by_hour = {}
points_by_hour = {}
for row in point_list:
date = dt.datetime.strptime(row[0], "%m/%d/%Y %H:%M")
hour = date.strftime("%H")
if hour not in counts_by_hour:
counts_by_hour[hour] = 1
points_by_hour[hour] = row[1]
else:
counts_by_hour[hour] += 1
points_by_hour[hour] += row[1]
avg_points_by_hour = []
for hour in counts_by_hour:
avg_points_by_hour.append([hour, points_by_hour[hour] / counts_by_hour[hour]])
print(sorted(avg_points_by_hour))
[['00', 16.12263139077583], ['01', 15.911919902584224], ['02', 16.712053891357318], ['03', 15.379154760114304], ['04', 15.609360936093609], ['05', 15.69731925264013], ['06', 14.920477423065861], ['07', 14.939901880621422], ['08', 15.08953341740227], ['09', 14.742026266416511], ['10', 15.144578313253012], ['11', 16.292903091927787], ['12', 16.699393735264398], ['13', 16.017749092375958], ['14', 14.11666371315494], ['15', 14.549354320234993], ['16', 14.646248004257584], ['17', 15.122583455862332], ['18', 15.430311386878088], ['19', 15.570531734572164], ['20', 13.785120643431636], ['21', 14.785966981132075], ['22', 14.308998884790254], ['23', 14.702674897119342]]
swap_avg_points_by_hour = []
for row in avg_points_by_hour:
swap_avg_points_by_hour.append([row[1], row[0]])
print(swap_avg_points_by_hour)
sorted_swap = sorted(swap_avg_points_by_hour, reverse = True)
print("Top 5 Hours for Most Posts number of points")
for row in sorted_swap:
time = dt.time(hour = int(row[1])).strftime("%H:%M")
print("{time}:{avg:.2f} average points per post".format(time=time, avg=row[0]))
[[15.379154760114304, '03'], [16.712053891357318, '02'], [15.911919902584224, '01'], [16.12263139077583, '00'], [14.702674897119342, '23'], [14.308998884790254, '22'], [14.785966981132075, '21'], [13.785120643431636, '20'], [15.570531734572164, '19'], [15.430311386878088, '18'], [15.122583455862332, '17'], [14.646248004257584, '16'], [14.549354320234993, '15'], [14.11666371315494, '14'], [16.017749092375958, '13'], [16.699393735264398, '12'], [16.292903091927787, '11'], [15.144578313253012, '10'], [14.742026266416511, '09'], [15.08953341740227, '08'], [14.939901880621422, '07'], [14.920477423065861, '06'], [15.69731925264013, '05'], [15.609360936093609, '04']] Top 5 Hours for Most Posts number of points 02:00:16.71 average points per post 12:00:16.70 average points per post 11:00:16.29 average points per post 00:00:16.12 average points per post 13:00:16.02 average points per post 01:00:15.91 average points per post 05:00:15.70 average points per post 04:00:15.61 average points per post 19:00:15.57 average points per post 18:00:15.43 average points per post 03:00:15.38 average points per post 10:00:15.14 average points per post 17:00:15.12 average points per post 08:00:15.09 average points per post 07:00:14.94 average points per post 06:00:14.92 average points per post 21:00:14.79 average points per post 09:00:14.74 average points per post 23:00:14.70 average points per post 16:00:14.65 average points per post 15:00:14.55 average points per post 22:00:14.31 average points per post 14:00:14.12 average points per post 20:00:13.79 average points per post
total_other_comments = 0
for post in other_posts:
num_comments = int(post[4])
total_other_comments += num_comments
avg_other_posts_comments = total_other_comments / len(other_posts)
print(avg_other_posts_comments)
total_other_points = 0
for post in other_posts:
num_points = int(post[3])
total_other_points += num_points
avg_other_posts_points = total_other_points / len(other_posts)
print(avg_other_posts_points)
6.4572678601427205 15.156010108756783
print(avg_ask_comments)
print(avg_show_comments)
print(avg_other_posts_comments)
10.393478498741656 4.886099625910612 6.4572678601427205
print(avg_ask_points)
print(avg_show_points)
print(avg_other_posts_points)
11.31174089068826 16.49863223547434 15.156010108756783
We believe Ask HN posts are an integral category of Hacker News due to the high engagements shown by average number of comments and upvotes. The best time of the day for high engagements have also been proven.