Hacker News is a site started by the startup incubator Y Combinator, where user-submitted stories (known as "posts") are voted and commented upon, similar to reddit. Hacker News is extremely popular in technology and startup circles, and posts that make it to the top of Hacker News' listings can get hundreds of thousands of visitors as a result.
You can find the data set here, but note that it has been reduced from almost 300,000 rows to approximately 20,000 rows by removing all submissions that did not receive any comments, and then randomly sampling from the remaining submissions.
We're specifically interested in posts whose titles begin with either Ask HN or Show HN. Users submit Ask HN posts to ask the Hacker News community a specific question. Below are a couple examples:
Likewise, users submit Show HN posts to show the Hacker News community a project, product, or just generally something interesting. Below are a couple of examples:
In this project, we'll aim to find out what kind of posts are more likely to receive attention on Hacker News. To do so, we will answer two questions:
The conclusion from this data analysis is that Ask HN post on Hacker News received more comments than Show HN posts. In addition, Ask HN posts, that were posted on 15:00 EST / 22:00 CET received the biggest average number of comments.
Firstly, we open and read hacker_news.csv data set as a list of lists and assign it to the variable hn.
from csv import reader
opened_file = open("hacker_news.csv", encoding='utf8')
read_file = reader(opened_file)
hn = list(read_file)
Then, we print few of the first rows of the data set.
print(hn[0])
print('\n')
print(hn[1:4])
['id', 'title', 'url', 'num_points', 'num_comments', 'author', 'created_at'] [['12579008', 'You have two days to comment if you want stem cells to be classified as your own', 'http://www.regulations.gov/document?D=FDA-2015-D-3719-0018', '1', '0', 'altstar', '9/26/2016 3:26'], ['12579005', 'SQLAR the SQLite Archiver', 'https://www.sqlite.org/sqlar/doc/trunk/README.md', '1', '0', 'blacksqr', '9/26/2016 3:24'], ['12578997', 'What if we just printed a flatscreen television on the side of our boxes?', 'https://medium.com/vanmoof/our-secrets-out-f21c1f03fdc8#.ietxmez43', '1', '0', 'pavel_lishin', '9/26/2016 3:19']]
headers = hn[0]
hn = hn[1:]
Becase the first row (first list in the list of lists) is the header row, we have separeted it from the rest of the data.
print(headers)
print('\n')
print(hn[0:4])
['id', 'title', 'url', 'num_points', 'num_comments', 'author', 'created_at'] [['12579008', 'You have two days to comment if you want stem cells to be classified as your own', 'http://www.regulations.gov/document?D=FDA-2015-D-3719-0018', '1', '0', 'altstar', '9/26/2016 3:26'], ['12579005', 'SQLAR the SQLite Archiver', 'https://www.sqlite.org/sqlar/doc/trunk/README.md', '1', '0', 'blacksqr', '9/26/2016 3:24'], ['12578997', 'What if we just printed a flatscreen television on the side of our boxes?', 'https://medium.com/vanmoof/our-secrets-out-f21c1f03fdc8#.ietxmez43', '1', '0', 'pavel_lishin', '9/26/2016 3:19'], ['12578989', 'algorithmic music', 'http://cacm.acm.org/magazines/2011/7/109891-algorithmic-composition/fulltext', '1', '0', 'poindontcare', '9/26/2016 3:16']]
Sinnce we're only concerned with post titles beginning with Ask HN or Show HN, we'll create new lists of lists containing just the data for those titles.
ask_posts = []
show_posts = []
other_posts = []
for row in hn:
title = row[1]
if (title.lower()).startswith('ask hn'):
ask_posts.append(row)
elif (title.lower()).startswith('show hn'):
show_posts.append(row)
else:
other_posts.append(row)
print(len(ask_posts))
print(len(show_posts))
print(len(other_posts))
9139 10158 273822
There is 9139 Ask HN posts and 10158 Show HN posts in the dataset. Next, we will determine if Ask HN or Show HN receive more comments on average.
total_ask_comments = 0
for row in ask_posts:
num_comments = row[4]
num_comments = int(num_comments)
total_ask_comments += num_comments
avg_ask_comments = total_ask_comments / len(ask_posts)
print(round(avg_ask_comments,2))
10.39
total_show_comments = 0
for row in show_posts:
num_comments = row[4]
num_comments = int(num_comments)
total_show_comments += num_comments
avg_show_comments = total_show_comments / len(show_posts)
print(round(avg_show_comments,2))
4.89
It seems that Ask HN posts receive over two times more comments than Show HN posts (10.39 in comparision to 4.89). Since Ask HN are more likely to receive comments, we'll focus our remaining analysis just on these posts.
Next, we'll determine if Ask HN posts created at a certain time are more likely to attract comments. We'll use the following steps to perform this analysis:
First, we are create list of lists that includes number of comments and the date and hour of the creation of ask post.
result_list = []
for row in ask_posts:
created_at = row[6]
num_comments = row[4]
num_comments = int(num_comments)
result_list.append([created_at, num_comments])
print(result_list[0:2])
[['9/26/2016 2:53', 7], ['9/26/2016 1:17', 3]]
Next, we calculate the amount the comments related to the ask posts at each hour of the day, using dictionaires frequency tables and datetime module.
import datetime as dt
counts_by_hour = {} #the number of ask posts created every hour
comments_by_hour = {} #the hour of creation of the ask post and the corresponding number of comments (in total) related thereto
date_format = "%m/%d/%Y %H:%M"
for result in result_list:
creation_time_str = result[0]
comment_result = result[1]
time = dt.datetime.strptime(creation_time_str, date_format).strftime("%H")
if time not in counts_by_hour:
counts_by_hour[time] = 1
comments_by_hour[time] = comment_result
else:
counts_by_hour[time] += 1
comments_by_hour[time] += comment_result
comments_by_hour
{'02': 2996, '01': 2089, '22': 3372, '21': 4500, '19': 3954, '17': 5547, '15': 18525, '14': 4972, '13': 7245, '11': 2797, '10': 3013, '09': 1477, '07': 1585, '03': 2154, '23': 2297, '20': 4462, '16': 4466, '08': 2362, '00': 2277, '18': 4877, '12': 4234, '04': 2360, '06': 1587, '05': 1838}
counts_by_hour
{'02': 269, '01': 282, '22': 383, '21': 518, '19': 552, '17': 587, '15': 646, '14': 513, '13': 444, '11': 312, '10': 282, '09': 222, '07': 226, '03': 271, '23': 343, '20': 510, '16': 579, '08': 257, '00': 301, '18': 614, '12': 342, '04': 243, '06': 234, '05': 209}
avg_by_hour = []
for time in comments_by_hour:
avg = comments_by_hour[time] / counts_by_hour[time]
avg_by_hour.append([time, avg])
avg_by_hour
[['02', 11.137546468401487], ['01', 7.407801418439717], ['22', 8.804177545691905], ['21', 8.687258687258687], ['19', 7.163043478260869], ['17', 9.449744463373083], ['15', 28.676470588235293], ['14', 9.692007797270955], ['13', 16.31756756756757], ['11', 8.96474358974359], ['10', 10.684397163120567], ['09', 6.653153153153153], ['07', 7.013274336283186], ['03', 7.948339483394834], ['23', 6.696793002915452], ['20', 8.749019607843136], ['16', 7.713298791018998], ['08', 9.190661478599221], ['00', 7.5647840531561465], ['18', 7.94299674267101], ['12', 12.380116959064328], ['04', 9.7119341563786], ['06', 6.782051282051282], ['05', 8.794258373205741]]
Although we now have the results we need (the average number of comments for ask posts per hour of the post creation), this format makes it hard to identify the hours with the highest values. Let's finish by sorting the list of lists and printing the five highest values in a format that's easier to read.
swap_avg_by_hour = []
for row in avg_by_hour:
swap_avg_by_hour.append((row[1], row[0]))
swap_avg_by_hour
[(11.137546468401487, '02'), (7.407801418439717, '01'), (8.804177545691905, '22'), (8.687258687258687, '21'), (7.163043478260869, '19'), (9.449744463373083, '17'), (28.676470588235293, '15'), (9.692007797270955, '14'), (16.31756756756757, '13'), (8.96474358974359, '11'), (10.684397163120567, '10'), (6.653153153153153, '09'), (7.013274336283186, '07'), (7.948339483394834, '03'), (6.696793002915452, '23'), (8.749019607843136, '20'), (7.713298791018998, '16'), (9.190661478599221, '08'), (7.5647840531561465, '00'), (7.94299674267101, '18'), (12.380116959064328, '12'), (9.7119341563786, '04'), (6.782051282051282, '06'), (8.794258373205741, '05')]
sorted_swap = sorted(swap_avg_by_hour, reverse = True)
sorted_swap
[(28.676470588235293, '15'), (16.31756756756757, '13'), (12.380116959064328, '12'), (11.137546468401487, '02'), (10.684397163120567, '10'), (9.7119341563786, '04'), (9.692007797270955, '14'), (9.449744463373083, '17'), (9.190661478599221, '08'), (8.96474358974359, '11'), (8.804177545691905, '22'), (8.794258373205741, '05'), (8.749019607843136, '20'), (8.687258687258687, '21'), (7.948339483394834, '03'), (7.94299674267101, '18'), (7.713298791018998, '16'), (7.5647840531561465, '00'), (7.407801418439717, '01'), (7.163043478260869, '19'), (7.013274336283186, '07'), (6.782051282051282, '06'), (6.696793002915452, '23'), (6.653153153153153, '09')]
print("Top 5 Hours for Ask Posts Comments:")
for row in sorted_swap[0:5]:
us_hour_dt = dt.datetime.strptime(row[1], '%H')
us_hour_str = us_hour_dt.strftime('%H:%M')
my_time_dt = us_hour_dt + dt.timedelta(hours=7)
my_time_str = my_time_dt.strftime('%H:%M')
print('')
print(' ', '{time} EST: / {my_time} CET: {avg:.2f} average comments per post'.format(time=us_hour_str, my_time=my_time_str, avg = row[0]))
Top 5 Hours for Ask Posts Comments: 15:00 EST: / 22:00 CET: 28.68 average comments per post 13:00 EST: / 20:00 CET: 16.32 average comments per post 12:00 EST: / 19:00 CET: 12.38 average comments per post 02:00 EST: / 09:00 CET: 11.14 average comments per post 10:00 EST: / 17:00 CET: 10.68 average comments per post
In this data set, the Ask HN posts that were posted on 15:00 EST / 22:00 CET were receiving the highest average number of comments.
There is also huge difference between those posted on 15:00 EST / 22:00 CET and the second timeframe with the highest number of comments (13:00 EST / 20:00 CET), with (28.32 to 16.32 average comments per post, i.e. approx. 1.8x times less).
The fair conclusion from this data analysis is that historically, out of 'Ask Posts' posted on Hacker News that received any comments, posts that were posted on 15:00 EST / 22:00 CET received the biggest average number of comments.
Determine if show or ask posts receive more points on average. Determine if posts created at a certain time are more likely to receive more points. Compare your results to the average number of comments and points other posts receive. Use Dataquest's data science project style guide to format your project.