Hacker News is a popular site started by startup incubator Y Combinator, where user- submitted stories or "posts" are voted and commented on. Hacker News is extremely popular in technology and startup circles, and posts that make it to the top of Hacker News' listings can recceive hundreds of thousands of visitors as a result.
You can find the data set here. The data has been reduced from almost 300,000 rows to approximately 20,000 rows by removing all sunmissions that did not receive any comments, and then randomly sampling from the remaining submissions.
Users submit Ask HN posts to ask the Hacker News community a specific question. Users submit Show HN posts to show the Hacker News communtity a project, product, or something interesting. For this analysis, we will compare Ask HN post and Show HN posts to determine:
We begin by reading in the data set and removing the headers.
# Read in the data set.
from csv import reader
opened_file = open('hacker_news.csv')
read_file = reader(opened_file)
hn = list(read_file)
hn[:5]
[['id', 'title', 'url', 'num_points', 'num_comments', 'author', 'created_at'], ['12224879', 'Interactive Dynamic Video', 'http://www.interactivedynamicvideo.com/', '386', '52', 'ne0phyte', '8/4/2016 11:52'], ['10975351', 'How to Use Open Source and Shut the Fuck Up at the Same Time', 'http://hueniverse.com/2016/01/26/how-to-use-open-source-and-shut-the-fuck-up-at-the-same-time/', '39', '10', 'josep2', '1/26/2016 19:30'], ['11964716', "Florida DJs May Face Felony for April Fools' Water Joke", 'http://www.thewire.com/entertainment/2013/04/florida-djs-april-fools-water-joke/63798/', '2', '1', 'vezycash', '6/23/2016 22:20'], ['11919867', 'Technology ventures: From Idea to Enterprise', 'https://www.amazon.com/Technology-Ventures-Enterprise-Thomas-Byers/dp/0073523429', '3', '1', 'hswarna', '6/17/2016 0:01']]
# Remove the headers.
headers = hn[0]
hn = hn[1:]
print(headers)
print(hn[:5])
['id', 'title', 'url', 'num_points', 'num_comments', 'author', 'created_at'] [['12224879', 'Interactive Dynamic Video', 'http://www.interactivedynamicvideo.com/', '386', '52', 'ne0phyte', '8/4/2016 11:52'], ['10975351', 'How to Use Open Source and Shut the Fuck Up at the Same Time', 'http://hueniverse.com/2016/01/26/how-to-use-open-source-and-shut-the-fuck-up-at-the-same-time/', '39', '10', 'josep2', '1/26/2016 19:30'], ['11964716', "Florida DJs May Face Felony for April Fools' Water Joke", 'http://www.thewire.com/entertainment/2013/04/florida-djs-april-fools-water-joke/63798/', '2', '1', 'vezycash', '6/23/2016 22:20'], ['11919867', 'Technology ventures: From Idea to Enterprise', 'https://www.amazon.com/Technology-Ventures-Enterprise-Thomas-Byers/dp/0073523429', '3', '1', 'hswarna', '6/17/2016 0:01'], ['10301696', 'Note by Note: The Making of Steinway L1037 (2007)', 'http://www.nytimes.com/2007/11/07/movies/07stein.html?_r=0', '8', '2', 'walterbell', '9/30/2015 4:12']]
The data set contain the title of the posts, the number of comments for each post, and the date the post was created.
We will create a list for Ask HN, Show HN and Other HN Posts, extract corresponding posts and place them in the appropriate list. The Other HN list will contain non-Ask HN and non-Show HN post that have comments.Then we will determine the length of each list. Filtering the data allows for specific and and easier analysis moving forward.
# Extracting posts that begin with 'Ask HN' or 'Show HN' and inserting those posts into the approriapte list.
ask_posts = []
show_posts = []
other_posts = []
for post in hn:
title = post[1]
if title.lower().startswith('ask hn'):
ask_posts.append(post)
elif title.lower().startswith('show hn'):
show_posts.append(post)
else:
other_posts.append(post)
print(len(ask_posts))
print(len(show_posts))
print(len(other_posts))
1744 1162 17194
We now have the Ask HN and Show HN posts in separate lists. We will calculate the average number of posts received by each.
# Calculate the average number of 'Ask HN' posts received.
total_ask_comments = 0
for post in ask_posts:
total_ask_comments += int(post[4])
print(total_ask_comments)
avg_ask_comments = total_ask_comments / len(ask_posts)
print(avg_ask_comments)
24483 14.038417431192661
total_show_comments = 0
for post in show_posts:
total_show_comments += int(post[4])
print(total_show_comments)
avg_comments = total_show_comments / len(show_posts)
print(avg_comments)
11988 10.31669535283993
There were more than twice as many Ask HN posts as there were Show HN posts. On average, Ask HN posts received approximately 14 comments; while, Show HN post received approximately 10. Ask HN posts are more likely to receive comments; therefore, for the remainder of the analysis, these posts will be the focus.
# Calculate the number of Ask HN post created during each our of the day and the number of comments received for each.
import datetime as dt
result_list = []
for post in ask_posts:
result_list.append(
[post[6], int(post[4])]
)
counts_by_hour = {}
comments_by_hour = {}
date_format = "%m/%d/%Y %H:%M"
for row in result_list:
date = row[0]
comments = row[1]
hour = dt.datetime.strptime(date, date_format).strftime("%H")
if hour in counts_by_hour:
comments_by_hour[hour] += comments
counts_by_hour[hour] += 1
else:
comments_by_hour[hour] = comments
counts_by_hour[hour] = 1
comments_by_hour
{'00': 447, '01': 683, '02': 1381, '03': 421, '04': 337, '05': 464, '06': 397, '07': 267, '08': 492, '09': 251, '10': 793, '11': 641, '12': 687, '13': 1253, '14': 1416, '15': 4477, '16': 1814, '17': 1146, '18': 1439, '19': 1188, '20': 1722, '21': 1745, '22': 479, '23': 543}
# Calculate the average number of comments on 'Ask HN' posts created at each hour of the day that it was received.
avg_by_hour = []
for hour in comments_by_hour:
avg_by_hour.append([hour, comments_by_hour[hour] / counts_by_hour[hour]])
avg_by_hour
[['19', 10.8], ['15', 38.5948275862069], ['06', 9.022727272727273], ['10', 13.440677966101696], ['05', 10.08695652173913], ['01', 11.383333333333333], ['12', 9.41095890410959], ['13', 14.741176470588234], ['14', 13.233644859813085], ['18', 13.20183486238532], ['16', 16.796296296296298], ['23', 7.985294117647059], ['07', 7.852941176470588], ['03', 7.796296296296297], ['22', 6.746478873239437], ['11', 11.051724137931034], ['08', 10.25], ['09', 5.5777777777777775], ['20', 21.525], ['17', 11.46], ['02', 23.810344827586206], ['21', 16.009174311926607], ['00', 8.127272727272727], ['04', 7.170212765957447]]
# Create a new list to add the hour of the day and average number of comments received that hour to. Sort from highest to lowest average number of comments.
swap_avg_by_hour = []
for row in avg_by_hour:
swap_avg_by_hour.append([row[1], row[0]])
print(swap_avg_by_hour)
sorted_swap = sorted(swap_avg_by_hour, reverse = True)
sorted_swap
[[10.8, '19'], [38.5948275862069, '15'], [9.022727272727273, '06'], [13.440677966101696, '10'], [10.08695652173913, '05'], [11.383333333333333, '01'], [9.41095890410959, '12'], [14.741176470588234, '13'], [13.233644859813085, '14'], [13.20183486238532, '18'], [16.796296296296298, '16'], [7.985294117647059, '23'], [7.852941176470588, '07'], [7.796296296296297, '03'], [6.746478873239437, '22'], [11.051724137931034, '11'], [10.25, '08'], [5.5777777777777775, '09'], [21.525, '20'], [11.46, '17'], [23.810344827586206, '02'], [16.009174311926607, '21'], [8.127272727272727, '00'], [7.170212765957447, '04']]
[[38.5948275862069, '15'], [23.810344827586206, '02'], [21.525, '20'], [16.796296296296298, '16'], [16.009174311926607, '21'], [14.741176470588234, '13'], [13.440677966101696, '10'], [13.233644859813085, '14'], [13.20183486238532, '18'], [11.46, '17'], [11.383333333333333, '01'], [11.051724137931034, '11'], [10.8, '19'], [10.25, '08'], [10.08695652173913, '05'], [9.41095890410959, '12'], [9.022727272727273, '06'], [8.127272727272727, '00'], [7.985294117647059, '23'], [7.852941176470588, '07'], [7.796296296296297, '03'], [7.170212765957447, '04'], [6.746478873239437, '22'], [5.5777777777777775, '09']]
# Sort the values and print the 5 hours with the highest average comments.
print("Top 5 Hours for ask Post Comments")
for avg, hr in sorted_swap[:5]:
print(
"{}: {:.2f} average comments per post".format(
dt.datetime.strptime(hr, "%H"). strftime("%H:%M"),avg
)
)
Top 5 Hours for ask Post Comments 15:00: 38.59 average comments per post 02:00: 23.81 average comments per post 20:00: 21.52 average comments per post 16:00: 16.80 average comments per post 21:00: 16.01 average comments per post
With an average of 38.59 comments per post, 15:00(3:00 pm US est) has the highest average number of posts. That is more than 60% greater than the hour with the second highest average number of comments per post.
The goal of this project was to analyze Ask HN posts and Show HN posts to ascertain which type of post and which time received the most comments on average. From, our analysis, we determined that the Ask HN posts created between 15:00 and 16:00(3:00pm est - 4:00 pm est) recieved the most comments on average.
However, posts that did not have comments were excluded. And Ask HN and Show HN represented a small number of posts that were commented on in our data set compared to Other HN posts. Therefore, when considering the posts that received comments, we can say that Asks HN posts received more comments on average and Ask HN posts created between 15:00 and 14:00 received the most comments on average.