Hacker News is a site started by the startup incubator Y Combinator, where user-submitted stories (known as "posts") are voted and commented upon, similar to reddit. Hacker News is extremely popular in technology and startup circles, and posts that make it to the top of Hacker News' listings can get hundreds of thousands of visitors as a result.
You can find the data set here. The data set has already been initially cleaned by removing all rows with zero comments, and then pulling a random sampling of the remaining posts. This had the effect of bringing the data set from nearly 300,000 rows to 20,000 rows. Examples of this data are listed below.
Specifically, this analysis is focused on two post categories, "Ask HN" and "Show HN". "Ask HN" posts are used to ask the Hacker News community questions; whereas, "Show HN" posts are submitted to show the Hacker News community something of interest, i.e. a project or product. Ultimately, the goal of this analysis is to discover which post type receives more comments, and if there is a "golden window" of time in which to post to receive more comments.
from csv import reader
opened_file = open('hacker_news.csv')
read_file = reader(opened_file)
hn = list(read_file)
print(hn[:5])
[['id', 'title', 'url', 'num_points', 'num_comments', 'author', 'created_at'], ['12224879', 'Interactive Dynamic Video', 'http://www.interactivedynamicvideo.com/', '386', '52', 'ne0phyte', '8/4/2016 11:52'], ['10975351', 'How to Use Open Source and Shut the Fuck Up at the Same Time', 'http://hueniverse.com/2016/01/26/how-to-use-open-source-and-shut-the-fuck-up-at-the-same-time/', '39', '10', 'josep2', '1/26/2016 19:30'], ['11964716', "Florida DJs May Face Felony for April Fools' Water Joke", 'http://www.thewire.com/entertainment/2013/04/florida-djs-april-fools-water-joke/63798/', '2', '1', 'vezycash', '6/23/2016 22:20'], ['11919867', 'Technology ventures: From Idea to Enterprise', 'https://www.amazon.com/Technology-Ventures-Enterprise-Thomas-Byers/dp/0073523429', '3', '1', 'hswarna', '6/17/2016 0:01']]
headers = hn[0]
hn = hn[1:]
print(headers)
print('***') #seperate header from hn
print(hn[:5])
['id', 'title', 'url', 'num_points', 'num_comments', 'author', 'created_at'] *** [['12224879', 'Interactive Dynamic Video', 'http://www.interactivedynamicvideo.com/', '386', '52', 'ne0phyte', '8/4/2016 11:52'], ['10975351', 'How to Use Open Source and Shut the Fuck Up at the Same Time', 'http://hueniverse.com/2016/01/26/how-to-use-open-source-and-shut-the-fuck-up-at-the-same-time/', '39', '10', 'josep2', '1/26/2016 19:30'], ['11964716', "Florida DJs May Face Felony for April Fools' Water Joke", 'http://www.thewire.com/entertainment/2013/04/florida-djs-april-fools-water-joke/63798/', '2', '1', 'vezycash', '6/23/2016 22:20'], ['11919867', 'Technology ventures: From Idea to Enterprise', 'https://www.amazon.com/Technology-Ventures-Enterprise-Thomas-Byers/dp/0073523429', '3', '1', 'hswarna', '6/17/2016 0:01'], ['10301696', 'Note by Note: The Making of Steinway L1037 (2007)', 'http://www.nytimes.com/2007/11/07/movies/07stein.html?_r=0', '8', '2', 'walterbell', '9/30/2015 4:12']]
Above, the header row was seperated out of the initial data set.
Below, the data was organized as follows (post count in parenthesis):
This served as a means to seperate the two focus categories and filter out everything else. It also gives a little insight into the total posts for each type.
ask_posts = []
show_posts = []
other_posts = []
for row in hn:
title = row[1].lower()
if title.startswith('ask hn'):
ask_posts.append(row)
elif title.startswith('show hn'):
show_posts.append(row)
else:
other_posts.append(row)
print('Total Ask HN Posts: ', len(ask_posts))
print('Total Show HN Posts: ', len(show_posts))
print('Total Other Posts: ', len(other_posts))
Total Ask HN Posts: 1744 Total Show HN Posts: 1162 Total Other Posts: 17194
The next step in the process is to determine total number of comments for the two categories being analysed and calculate the average comment per post for each category.
total_ask_comments = 0
for row in ask_posts:
ask_comments = row[4]
total_ask_comments = total_ask_comments + int(ask_comments)
avg_ask_comments = total_ask_comments / len(ask_posts)
print('Average Ask HN Comments: ', round(avg_ask_comments, 2))
total_show_comments = 0
for row in show_posts:
show_comments = row[4]
total_show_comments = total_show_comments + int(show_comments)
avg_show_comments = total_show_comments / len(show_posts)
print('Average Show NA Comments: ', round(avg_show_comments, 2))
Average Ask HN Comments: 14.04 Average Show NA Comments: 10.32
From the above, it is seen that "Ask HN" posts receive roughly four more comments per post. This might be attributed to more users commenting on various ways to perform the task being asked about, or the question itself creating more dialog around solving the issues. Questions for clarification of the problem would also count as comments as well.
This solves the first part of the analysis. Now that it is proven that "Ask HN" posts receive more comments, the focus of the analysis will shift solely to this category.
To identify if there is a certain creation time at which a post is likely to attract the most comments, two more calculations must be completed:
import datetime as dt
result_list = []
for row in ask_posts:
result_list.append([row[6], int(row[4])])
counts_by_hour = {}
comments_by_hour = {}
for row in result_list:
date = row[0]
hour = dt.datetime.strptime(date, "%m/%d/%Y %H:%M").strftime("%H")
if hour not in counts_by_hour:
counts_by_hour[hour] = 1
comments_by_hour[hour] = row[1]
else:
counts_by_hour[hour] += 1
comments_by_hour[hour] += row[1]
The above created two dictionaries "counts_by_hour" and "comments_by_hour" which provide totals in those categories. Next the average for each hour will be calculated.
avg_by_hour = []
for hour in comments_by_hour:
avg_by_hour.append([hour, comments_by_hour[hour] / counts_by_hour[hour]])
print(avg_by_hour)
[['01', 11.383333333333333], ['15', 38.5948275862069], ['07', 7.852941176470588], ['13', 14.741176470588234], ['03', 7.796296296296297], ['22', 6.746478873239437], ['14', 13.233644859813085], ['08', 10.25], ['05', 10.08695652173913], ['02', 23.810344827586206], ['09', 5.5777777777777775], ['04', 7.170212765957447], ['10', 13.440677966101696], ['20', 21.525], ['16', 16.796296296296298], ['17', 11.46], ['00', 8.127272727272727], ['06', 9.022727272727273], ['12', 9.41095890410959], ['23', 7.985294117647059], ['21', 16.009174311926607], ['18', 13.20183486238532], ['11', 11.051724137931034], ['19', 10.8]]
The data is now calculated, but the formatting leaves a little to be desired. The final step is to display the data in a manner that is wasier to read.
swap_avg_by_hour = []
for row in avg_by_hour:
first_element = row[1]
second_element = row[0]
swap_avg_by_hour.append([first_element, second_element])
print(swap_avg_by_hour)
print('\n')
sorted_swap = sorted(swap_avg_by_hour, reverse=True)
print('Top 5 Hours for Ask Posts Comments')
date_format = '%H'
for element in sorted_swap[:5]:
avg = "{:.2f}".format(element[0])
hours = element[1]
hour = dt.datetime.strptime(hours, date_format).strftime('%H')
print("{time}:00 EST : {value} average comments per post".format(time = hour, value = avg))
[[11.383333333333333, '01'], [38.5948275862069, '15'], [7.852941176470588, '07'], [14.741176470588234, '13'], [7.796296296296297, '03'], [6.746478873239437, '22'], [13.233644859813085, '14'], [10.25, '08'], [10.08695652173913, '05'], [23.810344827586206, '02'], [5.5777777777777775, '09'], [7.170212765957447, '04'], [13.440677966101696, '10'], [21.525, '20'], [16.796296296296298, '16'], [11.46, '17'], [8.127272727272727, '00'], [9.022727272727273, '06'], [9.41095890410959, '12'], [7.985294117647059, '23'], [16.009174311926607, '21'], [13.20183486238532, '18'], [11.051724137931034, '11'], [10.8, '19']] Top 5 Hours for Ask Posts Comments 15:00 EST : 38.59 average comments per post 02:00 EST : 23.81 average comments per post 20:00 EST : 21.52 average comments per post 16:00 EST : 16.80 average comments per post 21:00 EST : 16.01 average comments per post
Per the above, the best hours to post for increased chances of comments are:
To convert to CST (Central Standard Time) which is the time zone this report is being written in, one would add a final step.
print('Top 5 Hours for Ask Posts Comments')
date_format = '%H'
for element in sorted_swap[:5]:
avg = "{:.2f}".format(element[0])
hours = element[1]
hour = dt.datetime.strptime(hours, date_format)
hour_cst = hour - dt.timedelta(hours = 1)
time = hour_cst.strftime('%H')
print("{time}:00 CST : {value} average comments per post".format(time = time, value = avg))
Top 5 Hours for Ask Posts Comments 14:00 CST : 38.59 average comments per post 01:00 CST : 23.81 average comments per post 19:00 CST : 21.52 average comments per post 15:00 CST : 16.80 average comments per post 20:00 CST : 16.01 average comments per post
Converting to CST moves each optimum time back by one hour. The revised optimum posting times are: