Notebook

Hacker News Post Category Analysis¶

Hacker News is a site started by the startup incubator Y Combinator, where user-submitted stories (known as "posts") are voted and commented upon, similar to reddit. Hacker News is extremely popular in technology and startup circles, and posts that make it to the top of Hacker News' listings can get hundreds of thousands of visitors as a result.

You can find the data set here. The data set has already been initially cleaned by removing all rows with zero comments, and then pulling a random sampling of the remaining posts. This had the effect of bringing the data set from nearly 300,000 rows to 20,000 rows. Examples of this data are listed below.

Specifically, this analysis is focused on two post categories, "Ask HN" and "Show HN". "Ask HN" posts are used to ask the Hacker News community questions; whereas, "Show HN" posts are submitted to show the Hacker News community something of interest, i.e. a project or product. Ultimately, the goal of this analysis is to discover which post type receives more comments, and if there is a "golden window" of time in which to post to receive more comments.

In [1]:

from csv import reader
opened_file = open('hacker_news.csv')
read_file = reader(opened_file)
hn = list(read_file)

print(hn[:5])

[['id', 'title', 'url', 'num_points', 'num_comments', 'author', 'created_at'], ['12224879', 'Interactive Dynamic Video', 'http://www.interactivedynamicvideo.com/', '386', '52', 'ne0phyte', '8/4/2016 11:52'], ['10975351', 'How to Use Open Source and Shut the Fuck Up at the Same Time', 'http://hueniverse.com/2016/01/26/how-to-use-open-source-and-shut-the-fuck-up-at-the-same-time/', '39', '10', 'josep2', '1/26/2016 19:30'], ['11964716', "Florida DJs May Face Felony for April Fools' Water Joke", 'http://www.thewire.com/entertainment/2013/04/florida-djs-april-fools-water-joke/63798/', '2', '1', 'vezycash', '6/23/2016 22:20'], ['11919867', 'Technology ventures: From Idea to Enterprise', 'https://www.amazon.com/Technology-Ventures-Enterprise-Thomas-Byers/dp/0073523429', '3', '1', 'hswarna', '6/17/2016 0:01']]

In [2]:

headers = hn[0]
hn = hn[1:]

print(headers)
print('***') #seperate header from hn
print(hn[:5])

['id', 'title', 'url', 'num_points', 'num_comments', 'author', 'created_at']
***
[['12224879', 'Interactive Dynamic Video', 'http://www.interactivedynamicvideo.com/', '386', '52', 'ne0phyte', '8/4/2016 11:52'], ['10975351', 'How to Use Open Source and Shut the Fuck Up at the Same Time', 'http://hueniverse.com/2016/01/26/how-to-use-open-source-and-shut-the-fuck-up-at-the-same-time/', '39', '10', 'josep2', '1/26/2016 19:30'], ['11964716', "Florida DJs May Face Felony for April Fools' Water Joke", 'http://www.thewire.com/entertainment/2013/04/florida-djs-april-fools-water-joke/63798/', '2', '1', 'vezycash', '6/23/2016 22:20'], ['11919867', 'Technology ventures: From Idea to Enterprise', 'https://www.amazon.com/Technology-Ventures-Enterprise-Thomas-Byers/dp/0073523429', '3', '1', 'hswarna', '6/17/2016 0:01'], ['10301696', 'Note by Note: The Making of Steinway L1037 (2007)', 'http://www.nytimes.com/2007/11/07/movies/07stein.html?_r=0', '8', '2', 'walterbell', '9/30/2015 4:12']]

Above, the header row was seperated out of the initial data set.

Below, the data was organized as follows (post count in parenthesis):

Ask HN Posts (1,744 total posts)
Show HN Posts (1,162 total posts)
Other Posts (17,194 total posts)

This served as a means to seperate the two focus categories and filter out everything else. It also gives a little insight into the total posts for each type.

In [3]:

ask_posts = []
show_posts = []
other_posts = []

for row in hn:
    title = row[1].lower()
    if title.startswith('ask hn'):
        ask_posts.append(row)
    elif title.startswith('show hn'):
        show_posts.append(row)
    else:
        other_posts.append(row)

print('Total Ask HN Posts: ', len(ask_posts))
print('Total Show HN Posts: ', len(show_posts))
print('Total Other Posts: ', len(other_posts))

Total Ask HN Posts:  1744
Total Show HN Posts:  1162
Total Other Posts:  17194

The next step in the process is to determine total number of comments for the two categories being analysed and calculate the average comment per post for each category.

In [4]:

total_ask_comments = 0

for row in ask_posts:
    ask_comments = row[4]
    total_ask_comments = total_ask_comments + int(ask_comments)

avg_ask_comments = total_ask_comments / len(ask_posts)
print('Average Ask HN Comments: ', round(avg_ask_comments, 2))

total_show_comments = 0

for row in show_posts:
    show_comments = row[4]
    total_show_comments = total_show_comments + int(show_comments)
    
avg_show_comments = total_show_comments / len(show_posts)
print('Average Show NA Comments: ', round(avg_show_comments, 2))

Average Ask HN Comments:  14.04
Average Show NA Comments:  10.32

From the above, it is seen that "Ask HN" posts receive roughly four more comments per post. This might be attributed to more users commenting on various ways to perform the task being asked about, or the question itself creating more dialog around solving the issues. Questions for clarification of the problem would also count as comments as well.

This solves the first part of the analysis. Now that it is proven that "Ask HN" posts receive more comments, the focus of the analysis will shift solely to this category.

To identify if there is a certain creation time at which a post is likely to attract the most comments, two more calculations must be completed:

Amount of ask posts created in each hour of the day, with number of comments received
Average number of comments received by hour created

In [5]:

import datetime as dt

result_list = []

for row in ask_posts: 
    result_list.append([row[6], int(row[4])])

counts_by_hour = {}
comments_by_hour = {}

for row in result_list:
    date = row[0]
    hour = dt.datetime.strptime(date, "%m/%d/%Y %H:%M").strftime("%H")
    if hour not in counts_by_hour:
        counts_by_hour[hour] = 1
        comments_by_hour[hour] = row[1]
    else:
        counts_by_hour[hour] += 1
        comments_by_hour[hour] += row[1]

The above created two dictionaries "counts_by_hour" and "comments_by_hour" which provide totals in those categories. Next the average for each hour will be calculated.

In [6]:

avg_by_hour = []

for hour in comments_by_hour:
    avg_by_hour.append([hour, comments_by_hour[hour] / counts_by_hour[hour]])

print(avg_by_hour)

[['01', 11.383333333333333], ['15', 38.5948275862069], ['07', 7.852941176470588], ['13', 14.741176470588234], ['03', 7.796296296296297], ['22', 6.746478873239437], ['14', 13.233644859813085], ['08', 10.25], ['05', 10.08695652173913], ['02', 23.810344827586206], ['09', 5.5777777777777775], ['04', 7.170212765957447], ['10', 13.440677966101696], ['20', 21.525], ['16', 16.796296296296298], ['17', 11.46], ['00', 8.127272727272727], ['06', 9.022727272727273], ['12', 9.41095890410959], ['23', 7.985294117647059], ['21', 16.009174311926607], ['18', 13.20183486238532], ['11', 11.051724137931034], ['19', 10.8]]

The data is now calculated, but the formatting leaves a little to be desired. The final step is to display the data in a manner that is wasier to read.

In [7]:

swap_avg_by_hour = []
    
for row in avg_by_hour:
    first_element = row[1]
    second_element = row[0]
    swap_avg_by_hour.append([first_element, second_element])
print(swap_avg_by_hour)
print('\n')
sorted_swap = sorted(swap_avg_by_hour, reverse=True)

print('Top 5 Hours for Ask Posts Comments')
date_format = '%H'

for element in sorted_swap[:5]:
    avg = "{:.2f}".format(element[0])
    hours = element[1]
    hour = dt.datetime.strptime(hours, date_format).strftime('%H')
    print("{time}:00 EST : {value} average comments per post".format(time = hour, value = avg))

    
    

[[11.383333333333333, '01'], [38.5948275862069, '15'], [7.852941176470588, '07'], [14.741176470588234, '13'], [7.796296296296297, '03'], [6.746478873239437, '22'], [13.233644859813085, '14'], [10.25, '08'], [10.08695652173913, '05'], [23.810344827586206, '02'], [5.5777777777777775, '09'], [7.170212765957447, '04'], [13.440677966101696, '10'], [21.525, '20'], [16.796296296296298, '16'], [11.46, '17'], [8.127272727272727, '00'], [9.022727272727273, '06'], [9.41095890410959, '12'], [7.985294117647059, '23'], [16.009174311926607, '21'], [13.20183486238532, '18'], [11.051724137931034, '11'], [10.8, '19']]


Top 5 Hours for Ask Posts Comments
15:00 EST : 38.59 average comments per post
02:00 EST : 23.81 average comments per post
20:00 EST : 21.52 average comments per post
16:00 EST : 16.80 average comments per post
21:00 EST : 16.01 average comments per post

Per the above, the best hours to post for increased chances of comments are:

3pm EST
2am EST
8pm EST
4pm EST
9pm EST

To convert to CST (Central Standard Time) which is the time zone this report is being written in, one would add a final step.

In [8]:

print('Top 5 Hours for Ask Posts Comments')
date_format = '%H'

for element in sorted_swap[:5]:
    avg = "{:.2f}".format(element[0])
    hours = element[1]
    hour = dt.datetime.strptime(hours, date_format)
    hour_cst = hour - dt.timedelta(hours = 1)
    time = hour_cst.strftime('%H')
    print("{time}:00 CST : {value} average comments per post".format(time = time, value = avg))

Top 5 Hours for Ask Posts Comments
14:00 CST : 38.59 average comments per post
01:00 CST : 23.81 average comments per post
19:00 CST : 21.52 average comments per post
15:00 CST : 16.80 average comments per post
20:00 CST : 16.01 average comments per post

Converting to CST moves each optimum time back by one hour. The revised optimum posting times are:

2pm CST
1am CST
7pm CST
3pm CST
8pm CST