Notebook

Introduction¶

Hacker News is a site started by the startup incubator Y Combinator, where user-submitted stories (known as "posts") receive votes and comments, similar to reddit. Hacker News is extremely popular in technology and startup circles, and posts that make it to the top of the Hacker News listings can get hundreds of thousands of visitors as a result.

You can find the data set here, but note that we have reduced from almost 300,000 rows to approximately 20,000 rows by removing all submissions that didn't receive any comments and then randomly sampling from the remaining submissions. Below are descriptions of the columns:

id: the unique identifier from Hacker News for the post
title: the title of the post
url: the URL that the posts links to, if the post has a URL
num_points: the number of points the post acquired, calculated as the total number of upvotes minus the total number of downvotes
num_comments: the number of comments on the post
author: the username of the person who submitted the post
created_at: the date and time of the post's submission

We're specifically interested in posts with titles that begin with either Ask HN or Show HN. Users submit Ask HN posts to ask the Hacker News community a specific question. Below are a few examples:

Ask HN: How to improve my personal website?
Ask HN: Am I the only one outraged by Twitter shutting down share counts?
Ask HN: Aby recent changes to CSS that broke mobile?

Likewise, users submit Show HN posts to show the Hacker News community a project, product, or just something interesting. Below are a few examples:

Show HN: Wio Link ESP8266 Based Web of Things Hardware Development Platform'
Show HN: Something pointless I made
Show HN: Shanhu.io, a programming playground powered by e8vm

We'll compare these two types of posts to determine the following:

Do Ask HN or Show HN receive more comments on average?
Do posts created at a certain time receive more comments on average?

Data cleaning¶

First, we'll import the data and check its structure:

In [1]:

from csv import reader
opened_file = open('HN_posts_year_to_Sep_26_2016.csv', encoding = 'utf8')
read_file = reader(opened_file)
hn = list(read_file)
print(hn[:5])

[['id', 'title', 'url', 'num_points', 'num_comments', 'author', 'created_at'], ['12579008', 'You have two days to comment if you want stem cells to be classified as your own', 'http://www.regulations.gov/document?D=FDA-2015-D-3719-0018', '1', '0', 'altstar', '9/26/2016 3:26'], ['12579005', 'SQLAR  the SQLite Archiver', 'https://www.sqlite.org/sqlar/doc/trunk/README.md', '1', '0', 'blacksqr', '9/26/2016 3:24'], ['12578997', 'What if we just printed a flatscreen television on the side of our boxes?', 'https://medium.com/vanmoof/our-secrets-out-f21c1f03fdc8#.ietxmez43', '1', '0', 'pavel_lishin', '9/26/2016 3:19'], ['12578989', 'algorithmic music', 'http://cacm.acm.org/magazines/2011/7/109891-algorithmic-composition/fulltext', '1', '0', 'poindontcare', '9/26/2016 3:16']]

Then, we'll remove the headers and assign it to other variable, and update the data list of lists without the header

In [2]:

headers = hn[0]
hn = hn[1:]

print(headers)
print(hn[:5])

['id', 'title', 'url', 'num_points', 'num_comments', 'author', 'created_at']
[['12579008', 'You have two days to comment if you want stem cells to be classified as your own', 'http://www.regulations.gov/document?D=FDA-2015-D-3719-0018', '1', '0', 'altstar', '9/26/2016 3:26'], ['12579005', 'SQLAR  the SQLite Archiver', 'https://www.sqlite.org/sqlar/doc/trunk/README.md', '1', '0', 'blacksqr', '9/26/2016 3:24'], ['12578997', 'What if we just printed a flatscreen television on the side of our boxes?', 'https://medium.com/vanmoof/our-secrets-out-f21c1f03fdc8#.ietxmez43', '1', '0', 'pavel_lishin', '9/26/2016 3:19'], ['12578989', 'algorithmic music', 'http://cacm.acm.org/magazines/2011/7/109891-algorithmic-composition/fulltext', '1', '0', 'poindontcare', '9/26/2016 3:16'], ['12578979', 'How the Data Vault Enables the Next-Gen Data Warehouse and Data Lake', 'https://www.talend.com/blog/2016/05/12/talend-and-Â\x93the-data-vaultÂ\x94', '1', '0', 'markgainor1', '9/26/2016 3:14']]

Next, we'll separate posts beginning with Ask HN and Show HN (and case variations) into two different lists, and other posts to a third list.

In [3]:

ask_posts = []
show_posts = []
other_posts = []

for post in hn:
    title = post[1]
    if title.lower().startswith('ask hn'):
        ask_posts.append(post)
    elif title.lower().startswith('show hn'):
        show_posts.append(post)
    else:
        other_posts.append(post)

print('There are', len(ask_posts), 'ask HN posts')
print('There are', len(show_posts), 'show HN posts')
print('There are', len(other_posts), 'other posts')

There are 9139 ask HN posts
There are 10158 show HN posts
There are 273822 other posts

Data analysis¶

Next, let's determine if ask posts or show posts receive more comments on average.

In [4]:

total_ask_comments = 0

for post in ask_posts:
    comments = int(post[4])
    total_ask_comments += comments

avg_ask_comments = total_ask_comments / len(ask_posts)


total_show_comments = 0

for post in show_posts:
    comments = int(post[4])
    total_show_comments += comments

avg_show_comments = total_show_comments / len(show_posts)

print('The average of comments in ask HR posts is', round(avg_ask_comments,1))
print('The average of comments in show HR posts is', round(avg_show_comments,1))

The average of comments in ask HR posts is 10.4
The average of comments in show HR posts is 4.9

Therefore, we can conclude that, on average, ask HR posts receive more comments (10.4 vs 4.9) compared with show HR posts.

Since ask posts are more likely to receive comments, we'll focus our remaining analysis just on these posts.

Next, we'll determine if ask posts created at a certain time are more likely to attract comments. We'll use the following steps to perform this analysis:

Calculate the number of ask posts created in each hour of the day, along with the number of comments received.
Calculate the average number of comments ask posts receive by hour created.

In [5]:

import datetime as dt

result_list = []

for post in ask_posts:
    comments = int(post[4])
    created = post[6]
    result_list.append([created, comments])

    
counts_by_hour = {}
comments_by_hour = {}

for item in result_list:
    created = dt.datetime.strptime(item[0], '%m/%d/%Y %H:%M')
    hour_created = created.hour
    comments = int(item[1])
    if hour_created in counts_by_hour:
        counts_by_hour[hour_created] += 1
        comments_by_hour[hour_created] += comments
    else:
        counts_by_hour[hour_created] = 1
        comments_by_hour[hour_created] = comments

print('Posts count per hour:', counts_by_hour)
print('Comments count per hour:', comments_by_hour)


avg_by_hour = []

for hour in comments_by_hour:
    avg_by_hour.append([hour, round(comments_by_hour[hour] / counts_by_hour[hour], 1)])
    
print('Average number of comments by hour:', avg_by_hour)

Posts count per hour: {2: 269, 1: 282, 22: 383, 21: 518, 19: 552, 17: 587, 15: 646, 14: 513, 13: 444, 11: 312, 10: 282, 9: 222, 7: 226, 3: 271, 23: 343, 20: 510, 16: 579, 8: 257, 0: 301, 18: 614, 12: 342, 4: 243, 6: 234, 5: 209}
Comments count per hour: {2: 2996, 1: 2089, 22: 3372, 21: 4500, 19: 3954, 17: 5547, 15: 18525, 14: 4972, 13: 7245, 11: 2797, 10: 3013, 9: 1477, 7: 1585, 3: 2154, 23: 2297, 20: 4462, 16: 4466, 8: 2362, 0: 2277, 18: 4877, 12: 4234, 4: 2360, 6: 1587, 5: 1838}
Average number of comments by hour: [[2, 11.1], [1, 7.4], [22, 8.8], [21, 8.7], [19, 7.2], [17, 9.4], [15, 28.7], [14, 9.7], [13, 16.3], [11, 9.0], [10, 10.7], [9, 6.7], [7, 7.0], [3, 7.9], [23, 6.7], [20, 8.7], [16, 7.7], [8, 9.2], [0, 7.6], [18, 7.9], [12, 12.4], [4, 9.7], [6, 6.8], [5, 8.8]]

Although we now have the results we need, this format makes it difficult to identify the hours with the highest values. Let's finish by sorting the list of lists and printing the five highest values in a format that's easier to read.

In [6]:

swap_avg_by_hour = []

for row in avg_by_hour:
    swap_avg_by_hour.append([row[1], row[0]])

sorted_swap = sorted(swap_avg_by_hour, reverse=True)

print('Top 5 Hours for Ask Posts Comments')
for row in sorted_swap[:5]:
    time_hour = dt.datetime.strptime(str(row[1]), '%H')
    format_hour = time_hour.strftime("%H:%M")
    avg_comments = row[0]
    print('{hour}: {comments} average comments per post'.format(hour = format_hour, comments = avg_comments))

Top 5 Hours for Ask Posts Comments
15:00: 28.7 average comments per post
13:00: 16.3 average comments per post
12:00: 12.4 average comments per post
02:00: 11.1 average comments per post
10:00: 10.7 average comments per post

Conclusion¶

Therefore, we can conclude that the best time of the day to write a post and have more comments, being the post an 'ask HR' post, is at 15:00 Eastern Time in the US. As we live in Spain, this would be at 20:00 Spanish time