Notebook

Analysis of Hacker News Posts

In the followig project, we'll be working with a dataset of submissions to popular technology site Hacker News.

Hacker News is a site started by the startup incubator Y Combinator, where user-submitted stories (known as "posts") receive votes and comments, similar to reddit. Hacker News is extremely popular in technology and startup circles, and posts that make it to the top of the Hacker News listings can get hundreds of thousands of visitors as a result.

You can find the dataset here, but note that we have reduced from almost 300,000 rows to approximately 80,000 rows by removing all submissions that didn't receive any comments. Below are the descriptions of the columns:

id: the unique identifier from Hacker News for the post
title: the title of the post
url: the URL that the post links to, if the post has a URL
num_points: the number of points the post acquired, calculated as the total number of upvotes minus the total number of downvotes
num_comments: the number of comments of the post
author: the username of the person who submitted the post
created_at: the date and time of the post's submission

We're specifically interested in posts that begin with either Ask HN and Show HN. Below are a few examples:

Ask HN: How to improve my personal website?
Show HN: Something pointless I made

We'll compare these two types of posts to determine the following:

Do Ask HN or Show HN receive more comments on average?
Do posts created at a certain time receive more comments on average?

Import the dataset and show a few rows¶

We now import the dataset and display the first five rows.

In [1]:

from csv import reader

opened_file = open('C:/Users/Benson/my_datasets/Hacker_News_20160926.csv', encoding = "utf8")
read_file = reader(opened_file)
hn = list(read_file)

#Separate header and data
headers = hn[0]
hn = hn[1:]

print(headers)
print(hn[:5])

['id', 'title', 'url', 'num_points', 'num_comments', 'author', 'created_at']
[['12579008', 'You have two days to comment if you want stem cells to be classified as your own', 'http://www.regulations.gov/document?D=FDA-2015-D-3719-0018', '1', '0', 'altstar', '9/26/2016 3:26'], ['12579005', 'SQLAR  the SQLite Archiver', 'https://www.sqlite.org/sqlar/doc/trunk/README.md', '1', '0', 'blacksqr', '9/26/2016 3:24'], ['12578997', 'What if we just printed a flatscreen television on the side of our boxes?', 'https://medium.com/vanmoof/our-secrets-out-f21c1f03fdc8#.ietxmez43', '1', '0', 'pavel_lishin', '9/26/2016 3:19'], ['12578989', 'algorithmic music', 'http://cacm.acm.org/magazines/2011/7/109891-algorithmic-composition/fulltext', '1', '0', 'poindontcare', '9/26/2016 3:16'], ['12578979', 'How the Data Vault Enables the Next-Gen Data Warehouse and Data Lake', 'https://www.talend.com/blog/2016/05/12/talend-and-Â\x93the-data-vaultÂ\x94', '1', '0', 'markgainor1', '9/26/2016 3:14']]

Filter the data¶

From the above, we have separated the headers from the dataset and stored in the variable headers.

Since we are only interested in posts that begin with either Ask HN or Show HN, we will create new lists of lists containing just the data for such titles. For Ask HN, we will also include submissions that start with [Ask HN - there were 2 such rows in the dataset.

To do this, we will use the string method startswith. Furthermore, since the startswith method is case sensitive, we'll also use the lower method to control capitalization problem.

In [2]:

print("Original Dataset:", len(hn))

#Filter out submissions with no comments
hn_with_comments = []

for row in hn:
    num_comments = int(row[4])
    if num_comments != 0:
        hn_with_comments.append(row)

hn = hn_with_comments
print("Commented submissions: ",len(hn))

ask_posts = []
show_posts = []
other_posts = []

for row in hn:
    title = row[1].lower()
    
    if title.startswith("ask hn") or title.startswith("[ask hn"):
        ask_posts.append(row)
    elif title.startswith("show hn"):
        show_posts.append(row)
    else:
        other_posts.append(row)

print("Ask HN:", len(ask_posts))
print("Show HN:",len(show_posts))
print("Other Posts:", len(other_posts))

Original Dataset: 293119
Commented submissions:  80401
Ask HN: 6913
Show HN: 5059
Other Posts: 68429

Below are the first five rows of the Ask HN and Show HN:

In [3]:

print(ask_posts[:5])
print('\n')
print(show_posts[:5])

[['12578908', 'Ask HN: What TLD do you use for local development?', '', '4', '7', 'Sevrene', '9/26/2016 2:53'], ['12578522', 'Ask HN: How do you pass on your work when you die?', '', '6', '3', 'PascLeRasc', '9/26/2016 1:17'], ['12577870', 'Ask HN: Why join a fund when you can be an angel?', '', '1', '3', 'anthony_james', '9/25/2016 22:48'], ['12577647', 'Ask HN: Someone uses stock trading as passive income?', '', '5', '2', '00taffe', '9/25/2016 21:50'], ['12576946', 'Ask HN: How hard would it be to make a cheap, hackable phone?', '', '2', '1', 'hkt', '9/25/2016 19:30']]


[['12577142', 'Show HN: Jumble  Essays on the go #PaulInYourPocket', 'https://itunes.apple.com/us/app/jumble-find-startup-essay/id1150939197?ls=1&mt=8', '1', '1', 'ryderj', '9/25/2016 20:06'], ['12576813', 'Show HN: Learn Japanese Vocab via multiple choice questions', 'http://japanese.vul.io/', '1', '1', 'soulchild37', '9/25/2016 19:06'], ['12576090', 'Show HN: Markov chain Twitter bot. Trained on comments left on Pornhub', 'https://twitter.com/botsonasty', '3', '1', 'keepingscore', '9/25/2016 16:50'], ['12575471', 'Show HN: Project-Okot: Novel, CODE-FREE data-apps in mere seconds', 'https://studio.nuchwezi.com/', '3', '1', 'nfixx', '9/25/2016 14:30'], ['12574773', 'Show HN: Cursor that Screenshot', 'http://edward.codes/cursor-that-screenshot', '3', '3', 'ed-bit', '9/25/2016 10:50']]

Do 'Ask HN' or 'Show HN' receive more comments on the average?¶

At this point, we now have all commented Ask HN posts in the ask_posts list and all commented 'Show HN' in the show_posts list. We can now determine whether ask posts or show posts receive more comments on the average.

In [4]:

#Average number of comments on ask posts
total_ask_comments = 0

for row in ask_posts:
    n_comments = int(row[4])
    total_ask_comments += n_comments

avg_ask_comments = total_ask_comments / len(ask_posts)
print("Average number of ask posts:", avg_ask_comments)

#Average number of comments on show posts
total_show_comments = 0

for row in show_posts:
    n_comments = int(row[4])
    total_show_comments += n_comments
    
avg_show_comments = total_show_comments / len(show_posts)
print("Average number of show posts:", avg_show_comments)

Average number of ask posts: 13.740778243888327
Average number of show posts: 9.810832180272781

From the above result, Ask HN got 14 comments on the average while Show HN got 10 comments on the average. Hence, we see that Ask HN posts are more likely to receive comments than Show HN posts.

Since ask posts are more likely to receive comments, we will focus our the remianing analysis on these posts.

Do posts created at a certain time receive more comments on average?¶

We shall now determine if ask posts created at a certain time are more likely to attract comments. We will use the following steps to perform this analysis:

1. Calculate the number of ask posts in each hour of the day, along with the number of comments received.
1. Calculate the average number of comments received by hour created.

We will use the datetime module to work with the data in the created_at column.

In [5]:

import datetime as dt

result_list = []
for row in ask_posts:
    created_at = row[6]
    n_comments = int(row[4])
    result = [created_at, n_comments]
    result_list.append(result)
    
counts_by_hour = {}
comments_by_hour = {}

for row in result_list:
    date = dt.datetime.strptime(row[0], "%m/%d/%Y %H:%M")
    hour = date.hour
    n_comments = row[1]
    
    if hour not in counts_by_hour:
        counts_by_hour[hour] = 1
        comments_by_hour[hour] = n_comments
    else:
        counts_by_hour[hour] += 1
        comments_by_hour[hour] += n_comments

print("Number of posts per each hour of the day:", counts_by_hour)
print("\n")
print("Number of comments per each hour of the day:", comments_by_hour)

#Calculate the average number of comments per hour
avg_by_hour = []

for item in counts_by_hour:
    avg = (comments_by_hour[item] / counts_by_hour[item])
    avg_by_hour.append([item, avg])

print("\n")
print("Average number of comments per hour:",avg_by_hour)

Number of posts per each hour of the day: {2: 227, 1: 223, 22: 287, 21: 407, 19: 420, 17: 404, 15: 468, 14: 378, 13: 326, 11: 251, 10: 219, 9: 176, 7: 157, 3: 212, 16: 415, 8: 190, 0: 231, 23: 276, 20: 392, 18: 452, 12: 274, 4: 186, 6: 176, 5: 166}


Number of comments per each hour of the day: {2: 2996, 1: 2089, 22: 3372, 21: 4500, 19: 3954, 17: 5547, 15: 18526, 14: 4972, 13: 7245, 11: 2797, 10: 3013, 9: 1477, 7: 1585, 3: 2154, 16: 4466, 8: 2362, 0: 2277, 23: 2297, 20: 4462, 18: 4877, 12: 4234, 4: 2360, 6: 1587, 5: 1841}


Average number of comments per hour: [[2, 13.198237885462555], [1, 9.367713004484305], [22, 11.749128919860627], [21, 11.056511056511056], [19, 9.414285714285715], [17, 13.73019801980198], [15, 39.585470085470085], [14, 13.153439153439153], [13, 22.2239263803681], [11, 11.143426294820717], [10, 13.757990867579908], [9, 8.392045454545455], [7, 10.095541401273886], [3, 10.160377358490566], [16, 10.76144578313253], [8, 12.43157894736842], [0, 9.857142857142858], [23, 8.322463768115941], [20, 11.38265306122449], [18, 10.789823008849558], [12, 15.452554744525548], [4, 12.688172043010752], [6, 9.017045454545455], [5, 11.090361445783133]]

Finally, we sort the average number of comments by each hour. We'll create a list that equals avg_by_hour with swapped columns and store it to variable named swap_avg_by_hour.

In [6]:

# Create swapped list
swap_avg_by_hour = []

for row in avg_by_hour:
    hour = row[0]
    avg= row[1]
    swap_avg_by_hour.append([avg, hour])

# Sort by average number of comments
sorted_swap = sorted(swap_avg_by_hour, reverse = True)

# Print 'Top 5 hours for Ask Posts Comments'
print("Top 5 Hours for Ask Post Comments")
for row in sorted_swap[:5]:
    hour = dt.datetime.strptime(str(row[1]), "%H")
    hour = hour.strftime("%H:%M")
    print("{0}: {1:.2f} average comments per post".format(hour, row[0]))

Top 5 Hours for Ask Post Comments
15:00: 39.59 average comments per post
13:00: 22.22 average comments per post
12:00: 15.45 average comments per post
10:00: 13.76 average comments per post
17:00: 13.73 average comments per post

Conclusion¶

From the above analysis, we saw that Ask HN posts are more likely to receive comments than Show HNposts in the Hacker News site. More specifically, Ask HN posts get 14 comments on average while Show HN posts get 10 comments on average.

As to the time of the day, Ask HN posts that are created at 15:00 are more likely to get the most comments. Ask posts that are created at 15:00 receive an average of 39.59 comments, which is considerably higher than any other time of the day.