Notebook

Hacker News Analysis - Popular Type of Posts¶

In this project, we'll work with a dataset of submissions to popular technology site Hacker News.

Hacker News is a site started by the startup incubator Y Combinator, where user-submitted stories (known as "posts") receive votes and comments, similar to reddit. Hacker News is extremely popular in technology and startup circles, and posts that make it to the top of the Hacker News listings can get hundreds of thousands of visitors as a result.

You can find the data set here, but note that we have reduced from almost 300,000 rows to approximately 20,000 rows by removing all submissions that didn't receive any comments and then randomly sampling from the remaining submissions.

We'll compare Ask HN and Show HN posts to determine the following:

Do Ask HN or Show HN receive more comments on average?
Do posts created at a certain time receive more comments on average?

Below are first five rows of our dataset.

In [1]:

from csv import reader
opened_file = open('hacker_news.csv')
read_file = reader(opened_file)
hn = list(read_file)

#Separate header and data
headers = hn[:1]
hn = hn[1:]
print(headers)
print(hn[:5])

[['id', 'title', 'url', 'num_points', 'num_comments', 'author', 'created_at']]
[['12224879', 'Interactive Dynamic Video', 'http://www.interactivedynamicvideo.com/', '386', '52', 'ne0phyte', '8/4/2016 11:52'], ['10975351', 'How to Use Open Source and Shut the Fuck Up at the Same Time', 'http://hueniverse.com/2016/01/26/how-to-use-open-source-and-shut-the-fuck-up-at-the-same-time/', '39', '10', 'josep2', '1/26/2016 19:30'], ['11964716', "Florida DJs May Face Felony for April Fools' Water Joke", 'http://www.thewire.com/entertainment/2013/04/florida-djs-april-fools-water-joke/63798/', '2', '1', 'vezycash', '6/23/2016 22:20'], ['11919867', 'Technology ventures: From Idea to Enterprise', 'https://www.amazon.com/Technology-Ventures-Enterprise-Thomas-Byers/dp/0073523429', '3', '1', 'hswarna', '6/17/2016 0:01'], ['10301696', 'Note by Note: The Making of Steinway L1037 (2007)', 'http://www.nytimes.com/2007/11/07/movies/07stein.html?_r=0', '8', '2', 'walterbell', '9/30/2015 4:12']]

To filter our data, we separate headers from the dataset and store them in a variable named headers.

Since we're only concerned with post titles beginning with Ask HN or Show HN, we'll create new lists of lists containing just the data for those titles.

To find the posts that begin with either Ask HN or Show HN, we'll use the string method startswith. since the startswith method is case sensitive, we'll use the lower method to control capitalization problem.

In [2]:

ask_posts = []
show_posts = []
other_posts = []

for row in hn:
    title = row[1].lower()
    if title.startswith("ask hn"):
        ask_posts.append(row)
    elif title.startswith("show hn"):
        show_posts.append(row)
    else:
        other_posts.append(row)

print(len(ask_posts))
print(len(show_posts))
print(len(other_posts))

1744
1162
17194

Do 'Ask HN' or 'Show HN' receive more comments on average?¶

We have Ask HN posts in the list of lists named ask_posts, and Show HN posts in show_posts. Now, let's determin if ask posts or show posts receive more comments on average.

In [3]:

#Average number of comments on ask posts
total_ask_comments = 0

for row in ask_posts:
    n_comments = int(row[4])
    total_ask_comments += n_comments

avg_ask_comments = total_ask_comments / len(ask_posts)
print("Average number of comments on ask posts:", avg_ask_comments)

#Average number of comments on show posts
total_show_comments = 0

for row in show_posts:
    n_comments = int(row[4])
    total_show_comments += n_comments

avg_show_comments = total_show_comments / len(show_posts)
print("Average number of comments on show posts:", avg_show_comments)

Average number of comments on ask posts: 14.038417431192661
Average number of comments on show posts: 10.31669535283993

According to the analysis, Ask HN posts are more likely to receive more comments than Show HN posts. When Ask HN got 14 comments, Show HN posts got 10 comments on average.

Since ask posts are more likely to receive comments, we'll focus our remaining analysis just on these posts.

Do posts created at a certain time receive more comments on average?¶

Next we'll determine if ask posts created at a certain time are more likely to attract comments. We'll use the following steps to perfom this analysis:

Calculate the number of ask posts creataed in each hour of the day, along with the number of comments received.
Calculate the average number of comments ask posts receive by hour created.

We'll use the datetime module to work with the data in the created_at column.

In [4]:

import datetime as dt

result_list = []
for row in ask_posts:
    created_at = row[6]
    n_comments = int(row[4])
    result = [created_at, n_comments]
    result_list.append(result)

    
counts_by_hour = {}
comments_by_hour = {}

for row in result_list:
    date = dt.datetime.strptime(row[0], "%m/%d/%Y %H:%M")
    hour = date.hour
    n_comments = row[1]
    
    if hour not in counts_by_hour:
        counts_by_hour[hour] = 1
        comments_by_hour[hour] = n_comments
    else:
        counts_by_hour[hour] += 1
        comments_by_hour[hour] += n_comments

print("Number of posts per each hour of the day:", counts_by_hour)
print("Number of comments per each hour of the day:", comments_by_hour)

#Calculate the average number of comments per post

avg_by_hour = []

for item in counts_by_hour:
    avg = (comments_by_hour[item] / counts_by_hour[item])
    avg_by_hour.append([item, avg])

print(avg_by_hour)

Number of posts per each hour of the day: {9: 45, 13: 85, 10: 59, 14: 107, 16: 108, 23: 68, 12: 73, 17: 100, 15: 116, 21: 109, 20: 80, 2: 58, 18: 109, 3: 54, 5: 46, 19: 110, 1: 60, 22: 71, 8: 48, 4: 47, 0: 55, 6: 44, 7: 34, 11: 58}
Number of comments per each hour of the day: {9: 251, 13: 1253, 10: 793, 14: 1416, 16: 1814, 23: 543, 12: 687, 17: 1146, 15: 4477, 21: 1745, 20: 1722, 2: 1381, 18: 1439, 3: 421, 5: 464, 19: 1188, 1: 683, 22: 479, 8: 492, 4: 337, 0: 447, 6: 397, 7: 267, 11: 641}
[[9, 5.5777777777777775], [13, 14.741176470588234], [10, 13.440677966101696], [14, 13.233644859813085], [16, 16.796296296296298], [23, 7.985294117647059], [12, 9.41095890410959], [17, 11.46], [15, 38.5948275862069], [21, 16.009174311926607], [20, 21.525], [2, 23.810344827586206], [18, 13.20183486238532], [3, 7.796296296296297], [5, 10.08695652173913], [19, 10.8], [1, 11.383333333333333], [22, 6.746478873239437], [8, 10.25], [4, 7.170212765957447], [0, 8.127272727272727], [6, 9.022727272727273], [7, 7.852941176470588], [11, 11.051724137931034]]

To sort the average comments by each hour, we'll create a list that equals avg_by_hour with swapped columns and store it in a variable named swap_avg_by_hour.

In [5]:

# Create swapped list
swap_avg_by_hour = []

for row in avg_by_hour:
    hour = row[0]
    avg = row[1]
    swap_avg_by_hour.append([avg, hour])

# Sort by average comments
sorted_swap = sorted(swap_avg_by_hour, reverse=True)

# Print 'Top 5 hours for Ask Posts Comments'
print("Top 5 Hours for Ask Posts Comments")
for row in sorted_swap[:5]:
    hour = dt.datetime.strptime(str(row[1]), "%H")
    hour = hour.strftime("%H:%M")
    print("{0}: {1:.2f} average comments per post".format(hour, row[0]))

Top 5 Hours for Ask Posts Comments
15:00: 38.59 average comments per post
02:00: 23.81 average comments per post
20:00: 21.52 average comments per post
16:00: 16.80 average comments per post
21:00: 16.01 average comments per post

Conclusion¶

As a result of the analysis, Ask HN posts are more likely to receive comments than the Show HN posts on the Hacker News. Ask HN posts get 14 comments on average whereas Show HN posts get 10.3 comments.

Regarding the time of the day, Ask HN posts that are posted at 15:00 are more likely to get the most comments. Ask posts that are posted at 15:00 get 38.59 comments on average, which is considerably higher than the other time of the day.