Notebook

Exploring Hacker News Posts¶

Intoroduction¶

Hacker News is a site started by the startup incubator Y Combinastor, where user-submitted stories (known as 'posts') receive votes and comments, similar to reddit. Hacker News is extremely popular in technology and startup circles, and posts that make it to the top of the Hacker News listings can get hundreds of thousands of visitors as a result.

You can find the data set here, but note that we have reduced from almost 300,000 rows to approximately 20,000 rows by removing all submissions that didn't receive any comments and then randomly sampling from the remaining submissions. Below are descriptions of the columns:

id: the unique identifier from Hacker News for the post
title: the title of the post
url: the URL that the posts links to, if the post has a URL
num_points: the number of points the post acquired, calculated as the total number of upvotes minus the total number of downvotes
num_comments: the number of comments on the post
author: the username of the person who submitted the post
created_at: the date and time of the post's submission

We're specifically interested in posts with titles that begin with either Ask HN or Show HN. Users submit Ask HN posts to ask the Hacker News community a specific question. Likewise, users submit Show HN posts to show the Hacker News community a project, product, or just something interesting

We'll compare these two types of posts to determine the following:

Do Ask HN or Show HN receive more comments on average
Do posts created at a certain time receive more comments on average?

Let's start by importing the libraries we need and reading the dataset into a list of lists.

Open and read hacker_news.csv data

In [38]:

# Open and read data
import csv as c
opened_file = open('hacker_news.csv')
read_file = c.reader(opened_file)

# Create list assign to variable called hn
hn = list(read_file)

# Show five first row data
print(hn[:5])

[['id', 'title', 'url', 'num_points', 'num_comments', 'author', 'created_at'], ['12224879', 'Interactive Dynamic Video', 'http://www.interactivedynamicvideo.com/', '386', '52', 'ne0phyte', '8/4/2016 11:52'], ['10975351', 'How to Use Open Source and Shut the Fuck Up at the Same Time', 'http://hueniverse.com/2016/01/26/how-to-use-open-source-and-shut-the-fuck-up-at-the-same-time/', '39', '10', 'josep2', '1/26/2016 19:30'], ['11964716', "Florida DJs May Face Felony for April Fools' Water Joke", 'http://www.thewire.com/entertainment/2013/04/florida-djs-april-fools-water-joke/63798/', '2', '1', 'vezycash', '6/23/2016 22:20'], ['11919867', 'Technology ventures: From Idea to Enterprise', 'https://www.amazon.com/Technology-Ventures-Enterprise-Thomas-Byers/dp/0073523429', '3', '1', 'hswarna', '6/17/2016 0:01']]

Removing Headers from a List of Lists¶

In [39]:

headers = hn[0]  # Create variable called header contains first row that column header
hn = hn [1:]  # Exclude the first row containing the column header
print(headers)  # Display header column
print(hn[:5])  # Display first five rows exclude column header

['id', 'title', 'url', 'num_points', 'num_comments', 'author', 'created_at']
[['12224879', 'Interactive Dynamic Video', 'http://www.interactivedynamicvideo.com/', '386', '52', 'ne0phyte', '8/4/2016 11:52'], ['10975351', 'How to Use Open Source and Shut the Fuck Up at the Same Time', 'http://hueniverse.com/2016/01/26/how-to-use-open-source-and-shut-the-fuck-up-at-the-same-time/', '39', '10', 'josep2', '1/26/2016 19:30'], ['11964716', "Florida DJs May Face Felony for April Fools' Water Joke", 'http://www.thewire.com/entertainment/2013/04/florida-djs-april-fools-water-joke/63798/', '2', '1', 'vezycash', '6/23/2016 22:20'], ['11919867', 'Technology ventures: From Idea to Enterprise', 'https://www.amazon.com/Technology-Ventures-Enterprise-Thomas-Byers/dp/0073523429', '3', '1', 'hswarna', '6/17/2016 0:01'], ['10301696', 'Note by Note: The Making of Steinway L1037 (2007)', 'http://www.nytimes.com/2007/11/07/movies/07stein.html?_r=0', '8', '2', 'walterbell', '9/30/2015 4:12']]

Extracting Ask HN and Show HN Posts¶

As we explained at the beginning, we only interested in posts with titles that begin with Ask HN and Show HN, we separated the data for become three lists :

1.ask_posts with titles that begin with Ask HN
2.show_posts with titles that begin with Show HN
3.other_posts with titles not begin with Ask HN and Show HN

In [40]:

# Create empty list
ask_posts = []
show_posts = []
other_posts = []

for row in hn:
    title = row[1]  # Assign the title in each row to a variable named title
    if title.lower().startswith('ask hn'):  
        ask_posts.append(row)  # conditional true if title start with ask hn append to ask_posts list
    elif title.lower().startswith('show hn'):
        show_posts.append(row)  # conditional true if title start with show hn append to show_posts list
    else:
        other_posts.append(row)  # conditional true if title not start with ask hn and show hn append to other_posts list
        
print(len(ask_posts))
print(len(show_posts))
print(len(other_posts))

1744
1162
17194

Calculating the Average Number of Comments for Ask HN and Show HN Posts¶

In [41]:

# Average number of comments for Ask HN
total_ask_comments = 0
for row in ask_posts:
    num_comments = int(row[4])
    total_ask_comments += num_comments
    
avg_ask_comments = total_ask_comments / len(ask_posts)

print(f'Average number of comments for Ask HN: {avg_ask_comments}')

# Average number of comments for Show HN
total_show_comments = 0
for row in show_posts:
    num_comments = int(row[4])
    total_show_comments += num_comments
    
avg_show_comments = total_show_comments / len(show_posts)

print(f'Average number of comments for Show HN: {avg_show_comments}')

Average number of comments for Ask HN: 14.038417431192661
Average number of comments for Show HN: 10.31669535283993

Based on data above, on average Ask posts receive more comments than Show posts

Finding the Number of Ask Posts and Comments by Hour Created¶

In [42]:

import datetime as dt
result_list = []

# Create lists contain list of created at post and number of comment at post
for row in ask_posts:
    created_at = row[6]
    num_comments = int(row[4])
    result = [created_at, num_comments]
    result_list.append(result)
    
counts_by_hour = {}
comments_by_hour = {}

# Create Dictionary total post by hour and total comment by hour
for element in result_list:
    date = dt.datetime.strptime(element[0], '%m/%d/%Y %H:%M')
    hour = dt.datetime.strftime(date, '%H')
    num_comments = element[1]
    if hour not in counts_by_hour:
        counts_by_hour[hour] = 1
        comments_by_hour[hour] = num_comments
    else:
        counts_by_hour[hour] += 1
        comments_by_hour[hour] += num_comments

Calculating the Average Number of Comments for Ask HN Posts by Hour¶

In [43]:

avg_by_hour = []

# Average number of comments by hour
for key in counts_by_hour:
    avg = comments_by_hour[key] / counts_by_hour[key]
    avg_by_hour.append([key, avg])
    
print(avg_by_hour)

[['09', 5.5777777777777775], ['13', 14.741176470588234], ['10', 13.440677966101696], ['14', 13.233644859813085], ['16', 16.796296296296298], ['23', 7.985294117647059], ['12', 9.41095890410959], ['17', 11.46], ['15', 38.5948275862069], ['21', 16.009174311926607], ['20', 21.525], ['02', 23.810344827586206], ['18', 13.20183486238532], ['03', 7.796296296296297], ['05', 10.08695652173913], ['19', 10.8], ['01', 11.383333333333333], ['22', 6.746478873239437], ['08', 10.25], ['04', 7.170212765957447], ['00', 8.127272727272727], ['06', 9.022727272727273], ['07', 7.852941176470588], ['11', 11.051724137931034]]

Sorting and Printing Values from List of Lists¶

In [44]:

swap_avg_by_hour = []
for row in avg_by_hour:
    swap_avg_by_hour.append([row[1], row[0]])
    
# Sort by descending value of average number of comments    
sorted_swap = sorted(swap_avg_by_hour, reverse=True)
print('Top 5 Hours for Ask Posts Comments')

for row in sorted_swap[:5]:
    hour = dt.datetime.strptime(row[1], '%H')
    hour = dt.datetime.strftime(hour, '%H:%M')
    print('{} : {:.2f} average comments per post'.format(hour, row[0]))

Top 5 Hours for Ask Posts Comments
15:00 : 38.59 average comments per post
02:00 : 23.81 average comments per post
20:00 : 21.52 average comments per post
16:00 : 16.80 average comments per post
21:00 : 16.01 average comments per post

Conclusion¶

On average Ask posts receive more comments than Show posts
The top 5 hours for Ask Posts Comments are at 15, 02, 20, 16, and 21