Notebook

Ask HN vs Show HN - Analyzing Hacker News Posts¶

Hacker News is a News Site started by the startup incubator Y Combinator. It site a site where you can share a story or ask(known as "posts") are voted and commented upon, similar to reddit.

Our goal for this project is to determine the following:

Do Ask HN vs Show HN receive more comments on average?
Do posts created at a certain time receive more comments on average?

Opening and Reading the data set¶

We will use this data set, has approximately 20,000 rows.

Let's start by exploring our data set. First we will print the headers and first five rows.

In [1]:

opened_file = open("hacker_news.csv") 
from csv import reader
read_file = reader(opened_file)
hn = list(read_file)

headers = hn[0] # assigning the first row which is the headers to "headers" variable
hn = hn[1:]
print(headers, '\n')

for row in hn[:5]:
    print(row, '\n')

['id', 'title', 'url', 'num_points', 'num_comments', 'author', 'created_at'] 

['12224879', 'Interactive Dynamic Video', 'http://www.interactivedynamicvideo.com/', '386', '52', 'ne0phyte', '8/4/2016 11:52'] 

['10975351', 'How to Use Open Source and Shut the Fuck Up at the Same Time', 'http://hueniverse.com/2016/01/26/how-to-use-open-source-and-shut-the-fuck-up-at-the-same-time/', '39', '10', 'josep2', '1/26/2016 19:30'] 

['11964716', "Florida DJs May Face Felony for April Fools' Water Joke", 'http://www.thewire.com/entertainment/2013/04/florida-djs-april-fools-water-joke/63798/', '2', '1', 'vezycash', '6/23/2016 22:20'] 

['11919867', 'Technology ventures: From Idea to Enterprise', 'https://www.amazon.com/Technology-Ventures-Enterprise-Thomas-Byers/dp/0073523429', '3', '1', 'hswarna', '6/17/2016 0:01'] 

['10301696', 'Note by Note: The Making of Steinway L1037 (2007)', 'http://www.nytimes.com/2007/11/07/movies/07stein.html?_r=0', '8', '2', 'walterbell', '9/30/2015 4:12']

Below are descriptions of the columns:

id: The unique identifier from Hacker News for the post
title: The title of the post
url: The URL that the posts links to, if it the post has a URL
num_points: The number of points the post acquired, calculated as the total number of upvotes minus the total number of downvotes
num_comments: The number of comments that were made on the post
author: The username of the person who submitted the post
created_at: The date and time at which the post was submitted

Data Cleaning¶

Let's start by spliting the data to three parts:

A list only contains posts starts with Ask HN
A list only contains posts starts with Show HN
A list only contains other posts not starting either Ask HN or Show HN

In [2]:

ask_posts = []
show_posts = []
other_posts = []

for row in hn:
    title = row[1]
    if title.lower().startswith("ask hn"):
        ask_posts.append(row)
    elif title.lower().startswith("show hn"):
        show_posts.append(row)
    else:
        other_posts.append(row)

print("Ask posts length:", len(ask_posts))
print("Show posts length:", len(show_posts))
print("Other posts length:", len(other_posts))

Ask posts length: 1744
Show posts length: 1162
Other posts length: 17194

For our analysis we are only interested in post starting with Ask HN and Show HN.

Data Analyzing¶

Let's start by computing the average of both Ask HN and Show HN.

In [3]:

total_ask_comments = 0

for post in ask_posts:
    comments = int(post[4])
    total_ask_comments += comments
    
avg_ask_comments = total_ask_comments / len(ask_posts)
print("Overall total of Ask HN comments: {:.2f}".format(avg_ask_comments))

Overall total of Ask HN comments: 14.04

In [4]:

total_show_comments = 0

for post in show_posts:
    comments = int(post[4])
    total_show_comments += comments
    
avg_show_comments = total_show_comments / len(ask_posts)
print("Overall total of Show HN comments: {:.2f}".format(avg_show_comments))

Overall total of Show HN comments: 6.87

We find here that Ask HN posts with 14.04% receives more comments on average than Show HN posts with 6.87%.

With that, we will be using Ask HN to determine the particular hour that attracts more comments.

In [5]:

import datetime as dt

result_list = []  # a list that will contain date of post created and # of comments
for post in ask_posts:
    created_at = post[6]
    comments = post[4]
    result_list.append([created_at, comments])
    
counts_by_hour = {}  # a list that will contain number of post per hour
comments_by_hour = {}  # a list that will contain number comments per hour

for result in result_list:
    hour = dt.datetime.strptime(result[0], "%m/%d/%Y %H:%M")
    hour = hour.strftime("%I %p")  # Converting our hour to 12 hour format and Locale’s equivalent of either AM or PM
    comments = int(result[1])
    
    if not hour in counts_by_hour:
        counts_by_hour[hour] = 1
        comments_by_hour[hour] = comments
    else:
        counts_by_hour[hour] += 1
        comments_by_hour[hour] += comments

# To sort out the values in counts_by_hour by descending order  
counts_by_hour = {k: v for k, v in sorted(counts_by_hour.items(), key=lambda item: item[1],  reverse = True)}

# To sort out the values in comments_by_hour by descending order 
comments_by_hour = {k: v for k, v in sorted(comments_by_hour.items(), key=lambda item: item[1],  reverse = True)}

print("Hour  | # of post")
for hour, posts_len in counts_by_hour.items():
    print(hour, "| ", posts_len)

Hour  | # of post
03 PM |  116
07 PM |  110
09 PM |  109
06 PM |  109
04 PM |  108
02 PM |  107
05 PM |  100
01 PM |  85
08 PM |  80
12 PM |  73
10 PM |  71
11 PM |  68
01 AM |  60
10 AM |  59
02 AM |  58
11 AM |  58
12 AM |  55
03 AM |  54
08 AM |  48
04 AM |  47
05 AM |  46
09 AM |  45
06 AM |  44
07 AM |  34

We see from the result that in 3pm has highest with 116 post followed by 7pm with 110 posts, 9pm with 109 posts and so on.

Now, let's check number of comments per hour.

In [6]:

print("Hour  | # of Comments")
for hour, comments_len in comments_by_hour.items():
    print(hour, "| ", comments_len)

Hour  | # of Comments
03 PM |  4477
04 PM |  1814
09 PM |  1745
08 PM |  1722
06 PM |  1439
02 PM |  1416
02 AM |  1381
01 PM |  1253
07 PM |  1188
05 PM |  1146
10 AM |  793
12 PM |  687
01 AM |  683
11 AM |  641
11 PM |  543
08 AM |  492
10 PM |  479
05 AM |  464
12 AM |  447
03 AM |  421
06 AM |  397
04 AM |  337
07 AM |  267
09 AM |  251

We see the almost the same result here, being 3pm has most comments followed by 4pm, 9pm and so on.

In [7]:

avg_by_hour = []  # a list that will contain the average of comments per hour
for hour in comments_by_hour:
    avg_by_hour.append([hour,comments_by_hour[hour] / counts_by_hour[hour]])

# To sort out the values in avg_by_hour by descending order  
avg_by_hour = [[k, v] for k, v in sorted(avg_by_hour, key=lambda item: item[1],  reverse = True)]

print("Top 5 Hours for Ask Posts Comments")
for row in avg_by_hour[:5]:
    template = "{}: {:.2f} average comments per post."
    hour = dt.datetime.strptime(row[0], "%I %p")
    hour = hour.strftime("%I %p")
    average = row[1]
    print(template.format(hour, average))

Top 5 Hours for Ask Posts Comments
03 PM: 38.59 average comments per post.
02 AM: 23.81 average comments per post.
08 PM: 21.52 average comments per post.
04 PM: 16.80 average comments per post.
09 PM: 16.01 average comments per post.

Conclusion¶

From our analysis, we conclude that having Ask HN to our post will attract more comments than Show HN. Based from our top 5 average comments per hour, comments will appear mostly during 3pm, 2am, 8pm and so on.