Notebook

Exploring Hacker News¶

Introduction:¶

What is Hacker News?¶

Hacker News is an independent news site, similar to Reddit, operated by Y combinator. The website is built on a simple posting platform with an up/down voting system and comments on posts.

Who uses Hacker News?¶

Hacker News is popular in technology and start-up circles, and is regularly visited by hundreds of thousands of visitors daily.

Why evaluate stories on Hacker News?¶

By examining the differences between how often different post categories (ie Ask HN, Show HN, etc.), we can gain insight into what sort of posts would generate the most visibility (though comments or aggregate votes) for our client. We can also determine if there is an opportune time of day to post on HN to maximize visibility.

Goals¶

1. Determine if "Ask HN" or "Show HN" receive different amounts of comments on average¶

2. Identify posting times which lead to higher quantities of comments¶

3. Determine which post category receives the most points on average¶

4. Identify posting times that lead to higher average points¶

Outline¶

Imports¶

Functions¶

metric_sort
avg_metric

Data Source¶

Data Pre-cleaning
Dataset Headings
(1) Importing Dataset
(2) Data sorting and aggregation

Data Analysis¶

(Q-1) What post types receive the most comments?
- (A-1) Ask posts receive most comments on average
(Q-2) Does time-of-day affect Ask HN post comment number?
- (A-2) Early afternoon includes several of the best times to post
(Q-3) What post category receives more points on average?
- (A-3) Posts that are neither Ask HN nor Show HN receive the most points.
(Q-4) Does the number of points awarded to posts vary with time-of-day?
- (A-4) The number of points awarded per day varies least in the "Other" category of post

Conclusion¶

Imports¶

In [1]:

import csv
import datetime as dt

Functions¶

metric_sort¶

This function accepts a list which follows the format of this project's imported dataset. It takes this list and calculates an average metric (in this project either comments or points) and returns a descending sorted list in the desired metric.

post_list = a list of lists which contains a subset of the main data file. Each row represents a single post on HackerNews

dataset_name = the name of the data subset, for use in the print-out

metric_index = the list index of the metric of index. For number of comments, this will be 4. For number of points, this will be 3.

metric_name = the name of the metric of interest, for use in thge print-out

top_x_hours = the number of hours to include in the printout

In [2]:

def metric_sort(post_list, dataset_name, metric_index, 
               metric_name, top_x_hours):
    result_list = []

    for entry in post_list:
        num_metric = int(entry[metric_index])
        created_at = dt.datetime.strptime(entry[6], "%m/%d/%Y %H:%M")
        result_list.append([created_at, num_metric])

    # Accumulators for calculation of the average
    counts_by_hour = {}
    metric_by_hour = {}

    # Perform counting for the averages
    for entry in result_list:
        hour = entry[0].hour
        metric = entry[1]

        if hour in counts_by_hour:
            counts_by_hour[hour] += 1
            metric_by_hour[hour] += metric
        else:
            counts_by_hour[hour] = 1
            metric_by_hour[hour] = metric

    # List-of-Lists that will contain [[Hour (n), Average metric in hour (n)],
    #                                  [Hour (n+1), Average metric in hour (n+1)],
    #                                  [Hour (n+2), Average metric in hour (n+2)],
    #                                  ...]
    avg_by_hour = []

    for hour in counts_by_hour:
        posts = counts_by_hour[hour]
        metric = metric_by_hour[hour]
        average = metric/posts
        avg_by_hour.append([hour, average])

        
    # List-of-Lists that will contain [[Average metric in hour(n), Hour (n)],
    #                                  [Average metric in hour (n+1), Hour (n+1)],
    #                                  [Average metric in hour (n+2), Hour (n+2)],
    #                                  ...]
    swap_avg_by_hour = []
    
    for hour in avg_by_hour:
        swap_avg_by_hour.append([hour[1], hour[0]])

        
    # List-of-Lists that will contain [[Average metric in hour with HIGHEST METRIC, Hour with HIGHEST METRIC],
    #                                  [Average points in hour with SECOND HIGHEST METRIC, Hour with SECOND HIGHEST METRIC],
    #                                  [Average points in hour with THIRD HIGHEST METRIC, Hour with THIRD HIGHEST METRIC],
    #                                  ...]
    sorted_swap = sorted(swap_avg_by_hour, reverse = True)

    print("The {} hours during which \"{} posts\" accumulated the most {}:".format(top_x_hours, dataset_name, metric_name))

    for i in range(0, top_x_hours):
        [hr, metric] = [sorted_swap[i][1], sorted_swap[i][0]]
        time = dt.datetime.strptime(str(hr), "%H")
        time_str = time.strftime("%H:%M")
        print("{}: {:.2f} average {} per post".format(time_str, metric, metric_name))
        
    return sorted_swap

avg_metric¶

This function determined the average points, given a data subset in the format of the imported data.

data_subset = a subset of the imported dataset, in the format of the imported dataset

metric_index = the list index of the metric of index. For number of comments, this will be 4. For number of points, this will be 3.

metric_name = the name of the metric of interest, for use in thge print-out

In [3]:

def avg_metric(data_subset, subset_name, metric_index, metric_name):
    sum_metric = 0
    num_posts = len(data_subset)
    
    for entry in data_subset:
        title = entry[1].lower()
        sum_metric += int(entry[metric_index])


    average_metric = sum_metric/num_posts
    print('The average {} per post in the {} subset are: {:.2f}'.format(metric_name, subset_name, average_metric))
          
    return [sum_metric, average_metric]

Data Source¶

This project utilizes data avaiable on Kaggle, titled Hacker News Posts.

Data Pre-cleaning¶

The original data file includes +300,000 entries. This data was cleaned in two steps:

Remove all submissions without comments
Randomly sample from remaining submissions

The final data-set has ~20,000 entries.

Dataset Headings¶

[0] = id: the unique identifier number

[1] = title: the title of the post

[2] = url: the url that the post links to, if it links to a URL

[3] = num_points: the number of points the post acquired, calculated as upvotes less downvotes

[4] = num_comments: the number of comments made on the post

[5] = author: the username of the person who submitted the post

[6] = created_at: the date and time at which the post was submitted

(1) Importing Data¶

In [4]:

from csv import reader
opened_file = open('./DataSets/HackerNews/HN_posts_year_to_Sep_26_2016.csv', encoding='utf8')
read_file = reader(opened_file)
hn = list(read_file)

headers = hn[0]
hn = hn[1:]

print('A sample of the dataset is below:')
print(headers)
print('=============================================================================')
print(hn[0:4])
print('=============================================================================')

A sample of the dataset is below:
['id', 'title', 'url', 'num_points', 'num_comments', 'author', 'created_at']
=============================================================================
[['12579008', 'You have two days to comment if you want stem cells to be classified as your own', 'http://www.regulations.gov/document?D=FDA-2015-D-3719-0018', '1', '0', 'altstar', '9/26/2016 3:26'], ['12579005', 'SQLAR  the SQLite Archiver', 'https://www.sqlite.org/sqlar/doc/trunk/README.md', '1', '0', 'blacksqr', '9/26/2016 3:24'], ['12578997', 'What if we just printed a flatscreen television on the side of our boxes?', 'https://medium.com/vanmoof/our-secrets-out-f21c1f03fdc8#.ietxmez43', '1', '0', 'pavel_lishin', '9/26/2016 3:19'], ['12578989', 'algorithmic music', 'http://cacm.acm.org/magazines/2011/7/109891-algorithmic-composition/fulltext', '1', '0', 'poindontcare', '9/26/2016 3:16']]
=============================================================================

(2) Data sorting and aggregation¶

With the data read into our program, we start by filtering the remaining data into three buckets:

'Ask HN'
'Show HN'
'Other'

Additionally, we can calulate the average number of comments for each of these categories.

In [5]:

ask_posts = []
show_posts = []
other_posts = []

for entry in hn:
    title = entry[1].lower()
    if title.startswith('ask hn'):
        ask_posts.append(entry)
    elif title.startswith('show hn'):
        show_posts.append(entry)
    else:
        other_posts.append(entry)
        
num_ask_posts = len(ask_posts)
num_show_posts = len(show_posts)
num_other_posts = len(other_posts)
        
print('The total number of posts is {}'.format(len(hn)))
print('======================================')
print('The number of ask posts is \t{}'.format(num_ask_posts))
print('The number of show posts is \t{}'.format(num_show_posts))
print('The number of other posts is \t{}'.format(num_other_posts))    

The total number of posts is 293119
======================================
The number of ask posts is 	9139
The number of show posts is 	10158
The number of other posts is 	273822

Data analysis¶

(Q-1) What post types receive the most comments?¶

In [6]:

[total_ask_comments, average_ask_comments] = avg_metric(ask_posts, "ASK", 4, "comments")
[total_show_comments, average_show_comments] = avg_metric(show_posts, "SHOW", 4, "comments")
[total_other_comments, average_other_comments] = avg_metric(other_posts, "OTHER", 4, "comments")

The average comments per post in the ASK subset are: 10.39
The average comments per post in the SHOW subset are: 4.89
The average comments per post in the OTHER subset are: 6.46

(A-1) Ask posts receive most comments on average¶

Based on our analysis, ask posts receive the most comments, followed by other posts, then show posts. The exact results are in the output above. This finding follows logic, as those seeking help are specifically soliciting comments.

(Q-2) Does time-of-day affect Ask HN post comment number?¶

Next, we will examine the frequency of postings versus time-of-day.

In [7]:

comments_ask_sort = metric_sort(ask_posts, "Ask", 4, "comment(s)", 5)
print('\n')
comments_show_sort = metric_sort(show_posts, "Show", 4, "comment(s)", 5)
print('\n')
comments_other_sort = metric_sort(other_posts, "Other", 4, "comment(s)", 5)

The 5 hours during which "Ask posts" accumulated the most comment(s):
15:00: 28.68 average comment(s) per post
13:00: 16.32 average comment(s) per post
12:00: 12.38 average comment(s) per post
02:00: 11.14 average comment(s) per post
10:00: 10.68 average comment(s) per post


The 5 hours during which "Show posts" accumulated the most comment(s):
12:00: 6.99 average comment(s) per post
07:00: 6.68 average comment(s) per post
11:00: 6.00 average comment(s) per post
08:00: 5.60 average comment(s) per post
14:00: 5.52 average comment(s) per post


The 5 hours during which "Other posts" accumulated the most comment(s):
12:00: 7.59 average comment(s) per post
11:00: 7.37 average comment(s) per post
02:00: 7.18 average comment(s) per post
13:00: 7.15 average comment(s) per post
05:00: 6.79 average comment(s) per post

(A-2) Early afternoon includes several of the best times to post¶

As shown in the code block above, if you want to maximize the number of comments you receive on your posts, it is best to post during the hours of 2pm, 12pm, 11am, 1am, or 9am Central time (time above are given in Eastern time)

(Q-3) What post category receives more points on average?¶

So far, we have only examined the number of comments on a post. The number of points a post receives (the difference of the post's up and down votes), gives us an idea of the community's reception of the post.

In [8]:

[total_ask_points, average_ask_points] = avg_metric(ask_posts, "ASK", 3, "points")
[total_show_points, average_show_points] = avg_metric(show_posts, "SHOW", 3, "points")
[total_other_points, average_other_points] = avg_metric(other_posts, "OTHER", 3, "points")

The average points per post in the ASK subset are: 11.31
The average points per post in the SHOW subset are: 14.84
The average points per post in the OTHER subset are: 15.16

(A-3) Posts that are neither Ask HN nor Show HN receive the most points.¶

The data above show that non-ask, non-show HN posts receive the kmost points, followed by Show HN posts. In my mind, I imagine users posting memes, or cat photos which are easy to award points to, but neither promote discussion, ask for information, nor provide information.

A brief exploration of the articles composing the other category is given below.

In [9]:

other_titles = []

for entry in other_posts:
    other_titles.append(entry[1])
    
for i in range(0,21):
    print(other_titles[i])

You have two days to comment if you want stem cells to be classified as your own
SQLAR  the SQLite Archiver
What if we just printed a flatscreen television on the side of our boxes?
algorithmic music
How the Data Vault Enables the Next-Gen Data Warehouse and Data Lake
Saving the Hassle of Shopping
Macalifa  A new open-source music app for UWP that won't suck
GitHub  theweavrs/Macalifa: A music player written for UWP
Google Allo  first Impression
Advanced Multimedia on the Linux Command Line
Muroc Maru
Why companies make their products worse
Tuning AWS SQS Queues
The Promise of GitHub
Joint R&D Has Its Ups and Downs
IBM announces next implementation of Apples Swift developer language
Amazons Algorithms Dont Find You the Best Deals
Ruffled Feathers
The Veil of Ignorance  Design and Accessbility
OMeta#: Who? What? When? Where? Why? (2008)
Burning Ship fractal

Without reading the actual posts, I infer that the other category consitst of a mixture of unlabeled "Show HN" posts and news/technical stories that users think would be interesting for the community:

Unlabeled Show HN

SQLAR the SQLite Archiver
Burning Ship Fractal

News/Techincal Stories

You have two days to comment if you want stem cells to be classified as your own
Why companies make their products worse
The promise of GitHub
IBM announces next implementation of Apples Swift developer language
Macalifa A new open-source music app for UWP that won't suck
GitHub thewavrs/Macalifa: A music player written for UWP

Personal views/reviews

Google Allo first Impression

Given that news stories tend to have broader appeal than any one project, it makes sense that more users would be interested in rating the story.

(Q-4) Does the number of points awarded to posts vary with time-of-day?¶

Finally, we can examine if the number of points, which I would take as a stronger proxy of user activity than number of comments, cvaries by time of day.

In [10]:

# Using ask_posts, show_posts, and other_posts lists from code block 2
  
sorted_ask = metric_sort(ask_posts, "Ask", 3, "points", 5)
print('\n')
sorted_show = metric_sort(show_posts, "Show", 3, "points", 5)
print('\n')
sorted_other = metric_sort(other_posts, "Other", 3, "points", 5)

The 5 hours during which "Ask posts" accumulated the most points:
15:00: 21.64 average points per post
13:00: 17.93 average points per post
12:00: 13.58 average points per post
10:00: 13.44 average points per post
17:00: 12.19 average points per post


The 5 hours during which "Show posts" accumulated the most points:
12:00: 20.91 average points per post
11:00: 19.26 average points per post
13:00: 17.02 average points per post
19:00: 16.06 average points per post
06:00: 15.99 average points per post


The 5 hours during which "Other posts" accumulated the most points:
02:00: 16.71 average points per post
12:00: 16.70 average points per post
11:00: 16.29 average points per post
00:00: 16.12 average points per post
13:00: 16.02 average points per post

(A-4) The number of points awarded per day varies least in the "Other" category of post¶

The most striking result of this analysis is that the points per post in the other category is almost flat across a 24 hour period. Contrast this with the ranges of the other two categories, which are significantly larger:

In [11]:

range_ask = sorted_ask[0][0] - sorted_ask[-1][0]
range_show = sorted_show[0][0] - sorted_show[-1][0]
range_other = sorted_other[0][0] - sorted_other[-1][0]

print("Range of average points per post:")
print("=================================")
print("Ask HN: {:.2f}".format(range_ask))
print("Show HN: {:.2f}".format(range_show))
print("Other: {:.2f}".format(range_other))

Range of average points per post:
=================================
Ask HN: 14.01
Show HN: 10.38
Other: 2.93

It is worth noting that there are many more "other" posts than "ask" or "show" posts, which might contribute to this decrease in variability:

Other posts = 273,822
Ask posts = 9,139
Show posts = 10,158

Conclusion¶

In this project, we examined a dataset containing posting, commenting, and rating statistics from "Hacker News." We addressed four questions:

What post types receive the most comments?

Ask posts receive the most comments, followed by show posts, followed by other posts. I posit that this is due to the very nature of ask posts requesting feedback.

Does time-of-day affect Ask HN post comment number?

Time-of-day does apper to affect the number of comments that are received per Ask HN post. Beyond a cluster of hours in the early afternoon Central US time, I do not see a particularly predictable pattern. Overall, the hour with the highest comment number per "Ask HN" post is 3pm Central US Time.

What post category receives more points on average?

Surprisingly, both the "Ask HN" and the "Show HN" posts receive fewer points than posts in the other category. A brief analysis shows that this other category is composed of both news stories and unlabeled "ask" and "show" posts.

Does the number of points awarded to posts vary with time-of-day?

This question also provided the surprising result that in an average 24 hour period, the variablility in points-per-post was drastically lower in the "other" category than in the "ask" or "show" caregories. This could reflect the broader interest that these topics have, as well as news articles being more digestible during the day than entire coding projects. An alternative exaplanation would be that we have a significantly larger denominator in the "other" category, which might be masking a higher range.