Hacker News is an independent news site, similar to Reddit, operated by Y combinator. The website is built on a simple posting platform with an up/down voting system and comments on posts.
Hacker News is popular in technology and start-up circles, and is regularly visited by hundreds of thousands of visitors daily.
By examining the differences between how often different post categories (ie Ask HN, Show HN, etc.), we can gain insight into what sort of posts would generate the most visibility (though comments or aggregate votes) for our client. We can also determine if there is an opportune time of day to post on HN to maximize visibility.
import csv
import datetime as dt
This function accepts a list which follows the format of this project's imported dataset. It takes this list and calculates an average metric (in this project either comments or points) and returns a descending sorted list in the desired metric.
post_list = a list of lists which contains a subset of the main data file. Each row represents a single post on HackerNews
dataset_name = the name of the data subset, for use in the print-out
metric_index = the list index of the metric of index. For number of comments, this will be 4. For number of points, this will be 3.
metric_name = the name of the metric of interest, for use in thge print-out
top_x_hours = the number of hours to include in the printout
def metric_sort(post_list, dataset_name, metric_index,
metric_name, top_x_hours):
result_list = []
for entry in post_list:
num_metric = int(entry[metric_index])
created_at = dt.datetime.strptime(entry[6], "%m/%d/%Y %H:%M")
result_list.append([created_at, num_metric])
# Accumulators for calculation of the average
counts_by_hour = {}
metric_by_hour = {}
# Perform counting for the averages
for entry in result_list:
hour = entry[0].hour
metric = entry[1]
if hour in counts_by_hour:
counts_by_hour[hour] += 1
metric_by_hour[hour] += metric
else:
counts_by_hour[hour] = 1
metric_by_hour[hour] = metric
# List-of-Lists that will contain [[Hour (n), Average metric in hour (n)],
# [Hour (n+1), Average metric in hour (n+1)],
# [Hour (n+2), Average metric in hour (n+2)],
# ...]
avg_by_hour = []
for hour in counts_by_hour:
posts = counts_by_hour[hour]
metric = metric_by_hour[hour]
average = metric/posts
avg_by_hour.append([hour, average])
# List-of-Lists that will contain [[Average metric in hour(n), Hour (n)],
# [Average metric in hour (n+1), Hour (n+1)],
# [Average metric in hour (n+2), Hour (n+2)],
# ...]
swap_avg_by_hour = []
for hour in avg_by_hour:
swap_avg_by_hour.append([hour[1], hour[0]])
# List-of-Lists that will contain [[Average metric in hour with HIGHEST METRIC, Hour with HIGHEST METRIC],
# [Average points in hour with SECOND HIGHEST METRIC, Hour with SECOND HIGHEST METRIC],
# [Average points in hour with THIRD HIGHEST METRIC, Hour with THIRD HIGHEST METRIC],
# ...]
sorted_swap = sorted(swap_avg_by_hour, reverse = True)
print("The {} hours during which \"{} posts\" accumulated the most {}:".format(top_x_hours, dataset_name, metric_name))
for i in range(0, top_x_hours):
[hr, metric] = [sorted_swap[i][1], sorted_swap[i][0]]
time = dt.datetime.strptime(str(hr), "%H")
time_str = time.strftime("%H:%M")
print("{}: {:.2f} average {} per post".format(time_str, metric, metric_name))
return sorted_swap
This function determined the average points, given a data subset in the format of the imported data.
data_subset = a subset of the imported dataset, in the format of the imported dataset
metric_index = the list index of the metric of index. For number of comments, this will be 4. For number of points, this will be 3.
metric_name = the name of the metric of interest, for use in thge print-out
def avg_metric(data_subset, subset_name, metric_index, metric_name):
sum_metric = 0
num_posts = len(data_subset)
for entry in data_subset:
title = entry[1].lower()
sum_metric += int(entry[metric_index])
average_metric = sum_metric/num_posts
print('The average {} per post in the {} subset are: {:.2f}'.format(metric_name, subset_name, average_metric))
return [sum_metric, average_metric]
This project utilizes data avaiable on Kaggle, titled Hacker News Posts.
The original data file includes +300,000 entries. This data was cleaned in two steps:
The final data-set has ~20,000 entries.
[0] = id: the unique identifier number
[1] = title: the title of the post
[2] = url: the url that the post links to, if it links to a URL
[3] = num_points: the number of points the post acquired, calculated as upvotes less downvotes
[4] = num_comments: the number of comments made on the post
[5] = author: the username of the person who submitted the post
[6] = created_at: the date and time at which the post was submitted
from csv import reader
opened_file = open('./DataSets/HackerNews/HN_posts_year_to_Sep_26_2016.csv', encoding='utf8')
read_file = reader(opened_file)
hn = list(read_file)
headers = hn[0]
hn = hn[1:]
print('A sample of the dataset is below:')
print(headers)
print('=============================================================================')
print(hn[0:4])
print('=============================================================================')
A sample of the dataset is below: ['id', 'title', 'url', 'num_points', 'num_comments', 'author', 'created_at'] ============================================================================= [['12579008', 'You have two days to comment if you want stem cells to be classified as your own', 'http://www.regulations.gov/document?D=FDA-2015-D-3719-0018', '1', '0', 'altstar', '9/26/2016 3:26'], ['12579005', 'SQLAR the SQLite Archiver', 'https://www.sqlite.org/sqlar/doc/trunk/README.md', '1', '0', 'blacksqr', '9/26/2016 3:24'], ['12578997', 'What if we just printed a flatscreen television on the side of our boxes?', 'https://medium.com/vanmoof/our-secrets-out-f21c1f03fdc8#.ietxmez43', '1', '0', 'pavel_lishin', '9/26/2016 3:19'], ['12578989', 'algorithmic music', 'http://cacm.acm.org/magazines/2011/7/109891-algorithmic-composition/fulltext', '1', '0', 'poindontcare', '9/26/2016 3:16']] =============================================================================
With the data read into our program, we start by filtering the remaining data into three buckets:
Additionally, we can calulate the average number of comments for each of these categories.
ask_posts = []
show_posts = []
other_posts = []
for entry in hn:
title = entry[1].lower()
if title.startswith('ask hn'):
ask_posts.append(entry)
elif title.startswith('show hn'):
show_posts.append(entry)
else:
other_posts.append(entry)
num_ask_posts = len(ask_posts)
num_show_posts = len(show_posts)
num_other_posts = len(other_posts)
print('The total number of posts is {}'.format(len(hn)))
print('======================================')
print('The number of ask posts is \t{}'.format(num_ask_posts))
print('The number of show posts is \t{}'.format(num_show_posts))
print('The number of other posts is \t{}'.format(num_other_posts))
The total number of posts is 293119 ====================================== The number of ask posts is 9139 The number of show posts is 10158 The number of other posts is 273822
[total_ask_comments, average_ask_comments] = avg_metric(ask_posts, "ASK", 4, "comments")
[total_show_comments, average_show_comments] = avg_metric(show_posts, "SHOW", 4, "comments")
[total_other_comments, average_other_comments] = avg_metric(other_posts, "OTHER", 4, "comments")
The average comments per post in the ASK subset are: 10.39 The average comments per post in the SHOW subset are: 4.89 The average comments per post in the OTHER subset are: 6.46
Based on our analysis, ask posts receive the most comments, followed by other posts, then show posts. The exact results are in the output above. This finding follows logic, as those seeking help are specifically soliciting comments.
Next, we will examine the frequency of postings versus time-of-day.
comments_ask_sort = metric_sort(ask_posts, "Ask", 4, "comment(s)", 5)
print('\n')
comments_show_sort = metric_sort(show_posts, "Show", 4, "comment(s)", 5)
print('\n')
comments_other_sort = metric_sort(other_posts, "Other", 4, "comment(s)", 5)
The 5 hours during which "Ask posts" accumulated the most comment(s): 15:00: 28.68 average comment(s) per post 13:00: 16.32 average comment(s) per post 12:00: 12.38 average comment(s) per post 02:00: 11.14 average comment(s) per post 10:00: 10.68 average comment(s) per post The 5 hours during which "Show posts" accumulated the most comment(s): 12:00: 6.99 average comment(s) per post 07:00: 6.68 average comment(s) per post 11:00: 6.00 average comment(s) per post 08:00: 5.60 average comment(s) per post 14:00: 5.52 average comment(s) per post The 5 hours during which "Other posts" accumulated the most comment(s): 12:00: 7.59 average comment(s) per post 11:00: 7.37 average comment(s) per post 02:00: 7.18 average comment(s) per post 13:00: 7.15 average comment(s) per post 05:00: 6.79 average comment(s) per post
As shown in the code block above, if you want to maximize the number of comments you receive on your posts, it is best to post during the hours of 2pm, 12pm, 11am, 1am, or 9am Central time (time above are given in Eastern time)
So far, we have only examined the number of comments on a post. The number of points a post receives (the difference of the post's up and down votes), gives us an idea of the community's reception of the post.
[total_ask_points, average_ask_points] = avg_metric(ask_posts, "ASK", 3, "points")
[total_show_points, average_show_points] = avg_metric(show_posts, "SHOW", 3, "points")
[total_other_points, average_other_points] = avg_metric(other_posts, "OTHER", 3, "points")
The average points per post in the ASK subset are: 11.31 The average points per post in the SHOW subset are: 14.84 The average points per post in the OTHER subset are: 15.16
The data above show that non-ask, non-show HN posts receive the kmost points, followed by Show HN posts. In my mind, I imagine users posting memes, or cat photos which are easy to award points to, but neither promote discussion, ask for information, nor provide information.
A brief exploration of the articles composing the other category is given below.
other_titles = []
for entry in other_posts:
other_titles.append(entry[1])
for i in range(0,21):
print(other_titles[i])
You have two days to comment if you want stem cells to be classified as your own SQLAR the SQLite Archiver What if we just printed a flatscreen television on the side of our boxes? algorithmic music How the Data Vault Enables the Next-Gen Data Warehouse and Data Lake Saving the Hassle of Shopping Macalifa A new open-source music app for UWP that won't suck GitHub theweavrs/Macalifa: A music player written for UWP Google Allo first Impression Advanced Multimedia on the Linux Command Line Muroc Maru Why companies make their products worse Tuning AWS SQS Queues The Promise of GitHub Joint R&D Has Its Ups and Downs IBM announces next implementation of Apples Swift developer language Amazons Algorithms Dont Find You the Best Deals Ruffled Feathers The Veil of Ignorance Design and Accessbility OMeta#: Who? What? When? Where? Why? (2008) Burning Ship fractal
Without reading the actual posts, I infer that the other category consitst of a mixture of unlabeled "Show HN" posts and news/technical stories that users think would be interesting for the community:
Unlabeled Show HN
News/Techincal Stories
Personal views/reviews
Given that news stories tend to have broader appeal than any one project, it makes sense that more users would be interested in rating the story.
Finally, we can examine if the number of points, which I would take as a stronger proxy of user activity than number of comments, cvaries by time of day.
# Using ask_posts, show_posts, and other_posts lists from code block 2
sorted_ask = metric_sort(ask_posts, "Ask", 3, "points", 5)
print('\n')
sorted_show = metric_sort(show_posts, "Show", 3, "points", 5)
print('\n')
sorted_other = metric_sort(other_posts, "Other", 3, "points", 5)
The 5 hours during which "Ask posts" accumulated the most points: 15:00: 21.64 average points per post 13:00: 17.93 average points per post 12:00: 13.58 average points per post 10:00: 13.44 average points per post 17:00: 12.19 average points per post The 5 hours during which "Show posts" accumulated the most points: 12:00: 20.91 average points per post 11:00: 19.26 average points per post 13:00: 17.02 average points per post 19:00: 16.06 average points per post 06:00: 15.99 average points per post The 5 hours during which "Other posts" accumulated the most points: 02:00: 16.71 average points per post 12:00: 16.70 average points per post 11:00: 16.29 average points per post 00:00: 16.12 average points per post 13:00: 16.02 average points per post
The most striking result of this analysis is that the points per post in the other category is almost flat across a 24 hour period. Contrast this with the ranges of the other two categories, which are significantly larger:
range_ask = sorted_ask[0][0] - sorted_ask[-1][0]
range_show = sorted_show[0][0] - sorted_show[-1][0]
range_other = sorted_other[0][0] - sorted_other[-1][0]
print("Range of average points per post:")
print("=================================")
print("Ask HN: {:.2f}".format(range_ask))
print("Show HN: {:.2f}".format(range_show))
print("Other: {:.2f}".format(range_other))
Range of average points per post: ================================= Ask HN: 14.01 Show HN: 10.38 Other: 2.93
It is worth noting that there are many more "other" posts than "ask" or "show" posts, which might contribute to this decrease in variability:
In this project, we examined a dataset containing posting, commenting, and rating statistics from "Hacker News." We addressed four questions:
Ask posts receive the most comments, followed by show posts, followed by other posts. I posit that this is due to the very nature of ask posts requesting feedback.
Time-of-day does apper to affect the number of comments that are received per Ask HN post. Beyond a cluster of hours in the early afternoon Central US time, I do not see a particularly predictable pattern. Overall, the hour with the highest comment number per "Ask HN" post is 3pm Central US Time.
Surprisingly, both the "Ask HN" and the "Show HN" posts receive fewer points than posts in the other category. A brief analysis shows that this other category is composed of both news stories and unlabeled "ask" and "show" posts.
This question also provided the surprising result that in an average 24 hour period, the variablility in points-per-post was drastically lower in the "other" category than in the "ask" or "show" caregories. This could reflect the broader interest that these topics have, as well as news articles being more digestible during the day than entire coding projects. An alternative exaplanation would be that we have a significantly larger denominator in the "other" category, which might be masking a higher range.