Notebook

Exploring Hacker News posts¶

Hacker News is a site started by the startup incubator Y Combinator, where user-submitted stories (known as "posts") are voted and commented upon, similar to reddit. Hacker News is extremely popular in technology and startup circles, and posts that make it to the top of Hacker News' listings can get hundreds of thousands of visitors as a result.

This is the link for the original data set

This is the columns description:

id: The unique identifier from Hacker News for the post
title: The title of the post
url: The URL that the posts links to, if the post has a URL
num_points : The number of points the post acquired, calculated as the total number of upvotes minus the total number of downvotes
num_comments: The number of comments that were made on the post
author: The username of the person who submitted the post
created_at: The date and time at which the post was submitted

We're specifically interested in posts whose titles begin with either Ask HN or Show HN.

Users submit Ask HN posts to ask the Hacker News community a specific question. Below are a couple examples:

Ask HN: How to improve my personal website?
Ask HN: Am I the only one outraged by Twitter shutting down share counts?
Ask HN: Aby recent changes to CSS that broke mobile?

Likewise, users submit Show HN posts to show the Hacker News community a project, product, or just generally something interesting.

Show HN: Wio Link ESP8266 Based Web of Things Hardware Development Platform'
Show HN: Something pointless I made
Show HN: Shanhu.io, a programming playground powered by e8vm

We'll compare these two types of posts to determine the following:

Do Ask HN or Show HN receive more comments on average?
Do posts created at a certain time receive more comments on average?

Let start by reading the data set in a list of lists and printing the first 5 entries

In [1]:

from csv import reader
opened = open("HN_posts_year_to_Sep_26_2016.csv", encoding = 'utf-8')
read = reader(opened)
hn = list(read)
header = hn[0]
hn = hn[1:]

print(header)
hn[:5]

['id', 'title', 'url', 'num_points', 'num_comments', 'author', 'created_at']

Out[1]:

[['12579008',
  'You have two days to comment if you want stem cells to be classified as your own',
  'http://www.regulations.gov/document?D=FDA-2015-D-3719-0018',
  '1',
  '0',
  'altstar',
  '9/26/2016 3:26'],
 ['12579005',
  'SQLAR  the SQLite Archiver',
  'https://www.sqlite.org/sqlar/doc/trunk/README.md',
  '1',
  '0',
  'blacksqr',
  '9/26/2016 3:24'],
 ['12578997',
  'What if we just printed a flatscreen television on the side of our boxes?',
  'https://medium.com/vanmoof/our-secrets-out-f21c1f03fdc8#.ietxmez43',
  '1',
  '0',
  'pavel_lishin',
  '9/26/2016 3:19'],
 ['12578989',
  'algorithmic music',
  'http://cacm.acm.org/magazines/2011/7/109891-algorithmic-composition/fulltext',
  '1',
  '0',
  'poindontcare',
  '9/26/2016 3:16'],
 ['12578979',
  'How the Data Vault Enables the Next-Gen Data Warehouse and Data Lake',
  'https://www.talend.com/blog/2016/05/12/talend-and-Â\x93the-data-vaultÂ\x94',
  '1',
  '0',
  'markgainor1',
  '9/26/2016 3:14']]

Now let separate the dataset in 'Ask HN', 'Show HN' and 'Other' topics

In [2]:

ask_posts = []
show_posts = []
other_posts = []

for row in hn:
    if row[1].lower().startswith("ask hn"):
        ask_posts.append(row)
    elif row[1].lower().startswith("show hn"):
        show_posts.append(row)
    else:
        other_posts.append(row)
        
print("AH posts: ", len(ask_posts))
print("SH posts: ",len(show_posts))
print("'Other' posts:", len(other_posts))

AH posts:  9139
SH posts:  10158
'Other' posts: 273822

Now let's determine which type of post receives more comments on average

In [3]:

total_ask_comments = 0

for comm in ask_posts:
    total_ask_comments += int(comm[4])
    
avg_ask_comments = round(total_ask_comments / len(ask_posts))    
    
print ("Average number of comments for the AH posts: " , avg_ask_comments)    
    
total_show_comments = 0
for comm in show_posts:
    total_show_comments +=int(comm[4])
    
avg_show_comments = round(total_show_comments / len(show_posts))   
    
print ("Average number of comments for the SH posts: " , avg_show_comments)

Average number of comments for the AH posts:  10
Average number of comments for the SH posts:  5

The 'Asking posts' seems more popular in terms of engagement, so from now on we'll keep a focus on this type for our analysis

Calculating Average number of comments for AH posts by hour¶

Next, we'll determine if AH posts created at a certain time are more likely to attract comments. We'll use the following steps to perform this analysis:

Calculate the amount of ask posts created in each hour of the day, along with the number of comments received.
Calculate the average number of comments ask posts receive by hour created.

In [4]:

import datetime as dt

result_list = []

for row in ask_posts:
    result_list.append([row[6], int(row[4])])
    
posts_by_hour = {} 
comments_by_hour ={}
date_format = "%m/%d/%Y %H:%M"             # created format type for date


for row in result_list:
    date = row[0]
    hour = dt.datetime.strptime(date, date_format).strftime("%H")      # date formatted and hour extracted
    if hour not in posts_by_hour:
        posts_by_hour[hour] = 1
        comments_by_hour[hour] = row[1] 
    else:
        posts_by_hour[hour] += 1
        comments_by_hour[hour] += row[1]
        
# Now the average

avg_by_hour = []

for row in posts_by_hour:
    avg_by_hour.append([row,round(comments_by_hour[row]/posts_by_hour[row],2)])

avg_by_hour.sort(reverse = True, key = lambda x:x[1])          # sorted by highest n of comments (second column)

for h,avg in avg_by_hour:
    print("H {} - {} avg comments".format(dt.datetime.strptime(h,"%H").strftime("%H:%M"),avg) )

H 15:00 - 28.68 avg comments
H 13:00 - 16.32 avg comments
H 12:00 - 12.38 avg comments
H 02:00 - 11.14 avg comments
H 10:00 - 10.68 avg comments
H 04:00 - 9.71 avg comments
H 14:00 - 9.69 avg comments
H 17:00 - 9.45 avg comments
H 08:00 - 9.19 avg comments
H 11:00 - 8.96 avg comments
H 22:00 - 8.8 avg comments
H 05:00 - 8.79 avg comments
H 20:00 - 8.75 avg comments
H 21:00 - 8.69 avg comments
H 03:00 - 7.95 avg comments
H 18:00 - 7.94 avg comments
H 16:00 - 7.71 avg comments
H 00:00 - 7.56 avg comments
H 01:00 - 7.41 avg comments
H 19:00 - 7.16 avg comments
H 07:00 - 7.01 avg comments
H 06:00 - 6.78 avg comments
H 23:00 - 6.7 avg comments
H 09:00 - 6.65 avg comments

As seen from the documentation , the time zone for this data set is Eastern Time in the US.

The hour receiving more comments is 15:00, with an average of 28.68 .

The 12:00 - 15:00 (GMT -5) time band seems the most popular

Conclusions¶

After analyzing the whole data set, we can affirm that in order to maximize the amount of comments a post receive, it has to be adressed as AH and possibly to be posted between 17:00 - 20:00 London time