Notebook

Hacker News Site - When and What to post to get the maximum comments¶

Objective¶

In this project, I will analyse a representative set of data from the Hacker News site to determine what kind of posts (between Ask HN and Show HN) drives more comments and if the time at which the posts was created has a influence on the number of comments received.

The data set can be found on this kaggle site

It contains approximately 20,000 rows - from the 300,000 rows originally extracted in 2016 - posts without comments were removed and the remaining submissions randomly sampled.

Below are the column description:

id: The unique identifier from Hacker News for the post
title: The title of the post
url: The URL that the posts links to, if the post has a URL
num_points: The number of points the post acquired, calculated as the total number of upvotes minus the total number of downvotes
num_comments: The number of comments that were made on the post
author: The username of the person who submitted the post
created_at: The date and time at which the post was submitted (using Easter Time in the US)

Importing and preparing the data¶

In [1]:

#Importing the data
opened_file = open('hacker_news.csv')
from csv import reader
read_file = reader(opened_file)
hn = list(read_file)

print(hn[:5])

[['id', 'title', 'url', 'num_points', 'num_comments', 'author', 'created_at'], ['12224879', 'Interactive Dynamic Video', 'http://www.interactivedynamicvideo.com/', '386', '52', 'ne0phyte', '8/4/2016 11:52'], ['10975351', 'How to Use Open Source and Shut the Fuck Up at the Same Time', 'http://hueniverse.com/2016/01/26/how-to-use-open-source-and-shut-the-fuck-up-at-the-same-time/', '39', '10', 'josep2', '1/26/2016 19:30'], ['11964716', "Florida DJs May Face Felony for April Fools' Water Joke", 'http://www.thewire.com/entertainment/2013/04/florida-djs-april-fools-water-joke/63798/', '2', '1', 'vezycash', '6/23/2016 22:20'], ['11919867', 'Technology ventures: From Idea to Enterprise', 'https://www.amazon.com/Technology-Ventures-Enterprise-Thomas-Byers/dp/0073523429', '3', '1', 'hswarna', '6/17/2016 0:01']]

In [2]:

#separating the header from the data for ease of use
header = hn[0]
hn = hn[1:]

print(header)
for row in hn[:6]:
    print(row)

['id', 'title', 'url', 'num_points', 'num_comments', 'author', 'created_at']
['12224879', 'Interactive Dynamic Video', 'http://www.interactivedynamicvideo.com/', '386', '52', 'ne0phyte', '8/4/2016 11:52']
['10975351', 'How to Use Open Source and Shut the Fuck Up at the Same Time', 'http://hueniverse.com/2016/01/26/how-to-use-open-source-and-shut-the-fuck-up-at-the-same-time/', '39', '10', 'josep2', '1/26/2016 19:30']
['11964716', "Florida DJs May Face Felony for April Fools' Water Joke", 'http://www.thewire.com/entertainment/2013/04/florida-djs-april-fools-water-joke/63798/', '2', '1', 'vezycash', '6/23/2016 22:20']
['11919867', 'Technology ventures: From Idea to Enterprise', 'https://www.amazon.com/Technology-Ventures-Enterprise-Thomas-Byers/dp/0073523429', '3', '1', 'hswarna', '6/17/2016 0:01']
['10301696', 'Note by Note: The Making of Steinway L1037 (2007)', 'http://www.nytimes.com/2007/11/07/movies/07stein.html?_r=0', '8', '2', 'walterbell', '9/30/2015 4:12']
['10482257', 'Title II kills investment? Comcast and other ISPs are now spending more', 'http://arstechnica.com/business/2015/10/comcast-and-other-isps-boost-network-investment-despite-net-neutrality/', '53', '22', 'Deinos', '10/31/2015 9:48']

Preparing the data¶

We are only concerned with post titles beginning with Ask HN or Show HN, we'll create new lists of lists containing just the data for those titles.

In [3]:

#empty lists to store data of interest
ask_posts = []
show_posts = []
other_posts = []

for row in hn:
    title = row[1]
    title = title.lower()
    if title.startswith('ask hn'):
        ask_posts.append(row)
    elif title.startswith('show hn'):
        show_posts.append(row)
    else:
        other_posts.append(row)
        
print('There are {} in the ask hn list'.format(len(ask_posts)))
print('There are {} in the show hn list'.format(len(show_posts)))
print('There are {} in the other list'.format(len(other_posts)))

There are 1744 in the ask hn list
There are 1162 in the show hn list
There are 17194 in the other list

Comparing both types of posts¶

We want to determine which kind of posts - Ask HN or Show HN gather greater interest.

To do so, we will look at the average comments each type of posts receive.

In [4]:

#Ask hn - average number of comments per posts

total_ask_comments = 0

for post in ask_posts:
    n_comments = int(post[4])
    total_ask_comments +=n_comments

avg_ask_comments = total_ask_comments / len(ask_posts)  

print('There are {:.2f} comments on average for Ask NH posts'.format(avg_ask_comments))

#Show hn - average number of comments per posts

total_show_comments = 0

for post in show_posts:
    n_comments = int(post[4])
    total_show_comments +=n_comments

avg_show_comments = total_show_comments / len(show_posts)  

print('There are {:.2f} comments on average for Show HN posts'.format(avg_show_comments))

There are 14.04 comments on average for Ask NH posts
There are 10.32 comments on average for Show HN posts

In [5]:

#looking at maximum number of comments
#ask hn
max_n_comments_ask = 0 
for post in ask_posts:
    n_comments = int(post[4])
    if n_comments > max_n_comments_ask:
        max_n_comments_ask = n_comments
        
print('There a maximum of {} comments for ASK HN posts'.format(max_n_comments_ask))

#show hn
max_n_comments_show = 0 
for post in show_posts:
    n_comments = int(post[4])
    if n_comments > max_n_comments_show:
        max_n_comments_show = n_comments
        
print('There a maximum of {} comments for Show HN posts'.format(max_n_comments_show))

There a maximum of 947 comments for ASK HN posts
There a maximum of 306 comments for Show HN posts

Comparing both type of posts - conclusion¶

From the average number of comments on posts, we can see that Ask HN posts are receiving on average almost 1.5 times more comments than Show HN posts - with each Aks HN post included in our sample receiving an average of 14.04 comments compared to the 10.32 average comments of the Show HNposts. In addition, just to confirm our conclusion, we can look at the maximum number of comments received by each type of post: the result is aligned with the maximum of comments an Ask HN post received is 3 times as high as the maximum number of comments received by a Show HN post.

Based on the above, we will focus on the Ask HN posts for the remaining part of the analysis.

Finding if there is a better time to make a Ask HN post¶

In order to do so, we will:

Calculate the amount of ask posts created in each hour of the day, along with the number of comments received.
Calculate the average number of comments ask posts receive by hour created.

In [6]:

# Importing the Datetime module given we will work with time 
import datetime as dt

Calculating the amount of ask posts created in each hour of the day, along with the number of comments received¶

In [7]:

result_list = []

for post in ask_posts:
    created_at = post[6]
    n_comments = int(post[4])
    new_element = [created_at,n_comments]
    result_list.append(new_element)
    
counts_by_hour = {}
comments_by_hour = {}

for row in result_list:
    time = row[0]
    comments = int(row[1])
    time_dt = dt.datetime.strptime(time,"%m/%d/%Y %H:%M")
    hour = time_dt.hour
    if hour not in counts_by_hour:
        counts_by_hour[hour] = 1
        comments_by_hour[hour] = comments
    else:
        counts_by_hour[hour] += 1
        comments_by_hour[hour] += comments
        
#Making sure the results is realistic and that we likely have accurate results out        
print(counts_by_hour)
print(comments_by_hour)
#making sure we have all the rows
print("There are {} lines in counts by hours".format(len(counts_by_hour)))
print("There are {} lines in comments by hours".format(len(comments_by_hour)))

{9: 45, 13: 85, 10: 59, 14: 107, 16: 108, 23: 68, 12: 73, 17: 100, 15: 116, 21: 109, 20: 80, 2: 58, 18: 109, 3: 54, 5: 46, 19: 110, 1: 60, 22: 71, 8: 48, 4: 47, 0: 55, 6: 44, 7: 34, 11: 58}
{9: 251, 13: 1253, 10: 793, 14: 1416, 16: 1814, 23: 543, 12: 687, 17: 1146, 15: 4477, 21: 1745, 20: 1722, 2: 1381, 18: 1439, 3: 421, 5: 464, 19: 1188, 1: 683, 22: 479, 8: 492, 4: 337, 0: 447, 6: 397, 7: 267, 11: 641}
There are 24 lines in counts by hours
There are 24 lines in comments by hours

Calculating the hourly average of comments¶

This will help us identify the best time to create a post to draw a maximum of comments

In [8]:

avg_by_hour = []

for hour in counts_by_hour:
    avg_by_hour.append([hour, (comments_by_hour[hour]/counts_by_hour[hour])])


print("There are {} lines in average comments by hours".format(len(avg_by_hour)))
for row in avg_by_hour:
    print(row)

There are 24 lines in average comments by hours
[9, 5.5777777777777775]
[13, 14.741176470588234]
[10, 13.440677966101696]
[14, 13.233644859813085]
[16, 16.796296296296298]
[23, 7.985294117647059]
[12, 9.41095890410959]
[17, 11.46]
[15, 38.5948275862069]
[21, 16.009174311926607]
[20, 21.525]
[2, 23.810344827586206]
[18, 13.20183486238532]
[3, 7.796296296296297]
[5, 10.08695652173913]
[19, 10.8]
[1, 11.383333333333333]
[22, 6.746478873239437]
[8, 10.25]
[4, 7.170212765957447]
[0, 8.127272727272727]
[6, 9.022727272727273]
[7, 7.852941176470588]
[11, 11.051724137931034]

Sorting the data¶

The above data appears accurate, however, it is hard to manually quickly identify the 5 best hours for creating a posts.

To facilitate the task, we will swap the order of the data (average number of comments first instead of hour) and sort the average in descending order.

In [9]:

#Swapping the order
swap_avg_by_hour = []

for hour in avg_by_hour:
    swap_avg_by_hour.append([hour[1],hour[0]])

sorted_swap = sorted(swap_avg_by_hour, reverse=True)
for row in sorted_swap:
    print(row)

[38.5948275862069, 15]
[23.810344827586206, 2]
[21.525, 20]
[16.796296296296298, 16]
[16.009174311926607, 21]
[14.741176470588234, 13]
[13.440677966101696, 10]
[13.233644859813085, 14]
[13.20183486238532, 18]
[11.46, 17]
[11.383333333333333, 1]
[11.051724137931034, 11]
[10.8, 19]
[10.25, 8]
[10.08695652173913, 5]
[9.41095890410959, 12]
[9.022727272727273, 6]
[8.127272727272727, 0]
[7.985294117647059, 23]
[7.852941176470588, 7]
[7.796296296296297, 3]
[7.170212765957447, 4]
[6.746478873239437, 22]
[5.5777777777777775, 9]

Identifying the best 5 hours to post a Ask HN post¶

In [10]:

print("Top 5 Hours for Ask HN Posts Comments")

for row in sorted_swap[:5]:
    time_dt = dt.datetime.strptime(str(row[1]), '%H')
    time = time_dt.strftime("%H:%M")
    print("{time}: {comments:.2f} average comments per post".format(time=time,comments=row[0]))

Top 5 Hours for Ask HN Posts Comments
15:00: 38.59 average comments per post
02:00: 23.81 average comments per post
20:00: 21.52 average comments per post
16:00: 16.80 average comments per post
21:00: 16.01 average comments per post

What and When - Conclusions¶

After our preliminary analysis we found that:

Ask HN posts attract more comments in average than Show HN posts
Posting in mid-afternoon 15:00 US Easter Time leads to more comments

We could however give a secondary window - while the 15:00 - 16:00 window produces significantly more comments (up to 2.5 times the average for Ask HN post in the case of the 15:00 hour), a second window opens in early evening around 20:00 - 21:00 with about 1.5 times more comments than the average Ask HN post at 20:00. It must be noted that those located outside of Eastern US, will need to adjust their posting time to take into account the time difference.