Notebook

Exploring the Best Type of Post and Timing for Posting on Hacker News Portal¶

Hacker News is a site started by the startup incubator Y Combinator, where user-submitted stories (known as "posts") are voted and commented upon, similar to reddit. Hacker News is extremely popular in technology and startup circles, and posts that make it to the top of Hacker News' listings can get hundreds of thousands of visitors as a result.

You can find the data set here, but note that it has been reduced from almost 300,000 rows to approximately 20,000 rows by removing all submissions that did not receive any comments, and then randomly sampling from the remaining submissions.

We're specifically interested in posts whose titles begin with either Ask HN or Show HN. Users submit Ask HN posts to ask the Hacker News community a specific question. Below are a couple examples:

Ask HN: How to improve my personal website?
Ask HN: Am I the only one outraged by Twitter shutting down share counts?
Ask HN: Aby recent changes to CSS that broke mobile?

Likewise, users submit Show HN posts to show the Hacker News community a project, product, or just generally something interesting. Below are a couple of examples:

Show HN: Wio Link ESP8266 Based Web of Things Hardware Development Platform'
Show HN: Something pointless I made
Show HN: Shanhu.io, a programming playground powered by e8vm

In this project, we'll aim to find out what kind of posts are more likely to receive attention on Hacker News. To do so, we will answer two questions:

Do Ask HN or Show HN receive more comments on average?
Do posts created at a certain time receive more comments on average?

Summary of Results¶

The conclusion from this data analysis is that Ask HN post on Hacker News received more comments than Show HN posts. In addition, Ask HN posts, that were posted on 15:00 EST / 22:00 CET received the biggest average number of comments.

Opening and Preparing the Data¶

Firstly, we open and read hacker_news.csv data set as a list of lists and assign it to the variable hn.

In [1]:

from csv import reader
opened_file = open("hacker_news.csv", encoding='utf8')
read_file = reader(opened_file)
hn = list(read_file)

Then, we print few of the first rows of the data set.

In [2]:

print(hn[0])
print('\n')
print(hn[1:4])

['id', 'title', 'url', 'num_points', 'num_comments', 'author', 'created_at']


[['12579008', 'You have two days to comment if you want stem cells to be classified as your own', 'http://www.regulations.gov/document?D=FDA-2015-D-3719-0018', '1', '0', 'altstar', '9/26/2016 3:26'], ['12579005', 'SQLAR  the SQLite Archiver', 'https://www.sqlite.org/sqlar/doc/trunk/README.md', '1', '0', 'blacksqr', '9/26/2016 3:24'], ['12578997', 'What if we just printed a flatscreen television on the side of our boxes?', 'https://medium.com/vanmoof/our-secrets-out-f21c1f03fdc8#.ietxmez43', '1', '0', 'pavel_lishin', '9/26/2016 3:19']]

In [3]:

headers = hn[0]
hn = hn[1:]

Becase the first row (first list in the list of lists) is the header row, we have separeted it from the rest of the data.

In [4]:

print(headers)
print('\n')
print(hn[0:4])

['id', 'title', 'url', 'num_points', 'num_comments', 'author', 'created_at']


[['12579008', 'You have two days to comment if you want stem cells to be classified as your own', 'http://www.regulations.gov/document?D=FDA-2015-D-3719-0018', '1', '0', 'altstar', '9/26/2016 3:26'], ['12579005', 'SQLAR  the SQLite Archiver', 'https://www.sqlite.org/sqlar/doc/trunk/README.md', '1', '0', 'blacksqr', '9/26/2016 3:24'], ['12578997', 'What if we just printed a flatscreen television on the side of our boxes?', 'https://medium.com/vanmoof/our-secrets-out-f21c1f03fdc8#.ietxmez43', '1', '0', 'pavel_lishin', '9/26/2016 3:19'], ['12578989', 'algorithmic music', 'http://cacm.acm.org/magazines/2011/7/109891-algorithmic-composition/fulltext', '1', '0', 'poindontcare', '9/26/2016 3:16']]

Data Analysis¶

Counting number of posts (Ask HN vs Show HN) and which of them receive more comments on average¶

Sinnce we're only concerned with post titles beginning with Ask HN or Show HN, we'll create new lists of lists containing just the data for those titles.

In [5]:

ask_posts = []
show_posts = []
other_posts = []

for row in hn:
    title = row[1]
    if (title.lower()).startswith('ask hn'):
        ask_posts.append(row)
    elif (title.lower()).startswith('show hn'):
        show_posts.append(row)
    else:
        other_posts.append(row)

print(len(ask_posts))
print(len(show_posts))
print(len(other_posts))

9139
10158
273822

There is 9139 Ask HN posts and 10158 Show HN posts in the dataset. Next, we will determine if Ask HN or Show HN receive more comments on average.

In [6]:

total_ask_comments = 0
for row in ask_posts:
    num_comments = row[4]
    num_comments = int(num_comments)
    total_ask_comments += num_comments
avg_ask_comments = total_ask_comments / len(ask_posts)
print(round(avg_ask_comments,2))

10.39

In [7]:

total_show_comments = 0
for row in show_posts:
    num_comments = row[4]
    num_comments = int(num_comments)
    total_show_comments += num_comments
avg_show_comments = total_show_comments / len(show_posts)
print(round(avg_show_comments,2))

4.89

It seems that Ask HN posts receive over two times more comments than Show HN posts (10.39 in comparision to 4.89). Since Ask HN are more likely to receive comments, we'll focus our remaining analysis just on these posts.

Next, we'll determine if Ask HN posts created at a certain time are more likely to attract comments. We'll use the following steps to perform this analysis:

we will calculate the amount of ask posts created in each hour of the day, along with the number of comments received.
then we will calculate the average number of comments ask posts receive by hour created.

Counting the average number of comments received per hour of the creation of the Ask HN Post¶

First, we are create list of lists that includes number of comments and the date and hour of the creation of ask post.

In [8]:

result_list = []
for row in ask_posts:
    created_at = row[6]
    num_comments = row[4]
    num_comments = int(num_comments)
    result_list.append([created_at, num_comments])
print(result_list[0:2])

[['9/26/2016 2:53', 7], ['9/26/2016 1:17', 3]]

Next, we calculate the amount the comments related to the ask posts at each hour of the day, using dictionaires frequency tables and datetime module.

In [9]:

import datetime as dt
counts_by_hour = {} #the number of ask posts created every hour
comments_by_hour = {} #the hour of creation of the ask post and the corresponding number of comments (in total) related thereto 
date_format = "%m/%d/%Y %H:%M"

In [10]:

for result in result_list:
    creation_time_str = result[0]
    comment_result = result[1]
    time = dt.datetime.strptime(creation_time_str, date_format).strftime("%H")
    
    if time not in counts_by_hour:
        counts_by_hour[time] = 1
        comments_by_hour[time] = comment_result
    else:
        counts_by_hour[time] += 1
        comments_by_hour[time] += comment_result

comments_by_hour

Out[10]:

{'02': 2996,
 '01': 2089,
 '22': 3372,
 '21': 4500,
 '19': 3954,
 '17': 5547,
 '15': 18525,
 '14': 4972,
 '13': 7245,
 '11': 2797,
 '10': 3013,
 '09': 1477,
 '07': 1585,
 '03': 2154,
 '23': 2297,
 '20': 4462,
 '16': 4466,
 '08': 2362,
 '00': 2277,
 '18': 4877,
 '12': 4234,
 '04': 2360,
 '06': 1587,
 '05': 1838}

In [11]:

counts_by_hour

Out[11]:

{'02': 269,
 '01': 282,
 '22': 383,
 '21': 518,
 '19': 552,
 '17': 587,
 '15': 646,
 '14': 513,
 '13': 444,
 '11': 312,
 '10': 282,
 '09': 222,
 '07': 226,
 '03': 271,
 '23': 343,
 '20': 510,
 '16': 579,
 '08': 257,
 '00': 301,
 '18': 614,
 '12': 342,
 '04': 243,
 '06': 234,
 '05': 209}

In [12]:

avg_by_hour = []
for time in comments_by_hour:
    avg = comments_by_hour[time] / counts_by_hour[time]
    avg_by_hour.append([time, avg])
avg_by_hour

Out[12]:

[['02', 11.137546468401487],
 ['01', 7.407801418439717],
 ['22', 8.804177545691905],
 ['21', 8.687258687258687],
 ['19', 7.163043478260869],
 ['17', 9.449744463373083],
 ['15', 28.676470588235293],
 ['14', 9.692007797270955],
 ['13', 16.31756756756757],
 ['11', 8.96474358974359],
 ['10', 10.684397163120567],
 ['09', 6.653153153153153],
 ['07', 7.013274336283186],
 ['03', 7.948339483394834],
 ['23', 6.696793002915452],
 ['20', 8.749019607843136],
 ['16', 7.713298791018998],
 ['08', 9.190661478599221],
 ['00', 7.5647840531561465],
 ['18', 7.94299674267101],
 ['12', 12.380116959064328],
 ['04', 9.7119341563786],
 ['06', 6.782051282051282],
 ['05', 8.794258373205741]]

Although we now have the results we need (the average number of comments for ask posts per hour of the post creation), this format makes it hard to identify the hours with the highest values. Let's finish by sorting the list of lists and printing the five highest values in a format that's easier to read.

In [13]:

swap_avg_by_hour = []
for row in avg_by_hour:
    swap_avg_by_hour.append((row[1], row[0]))
    
swap_avg_by_hour

Out[13]:

[(11.137546468401487, '02'),
 (7.407801418439717, '01'),
 (8.804177545691905, '22'),
 (8.687258687258687, '21'),
 (7.163043478260869, '19'),
 (9.449744463373083, '17'),
 (28.676470588235293, '15'),
 (9.692007797270955, '14'),
 (16.31756756756757, '13'),
 (8.96474358974359, '11'),
 (10.684397163120567, '10'),
 (6.653153153153153, '09'),
 (7.013274336283186, '07'),
 (7.948339483394834, '03'),
 (6.696793002915452, '23'),
 (8.749019607843136, '20'),
 (7.713298791018998, '16'),
 (9.190661478599221, '08'),
 (7.5647840531561465, '00'),
 (7.94299674267101, '18'),
 (12.380116959064328, '12'),
 (9.7119341563786, '04'),
 (6.782051282051282, '06'),
 (8.794258373205741, '05')]

In [14]:

sorted_swap = sorted(swap_avg_by_hour, reverse = True)
sorted_swap

Out[14]:

[(28.676470588235293, '15'),
 (16.31756756756757, '13'),
 (12.380116959064328, '12'),
 (11.137546468401487, '02'),
 (10.684397163120567, '10'),
 (9.7119341563786, '04'),
 (9.692007797270955, '14'),
 (9.449744463373083, '17'),
 (9.190661478599221, '08'),
 (8.96474358974359, '11'),
 (8.804177545691905, '22'),
 (8.794258373205741, '05'),
 (8.749019607843136, '20'),
 (8.687258687258687, '21'),
 (7.948339483394834, '03'),
 (7.94299674267101, '18'),
 (7.713298791018998, '16'),
 (7.5647840531561465, '00'),
 (7.407801418439717, '01'),
 (7.163043478260869, '19'),
 (7.013274336283186, '07'),
 (6.782051282051282, '06'),
 (6.696793002915452, '23'),
 (6.653153153153153, '09')]

In [15]:

print("Top 5 Hours for Ask Posts Comments:")
for row in sorted_swap[0:5]:
    us_hour_dt = dt.datetime.strptime(row[1], '%H')
    us_hour_str = us_hour_dt.strftime('%H:%M')
    my_time_dt = us_hour_dt + dt.timedelta(hours=7)
    my_time_str = my_time_dt.strftime('%H:%M')
    print('')
    print(' ', '{time} EST: / {my_time} CET:  {avg:.2f} average comments per post'.format(time=us_hour_str, my_time=my_time_str, avg = row[0]))

Top 5 Hours for Ask Posts Comments:

  15:00 EST: / 22:00 CET:  28.68 average comments per post

  13:00 EST: / 20:00 CET:  16.32 average comments per post

  12:00 EST: / 19:00 CET:  12.38 average comments per post

  02:00 EST: / 09:00 CET:  11.14 average comments per post

  10:00 EST: / 17:00 CET:  10.68 average comments per post

Conclusions¶

In this data set, the Ask HN posts that were posted on 15:00 EST / 22:00 CET were receiving the highest average number of comments.

There is also huge difference between those posted on 15:00 EST / 22:00 CET and the second timeframe with the highest number of comments (13:00 EST / 20:00 CET), with (28.32 to 16.32 average comments per post, i.e. approx. 1.8x times less).

The fair conclusion from this data analysis is that historically, out of 'Ask Posts' posted on Hacker News that received any comments, posts that were posted on 15:00 EST / 22:00 CET received the biggest average number of comments.

Determine if show or ask posts receive more points on average. Determine if posts created at a certain time are more likely to receive more points. Compare your results to the average number of comments and points other posts receive. Use Dataquest's data science project style guide to format your project.

In [ ]: