Notebook

POST-TYPES AND TIME THAT RECIEVE MORE ENGAGEMENT ON HACKER NEWS.¶

Introduction¶

Hacker News is a social news website, under the startup incubator Y Combinator, with a focus on computer science and entrepreneurship. Hacker News gains huge popularity in technology and startup communities. On this site, users can submit any posts, which "gratify one's intellectual curiosity" (Ref: Hacker News Guidelines). Their posts are voted and commented upon, where the top-ranked posts can draw hundreds of thousands of traffic.

You can find the original dataset for Hacker News posts (12-month period) until 26th September 2016 here link. For this project, we use the hacker_news.csv dataset, a modified dataset, of which approximately 300,000 data rows have been trimmed down to 20,000 rows by:

Deleting all the posts without any comments
Sampling randomly from the remaining posts after the deletion

Here are the explanations for the columns of the hacker_news.csv dataset:

id: The unique identifier for the post
title: The title of the post
url: The URL that the posts link to if the post has a URL
num_points: The number of points the post acquired, calculated as the total number of upvotes minus the total number of downvotes
num_comments: The number of comments that were made on the post
author: The username of the person who submitted the post
created_at: The date and time at which the post was submitted (time zone - Eastern Time in the US)

Goal of the Project¶

The goal of this project is to identify the type of post that recieves more engagement or comments between Ask HN and Show HN.

Also to know which posts created at a certain time recieve more comments on average.

Summary of Results¶

Based on our data analysis, we concluded that Ask HN has a slightly higher number of comments and the best time to get high attention is submitting a post by 22:00 West African Time (WAT) or 15:00 Eastern Time (EST).

Please check out the details below for the full data analysis.

Opening and Preparing the Data¶

We open and read hacker_news.csv as a list of lists and assign it to the variable hn. For data analysis purpose, we remove the header row (hn[0]) of the dataset and keep only the rows (hn[1:]) that contain the data.

In [2]:

from csv import reader
opened_file = open('hacker_news.csv')
read_file = reader(opened_file)
hn = list(read_file)

print(hn[:5])

[['id', 'title', 'url', 'num_points', 'num_comments', 'author', 'created_at'], ['12224879', 'Interactive Dynamic Video', 'http://www.interactivedynamicvideo.com/', '386', '52', 'ne0phyte', '8/4/2016 11:52'], ['10975351', 'How to Use Open Source and Shut the Fuck Up at the Same Time', 'http://hueniverse.com/2016/01/26/how-to-use-open-source-and-shut-the-fuck-up-at-the-same-time/', '39', '10', 'josep2', '1/26/2016 19:30'], ['11964716', "Florida DJs May Face Felony for April Fools' Water Joke", 'http://www.thewire.com/entertainment/2013/04/florida-djs-april-fools-water-joke/63798/', '2', '1', 'vezycash', '6/23/2016 22:20'], ['11919867', 'Technology ventures: From Idea to Enterprise', 'https://www.amazon.com/Technology-Ventures-Enterprise-Thomas-Byers/dp/0073523429', '3', '1', 'hswarna', '6/17/2016 0:01']]

Above we just read our file in as a list of lists and assigned it to a variable named 'hn'. We then displayed the first five rows including the header row.

Now lets remove the header row.

In [3]:

headers = hn[0]
hn = hn[1:]

print(headers)
print('\n') #space out the header row from the row body.
print(hn[:5])

['id', 'title', 'url', 'num_points', 'num_comments', 'author', 'created_at']


[['12224879', 'Interactive Dynamic Video', 'http://www.interactivedynamicvideo.com/', '386', '52', 'ne0phyte', '8/4/2016 11:52'], ['10975351', 'How to Use Open Source and Shut the Fuck Up at the Same Time', 'http://hueniverse.com/2016/01/26/how-to-use-open-source-and-shut-the-fuck-up-at-the-same-time/', '39', '10', 'josep2', '1/26/2016 19:30'], ['11964716', "Florida DJs May Face Felony for April Fools' Water Joke", 'http://www.thewire.com/entertainment/2013/04/florida-djs-april-fools-water-joke/63798/', '2', '1', 'vezycash', '6/23/2016 22:20'], ['11919867', 'Technology ventures: From Idea to Enterprise', 'https://www.amazon.com/Technology-Ventures-Enterprise-Thomas-Byers/dp/0073523429', '3', '1', 'hswarna', '6/17/2016 0:01'], ['10301696', 'Note by Note: The Making of Steinway L1037 (2007)', 'http://www.nytimes.com/2007/11/07/movies/07stein.html?_r=0', '8', '2', 'walterbell', '9/30/2015 4:12']]

Now lets separate posts beginning with Ask HN and Show HN into two different lists using the startswith() method.

In [4]:

ask_posts = []    #for 'Ask HN'
show_posts = []   #for 'Show HN'
other_posts = []  #for neither 'Ask HN' nor 'Show HN'

for row in hn:
    title = row[1]
    if (title.lower()).startswith('ask hn'):
        ask_posts.append(row)
    elif (title.lower()).startswith('show hn'):
        show_posts.append(row)
    else:
        other_posts.append(row)
        
print('Number of Ask posts:' ,len(ask_posts))
print('Number of Show posts:' ,len(show_posts))
print('Number of Other posts:' ,len(other_posts))

Number of Ask posts: 1744
Number of Show posts: 1162
Number of Other posts: 17194

Data Analysis:¶

Step 1: Comments - Ask HN vs Show HN posts¶

In [5]:

total_ask_comments = 0
total_show_comments = 0

for row in ask_posts:
    num_comments = int(row[4])
    total_ask_comments += num_comments
    
avg_ask_comments = total_ask_comments / len(ask_posts)
print('Average number of ask comments:' ,avg_ask_comments)

for row in show_posts:
    num_comments = int(row[4])
    total_show_comments += num_comments
    
avg_show_comments = total_show_comments / len(show_posts)
print('Average number of show comments:' ,avg_show_comments)

Average number of ask comments: 14.038417431192661
Average number of show comments: 10.31669535283993

From the results above, we see that: On an average "Ask HN" posts have ~14 comments and "Show HN" posts have ~10 comments We conclude that: On an average Ask HN posts receive more comments than the Show HN posts.

Now lets calculate the amount of ask posts created per hour, along with the total amount of comments.

In [6]:

import datetime as dt
result_list = []
counts_by_hour = {}   # The number of ask_posts created every hour
comments_by_hour = {}  # The number of comments obtained by the ask_posts 


for row in ask_posts:
    created_at = row[6]
    num_comments = int(row[4])
    result_list.append([created_at ,num_comments])

    
for result in result_list:
    comments_result= result[1]
    created_at_result = result[0]
    created_at_result_dt = dt.datetime.strptime(created_at_result, '%m/%d/%Y %H:%M')
    creation_hour = created_at_result_dt.strftime('%H')
    
    if creation_hour in counts_by_hour:
        counts_by_hour[creation_hour] += 1
        comments_by_hour[creation_hour] += comments_result
    else:
        counts_by_hour[creation_hour] = 1
        comments_by_hour[creation_hour] = comments_result
        
print(counts_by_hour)
print('\n')
print(comments_by_hour)

{'09': 45, '13': 85, '10': 59, '14': 107, '16': 108, '23': 68, '12': 73, '17': 100, '15': 116, '21': 109, '20': 80, '02': 58, '18': 109, '03': 54, '05': 46, '19': 110, '01': 60, '22': 71, '08': 48, '04': 47, '00': 55, '06': 44, '07': 34, '11': 58}


{'09': 251, '13': 1253, '10': 793, '14': 1416, '16': 1814, '23': 543, '12': 687, '17': 1146, '15': 4477, '21': 1745, '20': 1722, '02': 1381, '18': 1439, '03': 421, '05': 464, '19': 1188, '01': 683, '22': 479, '08': 492, '04': 337, '00': 447, '06': 397, '07': 267, '11': 641}

Next, we will use counts_by_hour and comments_by_hour dictionaries to determine the average number of comments for posts created during each hour of the day.

In [7]:

avg_comments_by_hour = []

for hour in comments_by_hour:
    avg_comments_per_post = round((comments_by_hour[hour])/counts_by_hour[hour],1) #division in 1 decimal place
    avg_comments_by_hour.append([hour, avg_comments_per_post])
    
print(avg_comments_by_hour)

[['09', 5.6], ['13', 14.7], ['10', 13.4], ['14', 13.2], ['16', 16.8], ['23', 8.0], ['12', 9.4], ['17', 11.5], ['15', 38.6], ['21', 16.0], ['20', 21.5], ['02', 23.8], ['18', 13.2], ['03', 7.8], ['05', 10.1], ['19', 10.8], ['01', 11.4], ['22', 6.7], ['08', 10.2], ['04', 7.2], ['00', 8.1], ['06', 9.0], ['07', 7.9], ['11', 11.1]]

Although we now have the results we need, this format makes it hard to identify the hours with the highest values. Let's finish by sorting the list of lists and printing the five highest values in a format that's easier to read.

In [8]:

swap_avg_by_hour = []
for row in avg_comments_by_hour:
    swap_avg_by_hour.append((row[1], row[0])) # row[1]= average comments per post  row[0]= hour
    
print(swap_avg_by_hour)
    

[(5.6, '09'), (14.7, '13'), (13.4, '10'), (13.2, '14'), (16.8, '16'), (8.0, '23'), (9.4, '12'), (11.5, '17'), (38.6, '15'), (16.0, '21'), (21.5, '20'), (23.8, '02'), (13.2, '18'), (7.8, '03'), (10.1, '05'), (10.8, '19'), (11.4, '01'), (6.7, '22'), (10.2, '08'), (7.2, '04'), (8.1, '00'), (9.0, '06'), (7.9, '07'), (11.1, '11')]

In [9]:

sorted_swap = sorted(swap_avg_by_hour, reverse=True)
print("The Top Five Hours for Ask Posts Comments:")
for row in sorted_swap[:5]:
    # US/Eastern timezone (EST) - UTC-06
    est_hour_dt = dt.datetime.strptime(row[1], '%H')
    est_hour_str = est_hour_dt.strftime('%H:%M')
    
    # Our timezone (WAT) - UTC+01: 7 hours ahead of EST
    # Converting the `Hour` from EST to WAT
    our_hour_dt = dt.datetime.strptime(row[1], '%H') + dt.timedelta(hours=7)
    our_hour_str = our_hour_dt.strftime('%H:%M')
    
    print('   ', '{est_time} EST or {our_time} WAT:    {avg:.1f} average comments per post'.format(est_time=est_hour_str, our_time=our_hour_str, avg=row[0]))    # Use one decimal place to format avg

The Top Five Hours for Ask Posts Comments:
    15:00 EST or 22:00 WAT:    38.6 average comments per post
    02:00 EST or 09:00 WAT:    23.8 average comments per post
    20:00 EST or 03:00 WAT:    21.5 average comments per post
    16:00 EST or 23:00 WAT:    16.8 average comments per post
    21:00 EST or 04:00 WAT:    16.0 average comments per post

Conclusion:¶

Our results show that creating a post at 15:00 - 16:00 EST has the highest chance of receiving comments. One of the possible explanations is that 15:00 EST is a time when users in both North America and Europe are active. This is based on our assumption that most of the Hacker News users are from these two continents. For this reason, the best time for us to submit a post at our time zone is 22:00, and it is followed by 09:00 and 03:00 WAT.