Notebook

Do Posts Created At A Certain Time Receive More Comments On Average?¶

Hacker News is a site started by the startup incubator Y Combinator, where user-submitted stories (known as "posts") are voted and commented upon, similar to reddit. Hacker News is extremely popular in technology and startup circles, and posts that make it to the top of Hacker News' listings can get hundreds of thousands of visitors as a result.

We're specifically interested in posts whose titles begin with either Ask HN or Show HN. Users submit Ask HN posts to ask the Hacker News community a specific question. Below are a few examples:

Ask HN: How to improve my personal website?
Ask HN: Am I the only one outraged by Twitter shutting down share counts?
Ask HN: Aby recent changes to CSS that broke mobile?

Likewise, users submit Show HN posts to show the Hacker News community a project, product, or just generally something interesting. Below are a few examples:

Show HN: Wio Link  ESP8266 Based Web of Things Hardware Development Platform'
Show HN: Something pointless I made
Show HN: Shanhu.io, a programming playground powered by e8vm

We'll compare these two types of posts, using python built-ins, to determine the following:

Do Ask HN or Show HN receive more comments on average?
Do posts created at a certain time receive more comments on average?

This data set is Hacker News posts from September 26, 2015 to September 26, 2016. You can find the data set here, and below are descriptions of the columns:

id: The unique identifier from Hacker News for the post
title: The title of the post
url: The URL that the posts links to, if it the post has a URL
num_points: The number of points the post acquired, calculated as the total number of upvotes minus the total number of downvotes
num_comments: The number of comments that were made on the post
author: The username of the person who submitted the post
created_at: The date and time at which the post was submitted (the time zone is Eastern Time in the US)

We'll read in the hacker news csv file. Because we are analyzing posts with comments, we'll remove posts without any and look at the first five rows.

In [17]:

import datetime as dt

from csv import reader


def csv_to_list(file, head=True):
    '''
    Transforms a csv file to a list of lists, and returns both the
    header and data - unless head parameter is set to False. 
    
    Arg:
        file (str):  name or path for csv file
        
        head (bool): True is default; False for no header
        
    Returns:
        tuple or list: Returns tuple with header as the first element 
        and data as the second.  If head is False, returns data as a list.
    '''
    
    with open(file) as openfile:
        readfile = reader(openfile)
        hn = list(readfile)
        
    if head:
        return hn[0], hn[1:]
    else:
        return hn
    
    
def remove_nocomments(dataset):
    '''
    Remove posts with no comments.
    
    Arg:
        dataset (list): dataset as a list of lists
        
    Returns:
        list: dataset as a list of lists
    '''
    
    clean_dataset = []

    for row in dataset:
        num_comments = row[4]
        
        if num_comments != '0':
            clean_dataset.append(row)
            
    return clean_dataset


def create_ask_show(dataset):
    '''
    Create three different datasets: one for posts starting 
    with Ask HN, and one for posts starting with Show HN, and
    one for other posts.
    
    Arg:
        dataset (list): dataset as a list of lists
        
    Returns:
        tuple: datasets as a list of lists with posts starting 
        with Ask HN, posts starting with Show HN, and other posts        
    '''
    
    ask_posts = []
    show_posts = []
    other_posts = []

    for row in dataset:
        title = row[1].lower()
        
        if title.startswith('ask hn'):
            ask_posts.append(row)
        elif title.startswith('show hn'):
            show_posts.append(row)
        else:
            other_posts.append(row)
            
    return ask_posts, show_posts, other_posts


def avg_num_comments(dataset):
    '''
    Calculate the average number of comments for posts in the dataset.
    
    Arg:
        dataset (list): dataset as a list of lists
        
    Returns:
        float: average number of comments
    '''
    
    total_comments = 0

    for row in dataset:
        num_comments = int(row[4])
        total_comments += num_comments
    
    avg_comments = total_comments/len(dataset)
    
    return avg_comments


def counts_comments_hr(dataset):
    '''
    Creates frequency tables that counts both number of 
    posts per hour and number of comments per hour.
    
    Arg:
        dataset (list): dataset as a list of lists
        
    Returns:
        tuple: dict key=hour: value=number of posts, dict key=hour: value=number of comments
    '''

    # will contain `created_at` datetime object and number of comments for each post in `ask_posts`
    result_list = []  

    for row in dataset:
        created_at = dt.datetime.strptime(row[6], '%m/%d/%Y %H:%M')
        num_comments = int(row[4])
    
        result_list.append([created_at, num_comments])
    
    
    # will contain number of posts from ask_posts per hour
    counts_by_hour = {}

    # will contain total number of comments from ask_posts per hour
    comments_by_hour = {}

    for row in result_list:
        hour = row[0].hour
        num_comments = row[1]
    
        if hour in counts_by_hour:
            counts_by_hour[hour] += 1
            comments_by_hour[hour] += num_comments
        else:
            counts_by_hour[hour] = 1
            comments_by_hour[hour] = num_comments
            
    return counts_by_hour, comments_by_hour


def calc_avg_by_hr(comments, counts):
    '''
    Calculate the averge number of comments for posts by hour.
    
    Arg:
        comments (dict): comments per hour - key=hour: value=comments per hour
        
        counts (dict):  number of posts per hour - key=hour: value=posts per hour
        
    Returns:
        list: list of lists with each row having a hour as first element and 
        average number of comments per post as second element.
    '''

    avg_by_hr = []

    for hour in comments:
        hourly = [hour, comments[hour]/counts[hour]]
        avg_by_hr.append(hourly)
    
    return avg_by_hr

In [18]:

# read in file
header, hn = csv_to_list('HN_posts_year_to_Sep_26_2016.csv')

# remove posts with no comments
hn = remove_nocomments(hn)  

# display header and the first five rows
print(header)
for row in hn[:5]:  
    print('\n', row, '\n')

['id', 'title', 'url', 'num_points', 'num_comments', 'author', 'created_at']

 ['12578975', 'Saving the Hassle of Shopping', 'https://blog.menswr.com/2016/09/07/whats-new-with-your-style-feed/', '1', '1', 'bdoux', '9/26/2016 3:13'] 


 ['12578908', 'Ask HN: What TLD do you use for local development?', '', '4', '7', 'Sevrene', '9/26/2016 2:53'] 


 ['12578822', 'Amazons Algorithms Dont Find You the Best Deals', 'https://www.technologyreview.com/s/602442/amazons-algorithms-dont-find-you-the-best-deals/', '1', '1', 'yarapavan', '9/26/2016 2:26'] 


 ['12578694', 'Emergency dose of epinephrine that does not cost an arm and a leg', 'http://m.imgur.com/gallery/th6Ua', '2', '1', 'dredmorbius', '9/26/2016 1:54'] 


 ['12578624', 'Phone Makers Could Cut Off Drivers. So Why Dont They?', 'http://www.nytimes.com/2016/09/25/technology/phone-makers-could-cut-off-drivers-so-why-dont-they.html', '4', '1', 'danso', '9/26/2016 1:37']

Extracting Ask HN and Show HN Posts¶

Because we're specifically interested in posts whose titles begin with either Ask HN or Show HN and will be comparing the two, we'll place those posts in two different datasets. In addition, we will keep other posts to check that we captured all posts.

In [19]:

ask_posts, show_posts, other_posts = create_ask_show(hn)

print('\nThe number of posts in each data set\n'.title())
print(f'ask_posts:    {len(ask_posts)}\n')
print(f'show_posts:   {len(show_posts)}\n')
print(f'other_posts:  {len(other_posts)}')
print('---------------------\n')
print(f'total posts:  {len(hn)}')

The Number Of Posts In Each Data Set

ask_posts:    6911

show_posts:   5059

other_posts:  68431
---------------------

total posts:  80401

Calculating the Average Number of Comments for Ask HN and Show HN Posts¶

In [20]:

# call `avg_num_comments` to calculate the average number of comments per post in `ask_posts`
avg_ask_comments = avg_num_comments(ask_posts)

# call `avg_num_comments` to calculate the average number of comments per post in `show_posts`
avg_show_comments = avg_num_comments(show_posts)

print(f'\nAverage number of comments per post in `ask_posts`:   {avg_ask_comments}\n')
print(f'Average number of comments per post in `show_posts`:   {avg_show_comments}\n')

Average number of comments per post in `ask_posts`:   13.744175951381855

Average number of comments per post in `show_posts`:   9.810832180272781

On average, ask posts receive more comments than show posts. Since ask posts are more likely to receive comments, we'll focus our remaining analysis just on these posts.

Next, we'll determine if ask posts created at a certain time are more likely to attract comments. We'll use the following steps to perform this analysis:

Calculate the amount of ask posts created in each hour of the day, along with the number of comments received.
Calculate the average number of comments ask posts receive by hour created.

In the code block below, we'll tackle the first step — calculating the amount of ask posts and comments by hour created.

Finding the Amount of Ask Posts and Comments by Hour Created¶

In [21]:

counts_by_hour, comments_by_hour = counts_comments_hr(ask_posts)
counts_by_hour

Out[21]:

{2: 227,
 1: 223,
 22: 287,
 21: 407,
 19: 420,
 17: 404,
 15: 467,
 14: 378,
 13: 326,
 11: 251,
 10: 219,
 9: 176,
 7: 157,
 3: 212,
 16: 415,
 8: 190,
 0: 231,
 23: 276,
 20: 392,
 18: 452,
 12: 274,
 4: 186,
 6: 176,
 5: 165}

The table above is the amount of ask posts by hour created with the first number representing the hour and the second representing posts. Below is the amount of comments by hour created.

In [22]:

comments_by_hour

Out[22]:

Calculating the Average Number of Comments for Ask HN Posts by Hour¶

Now, we'll tackle the second step by using the two dictionaries above to create a list of lists containing the hours during which posts were created and the average number of comments those posts received.

In [23]:

avg_by_hr = calc_avg_by_hr(comments_by_hour, counts_by_hour)
avg_by_hr

Out[23]:

[[2, 13.198237885462555],
 [1, 9.367713004484305],
 [22, 11.749128919860627],
 [21, 11.056511056511056],
 [19, 9.414285714285715],
 [17, 13.73019801980198],
 [15, 39.66809421841542],
 [14, 13.153439153439153],
 [13, 22.2239263803681],
 [11, 11.143426294820717],
 [10, 13.757990867579908],
 [9, 8.392045454545455],
 [7, 10.095541401273886],
 [3, 10.160377358490566],
 [16, 10.76144578313253],
 [8, 12.43157894736842],
 [0, 9.857142857142858],
 [23, 8.322463768115941],
 [20, 11.38265306122449],
 [18, 10.789823008849558],
 [12, 15.452554744525548],
 [4, 12.688172043010752],
 [6, 9.017045454545455],
 [5, 11.139393939393939]]

Sorting and Printing Values from a List of Lists¶

Although we now have the results we need, this format makes it hard to identify the hours with the highest values. Let's finish by sorting the list of lists in descending order by highest average comments and printing the five rows in a format that's easier to read.

In [24]:

# sort avg_by_hr by average comments and in descending order
sorted_avg_by_hr = sorted(avg_by_hr, key=lambda row: row[1], reverse=True)

print('\nTop 5 Hours for Ask Posts Comments\n')

for row in sorted_avg_by_hr[:5]:
        
        hr = dt.time(hour=row[0])
        
        string = f'{hr:%H:%M}: {row[1]:.2f} average comments per post\n'
            
        print(string)

Top 5 Hours for Ask Posts Comments

15:00: 39.67 average comments per post

13:00: 22.22 average comments per post

12:00: 15.45 average comments per post

10:00: 13.76 average comments per post

17:00: 13.73 average comments per post

Which hours should you create a post during to have a higher chance of receiving comments?

When working with time series data, it is helpful to group hours in periods - such as morning, afternoon, evening, and night. According to Britannica Dictionary, many people can agree with morning 05:00 - 11:59, afternoon 12:00 - 16:59, evening 17:00 - 20:59, and night 21:00 - 04:59. Using this framework, the afternoon is the best time to create a post to have a higher chance of receiving comments.

In the table above, Top 5 Hours for Ask Posts Comments, the top three hours are all in the afternoon. I am located in Central Time Zone, and the times are in Eastern Time Zone. Even after adjusting for Time Zones, three hours out of the top five hours are still in the afternoon.

To narrow the hours down even more, 15:00 and 13:00 are the two best times. 15:00 is the most popular hour and is approximately 17 comments per post on average higher than 13:00, which is the second most popular time. 13:00 is almost seven comments higher on average than third place.

Lastly, to be even more specific, 15:00 is the most popular hour. Still referencing the table above, it's average comments per post are approximately seventy-nine percent higher than the next best option.

Conclusion¶

After comparing posts that begin with Ask HN and Show HN using data on Hacker News from September 26, 2015 to September 26, 2016, we can answer the two questions that we proposed. Do posts starting with Ask HN or Show HN receive more comments on average? After separating the posts into two and dividing number of comments by posts, we find that posts beginning with Ask HN receive more comments per post on average. Second, do posts created at a certain time receive more comments on average? Looking at just the posts from Ask HN, because those receive more comments per post, we find after identifying the number of comments per post by hour that the afternoon, and 15:00 specifically, is the best time to create posts on Hacker News to receive more comments.