Notebook

Finding Insights about Popular Hacker News Posts¶

Hacker News is a popular site within the technology community where users can share, comment, and vote on posts related to the industry. The goal of this project is to compare the different types of posts and determine which type of posts receive the most comments and points.

Ask HN posts allow users can ask the community a specific question. Users can share something with the community with Show HN posts. All other types of posts are captured under Other.

In this analysis, we'll ask the following questions:

Do Ask HN, Show HN, or Other posts receive more comments on average?
Which of these types of posts receive the most points?
Do posts created at a certain time receive more comments or points?

The dataset we're using can be found here. According to the documentation for the dataset, the timezone used is U.S. Eastern Standard Time.

Exploring the Data¶

We'll begin by reading in the data set and exploring a sample to gain insights.

In [1]:

# Imports
from csv import reader
import datetime as dt

# Read in the data
opened_file = open('hacker_news_posts.csv', encoding = 'utf-8')
read_file = reader(opened_file)
hn = list(read_file)

# Remove the headers from the dataset
headers = hn[0]
hn = hn[1:]

In [2]:

# Define a function to extract a slice of a given dataset for analysis
def explore(dataset, start, end, rows_and_columns=False):
    dataset_slice = dataset[start:end]
    for row in dataset_slice:
        print(row)
        print('\n')
    
    if rows_and_columns:
        print('Number of rows:', len(dataset))
        print('Number of columns:', len(dataset[0]))

# Explore a sample of the data
print(headers,'\n')
explore(hn, 0, 5, True)

['id', 'title', 'url', 'num_points', 'num_comments', 'author', 'created_at'] 

['11966167', 'UK votes to leave EU', 'http://www.bbc.co.uk/news/uk-politics-36615028', '3125', '2531', 'dmmalam', '6/24/2016 3:48']


['12445994', 'iPhone 7', 'http://www.apple.com/iPhone7', '756', '1733', 'benigeri', '9/7/2016 18:52']


['11807450', 'Moving Forward on Basic Income', 'http://blog.ycombinator.com/moving-forward-on-basic-income', '1330', '1448', 'dwaxe', '5/31/2016 16:20']


['10982340', 'Request For Research: Basic Income', 'https://blog.ycombinator.com/basic-income', '1876', '1120', 'mattkrisiloff', '1/27/2016 19:23']


['11814828', 'Ask HN: Who is hiring? (June 2016)', '', '644', '1007', 'whoishiring', '6/1/2016 15:01']


Number of rows: 293119
Number of columns: 7

We can see that we have the following fields for our analysis:

Column	Description
id	The unique ID of the post
title	The title of the post
url	The URL at which the post can be viewed
num_points	The number of points the post has
num_comments	The number of comments the post has
author	The user that submitted the post
created_at	The timestamp at which the post was submitted

The num_points, num_comments, and created_at fields should prove useful for analysis.

Cleansing the Data¶

Now that we've identified the useful information in our dataset, it's now important to assess the integrity of the data by checking for any obvious errors that might exist. We can do this methodically and remove rows that contain erroneous data.

Checking for Shifted Columns

Sometimes rows in the dataset become offset from the column headers. We can check for this by comparing the length of the header row to each row in the dataset.

In [3]:

# Define a function to search for column shifts in the given dataset
def check_column_shifts(dataset_header, dataset):

    # Loop over the dataset and compare the length of each row in the dataset to the length of the header
    errors_count = 0
    for row in dataset:
        if len(dataset_header) != len(row):
            errors_count += 1
            print(headers,'\n')
            print('Row Index: ', dataset.index(row),'\n') # Print the row number where the error was found
            print(row, '\n')
        
    print('Column Shift Errors: ', errors_count)
    
# Search for column shifts in the dataset
check_column_shifts(headers, hn)

Column Shift Errors:  0

Checking for Missing (or Null) Data

We can also check for missing values from our datasets and remove any affected rows. We can create a function similar to the one used above to iterate through the dataset and identify rows that contain missing data.

In [4]:

# Define a function to search for null data in the given dataset
def check_null_data(dataset_header, dataset, index):
    
    null_value = False 
    null_count = 0
    
    # Loop over each row in the dataset to identify any missing values at the given index
    for row in dataset:
        if row[index] == '':
            null_value = True
            null_count += 1
        if null_value == True:
            print(dataset_header,'\n')
            print('Row Index: ', dataset.index(row),'\n') # Print the row number where the error was found
            print(row, '\n')
            null_value = False
    
    # Print the number of missing values identified at the given index
    print('Missing "{}" Values Identified: {}'.format(headers[index] , null_count))
    
# Search for missing data in the dataset at the given index
check_null_data(headers, hn, 3)
check_null_data(headers, hn, 4)
check_null_data(headers, hn, 6)

Missing "num_points" Values Identified: 0
Missing "num_comments" Values Identified: 0
Missing "created_at" Values Identified: 0

Since we didn't identify any obvious errors within the dataset, we can proceed with further narrowing the dataset.

Limiting the Dataset to Posts that Have Comments

For the purposes of our analysis, we only need posts that contain comments. Let's create a list of the rows in our dataset that represent posts with comments.

In [5]:

# Loop over the dataset and append each row representing a post with comments to final_hn
final_hn = []
for row in hn:
    num_comments = int(row[4])
    if num_comments != 0:
        final_hn.append(row)

# Determine the number of rows and columns in final_hn
explore(final_hn, 0, 0, True)

Number of rows: 80401
Number of columns: 7

We're left with a list of lists containing information about 80,000 posts. This reprsents a 73% reduction from our original dataset.

Analyzing the Data¶

Now that we have our finalized dataset, let's start by separating the data for the different types of posts into lists. This will aid us in our analysis for each type of post.

In [6]:

ask_posts = []
show_posts = []
other_posts = []

# Loop over each row in the dataset and create list of the rows for each type of post
for row in final_hn:
    title = row[1]
    if title.lower().startswith('ask hn'):
        ask_posts.append(row)
    elif title.lower().startswith('show hn'):
        show_posts.append(row)
    else:
        other_posts.append(row)
        
# Display the number of each type of post and a sample of the first few posts
print('Number of Ask HN Posts:', len(ask_posts))
for row in ask_posts[:5]:
    print(row)
print('\n')

print('Number of Show HN Posts:', len(show_posts))
for row in show_posts[:5]:
    print(row)
print('\n')

print('Number of Other Posts:', len(other_posts))
for row in other_posts[:5]:
    print(row)

Number of Ask HN Posts: 6911
['11814828', 'Ask HN: Who is hiring? (June 2016)', '', '644', '1007', 'whoishiring', '6/1/2016 15:01']
['12202865', 'Ask HN: Who is hiring? (August 2016)', '', '534', '947', 'whoishiring', '8/1/2016 15:01']
['11611867', 'Ask HN: Who is hiring? (May 2016)', '', '553', '937', 'whoishiring', '5/2/2016 15:01']
['12405698', 'Ask HN: Who is hiring? (September 2016)', '', '521', '910', 'whoishiring', '9/1/2016 15:00']
['12016568', 'Ask HN: Who is hiring? (July 2016)', '', '566', '898', 'whoishiring', '7/1/2016 15:01']


Number of Show HN Posts: 5059
['11667494', 'Show HN: BitKeeper  Enterprise-ready version control, now open-source', 'https://www.bitkeeper.org/', '384', '306', 'wscott', '5/10/2016 14:39']
['10729068', 'Show HN: FuckFuckAdblock', 'https://github.com/Mechazawa/FuckFuckAdblock', '353', '298', 'mechazawa', '12/14/2015 2:24']
['12504012', 'Show HN: I invented a caffeinated toothpaste', 'https://www.powertoothpaste.com/', '169', '280', 'nappy', '9/15/2016 7:07']
['10320509', 'Show HN: Cancel your Comcast in 5 minutes', 'http://www.airpaperinc.com/', '386', '257', 'estsauver', '10/2/2015 18:56']
['10477721', 'Show HN: Twitch Installs Arch Linux  A cooperative text-based horror game', 'https://twitchinstalls.com', '802', '250', 'jbott', '10/30/2015 13:45']


Number of Other Posts: 68431
['11966167', 'UK votes to leave EU', 'http://www.bbc.co.uk/news/uk-politics-36615028', '3125', '2531', 'dmmalam', '6/24/2016 3:48']
['12445994', 'iPhone 7', 'http://www.apple.com/iPhone7', '756', '1733', 'benigeri', '9/7/2016 18:52']
['11807450', 'Moving Forward on Basic Income', 'http://blog.ycombinator.com/moving-forward-on-basic-income', '1330', '1448', 'dwaxe', '5/31/2016 16:20']
['10982340', 'Request For Research: Basic Income', 'https://blog.ycombinator.com/basic-income', '1876', '1120', 'mattkrisiloff', '1/27/2016 19:23']
['11049067', 'GitHub is undergoing a full-blown overhaul as execs and employees depart', 'http://www.businessinsider.com/github-the-full-inside-story-2016-2', '808', '973', 'easyd', '2/6/2016 18:43']

Calculating the Average Number of Comments for Each Type of Post¶

We can now use the lists we created for each type of post to calculate the average number of comments each type receives.

In [7]:

# Define a function to calculate the average number of comments for each type of post
def get_average(lst, index):
    total = 0
    for item in lst:
        total += int(item[index])
    
    return total / len(lst)

# Call the function to calculate the average number of comments for each type of post and print the results
avg_show_comments = get_average(show_posts, 4)
print('Average Show Comments: {:.2f}'.format(avg_show_comments))

avg_ask_comments = get_average(ask_posts, 4)
print('Average Ask Comments: {:.2f}'.format(avg_ask_comments))

avg_other_comments = get_average(other_posts, 4) 
print('Average Other Comments: {:.2f}'.format(avg_other_comments))

Average Show Comments: 9.81
Average Ask Comments: 13.74
Average Other Comments: 25.84

On average, ask and show posts receive less comments than all other posts, which receive about 26 comments each. Ask posts receive about 14 comments each and show posts receive about 10 comments.

Calculating the Average Number of Points for Each Type of Post¶

Now we can repeat the above steps to calculate the average number of points each post type receives.

In [8]:

# Call the get_average() function we created previously to determine the average number of points for each post type
avg_show_comments = get_average(show_posts, 3)
print('Average Show Points: {:.2f}'.format(avg_show_comments))

avg_ask_comments = get_average(ask_posts, 3)
print('Average Ask Points: {:.2f}'.format(avg_ask_comments))

avg_other_comments = get_average(other_posts, 3) 
print('Average Other Points: {:.2f}'.format(avg_other_comments))

Average Show Points: 26.62
Average Ask Points: 14.40
Average Other Points: 53.43

Like we discovered by looking at the comment data, other posts receive the most points on average, about twice as many points as the next closest post type. However, unlike the comment data, show posts receive more points than ask posts; On average, ask posts receive 14 points whereas show posts receive 27 points.

Calculating the Average of Comments by Hour for Each Type of Post¶

Next, we can find the relationship between the time of day and maximizing comments for each type of post. We can do this by determining the number of posts created during each hour of the day and the number of comments each of those posts recevied. Then, for each type of post, we can calculate the average number of comments received for posts created at each hour of the day.

In [22]:

def get_average_by_hour(lst, index):

    # Loop over the given list and append the total number of comments (or points) and the created time to a results list
    results = []
    for item in lst:
        created_at = item[6]
        total = int(item[index])
        results.append([total, created_at])
        
   # Create frequency tables for the number of posts by hour and the total number of comments (or points) by hour 
    counts_by_hour = {}
    total_by_hour = {} 
    for item in results:
        date = item[1]
        total = item[0]
        
        date = dt.datetime.strptime(date, "%m/%d/%Y %H:%M")
        hour = date.strftime("%H")
        
        if hour not in counts_by_hour:
            counts_by_hour[hour] = 1
            total_by_hour[hour] = total
        else:
            counts_by_hour[hour] += 1
            total_by_hour[hour] += total
    
    # Calculate the average by hour and append the results to a list
    avg_by_hour = []
    for hour in counts_by_hour:
        avg_by_hour.append([total_by_hour[hour] / counts_by_hour[hour], hour])

    # Sort the average by hour in descending order
    sorted_avg_by_hour = sorted(avg_by_hour, reverse = True)
    
    return sorted_avg_by_hour


# Call the function we created to calculate the average number of comments for each type of post and
# print the hours with highest average comments for each type of post

avg_show_comments_by_hour = get_average_by_hour(show_posts, 4)
print("Top 5 Hours for 'Show HN' Comments")
for item in avg_show_comments_by_hour[:5]:
    print(
        "{}: {:.2f} average comments per post".format(
            dt.datetime.strptime(item[1], "%H").strftime("%H:%M"),
            item[0]
        )
    )
print('\n')
    
avg_ask_comments_by_hour = get_average_by_hour(ask_posts, 4)
print("Top 5 Hours for 'Ask HN' Comments")
for item in avg_ask_comments_by_hour[:5]:
    print(
        "{}: {:.2f} average comments per post".format(
            dt.datetime.strptime(item[1], "%H").strftime("%H:%M"),
            item[0]
        )
    )
print('\n')

avg_other_comments_by_hour = get_average_by_hour(other_posts, 4)
print("Top 5 Hours for 'Other' Comments")
for item in avg_other_comments_by_hour[:5]:
    print(
        "{}: {:.2f} average comments per post".format(
            dt.datetime.strptime(item[1], "%H").strftime("%H:%M"),
            item[0]
        )
    )

Top 5 Hours for 'Show HN' Comments
07:00: 12.42 average comments per post
12:00: 12.03 average comments per post
14:00: 11.60 average comments per post
08:00: 11.07 average comments per post
04:00: 10.87 average comments per post


Top 5 Hours for 'Ask HN' Comments
15:00: 39.67 average comments per post
13:00: 22.22 average comments per post
12:00: 15.45 average comments per post
10:00: 13.76 average comments per post
17:00: 13.73 average comments per post


Top 5 Hours for 'Other' Comments
13:00: 29.37 average comments per post
12:00: 29.20 average comments per post
14:00: 28.09 average comments per post
15:00: 27.97 average comments per post
11:00: 27.13 average comments per post

For 'Show HN' posts, the hour that receives the most comments is 07:00, or 7:00 am EST, with an average of 12.42 comments per post. Additionally, the hours with the highest average comments are most often in the morning and early afternoon.

For 'Ask HN' posts, the hour that receives the most comments is 15:00, or 3:00 pm EST, with an average of 39.67 comments per post. The highest average comments for these posts are typically in the afternoon.

For 'Other' posts, the hour that receives the most comments is 13:00, or 1:00 pm EST, with an average of 29.37 comments per post. The highest average comments for 'Other' posts are typically in the late morning and afternoon.

The hour with the highest average number of comments is different for each type of post. However, all types of posts have high average comment activity at noon and in the early afternoon.

Calculating the Average of Points by Hour for Each Type of Post¶

We can repeat the previous steps to determine the average number of points by hour for each type of post.

In [25]:

# Call get_average_by_hour() to calculate the average number of points for each type of post and
# print the hours with highest average points for each type of post

avg_show_points_by_hour = get_average_by_hour(show_posts, 3)
print("Top 5 Hours for 'Show HN' Points")
for item in avg_show_points_by_hour[:5]:
    print(
        "{}: {:.2f} average points per post".format(
            dt.datetime.strptime(item[1], "%H").strftime("%H:%M"),
            item[0]
        )
    )
print('\n')
    
avg_ask_points_by_hour = get_average_by_hour(ask_posts, 3)
print("Top 5 Hours for 'Ask HN' Points")
for item in avg_ask_points_by_hour[:5]:
    print(
        "{}: {:.2f} average points per post".format(
            dt.datetime.strptime(item[1], "%H").strftime("%H:%M"),
            item[0]
        )
    )
print('\n')

avg_other_points_by_hour = get_average_by_hour(other_posts, 3)
print("Top 5 Hours for 'Other' Points")
for item in avg_other_points_by_hour[:5]:
    print(
        "{}: {:.2f} average points per post".format(
            dt.datetime.strptime(item[1], "%H").strftime("%H:%M"),
            item[0]
        )
    )

Top 5 Hours for 'Show HN' Points
12:00: 33.57 average points per post
11:00: 31.57 average points per post
23:00: 30.40 average points per post
19:00: 29.80 average points per post
06:00: 29.38 average points per post


Top 5 Hours for 'Ask HN' Points
15:00: 29.31 average points per post
13:00: 23.77 average points per post
17:00: 16.96 average points per post
10:00: 16.71 average points per post
12:00: 16.53 average points per post


Top 5 Hours for 'Other' Points
13:00: 58.62 average points per post
12:00: 57.53 average points per post
15:00: 55.95 average points per post
17:00: 55.64 average points per post
16:00: 55.62 average points per post

For 'Show HN' posts, the hour that receives the most points is 12:00, or 12:00 pm EST, with an average of 33.57 points per post. Additionally, the hours with the highest average points are most often in the late morning and nighttime.

For 'Ask HN' posts, the hour that receives the most points is 15:00, or 3:00 pm EST, with an average of 29.31 points per post. The highest average points for these posts are typically in the afternoon.

For 'Other' posts, the hour that receives the most points is 13:00, or 1:00 pm EST, with an average of 58.62 points per post. The highest average points for 'Other' posts are typically in the afternoon.

The hour with the highest average number of points is different for each type of post. However, all types of posts have high average point activity at noon.

Conclusion¶

In this project, we analyzed each type of Hacker News posts to determine which type of post and times receive the most comments and points on average. It's important to note that purposely limited the scope of our analysis to only include posts that received comments.

Based on our analysis, to maximize the amount of comments or points a post receives, one should create an 'Other' post between the hours of 12:00 and 15:00, or 12:00 pm and 3:00 pm EST. Excluding 'Other' posts, to maximize comments, one should create an 'Ask HN' post between the hours of 13:00 and 15:00, or 1:00 pm and 3:00 pm EST. To maximize points, one should create a 'Show HN' post between the hours of 11:00 and 12:00, or 11:00 am and 12:00 pm EST.

Finally, we found that a creating a post around 12:00, or 12:00 pm EST, is a good strategy for maximizing comments and points regardless of the type of post.