Notebook

Exploring Hacker News Posts¶

This project aims at analyzing a sample from the Hacker News data set. Hacker News is a website where users submit posts that are voted and commented upon, similar to reddit.

The table below provides a brief description of each of the columns.

Column Name	Description
id	The unique identifier from Hacker News for a post
title	The title of the post
url	The URL that the post links to, if it has any
num_points	The number of points the post acquired by substracting the total number of downvotes from the total number of upvotes
author	The username of the author of the post
created_at	The date and time at which the post was submitted

In [24]:

# Reading the file as a list of lists
from csv import reader
opened_file = open("hacker_news.csv")
read_file = reader(opened_file)
hn = list(read_file)

# Function to display rows [Stops at end_index - 1]
def explore_data(dataset, start_index, end_index):
    for index in range(start_index, end_index):
        print(dataset[index])
        print("\n")

# First 6 rows, including the header
explore_data(hn, 0, 6) 
        

['id', 'title', 'url', 'num_points', 'num_comments', 'author', 'created_at']


['12224879', 'Interactive Dynamic Video', 'http://www.interactivedynamicvideo.com/', '386', '52', 'ne0phyte', '8/4/2016 11:52']


['10975351', 'How to Use Open Source and Shut the Fuck Up at the Same Time', 'http://hueniverse.com/2016/01/26/how-to-use-open-source-and-shut-the-fuck-up-at-the-same-time/', '39', '10', 'josep2', '1/26/2016 19:30']


['11964716', "Florida DJs May Face Felony for April Fools' Water Joke", 'http://www.thewire.com/entertainment/2013/04/florida-djs-april-fools-water-joke/63798/', '2', '1', 'vezycash', '6/23/2016 22:20']


['11919867', 'Technology ventures: From Idea to Enterprise', 'https://www.amazon.com/Technology-Ventures-Enterprise-Thomas-Byers/dp/0073523429', '3', '1', 'hswarna', '6/17/2016 0:01']


['10301696', 'Note by Note: The Making of Steinway L1037 (2007)', 'http://www.nytimes.com/2007/11/07/movies/07stein.html?_r=0', '8', '2', 'walterbell', '9/30/2015 4:12']

In [25]:

# Separating the column headers from the data set
headers = hn[0]

# Removing the headers from the data set
hn = hn[1:]

print("The headers are \n", headers, "\n")

print("Checking if the header row was removed\n")
explore_data(hn, 0, 5)

The headers are 
 ['id', 'title', 'url', 'num_points', 'num_comments', 'author', 'created_at'] 

Checking if the header row was removed

['12224879', 'Interactive Dynamic Video', 'http://www.interactivedynamicvideo.com/', '386', '52', 'ne0phyte', '8/4/2016 11:52']


['10975351', 'How to Use Open Source and Shut the Fuck Up at the Same Time', 'http://hueniverse.com/2016/01/26/how-to-use-open-source-and-shut-the-fuck-up-at-the-same-time/', '39', '10', 'josep2', '1/26/2016 19:30']


['11964716', "Florida DJs May Face Felony for April Fools' Water Joke", 'http://www.thewire.com/entertainment/2013/04/florida-djs-april-fools-water-joke/63798/', '2', '1', 'vezycash', '6/23/2016 22:20']


['11919867', 'Technology ventures: From Idea to Enterprise', 'https://www.amazon.com/Technology-Ventures-Enterprise-Thomas-Byers/dp/0073523429', '3', '1', 'hswarna', '6/17/2016 0:01']


['10301696', 'Note by Note: The Making of Steinway L1037 (2007)', 'http://www.nytimes.com/2007/11/07/movies/07stein.html?_r=0', '8', '2', 'walterbell', '9/30/2015 4:12']

We will focus on posts whose titles start with either Ask HN or Show HN. Ask HN posts are those submitted by users to ask the Hacker News community a specific question. Likewise, the Show HN posts are those where users want to show a project or a product to the community.

We will compare these two types of posts to answer the following questions:

Which one gets more comments on average?
Do users comment more on posts that are created at a certain time?

Data Cleaning¶

** Filtering by the type of the posts **

For this purpose, we first filter the rows in the data set according to the type of posts. The categories will be ask_posts and show_posts. We will assign other_posts to the remaining types of posts.

In [26]:

ask_posts = []
show_posts = []
other_posts = []
for each_row in hn:
    title = each_row[1]
    # startswith function is case sensitive
    # so make turn title to small letters
    title = title.lower()
    if title.startswith('ask hn'):
        ask_posts.append(each_row)
    elif title.startswith('show hn'):
        show_posts.append(each_row)
    else:
        other_posts.append(each_row)

# Function to check the number of posts in each list
def n_posts(str_posts, list_name):
    print("The number of posts in",str_posts , "is", len(list_name))

n_posts("ask_posts", ask_posts)
n_posts("show_posts", show_posts)
n_posts("other_posts", other_posts)

explore_data(ask_posts, 0,5)

The number of posts in ask_posts is 1744
The number of posts in show_posts is 1162
The number of posts in other_posts is 17194
['12296411', 'Ask HN: How to improve my personal website?', '', '2', '6', 'ahmedbaracat', '8/16/2016 9:55']


['10610020', 'Ask HN: Am I the only one outraged by Twitter shutting down share counts?', '', '28', '29', 'tkfx', '11/22/2015 13:43']


['11610310', 'Ask HN: Aby recent changes to CSS that broke mobile?', '', '1', '1', 'polskibus', '5/2/2016 10:14']


['12210105', 'Ask HN: Looking for Employee #3 How do I do it?', '', '1', '3', 'sph130', '8/2/2016 14:20']


['10394168', 'Ask HN: Someone offered to buy my browser extension from me. What now?', '', '28', '17', 'roykolak', '10/15/2015 16:38']

Popularity of posts¶

According to the type of posts

As previously mentioned, our analysis is linked to the number of comments. First, we are going to determine which type of posts from the two we are considering, receives more comments on average.

In [30]:

# Function to find the number of total comments in a type of post
def average_comments(type_name):
    total_comments = 0
    for each_row in type_name:
        total_comments += int(each_row[4])
        
    average = total_comments / len(type_name)    
    print(average)

Average_Ask = average_comments(ask_posts)
Average_Show = average_comments(show_posts)

14.038417431192661
10.31669535283993

Since Ask HN posts receive around 14 comments on average than Show HN posts that receive an average of approximately 10 comments, we conclude that ask posts receive more comments on average. Thus, the second part of our analysis will focus on ask posts.

According to time

We are going to analyze the average number of comments by hour. So, we need to know the number of comments by hour as well as the number of posts by hour.

In [39]:

# Calculating the number of comments and posts by hour
import datetime as dt
result_list = [] # Stores created_at and number of comments
for each_row in ask_posts:
    created_at = each_row[6]
    n_comments = int(each_row[4])
    result_list.append([created_at, n_comments])
    
counts_by_hour = {} # Number of posts by hour
comments_by_hour = {}
date_format = "%m/%d/%Y %H:%M"
for each_row in result_list:
    dt_object = dt.datetime.strptime(each_row[0], date_format)
    hour = dt_object.strftime("%H")
    if hour not in counts_by_hour:
        counts_by_hour[hour] = 1
        comments_by_hour[hour] = each_row[1]
    else:
        counts_by_hour[hour] += 1
        comments_by_hour[hour] += each_row[1]

In [40]:

# Calculating the average number of comments by hour
avg_by_hour = [] # List of lists to store [hour, average]
for each_hour in comments_by_hour:
    avg_by_hour.append([each_hour, comments_by_hour[each_hour]/counts_by_hour[each_hour]])
    
avg_by_hour

Out[40]:

[['17', 11.46],
 ['10', 13.440677966101696],
 ['06', 9.022727272727273],
 ['23', 7.985294117647059],
 ['15', 38.5948275862069],
 ['16', 16.796296296296298],
 ['11', 11.051724137931034],
 ['04', 7.170212765957447],
 ['02', 23.810344827586206],
 ['07', 7.852941176470588],
 ['12', 9.41095890410959],
 ['14', 13.233644859813085],
 ['13', 14.741176470588234],
 ['05', 10.08695652173913],
 ['21', 16.009174311926607],
 ['00', 8.127272727272727],
 ['18', 13.20183486238532],
 ['19', 10.8],
 ['22', 6.746478873239437],
 ['09', 5.5777777777777775],
 ['01', 11.383333333333333],
 ['20', 21.525],
 ['03', 7.796296296296297],
 ['08', 10.25]]

It will be hard to identify the hours with the highest values. So, we need to sort avg_by_hour.

In [44]:

# Sorted uses the first value of a list to sort
# So we need to swap the values in avg_by_hour
swap_avg_by_hour = []
for each_hour in avg_by_hour:
    swap_avg_by_hour.append([each_hour[1], each_hour[0]])
    
swap_avg_by_hour

Out[44]:

[[11.46, '17'],
 [13.440677966101696, '10'],
 [9.022727272727273, '06'],
 [7.985294117647059, '23'],
 [38.5948275862069, '15'],
 [16.796296296296298, '16'],
 [11.051724137931034, '11'],
 [7.170212765957447, '04'],
 [23.810344827586206, '02'],
 [7.852941176470588, '07'],
 [9.41095890410959, '12'],
 [13.233644859813085, '14'],
 [14.741176470588234, '13'],
 [10.08695652173913, '05'],
 [16.009174311926607, '21'],
 [8.127272727272727, '00'],
 [13.20183486238532, '18'],
 [10.8, '19'],
 [6.746478873239437, '22'],
 [5.5777777777777775, '09'],
 [11.383333333333333, '01'],
 [21.525, '20'],
 [7.796296296296297, '03'],
 [10.25, '08']]

In [47]:

# Using the sorted() function
# Descending
sorted_swap = sorted(swap_avg_by_hour, reverse = True)
print("Top 5 Hours for Ask Posts Comments")

string_format = "{}: {:.2f} average comments per post"
for average, hour in sorted_swap[:6]:
    dt_object = dt.datetime.strptime(hour, "%H")
    time = dt_object.strftime("%H:%M")
    print(string_format.format(time, average))

Top 5 Hours for Ask Posts Comments
15:00: 38.59 average comments per post
02:00: 23.81 average comments per post
20:00: 21.52 average comments per post
16:00: 16.80 average comments per post
21:00: 16.01 average comments per post
13:00: 14.74 average comments per post

The hour that receives the highest number of comments on average is 15:00, with an average of around 39 comments per post.

According to the documentation, the time zone is Eastern Time in the US). Thus, in Mauritius the peak time would be 23:00.

Conclusion¶

The sample that was taken excluded posts without any comments. So, it would be better to precise that out of posts that received comments, ASK HN posts seem to do better with the peak time being 15:00.

Thus, to maximize the number of comments a post receives, it is recommended to submit one that is of type Ask HN around 15:00.