Notebook

Finding the Most Popluar Post Types in Hacker News¶

The main goal of the project is to find out which type of posts receives most comments on average in the website Hacker News.

We will try to find out:

Whether the post that asked questions ("Ask HN" posts) or the one that provided contents ("Show HN" posts) has the largest average number of comments
What is the best time period for creating the post that get the most comments on average

Summary of Results¶

After analyzing the data, the conclusion we found is that the post asking the question at 3pm EST would likely attract the most comments in Hacker News.

For more details, please refer to the full analysis below.

Exploring the Existing Simplified Data¶

We will first try to make use of existing data to simplifying the analzysis process and time of calculation.

We will use the simplified data from the data set available in Kaggle. Please note that posts with no comments is reduced from the original data set containing 300,000 rows. The data contain 20,000 random sampled rows from remaining one.

Below, we will do a quick exploration of the hacker_news.csv file.

In [1]:

# Read in the data
from csv import reader
opened_file = open('/content/hacker_news.csv')
read_file = reader(opened_file)

# Transform read_file into a list of lists
hn = list(read_file)

# Quick exploration of the data
print(*hn[:6], sep = '\n')

['id', 'title', 'url', 'num_points', 'num_comments', 'author', 'created_at']
['12224879', 'Interactive Dynamic Video', 'http://www.interactivedynamicvideo.com/', '386', '52', 'ne0phyte', '8/4/2016 11:52']
['10975351', 'How to Use Open Source and Shut the Fuck Up at the Same Time', 'http://hueniverse.com/2016/01/26/how-to-use-open-source-and-shut-the-fuck-up-at-the-same-time/', '39', '10', 'josep2', '1/26/2016 19:30']
['11964716', "Florida DJs May Face Felony for April Fools' Water Joke", 'http://www.thewire.com/entertainment/2013/04/florida-djs-april-fools-water-joke/63798/', '2', '1', 'vezycash', '6/23/2016 22:20']
['11919867', 'Technology ventures: From Idea to Enterprise', 'https://www.amazon.com/Technology-Ventures-Enterprise-Thomas-Byers/dp/0073523429', '3', '1', 'hswarna', '6/17/2016 0:01']
['10301696', 'Note by Note: The Making of Steinway L1037 (2007)', 'http://www.nytimes.com/2007/11/07/movies/07stein.html?_r=0', '8', '2', 'walterbell', '9/30/2015 4:12']

Then we will split the header and the lists containing data into two lists in order to simplify the analysis process later on.

In [2]:

# Extract the header from the data
headers = hn[0]
del hn[0]
print(headers)
print(*hn[:5], sep = '\n')

['id', 'title', 'url', 'num_points', 'num_comments', 'author', 'created_at']
['12224879', 'Interactive Dynamic Video', 'http://www.interactivedynamicvideo.com/', '386', '52', 'ne0phyte', '8/4/2016 11:52']
['10975351', 'How to Use Open Source and Shut the Fuck Up at the Same Time', 'http://hueniverse.com/2016/01/26/how-to-use-open-source-and-shut-the-fuck-up-at-the-same-time/', '39', '10', 'josep2', '1/26/2016 19:30']
['11964716', "Florida DJs May Face Felony for April Fools' Water Joke", 'http://www.thewire.com/entertainment/2013/04/florida-djs-april-fools-water-joke/63798/', '2', '1', 'vezycash', '6/23/2016 22:20']
['11919867', 'Technology ventures: From Idea to Enterprise', 'https://www.amazon.com/Technology-Ventures-Enterprise-Thomas-Byers/dp/0073523429', '3', '1', 'hswarna', '6/17/2016 0:01']
['10301696', 'Note by Note: The Making of Steinway L1037 (2007)', 'http://www.nytimes.com/2007/11/07/movies/07stein.html?_r=0', '8', '2', 'walterbell', '9/30/2015 4:12']

Filtering Post Types by Title from the Data¶

Since one of our goal is to compare posts which ask something ("Ask HN") or provide interesting contents to the forum ("Show HN"). We only concern the post title containing Ask HN or Show HN.

In [3]:

# Create the lists that contain three types of posts
ask_posts = []
show_posts = []
other_posts = []

# Separate rows into the lists
for row in hn:
    title = row[1].lower()
    if title.startswith('ask hn'):
        ask_posts.append(row)
    elif title.startswith('show hn'):
        show_posts.append(row)
    else:
        other_posts.append(row)

# Check the number of posts in lists
print('Number of \"Ask HN\" posts: ',len(ask_posts))
print(*ask_posts[:2],sep = '\n')
print('Number of \"Show HN\" posts: ',len(show_posts))
print(*show_posts[:2],sep = '\n')
print('Number of other posts: ',len(other_posts))

Number of "Ask HN" posts:  1744
['12296411', 'Ask HN: How to improve my personal website?', '', '2', '6', 'ahmedbaracat', '8/16/2016 9:55']
['10610020', 'Ask HN: Am I the only one outraged by Twitter shutting down share counts?', '', '28', '29', 'tkfx', '11/22/2015 13:43']
Number of "Show HN" posts:  1162
['10627194', 'Show HN: Wio Link  ESP8266 Based Web of Things Hardware Development Platform', 'https://iot.seeed.cc', '26', '22', 'kfihihc', '11/25/2015 14:03']
['10646440', 'Show HN: Something pointless I made', 'http://dn.ht/picklecat/', '747', '102', 'dhotson', '11/29/2015 22:46']
Number of other posts:  17194

Comparing the Popularity between Post Types¶

In below, we will find out the poplularity from "Ask HN" and "Show HN" post by finding the average number for comments of both post types and analyse the results.

In [4]:

# Find the total number of comments on ask posts
total_ask_comments = 0
for ask_post in ask_posts:
    num_comments = int(ask_post[4])
    total_ask_comments += num_comments

#Compute the average number of comment on ask posts
num_of_ask_posts = len(ask_posts)
avg_ask_comments = round(total_ask_comments / num_of_ask_posts)

print('Average number of comments on ask posts:',avg_ask_comments)

# Find the total number of comments on show posts
total_show_comments = 0
for show_post in show_posts:
    num_comments = int(show_post[4])
    total_show_comments += num_comments

#Compute the average number of comment on show posts
num_of_show_posts = len(show_posts)
avg_show_comments = round(total_show_comments / num_of_show_posts)

print('Average number of comments on show posts:',avg_show_comments)

Average number of comments on ask posts: 14
Average number of comments on show posts: 10

The average number of comment on ask posts is larger than the one on show posts by 4.

It turns out that posts asking questions in the forum are more likely to be replied most than the posts showing the projects.

Finding the Most Popular Time Period for Posting¶

We will try to find if ask posts created at a certain time are more likely to attract comments.

We will use the following steps to perform the analysis:

Find out the hour when the posts created and the number of comments these posts received.
Find out the average number of comment ask posts receive by hour created.

In [5]:

# Create the list of the data we want to focus
import datetime as dt
result_list = []
for ask_post in ask_posts:
    created_time = ask_post[6] # The time the post was created
    num_comments = ask_post[4] # The number of comments the post got
    result_list.append([created_time,num_comments])

# Create dictionaries for later use
counts_by_hour = {} # Counting the total number of post by hour the post was created
comments_by_hour = {} # Counting the number of comment by hour the post was created

# Count the total post and comment by hour the post was created
for result in result_list:
    date_time = result[0]
    num_comment = result[1]
    date_time = dt.datetime.strptime(date_time,"%m/%d/%Y %H:%M") #Changing the format of date and time in result list
    hour = dt.datetime.strftime(date_time,"%H")
    if hour not in counts_by_hour:
        counts_by_hour[hour] = 1
        comments_by_hour[hour] = int(num_comment)
    else:
        counts_by_hour[hour] += 1
        comments_by_hour[hour] += int(num_comment)

# Calculate the average number of comment for posts created in every hour of the day
avg_by_hour = []
for hour in counts_by_hour:
    avg_comments = comments_by_hour[hour] / counts_by_hour[hour]
    avg_by_hour.append([hour,avg_comments])

Then, we will sort the list avg_by_hour from the highest values to get the top 5 hours for Ask Post comments.

In [7]:

# Create the list which swap the value of avg_by_hour list
swap_avg_by_hour = []
for hour in avg_by_hour:
    swap_avg_by_hour.append([hour[1],hour[0]])
print(swap_avg_by_hour)

# Sort the swapped list
sorted_swap = sorted(swap_avg_by_hour, reverse = True)

# Print the Top 5 hours for Ask Post Comments
print(sorted_swap[:5])

[[5.5777777777777775, '09'], [14.741176470588234, '13'], [13.440677966101696, '10'], [13.233644859813085, '14'], [16.796296296296298, '16'], [7.985294117647059, '23'], [9.41095890410959, '12'], [11.46, '17'], [38.5948275862069, '15'], [16.009174311926607, '21'], [21.525, '20'], [23.810344827586206, '02'], [13.20183486238532, '18'], [7.796296296296297, '03'], [10.08695652173913, '05'], [10.8, '19'], [11.383333333333333, '01'], [6.746478873239437, '22'], [10.25, '08'], [7.170212765957447, '04'], [8.127272727272727, '00'], [9.022727272727273, '06'], [7.852941176470588, '07'], [11.051724137931034, '11']]
[[38.5948275862069, '15'], [23.810344827586206, '02'], [21.525, '20'], [16.796296296296298, '16'], [16.009174311926607, '21']]

To increase the readability, we will use str.format to generate the final result.

In [9]:

# Print the formatted hour and average as the final data of the analysis
for comments in sorted_swap:
    hour = comments[1]
    formatted_hour = dt.datetime.strptime(hour,'%H')
    formatted_hour = dt.datetime.strftime(formatted_hour,'%H:%M')
    result_line = "{}: {:.2f} average comments per post".format(formatted_hour, comments[0])
    print(result_line)

15:00: 38.59 average comments per post
02:00: 23.81 average comments per post
20:00: 21.52 average comments per post
16:00: 16.80 average comments per post
21:00: 16.01 average comments per post
13:00: 14.74 average comments per post
10:00: 13.44 average comments per post
14:00: 13.23 average comments per post
18:00: 13.20 average comments per post
17:00: 11.46 average comments per post
01:00: 11.38 average comments per post
11:00: 11.05 average comments per post
19:00: 10.80 average comments per post
08:00: 10.25 average comments per post
05:00: 10.09 average comments per post
12:00: 9.41 average comments per post
06:00: 9.02 average comments per post
00:00: 8.13 average comments per post
23:00: 7.99 average comments per post
07:00: 7.85 average comments per post
03:00: 7.80 average comments per post
04:00: 7.17 average comments per post
22:00: 6.75 average comments per post
09:00: 5.58 average comments per post

As we can see from the data above, to have the highest chance to receive most comments from Hacker News, is to create a Ask post at 3pm EST (or 3am in HK time zone). Average of 38.59 comments will be received to such posts.

Summary¶

In this project, we analyzed the posts from Hacker News to find the post type and time period to attract the most comments. The conculsion we found is that, by asking the question at 3pm EST, the post will likely to receive most comments in Hacker News.