The main goal of the project is to find out which type of posts receives most comments on average in the website Hacker News.
We will try to find out:
After analyzing the data, the conclusion we found is that the post asking the question at 3pm EST would likely attract the most comments in Hacker News.
For more details, please refer to the full analysis below.
We will first try to make use of existing data to simplifying the analzysis process and time of calculation.
We will use the simplified data from the data set available in Kaggle. Please note that posts with no comments is reduced from the original data set containing 300,000 rows. The data contain 20,000 random sampled rows from remaining one.
Below, we will do a quick exploration of the hacker_news.csv
file.
# Read in the data
from csv import reader
opened_file = open('/content/hacker_news.csv')
read_file = reader(opened_file)
# Transform read_file into a list of lists
hn = list(read_file)
# Quick exploration of the data
print(*hn[:6], sep = '\n')
['id', 'title', 'url', 'num_points', 'num_comments', 'author', 'created_at'] ['12224879', 'Interactive Dynamic Video', 'http://www.interactivedynamicvideo.com/', '386', '52', 'ne0phyte', '8/4/2016 11:52'] ['10975351', 'How to Use Open Source and Shut the Fuck Up at the Same Time', 'http://hueniverse.com/2016/01/26/how-to-use-open-source-and-shut-the-fuck-up-at-the-same-time/', '39', '10', 'josep2', '1/26/2016 19:30'] ['11964716', "Florida DJs May Face Felony for April Fools' Water Joke", 'http://www.thewire.com/entertainment/2013/04/florida-djs-april-fools-water-joke/63798/', '2', '1', 'vezycash', '6/23/2016 22:20'] ['11919867', 'Technology ventures: From Idea to Enterprise', 'https://www.amazon.com/Technology-Ventures-Enterprise-Thomas-Byers/dp/0073523429', '3', '1', 'hswarna', '6/17/2016 0:01'] ['10301696', 'Note by Note: The Making of Steinway L1037 (2007)', 'http://www.nytimes.com/2007/11/07/movies/07stein.html?_r=0', '8', '2', 'walterbell', '9/30/2015 4:12']
Then we will split the header and the lists containing data into two lists in order to simplify the analysis process later on.
# Extract the header from the data
headers = hn[0]
del hn[0]
print(headers)
print(*hn[:5], sep = '\n')
['id', 'title', 'url', 'num_points', 'num_comments', 'author', 'created_at'] ['12224879', 'Interactive Dynamic Video', 'http://www.interactivedynamicvideo.com/', '386', '52', 'ne0phyte', '8/4/2016 11:52'] ['10975351', 'How to Use Open Source and Shut the Fuck Up at the Same Time', 'http://hueniverse.com/2016/01/26/how-to-use-open-source-and-shut-the-fuck-up-at-the-same-time/', '39', '10', 'josep2', '1/26/2016 19:30'] ['11964716', "Florida DJs May Face Felony for April Fools' Water Joke", 'http://www.thewire.com/entertainment/2013/04/florida-djs-april-fools-water-joke/63798/', '2', '1', 'vezycash', '6/23/2016 22:20'] ['11919867', 'Technology ventures: From Idea to Enterprise', 'https://www.amazon.com/Technology-Ventures-Enterprise-Thomas-Byers/dp/0073523429', '3', '1', 'hswarna', '6/17/2016 0:01'] ['10301696', 'Note by Note: The Making of Steinway L1037 (2007)', 'http://www.nytimes.com/2007/11/07/movies/07stein.html?_r=0', '8', '2', 'walterbell', '9/30/2015 4:12']
Since one of our goal is to compare posts which ask something ("Ask HN") or provide interesting contents to the forum ("Show HN"). We only concern the post title containing Ask HN
or Show HN
.
# Create the lists that contain three types of posts
ask_posts = []
show_posts = []
other_posts = []
# Separate rows into the lists
for row in hn:
title = row[1].lower()
if title.startswith('ask hn'):
ask_posts.append(row)
elif title.startswith('show hn'):
show_posts.append(row)
else:
other_posts.append(row)
# Check the number of posts in lists
print('Number of \"Ask HN\" posts: ',len(ask_posts))
print(*ask_posts[:2],sep = '\n')
print('Number of \"Show HN\" posts: ',len(show_posts))
print(*show_posts[:2],sep = '\n')
print('Number of other posts: ',len(other_posts))
Number of "Ask HN" posts: 1744 ['12296411', 'Ask HN: How to improve my personal website?', '', '2', '6', 'ahmedbaracat', '8/16/2016 9:55'] ['10610020', 'Ask HN: Am I the only one outraged by Twitter shutting down share counts?', '', '28', '29', 'tkfx', '11/22/2015 13:43'] Number of "Show HN" posts: 1162 ['10627194', 'Show HN: Wio Link ESP8266 Based Web of Things Hardware Development Platform', 'https://iot.seeed.cc', '26', '22', 'kfihihc', '11/25/2015 14:03'] ['10646440', 'Show HN: Something pointless I made', 'http://dn.ht/picklecat/', '747', '102', 'dhotson', '11/29/2015 22:46'] Number of other posts: 17194
In below, we will find out the poplularity from "Ask HN" and "Show HN" post by finding the average number for comments of both post types and analyse the results.
# Find the total number of comments on ask posts
total_ask_comments = 0
for ask_post in ask_posts:
num_comments = int(ask_post[4])
total_ask_comments += num_comments
#Compute the average number of comment on ask posts
num_of_ask_posts = len(ask_posts)
avg_ask_comments = round(total_ask_comments / num_of_ask_posts)
print('Average number of comments on ask posts:',avg_ask_comments)
# Find the total number of comments on show posts
total_show_comments = 0
for show_post in show_posts:
num_comments = int(show_post[4])
total_show_comments += num_comments
#Compute the average number of comment on show posts
num_of_show_posts = len(show_posts)
avg_show_comments = round(total_show_comments / num_of_show_posts)
print('Average number of comments on show posts:',avg_show_comments)
Average number of comments on ask posts: 14 Average number of comments on show posts: 10
The average number of comment on ask posts is larger than the one on show posts by 4.
It turns out that posts asking questions in the forum are more likely to be replied most than the posts showing the projects.
We will try to find if ask posts created at a certain time are more likely to attract comments.
We will use the following steps to perform the analysis:
# Create the list of the data we want to focus
import datetime as dt
result_list = []
for ask_post in ask_posts:
created_time = ask_post[6] # The time the post was created
num_comments = ask_post[4] # The number of comments the post got
result_list.append([created_time,num_comments])
# Create dictionaries for later use
counts_by_hour = {} # Counting the total number of post by hour the post was created
comments_by_hour = {} # Counting the number of comment by hour the post was created
# Count the total post and comment by hour the post was created
for result in result_list:
date_time = result[0]
num_comment = result[1]
date_time = dt.datetime.strptime(date_time,"%m/%d/%Y %H:%M") #Changing the format of date and time in result list
hour = dt.datetime.strftime(date_time,"%H")
if hour not in counts_by_hour:
counts_by_hour[hour] = 1
comments_by_hour[hour] = int(num_comment)
else:
counts_by_hour[hour] += 1
comments_by_hour[hour] += int(num_comment)
# Calculate the average number of comment for posts created in every hour of the day
avg_by_hour = []
for hour in counts_by_hour:
avg_comments = comments_by_hour[hour] / counts_by_hour[hour]
avg_by_hour.append([hour,avg_comments])
Then, we will sort the list avg_by_hour
from the highest values to get the top 5 hours for Ask Post comments.
# Create the list which swap the value of avg_by_hour list
swap_avg_by_hour = []
for hour in avg_by_hour:
swap_avg_by_hour.append([hour[1],hour[0]])
print(swap_avg_by_hour)
# Sort the swapped list
sorted_swap = sorted(swap_avg_by_hour, reverse = True)
# Print the Top 5 hours for Ask Post Comments
print(sorted_swap[:5])
[[5.5777777777777775, '09'], [14.741176470588234, '13'], [13.440677966101696, '10'], [13.233644859813085, '14'], [16.796296296296298, '16'], [7.985294117647059, '23'], [9.41095890410959, '12'], [11.46, '17'], [38.5948275862069, '15'], [16.009174311926607, '21'], [21.525, '20'], [23.810344827586206, '02'], [13.20183486238532, '18'], [7.796296296296297, '03'], [10.08695652173913, '05'], [10.8, '19'], [11.383333333333333, '01'], [6.746478873239437, '22'], [10.25, '08'], [7.170212765957447, '04'], [8.127272727272727, '00'], [9.022727272727273, '06'], [7.852941176470588, '07'], [11.051724137931034, '11']] [[38.5948275862069, '15'], [23.810344827586206, '02'], [21.525, '20'], [16.796296296296298, '16'], [16.009174311926607, '21']]
To increase the readability, we will use str.format
to generate the final result.
# Print the formatted hour and average as the final data of the analysis
for comments in sorted_swap:
hour = comments[1]
formatted_hour = dt.datetime.strptime(hour,'%H')
formatted_hour = dt.datetime.strftime(formatted_hour,'%H:%M')
result_line = "{}: {:.2f} average comments per post".format(formatted_hour, comments[0])
print(result_line)
15:00: 38.59 average comments per post 02:00: 23.81 average comments per post 20:00: 21.52 average comments per post 16:00: 16.80 average comments per post 21:00: 16.01 average comments per post 13:00: 14.74 average comments per post 10:00: 13.44 average comments per post 14:00: 13.23 average comments per post 18:00: 13.20 average comments per post 17:00: 11.46 average comments per post 01:00: 11.38 average comments per post 11:00: 11.05 average comments per post 19:00: 10.80 average comments per post 08:00: 10.25 average comments per post 05:00: 10.09 average comments per post 12:00: 9.41 average comments per post 06:00: 9.02 average comments per post 00:00: 8.13 average comments per post 23:00: 7.99 average comments per post 07:00: 7.85 average comments per post 03:00: 7.80 average comments per post 04:00: 7.17 average comments per post 22:00: 6.75 average comments per post 09:00: 5.58 average comments per post
As we can see from the data above, to have the highest chance to receive most comments from Hacker News, is to create a Ask post at 3pm EST (or 3am in HK time zone). Average of 38.59 comments will be received to such posts.
In this project, we analyzed the posts from Hacker News to find the post type and time period to attract the most comments. The conculsion we found is that, by asking the question at 3pm EST, the post will likely to receive most comments in Hacker News.