Hacker News is a site where user-submitted stories (known as "posts") are voted and commented upon. Hacker News is extremely popular in technology and startup circles, and posts that make it to the top of Hacker News' listings can get hundreds of thousands of visitors as a result. We're specifically interested in posts whose titles begin with either Ask HN or Show HN. Users submit Ask HN posts to ask the Hacker News community a specific question. Likewise, users submit Show HN posts to show the Hacker News community a project, product, or just generally something interesting. The goal of our analysis is to determine the following:
Ask HN
or Show HN
receive more comments on average?After analyzing the data, we can say that Ask HM
has higher comments than Show HM
post. Moreover, the time of posting has an impact on the average number of comments; Ask post, posted at 15:00 Easter time (07:00 Fiji time), receives, on average, the highest number of comments.
For more details, please refer to the full analysis below.
Hacker news post data is available in [Kaggle][1]. This data set contains 12 month of data, up to September 2016 [1]:https://www.kaggle.com/hacker-news/hacker-news-posts
We will start by importing the data.
from csv import reader
opened_file = open('hacker_news.csv')
read_file = reader(opened_file)
hn = list(read_file)
for row in hn[:5]:
print(row)
print('\n')
print('The data set has '+ str(len(hn))+' rows.')
print("Each row has "+ str(len(hn[0]))+ ' columns.')
['id', 'title', 'url', 'num_points', 'num_comments', 'author', 'created_at'] ['12224879', 'Interactive Dynamic Video', 'http://www.interactivedynamicvideo.com/', '386', '52', 'ne0phyte', '8/4/2016 11:52'] ['10975351', 'How to Use Open Source and Shut the Fuck Up at the Same Time', 'http://hueniverse.com/2016/01/26/how-to-use-open-source-and-shut-the-fuck-up-at-the-same-time/', '39', '10', 'josep2', '1/26/2016 19:30'] ['11964716', "Florida DJs May Face Felony for April Fools' Water Joke", 'http://www.thewire.com/entertainment/2013/04/florida-djs-april-fools-water-joke/63798/', '2', '1', 'vezycash', '6/23/2016 22:20'] ['11919867', 'Technology ventures: From Idea to Enterprise', 'https://www.amazon.com/Technology-Ventures-Enterprise-Thomas-Byers/dp/0073523429', '3', '1', 'hswarna', '6/17/2016 0:01'] The data set has 20101 rows. Each row has 7 columns.
The Hacker news
data set has 20,101 rows and 7 columns.
the description of each column is :
title: title of the post
url: the url of the item being linked to
num_points: the number of upvotes the post received
num_comments: the number of comments the post received
author: the name of the account that made the post
created_at: the date and time the post was made (the time zone is Eastern Time in the US)
To continue with the analysis, we are going to remove the header
header = hn[0]
hn = hn[1:]
for row in hn[:5]:
print(row)
print('\n')
['12224879', 'Interactive Dynamic Video', 'http://www.interactivedynamicvideo.com/', '386', '52', 'ne0phyte', '8/4/2016 11:52'] ['10975351', 'How to Use Open Source and Shut the Fuck Up at the Same Time', 'http://hueniverse.com/2016/01/26/how-to-use-open-source-and-shut-the-fuck-up-at-the-same-time/', '39', '10', 'josep2', '1/26/2016 19:30'] ['11964716', "Florida DJs May Face Felony for April Fools' Water Joke", 'http://www.thewire.com/entertainment/2013/04/florida-djs-april-fools-water-joke/63798/', '2', '1', 'vezycash', '6/23/2016 22:20'] ['11919867', 'Technology ventures: From Idea to Enterprise', 'https://www.amazon.com/Technology-Ventures-Enterprise-Thomas-Byers/dp/0073523429', '3', '1', 'hswarna', '6/17/2016 0:01'] ['10301696', 'Note by Note: The Making of Steinway L1037 (2007)', 'http://www.nytimes.com/2007/11/07/movies/07stein.html?_r=0', '8', '2', 'walterbell', '9/30/2015 4:12']
We have removed the header to make it easier to analyze the data.
For our analysis we are interested only in Ask HN
and Show HN
posts.
In the next step, we are going to filter the data.
ask_posts = []
show_posts = []
other_posts = []
for row in hn:
title = row[1]
title = title.lower()
if title.startswith('ask hn'):
ask_posts.append(row)
elif title.startswith('show hn'):
show_posts.append(row)
else:
other_posts.append(row)
print("There are "+ str(len(ask_posts))+" Ask HN posts")
print("There are "+ str(len(show_posts))+" Show HN posts")
print("There are "+ str(len(other_posts))+" other posts")
There are 1744 Ask HN posts There are 1162 Show HN posts There are 17194 other posts
In the step above, we have filtered out the data into three lists.
In the data set, there are 1744 Ask HN
posts and 1162 Show HN
posts.
We will perform the rest of our analysis on those filtered datasets.
In the next step we are going to determine if ask posts or show posts receive more comments on average
Ask HN
posts¶total_ask_comments = 0
for row in ask_posts:
num_comments = int(row[4])
total_ask_comments += num_comments
avg_ask_comments = total_ask_comments /( len(ask_posts))
output= "The average number of comments for Ask HN post is {:.2f} comments".format(avg_ask_comments)
print(output)
The average number of comments for Ask HN post is 14.04 comments
As we can see, on average Ask HN post get 14.04 comments.
Show HN
posts¶total_show_comments = 0
for row in show_posts:
num_comments = int(row[4])
total_show_comments += num_comments
avg_show_comments = total_show_comments /( len(show_posts))
output= "The average number of comments for Show HN post is {:.2f} comments".format(avg_show_comments)
print(output)
The average number of comments for Show HN post is 10.32 comments
As we can see, on average Show HN
post get 10.32 comments.
Ask HN
get 14.04 comments per post, whereas Show HN
get 10.32 comments per post. Therefore we can say that Ask HN
posts, on average, get more comments than Show HN
posts
As we have seen before, Ask HN
post drive higher engagement than Show HN posts. For the rest of the analysis, we are going to focus on Ask HN posts only.
The next step is to determine if ask posts created at a certain time are more likely to attract comments.
To do so, we will calculate the number of ask posts and comments by hour created.
import datetime as dt # import as dt to make code more readable
result_list = []
for row in ask_posts:
created_at = row[6]
num_comments = int(row[4])
result = [created_at,num_comments]
result_list.append(result)
counts_by_hour = {}
comments_by_hour = {}
for row in result_list:
hour = row [0]
hour = dt.datetime.strptime(hour,"%m/%d/%Y %H:%M")
hour = hour.strftime("%H")
comment = row[1]
if hour not in counts_by_hour:
counts_by_hour[hour] = 1
comments_by_hour[hour] = comment
else:
counts_by_hour[hour] += 1
comments_by_hour[hour] += comment
In the step above, we created two dictionaries:
counts_by_hour
: contains the number of ask posts created during each hour of the day.comments_by_hour
: contains the corresponding number of comments ask posts created at each hour received.Next, we'll use these two dictionaries to calculate the average number of comments for posts created during each hour of the day.
average_by_hour = []
for hour in comments_by_hour:
comments = comments_by_hour[hour]
total = counts_by_hour[hour]
average = comments / total
average_by_hour.append([hour,average])
In the step above, we have calculated the average number of comments for each hour fo the day.
To make the data more readable, we are going to sort the data and print the top 5 hours for Ask posts comments
swap_avg_by_hour = []
for row in average_by_hour:
hour = row[0]
average = row [1]
swap_avg_by_hour.append([average,hour])
sorted_swap = sorted(swap_avg_by_hour, reverse = True)
print( "Top 5 Hours for Ask Posts Comments")
for row in sorted_swap[:5]:
string = "{}:00 : {:.2f} average comments per post"
output = string.format(row[1],row[0])
print(output)
Top 5 Hours for Ask Posts Comments 15:00 : 38.59 average comments per post 02:00 : 23.81 average comments per post 20:00 : 21.52 average comments per post 16:00 : 16.80 average comments per post 21:00 : 16.01 average comments per post
From the table above we can see that Ask post potesd at 15:00, on average, get the highest amount fo comments.
The above data the time zone is Eastern Time in the US. As i am currently living in Fiji, I am going to convert the best time to post to local time in Fiji.
import pytz
# turn hour 15 into a time object
Eastern_time= dt.datetime.strptime("15:00","%H:%M")
#Define the current HM time zone and my current time zone
Hn_time_zone = pytz.timezone("US/Eastern")
Fiji_time_zone = pytz.timezone("Pacific/Fiji")
#Convert US/eastern to Fiji time
Eastern_time = Hn_time_zone.localize(Eastern_time)
Fiji_time = Eastern_time.astimezone(Fiji_time_zone)
Fiji_time = Fiji_time.strftime("%H")
print ("15:00 US/Eastern time corresponds to "+ Fiji_time+ ":00 local time in Fiji")
15:00 US/Eastern time corresponds to 07:00 local time in Fiji
From the above we can see that 15:00 US/Eastern time corresponds to 07:00 local time in Fiji.
In this project, we have analyzed data from Hackers News comments to find what is driving post engagement. Our analysis focused on Ask HM and Show HM posts.
In conclusion, we can say that Ask HM has higher comments than Show HM post. Moreover, the time of posting has an impact on the average number of comments; Ask post, posted at 15:00, Easter time,(07:00 Fiji time) receive, on average, the highest number of comments.