Hacker News (sometimes abbreviated as HN) is a social news website focusing on computer science and entrepreneurship. It is run by the investment fund and startup incubator Y Combinator. In general, content that can be submitted is defined as "anything that gratifies one's intellectual curiosity. Source Wikipedia
In this project, we'll compare two different types of posts from the Hacker News. The two types of posts we'll explore begin with either Ask
HN
or Show HN
.
The datset used in this analysis can be found on this Kaggle page
The ojective is to compare Ask HN
and Show HN
types of posts to determine the following:
Ask HN
or Show HN
receive more comments on average?It is important to say that the dataset was reduced from almost 300,000 rows to approximately 20,000 rows by removing all submissions that did not receive any comments.
Below is the description of the columns in the hacker_news.csv dataset:
id:
The post unique identifier
title:
The post title
url:
The URL that the posts link to if the post has a URL
num_points:
The total number of points the post received, which is calculated as the total number of upvotes minus the total number of downvotes
num_comments:
The total number of comments the post received
author:
The person that submitted the post
created_at:
The date and time at which the post was submitted (time zone - Eastern Time in the US)
# Import the needed libraries
from csv import reader
import datetime as dt
# Read in the data
opened_file = open("hacker_news.csv")
read_file = reader(opened_file)
hn = list(read_file)
# View the first 5 records
hn[:5]
[['id', 'title', 'url', 'num_points', 'num_comments', 'author', 'created_at'], ['12224879', 'Interactive Dynamic Video', 'http://www.interactivedynamicvideo.com/', '386', '52', 'ne0phyte', '8/4/2016 11:52'], ['10975351', 'How to Use Open Source and Shut the Fuck Up at the Same Time', 'http://hueniverse.com/2016/01/26/how-to-use-open-source-and-shut-the-fuck-up-at-the-same-time/', '39', '10', 'josep2', '1/26/2016 19:30'], ['11964716', "Florida DJs May Face Felony for April Fools' Water Joke", 'http://www.thewire.com/entertainment/2013/04/florida-djs-april-fools-water-joke/63798/', '2', '1', 'vezycash', '6/23/2016 22:20'], ['11919867', 'Technology ventures: From Idea to Enterprise', 'https://www.amazon.com/Technology-Ventures-Enterprise-Thomas-Byers/dp/0073523429', '3', '1', 'hswarna', '6/17/2016 0:01']]
# #Remove header
headers = hn[0]
hn = hn[1:]
print(headers)
print('\n')
print(hn[:5])
['id', 'title', 'url', 'num_points', 'num_comments', 'author', 'created_at'] [['12224879', 'Interactive Dynamic Video', 'http://www.interactivedynamicvideo.com/', '386', '52', 'ne0phyte', '8/4/2016 11:52'], ['10975351', 'How to Use Open Source and Shut the Fuck Up at the Same Time', 'http://hueniverse.com/2016/01/26/how-to-use-open-source-and-shut-the-fuck-up-at-the-same-time/', '39', '10', 'josep2', '1/26/2016 19:30'], ['11964716', "Florida DJs May Face Felony for April Fools' Water Joke", 'http://www.thewire.com/entertainment/2013/04/florida-djs-april-fools-water-joke/63798/', '2', '1', 'vezycash', '6/23/2016 22:20'], ['11919867', 'Technology ventures: From Idea to Enterprise', 'https://www.amazon.com/Technology-Ventures-Enterprise-Thomas-Byers/dp/0073523429', '3', '1', 'hswarna', '6/17/2016 0:01'], ['10301696', 'Note by Note: The Making of Steinway L1037 (2007)', 'http://www.nytimes.com/2007/11/07/movies/07stein.html?_r=0', '8', '2', 'walterbell', '9/30/2015 4:12']]
First of all, we'll identify posts that begin with either Ask HN
or Show HN
and separate the data for those two types of posts into different lists. Separating the data makes it easier to analyze in the following steps.
# Identify posts that begin with either `Ask HN` or `Show HN` and separate the data into different lists.
ask_posts = []
ask_posts = []
show_posts = []
other_posts = []
for post in hn:
title = post[1]
if title.lower().startswith("ask hn"):
ask_posts.append(post)
elif title.lower().startswith("show hn"):
show_posts.append(post)
else:
other_posts.append(post)
print("Total number of ASK Posts HN: ", len(ask_posts))
print("Total number of SHOW Posts HN: ", len(show_posts))
print("Total number of OTHER Posts HN:", len(other_posts))
Total number of ASK Posts HN: 1744 Total number of SHOW Posts HN: 1162 Total number of OTHER Posts HN: 17194
# Below is the first five rows of ask_posts
print(ask_posts[:5])
[['12296411', 'Ask HN: How to improve my personal website?', '', '2', '6', 'ahmedbaracat', '8/16/2016 9:55'], ['10610020', 'Ask HN: Am I the only one outraged by Twitter shutting down share counts?', '', '28', '29', 'tkfx', '11/22/2015 13:43'], ['11610310', 'Ask HN: Aby recent changes to CSS that broke mobile?', '', '1', '1', 'polskibus', '5/2/2016 10:14'], ['12210105', 'Ask HN: Looking for Employee #3 How do I do it?', '', '1', '3', 'sph130', '8/2/2016 14:20'], ['10394168', 'Ask HN: Someone offered to buy my browser extension from me. What now?', '', '28', '17', 'roykolak', '10/15/2015 16:38']]
# Below is the first five rows of show_posts
print(show_posts[:5])
[['10627194', 'Show HN: Wio Link ESP8266 Based Web of Things Hardware Development Platform', 'https://iot.seeed.cc', '26', '22', 'kfihihc', '11/25/2015 14:03'], ['10646440', 'Show HN: Something pointless I made', 'http://dn.ht/picklecat/', '747', '102', 'dhotson', '11/29/2015 22:46'], ['11590768', 'Show HN: Shanhu.io, a programming playground powered by e8vm', 'https://shanhu.io', '1', '1', 'h8liu', '4/28/2016 18:05'], ['12178806', 'Show HN: Webscope Easy way for web developers to communicate with Clients', 'http://webscopeapp.com', '3', '3', 'fastbrick', '7/28/2016 7:11'], ['10872799', 'Show HN: GeoScreenshot Easily test Geo-IP based web pages', 'https://www.geoscreenshot.com/', '1', '9', 'kpsychwave', '1/9/2016 20:45']]
# Below is the first five rows of other posts
print(other_posts[:5])
[['12224879', 'Interactive Dynamic Video', 'http://www.interactivedynamicvideo.com/', '386', '52', 'ne0phyte', '8/4/2016 11:52'], ['10975351', 'How to Use Open Source and Shut the Fuck Up at the Same Time', 'http://hueniverse.com/2016/01/26/how-to-use-open-source-and-shut-the-fuck-up-at-the-same-time/', '39', '10', 'josep2', '1/26/2016 19:30'], ['11964716', "Florida DJs May Face Felony for April Fools' Water Joke", 'http://www.thewire.com/entertainment/2013/04/florida-djs-april-fools-water-joke/63798/', '2', '1', 'vezycash', '6/23/2016 22:20'], ['11919867', 'Technology ventures: From Idea to Enterprise', 'https://www.amazon.com/Technology-Ventures-Enterprise-Thomas-Byers/dp/0073523429', '3', '1', 'hswarna', '6/17/2016 0:01'], ['10301696', 'Note by Note: The Making of Steinway L1037 (2007)', 'http://www.nytimes.com/2007/11/07/movies/07stein.html?_r=0', '8', '2', 'walterbell', '9/30/2015 4:12']]
Now that we have separated Ask HN
posts and Show HN
posts into different lists, we'll calculate the average number of comments each type of post receives.
# Calculate the average number of comments `Ask HN` posts receive.
total_ask_comments = 0
for post in ask_posts:
num_comments = int(post[4])
total_ask_comments += num_comments
avg_ask_comments = total_ask_comments / len(ask_posts)
print("Average number of Ask HN comments:", round(avg_ask_comments, 2))
Average number of Ask HN comments: 14.04
#Calculate average comments from show posts:
total_show_comments = 0
for post in show_posts:
num_comments = int(post[4])
total_show_comments += num_comments
avg_show_comments = total_show_comments / len(show_posts)
print("Average number of Show HN comments: ", round(avg_show_comments,2))
Average number of Show HN comments: 10.32
From the analysis above, it can be deduced that Ask HN
posts received approximately 14 comments, whereas Show HN
posts received approximately 10 comments. Thus, since Ask HN
posts are likely to receive more comments, hence, the remaining analysis will focus on the Ask HN
posts.
Next line of action we'll determine if we can maximize the amount of comments an Ask
post receives by creating it at a certain time. First, we'll find the amount of Ask
posts created during each hour of day, along with the number of comments those posts received. Then, we'll calculate the average amount of comments Ask
posts created at each hour of the day receive.
result_list = []
for post in ask_posts:
created_at = post[6]
num_comments = int(post[4])
result_list.append([created_at, num_comments])
counts_by_hour = {}
comments_by_hour = {}
date_format = "%m/%d/%Y %H:%M"
for row in result_list:
date = row[0]
comment = row[1]
time_t = dt.datetime.strptime(date, date_format).strftime("%H")
if time_t not in counts_by_hour:
counts_by_hour[time_t] = 1
comments_by_hour[time_t] = comment
else:
counts_by_hour[time_t] +=1
comments_by_hour[time_t] += comment
comments_by_hour
{'09': 251, '13': 1253, '10': 793, '14': 1416, '16': 1814, '23': 543, '12': 687, '17': 1146, '15': 4477, '21': 1745, '20': 1722, '02': 1381, '18': 1439, '03': 421, '05': 464, '19': 1188, '01': 683, '22': 479, '08': 492, '04': 337, '00': 447, '06': 397, '07': 267, '11': 641}
We use the dictionaries we just created to calculate the average number of comments for Ask HN
posts by hour.
avg_by_hour = []
for hr in comments_by_hour:
avg_by_hour.append([hr, comments_by_hour[hr] / counts_by_hour[hr]])
avg_by_hour
[['09', 5.5777777777777775], ['13', 14.741176470588234], ['10', 13.440677966101696], ['14', 13.233644859813085], ['16', 16.796296296296298], ['23', 7.985294117647059], ['12', 9.41095890410959], ['17', 11.46], ['15', 38.5948275862069], ['21', 16.009174311926607], ['20', 21.525], ['02', 23.810344827586206], ['18', 13.20183486238532], ['03', 7.796296296296297], ['05', 10.08695652173913], ['19', 10.8], ['01', 11.383333333333333], ['22', 6.746478873239437], ['08', 10.25], ['04', 7.170212765957447], ['00', 8.127272727272727], ['06', 9.022727272727273], ['07', 7.852941176470588], ['11', 11.051724137931034]]
The average number of comments for posts created during each hour of the day was calculated above, however, the format makes it difficult to identify the hours with the highest values. Thus, the need to sort the list and print the highest five values in a format that is more readable
swap_avg_by_hour = []
for row in avg_by_hour:
first_e = row[1]
second_e = row[0]
swap_avg_by_hour.append([first_e, second_e])
print(swap_avg_by_hour)
sorted_swap = sorted(swap_avg_by_hour, reverse=True)
print('\n')
sorted_swap
[[5.5777777777777775, '09'], [14.741176470588234, '13'], [13.440677966101696, '10'], [13.233644859813085, '14'], [16.796296296296298, '16'], [7.985294117647059, '23'], [9.41095890410959, '12'], [11.46, '17'], [38.5948275862069, '15'], [16.009174311926607, '21'], [21.525, '20'], [23.810344827586206, '02'], [13.20183486238532, '18'], [7.796296296296297, '03'], [10.08695652173913, '05'], [10.8, '19'], [11.383333333333333, '01'], [6.746478873239437, '22'], [10.25, '08'], [7.170212765957447, '04'], [8.127272727272727, '00'], [9.022727272727273, '06'], [7.852941176470588, '07'], [11.051724137931034, '11']]
[[38.5948275862069, '15'], [23.810344827586206, '02'], [21.525, '20'], [16.796296296296298, '16'], [16.009174311926607, '21'], [14.741176470588234, '13'], [13.440677966101696, '10'], [13.233644859813085, '14'], [13.20183486238532, '18'], [11.46, '17'], [11.383333333333333, '01'], [11.051724137931034, '11'], [10.8, '19'], [10.25, '08'], [10.08695652173913, '05'], [9.41095890410959, '12'], [9.022727272727273, '06'], [8.127272727272727, '00'], [7.985294117647059, '23'], [7.852941176470588, '07'], [7.796296296296297, '03'], [7.170212765957447, '04'], [6.746478873239437, '22'], [5.5777777777777775, '09']]
# Sort the values and print the the 5 hours with the highest average comments.
# West Africa Time is 5 hours ahead of Eastern Time
print("Top 5 Hours for 'Ask HN' Comments")
for row in sorted_swap[:5]:
# West Africa Time (WAT)
# Converting the `Hour` from EST to WAT, WAT is 5 hours ahead of EST
wat_hr_dt = dt.datetime.strptime(row[1], '%H') + dt.timedelta(hours=5)
wat_hr_str = wat_hr_dt.strftime('%H:%M')
print(' ', '{wat_time} WAT: {avg:.2f} average comments per post'.format(wat_time=wat_hr_str, avg=row[0]))
Top 5 Hours for 'Ask HN' Comments 20:00 WAT: 38.59 average comments per post 07:00 WAT: 23.81 average comments per post 01:00 WAT: 21.52 average comments per post 21:00 WAT: 16.80 average comments per post 02:00 WAT: 16.01 average comments per post
From the analysis, it can be concluded that the hour that receives the most comments per post on average is 20:00, with an average of 38.59 comments per post. The analysis also shows that, there's 41.5% diffrence in the number of comments between the hours with the highest and the hours with the least average number of comments.
The source of the dataset revealed that the timezone used is Eastern Time in the US, however, this was converted to WAT, thus, 20:00 can also be written as 8:00 PM WAT. Note that WAT is 5 hours ahead of EST.
Finally, it is recommended that post should be categorized as Ask HN post and advisably should be created between 20:00 and 21:00 (8:00 pm WAT - 9:00 pm WAT).