Hacker News is a site started by the startup incubator Y Combinator, where user-submitted stories (known as "posts") are voted and commented upon, similar to reddit. Hacker News is extremely popular in technology and startup circles, and posts that make it to the top of Hacker News' listings can get hundreds of thousands of visitors as a result.
This is the link for the original data set
This is the columns description:
We're specifically interested in posts whose titles begin with either Ask HN or Show HN.
Users submit Ask HN posts to ask the Hacker News community a specific question. Below are a couple examples:
Ask HN: How to improve my personal website?
Ask HN: Am I the only one outraged by Twitter shutting down share counts?
Ask HN: Aby recent changes to CSS that broke mobile?
Likewise, users submit Show HN posts to show the Hacker News community a project, product, or just generally something interesting.
Show HN: Wio Link ESP8266 Based Web of Things Hardware Development Platform'
Show HN: Something pointless I made
Show HN: Shanhu.io, a programming playground powered by e8vm
We'll compare these two types of posts to determine the following:
Let start by reading the data set in a list of lists and printing the first 5 entries
from csv import reader
opened = open("HN_posts_year_to_Sep_26_2016.csv", encoding = 'utf-8')
read = reader(opened)
hn = list(read)
header = hn[0]
hn = hn[1:]
print(header)
hn[:5]
['id', 'title', 'url', 'num_points', 'num_comments', 'author', 'created_at']
[['12579008', 'You have two days to comment if you want stem cells to be classified as your own', 'http://www.regulations.gov/document?D=FDA-2015-D-3719-0018', '1', '0', 'altstar', '9/26/2016 3:26'], ['12579005', 'SQLAR the SQLite Archiver', 'https://www.sqlite.org/sqlar/doc/trunk/README.md', '1', '0', 'blacksqr', '9/26/2016 3:24'], ['12578997', 'What if we just printed a flatscreen television on the side of our boxes?', 'https://medium.com/vanmoof/our-secrets-out-f21c1f03fdc8#.ietxmez43', '1', '0', 'pavel_lishin', '9/26/2016 3:19'], ['12578989', 'algorithmic music', 'http://cacm.acm.org/magazines/2011/7/109891-algorithmic-composition/fulltext', '1', '0', 'poindontcare', '9/26/2016 3:16'], ['12578979', 'How the Data Vault Enables the Next-Gen Data Warehouse and Data Lake', 'https://www.talend.com/blog/2016/05/12/talend-and-Â\x93the-data-vaultÂ\x94', '1', '0', 'markgainor1', '9/26/2016 3:14']]
Now let separate the dataset in 'Ask HN', 'Show HN' and 'Other' topics
ask_posts = []
show_posts = []
other_posts = []
for row in hn:
if row[1].lower().startswith("ask hn"):
ask_posts.append(row)
elif row[1].lower().startswith("show hn"):
show_posts.append(row)
else:
other_posts.append(row)
print("AH posts: ", len(ask_posts))
print("SH posts: ",len(show_posts))
print("'Other' posts:", len(other_posts))
AH posts: 9139 SH posts: 10158 'Other' posts: 273822
Now let's determine which type of post receives more comments on average
total_ask_comments = 0
for comm in ask_posts:
total_ask_comments += int(comm[4])
avg_ask_comments = round(total_ask_comments / len(ask_posts))
print ("Average number of comments for the AH posts: " , avg_ask_comments)
total_show_comments = 0
for comm in show_posts:
total_show_comments +=int(comm[4])
avg_show_comments = round(total_show_comments / len(show_posts))
print ("Average number of comments for the SH posts: " , avg_show_comments)
Average number of comments for the AH posts: 10 Average number of comments for the SH posts: 5
The 'Asking posts' seems more popular in terms of engagement, so from now on we'll keep a focus on this type for our analysis
Next, we'll determine if AH posts created at a certain time are more likely to attract comments. We'll use the following steps to perform this analysis:
Calculate the amount of ask posts created in each hour of the day, along with the number of comments received.
Calculate the average number of comments ask posts receive by hour created.
import datetime as dt
result_list = []
for row in ask_posts:
result_list.append([row[6], int(row[4])])
posts_by_hour = {}
comments_by_hour ={}
date_format = "%m/%d/%Y %H:%M" # created format type for date
for row in result_list:
date = row[0]
hour = dt.datetime.strptime(date, date_format).strftime("%H") # date formatted and hour extracted
if hour not in posts_by_hour:
posts_by_hour[hour] = 1
comments_by_hour[hour] = row[1]
else:
posts_by_hour[hour] += 1
comments_by_hour[hour] += row[1]
# Now the average
avg_by_hour = []
for row in posts_by_hour:
avg_by_hour.append([row,round(comments_by_hour[row]/posts_by_hour[row],2)])
avg_by_hour.sort(reverse = True, key = lambda x:x[1]) # sorted by highest n of comments (second column)
for h,avg in avg_by_hour:
print("H {} - {} avg comments".format(dt.datetime.strptime(h,"%H").strftime("%H:%M"),avg) )
H 15:00 - 28.68 avg comments H 13:00 - 16.32 avg comments H 12:00 - 12.38 avg comments H 02:00 - 11.14 avg comments H 10:00 - 10.68 avg comments H 04:00 - 9.71 avg comments H 14:00 - 9.69 avg comments H 17:00 - 9.45 avg comments H 08:00 - 9.19 avg comments H 11:00 - 8.96 avg comments H 22:00 - 8.8 avg comments H 05:00 - 8.79 avg comments H 20:00 - 8.75 avg comments H 21:00 - 8.69 avg comments H 03:00 - 7.95 avg comments H 18:00 - 7.94 avg comments H 16:00 - 7.71 avg comments H 00:00 - 7.56 avg comments H 01:00 - 7.41 avg comments H 19:00 - 7.16 avg comments H 07:00 - 7.01 avg comments H 06:00 - 6.78 avg comments H 23:00 - 6.7 avg comments H 09:00 - 6.65 avg comments
As seen from the documentation , the time zone for this data set is Eastern Time in the US.
The hour receiving more comments is 15:00, with an average of 28.68 .
The 12:00 - 15:00 (GMT -5) time band seems the most popular
After analyzing the whole data set, we can affirm that in order to maximize the amount of comments a post receive, it has to be adressed as AH and possibly to be posted between 17:00 - 20:00 London time