Hacker News is a social news website, under the startup incubator Y Combinator, with a focus on computer science and entrepreneurship. Hacker News gains huge popularity in technology and startup communities. On this site, users can submit any posts, which "gratify one's intellectual curiosity" (Ref: Hacker News Guidelines). Their posts are voted and commented upon, where the top-ranked posts can draw hundreds of thousands of traffic.
You can find the original dataset for Hacker News posts (12-month period) until 26th September 2016 here link. For this project, we use the hacker_news.csv dataset, a modified dataset, of which approximately 300,000 data rows have been trimmed down to 20,000 rows by:
Deleting all the posts without any comments
Sampling randomly from the remaining posts after the deletion
Here are the explanations for the columns of the hacker_news.csv dataset:
The goal of this project is to identify the type of post that recieves more engagement or comments between Ask HN and Show HN.
Also to know which posts created at a certain time recieve more comments on average.
Based on our data analysis, we concluded that Ask HN has a slightly higher number of comments and the best time to get high attention is submitting a post by 22:00 West African Time (WAT) or 15:00 Eastern Time (EST).
Please check out the details below for the full data analysis.
We open and read hacker_news.csv as a list of lists and assign it to the variable hn. For data analysis purpose, we remove the header row (hn[0]) of the dataset and keep only the rows (hn[1:]) that contain the data.
from csv import reader
opened_file = open('hacker_news.csv')
read_file = reader(opened_file)
hn = list(read_file)
print(hn[:5])
[['id', 'title', 'url', 'num_points', 'num_comments', 'author', 'created_at'], ['12224879', 'Interactive Dynamic Video', 'http://www.interactivedynamicvideo.com/', '386', '52', 'ne0phyte', '8/4/2016 11:52'], ['10975351', 'How to Use Open Source and Shut the Fuck Up at the Same Time', 'http://hueniverse.com/2016/01/26/how-to-use-open-source-and-shut-the-fuck-up-at-the-same-time/', '39', '10', 'josep2', '1/26/2016 19:30'], ['11964716', "Florida DJs May Face Felony for April Fools' Water Joke", 'http://www.thewire.com/entertainment/2013/04/florida-djs-april-fools-water-joke/63798/', '2', '1', 'vezycash', '6/23/2016 22:20'], ['11919867', 'Technology ventures: From Idea to Enterprise', 'https://www.amazon.com/Technology-Ventures-Enterprise-Thomas-Byers/dp/0073523429', '3', '1', 'hswarna', '6/17/2016 0:01']]
Above we just read our file in as a list of lists and assigned it to a variable named 'hn'. We then displayed the first five rows including the header row.
Now lets remove the header row.
headers = hn[0]
hn = hn[1:]
print(headers)
print('\n') #space out the header row from the row body.
print(hn[:5])
['id', 'title', 'url', 'num_points', 'num_comments', 'author', 'created_at'] [['12224879', 'Interactive Dynamic Video', 'http://www.interactivedynamicvideo.com/', '386', '52', 'ne0phyte', '8/4/2016 11:52'], ['10975351', 'How to Use Open Source and Shut the Fuck Up at the Same Time', 'http://hueniverse.com/2016/01/26/how-to-use-open-source-and-shut-the-fuck-up-at-the-same-time/', '39', '10', 'josep2', '1/26/2016 19:30'], ['11964716', "Florida DJs May Face Felony for April Fools' Water Joke", 'http://www.thewire.com/entertainment/2013/04/florida-djs-april-fools-water-joke/63798/', '2', '1', 'vezycash', '6/23/2016 22:20'], ['11919867', 'Technology ventures: From Idea to Enterprise', 'https://www.amazon.com/Technology-Ventures-Enterprise-Thomas-Byers/dp/0073523429', '3', '1', 'hswarna', '6/17/2016 0:01'], ['10301696', 'Note by Note: The Making of Steinway L1037 (2007)', 'http://www.nytimes.com/2007/11/07/movies/07stein.html?_r=0', '8', '2', 'walterbell', '9/30/2015 4:12']]
Now lets separate posts beginning with Ask HN and Show HN into two different lists using the startswith() method.
ask_posts = [] #for 'Ask HN'
show_posts = [] #for 'Show HN'
other_posts = [] #for neither 'Ask HN' nor 'Show HN'
for row in hn:
title = row[1]
if (title.lower()).startswith('ask hn'):
ask_posts.append(row)
elif (title.lower()).startswith('show hn'):
show_posts.append(row)
else:
other_posts.append(row)
print('Number of Ask posts:' ,len(ask_posts))
print('Number of Show posts:' ,len(show_posts))
print('Number of Other posts:' ,len(other_posts))
Number of Ask posts: 1744 Number of Show posts: 1162 Number of Other posts: 17194
total_ask_comments = 0
total_show_comments = 0
for row in ask_posts:
num_comments = int(row[4])
total_ask_comments += num_comments
avg_ask_comments = total_ask_comments / len(ask_posts)
print('Average number of ask comments:' ,avg_ask_comments)
for row in show_posts:
num_comments = int(row[4])
total_show_comments += num_comments
avg_show_comments = total_show_comments / len(show_posts)
print('Average number of show comments:' ,avg_show_comments)
Average number of ask comments: 14.038417431192661 Average number of show comments: 10.31669535283993
From the results above, we see that: On an average "Ask HN" posts have ~14 comments and "Show HN" posts have ~10 comments We conclude that: On an average Ask HN posts receive more comments than the Show HN posts.
Now lets calculate the amount of ask posts created per hour, along with the total amount of comments.
import datetime as dt
result_list = []
counts_by_hour = {} # The number of ask_posts created every hour
comments_by_hour = {} # The number of comments obtained by the ask_posts
for row in ask_posts:
created_at = row[6]
num_comments = int(row[4])
result_list.append([created_at ,num_comments])
for result in result_list:
comments_result= result[1]
created_at_result = result[0]
created_at_result_dt = dt.datetime.strptime(created_at_result, '%m/%d/%Y %H:%M')
creation_hour = created_at_result_dt.strftime('%H')
if creation_hour in counts_by_hour:
counts_by_hour[creation_hour] += 1
comments_by_hour[creation_hour] += comments_result
else:
counts_by_hour[creation_hour] = 1
comments_by_hour[creation_hour] = comments_result
print(counts_by_hour)
print('\n')
print(comments_by_hour)
{'09': 45, '13': 85, '10': 59, '14': 107, '16': 108, '23': 68, '12': 73, '17': 100, '15': 116, '21': 109, '20': 80, '02': 58, '18': 109, '03': 54, '05': 46, '19': 110, '01': 60, '22': 71, '08': 48, '04': 47, '00': 55, '06': 44, '07': 34, '11': 58} {'09': 251, '13': 1253, '10': 793, '14': 1416, '16': 1814, '23': 543, '12': 687, '17': 1146, '15': 4477, '21': 1745, '20': 1722, '02': 1381, '18': 1439, '03': 421, '05': 464, '19': 1188, '01': 683, '22': 479, '08': 492, '04': 337, '00': 447, '06': 397, '07': 267, '11': 641}
Next, we will use counts_by_hour and comments_by_hour dictionaries to determine the average number of comments for posts created during each hour of the day.
avg_comments_by_hour = []
for hour in comments_by_hour:
avg_comments_per_post = round((comments_by_hour[hour])/counts_by_hour[hour],1) #division in 1 decimal place
avg_comments_by_hour.append([hour, avg_comments_per_post])
print(avg_comments_by_hour)
[['09', 5.6], ['13', 14.7], ['10', 13.4], ['14', 13.2], ['16', 16.8], ['23', 8.0], ['12', 9.4], ['17', 11.5], ['15', 38.6], ['21', 16.0], ['20', 21.5], ['02', 23.8], ['18', 13.2], ['03', 7.8], ['05', 10.1], ['19', 10.8], ['01', 11.4], ['22', 6.7], ['08', 10.2], ['04', 7.2], ['00', 8.1], ['06', 9.0], ['07', 7.9], ['11', 11.1]]
Although we now have the results we need, this format makes it hard to identify the hours with the highest values. Let's finish by sorting the list of lists and printing the five highest values in a format that's easier to read.
swap_avg_by_hour = []
for row in avg_comments_by_hour:
swap_avg_by_hour.append((row[1], row[0])) # row[1]= average comments per post row[0]= hour
print(swap_avg_by_hour)
[(5.6, '09'), (14.7, '13'), (13.4, '10'), (13.2, '14'), (16.8, '16'), (8.0, '23'), (9.4, '12'), (11.5, '17'), (38.6, '15'), (16.0, '21'), (21.5, '20'), (23.8, '02'), (13.2, '18'), (7.8, '03'), (10.1, '05'), (10.8, '19'), (11.4, '01'), (6.7, '22'), (10.2, '08'), (7.2, '04'), (8.1, '00'), (9.0, '06'), (7.9, '07'), (11.1, '11')]
sorted_swap = sorted(swap_avg_by_hour, reverse=True)
print("The Top Five Hours for Ask Posts Comments:")
for row in sorted_swap[:5]:
# US/Eastern timezone (EST) - UTC-06
est_hour_dt = dt.datetime.strptime(row[1], '%H')
est_hour_str = est_hour_dt.strftime('%H:%M')
# Our timezone (WAT) - UTC+01: 7 hours ahead of EST
# Converting the `Hour` from EST to WAT
our_hour_dt = dt.datetime.strptime(row[1], '%H') + dt.timedelta(hours=7)
our_hour_str = our_hour_dt.strftime('%H:%M')
print(' ', '{est_time} EST or {our_time} WAT: {avg:.1f} average comments per post'.format(est_time=est_hour_str, our_time=our_hour_str, avg=row[0])) # Use one decimal place to format avg
The Top Five Hours for Ask Posts Comments: 15:00 EST or 22:00 WAT: 38.6 average comments per post 02:00 EST or 09:00 WAT: 23.8 average comments per post 20:00 EST or 03:00 WAT: 21.5 average comments per post 16:00 EST or 23:00 WAT: 16.8 average comments per post 21:00 EST or 04:00 WAT: 16.0 average comments per post
Our results show that creating a post at 15:00 - 16:00 EST has the highest chance of receiving comments. One of the possible explanations is that 15:00 EST is a time when users in both North America and Europe are active. This is based on our assumption that most of the Hacker News users are from these two continents. For this reason, the best time for us to submit a post at our time zone is 22:00, and it is followed by 09:00 and 03:00 WAT.