In this notebook we'll work with a dataset with submissions to Hacker News. You can download it here. To keeps things simple, posts without comments were removed and remaning submissions were randomly sampled (data was reduced from almost 300.000 rows to approximately 20.000 rows).
Below are the descriptions of the columns:
For our analysis we're specifically interested in posts whose titles begin with either Ask HN (when users submit posts to ask the Hacker News community a specific question or Show HN (when users submit posts to show the Hacker community a projet, product, or just generally something interesting).
We will compare these two types of posts to determine the following:
#import the reader function from the csv module
from csv import reader
opened_file = open('hacker_news.csv', encoding="utf8");
read_file = reader(opened_file)
#converting the read file into a list of lists forma
hn = list(read_file)
After loading it, we can check the first 5 rows from our dataset:
hn[0:5]
[['id', 'title', 'url', 'num_points', 'num_comments', 'author', 'created_at'], ['12224879', 'Interactive Dynamic Video', 'http://www.interactivedynamicvideo.com/', '386', '52', 'ne0phyte', '8/4/2016 11:52'], ['10975351', 'How to Use Open Source and Shut the Fuck Up at the Same Time', 'http://hueniverse.com/2016/01/26/how-to-use-open-source-and-shut-the-fuck-up-at-the-same-time/', '39', '10', 'josep2', '1/26/2016 19:30'], ['11964716', "Florida DJs May Face Felony for April Fools' Water Joke", 'http://www.thewire.com/entertainment/2013/04/florida-djs-april-fools-water-joke/63798/', '2', '1', 'vezycash', '6/23/2016 22:20'], ['11919867', 'Technology ventures: From Idea to Enterprise', 'https://www.amazon.com/Technology-Ventures-Enterprise-Thomas-Byers/dp/0073523429', '3', '1', 'hswarna', '6/17/2016 0:01']]
Now let's split our data from the headers
headers = hn[0]
hn = hn[1:]
#checking headers
headers
['id', 'title', 'url', 'num_points', 'num_comments', 'author', 'created_at']
#checking data
hn[0:5]
[['12224879', 'Interactive Dynamic Video', 'http://www.interactivedynamicvideo.com/', '386', '52', 'ne0phyte', '8/4/2016 11:52'], ['10975351', 'How to Use Open Source and Shut the Fuck Up at the Same Time', 'http://hueniverse.com/2016/01/26/how-to-use-open-source-and-shut-the-fuck-up-at-the-same-time/', '39', '10', 'josep2', '1/26/2016 19:30'], ['11964716', "Florida DJs May Face Felony for April Fools' Water Joke", 'http://www.thewire.com/entertainment/2013/04/florida-djs-april-fools-water-joke/63798/', '2', '1', 'vezycash', '6/23/2016 22:20'], ['11919867', 'Technology ventures: From Idea to Enterprise', 'https://www.amazon.com/Technology-Ventures-Enterprise-Thomas-Byers/dp/0073523429', '3', '1', 'hswarna', '6/17/2016 0:01'], ['10301696', 'Note by Note: The Making of Steinway L1037 (2007)', 'http://www.nytimes.com/2007/11/07/movies/07stein.html?_r=0', '8', '2', 'walterbell', '9/30/2015 4:12']]
We will populate three lists to categorize posts beginning with Ask HN, Show HN and Other (for any other post).
#initializing empty lists
ask_posts = []
show_posts = []
other_posts = []
#iterating on dataset
for row in hn:
title = row[1] #title is the second column
if title.lower().startswith('ask hn'):
ask_posts.append(row)
elif title.lower().startswith('show hn'):
show_posts.append(row)
else:
other_posts.append(row)
#checking number of posts
print('Number of posts in ask_posts:',len(ask_posts))
print('Number of posts in show_posts:',len(show_posts))
print('Number of posts in other_posts:',len(other_posts))
Number of posts in ask_posts: 1744 Number of posts in show_posts: 1162 Number of posts in other_posts: 17194
Next, let's determine if ask posts or show posts receive more comments on average.
#average comments - ask hn posts
total_ask_comments = 0
for row in ask_posts:
num_comments = int(row[4])
total_ask_comments += num_comments
avg_ask_comments = total_ask_comments/len(ask_posts)
#average comments post hn posts
total_show_comments = 0
for row in show_posts:
num_comments = int(row[4])
total_show_comments += num_comments
avg_show_comments = total_show_comments/len(show_posts)
print('Average comments on Ask HN posts: {:.2f}'.format(avg_ask_comments))
print('Average comments on Show HN comments: {:.2f}'.format(avg_show_comments))
Average comments on Ask HN posts: 14.04 Average comments on Show HN comments: 10.32
As expected Aks HN posts have a higher average comments.
Next, we'll determine if ask posts created at a certain time are more likely to attract comments. We'll use the following steps to perform this analysis:
#importing datetime library
import datetime as dt
result_list = []
for row in ask_posts:
result_list.append([row[6],int(row[4])])#appending created at and number of comments
counts_by_hour = {}
comments_by_hour = {}
for row in result_list:
comments_count = row[1]
date_string = row[0]
date_created = dt.datetime.strptime(date_string,"%m/%d/%Y %H:%M")
hour_created = date_created.hour
if hour_created in counts_by_hour:
counts_by_hour[hour_created] += 1
comments_by_hour[hour_created] += comments_count
else:
counts_by_hour[hour_created] = 1
comments_by_hour[hour_created] = comments_count
We have now the amount of posts and comments created for each hour so we can calculate the average number of comments per post for posts created during each hour of the day.
print('Posts created by hour: ', counts_by_hour)
print('Total comments added by hour', comments_by_hour)
Posts created by hour: {9: 45, 13: 85, 10: 59, 14: 107, 16: 108, 23: 68, 12: 73, 17: 100, 15: 116, 21: 109, 20: 80, 2: 58, 18: 109, 3: 54, 5: 46, 19: 110, 1: 60, 22: 71, 8: 48, 4: 47, 0: 55, 6: 44, 7: 34, 11: 58} Total comments added by hour {9: 251, 13: 1253, 10: 793, 14: 1416, 16: 1814, 23: 543, 12: 687, 17: 1146, 15: 4477, 21: 1745, 20: 1722, 2: 1381, 18: 1439, 3: 421, 5: 464, 19: 1188, 1: 683, 22: 479, 8: 492, 4: 337, 0: 447, 6: 397, 7: 267, 11: 641}
avg_by_hour = []
for key in counts_by_hour:
avg_posts = comments_by_hour[key]/counts_by_hour[key]
avg_by_hour.append([key,avg_posts])
avg_by_hour
[[9, 5.5777777777777775], [13, 14.741176470588234], [10, 13.440677966101696], [14, 13.233644859813085], [16, 16.796296296296298], [23, 7.985294117647059], [12, 9.41095890410959], [17, 11.46], [15, 38.5948275862069], [21, 16.009174311926607], [20, 21.525], [2, 23.810344827586206], [18, 13.20183486238532], [3, 7.796296296296297], [5, 10.08695652173913], [19, 10.8], [1, 11.383333333333333], [22, 6.746478873239437], [8, 10.25], [4, 7.170212765957447], [0, 8.127272727272727], [6, 9.022727272727273], [7, 7.852941176470588], [11, 11.051724137931034]]
Althought we now have the results we need, this format makes it hard to identify the hours with the highest values. Let's finish by soring the list and printing the five highest values in a format that's easier to read.
#let's swap the column order from avg_by_hour
swap_avg_by_hour = []
for row in avg_by_hour:
swap_avg_by_hour.append([row[1],row[0]])
swap_avg_by_hour
[[5.5777777777777775, 9], [14.741176470588234, 13], [13.440677966101696, 10], [13.233644859813085, 14], [16.796296296296298, 16], [7.985294117647059, 23], [9.41095890410959, 12], [11.46, 17], [38.5948275862069, 15], [16.009174311926607, 21], [21.525, 20], [23.810344827586206, 2], [13.20183486238532, 18], [7.796296296296297, 3], [10.08695652173913, 5], [10.8, 19], [11.383333333333333, 1], [6.746478873239437, 22], [10.25, 8], [7.170212765957447, 4], [8.127272727272727, 0], [9.022727272727273, 6], [7.852941176470588, 7], [11.051724137931034, 11]]
#sorting swap_avg_by_hour in descending order
sorted_swap = sorted(swap_avg_by_hour, reverse=True)
print('Top 5 Hours for Ask Posts Comments')
for row in sorted_swap[:5]:
hour_formatted = dt.datetime.strptime(str(row[1]),'%H')
hour_formatted += dt.timedelta(hours=2) #converting EST to UTC-3
hour_formatted = hour_formatted.strftime('%H:%M')
print('{}: {:.2f} average comments per post'.format(hour_formatted,row[0]))
Top 5 Hours for Ask Posts Comments 17:00: 38.59 average comments per post 04:00: 23.81 average comments per post 22:00: 21.52 average comments per post 18:00: 16.80 average comments per post 23:00: 16.01 average comments per post
From our top 5 hours for Ask Posts we can conclude that the best time to post a question would it be around 17:00 - 17:59 (in America/Sao_Paulo timezone). It's interesting to note that the average comments of posts around 18:00h drops to less than half than the average of posts created around 17:00.
The reason behind this huge difference in a small window of time could it be the ending of office ours. To verify if this is a valid hypothesis we could analyse the average comments per post for the whole dataset and not only the Ask HN posts. We could check then if the distribuition of average comments is the same.