In this project, I will analyse a representative set of data from the Hacker News site to determine what kind of posts (between Ask HN and Show HN) drives more comments and if the time at which the posts was created has a influence on the number of comments received.
The data set can be found on this kaggle site
It contains approximately 20,000 rows - from the 300,000 rows originally extracted in 2016 - posts without comments were removed and the remaining submissions randomly sampled.
Below are the column description:
id
: The unique identifier from Hacker News for the posttitle
: The title of the posturl
: The URL that the posts links to, if the post has a URLnum_points
: The number of points the post acquired, calculated as the total number of upvotes minus the total number of downvotesnum_comments
: The number of comments that were made on the postauthor
: The username of the person who submitted the postcreated_at
: The date and time at which the post was submitted (using Easter Time in the US)#Importing the data
opened_file = open('hacker_news.csv')
from csv import reader
read_file = reader(opened_file)
hn = list(read_file)
print(hn[:5])
[['id', 'title', 'url', 'num_points', 'num_comments', 'author', 'created_at'], ['12224879', 'Interactive Dynamic Video', 'http://www.interactivedynamicvideo.com/', '386', '52', 'ne0phyte', '8/4/2016 11:52'], ['10975351', 'How to Use Open Source and Shut the Fuck Up at the Same Time', 'http://hueniverse.com/2016/01/26/how-to-use-open-source-and-shut-the-fuck-up-at-the-same-time/', '39', '10', 'josep2', '1/26/2016 19:30'], ['11964716', "Florida DJs May Face Felony for April Fools' Water Joke", 'http://www.thewire.com/entertainment/2013/04/florida-djs-april-fools-water-joke/63798/', '2', '1', 'vezycash', '6/23/2016 22:20'], ['11919867', 'Technology ventures: From Idea to Enterprise', 'https://www.amazon.com/Technology-Ventures-Enterprise-Thomas-Byers/dp/0073523429', '3', '1', 'hswarna', '6/17/2016 0:01']]
#separating the header from the data for ease of use
header = hn[0]
hn = hn[1:]
print(header)
for row in hn[:6]:
print(row)
['id', 'title', 'url', 'num_points', 'num_comments', 'author', 'created_at'] ['12224879', 'Interactive Dynamic Video', 'http://www.interactivedynamicvideo.com/', '386', '52', 'ne0phyte', '8/4/2016 11:52'] ['10975351', 'How to Use Open Source and Shut the Fuck Up at the Same Time', 'http://hueniverse.com/2016/01/26/how-to-use-open-source-and-shut-the-fuck-up-at-the-same-time/', '39', '10', 'josep2', '1/26/2016 19:30'] ['11964716', "Florida DJs May Face Felony for April Fools' Water Joke", 'http://www.thewire.com/entertainment/2013/04/florida-djs-april-fools-water-joke/63798/', '2', '1', 'vezycash', '6/23/2016 22:20'] ['11919867', 'Technology ventures: From Idea to Enterprise', 'https://www.amazon.com/Technology-Ventures-Enterprise-Thomas-Byers/dp/0073523429', '3', '1', 'hswarna', '6/17/2016 0:01'] ['10301696', 'Note by Note: The Making of Steinway L1037 (2007)', 'http://www.nytimes.com/2007/11/07/movies/07stein.html?_r=0', '8', '2', 'walterbell', '9/30/2015 4:12'] ['10482257', 'Title II kills investment? Comcast and other ISPs are now spending more', 'http://arstechnica.com/business/2015/10/comcast-and-other-isps-boost-network-investment-despite-net-neutrality/', '53', '22', 'Deinos', '10/31/2015 9:48']
We are only concerned with post titles beginning with Ask HN
or Show HN
, we'll create new lists of lists containing just the data for those titles.
#empty lists to store data of interest
ask_posts = []
show_posts = []
other_posts = []
for row in hn:
title = row[1]
title = title.lower()
if title.startswith('ask hn'):
ask_posts.append(row)
elif title.startswith('show hn'):
show_posts.append(row)
else:
other_posts.append(row)
print('There are {} in the ask hn list'.format(len(ask_posts)))
print('There are {} in the show hn list'.format(len(show_posts)))
print('There are {} in the other list'.format(len(other_posts)))
There are 1744 in the ask hn list There are 1162 in the show hn list There are 17194 in the other list
We want to determine which kind of posts - Ask HN
or Show HN
gather greater interest.
To do so, we will look at the average comments each type of posts receive.
#Ask hn - average number of comments per posts
total_ask_comments = 0
for post in ask_posts:
n_comments = int(post[4])
total_ask_comments +=n_comments
avg_ask_comments = total_ask_comments / len(ask_posts)
print('There are {:.2f} comments on average for Ask NH posts'.format(avg_ask_comments))
#Show hn - average number of comments per posts
total_show_comments = 0
for post in show_posts:
n_comments = int(post[4])
total_show_comments +=n_comments
avg_show_comments = total_show_comments / len(show_posts)
print('There are {:.2f} comments on average for Show HN posts'.format(avg_show_comments))
There are 14.04 comments on average for Ask NH posts There are 10.32 comments on average for Show HN posts
#looking at maximum number of comments
#ask hn
max_n_comments_ask = 0
for post in ask_posts:
n_comments = int(post[4])
if n_comments > max_n_comments_ask:
max_n_comments_ask = n_comments
print('There a maximum of {} comments for ASK HN posts'.format(max_n_comments_ask))
#show hn
max_n_comments_show = 0
for post in show_posts:
n_comments = int(post[4])
if n_comments > max_n_comments_show:
max_n_comments_show = n_comments
print('There a maximum of {} comments for Show HN posts'.format(max_n_comments_show))
There a maximum of 947 comments for ASK HN posts There a maximum of 306 comments for Show HN posts
From the average number of comments on posts, we can see that Ask HN
posts are receiving on average almost 1.5 times more comments than Show HN
posts - with each Aks HN
post included in our sample receiving an average of 14.04 comments compared to the 10.32 average comments of the Show HN
posts. In addition, just to confirm our conclusion, we can look at the maximum number of comments received by each type of post: the result is aligned with the maximum of comments an Ask HN
post received is 3 times as high as the maximum number of comments received by a Show HN
post.
Based on the above, we will focus on the Ask HN
posts for the remaining part of the analysis.
In order to do so, we will:
# Importing the Datetime module given we will work with time
import datetime as dt
result_list = []
for post in ask_posts:
created_at = post[6]
n_comments = int(post[4])
new_element = [created_at,n_comments]
result_list.append(new_element)
counts_by_hour = {}
comments_by_hour = {}
for row in result_list:
time = row[0]
comments = int(row[1])
time_dt = dt.datetime.strptime(time,"%m/%d/%Y %H:%M")
hour = time_dt.hour
if hour not in counts_by_hour:
counts_by_hour[hour] = 1
comments_by_hour[hour] = comments
else:
counts_by_hour[hour] += 1
comments_by_hour[hour] += comments
#Making sure the results is realistic and that we likely have accurate results out
print(counts_by_hour)
print(comments_by_hour)
#making sure we have all the rows
print("There are {} lines in counts by hours".format(len(counts_by_hour)))
print("There are {} lines in comments by hours".format(len(comments_by_hour)))
{9: 45, 13: 85, 10: 59, 14: 107, 16: 108, 23: 68, 12: 73, 17: 100, 15: 116, 21: 109, 20: 80, 2: 58, 18: 109, 3: 54, 5: 46, 19: 110, 1: 60, 22: 71, 8: 48, 4: 47, 0: 55, 6: 44, 7: 34, 11: 58} {9: 251, 13: 1253, 10: 793, 14: 1416, 16: 1814, 23: 543, 12: 687, 17: 1146, 15: 4477, 21: 1745, 20: 1722, 2: 1381, 18: 1439, 3: 421, 5: 464, 19: 1188, 1: 683, 22: 479, 8: 492, 4: 337, 0: 447, 6: 397, 7: 267, 11: 641} There are 24 lines in counts by hours There are 24 lines in comments by hours
This will help us identify the best time to create a post to draw a maximum of comments
avg_by_hour = []
for hour in counts_by_hour:
avg_by_hour.append([hour, (comments_by_hour[hour]/counts_by_hour[hour])])
print("There are {} lines in average comments by hours".format(len(avg_by_hour)))
for row in avg_by_hour:
print(row)
There are 24 lines in average comments by hours [9, 5.5777777777777775] [13, 14.741176470588234] [10, 13.440677966101696] [14, 13.233644859813085] [16, 16.796296296296298] [23, 7.985294117647059] [12, 9.41095890410959] [17, 11.46] [15, 38.5948275862069] [21, 16.009174311926607] [20, 21.525] [2, 23.810344827586206] [18, 13.20183486238532] [3, 7.796296296296297] [5, 10.08695652173913] [19, 10.8] [1, 11.383333333333333] [22, 6.746478873239437] [8, 10.25] [4, 7.170212765957447] [0, 8.127272727272727] [6, 9.022727272727273] [7, 7.852941176470588] [11, 11.051724137931034]
The above data appears accurate, however, it is hard to manually quickly identify the 5 best hours for creating a posts.
To facilitate the task, we will swap the order of the data (average number of comments first instead of hour) and sort the average in descending order.
#Swapping the order
swap_avg_by_hour = []
for hour in avg_by_hour:
swap_avg_by_hour.append([hour[1],hour[0]])
sorted_swap = sorted(swap_avg_by_hour, reverse=True)
for row in sorted_swap:
print(row)
[38.5948275862069, 15] [23.810344827586206, 2] [21.525, 20] [16.796296296296298, 16] [16.009174311926607, 21] [14.741176470588234, 13] [13.440677966101696, 10] [13.233644859813085, 14] [13.20183486238532, 18] [11.46, 17] [11.383333333333333, 1] [11.051724137931034, 11] [10.8, 19] [10.25, 8] [10.08695652173913, 5] [9.41095890410959, 12] [9.022727272727273, 6] [8.127272727272727, 0] [7.985294117647059, 23] [7.852941176470588, 7] [7.796296296296297, 3] [7.170212765957447, 4] [6.746478873239437, 22] [5.5777777777777775, 9]
print("Top 5 Hours for Ask HN Posts Comments")
for row in sorted_swap[:5]:
time_dt = dt.datetime.strptime(str(row[1]), '%H')
time = time_dt.strftime("%H:%M")
print("{time}: {comments:.2f} average comments per post".format(time=time,comments=row[0]))
Top 5 Hours for Ask HN Posts Comments 15:00: 38.59 average comments per post 02:00: 23.81 average comments per post 20:00: 21.52 average comments per post 16:00: 16.80 average comments per post 21:00: 16.01 average comments per post
After our preliminary analysis we found that:
We could however give a secondary window - while the 15:00 - 16:00 window produces significantly more comments (up to 2.5 times the average for Ask HN post in the case of the 15:00 hour), a second window opens in early evening around 20:00 - 21:00 with about 1.5 times more comments than the average Ask HN post at 20:00. It must be noted that those located outside of Eastern US, will need to adjust their posting time to take into account the time difference.