Hacker News Site- Posts that get the Maximum Comments
Objective
In this guided project, we want to study posts from Hacker News, the two types of posts we want to look into are:
Ask HN
: users submit posts to ask the Hacker News community a specific question.Show HN
: users sumbit posts to show the Hacker News community a project, product, or something interesting.The data set is from Kaggle.com
The original data set from Kaggle were extracted in 2016 which contains 300,000 rows , we will only be working with 20,000 rows, since posts without comments were removed and then randomly sampled from the remaining submissions.
For the purpose of this project, we will compare the two types of posts to determine the following:
Ask HN
or Show HN
receive more comments on average?Importing the data
import csv
with open('hacker_news.csv') as hn_posts:
hn=list(csv.reader(hn_posts))
#separating the header from the data for ease of use
hn_header=hn[0]
hn=hn[1:]
print(hn_header)
print(hn[:5])
['id', 'title', 'url', 'num_points', 'num_comments', 'author', 'created_at'] [['12224879', 'Interactive Dynamic Video', 'http://www.interactivedynamicvideo.com/', '386', '52', 'ne0phyte', '8/4/2016 11:52'], ['10975351', 'How to Use Open Source and Shut the Fuck Up at the Same Time', 'http://hueniverse.com/2016/01/26/how-to-use-open-source-and-shut-the-fuck-up-at-the-same-time/', '39', '10', 'josep2', '1/26/2016 19:30'], ['11964716', "Florida DJs May Face Felony for April Fools' Water Joke", 'http://www.thewire.com/entertainment/2013/04/florida-djs-april-fools-water-joke/63798/', '2', '1', 'vezycash', '6/23/2016 22:20'], ['11919867', 'Technology ventures: From Idea to Enterprise', 'https://www.amazon.com/Technology-Ventures-Enterprise-Thomas-Byers/dp/0073523429', '3', '1', 'hswarna', '6/17/2016 0:01'], ['10301696', 'Note by Note: The Making of Steinway L1037 (2007)', 'http://www.nytimes.com/2007/11/07/movies/07stein.html?_r=0', '8', '2', 'walterbell', '9/30/2015 4:12']]
Since we are only concerned with post titles beginning with Ask HN
and Show HN
, we will separate the two type of posts into different lists.
ask_posts=[]
show_posts=[]
other_posts=[]
for row in hn:
title=row[1]
if title.lower().startswith('ask hn'):
ask_posts.append(row)
x+=1
elif title.lower().startswith('show hn'):
show_posts.append(row)
else:
other_posts.append(row)
Let's check the numbers of post on each type of posts.
print(len(ask_posts))
print(len(show_posts))
print(len(other_posts))
1744 1162 17194
Determining which type of posts receive more comments
Let's see if Ask HN
or Show HN
receive more comments on average
total_ask_comments=0
total_show_comments=0
for row in hn:
num_comments=row[4]
title=row[1]
if title.lower().startswith('show hn'):
total_show_comments=total_show_comments+int(num_comments)
elif title.lower().startswith('ask hn'):
total_ask_comments=total_ask_comments+int(num_comments)
y+=1
avg_ask_comments=total_ask_comments/len(ask_posts)
avg_show_comments=total_show_comments/len(show_posts)
print('For Ask HN posts, there are on average {:.2f} comments.'.format(avg_ask_comments))
print('For Show HN posts, there are on average {:.2f} comments.'.format(avg_show_comments))
For Ask HN posts, there are on average 14.04 comments. For Show HN posts, there are on average 10.32 comments.
Let's look for maximum number of comments on each type of posts
max_ask_post=[]
for post in ask_posts:
comment=int(post[4])
max_ask_post.append(comment)
max_show_post=[]
for post in show_posts:
comment=int(post[4])
max_show_post.append(comment)
print('For Ask HN posts, there is a maximum of {} comments'.format(max(max_ask_post)))
print('For Show HN posts, there is a maximum of {} comments'.format(max(max_show_post)))
For Ask HN posts, there is a maximum of 947 comments For Show HN posts, there is a maximum of 306 comments
Posts on Ask HN
receive more comments on average than Show HN
As it turns out, Ask HN
are on average receiving 1.4 times more comments than Show HN
, and the max comments an Ask HN
post received is 3 times more than max comments on Show HN
.
Since Ask HN
posts are more likely to receive comments, we will focus our analaysis on Ask HN
posts.
What is the best time for a post to receive more comments
Let's see if Ask HN
created at a certain time are more likely to attract comments.
We will do this in the following steps:
1. Calculate the amount of ask posts created in each hour of the day, along with the number of comments received.
#Importing the Datetime module since we are working with time and dates
import datetime as dt
result_list=[]
for post in ask_posts:
created_at=post[6]
num_comments=int(post[4])
result=created_at,num_comments
result_list.append(result)
counts_by_hour={}
comments_by_hour={}
for result in result_list:
num_comments=result[1]
time=result[0]
time=dt.datetime.strptime(time,'%m/%d/%Y %H:%M').strftime('%H')
if time not in counts_by_hour:
counts_by_hour[time]=1
comments_by_hour[time]=num_comments
else:
counts_by_hour[time]+=1
comments_by_hour[time]+=num_comments
print(comments_by_hour)
print(counts_by_hour)
{'09': 251, '13': 1253, '10': 793, '14': 1416, '16': 1814, '23': 543, '12': 687, '17': 1146, '15': 4477, '21': 1745, '20': 1722, '02': 1381, '18': 1439, '03': 421, '05': 464, '19': 1188, '01': 683, '22': 479, '08': 492, '04': 337, '00': 447, '06': 397, '07': 267, '11': 641} {'09': 45, '13': 85, '10': 59, '14': 107, '16': 108, '23': 68, '12': 73, '17': 100, '15': 116, '21': 109, '20': 80, '02': 58, '18': 109, '03': 54, '05': 46, '19': 110, '01': 60, '22': 71, '08': 48, '04': 47, '00': 55, '06': 44, '07': 34, '11': 58}
2. Calculate the average number of comments ask posts receive by hour created.
avg_by_hour=[]
for hr in comments_by_hour:
avg_by_hour.append([hr,comments_by_hour[hr]/counts_by_hour[hr]])
avg_by_hour
[['09', 5.5777777777777775], ['13', 14.741176470588234], ['10', 13.440677966101696], ['14', 13.233644859813085], ['16', 16.796296296296298], ['23', 7.985294117647059], ['12', 9.41095890410959], ['17', 11.46], ['15', 38.5948275862069], ['21', 16.009174311926607], ['20', 21.525], ['02', 23.810344827586206], ['18', 13.20183486238532], ['03', 7.796296296296297], ['05', 10.08695652173913], ['19', 10.8], ['01', 11.383333333333333], ['22', 6.746478873239437], ['08', 10.25], ['04', 7.170212765957447], ['00', 8.127272727272727], ['06', 9.022727272727273], ['07', 7.852941176470588], ['11', 11.051724137931034]]
Let's sort the average and print the five highest values in a format that's easier to read.
swap_avg_by_hour=[]
for hour in avg_by_hour:
swap=hour[1], hour[0]
swap_avg_by_hour.append(swap)
sorted_swap=sorted(swap_avg_by_hour,reverse=True)
print('Top 5 Hours for Ask Posts Comments\n')
top_5_hours=[]
for hour in sorted_swap[:5]:
top_hour=dt.datetime.strptime(hour[1],'%H')
top_hour=top_hour.strftime('%H:%M')
top_comment=hour[0]
print(f"At {top_hour} there are {top_comment :.2f} average comments per post.")
Top 5 Hours for Ask Posts Comments At 15:00 there are 38.59 average comments per post. At 02:00 there are 23.81 average comments per post. At 20:00 there are 21.52 average comments per post. At 16:00 there are 16.80 average comments per post. At 21:00 there are 16.01 average comments per post.
Conclusions
After our simple analysis we have discovered the following:
Ask HN
have higher comments on average than Show HN
posts, Ask HN
also have 3 times more maximum comments than Show HN
posts.
Posts posting on 15:00 US Easter Time receives the highest average comments, that's 2.3 times as much comments compare to posting one hour later at 16:00 US Easter Time, average comments on 15:00 US Easter Time is also 1.6 times as much comments compare to the 2nd highest time period of average comments at 2:00 AM US Easter Time.
One thing to keep in mind is the data we are analysing had excluded the data on posts without any comments.
As a result, our conclusion should be that in the posts that had received one or more comments, Ask HN
posts received more comments on average compared to Show HN
, and Ask HN
post created between 15:00 and 16:00 EST received the most comments on average. Further data analysis with more complete data set might change this conclusion.