Hacker News is a site extremely popular in technology and startup circles. As a result, posts that make it to the top of Hacker News' listings can get hundreds of thousands of visitors as as result.
Title with either "Ask HN" or "Show HN" are particular interesting topics. "Ask HN" posts usually are question posts to the Hacker News community and "Show HN" posts are submissions to Hacker News community a project, product or just something interesting.
So, it is interesting to know more about which topics are having more comments on average. The "Ask HN" posts? Or the "Show HN" posts? Are there posts created at a certain time receive more comments on average?
Let's us explore together! If you are interesting to know more about the dataset, please visit here
import csv
opened_file = open('hacker_news.csv')
Reader = csv.reader(opened_file)
All_rows = list(Reader)
headers = All_rows[0]
hn = All_rows[1:]
print(headers)
print(hn[:5])
['id', 'title', 'url', 'num_points', 'num_comments', 'author', 'created_at'] [['12224879', 'Interactive Dynamic Video', 'http://www.interactivedynamicvideo.com/', '386', '52', 'ne0phyte', '8/4/2016 11:52'], ['10975351', 'How to Use Open Source and Shut the Fuck Up at the Same Time', 'http://hueniverse.com/2016/01/26/how-to-use-open-source-and-shut-the-fuck-up-at-the-same-time/', '39', '10', 'josep2', '1/26/2016 19:30'], ['11964716', "Florida DJs May Face Felony for April Fools' Water Joke", 'http://www.thewire.com/entertainment/2013/04/florida-djs-april-fools-water-joke/63798/', '2', '1', 'vezycash', '6/23/2016 22:20'], ['11919867', 'Technology ventures: From Idea to Enterprise', 'https://www.amazon.com/Technology-Ventures-Enterprise-Thomas-Byers/dp/0073523429', '3', '1', 'hswarna', '6/17/2016 0:01'], ['10301696', 'Note by Note: The Making of Steinway L1037 (2007)', 'http://www.nytimes.com/2007/11/07/movies/07stein.html?_r=0', '8', '2', 'walterbell', '9/30/2015 4:12']]
ask_posts = []
show_posts = []
other_posts = []
for row in hn:
title = row[1]
title = title.lower()
if title.startswith('ask hn'):
ask_posts.append(row)
elif title.startswith('show hn'):
show_posts.append(row)
else:
other_posts.append(row)
print('There are {} ask posts'.format(len(ask_posts)))
print('There are {} show posts'.format(len(show_posts)))
print('There are {} other posts'.format(len(other_posts)))
There are 1744 ask posts There are 1162 show posts There are 17194 other posts
id: The unique identifier from Hacker News for the post
title: The title of the post
url: The URL that the posts links to, if it the post has a URL
num_points: The number of points the post acquired, calculated as the total number of upvotes minus the total number of downvotes
num_comments: The number of comments that were made on the post
author: The username of the person who submitted the post
created_at: The date and time at which the post was submitted
def cal_num(posts, row):
total_comments = 0
for row in posts:
num_comment = row[4]
total_comments += int(num_comment)
return total_comments
total_ask_comments = cal_num(ask_posts, 4)
aveg_ask_comments = total_ask_comments / len(ask_posts)
print(aveg_ask_comments)
total_show_comments = cal_num(show_posts, 4)
avg_show_comments = total_show_comments / len(show_posts)
print(avg_show_comments)
14.038417431192661 10.31669535283993
By comparing the two average numbers above, we can certainly say that there are receiving more comments on average.
If the ask posts created at a certain time are more likely to attract comments. We will:
import datetime as dt
result_list = []
for row in ask_posts:
created_at = row[6]
num_comment = row[4]
result_list.append([created_at, int(num_comment)])
Find the number of posts and number of comments for each hour
counts_by_hour = {}
comments_by_hour = {}
for row in result_list:
row_dt = dt.datetime.strptime(row[0],'%m/%d/%Y %H:%M')
Hour = row_dt.strftime('%H') # Extract hours from datetime
if Hour not in counts_by_hour:
counts_by_hour[Hour] = 1
comments_by_hour[Hour] = row[1]
else:
counts_by_hour[Hour] += 1
comments_by_hour[Hour] += row[1]
print(counts_by_hour)
print(comments_by_hour)
{'09': 45, '13': 85, '10': 59, '14': 107, '16': 108, '23': 68, '12': 73, '17': 100, '15': 116, '21': 109, '20': 80, '02': 58, '18': 109, '03': 54, '05': 46, '19': 110, '01': 60, '22': 71, '08': 48, '04': 47, '00': 55, '06': 44, '07': 34, '11': 58} {'09': 251, '13': 1253, '10': 793, '14': 1416, '16': 1814, '23': 543, '12': 687, '17': 1146, '15': 4477, '21': 1745, '20': 1722, '02': 1381, '18': 1439, '03': 421, '05': 464, '19': 1188, '01': 683, '22': 479, '08': 492, '04': 337, '00': 447, '06': 397, '07': 267, '11': 641}
avg_by_hour = []
for key in counts_by_hour:
avg = comments_by_hour[key] / counts_by_hour[key]
avg_by_hour.append([key, avg])
Inspect the frequency dictionary based on 'hour' key
print(avg_by_hour)
[['09', 5.5777777777777775], ['13', 14.741176470588234], ['10', 13.440677966101696], ['14', 13.233644859813085], ['16', 16.796296296296298], ['23', 7.985294117647059], ['12', 9.41095890410959], ['17', 11.46], ['15', 38.5948275862069], ['21', 16.009174311926607], ['20', 21.525], ['02', 23.810344827586206], ['18', 13.20183486238532], ['03', 7.796296296296297], ['05', 10.08695652173913], ['19', 10.8], ['01', 11.383333333333333], ['22', 6.746478873239437], ['08', 10.25], ['04', 7.170212765957447], ['00', 8.127272727272727], ['06', 9.022727272727273], ['07', 7.852941176470588], ['11', 11.051724137931034]]
Swap the two columns in the average hour dictionary for better inspection.
swap_avg_by_hour = []
for row in avg_by_hour:
swap_avg_by_hour.append([row[1], row[0]])
print(swap_avg_by_hour)
[[5.5777777777777775, '09'], [14.741176470588234, '13'], [13.440677966101696, '10'], [13.233644859813085, '14'], [16.796296296296298, '16'], [7.985294117647059, '23'], [9.41095890410959, '12'], [11.46, '17'], [38.5948275862069, '15'], [16.009174311926607, '21'], [21.525, '20'], [23.810344827586206, '02'], [13.20183486238532, '18'], [7.796296296296297, '03'], [10.08695652173913, '05'], [10.8, '19'], [11.383333333333333, '01'], [6.746478873239437, '22'], [10.25, '08'], [7.170212765957447, '04'], [8.127272727272727, '00'], [9.022727272727273, '06'], [7.852941176470588, '07'], [11.051724137931034, '11']]
Sort the Swapped Average Hour located the maximum average comment per post at a given time
sorted_swap = sorted(swap_avg_by_hour, reverse = True)
for row in sorted_swap:
template = '{HourMinute}: {Avg_per_post:.2f} average comments per post'
Hour = dt.datetime.strptime(row[1],'%H') # Create datetime hour, # Format should match your str
Pt = Hour.strftime('%H:00') # Convert hour to str
print(template.format(HourMinute = Pt, Avg_per_post =row[0]))
15:00: 38.59 average comments per post 02:00: 23.81 average comments per post 20:00: 21.52 average comments per post 16:00: 16.80 average comments per post 21:00: 16.01 average comments per post 13:00: 14.74 average comments per post 10:00: 13.44 average comments per post 14:00: 13.23 average comments per post 18:00: 13.20 average comments per post 17:00: 11.46 average comments per post 01:00: 11.38 average comments per post 11:00: 11.05 average comments per post 19:00: 10.80 average comments per post 08:00: 10.25 average comments per post 05:00: 10.09 average comments per post 12:00: 9.41 average comments per post 06:00: 9.02 average comments per post 00:00: 8.13 average comments per post 23:00: 7.99 average comments per post 07:00: 7.85 average comments per post 03:00: 7.80 average comments per post 04:00: 7.17 average comments per post 22:00: 6.75 average comments per post 09:00: 5.58 average comments per post
We have to notice that the above time schedule are in Eastern Time (EST) in the US. As we are residents in Europe, it would be nice to convert EST to Central Europe Time (CET).
for row in sorted_swap:
template = '{HourMinute}: {Avg_per_post:.2f} average comments per post'
Hour = dt.datetime.strptime(row[1],'%H') # Create datetime hour, # Format should match your str
Hour = Hour + dt.timedelta(hours = 6)
Pt = Hour.strftime('%H:00') # Convert hour to str
print(template.format(HourMinute = Pt, Avg_per_post =row[0]))
21:00: 38.59 average comments per post 08:00: 23.81 average comments per post 02:00: 21.52 average comments per post 22:00: 16.80 average comments per post 03:00: 16.01 average comments per post 19:00: 14.74 average comments per post 16:00: 13.44 average comments per post 20:00: 13.23 average comments per post 00:00: 13.20 average comments per post 23:00: 11.46 average comments per post 07:00: 11.38 average comments per post 17:00: 11.05 average comments per post 01:00: 10.80 average comments per post 14:00: 10.25 average comments per post 11:00: 10.09 average comments per post 18:00: 9.41 average comments per post 12:00: 9.02 average comments per post 06:00: 8.13 average comments per post 05:00: 7.99 average comments per post 13:00: 7.85 average comments per post 09:00: 7.80 average comments per post 10:00: 7.17 average comments per post 04:00: 6.75 average comments per post 15:00: 5.58 average comments per post
By refering to the average comments per post for each hour, we found that posts created at around 9pm, 10pm, 2am and 3 am have a higher chance of receiving comments. Post created in the midnight (3am) also have a good chance of receiving comments because