Hacker News is a site, where user-submitted stories (known as "posts") are voted and commented upon. Hacker News is extremely popular in technology and startup circles, and posts that make it to the top of Hacker News' listings can get hundreds of thousands of visitors as a result.
You can find the data set here, but we'll explore a part of data that has been reduced from almost 300,000 rows to approximately 20,000 rows by removing all submissions that did not receive any comments, and then randomly sampling from the remaining submissions. Below are descriptions of the columns:
We're specifically interested in posts whose titles begin with either Ask HN
or Show HN
. Users submit Ask HN
posts to ask the Hacker News community a specific question. Likewise, users submit Show HN
posts to show the Hacker News community a project, product, or just generally something interesting.
The aim of this work is to ask the following questions:
Ask HN
or Show HN
receive more comments on average?Let's get our dataset:
from csv import reader
file_open = open("hacker_news.csv")
data_read = reader(file_open)
hn = list(data_read)
hn[0:5]
[['id', 'title', 'url', 'num_points', 'num_comments', 'author', 'created_at'], ['12224879', 'Interactive Dynamic Video', 'http://www.interactivedynamicvideo.com/', '386', '52', 'ne0phyte', '8/4/2016 11:52'], ['10975351', 'How to Use Open Source and Shut the Fuck Up at the Same Time', 'http://hueniverse.com/2016/01/26/how-to-use-open-source-and-shut-the-fuck-up-at-the-same-time/', '39', '10', 'josep2', '1/26/2016 19:30'], ['11964716', "Florida DJs May Face Felony for April Fools' Water Joke", 'http://www.thewire.com/entertainment/2013/04/florida-djs-april-fools-water-joke/63798/', '2', '1', 'vezycash', '6/23/2016 22:20'], ['11919867', 'Technology ventures: From Idea to Enterprise', 'https://www.amazon.com/Technology-Ventures-Enterprise-Thomas-Byers/dp/0073523429', '3', '1', 'hswarna', '6/17/2016 0:01']]
The first row is the column headers. We extract headers and the data separately:
headers = hn[0]
hn = hn[1:]
print('Headers:')
print(headers, '\n')
print('Data:')
hn[0:5]
Headers: ['id', 'title', 'url', 'num_points', 'num_comments', 'author', 'created_at'] Data:
[['12224879', 'Interactive Dynamic Video', 'http://www.interactivedynamicvideo.com/', '386', '52', 'ne0phyte', '8/4/2016 11:52'], ['10975351', 'How to Use Open Source and Shut the Fuck Up at the Same Time', 'http://hueniverse.com/2016/01/26/how-to-use-open-source-and-shut-the-fuck-up-at-the-same-time/', '39', '10', 'josep2', '1/26/2016 19:30'], ['11964716', "Florida DJs May Face Felony for April Fools' Water Joke", 'http://www.thewire.com/entertainment/2013/04/florida-djs-april-fools-water-joke/63798/', '2', '1', 'vezycash', '6/23/2016 22:20'], ['11919867', 'Technology ventures: From Idea to Enterprise', 'https://www.amazon.com/Technology-Ventures-Enterprise-Thomas-Byers/dp/0073523429', '3', '1', 'hswarna', '6/17/2016 0:01'], ['10301696', 'Note by Note: The Making of Steinway L1037 (2007)', 'http://www.nytimes.com/2007/11/07/movies/07stein.html?_r=0', '8', '2', 'walterbell', '9/30/2015 4:12']]
Now, we're ready to filter our data. Since we're only concerned with post titles beginning with Ask HN
or Show HN
, we'll create new lists of lists containing just the data for those titles.
ask_posts = []
show_posts = []
other_posts = []
for row in hn:
title = row[1]
if title.lower().startswith('ask hn'):
ask_posts.append(row)
elif title.lower().startswith('show hn'):
show_posts.append(row)
else:
other_posts.append(row)
print('Number of Ask HN posts: ', len(ask_posts))
print('Number of Show HN posts: ', len(show_posts))
print('Number of Other HN posts: ', len(other_posts))
print('Total: ', len(hn))
Number of Ask HN posts: 1744 Number of Show HN posts: 1162 Number of Other HN posts: 17194 Total: 20100
It looks true as 1744 + 1162 + 17194 = 20100.
Below are the first five rows in the ask_posts
list of lists:
print('Ask HN:')
ask_posts[0:5]
Ask HN:
[['12296411', 'Ask HN: How to improve my personal website?', '', '2', '6', 'ahmedbaracat', '8/16/2016 9:55'], ['10610020', 'Ask HN: Am I the only one outraged by Twitter shutting down share counts?', '', '28', '29', 'tkfx', '11/22/2015 13:43'], ['11610310', 'Ask HN: Aby recent changes to CSS that broke mobile?', '', '1', '1', 'polskibus', '5/2/2016 10:14'], ['12210105', 'Ask HN: Looking for Employee #3 How do I do it?', '', '1', '3', 'sph130', '8/2/2016 14:20'], ['10394168', 'Ask HN: Someone offered to buy my browser extension from me. What now?', '', '28', '17', 'roykolak', '10/15/2015 16:38']]
Below are the first five rows in the show_posts
list of lists:
print('Show HN:')
show_posts[0:5]
Show HN:
[['10627194', 'Show HN: Wio Link ESP8266 Based Web of Things Hardware Development Platform', 'https://iot.seeed.cc', '26', '22', 'kfihihc', '11/25/2015 14:03'], ['10646440', 'Show HN: Something pointless I made', 'http://dn.ht/picklecat/', '747', '102', 'dhotson', '11/29/2015 22:46'], ['11590768', 'Show HN: Shanhu.io, a programming playground powered by e8vm', 'https://shanhu.io', '1', '1', 'h8liu', '4/28/2016 18:05'], ['12178806', 'Show HN: Webscope Easy way for web developers to communicate with Clients', 'http://webscopeapp.com', '3', '3', 'fastbrick', '7/28/2016 7:11'], ['10872799', 'Show HN: GeoScreenshot Easily test Geo-IP based web pages', 'https://www.geoscreenshot.com/', '1', '9', 'kpsychwave', '1/9/2016 20:45']]
# Ask HN
total_ask_comments = 0
for row_ask in ask_posts:
num_comments = int(row_ask[4])
total_ask_comments += num_comments
avg_ask_comments = total_ask_comments / len(ask_posts)
print('Number of comments on average, Ask HN: ', avg_ask_comments)
# Show HN
total_show_comments = 0
for row_show in show_posts:
num_comments = int(row_show[4])
total_show_comments += num_comments
avg_show_comments = total_show_comments / len(show_posts)
print('Number of comments on average, Show HN: ', avg_show_comments)
Number of comments on average, Ask HN: 14.038417431192661 Number of comments on average, Show HN: 10.31669535283993
So, we can see, that ask posts receive more comments on average.
Since ask posts are more likely to receive comments, we'll focus our remaining analysis just on these posts.
If ask posts created at a certain time are more likely to attract comments? We'll use the following steps to perform this analysis:
import datetime as dt
result_list = []
for row_ask in ask_posts:
result_list.append([row_ask[6], int(row_ask[4])])
counts_by_hour = {} # frequency table of the amount of posts by hour
comments_by_hour = {} # frequency table of the number of comments by hour
for rr in result_list:
rr_datetime = dt.datetime.strptime(rr[0], '%m/%d/%Y %H:%M')
# print(rr_datetime)
rr_hour = rr_datetime.strftime('%H')
# print(rr_hour, '\n')
if rr_hour in counts_by_hour:
counts_by_hour[rr_hour] += 1
comments_by_hour[rr_hour] += rr[1]
else:
counts_by_hour[rr_hour] = 1
comments_by_hour[rr_hour] = rr[1]
print(counts_by_hour, '\n')
print(comments_by_hour)
{'01': 60, '09': 45, '16': 108, '12': 73, '04': 47, '13': 85, '05': 46, '23': 68, '11': 58, '14': 107, '21': 109, '02': 58, '10': 59, '15': 116, '03': 54, '08': 48, '20': 80, '19': 110, '18': 109, '22': 71, '17': 100, '06': 44, '00': 55, '07': 34} {'01': 683, '09': 251, '16': 1814, '12': 687, '04': 337, '13': 1253, '05': 464, '23': 543, '11': 641, '14': 1416, '21': 1745, '02': 1381, '10': 793, '15': 4477, '03': 421, '08': 492, '20': 1722, '19': 1188, '18': 1439, '22': 479, '17': 1146, '06': 397, '00': 447, '07': 267}
Below, we calculate the average number of comments for posts created during each hour of the day:
avg_by_hour = []
for hour in sorted(counts_by_hour):
avg_by_hour.append([hour, comments_by_hour[hour] / counts_by_hour[hour]])
avg_by_hour
[['00', 8.127272727272727], ['01', 11.383333333333333], ['02', 23.810344827586206], ['03', 7.796296296296297], ['04', 7.170212765957447], ['05', 10.08695652173913], ['06', 9.022727272727273], ['07', 7.852941176470588], ['08', 10.25], ['09', 5.5777777777777775], ['10', 13.440677966101696], ['11', 11.051724137931034], ['12', 9.41095890410959], ['13', 14.741176470588234], ['14', 13.233644859813085], ['15', 38.5948275862069], ['16', 16.796296296296298], ['17', 11.46], ['18', 13.20183486238532], ['19', 10.8], ['20', 21.525], ['21', 16.009174311926607], ['22', 6.746478873239437], ['23', 7.985294117647059]]
Let's make results more readable:
swap_avg_by_hour = []
for row in avg_by_hour:
swap_avg_by_hour.append([round(row[1], 2), row[0]])
sorted_swap = sorted(swap_avg_by_hour, reverse=True)
sorted_swap
[[38.59, '15'], [23.81, '02'], [21.52, '20'], [16.8, '16'], [16.01, '21'], [14.74, '13'], [13.44, '10'], [13.23, '14'], [13.2, '18'], [11.46, '17'], [11.38, '01'], [11.05, '11'], [10.8, '19'], [10.25, '08'], [10.09, '05'], [9.41, '12'], [9.02, '06'], [8.13, '00'], [7.99, '23'], [7.85, '07'], [7.8, '03'], [7.17, '04'], [6.75, '22'], [5.58, '09']]
for row in sorted_swap[0:5]:
print(str.format('{hour}: {avg:.2f} average comments per post',
hour=dt.datetime.strptime(row[1], '%H').strftime('%H:%M'),
avg=row[0]))
15:00: 38.59 average comments per post 02:00: 23.81 average comments per post 20:00: 21.52 average comments per post 16:00: 16.80 average comments per post 21:00: 16.01 average comments per post
The most numbers of comments can be received, if we create a post between 3:00 p.m. and 4:00 p.m., that is between 23:00 and 0:00 by Moscow time.
In addition, we could expore whether the publication of a post on a working day affects on the number of comments.
We could also consider the following tasks: