Welcome to Hacker News!! Hacker News is a website that is popular in technology and start-up circles. Users submit stories, known as posts.
This analysis will
Users use Ask HN posts to ask the Hacker News community a specific. Some exmaples of Ask HN posts are,
Users use Show HN posts to show the Hacker News community a project, product or something interesting. Some exmaples of Show HN posts are,
The Hacker News data can be accessed at Hacker News link. A description of the columns are:
#Read in the Hacker News file
from csv import reader
opened_file = open('hacker_news.csv')
read_file = reader(opened_file)
hn = list(read_file)
#show the 1st 5 rows
print('The first five rows of the Hacker News dataset')
print('\n')
hn[:5]
The first five rows of the Hacker News dataset
[['id', 'title', 'url', 'num_points', 'num_comments', 'author', 'created_at'], ['12224879', 'Interactive Dynamic Video', 'http://www.interactivedynamicvideo.com/', '386', '52', 'ne0phyte', '8/4/2016 11:52'], ['10975351', 'How to Use Open Source and Shut the Fuck Up at the Same Time', 'http://hueniverse.com/2016/01/26/how-to-use-open-source-and-shut-the-fuck-up-at-the-same-time/', '39', '10', 'josep2', '1/26/2016 19:30'], ['11964716', "Florida DJs May Face Felony for April Fools' Water Joke", 'http://www.thewire.com/entertainment/2013/04/florida-djs-april-fools-water-joke/63798/', '2', '1', 'vezycash', '6/23/2016 22:20'], ['11919867', 'Technology ventures: From Idea to Enterprise', 'https://www.amazon.com/Technology-Ventures-Enterprise-Thomas-Byers/dp/0073523429', '3', '1', 'hswarna', '6/17/2016 0:01']]
Since the header row of the dataset is the column names, it is not needed for the analysis and will be removed into a separate list.
#assign 1st row, column headers, of dataset to headers
headers = hn[0]
hn = hn[1:]
print(headers)
print('\n')
print(hn[:5])
['id', 'title', 'url', 'num_points', 'num_comments', 'author', 'created_at'] [['12224879', 'Interactive Dynamic Video', 'http://www.interactivedynamicvideo.com/', '386', '52', 'ne0phyte', '8/4/2016 11:52'], ['10975351', 'How to Use Open Source and Shut the Fuck Up at the Same Time', 'http://hueniverse.com/2016/01/26/how-to-use-open-source-and-shut-the-fuck-up-at-the-same-time/', '39', '10', 'josep2', '1/26/2016 19:30'], ['11964716', "Florida DJs May Face Felony for April Fools' Water Joke", 'http://www.thewire.com/entertainment/2013/04/florida-djs-april-fools-water-joke/63798/', '2', '1', 'vezycash', '6/23/2016 22:20'], ['11919867', 'Technology ventures: From Idea to Enterprise', 'https://www.amazon.com/Technology-Ventures-Enterprise-Thomas-Byers/dp/0073523429', '3', '1', 'hswarna', '6/17/2016 0:01'], ['10301696', 'Note by Note: The Making of Steinway L1037 (2007)', 'http://www.nytimes.com/2007/11/07/movies/07stein.html?_r=0', '8', '2', 'walterbell', '9/30/2015 4:12']]
Now that the Hacker News dataset has been uploaded, the next step is to generate a list of only those posts the are related to Ask HN and Show HN.
This is achieved by looping through the Hacker News dateset and filtering on titles that include Ask HN and Show HN. All other posts will be saved as Other.
#create new lists containing posts titles with Ask HN and Show HN
ask_posts = []
show_posts = []
other_posts = []
for row in hn:
title = row[1]
title = title.lower()
if title.startswith('ask hn'):
ask_posts.append(row)
elif title.startswith('show hn'):
show_posts.append(row)
else:
other_posts.append(row)
print('The number of Ask HN posts are', len(ask_posts))
print('The number of Show HN posts are', len(show_posts))
print('The number of Other posts are', len(other_posts))
The number of Ask HN posts are 1744 The number of Show HN posts are 1162 The number of Other posts are 17194
Explore Ask HN and Show HN Posts
#print the first five rows from the Ask HN posts
print('These are the 1st five rows of the Ask HN posts')
print('\n')
print(ask_posts[:5])
These are the 1st five rows of the Ask HN posts [['12296411', 'Ask HN: How to improve my personal website?', '', '2', '6', 'ahmedbaracat', '8/16/2016 9:55'], ['10610020', 'Ask HN: Am I the only one outraged by Twitter shutting down share counts?', '', '28', '29', 'tkfx', '11/22/2015 13:43'], ['11610310', 'Ask HN: Aby recent changes to CSS that broke mobile?', '', '1', '1', 'polskibus', '5/2/2016 10:14'], ['12210105', 'Ask HN: Looking for Employee #3 How do I do it?', '', '1', '3', 'sph130', '8/2/2016 14:20'], ['10394168', 'Ask HN: Someone offered to buy my browser extension from me. What now?', '', '28', '17', 'roykolak', '10/15/2015 16:38']]
#print the first five rows from the Show HN posts
print('These are the 1st five rows of the Show HN posts')
print('\n')
print(show_posts[:5])
These are the 1st five rows of the Show HN posts [['10627194', 'Show HN: Wio Link ESP8266 Based Web of Things Hardware Development Platform', 'https://iot.seeed.cc', '26', '22', 'kfihihc', '11/25/2015 14:03'], ['10646440', 'Show HN: Something pointless I made', 'http://dn.ht/picklecat/', '747', '102', 'dhotson', '11/29/2015 22:46'], ['11590768', 'Show HN: Shanhu.io, a programming playground powered by e8vm', 'https://shanhu.io', '1', '1', 'h8liu', '4/28/2016 18:05'], ['12178806', 'Show HN: Webscope Easy way for web developers to communicate with Clients', 'http://webscopeapp.com', '3', '3', 'fastbrick', '7/28/2016 7:11'], ['10872799', 'Show HN: GeoScreenshot Easily test Geo-IP based web pages', 'https://www.geoscreenshot.com/', '1', '9', 'kpsychwave', '1/9/2016 20:45']]
One aspect of the analysis is to determine which type of posts - Ask HN or Show HN - receive more comments. In order to determine this, the average number of comments will calculated.
#calculate average number of comments for
#Ask HN and Show HN posts
total_ask_comments = 0
total_show_comments = 0
for row in ask_posts:
n_ask_comments = int(row[4])
total_ask_comments += n_ask_comments
avg_ask_comments = total_ask_comments/n_ask_comments
for row in show_posts:
n_show_comments = int(row[4])
total_show_comments += n_show_comments
avg_show_comments = total_show_comments/n_show_comments
print('The average number of Ask HN posts are', avg_ask_comments)
print('The average number of Show HN posts are', avg_show_comments)
print('\n')
print('The total number of Ask HN posts are',total_ask_comments)
print('The total number of Show HN posts are',total_show_comments)
The average number of Ask HN posts are 12241.5 The average number of Show HN posts are 5994.0 The total number of Ask HN posts are 24483 The total number of Show HN posts are 11988
The total number of comments for both the Ask HN and Show HN is a static number that is determined based on when the dataset was downloaded. However, since the Hacker News is a dynamic website, meaning the content changes regularly, determining the average number of comments is a better guage than calculating the total number of comments for each type of post.
At the time when the Hacker News dataset was downloaded for this analysis, the average number of Ask HN posts far exceed the average number of Show HN posts. The average number of Ask HN posts are more than double the average number of Show HN posts. This trend is obviously also reflected in the total number of Ask HN and Show HN posts.
At the time of this analysis, the Ask HN posts are more popular than the Show HN posts. In order to conclude whether or not this is a common occurence, the Hacker News dataset will have be re-analyzed when the dataset has new Ask HN and Show HN posts to determine if this is a common trend.
Now that it's been determined that the Ask HN posts are more popular than the Show HN posts, the next step is to calculate the average number of Ask HN posts per hour.
First, determine the number of counts per hour and the number of comments per hour.
Calculate Number of Ask HN Posts per Hour & Total Number of Comments
#calculate the number of Ask HN posts created per hour and
#calculate the total number of comments
import datetime as dt
#generate list to include created_at (time post created) and
#number of comments
result_list = []
for row in ask_posts:
created_time = row[6]
n_comments = int(row[4])
result_list.append([created_time, n_comments])
counts_by_hour = {}
comments_by_hour = {}
for row in result_list:
hr = row[0]
comment = row[1]
dt_hr = dt.datetime.strptime(hr, "%m/%d/%Y %H:%M") #extract hour from created at column
hr_only = dt.datetime.strftime(dt_hr, "%H")
#populate counts_by_hour and comments_by_hour dictionaries
#with hour from datetime object as the dictionary key
if hr_only not in counts_by_hour:
counts_by_hour[hr_only] = 1
comments_by_hour[hr_only] = comment
else:
counts_by_hour[hr_only] += 1
comments_by_hour[hr_only] += comment
print('The number of counts by hour is', counts_by_hour)
print('\n')
print('The number of comments by hour is', comments_by_hour)
The number of counts by hour is {'09': 45, '13': 85, '10': 59, '14': 107, '16': 108, '23': 68, '12': 73, '17': 100, '15': 116, '21': 109, '20': 80, '02': 58, '18': 109, '03': 54, '05': 46, '19': 110, '01': 60, '22': 71, '08': 48, '04': 47, '00': 55, '06': 44, '07': 34, '11': 58} The number of comments by hour is {'09': 251, '13': 1253, '10': 793, '14': 1416, '16': 1814, '23': 543, '12': 687, '17': 1146, '15': 4477, '21': 1745, '20': 1722, '02': 1381, '18': 1439, '03': 421, '05': 464, '19': 1188, '01': 683, '22': 479, '08': 492, '04': 337, '00': 447, '06': 397, '07': 267, '11': 641}
Calculate Average Number of Ask HN Posts per Hour
Now that the number of Ask HN posts per hour and the number of comments per hour has been determined, the next step is to use this informaiton to calculate the average number of Ask HN comments per hour of the day.
#calculate the average number of posts per hour created during
#each hour of the day
avg_by_hour = []
#use the dictionaries previously calculated for number of comments per hour
#and number of counts per hour
for hour in comments_by_hour:
avg_by_hour.append([hour, comments_by_hour[hour]/counts_by_hour[hour]])
print('The average number of commnets per hour of day is', '\n', avg_by_hour)
The average number of commnets per hour of day is [['09', 5.5777777777777775], ['13', 14.741176470588234], ['10', 13.440677966101696], ['14', 13.233644859813085], ['16', 16.796296296296298], ['23', 7.985294117647059], ['12', 9.41095890410959], ['17', 11.46], ['15', 38.5948275862069], ['21', 16.009174311926607], ['20', 21.525], ['02', 23.810344827586206], ['18', 13.20183486238532], ['03', 7.796296296296297], ['05', 10.08695652173913], ['19', 10.8], ['01', 11.383333333333333], ['22', 6.746478873239437], ['08', 10.25], ['04', 7.170212765957447], ['00', 8.127272727272727], ['06', 9.022727272727273], ['07', 7.852941176470588], ['11', 11.051724137931034]]
For ease of readability, sort the average of comments per hour of day to determine which hour during the day receives the highest average number of comments.
#sort avg_by_hour in descending order with the average number
#of comments as the first element in the list
swap_avg_by_hour = []
for row in avg_by_hour:
swap_avg_by_hour.append([row[1], row[0]])
sorted_swap = sorted(swap_avg_by_hour, reverse=True)
print('Top 5 Hours for Ask HN Posts Comments')
print('Time is in EST')
for loop in sorted_swap:
dt_object = dt.datetime.strptime(loop[1], '%H')
loop[1] = dt.datetime.strftime(dt_object, '%I:%M %p')
loop[0] = "{:.2f} average comments per Ask HN posts".format(loop[0])
#resort dictionary so hour is first element
resorted = []
for row in sorted_swap:
resorted.append([row[1], row[0]])
print(resorted[:5])
Top 5 Hours for Ask HN Posts Comments Time is in EST [['03:00 PM', '38.59 average comments per Ask HN posts'], ['02:00 AM', '23.81 average comments per Ask HN posts'], ['08:00 PM', '21.52 average comments per Ask HN posts'], ['04:00 PM', '16.80 average comments per Ask HN posts'], ['09:00 PM', '16.01 average comments per Ask HN posts']]
*Ask HN* Post with Most Points per Hour
Now that it's been established which hour of the day has the most comments on average, which hour of the day has the most points?
#calculate the number of points per hour for Ask HN posts
tot_pts_hr = {}
for row in ask_posts:
hr = row[6]
pts = int(row[3])
dt_hr = dt.datetime.strptime(hr, "%m/%d/%Y %H:%M")
hr_only = dt.datetime.strftime(dt_hr, "%I %p")
if hr_only not in tot_pts_hr:
tot_pts_hr[hr_only] = pts
else:
tot_pts_hr[hr_only] += pts
print('The total number of points per hour is', tot_pts_hr)
The total number of points per hour is {'09 AM': 329, '01 PM': 2062, '10 AM': 1102, '02 PM': 1282, '04 PM': 2522, '11 PM': 581, '12 PM': 782, '05 PM': 1941, '03 PM': 3479, '09 PM': 1721, '08 PM': 1151, '02 AM': 793, '06 PM': 1741, '03 AM': 374, '05 AM': 552, '07 PM': 1513, '01 AM': 700, '10 PM': 511, '08 AM': 515, '04 AM': 389, '12 AM': 451, '06 AM': 591, '07 AM': 361, '11 AM': 825}
The best time to post an Ask HN comment is 3:00PM EST. 3:00PM is also the hour that received the most points. Therefore, it seems the optimal time to post Ask HN posts is 3:00PM.