Hacker News is a site started by the startup incubator Y Combinator, where user-submitted stories (known as "posts") are voted and commented upon, similar to reddit. Hacker News is extremely popular in technology and startup circles, and posts that make it to the top of Hacker News' listings can get hundreds of thousands of visitors as a result.
You can find the data set here, but note that it has been reduced from almost 300,000 rows to approximately 20,000 rows by removing all submissions that did not receive any comments, and then randomly sampling from the remaining submissions. Below are descriptions of the columns:
id: The unique identifier from Hacker News for the post
title: The title of the post
url: The URL that the posts links to, if it the post has a URL
num_points: The number of points the post acquired, calculated as the total number of upvotes minus the total number of downvotes
num_comments: The number of comments that were made on the post
author: The username of the person who submitted the post
created_at: The date and time at which the post was submitted
In this project, we’ll compare two different types of posts from Hacker News, a popular site where technology related stories (or ‘posts’) are voted and commented upon. The two types of posts we’ll explore begin with either Ask HN or Show HN.
Users submit Ask HN posts to ask the Hacker News community a specific question, such as “What is the best online course you’ve ever taken?” Likewise, users submit Show HN posts to show the Hacker News community a project, product, or just generally something interesting.
We’ll specifically compare these two types of posts to determine the following:
. Do Ask HN or Show HN receive more comments on average?
. Do posts created at a certain time receive more comments on average?
It should be noted that the data set we’re working with was reduced from almost 300,000 rows to approximately 20,000 rows by removing all submissions that did not receive any comments, and then randomly sampling from the remaining submissions.
First, we’ll read in the data and remove the headers.
from csv import reader
opened_file = open('/Users/umesh/Downloads/HN_posts_year_to_Sep_20_2016.csv')
read_file = reader(opened_file)
hn = list(read_file)
hn_header = hn[0]
hn = hn[1:]
print(hn_header)
['id', 'title', 'url', 'num_points', 'num_comments', 'author', 'created_at']
hn[0:5]
[['12579008', 'You have two days to comment if you want stem cells to be classified as your own', 'http://www.regulations.gov/document?D=FDA-2015-D-3719-0018', '1', '0', 'altstar', '9/26/2016 3:26'], ['12579005', 'SQLAR the SQLite Archiver', 'https://www.sqlite.org/sqlar/doc/trunk/README.md', '1', '0', 'blacksqr', '9/26/2016 3:24'], ['12578997', 'What if we just printed a flatscreen television on the side of our boxes?', 'https://medium.com/vanmoof/our-secrets-out-f21c1f03fdc8#.ietxmez43', '1', '0', 'pavel_lishin', '9/26/2016 3:19'], ['12578989', 'algorithmic music', 'http://cacm.acm.org/magazines/2011/7/109891-algorithmic-composition/fulltext', '1', '0', 'poindontcare', '9/26/2016 3:16'], ['12578979', 'How the Data Vault Enables the Next-Gen Data Warehouse and Data Lake', 'https://www.talend.com/blog/2016/05/12/talend-and-Â\x93the-data-vaultÂ\x94', '1', '0', 'markgainor1', '9/26/2016 3:14']]
We can see above that the data set contains the title of the posts, the number of comments for each post, and the date the post was created. Let’s start by exploring the number of comments for each type of post.
First, we’ll identify posts that begin with either Ask HN or Show HN and separate the data for those two types of posts into different lists. Separating the data makes it easier to analyze in the following steps.
ask_posts = []
show_posts = []
other_posts = []
for row in hn:
title = row[1]
if title.lower().startswith("ask hn"):
ask_posts.append(row)
elif title.lower().startswith("show hn"):
show_posts.append(row)
else:
other_posts.append(row)
print(len(ask_posts))
print(len(show_posts))
print(len(other_posts))
print(len(ask_posts + show_posts + other_posts))
9139 10158 273822 293119
print(ask_posts[:3])
[['12578908', 'Ask HN: What TLD do you use for local development?', '', '4', '7', 'Sevrene', '9/26/2016 2:53'], ['12578522', 'Ask HN: How do you pass on your work when you die?', '', '6', '3', 'PascLeRasc', '9/26/2016 1:17'], ['12577908', 'Ask HN: How a DNS problem can be limited to a geographic region?', '', '1', '0', 'kuon', '9/25/2016 22:57']]
print(show_posts[:3])
[['12578335', 'Show HN: Finding puns computationally', 'http://puns.samueltaylor.org/', '2', '0', 'saamm', '9/26/2016 0:36'], ['12578182', 'Show HN: A simple library for complicated animations', 'https://christinecha.github.io/choreographer-js/', '1', '0', 'christinecha', '9/26/2016 0:01'], ['12578098', 'Show HN: WebGL visualization of DNA sequences', 'http://grondilu.github.io/dna.html', '1', '0', 'grondilu', '9/25/2016 23:44']]
print(other_posts[:3])
[['12579008', 'You have two days to comment if you want stem cells to be classified as your own', 'http://www.regulations.gov/document?D=FDA-2015-D-3719-0018', '1', '0', 'altstar', '9/26/2016 3:26'], ['12579005', 'SQLAR the SQLite Archiver', 'https://www.sqlite.org/sqlar/doc/trunk/README.md', '1', '0', 'blacksqr', '9/26/2016 3:24'], ['12578997', 'What if we just printed a flatscreen television on the side of our boxes?', 'https://medium.com/vanmoof/our-secrets-out-f21c1f03fdc8#.ietxmez43', '1', '0', 'pavel_lishin', '9/26/2016 3:19']]
Now that we separated ask posts and show posts into different lists, we’ll calculate the average number of comments each type of post receives.
#Calculate the average number of comments `Ask HN` posts receive.
total_ask_comments = 0
for row in ask_posts:
total_ask_comments += int(row[4])
avg_ask_comments = total_ask_comments / len(ask_posts)
print(avg_ask_comments)
print(total_ask_comments)
10.393478498741656 94986
#Calculate the average number of comments `Ask HN` posts receive.
total_show_comments = 0
for row in show_posts:
total_show_comments += int(row[4])
avg_show_comments = total_show_comments / len(show_posts)
print(avg_show_comments)
print(total_show_comments)
4.886099625910612 49633
==> On average, ask posts in our sample receive approximately 10 comments, whereas show posts receive approximately 5. Since ask posts are more likely to receive comments, we’ll focus our remaining analysis just on these posts.
Next, we'll determine if ask posts created at a certain time are more likely to attract comments. We'll use the following steps to perform this analysis:
. Calculate the amount of ask posts created in each hour of the day, along with the number of comments received.
. Calculate the average number of comments ask posts receive by hour created.
import datetime as dt
result_list = []
for row in ask_posts:
created_at = row[6]
num_comments = int(row[4])
result_list.append((created_at, num_comments))
result_list[:10]
[('9/26/2016 2:53', 7), ('9/26/2016 1:17', 3), ('9/25/2016 22:57', 0), ('9/25/2016 22:48', 3), ('9/25/2016 21:50', 2), ('9/25/2016 19:30', 1), ('9/25/2016 19:22', 22), ('9/25/2016 17:55', 3), ('9/25/2016 15:48', 0), ('9/25/2016 15:35', 13)]
len(result_list)
9139
result_list[9130:]
[('9/6/2015 15:09', 0), ('9/6/2015 14:53', 6), ('9/6/2015 13:01', 9), ('9/6/2015 12:17', 0), ('9/6/2015 11:27', 0), ('9/6/2015 10:52', 1), ('9/6/2015 10:46', 4), ('9/6/2015 9:36', 20), ('9/6/2015 6:02', 20)]
#Create two empty dictionaries called counts_by_hour and comments_per_hour
counts_by_hour = {}
comments_per_hour = {}
#Extract the hour from the date, which is the first element of the row.
#Use the datetime.strptime() method to parse the date and create a datetime object.
for row in result_list:
hour = row[0]
#Use the string we want to parse as the first argument and a string that specifies the format as the second argument.
#Use the datetime.strftime() method to select just the hour from the datetime object.
#If the hour isn't a key in counts_by_hour:
##Create the key in counts_by_hour and set it equal to 1.
##Create the key in comments_by_hour and set it equal to the comment number.
date_str = dt.datetime.strptime(hour,"%m/%d/%Y %H:%M")
posts_created = date_str.strftime("%H")
comments_created = row[1]
if posts_created not in counts_by_hour:
counts_by_hour[posts_created] = 1
comments_per_hour[posts_created] = comments_created
#If the hour is already a key in counts_by_hour:
##Increment the value in counts_by_hour by 1.
##Increment the value in comments_by_hour by the comment number.
else:
counts_by_hour[posts_created] += 1
comments_per_hour[posts_created] += comments_created
counts_by_hour
{'02': 269, '01': 282, '22': 383, '21': 518, '19': 552, '17': 587, '15': 646, '14': 513, '13': 444, '11': 312, '10': 282, '09': 222, '07': 226, '03': 271, '23': 343, '20': 510, '16': 579, '08': 257, '00': 301, '18': 614, '12': 342, '04': 243, '06': 234, '05': 209}
comments_per_hour
{'02': 2996, '01': 2089, '22': 3372, '21': 4500, '19': 3954, '17': 5547, '15': 18525, '14': 4972, '13': 7245, '11': 2797, '10': 3013, '09': 1477, '07': 1585, '03': 2154, '23': 2297, '20': 4462, '16': 4466, '08': 2362, '00': 2277, '18': 4877, '12': 4234, '04': 2360, '06': 1587, '05': 1838}
. counts_by_hour: contains the number of ask posts created during each hour of the day.
. comments_by_hour: contains the corresponding number of comments ask posts created at each hour received.
Next, we'll use these two dictionaries to calculate the average number of comments for posts created during each hour of the day.
#calculate the average number of comments per post for posts created during each hour of the day
avg_by_hour = []
for row in counts_by_hour:
average = comments_per_hour[row] / counts_by_hour[row]
avg_by_hour.append([row, average])
avg_by_hour
[['02', 11.137546468401487], ['01', 7.407801418439717], ['22', 8.804177545691905], ['21', 8.687258687258687], ['19', 7.163043478260869], ['17', 9.449744463373083], ['15', 28.676470588235293], ['14', 9.692007797270955], ['13', 16.31756756756757], ['11', 8.96474358974359], ['10', 10.684397163120567], ['09', 6.653153153153153], ['07', 7.013274336283186], ['03', 7.948339483394834], ['23', 6.696793002915452], ['20', 8.749019607843136], ['16', 7.713298791018998], ['08', 9.190661478599221], ['00', 7.5647840531561465], ['18', 7.94299674267101], ['12', 12.380116959064328], ['04', 9.7119341563786], ['06', 6.782051282051282], ['05', 8.794258373205741]]
Although we now have the results we need, this format makes it hard to identify the hours with the highest values. Let's finish by sorting the list of lists and printing the five highest values in a format that's easier to read.
Create an empty list and assign it to swap_avg_by_hour. Iterate over the rows of avg_by_hour and append to swap_avg_by_hour a list whose first element is the second element of the row, and whose second element is the first element of the row.
Print swap_avg_by_hour.
Use the sorted() function to sort swap_avg_by_hour in descending order. Since the first column of this list is the average number of comments, sorting the list will sort by the average number of comments.
. Set the reverse argument to True, so that the highest value in the first column appears first in the list.
. Assign the result to sorted_swap.
Print the string "Top 5 Hours for Ask Posts Comments".
Loop through each average and each hour (in this order) in the first five lists of sorted_swap.
Use the str.format() method to print the hour and average in the following format: 15:00: 38.59 average comments per post.
To format the hours, use the datetime.strptime() constructor to return a datetime object and then use the strftime() method to specify the format of the time. To format the average, you can use {:.2f} to indicate that just two decimal places should be used.
#Create a list that equals avg_by_hour with swapped columns
swap_avg_by_hour = []
for row in avg_by_hour:
swap_avg_by_hour.append([row[1], row[0]])
swap_avg_by_hour
[[11.137546468401487, '02'], [7.407801418439717, '01'], [8.804177545691905, '22'], [8.687258687258687, '21'], [7.163043478260869, '19'], [9.449744463373083, '17'], [28.676470588235293, '15'], [9.692007797270955, '14'], [16.31756756756757, '13'], [8.96474358974359, '11'], [10.684397163120567, '10'], [6.653153153153153, '09'], [7.013274336283186, '07'], [7.948339483394834, '03'], [6.696793002915452, '23'], [8.749019607843136, '20'], [7.713298791018998, '16'], [9.190661478599221, '08'], [7.5647840531561465, '00'], [7.94299674267101, '18'], [12.380116959064328, '12'], [9.7119341563786, '04'], [6.782051282051282, '06'], [8.794258373205741, '05']]
#Use the sorted() function to sort swap_avg_by_hour in descending order
sorted_swap = sorted(swap_avg_by_hour, reverse=True)
sorted_swap
[[28.676470588235293, '15'], [16.31756756756757, '13'], [12.380116959064328, '12'], [11.137546468401487, '02'], [10.684397163120567, '10'], [9.7119341563786, '04'], [9.692007797270955, '14'], [9.449744463373083, '17'], [9.190661478599221, '08'], [8.96474358974359, '11'], [8.804177545691905, '22'], [8.794258373205741, '05'], [8.749019607843136, '20'], [8.687258687258687, '21'], [7.948339483394834, '03'], [7.94299674267101, '18'], [7.713298791018998, '16'], [7.5647840531561465, '00'], [7.407801418439717, '01'], [7.163043478260869, '19'], [7.013274336283186, '07'], [6.782051282051282, '06'], [6.696793002915452, '23'], [6.653153153153153, '09']]
print("Top 5 Hours for Ask Posts Comments")
Top 5 Hours for Ask Posts Comments
#Use the str.format() method to print the hour and average
#use the datetime.strptime() constructor to return a datetime object
#use the strftime() method to specify the format of the time.
#use {:.2f} to indicate that just two decimal places should be used
for each in sorted_swap[0:5]:
hour = dt.datetime.strptime(each[1],"%H")
hour = hour.strftime("%H:%M")
a = str.format("{hour} > {comments:.2f} average comments per hour", hour = hour, comments = (each[0]))
print(a)
15:00 > 28.68 average comments per hour 13:00 > 16.32 average comments per hour 12:00 > 12.38 average comments per hour 02:00 > 11.14 average comments per hour 10:00 > 10.68 average comments per hour
import datetime as dt
result_list1 = []
for row in show_posts:
created_at1 = row[6]
num_comments1 = int(row[4])
result_list1.append((created_at1, num_comments1))
result_list1[:10]
[('9/26/2016 0:36', 0), ('9/26/2016 0:01', 0), ('9/25/2016 23:44', 0), ('9/25/2016 23:17', 0), ('9/25/2016 20:06', 1), ('9/25/2016 19:06', 1), ('9/25/2016 18:32', 0), ('9/25/2016 16:50', 1), ('9/25/2016 16:43', 0), ('9/25/2016 14:30', 1)]
result_list1[10150:]
[('9/6/2015 18:05', 0), ('9/6/2015 15:41', 0), ('9/6/2015 15:25', 0), ('9/6/2015 14:21', 0), ('9/6/2015 13:50', 2), ('9/6/2015 13:02', 6), ('9/6/2015 12:38', 4), ('9/6/2015 12:16', 1)]
counts_by_hour1 = {}
comments_by_hour1 = {}
for row in result_list1:
hour1 = row[0]
dt_str1 = dt.datetime.strptime(hour1, "%m/%d/%Y %H:%M")
post_created1 = dt_str1.strftime("%H")
comments_created1 = row[1]
if post_created1 not in counts_by_hour1:
counts_by_hour1[post_created1] = 1
comments_by_hour1[post_created1] = comments_created1
else:
counts_by_hour1[post_created1] += 1
comments_by_hour1[post_created1] += comments_created1
counts_by_hour1
{'00': 276, '23': 319, '20': 525, '19': 556, '18': 656, '16': 801, '14': 696, '10': 323, '09': 302, '08': 316, '06': 192, '03': 206, '21': 430, '17': 761, '15': 836, '11': 402, '07': 236, '04': 194, '13': 610, '12': 516, '01': 247, '22': 377, '02': 209, '05': 172}
comments_by_hour1
{'00': 1283, '23': 1444, '20': 2183, '19': 2791, '18': 3242, '16': 3769, '14': 3839, '10': 1228, '09': 1411, '08': 1771, '06': 904, '03': 934, '21': 1759, '17': 3236, '15': 3824, '11': 2413, '07': 1577, '04': 978, '13': 3314, '12': 3609, '01': 1006, '22': 1450, '02': 1076, '05': 592}
avg_by_hour1 = []
for row in counts_by_hour1:
average1 = comments_by_hour1[row] / counts_by_hour1[row]
avg_by_hour1.append([row, average1])
avg_by_hour1
[['00', 4.648550724637682], ['23', 4.5266457680250785], ['20', 4.158095238095238], ['19', 5.01978417266187], ['18', 4.942073170731708], ['16', 4.705368289637953], ['14', 5.515804597701149], ['10', 3.801857585139319], ['09', 4.672185430463577], ['08', 5.6044303797468356], ['06', 4.708333333333333], ['03', 4.533980582524272], ['21', 4.090697674418605], ['17', 4.252299605781866], ['15', 4.574162679425838], ['11', 6.002487562189055], ['07', 6.682203389830509], ['04', 5.041237113402062], ['13', 5.432786885245902], ['12', 6.994186046511628], ['01', 4.0728744939271255], ['22', 3.8461538461538463], ['02', 5.148325358851674], ['05', 3.441860465116279]]
swap_avg_by_hour1 = []
for row in avg_by_hour1:
swap_avg_by_hour1.append([row[1], row[0]])
swap_avg_by_hour1
[[4.648550724637682, '00'], [4.5266457680250785, '23'], [4.158095238095238, '20'], [5.01978417266187, '19'], [4.942073170731708, '18'], [4.705368289637953, '16'], [5.515804597701149, '14'], [3.801857585139319, '10'], [4.672185430463577, '09'], [5.6044303797468356, '08'], [4.708333333333333, '06'], [4.533980582524272, '03'], [4.090697674418605, '21'], [4.252299605781866, '17'], [4.574162679425838, '15'], [6.002487562189055, '11'], [6.682203389830509, '07'], [5.041237113402062, '04'], [5.432786885245902, '13'], [6.994186046511628, '12'], [4.0728744939271255, '01'], [3.8461538461538463, '22'], [5.148325358851674, '02'], [3.441860465116279, '05']]
sorted_swap = sorted(swap_avg_by_hour1, reverse = True)
print("Top 5 Hours for Ask Posts Comments")
Top 5 Hours for Ask Posts Comments
for each in sorted_swap[0:5]:
hour1 = dt.datetime.strptime(each[1],"%H")
hour1 = hour1.strftime("%H:%M")
b = str.format("{hour}: {comments:.2f} average comments per hour", hour = hour1, comments = (each[0]))
print(b)
12:00: 6.99 average comments per hour 07:00: 6.68 average comments per hour 11:00: 6.00 average comments per hour 08:00: 5.60 average comments per hour 14:00: 5.52 average comments per hour