In the followig project, we'll be working with a dataset of submissions to popular technology site Hacker News.
Hacker News is a site started by the startup incubator Y Combinator, where user-submitted stories (known as "posts") receive votes and comments, similar to reddit. Hacker News is extremely popular in technology and startup circles, and posts that make it to the top of the Hacker News listings can get hundreds of thousands of visitors as a result.
You can find the dataset here, but note that we have reduced from almost 300,000 rows to approximately 80,000 rows by removing all submissions that didn't receive any comments. Below are the descriptions of the columns:
We're specifically interested in posts that begin with either Ask HN and Show HN. Below are a few examples:
We'll compare these two types of posts to determine the following:
We now import the dataset and display the first five rows.
from csv import reader
opened_file = open('C:/Users/Benson/my_datasets/Hacker_News_20160926.csv', encoding = "utf8")
read_file = reader(opened_file)
hn = list(read_file)
#Separate header and data
headers = hn[0]
hn = hn[1:]
print(headers)
print(hn[:5])
['id', 'title', 'url', 'num_points', 'num_comments', 'author', 'created_at'] [['12579008', 'You have two days to comment if you want stem cells to be classified as your own', 'http://www.regulations.gov/document?D=FDA-2015-D-3719-0018', '1', '0', 'altstar', '9/26/2016 3:26'], ['12579005', 'SQLAR the SQLite Archiver', 'https://www.sqlite.org/sqlar/doc/trunk/README.md', '1', '0', 'blacksqr', '9/26/2016 3:24'], ['12578997', 'What if we just printed a flatscreen television on the side of our boxes?', 'https://medium.com/vanmoof/our-secrets-out-f21c1f03fdc8#.ietxmez43', '1', '0', 'pavel_lishin', '9/26/2016 3:19'], ['12578989', 'algorithmic music', 'http://cacm.acm.org/magazines/2011/7/109891-algorithmic-composition/fulltext', '1', '0', 'poindontcare', '9/26/2016 3:16'], ['12578979', 'How the Data Vault Enables the Next-Gen Data Warehouse and Data Lake', 'https://www.talend.com/blog/2016/05/12/talend-and-Â\x93the-data-vaultÂ\x94', '1', '0', 'markgainor1', '9/26/2016 3:14']]
From the above, we have separated the headers from the dataset and stored in the variable headers.
Since we are only interested in posts that begin with either Ask HN or Show HN, we will create new lists of lists containing just the data for such titles. For Ask HN, we will also include submissions that start with [Ask HN - there were 2 such rows in the dataset.
To do this, we will use the string method startswith. Furthermore, since the startswith method is case sensitive, we'll also use the lower method to control capitalization problem.
print("Original Dataset:", len(hn))
#Filter out submissions with no comments
hn_with_comments = []
for row in hn:
num_comments = int(row[4])
if num_comments != 0:
hn_with_comments.append(row)
hn = hn_with_comments
print("Commented submissions: ",len(hn))
ask_posts = []
show_posts = []
other_posts = []
for row in hn:
title = row[1].lower()
if title.startswith("ask hn") or title.startswith("[ask hn"):
ask_posts.append(row)
elif title.startswith("show hn"):
show_posts.append(row)
else:
other_posts.append(row)
print("Ask HN:", len(ask_posts))
print("Show HN:",len(show_posts))
print("Other Posts:", len(other_posts))
Original Dataset: 293119 Commented submissions: 80401 Ask HN: 6913 Show HN: 5059 Other Posts: 68429
Below are the first five rows of the Ask HN and Show HN:
print(ask_posts[:5])
print('\n')
print(show_posts[:5])
[['12578908', 'Ask HN: What TLD do you use for local development?', '', '4', '7', 'Sevrene', '9/26/2016 2:53'], ['12578522', 'Ask HN: How do you pass on your work when you die?', '', '6', '3', 'PascLeRasc', '9/26/2016 1:17'], ['12577870', 'Ask HN: Why join a fund when you can be an angel?', '', '1', '3', 'anthony_james', '9/25/2016 22:48'], ['12577647', 'Ask HN: Someone uses stock trading as passive income?', '', '5', '2', '00taffe', '9/25/2016 21:50'], ['12576946', 'Ask HN: How hard would it be to make a cheap, hackable phone?', '', '2', '1', 'hkt', '9/25/2016 19:30']] [['12577142', 'Show HN: Jumble Essays on the go #PaulInYourPocket', 'https://itunes.apple.com/us/app/jumble-find-startup-essay/id1150939197?ls=1&mt=8', '1', '1', 'ryderj', '9/25/2016 20:06'], ['12576813', 'Show HN: Learn Japanese Vocab via multiple choice questions', 'http://japanese.vul.io/', '1', '1', 'soulchild37', '9/25/2016 19:06'], ['12576090', 'Show HN: Markov chain Twitter bot. Trained on comments left on Pornhub', 'https://twitter.com/botsonasty', '3', '1', 'keepingscore', '9/25/2016 16:50'], ['12575471', 'Show HN: Project-Okot: Novel, CODE-FREE data-apps in mere seconds', 'https://studio.nuchwezi.com/', '3', '1', 'nfixx', '9/25/2016 14:30'], ['12574773', 'Show HN: Cursor that Screenshot', 'http://edward.codes/cursor-that-screenshot', '3', '3', 'ed-bit', '9/25/2016 10:50']]
At this point, we now have all commented Ask HN posts in the ask_posts list and all commented 'Show HN' in the show_posts list. We can now determine whether ask posts or show posts receive more comments on the average.
#Average number of comments on ask posts
total_ask_comments = 0
for row in ask_posts:
n_comments = int(row[4])
total_ask_comments += n_comments
avg_ask_comments = total_ask_comments / len(ask_posts)
print("Average number of ask posts:", avg_ask_comments)
#Average number of comments on show posts
total_show_comments = 0
for row in show_posts:
n_comments = int(row[4])
total_show_comments += n_comments
avg_show_comments = total_show_comments / len(show_posts)
print("Average number of show posts:", avg_show_comments)
Average number of ask posts: 13.740778243888327 Average number of show posts: 9.810832180272781
From the above result, Ask HN got 14 comments on the average while Show HN got 10 comments on the average. Hence, we see that Ask HN posts are more likely to receive comments than Show HN posts.
Since ask posts are more likely to receive comments, we will focus our the remianing analysis on these posts.
We shall now determine if ask posts created at a certain time are more likely to attract comments. We will use the following steps to perform this analysis:
We will use the datetime module to work with the data in the created_at column.
import datetime as dt
result_list = []
for row in ask_posts:
created_at = row[6]
n_comments = int(row[4])
result = [created_at, n_comments]
result_list.append(result)
counts_by_hour = {}
comments_by_hour = {}
for row in result_list:
date = dt.datetime.strptime(row[0], "%m/%d/%Y %H:%M")
hour = date.hour
n_comments = row[1]
if hour not in counts_by_hour:
counts_by_hour[hour] = 1
comments_by_hour[hour] = n_comments
else:
counts_by_hour[hour] += 1
comments_by_hour[hour] += n_comments
print("Number of posts per each hour of the day:", counts_by_hour)
print("\n")
print("Number of comments per each hour of the day:", comments_by_hour)
#Calculate the average number of comments per hour
avg_by_hour = []
for item in counts_by_hour:
avg = (comments_by_hour[item] / counts_by_hour[item])
avg_by_hour.append([item, avg])
print("\n")
print("Average number of comments per hour:",avg_by_hour)
Number of posts per each hour of the day: {2: 227, 1: 223, 22: 287, 21: 407, 19: 420, 17: 404, 15: 468, 14: 378, 13: 326, 11: 251, 10: 219, 9: 176, 7: 157, 3: 212, 16: 415, 8: 190, 0: 231, 23: 276, 20: 392, 18: 452, 12: 274, 4: 186, 6: 176, 5: 166} Number of comments per each hour of the day: {2: 2996, 1: 2089, 22: 3372, 21: 4500, 19: 3954, 17: 5547, 15: 18526, 14: 4972, 13: 7245, 11: 2797, 10: 3013, 9: 1477, 7: 1585, 3: 2154, 16: 4466, 8: 2362, 0: 2277, 23: 2297, 20: 4462, 18: 4877, 12: 4234, 4: 2360, 6: 1587, 5: 1841} Average number of comments per hour: [[2, 13.198237885462555], [1, 9.367713004484305], [22, 11.749128919860627], [21, 11.056511056511056], [19, 9.414285714285715], [17, 13.73019801980198], [15, 39.585470085470085], [14, 13.153439153439153], [13, 22.2239263803681], [11, 11.143426294820717], [10, 13.757990867579908], [9, 8.392045454545455], [7, 10.095541401273886], [3, 10.160377358490566], [16, 10.76144578313253], [8, 12.43157894736842], [0, 9.857142857142858], [23, 8.322463768115941], [20, 11.38265306122449], [18, 10.789823008849558], [12, 15.452554744525548], [4, 12.688172043010752], [6, 9.017045454545455], [5, 11.090361445783133]]
Finally, we sort the average number of comments by each hour. We'll create a list that equals avg_by_hour with swapped columns and store it to variable named swap_avg_by_hour.
# Create swapped list
swap_avg_by_hour = []
for row in avg_by_hour:
hour = row[0]
avg= row[1]
swap_avg_by_hour.append([avg, hour])
# Sort by average number of comments
sorted_swap = sorted(swap_avg_by_hour, reverse = True)
# Print 'Top 5 hours for Ask Posts Comments'
print("Top 5 Hours for Ask Post Comments")
for row in sorted_swap[:5]:
hour = dt.datetime.strptime(str(row[1]), "%H")
hour = hour.strftime("%H:%M")
print("{0}: {1:.2f} average comments per post".format(hour, row[0]))
Top 5 Hours for Ask Post Comments 15:00: 39.59 average comments per post 13:00: 22.22 average comments per post 12:00: 15.45 average comments per post 10:00: 13.76 average comments per post 17:00: 13.73 average comments per post
From the above analysis, we saw that Ask HN posts are more likely to receive comments than Show HNposts in the Hacker News site. More specifically, Ask HN posts get 14 comments on average while Show HN posts get 10 comments on average.
As to the time of the day, Ask HN posts that are created at 15:00 are more likely to get the most comments. Ask posts that are created at 15:00 receive an average of 39.59 comments, which is considerably higher than any other time of the day.