Hacker News is a site where users can submit posts are voted and commented upon. Hacker News is very popular in technology and startup cicles, and posts that make it to the top of Hacker News' listings can get hundreds of thousands of visitors as a result.
There are two types of posts that are important for this project: "Ask HN" and "Show HN". People submit "Ask HN" posts to ask Hacker News community a specific question. Likewise, users submit "Shoe HN" posts to show a project, product or something interesting.
The purpose of this project is to compare these two types of posts and answer the following questions:
import csv
opened_file = open("HN_posts_year_to_Sep_26_2016.csv")
hn = list(csv.reader(opened_file))
print(hn[:5])
[['id', 'title', 'url', 'num_points', 'num_comments', 'author', 'created_at'], ['12579008', 'You have two days to comment if you want stem cells to be classified as your own', 'http://www.regulations.gov/document?D=FDA-2015-D-3719-0018', '1', '0', 'altstar', '9/26/2016 3:26'], ['12579005', 'SQLAR the SQLite Archiver', 'https://www.sqlite.org/sqlar/doc/trunk/README.md', '1', '0', 'blacksqr', '9/26/2016 3:24'], ['12578997', 'What if we just printed a flatscreen television on the side of our boxes?', 'https://medium.com/vanmoof/our-secrets-out-f21c1f03fdc8#.ietxmez43', '1', '0', 'pavel_lishin', '9/26/2016 3:19'], ['12578989', 'algorithmic music', 'http://cacm.acm.org/magazines/2011/7/109891-algorithmic-composition/fulltext', '1', '0', 'poindontcare', '9/26/2016 3:16']]
Remove the header row from hn.
headers = hn[0]
hn = hn[1:]
print(headers)
print(hn[:5])
['id', 'title', 'url', 'num_points', 'num_comments', 'author', 'created_at'] [['12579008', 'You have two days to comment if you want stem cells to be classified as your own', 'http://www.regulations.gov/document?D=FDA-2015-D-3719-0018', '1', '0', 'altstar', '9/26/2016 3:26'], ['12579005', 'SQLAR the SQLite Archiver', 'https://www.sqlite.org/sqlar/doc/trunk/README.md', '1', '0', 'blacksqr', '9/26/2016 3:24'], ['12578997', 'What if we just printed a flatscreen television on the side of our boxes?', 'https://medium.com/vanmoof/our-secrets-out-f21c1f03fdc8#.ietxmez43', '1', '0', 'pavel_lishin', '9/26/2016 3:19'], ['12578989', 'algorithmic music', 'http://cacm.acm.org/magazines/2011/7/109891-algorithmic-composition/fulltext', '1', '0', 'poindontcare', '9/26/2016 3:16'], ['12578979', 'How the Data Vault Enables the Next-Gen Data Warehouse and Data Lake', 'https://www.talend.com/blog/2016/05/12/talend-and-\xc3\x82\xc2\x93the-data-vault\xc3\x82\xc2\x94', '1', '0', 'markgainor1', '9/26/2016 3:14']]
Split ask posts and show posts into two different lists:
ask_posts = []
show_posts = []
other_posts = []
for row in hn:
title = row[1]
if title.lower().startswith('ask hn'):
ask_posts.append(row)
elif title.lower().startswith('show hn'):
show_posts.append(row)
else:
other_posts.append(row)
print(len(ask_posts))
print(len(show_posts))
print(len(other_posts))
9139 10158 273822
Calculate average number of comments for ask posts:
total_ask_comments = 0.0
for row in ask_posts:
num_comments = int(row[4])
total_ask_comments += num_comments
avg_ask_comments = total_ask_comments / len(ask_posts)
print(avg_ask_comments)
10.3934784987
Calculate average number of comments for show posts:
total_show_comments = 0.0
for row in show_posts:
num_comments = int(row[4])
total_show_comments += num_comments
avg_show_comments = total_show_comments / len(show_posts)
print(avg_show_comments)
4.88609962591
On average, ask posts receive more comments than show posts.
Next, we will decide if ask posts created at certain time are more likely to get more comments. There are two steps to perform this analysis:
import datetime as dt
result_list = []
for row in ask_posts:
created_at = row[6]
num_comments = int(row[4])
result_list.append([created_at,num_comments])
counts_by_hour = {}
comments_by_hour = {}
for row in result_list:
date_time = dt.datetime.strptime(row[0], "%m/%d/%Y %H:%M")
hour = date_time.strftime("%H")
if hour in counts_by_hour:
counts_by_hour[hour] += 1
comments_by_hour[hour] += row[1]
else:
counts_by_hour[hour] = 1
comments_by_hour[hour] = row[1]
comments_by_hour
{'00': 2277, '01': 2089, '02': 2996, '03': 2154, '04': 2360, '05': 1838, '06': 1587, '07': 1585, '08': 2362, '09': 1477, '10': 3013, '11': 2797, '12': 4234, '13': 7245, '14': 4972, '15': 18525, '16': 4466, '17': 5547, '18': 4877, '19': 3954, '20': 4462, '21': 4500, '22': 3372, '23': 2297}
avg_by_hour = []
for hour in comments_by_hour:
avg_by_hour.append([hour, comments_by_hour[hour]/counts_by_hour[hour]])
avg_by_hour
[['02', 11], ['03', 7], ['00', 7], ['01', 7], ['20', 8], ['21', 8], ['22', 8], ['23', 6], ['08', 9], ['09', 6], ['14', 9], ['06', 6], ['07', 7], ['11', 8], ['10', 10], ['13', 16], ['12', 12], ['15', 28], ['04', 9], ['17', 9], ['16', 7], ['19', 7], ['18', 7], ['05', 8]]