Hacker News is a site, where user-submitted stories (known as "posts") receive votes and comments, similar to Reddit. The site is extremely popular in technology and startup circles, mainly because the posts that make it to the top of the Hacker News listings can get hundreds of thousands of visitors as a result.
The dataset can be found on Kaggle. Below are the descriptions of the columns.
Column | Description |
---|---|
id |
the unique identifier from Hacker News for the post |
title |
the title of the post |
url |
the URL the posts links to, if the post has a URL |
num_points |
the number of points the post acquired, calculated as the total no. of upvotes minus total no. of downvotes |
num_comments |
the number of comments on the post |
author |
the username of the person who submitted the post |
created_at |
the date and time of the post's submission |
The posts on Hacker News with titles Ask HN or Show HN means:
We are interested in the above titles, Ask HN or Show HN. By using these two types of posts, we will analyze
import datetime as dt
from csv import reader
# Opening the dataset which is in the form of csv
opened_file = open('HN_posts_year_to_Sep_26_2016.csv',encoding="utf8")
read_file = reader(opened_file)
hn = list(read_file)
# Displaying the first five rows of hn
print(hn[:4])
[['id', 'title', 'url', 'num_points', 'num_comments', 'author', 'created_at'], ['12579008', 'You have two days to comment if you want stem cells to be classified as your own', 'http://www.regulations.gov/document?D=FDA-2015-D-3719-0018', '1', '0', 'altstar', '9/26/2016 3:26'], ['12579005', 'SQLAR the SQLite Archiver', 'https://www.sqlite.org/sqlar/doc/trunk/README.md', '1', '0', 'blacksqr', '9/26/2016 3:24'], ['12578997', 'What if we just printed a flatscreen television on the side of our boxes?', 'https://medium.com/vanmoof/our-secrets-out-f21c1f03fdc8#.ietxmez43', '1', '0', 'pavel_lishin', '9/26/2016 3:19']]
# Extracting the first row of data which is the header
headers = hn[0]
# Removing the first row of data from hn
hn = hn[1:]
# Displaying headers to check if our header is correct
print(headers)
print('\n')
# Displaying the first five rows of hn to ensure the header has been removed
print(hn[:4])
['id', 'title', 'url', 'num_points', 'num_comments', 'author', 'created_at'] [['12579008', 'You have two days to comment if you want stem cells to be classified as your own', 'http://www.regulations.gov/document?D=FDA-2015-D-3719-0018', '1', '0', 'altstar', '9/26/2016 3:26'], ['12579005', 'SQLAR the SQLite Archiver', 'https://www.sqlite.org/sqlar/doc/trunk/README.md', '1', '0', 'blacksqr', '9/26/2016 3:24'], ['12578997', 'What if we just printed a flatscreen television on the side of our boxes?', 'https://medium.com/vanmoof/our-secrets-out-f21c1f03fdc8#.ietxmez43', '1', '0', 'pavel_lishin', '9/26/2016 3:19'], ['12578989', 'algorithmic music', 'http://cacm.acm.org/magazines/2011/7/109891-algorithmic-composition/fulltext', '1', '0', 'poindontcare', '9/26/2016 3:16']]
Like we mentioned above, we are only concerned with the posts title beginning with Ask HN or Show HN, we will isolate them in a new lists of lists containing the data for those titles.
ask_posts = [] # ASk HN posts
show_posts = [] # Show HN posts
other_posts = [] # Other posts
for row in hn:
title = row[1]
if title.lower().startswith('ask hn'):
ask_posts.append(row)
elif title.lower().startswith('show hn'):
show_posts.append(row)
else:
other_posts.append(row)
# Checking the number of posts in ask, show and other posts
print(len(ask_posts))
print(len(show_posts))
print(len(other_posts))
9139 10158 273822
Let's have a look at our ask_posts and show_posts by printing a few rows of the lists of list.
print(headers)
print('\n')
print(ask_posts[:4])
print('\n')
print(show_posts[:4])
['id', 'title', 'url', 'num_points', 'num_comments', 'author', 'created_at'] [['12578908', 'Ask HN: What TLD do you use for local development?', '', '4', '7', 'Sevrene', '9/26/2016 2:53'], ['12578522', 'Ask HN: How do you pass on your work when you die?', '', '6', '3', 'PascLeRasc', '9/26/2016 1:17'], ['12577908', 'Ask HN: How a DNS problem can be limited to a geographic region?', '', '1', '0', 'kuon', '9/25/2016 22:57'], ['12577870', 'Ask HN: Why join a fund when you can be an angel?', '', '1', '3', 'anthony_james', '9/25/2016 22:48']] [['12578335', 'Show HN: Finding puns computationally', 'http://puns.samueltaylor.org/', '2', '0', 'saamm', '9/26/2016 0:36'], ['12578182', 'Show HN: A simple library for complicated animations', 'https://christinecha.github.io/choreographer-js/', '1', '0', 'christinecha', '9/26/2016 0:01'], ['12578098', 'Show HN: WebGL visualization of DNA sequences', 'http://grondilu.github.io/dna.html', '1', '0', 'grondilu', '9/25/2016 23:44'], ['12577991', 'Show HN: Pomodoro-centric, heirarchical project management with ES6 modules', 'https://github.com/jakebian/zeal', '2', '0', 'dbranes', '9/25/2016 23:17']]
Now by using the above data in ask_posts and show_posts, let's analyse one of our findings which we mentioned in the beginning; Do Ask HN or Show HN recevie more comments on average?
# Since we are going to check the number of comments for ask and show posts and check their avearge.
# Creating a function so that we can reuse it.
def average_of_comments(dataset, index):
'''Loops through the dataset with the mentioned
index, adds up the number of comments, and returns
the average'''
total_comments = 0
for row in dataset:
comment = int(row[index])
total_comments += comment
average = total_comments / len(dataset)
return average
# Finding the average of ask_posts comments
avg_ask_comments = average_of_comments(ask_posts, 4)
print(avg_ask_comments)
print('\n')
# Finding average of show_posts_comments
avg_show_comments = average_of_comments(show_posts, 4)
print(avg_show_comments)
10.393478498741656 4.886099625910612
The average for Ask HN posts is about 10 and average for Show HN posts is roughly about 5. By looking at the above average, we can say that the posts which ask Hacker News community is more likely to receive more comments than posts which show Hacker News community their project or ideas. This maybe due to the fact that, posts which has Ask HN is more likely to be a question to the community and users are more likely to comment on that question or a topic to engage in a conversation with users with similar interest.
Now that we have found out our finding for the first question. Next, we will determine if posts created at a certain time receive more comments on average. We will use the following steps:
# Creating a function to calculate the number of posts created per hour, along with total number of comments
# Having kewword arguments because the posts created and number of comments will be in the same column index
def posts_and_comments(dataset, index=6, comments=4):
result_list = []
for row in dataset:
created_at = row[index]
number_of_comment = int(row[comments])
result_list.append([created_at, number_of_comment])
counts_by_hour = {} # contains number of posts created during each hour
comments_by_hour = {} # Contains the corresponding number of comments for each hour
for row in result_list:
hour = row[0]
comment = row[1]
parsed_datetime = dt.datetime.strptime(hour, '%m/%d/%Y %H:%M')
parsed_hour = parsed_datetime.strftime('%H')
if parsed_hour not in counts_by_hour:
counts_by_hour[parsed_hour] = 1
comments_by_hour[parsed_hour] = comment
elif parsed_hour in counts_by_hour:
counts_by_hour[parsed_hour] += 1
comments_by_hour[parsed_hour] += comment
return counts_by_hour, comments_by_hour
# Creating the number of posts created per hour and total number of comments for ask_posts
ask_posts_by_hour, ask_comments_by_hour = posts_and_comments(ask_posts)
# Creating the number of posts created per hour and total number of comments for show_posts
show_posts_by_hour, show_comments_by_hour = posts_and_comments(show_posts)
Now that we have two dictionaries which has the number of posts by hour and number of comments by hour. Let's us calculate the average number of comments for each hour for both Ask HN and Show HN posts.
# Creating a function which gives the average for each hour
def average(dict_a, dict_b):
average_list = []
for key in dict_a:
avg = dict_b[key]/dict_a[key]
avg = round(avg)
average_list.append([key, avg])
return average_list
# Computing average for both ask and show posts
avg_ask_comments = average(ask_posts_by_hour, ask_comments_by_hour)
avg_show_comments = average(show_posts_by_hour, show_comments_by_hour)
print(avg_ask_comments)
print('\n')
print(avg_show_comments)
[['02', 11], ['01', 7], ['22', 9], ['21', 9], ['19', 7], ['17', 9], ['15', 29], ['14', 10], ['13', 16], ['11', 9], ['10', 11], ['09', 7], ['07', 7], ['03', 8], ['23', 7], ['20', 9], ['16', 8], ['08', 9], ['00', 8], ['18', 8], ['12', 12], ['04', 10], ['06', 7], ['05', 9]] [['00', 5], ['23', 5], ['20', 4], ['19', 5], ['18', 5], ['16', 5], ['14', 6], ['10', 4], ['09', 5], ['08', 6], ['06', 5], ['03', 5], ['21', 4], ['17', 4], ['15', 5], ['11', 6], ['07', 7], ['04', 5], ['13', 5], ['12', 7], ['01', 4], ['22', 4], ['02', 5], ['05', 3]]
We can see that the above format is difficult to identify the hours with the highest values. We will sort the list of lists and print the five highest values in a format that's easier to read.
swap_avg_ask_comments_by_hour = []
swap_avg_show_comments_by_hour = []
for row in avg_ask_comments:
swap_avg_ask_comments_by_hour.append([row[1], row[0]])
for row in avg_show_comments:
swap_avg_show_comments_by_hour.append([row[1], row[0]])
We have swapped the hour and comments in the list. Lets take a look at the average number of comments in descending order by printing only the top 5 elements of each list.
# Average comments for ask HN posts
print('Top 5 hours for ask posts comments.')
for row in sorted(swap_avg_ask_comments_by_hour[:5], reverse=True):
hour = row[1]
avg = row[0]
hour_stripped = dt.datetime.strptime(hour, '%H')
hour_stripped = hour_stripped.strftime('%H:%M:%S')
print(f"{hour_stripped}: {avg} average comments per post")
print('-'*40)
# Average comments for show HN posts
print('Top 5 hours for show posts comments')
for row in sorted(swap_avg_show_comments_by_hour[:5], reverse=True):
hour = row[1]
avg = row[0]
hour_stripped = dt.datetime.strptime(hour, '%H')
hour_stripped = hour_stripped.strftime('%H:%M:%S')
print(f"{hour_stripped}: {avg} average comments per post")
Top 5 hours for ask posts comments. 02:00:00: 11 average comments per post 22:00:00: 9 average comments per post 21:00:00: 9 average comments per post 19:00:00: 7 average comments per post 01:00:00: 7 average comments per post ---------------------------------------- Top 5 hours for show posts comments 23:00:00: 5 average comments per post 19:00:00: 5 average comments per post 18:00:00: 5 average comments per post 00:00:00: 5 average comments per post 20:00:00: 4 average comments per post
By looking at the output above, we can say that the Ask HN posts created during the night receive a significantly higher number of comments. (Lot of night owls on the website, huh?)
Where as for show posts comments, the average is similar from 6 pm to 11 pm.
The average for Ask HN posts is about 10 and average for Show HN posts is roughly about 5. By looking at the above average, we can say that the posts which ask Hacker News community is more likely to receive more comments than posts which show Hacker News community their project or ideas. This maybe due to the fact that, posts which has Ask HN is more likely to be a question to the community and users are more likely to comment on that question or a topic to engage in a conversation with users with similar interest.
The posts for Ask HN during the night receive a significant higher number of comments. Where as for the Show HN comments, we can say that from the time period of 6 pm to 11 pm, the number of comments received is pretty much the same.