In this project, we'll work with a dataset of submissions to popular technology site Hacker News.
Hacker News is a site started by the startup incubator Y Combinator, where user-submitted stories (known as "posts") receive votes and comments, similar to reddit. Hacker News is extremely popular in technology and startup circles, and posts that make it to the top of the Hacker News listings can get hundreds of thousands of visitors as a result.
You can find the data set here, but note that we have reduced from almost 300,000 rows to approximately 20,000 rows by removing all submissions that didn't receive any comments and then randomly sampling from the remaining submissions.
We'll compare Ask HN
and Show HN
posts to determine the following:
Ask HN
or Show HN
receive more comments on average?Below are first five rows of our dataset.
from csv import reader
opened_file = open('hacker_news.csv')
read_file = reader(opened_file)
hn = list(read_file)
#Separate header and data
headers = hn[:1]
hn = hn[1:]
print(headers)
print(hn[:5])
[['id', 'title', 'url', 'num_points', 'num_comments', 'author', 'created_at']] [['12224879', 'Interactive Dynamic Video', 'http://www.interactivedynamicvideo.com/', '386', '52', 'ne0phyte', '8/4/2016 11:52'], ['10975351', 'How to Use Open Source and Shut the Fuck Up at the Same Time', 'http://hueniverse.com/2016/01/26/how-to-use-open-source-and-shut-the-fuck-up-at-the-same-time/', '39', '10', 'josep2', '1/26/2016 19:30'], ['11964716', "Florida DJs May Face Felony for April Fools' Water Joke", 'http://www.thewire.com/entertainment/2013/04/florida-djs-april-fools-water-joke/63798/', '2', '1', 'vezycash', '6/23/2016 22:20'], ['11919867', 'Technology ventures: From Idea to Enterprise', 'https://www.amazon.com/Technology-Ventures-Enterprise-Thomas-Byers/dp/0073523429', '3', '1', 'hswarna', '6/17/2016 0:01'], ['10301696', 'Note by Note: The Making of Steinway L1037 (2007)', 'http://www.nytimes.com/2007/11/07/movies/07stein.html?_r=0', '8', '2', 'walterbell', '9/30/2015 4:12']]
To filter our data, we separate headers from the dataset and store them in a variable named headers
.
Since we're only concerned with post titles beginning with Ask HN
or Show HN
, we'll create new lists of lists containing just the data for those titles.
To find the posts that begin with either Ask HN
or Show HN
, we'll use the string method startswith
. since the startswith
method is case sensitive, we'll use the lower
method to control capitalization problem.
ask_posts = []
show_posts = []
other_posts = []
for row in hn:
title = row[1].lower()
if title.startswith("ask hn"):
ask_posts.append(row)
elif title.startswith("show hn"):
show_posts.append(row)
else:
other_posts.append(row)
print(len(ask_posts))
print(len(show_posts))
print(len(other_posts))
1744 1162 17194
We have Ask HN
posts in the list of lists named ask_posts
, and Show HN
posts in show_posts
. Now, let's determin if ask posts or show posts receive more comments on average.
#Average number of comments on ask posts
total_ask_comments = 0
for row in ask_posts:
n_comments = int(row[4])
total_ask_comments += n_comments
avg_ask_comments = total_ask_comments / len(ask_posts)
print("Average number of comments on ask posts:", avg_ask_comments)
#Average number of comments on show posts
total_show_comments = 0
for row in show_posts:
n_comments = int(row[4])
total_show_comments += n_comments
avg_show_comments = total_show_comments / len(show_posts)
print("Average number of comments on show posts:", avg_show_comments)
Average number of comments on ask posts: 14.038417431192661 Average number of comments on show posts: 10.31669535283993
According to the analysis, Ask HN
posts are more likely to receive more comments than Show HN
posts. When Ask HN
got 14 comments, Show HN
posts got 10 comments on average.
Since ask posts are more likely to receive comments, we'll focus our remaining analysis just on these posts.
Next we'll determine if ask posts created at a certain time are more likely to attract comments. We'll use the following steps to perfom this analysis:
We'll use the datetime
module to work with the data in the created_at
column.
import datetime as dt
result_list = []
for row in ask_posts:
created_at = row[6]
n_comments = int(row[4])
result = [created_at, n_comments]
result_list.append(result)
counts_by_hour = {}
comments_by_hour = {}
for row in result_list:
date = dt.datetime.strptime(row[0], "%m/%d/%Y %H:%M")
hour = date.hour
n_comments = row[1]
if hour not in counts_by_hour:
counts_by_hour[hour] = 1
comments_by_hour[hour] = n_comments
else:
counts_by_hour[hour] += 1
comments_by_hour[hour] += n_comments
print("Number of posts per each hour of the day:", counts_by_hour)
print("Number of comments per each hour of the day:", comments_by_hour)
#Calculate the average number of comments per post
avg_by_hour = []
for item in counts_by_hour:
avg = (comments_by_hour[item] / counts_by_hour[item])
avg_by_hour.append([item, avg])
print(avg_by_hour)
Number of posts per each hour of the day: {9: 45, 13: 85, 10: 59, 14: 107, 16: 108, 23: 68, 12: 73, 17: 100, 15: 116, 21: 109, 20: 80, 2: 58, 18: 109, 3: 54, 5: 46, 19: 110, 1: 60, 22: 71, 8: 48, 4: 47, 0: 55, 6: 44, 7: 34, 11: 58} Number of comments per each hour of the day: {9: 251, 13: 1253, 10: 793, 14: 1416, 16: 1814, 23: 543, 12: 687, 17: 1146, 15: 4477, 21: 1745, 20: 1722, 2: 1381, 18: 1439, 3: 421, 5: 464, 19: 1188, 1: 683, 22: 479, 8: 492, 4: 337, 0: 447, 6: 397, 7: 267, 11: 641} [[9, 5.5777777777777775], [13, 14.741176470588234], [10, 13.440677966101696], [14, 13.233644859813085], [16, 16.796296296296298], [23, 7.985294117647059], [12, 9.41095890410959], [17, 11.46], [15, 38.5948275862069], [21, 16.009174311926607], [20, 21.525], [2, 23.810344827586206], [18, 13.20183486238532], [3, 7.796296296296297], [5, 10.08695652173913], [19, 10.8], [1, 11.383333333333333], [22, 6.746478873239437], [8, 10.25], [4, 7.170212765957447], [0, 8.127272727272727], [6, 9.022727272727273], [7, 7.852941176470588], [11, 11.051724137931034]]
To sort the average comments by each hour, we'll create a list that equals avg_by_hour
with swapped columns and store it in a variable named swap_avg_by_hour
.
# Create swapped list
swap_avg_by_hour = []
for row in avg_by_hour:
hour = row[0]
avg = row[1]
swap_avg_by_hour.append([avg, hour])
# Sort by average comments
sorted_swap = sorted(swap_avg_by_hour, reverse=True)
# Print 'Top 5 hours for Ask Posts Comments'
print("Top 5 Hours for Ask Posts Comments")
for row in sorted_swap[:5]:
hour = dt.datetime.strptime(str(row[1]), "%H")
hour = hour.strftime("%H:%M")
print("{0}: {1:.2f} average comments per post".format(hour, row[0]))
Top 5 Hours for Ask Posts Comments 15:00: 38.59 average comments per post 02:00: 23.81 average comments per post 20:00: 21.52 average comments per post 16:00: 16.80 average comments per post 21:00: 16.01 average comments per post
As a result of the analysis, Ask HN
posts are more likely to receive comments than the Show HN
posts on the Hacker News. Ask HN
posts get 14 comments on average whereas Show HN
posts get 10.3 comments.
Regarding the time of the day, Ask HN
posts that are posted at 15:00 are more likely to get the most comments. Ask posts that are posted at 15:00 get 38.59 comments on average, which is considerably higher than the other time of the day.