Hacker News is a site started by the startup incubator Y Combinastor, where user-submitted stories (known as 'posts') receive votes and comments, similar to reddit. Hacker News is extremely popular in technology and startup circles, and posts that make it to the top of the Hacker News listings can get hundreds of thousands of visitors as a result.
You can find the data set here, but note that we have reduced from almost 300,000 rows to approximately 20,000 rows by removing all submissions that didn't receive any comments and then randomly sampling from the remaining submissions. Below are descriptions of the columns:
id
: the unique identifier from Hacker News for the posttitle
: the title of the posturl
: the URL that the posts links to, if the post has a URLnum_points
: the number of points the post acquired, calculated as the total number of upvotes minus the total number of downvotesnum_comments
: the number of comments on the postauthor
: the username of the person who submitted the postcreated_at
: the date and time of the post's submissionWe're specifically interested in posts with titles that begin with either Ask HN
or Show HN
. Users submit Ask HN
posts to ask the Hacker News community a specific question. Likewise, users submit Show HN
posts to show the Hacker News community a project, product, or just something interesting
We'll compare these two types of posts to determine the following:
Ask HN
or Show HN
receive more comments on averageLet's start by importing the libraries we need and reading the dataset into a list of lists.
Open and read hacker_news.csv
data
# Open and read data
import csv as c
opened_file = open('hacker_news.csv')
read_file = c.reader(opened_file)
# Create list assign to variable called hn
hn = list(read_file)
# Show five first row data
print(hn[:5])
[['id', 'title', 'url', 'num_points', 'num_comments', 'author', 'created_at'], ['12224879', 'Interactive Dynamic Video', 'http://www.interactivedynamicvideo.com/', '386', '52', 'ne0phyte', '8/4/2016 11:52'], ['10975351', 'How to Use Open Source and Shut the Fuck Up at the Same Time', 'http://hueniverse.com/2016/01/26/how-to-use-open-source-and-shut-the-fuck-up-at-the-same-time/', '39', '10', 'josep2', '1/26/2016 19:30'], ['11964716', "Florida DJs May Face Felony for April Fools' Water Joke", 'http://www.thewire.com/entertainment/2013/04/florida-djs-april-fools-water-joke/63798/', '2', '1', 'vezycash', '6/23/2016 22:20'], ['11919867', 'Technology ventures: From Idea to Enterprise', 'https://www.amazon.com/Technology-Ventures-Enterprise-Thomas-Byers/dp/0073523429', '3', '1', 'hswarna', '6/17/2016 0:01']]
headers = hn[0] # Create variable called header contains first row that column header
hn = hn [1:] # Exclude the first row containing the column header
print(headers) # Display header column
print(hn[:5]) # Display first five rows exclude column header
['id', 'title', 'url', 'num_points', 'num_comments', 'author', 'created_at'] [['12224879', 'Interactive Dynamic Video', 'http://www.interactivedynamicvideo.com/', '386', '52', 'ne0phyte', '8/4/2016 11:52'], ['10975351', 'How to Use Open Source and Shut the Fuck Up at the Same Time', 'http://hueniverse.com/2016/01/26/how-to-use-open-source-and-shut-the-fuck-up-at-the-same-time/', '39', '10', 'josep2', '1/26/2016 19:30'], ['11964716', "Florida DJs May Face Felony for April Fools' Water Joke", 'http://www.thewire.com/entertainment/2013/04/florida-djs-april-fools-water-joke/63798/', '2', '1', 'vezycash', '6/23/2016 22:20'], ['11919867', 'Technology ventures: From Idea to Enterprise', 'https://www.amazon.com/Technology-Ventures-Enterprise-Thomas-Byers/dp/0073523429', '3', '1', 'hswarna', '6/17/2016 0:01'], ['10301696', 'Note by Note: The Making of Steinway L1037 (2007)', 'http://www.nytimes.com/2007/11/07/movies/07stein.html?_r=0', '8', '2', 'walterbell', '9/30/2015 4:12']]
As we explained at the beginning, we only interested in posts with titles that begin with Ask HN
and Show HN
, we separated the data for become three lists :
1.ask_posts with titles that begin with Ask HN
2.show_posts with titles that begin with Show HN
3.other_posts with titles not begin with Ask HN and Show HN
# Create empty list
ask_posts = []
show_posts = []
other_posts = []
for row in hn:
title = row[1] # Assign the title in each row to a variable named title
if title.lower().startswith('ask hn'):
ask_posts.append(row) # conditional true if title start with ask hn append to ask_posts list
elif title.lower().startswith('show hn'):
show_posts.append(row) # conditional true if title start with show hn append to show_posts list
else:
other_posts.append(row) # conditional true if title not start with ask hn and show hn append to other_posts list
print(len(ask_posts))
print(len(show_posts))
print(len(other_posts))
1744 1162 17194
# Average number of comments for Ask HN
total_ask_comments = 0
for row in ask_posts:
num_comments = int(row[4])
total_ask_comments += num_comments
avg_ask_comments = total_ask_comments / len(ask_posts)
print(f'Average number of comments for Ask HN: {avg_ask_comments}')
# Average number of comments for Show HN
total_show_comments = 0
for row in show_posts:
num_comments = int(row[4])
total_show_comments += num_comments
avg_show_comments = total_show_comments / len(show_posts)
print(f'Average number of comments for Show HN: {avg_show_comments}')
Average number of comments for Ask HN: 14.038417431192661 Average number of comments for Show HN: 10.31669535283993
Based on data above, on average Ask posts
receive more comments than Show posts
import datetime as dt
result_list = []
# Create lists contain list of created at post and number of comment at post
for row in ask_posts:
created_at = row[6]
num_comments = int(row[4])
result = [created_at, num_comments]
result_list.append(result)
counts_by_hour = {}
comments_by_hour = {}
# Create Dictionary total post by hour and total comment by hour
for element in result_list:
date = dt.datetime.strptime(element[0], '%m/%d/%Y %H:%M')
hour = dt.datetime.strftime(date, '%H')
num_comments = element[1]
if hour not in counts_by_hour:
counts_by_hour[hour] = 1
comments_by_hour[hour] = num_comments
else:
counts_by_hour[hour] += 1
comments_by_hour[hour] += num_comments
avg_by_hour = []
# Average number of comments by hour
for key in counts_by_hour:
avg = comments_by_hour[key] / counts_by_hour[key]
avg_by_hour.append([key, avg])
print(avg_by_hour)
[['09', 5.5777777777777775], ['13', 14.741176470588234], ['10', 13.440677966101696], ['14', 13.233644859813085], ['16', 16.796296296296298], ['23', 7.985294117647059], ['12', 9.41095890410959], ['17', 11.46], ['15', 38.5948275862069], ['21', 16.009174311926607], ['20', 21.525], ['02', 23.810344827586206], ['18', 13.20183486238532], ['03', 7.796296296296297], ['05', 10.08695652173913], ['19', 10.8], ['01', 11.383333333333333], ['22', 6.746478873239437], ['08', 10.25], ['04', 7.170212765957447], ['00', 8.127272727272727], ['06', 9.022727272727273], ['07', 7.852941176470588], ['11', 11.051724137931034]]
swap_avg_by_hour = []
for row in avg_by_hour:
swap_avg_by_hour.append([row[1], row[0]])
# Sort by descending value of average number of comments
sorted_swap = sorted(swap_avg_by_hour, reverse=True)
print('Top 5 Hours for Ask Posts Comments')
for row in sorted_swap[:5]:
hour = dt.datetime.strptime(row[1], '%H')
hour = dt.datetime.strftime(hour, '%H:%M')
print('{} : {:.2f} average comments per post'.format(hour, row[0]))
Top 5 Hours for Ask Posts Comments 15:00 : 38.59 average comments per post 02:00 : 23.81 average comments per post 20:00 : 21.52 average comments per post 16:00 : 16.80 average comments per post 21:00 : 16.01 average comments per post
Ask posts
receive more comments than Show posts