Hacker News https://news.ycombinator.com/ is a social news website focused on computer science and investment. It was started by an American seed money startup accelerator Y Combinator https://www.ycombinator.com/, where user-submitted stories (referred to as "posts") are voted and commented upon. The concept is similar to the broader and highly popular platform, Reddit https://www.reddit.com/, where posts move to the top of the listings by users upvoting the content. This can result in hundreds of thousands of views. But, unlike Reddit, users cannot downvote content until they have accumulated enough "karma" points.
This project will focus primarily on posts whose titles begin with "Ask HN" and "Show HN" Users submit Ask HN posts to ask the Hacker News community a specific question. Below are some examples of Ask HN posts:
Likewise, users submit Show HN posts to show the Hacker News community a project, product, or just generally something interesting. Below are some examples of Show HN posts:
This project is for the completion of the DataQuest.io https://www.dataquest.io/ "Python for Data Science Intermediate" module; the second in a series for the completing Data Science course path. For this assignment, the following questions will be answered using the material focused on up to this point in the course, with particular consideration for the new material introduced in this module.
The primary objective of this project is to explore the data, and use our new found knowledge to answer the following questions:
There were 20,100 posts in our data set, with only about 14% of them being either Ask Posts or Show Posts. Of the ones that were, there were 50% more Ask Posts than Show Posts, and Ask Post received almost 3% more comments on average. It was determined that 3:00 pm EST was the most popular, time to post to an Ask Post comment. It was interesting to find that the second most popular time was 2:00 am in the morning. A good follow up would be to determine why so many Hacker users are up so early (or so late).
The dataset for this project was scrapped and contributed to Kaggle.com: https://www.kaggle.com/hacker-news/hacker-news-posts. It has been reduced from almost 300,000 rows to approximately 20,000 rows by removing all submissions that did not receive any comments, and then randomly sampling from the remaining submissions. Below are descriptions of the columns:
Column Name | Details |
---|---|
id | The unique identifier from Hacker News for the post |
title | The title of the post |
url | The URL that the posts links to, if the post has a URL |
num_points | The number of points the post acquired,(total number of upvotes minus the total number of downvotes) |
num_comments | The number of comments that were made on the post |
author | The username of the person who submitted the post |
created_at | The date and time at which the post was submitted (Eastern Time, USA) |
# Import data
from csv import reader
open_file = open('hacker_news.csv')
read_file = reader(open_file)
hn = list(read_file)
# Extract header row
headers = hn[0]
print('Header row: {}'.format(headers))
Header row: ['id', 'title', 'url', 'num_points', 'num_comments', 'author', 'created_at']
# Extract first 5 rows of data (minus header)
hn = hn[1:]
print('First 5 rows without header: {}'.format(hn[:6]))
First 5 rows without header: [['12224879', 'Interactive Dynamic Video', 'http://www.interactivedynamicvideo.com/', '386', '52', 'ne0phyte', '8/4/2016 11:52'], ['10975351', 'How to Use Open Source and Shut the Fuck Up at the Same Time', 'http://hueniverse.com/2016/01/26/how-to-use-open-source-and-shut-the-fuck-up-at-the-same-time/', '39', '10', 'josep2', '1/26/2016 19:30'], ['11964716', "Florida DJs May Face Felony for April Fools' Water Joke", 'http://www.thewire.com/entertainment/2013/04/florida-djs-april-fools-water-joke/63798/', '2', '1', 'vezycash', '6/23/2016 22:20'], ['11919867', 'Technology ventures: From Idea to Enterprise', 'https://www.amazon.com/Technology-Ventures-Enterprise-Thomas-Byers/dp/0073523429', '3', '1', 'hswarna', '6/17/2016 0:01'], ['10301696', 'Note by Note: The Making of Steinway L1037 (2007)', 'http://www.nytimes.com/2007/11/07/movies/07stein.html?_r=0', '8', '2', 'walterbell', '9/30/2015 4:12'], ['10482257', 'Title II kills investment? Comcast and other ISPs are now spending more', 'http://arstechnica.com/business/2015/10/comcast-and-other-isps-boost-network-investment-despite-net-neutrality/', '53', '22', 'Deinos', '10/31/2015 9:48']]
There were 20,100 posts in our data set. The Ask Posts, Show Posts and remaining posts were separated. 1742 were Ask Posts and 1161 were Show Posts. Ask Post exceeded the number of Show Posts by 50%.
# Separate Ask Posts from Show Posts
ask_posts = []
show_posts = []
other_posts = []
for row in hn:
title = row[1]
if title.startswith('Ask HN'):
ask_posts.append(row)
elif title.startswith('Show HN'):
show_posts.append(row)
else:
other_posts.append(row)
print('The number of ask_posts:{}'.format(len(ask_posts)))
print('The number of show_posts:{}'.format(len(show_posts)))
print('The number of other_posts:{}'.format(len(other_posts)))
The number of ask_posts:1742 The number of show_posts:1161 The number of other_posts:17197
print('First 5 Ask_Posts: {}'.format(ask_posts[:5]))
First 5 Ask_Posts: [['12296411', 'Ask HN: How to improve my personal website?', '', '2', '6', 'ahmedbaracat', '8/16/2016 9:55'], ['10610020', 'Ask HN: Am I the only one outraged by Twitter shutting down share counts?', '', '28', '29', 'tkfx', '11/22/2015 13:43'], ['11610310', 'Ask HN: Aby recent changes to CSS that broke mobile?', '', '1', '1', 'polskibus', '5/2/2016 10:14'], ['12210105', 'Ask HN: Looking for Employee #3 How do I do it?', '', '1', '3', 'sph130', '8/2/2016 14:20'], ['10394168', 'Ask HN: Someone offered to buy my browser extension from me. What now?', '', '28', '17', 'roykolak', '10/15/2015 16:38']]
print('First 5 Show_posts: {}'.format(show_posts[:5]))
First 5 Show_posts: [['10627194', 'Show HN: Wio Link ESP8266 Based Web of Things Hardware Development Platform', 'https://iot.seeed.cc', '26', '22', 'kfihihc', '11/25/2015 14:03'], ['10646440', 'Show HN: Something pointless I made', 'http://dn.ht/picklecat/', '747', '102', 'dhotson', '11/29/2015 22:46'], ['11590768', 'Show HN: Shanhu.io, a programming playground powered by e8vm', 'https://shanhu.io', '1', '1', 'h8liu', '4/28/2016 18:05'], ['12178806', 'Show HN: Webscope Easy way for web developers to communicate with Clients', 'http://webscopeapp.com', '3', '3', 'fastbrick', '7/28/2016 7:11'], ['10872799', 'Show HN: GeoScreenshot Easily test Geo-IP based web pages', 'https://www.geoscreenshot.com/', '1', '9', 'kpsychwave', '1/9/2016 20:45']]
It was determined through iteration, that the average number of comments associated with Ask Posts was about 14 per post. The average number of comments associated with Show Post was about 10 per post. Ask Posts received almost 3 percent more on average.
# Find average Ask comments
total_ask_comments = 0
for row in ask_posts:
comments = int(row[4])
total_ask_comments += comments
average_ask_comments = total_ask_comments/len(ask_posts) # Get the average ask comments
print('The average number of Ask Post comments are {}'.format(average_ask_comments))
The average number of Ask Post comments are 14.044776119402986
# Find average Show comments
total_show_comments = 0
for row in show_posts:
comments = int(row[4])
total_show_comments += comments
average_show_comments = total_show_comments/len(show_posts) # Get the average ask comments
print('The average number of Show Post comments are {}'.format(average_show_comments))
The average number of Show Post comments are 10.324720068906116
To answer this question, the average number of Ask Posts per hour was determined, then sorted to select the top five hours. The results showed that 3:00 pm EST was the most popular, followed by 2:00 am EST. It was interesting to find so many users asking questions at 2:00 am!
Top 5 Hours for Ask Posts Comments |
---|
15:00 EST: 38.59 average comments per post |
02:00 EST: 23.81 average comments per post |
20:00 EST: 21.52 average comments per post |
16:00 EST: 16.80 average comments per post |
21:00 EST: 16.01 average comments per post |
# Create a dictionary counts by hour and comments by hour
import datetime as dt
result_list = []
for row in ask_posts:
created= [row[6], int(row[4])]
result_list.append(created)
date_format = ("%m/%d/%Y %H:%M")
counts_by_hour = {}
comments_by_hour = {}
for row in result_list:
dt_string = row[0] # Select the datetime string
comment_string = row[1]
dt_object = dt.datetime.strptime(dt_string, date_format) # Convert the string into a datetime object
hour = dt_object.hour
if hour not in counts_by_hour:
counts_by_hour[hour] = 1
comments_by_hour[hour] = comment_string
else:
counts_by_hour[hour] += 1
comments_by_hour[hour] += comment_string
print(counts_by_hour)
print(comments_by_hour)
{9: 45, 13: 85, 10: 59, 14: 107, 16: 108, 23: 68, 12: 73, 17: 100, 15: 116, 21: 109, 20: 80, 2: 58, 18: 108, 3: 54, 5: 46, 19: 110, 1: 60, 22: 71, 8: 48, 4: 47, 0: 54, 6: 44, 7: 34, 11: 58} {9: 251, 13: 1253, 10: 793, 14: 1416, 16: 1814, 23: 543, 12: 687, 17: 1146, 15: 4477, 21: 1745, 20: 1722, 2: 1381, 18: 1430, 3: 421, 5: 464, 19: 1188, 1: 683, 22: 479, 8: 492, 4: 337, 0: 439, 6: 397, 7: 267, 11: 641}
# Create a list of lists for average posts per hour
avg_by_hour = []
for key in comments_by_hour:
avg_by_hour.append([key, comments_by_hour[key]/counts_by_hour[key]])
print("The average number of Ask Posts per hour:")
avg_by_hour
The average number of Ask Posts per hour:
[[9, 5.5777777777777775], [13, 14.741176470588234], [10, 13.440677966101696], [14, 13.233644859813085], [16, 16.796296296296298], [23, 7.985294117647059], [12, 9.41095890410959], [17, 11.46], [15, 38.5948275862069], [21, 16.009174311926607], [20, 21.525], [2, 23.810344827586206], [18, 13.24074074074074], [3, 7.796296296296297], [5, 10.08695652173913], [19, 10.8], [1, 11.383333333333333], [22, 6.746478873239437], [8, 10.25], [4, 7.170212765957447], [0, 8.12962962962963], [6, 9.022727272727273], [7, 7.852941176470588], [11, 11.051724137931034]]
# Swap the order
swap_avg_by_hour = []
for row in avg_by_hour:
swap = [row[1], row[0]]
swap_avg_by_hour.append(swap)
swap_avg_by_hour
[[5.5777777777777775, 9], [14.741176470588234, 13], [13.440677966101696, 10], [13.233644859813085, 14], [16.796296296296298, 16], [7.985294117647059, 23], [9.41095890410959, 12], [11.46, 17], [38.5948275862069, 15], [16.009174311926607, 21], [21.525, 20], [23.810344827586206, 2], [13.24074074074074, 18], [7.796296296296297, 3], [10.08695652173913, 5], [10.8, 19], [11.383333333333333, 1], [6.746478873239437, 22], [10.25, 8], [7.170212765957447, 4], [8.12962962962963, 0], [9.022727272727273, 6], [7.852941176470588, 7], [11.051724137931034, 11]]
# Sort the values
sorted_swap = sorted(swap_avg_by_hour, reverse = True)
sorted_swap
[[38.5948275862069, 15], [23.810344827586206, 2], [21.525, 20], [16.796296296296298, 16], [16.009174311926607, 21], [14.741176470588234, 13], [13.440677966101696, 10], [13.24074074074074, 18], [13.233644859813085, 14], [11.46, 17], [11.383333333333333, 1], [11.051724137931034, 11], [10.8, 19], [10.25, 8], [10.08695652173913, 5], [9.41095890410959, 12], [9.022727272727273, 6], [8.12962962962963, 0], [7.985294117647059, 23], [7.852941176470588, 7], [7.796296296296297, 3], [7.170212765957447, 4], [6.746478873239437, 22], [5.5777777777777775, 9]]
print('Top 5 Hours for Ask Posts Comments')
for row in sorted_swap[:5]:
avg = row[0]
hour = str(row[1])
print('{} EST: {:.2f} average comments per post'.format(dt.datetime.strptime(hour, '%H').strftime('%H:%M'), avg))
Top 5 Hours for Ask Posts Comments 15:00 EST: 38.59 average comments per post 02:00 EST: 23.81 average comments per post 20:00 EST: 21.52 average comments per post 16:00 EST: 16.80 average comments per post 21:00 EST: 16.01 average comments per post
# Convert to PST
print('Top 5 Hours for Ask Posts Comments in Pacific Standard Time')
for row in sorted_swap[:5]:
avg = row[0]
hour = str(row[1])
hour = dt.datetime.strptime(hour,'%H') + dt.timedelta(hours =-3)
print('{} PST: {:.2f} average comments per post'.format(hour.strftime('%H:%M'), avg))
Top 5 Hours for Ask Posts Comments in Pacific Standard Time 12:00 PST: 38.59 average comments per post 23:00 PST: 23.81 average comments per post 17:00 PST: 21.52 average comments per post 13:00 PST: 16.80 average comments per post 18:00 PST: 16.01 average comments per post
There were 20,100 posts in our Hacker News Posts data set; a sampling downloaded from Kaggle of the original scraped data containing over 300,000. Of the 20,100 posts, about 14% were either Ask Posts or Show Posts. There were 1742 Ask Post and 1161 Show Posts with Ask Posts exceeded the number of Show Posts by 50%. There were on average 14 comments associated with each Ask Post, in contrast to 10 per Show Posts. With 3% more comments per Ask Posts, the data was filtered further to determine the Top 5 hours in which comments were posted. At 3:00 pm EST, there were 38.59 average comments per post, followed by 2:00 am EST, with 23.81 average comments per hour. It was surprising to see so many users commenting at 2:00 am EST.