Notebook

Exploring Hacker News Posts¶

What Hacker News is and How it Works¶

Hacker News https://news.ycombinator.com/ is a social news website focused on computer science and investment. It was started by an American seed money startup accelerator Y Combinator https://www.ycombinator.com/, where user-submitted stories (referred to as "posts") are voted and commented upon. The concept is similar to the broader and highly popular platform, Reddit https://www.reddit.com/, where posts move to the top of the listings by users upvoting the content. This can result in hundreds of thousands of views. But, unlike Reddit, users cannot downvote content until they have accumulated enough "karma" points.

Defining Ask Posts and Show Posts¶

This project will focus primarily on posts whose titles begin with "Ask HN" and "Show HN" Users submit Ask HN posts to ask the Hacker News community a specific question. Below are some examples of Ask HN posts:

Ask HN: How to improve my personal website?
Ask HN: Am I the only one outraged by Twitter shutting down share counts?
Ask HN: Aby recent changes to CSS that broke mobile?

Likewise, users submit Show HN posts to show the Hacker News community a project, product, or just generally something interesting. Below are some examples of Show HN posts:

Show HN: Wio Link ESP8266 Based Web of Things Hardware Development Platform'
Show HN: Something pointless I made
Show HN: Shanhu.io, a programming playground powered by e8vm

Quantifying and Comparing Ask Posts and Show Posts¶

This project is for the completion of the DataQuest.io https://www.dataquest.io/ "Python for Data Science Intermediate" module; the second in a series for the completing Data Science course path. For this assignment, the following questions will be answered using the material focused on up to this point in the course, with particular consideration for the new material introduced in this module.

Working with and cleaning data with strings.
Working with dates and times; particularly, the datetime module.

The primary objective of this project is to explore the data, and use our new found knowledge to answer the following questions:

What is the number of Ask Post and Show Posts?
Which one receives more comments on average?
What is the average number of comments per post
At what hour of the day are the majority of comments posted?

Summary of the Results¶

There were 20,100 posts in our data set, with only about 14% of them being either Ask Posts or Show Posts. Of the ones that were, there were 50% more Ask Posts than Show Posts, and Ask Post received almost 3% more comments on average. It was determined that 3:00 pm EST was the most popular, time to post to an Ask Post comment. It was interesting to find that the second most popular time was 2:00 am in the morning. A good follow up would be to determine why so many Hacker users are up so early (or so late).

Data Source and Exploration¶

The dataset for this project was scrapped and contributed to Kaggle.com: https://www.kaggle.com/hacker-news/hacker-news-posts. It has been reduced from almost 300,000 rows to approximately 20,000 rows by removing all submissions that did not receive any comments, and then randomly sampling from the remaining submissions. Below are descriptions of the columns:

Column Name	Details
id	The unique identifier from Hacker News for the post
title	The title of the post
url	The URL that the posts links to, if the post has a URL
num_points	The number of points the post acquired,(total number of upvotes minus the total number of downvotes)
num_comments	The number of comments that were made on the post
author	The username of the person who submitted the post
created_at	The date and time at which the post was submitted (Eastern Time, USA)

In [57]:

# Import data
from csv import reader
open_file = open('hacker_news.csv')
read_file = reader(open_file)
hn = list(read_file)

In [58]:

# Extract header row 
headers = hn[0]
print('Header row: {}'.format(headers))

Header row: ['id', 'title', 'url', 'num_points', 'num_comments', 'author', 'created_at']

In [59]:

# Extract first 5 rows of data (minus header)
hn = hn[1:]
print('First 5 rows without header:  {}'.format(hn[:6]))

First 5 rows without header:  [['12224879', 'Interactive Dynamic Video', 'http://www.interactivedynamicvideo.com/', '386', '52', 'ne0phyte', '8/4/2016 11:52'], ['10975351', 'How to Use Open Source and Shut the Fuck Up at the Same Time', 'http://hueniverse.com/2016/01/26/how-to-use-open-source-and-shut-the-fuck-up-at-the-same-time/', '39', '10', 'josep2', '1/26/2016 19:30'], ['11964716', "Florida DJs May Face Felony for April Fools' Water Joke", 'http://www.thewire.com/entertainment/2013/04/florida-djs-april-fools-water-joke/63798/', '2', '1', 'vezycash', '6/23/2016 22:20'], ['11919867', 'Technology ventures: From Idea to Enterprise', 'https://www.amazon.com/Technology-Ventures-Enterprise-Thomas-Byers/dp/0073523429', '3', '1', 'hswarna', '6/17/2016 0:01'], ['10301696', 'Note by Note: The Making of Steinway L1037 (2007)', 'http://www.nytimes.com/2007/11/07/movies/07stein.html?_r=0', '8', '2', 'walterbell', '9/30/2015 4:12'], ['10482257', 'Title II kills investment? Comcast and other ISPs are now spending more', 'http://arstechnica.com/business/2015/10/comcast-and-other-isps-boost-network-investment-despite-net-neutrality/', '53', '22', 'Deinos', '10/31/2015 9:48']]

Finding the Number of Ask Posts and Show Posts¶

There were 20,100 posts in our data set. The Ask Posts, Show Posts and remaining posts were separated. 1742 were Ask Posts and 1161 were Show Posts. Ask Post exceeded the number of Show Posts by 50%.

In [60]:

# Separate Ask Posts from Show Posts
ask_posts = []
show_posts = []
other_posts = []

for row in hn:
    title = row[1]
    if title.startswith('Ask HN'):       
        ask_posts.append(row)
    elif title.startswith('Show HN'):
        show_posts.append(row)
    else: 
        other_posts.append(row)

print('The number of ask_posts:{}'.format(len(ask_posts)))
print('The number of show_posts:{}'.format(len(show_posts)))
print('The number of other_posts:{}'.format(len(other_posts)))

The number of ask_posts:1742
The number of show_posts:1161
The number of other_posts:17197

In [61]:

print('First 5 Ask_Posts: {}'.format(ask_posts[:5]))

First 5 Ask_Posts: [['12296411', 'Ask HN: How to improve my personal website?', '', '2', '6', 'ahmedbaracat', '8/16/2016 9:55'], ['10610020', 'Ask HN: Am I the only one outraged by Twitter shutting down share counts?', '', '28', '29', 'tkfx', '11/22/2015 13:43'], ['11610310', 'Ask HN: Aby recent changes to CSS that broke mobile?', '', '1', '1', 'polskibus', '5/2/2016 10:14'], ['12210105', 'Ask HN: Looking for Employee #3 How do I do it?', '', '1', '3', 'sph130', '8/2/2016 14:20'], ['10394168', 'Ask HN: Someone offered to buy my browser extension from me. What now?', '', '28', '17', 'roykolak', '10/15/2015 16:38']]

In [62]:

print('First 5 Show_posts: {}'.format(show_posts[:5]))

First 5 Show_posts: [['10627194', 'Show HN: Wio Link  ESP8266 Based Web of Things Hardware Development Platform', 'https://iot.seeed.cc', '26', '22', 'kfihihc', '11/25/2015 14:03'], ['10646440', 'Show HN: Something pointless I made', 'http://dn.ht/picklecat/', '747', '102', 'dhotson', '11/29/2015 22:46'], ['11590768', 'Show HN: Shanhu.io, a programming playground powered by e8vm', 'https://shanhu.io', '1', '1', 'h8liu', '4/28/2016 18:05'], ['12178806', 'Show HN: Webscope  Easy way for web developers to communicate with Clients', 'http://webscopeapp.com', '3', '3', 'fastbrick', '7/28/2016 7:11'], ['10872799', 'Show HN: GeoScreenshot  Easily test Geo-IP based web pages', 'https://www.geoscreenshot.com/', '1', '9', 'kpsychwave', '1/9/2016 20:45']]

Determining the Average Number of Comments¶

It was determined through iteration, that the average number of comments associated with Ask Posts was about 14 per post. The average number of comments associated with Show Post was about 10 per post. Ask Posts received almost 3 percent more on average.

In [63]:

# Find average Ask comments
total_ask_comments = 0
for row in ask_posts:
    comments = int(row[4])
    total_ask_comments += comments

average_ask_comments = total_ask_comments/len(ask_posts) # Get the average ask comments
print('The average number of Ask Post comments are {}'.format(average_ask_comments))

The average number of Ask Post comments are 14.044776119402986

In [64]:

# Find average Show comments
total_show_comments = 0
for row in show_posts:
    comments = int(row[4])
    total_show_comments += comments

average_show_comments = total_show_comments/len(show_posts) # Get the average ask comments
print('The average number of Show Post comments are {}'.format(average_show_comments))

The average number of Show Post comments are 10.324720068906116

Determining the Top 5 Hours for Ask Posts Comments¶

To answer this question, the average number of Ask Posts per hour was determined, then sorted to select the top five hours. The results showed that 3:00 pm EST was the most popular, followed by 2:00 am EST. It was interesting to find so many users asking questions at 2:00 am!

Top 5 Hours for Ask Posts Comments
15:00 EST: 38.59 average comments per post
02:00 EST: 23.81 average comments per post
20:00 EST: 21.52 average comments per post
16:00 EST: 16.80 average comments per post
21:00 EST: 16.01 average comments per post

In [65]:

# Create a dictionary counts by hour and comments by hour
import datetime as dt
result_list = []

for row in ask_posts:
    created= [row[6], int(row[4])]
    result_list.append(created)

date_format = ("%m/%d/%Y %H:%M")
counts_by_hour = {}
comments_by_hour = {}
for row in result_list:
    dt_string = row[0] # Select the datetime string
    comment_string = row[1]
    dt_object = dt.datetime.strptime(dt_string, date_format) # Convert the string into a datetime object
    hour = dt_object.hour

    if hour not in counts_by_hour:
        counts_by_hour[hour] = 1
        comments_by_hour[hour] = comment_string
    else: 
        counts_by_hour[hour] += 1
        comments_by_hour[hour] += comment_string

print(counts_by_hour)
print(comments_by_hour)

{9: 45, 13: 85, 10: 59, 14: 107, 16: 108, 23: 68, 12: 73, 17: 100, 15: 116, 21: 109, 20: 80, 2: 58, 18: 108, 3: 54, 5: 46, 19: 110, 1: 60, 22: 71, 8: 48, 4: 47, 0: 54, 6: 44, 7: 34, 11: 58}
{9: 251, 13: 1253, 10: 793, 14: 1416, 16: 1814, 23: 543, 12: 687, 17: 1146, 15: 4477, 21: 1745, 20: 1722, 2: 1381, 18: 1430, 3: 421, 5: 464, 19: 1188, 1: 683, 22: 479, 8: 492, 4: 337, 0: 439, 6: 397, 7: 267, 11: 641}

In [66]:

# Create a list of lists for average posts per hour
avg_by_hour = []
for key in comments_by_hour:
    avg_by_hour.append([key, comments_by_hour[key]/counts_by_hour[key]])

In [67]:

print("The average number of Ask Posts per hour:")
avg_by_hour

The average number of Ask Posts per hour:

Out[67]:

[[9, 5.5777777777777775],
 [13, 14.741176470588234],
 [10, 13.440677966101696],
 [14, 13.233644859813085],
 [16, 16.796296296296298],
 [23, 7.985294117647059],
 [12, 9.41095890410959],
 [17, 11.46],
 [15, 38.5948275862069],
 [21, 16.009174311926607],
 [20, 21.525],
 [2, 23.810344827586206],
 [18, 13.24074074074074],
 [3, 7.796296296296297],
 [5, 10.08695652173913],
 [19, 10.8],
 [1, 11.383333333333333],
 [22, 6.746478873239437],
 [8, 10.25],
 [4, 7.170212765957447],
 [0, 8.12962962962963],
 [6, 9.022727272727273],
 [7, 7.852941176470588],
 [11, 11.051724137931034]]

In [68]:

# Swap the order
swap_avg_by_hour = []
for row in avg_by_hour:
    swap = [row[1], row[0]]
    swap_avg_by_hour.append(swap)
    
swap_avg_by_hour

Out[68]:

[[5.5777777777777775, 9],
 [14.741176470588234, 13],
 [13.440677966101696, 10],
 [13.233644859813085, 14],
 [16.796296296296298, 16],
 [7.985294117647059, 23],
 [9.41095890410959, 12],
 [11.46, 17],
 [38.5948275862069, 15],
 [16.009174311926607, 21],
 [21.525, 20],
 [23.810344827586206, 2],
 [13.24074074074074, 18],
 [7.796296296296297, 3],
 [10.08695652173913, 5],
 [10.8, 19],
 [11.383333333333333, 1],
 [6.746478873239437, 22],
 [10.25, 8],
 [7.170212765957447, 4],
 [8.12962962962963, 0],
 [9.022727272727273, 6],
 [7.852941176470588, 7],
 [11.051724137931034, 11]]

In [69]:

# Sort the values
sorted_swap = sorted(swap_avg_by_hour, reverse = True)
sorted_swap

Out[69]:

[[38.5948275862069, 15],
 [23.810344827586206, 2],
 [21.525, 20],
 [16.796296296296298, 16],
 [16.009174311926607, 21],
 [14.741176470588234, 13],
 [13.440677966101696, 10],
 [13.24074074074074, 18],
 [13.233644859813085, 14],
 [11.46, 17],
 [11.383333333333333, 1],
 [11.051724137931034, 11],
 [10.8, 19],
 [10.25, 8],
 [10.08695652173913, 5],
 [9.41095890410959, 12],
 [9.022727272727273, 6],
 [8.12962962962963, 0],
 [7.985294117647059, 23],
 [7.852941176470588, 7],
 [7.796296296296297, 3],
 [7.170212765957447, 4],
 [6.746478873239437, 22],
 [5.5777777777777775, 9]]

In [70]:

print('Top 5 Hours for Ask Posts Comments')
for row in sorted_swap[:5]:
    avg = row[0]
    hour = str(row[1])    
    print('{} EST: {:.2f} average comments per post'.format(dt.datetime.strptime(hour, '%H').strftime('%H:%M'), avg))

Top 5 Hours for Ask Posts Comments
15:00 EST: 38.59 average comments per post
02:00 EST: 23.81 average comments per post
20:00 EST: 21.52 average comments per post
16:00 EST: 16.80 average comments per post
21:00 EST: 16.01 average comments per post

In [71]:

# Convert to PST
print('Top 5 Hours for Ask Posts Comments in Pacific Standard Time')
for row in sorted_swap[:5]:
    avg = row[0]
    hour = str(row[1])
    hour = dt.datetime.strptime(hour,'%H') + dt.timedelta(hours =-3)
    print('{} PST: {:.2f} average comments per post'.format(hour.strftime('%H:%M'), avg))

Top 5 Hours for Ask Posts Comments in Pacific Standard Time
12:00 PST: 38.59 average comments per post
23:00 PST: 23.81 average comments per post
17:00 PST: 21.52 average comments per post
13:00 PST: 16.80 average comments per post
18:00 PST: 16.01 average comments per post

Conclusion¶

There were 20,100 posts in our Hacker News Posts data set; a sampling downloaded from Kaggle of the original scraped data containing over 300,000. Of the 20,100 posts, about 14% were either Ask Posts or Show Posts. There were 1742 Ask Post and 1161 Show Posts with Ask Posts exceeded the number of Show Posts by 50%. There were on average 14 comments associated with each Ask Post, in contrast to 10 per Show Posts. With 3% more comments per Ask Posts, the data was filtered further to determine the Top 5 hours in which comments were posted. At 3:00 pm EST, there were 38.59 average comments per post, followed by 2:00 am EST, with 23.81 average comments per hour. It was surprising to see so many users commenting at 2:00 am EST.

In [ ]: