Notebook

Exploring Hacker News Posts¶

In this project, we will analyse posts from Hacker News website.

Hacker News is a site started by the startup incubator Y Combinator, where user-submitted stories (known as "posts") are voted and commented upon, similar to reddit.

The purpose of the analysis is to identify if there is a more popular post type and and if there are any times during the day that are more likely to attract users' responses.

The data set used for this project is a subset of a bigger data set which was reduced from 300,000 rows to 20,000 rows. All posts that did not receive any comments were removed, and the rest of the data was sampled randomly.

Link to access dataset: https://www.kaggle.com/hacker-news/hacker-news-posts

Read and Open the file¶

In [1]:

# read and open the file
from csv import reader

opened_file = open('hacker_news.csv')
read_file = reader(opened_file)
hn = list(read_file)

# display first 5 rows
print(hn[0:5])

[['id', 'title', 'url', 'num_points', 'num_comments', 'author', 'created_at'], ['12224879', 'Interactive Dynamic Video', 'http://www.interactivedynamicvideo.com/', '386', '52', 'ne0phyte', '8/4/2016 11:52'], ['10975351', 'How to Use Open Source and Shut the Fuck Up at the Same Time', 'http://hueniverse.com/2016/01/26/how-to-use-open-source-and-shut-the-fuck-up-at-the-same-time/', '39', '10', 'josep2', '1/26/2016 19:30'], ['11964716', "Florida DJs May Face Felony for April Fools' Water Joke", 'http://www.thewire.com/entertainment/2013/04/florida-djs-april-fools-water-joke/63798/', '2', '1', 'vezycash', '6/23/2016 22:20'], ['11919867', 'Technology ventures: From Idea to Enterprise', 'https://www.amazon.com/Technology-Ventures-Enterprise-Thomas-Byers/dp/0073523429', '3', '1', 'hswarna', '6/17/2016 0:01']]

In [2]:

# separate headers 
headers = hn[0]
print(headers)
hn = hn[1:]

# check headers are removed
print(hn[0:3])

['id', 'title', 'url', 'num_points', 'num_comments', 'author', 'created_at']
[['12224879', 'Interactive Dynamic Video', 'http://www.interactivedynamicvideo.com/', '386', '52', 'ne0phyte', '8/4/2016 11:52'], ['10975351', 'How to Use Open Source and Shut the Fuck Up at the Same Time', 'http://hueniverse.com/2016/01/26/how-to-use-open-source-and-shut-the-fuck-up-at-the-same-time/', '39', '10', 'josep2', '1/26/2016 19:30'], ['11964716', "Florida DJs May Face Felony for April Fools' Water Joke", 'http://www.thewire.com/entertainment/2013/04/florida-djs-april-fools-water-joke/63798/', '2', '1', 'vezycash', '6/23/2016 22:20']]

Check number of posts per each type¶

We are interested in posts whose title begin either Ask HN or Show HN.

Ask HN are posts to ask the Hacker News community specific questions.

e.g.: Ask HN: How to improve my personal website?

Show HN are posts to show the Hacker News community a project, product, or just generally something interesting.

e.g.: Show HN: Something pointless I made

In [3]:

ask_posts = []
show_posts = []
other_posts = []

# use lower and startswith functions to format the text and find the wanted post types
for row in hn:
    title = row[1]
    if title.lower().startswith("ask hn"):
        ask_posts.append(row)
    elif title.lower().startswith("show hn"):
        show_posts.append(row)
    else:
        other_posts.append(row)

print('Number of ask posts:', len(ask_posts))
print('Number of show posts:',len(show_posts))
print('Number of other posts:',len(other_posts))

Number of ask posts: 1744
Number of show posts: 1162
Number of other posts: 17194

Number of ask posts is the highest, let's drill down and find total number of comments for ask and show posts. ___

Identify number of comments per each post type¶

In [4]:

# calculate number of comments for ask posts
total_ask_comments = 0 

for row in ask_posts:
    comments = int(row[4])
    total_ask_comments += comments 

# check number of comments
print('Total number of ask comments:', total_ask_comments)

# calculate average number of ask comments 
avg_ask_comments = total_ask_comments / len(ask_posts)

print('Average of ask comments:', avg_ask_comments)

Total number of ask comments: 24483
Average of ask comments: 14.038417431192661

In [5]:

# calculate number of comments for show posts
total_show_comments = 0 

for row in show_posts:
    comments = int(row[4])
    total_show_comments += comments 

# check number of comments
print('Total number of show comments:', total_show_comments)

# calculate average number of ask comments 
avg_show_comments = total_show_comments / len(show_posts)

print('Average of show comments:', avg_show_comments)

Total number of show comments: 11988
Average of show comments: 10.31669535283993

On average, ask posts tend to get 4 more comments than show posts. Since ask posts are more likely to receive comments, we'll focus our remaining analysis on this type of post.

Determine if ask posts created at a certain time are more likely to receive comments¶

In [6]:

# import datetime module
import datetime as dt

# Create list with date and number of comments columns
result_list = []

for row in ask_posts:
    result_list.append([row[6], int(row[4])])

# check list 
print('Result_list:', result_list[0:5])  

# format date and count number of comments by hour
counts_by_hour = {}
comments_by_hour = {}
date_format = "%m/%d/%Y %H:%M"

for row in result_list:
    date = dt.datetime.strptime(row[0], date_format)
    hour = dt.datetime.strftime(date, "%H")
    comment = row[1]
    
    if hour in counts_by_hour:
        counts_by_hour[hour] += 1 
        comments_by_hour[hour] += comment
    else:
        counts_by_hour[hour] = 1 
        comments_by_hour[hour] = comment
    

comments_by_hour

Result_list: [['8/16/2016 9:55', 6], ['11/22/2015 13:43', 29], ['5/2/2016 10:14', 1], ['8/2/2016 14:20', 3], ['10/15/2015 16:38', 17]]

Out[6]:

{'09': 251,
 '13': 1253,
 '10': 793,
 '14': 1416,
 '16': 1814,
 '23': 543,
 '12': 687,
 '17': 1146,
 '15': 4477,
 '21': 1745,
 '20': 1722,
 '02': 1381,
 '18': 1439,
 '03': 421,
 '05': 464,
 '19': 1188,
 '01': 683,
 '22': 479,
 '08': 492,
 '04': 337,
 '00': 447,
 '06': 397,
 '07': 267,
 '11': 641}

Identify average number of comments per post during each hour of the day¶

In [7]:

avg_comments_hour = [] 

for hour in counts_by_hour:
    avg_comments_hour.append([hour, comments_by_hour[hour] / counts_by_hour[hour]])

avg_comments_hour

Out[7]:

[['09', 5.5777777777777775],
 ['13', 14.741176470588234],
 ['10', 13.440677966101696],
 ['14', 13.233644859813085],
 ['16', 16.796296296296298],
 ['23', 7.985294117647059],
 ['12', 9.41095890410959],
 ['17', 11.46],
 ['15', 38.5948275862069],
 ['21', 16.009174311926607],
 ['20', 21.525],
 ['02', 23.810344827586206],
 ['18', 13.20183486238532],
 ['03', 7.796296296296297],
 ['05', 10.08695652173913],
 ['19', 10.8],
 ['01', 11.383333333333333],
 ['22', 6.746478873239437],
 ['08', 10.25],
 ['04', 7.170212765957447],
 ['00', 8.127272727272727],
 ['06', 9.022727272727273],
 ['07', 7.852941176470588],
 ['11', 11.051724137931034]]

Sort results and identify top 5 most popular hours to receive comments¶

In [8]:

# create list with swaped columns so we can sort by number of comments
swap_avg_by_hour = []

for row in avg_comments_hour:
    swap_avg_by_hour.append([row[1], row[0]])

sorted_swap = sorted(swap_avg_by_hour, reverse = True)

print('Swaped list:', sorted_swap)

Swaped list: [[38.5948275862069, '15'], [23.810344827586206, '02'], [21.525, '20'], [16.796296296296298, '16'], [16.009174311926607, '21'], [14.741176470588234, '13'], [13.440677966101696, '10'], [13.233644859813085, '14'], [13.20183486238532, '18'], [11.46, '17'], [11.383333333333333, '01'], [11.051724137931034, '11'], [10.8, '19'], [10.25, '08'], [10.08695652173913, '05'], [9.41095890410959, '12'], [9.022727272727273, '06'], [8.127272727272727, '00'], [7.985294117647059, '23'], [7.852941176470588, '07'], [7.796296296296297, '03'], [7.170212765957447, '04'], [6.746478873239437, '22'], [5.5777777777777775, '09']]

In [9]:

# display top 5 hours which got the highest average of number of comments
print("Top 5 Hours for Ask Posts Comments")
for avg, hr in sorted_swap[0:5]:
   print('{}: {avg:,.2f} average comments per post'. format(dt.datetime.strptime(hr,'%H').strftime("%H"), avg = avg))

Top 5 Hours for Ask Posts Comments
15: 38.59 average comments per post
02: 23.81 average comments per post
20: 21.52 average comments per post
16: 16.80 average comments per post
21: 16.01 average comments per post

Conclusion¶

This project's purpose was to analyse post types on Hacker News and determine if there is a type of post which is more popular and attracts more comments.

Considering the fact we used only posts with comments, the results showed that on average, ask posts tend to get 4 more comments than show posts. Also, the results concluded that 3pm EST time is the hour when users are more likely to leave comments.