In this project, we will analyse posts from Hacker News website.
Hacker News is a site started by the startup incubator Y Combinator, where user-submitted stories (known as "posts") are voted and commented upon, similar to reddit.
The purpose of the analysis is to identify if there is a more popular post type and and if there are any times during the day that are more likely to attract users' responses.
The data set used for this project is a subset of a bigger data set which was reduced from 300,000 rows to 20,000 rows. All posts that did not receive any comments were removed, and the rest of the data was sampled randomly.
Link to access dataset: https://www.kaggle.com/hacker-news/hacker-news-posts
# read and open the file
from csv import reader
opened_file = open('hacker_news.csv')
read_file = reader(opened_file)
hn = list(read_file)
# display first 5 rows
print(hn[0:5])
[['id', 'title', 'url', 'num_points', 'num_comments', 'author', 'created_at'], ['12224879', 'Interactive Dynamic Video', 'http://www.interactivedynamicvideo.com/', '386', '52', 'ne0phyte', '8/4/2016 11:52'], ['10975351', 'How to Use Open Source and Shut the Fuck Up at the Same Time', 'http://hueniverse.com/2016/01/26/how-to-use-open-source-and-shut-the-fuck-up-at-the-same-time/', '39', '10', 'josep2', '1/26/2016 19:30'], ['11964716', "Florida DJs May Face Felony for April Fools' Water Joke", 'http://www.thewire.com/entertainment/2013/04/florida-djs-april-fools-water-joke/63798/', '2', '1', 'vezycash', '6/23/2016 22:20'], ['11919867', 'Technology ventures: From Idea to Enterprise', 'https://www.amazon.com/Technology-Ventures-Enterprise-Thomas-Byers/dp/0073523429', '3', '1', 'hswarna', '6/17/2016 0:01']]
# separate headers
headers = hn[0]
print(headers)
hn = hn[1:]
# check headers are removed
print(hn[0:3])
['id', 'title', 'url', 'num_points', 'num_comments', 'author', 'created_at'] [['12224879', 'Interactive Dynamic Video', 'http://www.interactivedynamicvideo.com/', '386', '52', 'ne0phyte', '8/4/2016 11:52'], ['10975351', 'How to Use Open Source and Shut the Fuck Up at the Same Time', 'http://hueniverse.com/2016/01/26/how-to-use-open-source-and-shut-the-fuck-up-at-the-same-time/', '39', '10', 'josep2', '1/26/2016 19:30'], ['11964716', "Florida DJs May Face Felony for April Fools' Water Joke", 'http://www.thewire.com/entertainment/2013/04/florida-djs-april-fools-water-joke/63798/', '2', '1', 'vezycash', '6/23/2016 22:20']]
We are interested in posts whose title begin either Ask HN
or Show HN
.
Ask HN
are posts to ask the Hacker News community specific questions.
e.g.: Ask HN: How to improve my personal website?
Show HN
are posts to show the Hacker News community a project, product, or just generally something interesting.
e.g.: Show HN: Something pointless I made
ask_posts = []
show_posts = []
other_posts = []
# use lower and startswith functions to format the text and find the wanted post types
for row in hn:
title = row[1]
if title.lower().startswith("ask hn"):
ask_posts.append(row)
elif title.lower().startswith("show hn"):
show_posts.append(row)
else:
other_posts.append(row)
print('Number of ask posts:', len(ask_posts))
print('Number of show posts:',len(show_posts))
print('Number of other posts:',len(other_posts))
Number of ask posts: 1744 Number of show posts: 1162 Number of other posts: 17194
Number of ask posts is the highest, let's drill down and find total number of comments for ask and show posts. ___
# calculate number of comments for ask posts
total_ask_comments = 0
for row in ask_posts:
comments = int(row[4])
total_ask_comments += comments
# check number of comments
print('Total number of ask comments:', total_ask_comments)
# calculate average number of ask comments
avg_ask_comments = total_ask_comments / len(ask_posts)
print('Average of ask comments:', avg_ask_comments)
Total number of ask comments: 24483 Average of ask comments: 14.038417431192661
# calculate number of comments for show posts
total_show_comments = 0
for row in show_posts:
comments = int(row[4])
total_show_comments += comments
# check number of comments
print('Total number of show comments:', total_show_comments)
# calculate average number of ask comments
avg_show_comments = total_show_comments / len(show_posts)
print('Average of show comments:', avg_show_comments)
Total number of show comments: 11988 Average of show comments: 10.31669535283993
On average, ask posts tend to get 4 more comments than show posts. Since ask posts are more likely to receive comments, we'll focus our remaining analysis on this type of post.
# import datetime module
import datetime as dt
# Create list with date and number of comments columns
result_list = []
for row in ask_posts:
result_list.append([row[6], int(row[4])])
# check list
print('Result_list:', result_list[0:5])
# format date and count number of comments by hour
counts_by_hour = {}
comments_by_hour = {}
date_format = "%m/%d/%Y %H:%M"
for row in result_list:
date = dt.datetime.strptime(row[0], date_format)
hour = dt.datetime.strftime(date, "%H")
comment = row[1]
if hour in counts_by_hour:
counts_by_hour[hour] += 1
comments_by_hour[hour] += comment
else:
counts_by_hour[hour] = 1
comments_by_hour[hour] = comment
comments_by_hour
Result_list: [['8/16/2016 9:55', 6], ['11/22/2015 13:43', 29], ['5/2/2016 10:14', 1], ['8/2/2016 14:20', 3], ['10/15/2015 16:38', 17]]
{'09': 251, '13': 1253, '10': 793, '14': 1416, '16': 1814, '23': 543, '12': 687, '17': 1146, '15': 4477, '21': 1745, '20': 1722, '02': 1381, '18': 1439, '03': 421, '05': 464, '19': 1188, '01': 683, '22': 479, '08': 492, '04': 337, '00': 447, '06': 397, '07': 267, '11': 641}
avg_comments_hour = []
for hour in counts_by_hour:
avg_comments_hour.append([hour, comments_by_hour[hour] / counts_by_hour[hour]])
avg_comments_hour
[['09', 5.5777777777777775], ['13', 14.741176470588234], ['10', 13.440677966101696], ['14', 13.233644859813085], ['16', 16.796296296296298], ['23', 7.985294117647059], ['12', 9.41095890410959], ['17', 11.46], ['15', 38.5948275862069], ['21', 16.009174311926607], ['20', 21.525], ['02', 23.810344827586206], ['18', 13.20183486238532], ['03', 7.796296296296297], ['05', 10.08695652173913], ['19', 10.8], ['01', 11.383333333333333], ['22', 6.746478873239437], ['08', 10.25], ['04', 7.170212765957447], ['00', 8.127272727272727], ['06', 9.022727272727273], ['07', 7.852941176470588], ['11', 11.051724137931034]]
# create list with swaped columns so we can sort by number of comments
swap_avg_by_hour = []
for row in avg_comments_hour:
swap_avg_by_hour.append([row[1], row[0]])
sorted_swap = sorted(swap_avg_by_hour, reverse = True)
print('Swaped list:', sorted_swap)
Swaped list: [[38.5948275862069, '15'], [23.810344827586206, '02'], [21.525, '20'], [16.796296296296298, '16'], [16.009174311926607, '21'], [14.741176470588234, '13'], [13.440677966101696, '10'], [13.233644859813085, '14'], [13.20183486238532, '18'], [11.46, '17'], [11.383333333333333, '01'], [11.051724137931034, '11'], [10.8, '19'], [10.25, '08'], [10.08695652173913, '05'], [9.41095890410959, '12'], [9.022727272727273, '06'], [8.127272727272727, '00'], [7.985294117647059, '23'], [7.852941176470588, '07'], [7.796296296296297, '03'], [7.170212765957447, '04'], [6.746478873239437, '22'], [5.5777777777777775, '09']]
# display top 5 hours which got the highest average of number of comments
print("Top 5 Hours for Ask Posts Comments")
for avg, hr in sorted_swap[0:5]:
print('{}: {avg:,.2f} average comments per post'. format(dt.datetime.strptime(hr,'%H').strftime("%H"), avg = avg))
Top 5 Hours for Ask Posts Comments 15: 38.59 average comments per post 02: 23.81 average comments per post 20: 21.52 average comments per post 16: 16.80 average comments per post 21: 16.01 average comments per post
This project's purpose was to analyse post types on Hacker News and determine if there is a type of post which is more popular and attracts more comments.
Considering the fact we used only posts with comments, the results showed that on average, ask posts tend to get 4 more comments than show posts. Also, the results concluded that 3pm EST time is the hour when users are more likely to leave comments.