In this project, we will work with a data set of submissions to popular technology site Hacker News
. Hacker News is extremely popular in technology and startup circles, and posts that make it to the top of Hacker News' listings can get hundreds of thousands of visitors as a result.
We are specifically interested in posts whose titles begin with either Ask HN
or Show HN
.
Ask HN
posts to ask the Hacker News community a specific question.Show HN
posts to show the Hacker News community a project, product, or just generally something interesting.Our goal for this project is to compare these two types of post and determine the following:
The data set contains data about approximately 20,000 rows by removing all submission that did not receive any comments, and then randonly from the reamining submissions. Bellow are descriptions of the columns:
id:
The unique identifier from Hacker News for the posttitle:
The title of the posturl:
The URL that the posts links to, if the post has a URLnum_points:
The number of points the post acquired, calculated as the total number of upvotes minus the total number of downvotesnum_comments:
The number of comments that were made on the postauthor:
The username of the person who submitted the postcreated_at:
The date and time at which the post was submittedSince we get an error named UnicodeDecodeError
, we add encoding="utf8"
to the open()
function (open('hacker_news.csv', encoding='utf8')
)
In the cell bellow, we:
list()
and save it to a variable named hn_dataset
hn_header
hn
# import reader
from csv import reader
# hacker_news data set
opened_file = open('hacker_news.csv', encoding='utf8')
read_file = reader(opened_file)
hn_dataset = list(read_file)
hn_header = hn_dataset[0]
hn = hn_dataset[1:]
print(hn_header)
print('\n')
print(hn_dataset[:5])
['id', 'title', 'url', 'num_points', 'num_comments', 'author', 'created_at'] [['id', 'title', 'url', 'num_points', 'num_comments', 'author', 'created_at'], ['12224879', 'Interactive Dynamic Video', 'http://www.interactivedynamicvideo.com/', '386', '52', 'ne0phyte', '8/4/2016 11:52'], ['10975351', 'How to Use Open Source and Shut the Fuck Up at the Same Time', 'http://hueniverse.com/2016/01/26/how-to-use-open-source-and-shut-the-fuck-up-at-the-same-time/', '39', '10', 'josep2', '1/26/2016 19:30'], ['11964716', "Florida DJs May Face Felony for April Fools' Water Joke", 'http://www.thewire.com/entertainment/2013/04/florida-djs-april-fools-water-joke/63798/', '2', '1', 'vezycash', '6/23/2016 22:20'], ['11919867', 'Technology ventures: From Idea to Enterprise', 'https://www.amazon.com/Technology-Ventures-Enterprise-Thomas-Byers/dp/0073523429', '3', '1', 'hswarna', '6/17/2016 0:01']]
Since we are only concerned with post titles beginning with Ask HN
or Show HN
, we will create new lists of lists containing just the data for those titles.
To find the posts that begin with either Ask HN
or Show HN
, we will use the string method str.startswith()
, and we can use the lower
method which returns a lowercase version of the starting string.
# method to separate posts beginning with Ask HH and Show HN
ask_posts = []
show_posts = []
other_posts = []
for row in hn_dataset:
title = row[1]
if title.lower().startswith("ask hn"):
ask_posts.append(row)
elif title.lower().startswith("show hn"):
show_posts.append(row)
else:
other_posts.append(row)
print('ask_posts: ', len(ask_posts))
print('show_posts: ', len(show_posts))
print('other_posts: ', len(other_posts))
ask_posts: 1744 show_posts: 1162 other_posts: 17195
In the last screen, we separated the ask posts
and the show posts
into two list of lists named ask_posts
and show_posts
.
Next, we will determine if ask posts
or show posts
receive more comments on average.
# average Ask HN
total_ask_comments = 0
for row in ask_posts:
num_comments = int(row[4])
total_ask_comments += num_comments
avg_ask_comments = total_ask_comments / len(ask_posts)
print(avg_ask_comments)
14.038417431192661
# average Show HN
total_show_comments = 0
for row in show_posts:
num_comments = int(row[4])
total_show_comments += num_comments
avg_show_comments = total_show_comments / len(show_posts)
print(avg_show_comments)
10.31669535283993
On average, ask posts approximately receive 14 comments whereas show posts receive almost 10 comments. Since ask posts are more likely to receive comments.
Since ask posts
are more likely to receive comments, we'll focus our remaining analysis just on these posts.
Next, we will determine if ask posts
created at a certain time are more likely to attract comments. We'll use the following steps to perform this analysis:
We will tackle the first step — calculating the amount of ask posts and comments by hour created. We will use the datetime
module to work with the data in the created_at
column.
We use the datetime.strptime()
constructor to parse dates stored as strings
and return datetime objects
.
# import datetime module
import datetime as dt
# appending columns: created_at and num_columns
result_list = []
for row in ask_posts:
result_list.append([row[6], int(row[4])])
result_list[:5]
[['8/16/2016 9:55', 6], ['11/22/2015 13:43', 29], ['5/2/2016 10:14', 1], ['8/2/2016 14:20', 3], ['10/15/2015 16:38', 17]]
# amount of ask post and comments
counts_by_hour = {}
comments_by_hour = {}
date_format = "%m/%d/%Y %H:%M"
for row in result_list:
date_string = row[0]
# parse the date and create a datetime object
time = dt.datetime.strptime(date_string, date_format)
# select just the hour from the datetime object
hour = time.strftime("%H")
if hour not in counts_by_hour:
counts_by_hour[hour] = 1
comments_by_hour[hour] = row[1]
else:
counts_by_hour[hour] += 1
comments_by_hour[hour] += row[1]
print('counts_by_hour: ', counts_by_hour)
print('\n')
print('comments_by_hour: ', comments_by_hour)
counts_by_hour: {'09': 45, '13': 85, '10': 59, '14': 107, '16': 108, '23': 68, '12': 73, '17': 100, '15': 116, '21': 109, '20': 80, '02': 58, '18': 109, '03': 54, '05': 46, '19': 110, '01': 60, '22': 71, '08': 48, '04': 47, '00': 55, '06': 44, '07': 34, '11': 58} comments_by_hour: {'09': 251, '13': 1253, '10': 793, '14': 1416, '16': 1814, '23': 543, '12': 687, '17': 1146, '15': 4477, '21': 1745, '20': 1722, '02': 1381, '18': 1439, '03': 421, '05': 464, '19': 1188, '01': 683, '22': 479, '08': 492, '04': 337, '00': 447, '06': 397, '07': 267, '11': 641}
We created two dictionaries:
the number of ask posts
created during each hour of the daynumber of comments ask posts
created at each hour receivedNext, we will use these two dictionaries to calculate the average number of comments for posts created during each hour of the day.
# average number of comments
avg_by_hour = []
for hour in comments_by_hour:
avg_num = comments_by_hour[hour] / counts_by_hour[hour]
avg_by_hour.append([hour, avg_num])
avg_by_hour
[['09', 5.5777777777777775], ['13', 14.741176470588234], ['10', 13.440677966101696], ['14', 13.233644859813085], ['16', 16.796296296296298], ['23', 7.985294117647059], ['12', 9.41095890410959], ['17', 11.46], ['15', 38.5948275862069], ['21', 16.009174311926607], ['20', 21.525], ['02', 23.810344827586206], ['18', 13.20183486238532], ['03', 7.796296296296297], ['05', 10.08695652173913], ['19', 10.8], ['01', 11.383333333333333], ['22', 6.746478873239437], ['08', 10.25], ['04', 7.170212765957447], ['00', 8.127272727272727], ['06', 9.022727272727273], ['07', 7.852941176470588], ['11', 11.051724137931034]]
In the last cell, we calculated the average number of comments for posts created during each hour of the day, and stored the results in a list of lists named avg_by_hour
.
Although we now have the results we need, this format makes it hard to identify the hours with the highest values. We will finish by sorting the list of lists and printing the five highest values in a format that's easier to read.
# sorting and printing values
swap_avg_by_hour = []
for row in avg_by_hour:
swap_avg_by_hour.append([row[1], row[0]])
print(swap_avg_by_hour)
[[5.5777777777777775, '09'], [14.741176470588234, '13'], [13.440677966101696, '10'], [13.233644859813085, '14'], [16.796296296296298, '16'], [7.985294117647059, '23'], [9.41095890410959, '12'], [11.46, '17'], [38.5948275862069, '15'], [16.009174311926607, '21'], [21.525, '20'], [23.810344827586206, '02'], [13.20183486238532, '18'], [7.796296296296297, '03'], [10.08695652173913, '05'], [10.8, '19'], [11.383333333333333, '01'], [6.746478873239437, '22'], [10.25, '08'], [7.170212765957447, '04'], [8.127272727272727, '00'], [9.022727272727273, '06'], [7.852941176470588, '07'], [11.051724137931034, '11']]
# descending order
sorted_swap = sorted(swap_avg_by_hour, reverse=True)
print("Top 5 Hours for Ask Posts Comments:")
for row in sorted_swap[:5]:
time_string = row[1]
# parse the hour and create a datetime object
time_top = dt.datetime.strptime(time_string, "%H")
# select the hour datetime object
hour_top = time_top.strftime("%H:%M")
print("{}: {:.2f} average comments per post".format(hour_top, row[0]))
Top 5 Hours for Ask Posts Comments: 15:00: 38.59 average comments per post 02:00: 23.81 average comments per post 20:00: 21.52 average comments per post 16:00: 16.80 average comments per post 21:00: 16.01 average comments per post
Based on our analysis, we recommend posting from 02:00 pm
to 03:00 pm
in order to have a higher chance of receiving more comments on an Aks Post
.
Furthermore, Creating posts from 20:00
to 21:00
receive on average between 16.01
to 21.52
comments per post, which it is another good option to do as well.
The hour that receives the most comments per post on average is 15:00
with an average of 38.59
comments per post. The time zone used is Eastern Time in the US; as a result, we could also write 15:00
as 3:00 pm est
.