Hacker News is site where user-submitted stories (known as "posts") receive votes and comments, similar to reddit. This project is designed to explore the Hacker News website to compare posts whose titles begin with either 'Ask HN'
or 'Show HN'
. Users submit 'Ask HN'
posts to ask the Hacker News community a specific question and 'Show HN'
posts to show the Hacker News community a project, product, or just something interesting.
We'll compare these two types of posts to determine the following:
'Ask HN'
or 'Show HN'
receive more comments on average?Let's start by importing the libraries we need and reading the dataset into a list of lists. We will alos print out the heders row and other rows to have an idea of what is contained in the dataset.
from csv import reader
opened_file = open('hacker_news.csv')
read_file = reader(opened_file)
hn = list(read_file)
headers = hn[0] # removing the headers row from our dataset
print('Our header is:\n', headers)
print('\n')
hn = hn[1:] # assigning our dataset without headers for analysis
print('Our dataset for analysis is:\n',hn[:5])
Our header is: ['id', 'title', 'url', 'num_points', 'num_comments', 'author', 'created_at'] Our dataset for analysis is: [['12224879', 'Interactive Dynamic Video', 'http://www.interactivedynamicvideo.com/', '386', '52', 'ne0phyte', '8/4/2016 11:52'], ['10975351', 'How to Use Open Source and Shut the Fuck Up at the Same Time', 'http://hueniverse.com/2016/01/26/how-to-use-open-source-and-shut-the-fuck-up-at-the-same-time/', '39', '10', 'josep2', '1/26/2016 19:30'], ['11964716', "Florida DJs May Face Felony for April Fools' Water Joke", 'http://www.thewire.com/entertainment/2013/04/florida-djs-april-fools-water-joke/63798/', '2', '1', 'vezycash', '6/23/2016 22:20'], ['11919867', 'Technology ventures: From Idea to Enterprise', 'https://www.amazon.com/Technology-Ventures-Enterprise-Thomas-Byers/dp/0073523429', '3', '1', 'hswarna', '6/17/2016 0:01'], ['10301696', 'Note by Note: The Making of Steinway L1037 (2007)', 'http://www.nytimes.com/2007/11/07/movies/07stein.html?_r=0', '8', '2', 'walterbell', '9/30/2015 4:12']]
Now, we are going to create a list of lists that contains only the post we are interested in (those starting with'Ask HN'
and 'Show HN'
). we will do that by using the string method 'startswith'
.
But in order not to miss some posts, we will make all our posts' titles to be in lower case.
ask_posts = [] # list that will contain all posts that begin with `'ask hn''
show_posts = [] # list that will contain all posts that begin with `'show hn''
other_posts = [] # list that will contain all posts that begin with others
for row in hn:
title = row[1]
if title.lower().startswith('ask hn'): # checking the lower case of title to see if begins with 'ask hn'
ask_posts.append(row)
elif title.lower().startswith('show hn'): # checking the lower case of title to see if begins with 'show hn'
show_posts.append(row)
else: # checking the lower case of title to see if begins with other characters
other_posts.append(row)
Now let's check the number of 'ask_posts'
, 'show_posts'
and 'other_posts'
. We will also check the first three elements of each list to make sure we have the correct elements.
print('The number of ask_posts:', len(ask_posts))
print('The number of show_posts:', len(show_posts))
print('The number of other_posts:', len(other_posts))
print('\n')
print(ask_posts[:3], '\n')
print(show_posts[:3], '\n')
print(other_posts[:3])
The number of ask_posts: 1744 The number of show_posts: 1162 The number of other_posts: 17194 [['12296411', 'Ask HN: How to improve my personal website?', '', '2', '6', 'ahmedbaracat', '8/16/2016 9:55'], ['10610020', 'Ask HN: Am I the only one outraged by Twitter shutting down share counts?', '', '28', '29', 'tkfx', '11/22/2015 13:43'], ['11610310', 'Ask HN: Aby recent changes to CSS that broke mobile?', '', '1', '1', 'polskibus', '5/2/2016 10:14']] [['10627194', 'Show HN: Wio Link ESP8266 Based Web of Things Hardware Development Platform', 'https://iot.seeed.cc', '26', '22', 'kfihihc', '11/25/2015 14:03'], ['10646440', 'Show HN: Something pointless I made', 'http://dn.ht/picklecat/', '747', '102', 'dhotson', '11/29/2015 22:46'], ['11590768', 'Show HN: Shanhu.io, a programming playground powered by e8vm', 'https://shanhu.io', '1', '1', 'h8liu', '4/28/2016 18:05']] [['12224879', 'Interactive Dynamic Video', 'http://www.interactivedynamicvideo.com/', '386', '52', 'ne0phyte', '8/4/2016 11:52'], ['10975351', 'How to Use Open Source and Shut the Fuck Up at the Same Time', 'http://hueniverse.com/2016/01/26/how-to-use-open-source-and-shut-the-fuck-up-at-the-same-time/', '39', '10', 'josep2', '1/26/2016 19:30'], ['11964716', "Florida DJs May Face Felony for April Fools' Water Joke", 'http://www.thewire.com/entertainment/2013/04/florida-djs-april-fools-water-joke/63798/', '2', '1', 'vezycash', '6/23/2016 22:20']]
Since we are interested in only 'ask_posts'
and 'show_posts'
, we will drop the 'other_posts'
list going ahead.
Let's determine if 'ask_posts'
or 'show_posts'
receive more comments on average.
total_ask_comments = 0
total = 0 # to calculate number of iteration for average
for row in ask_posts:
num_comments = int(row[4])
total_ask_comments += num_comments
total += 1
avg_ask_comments = total_ask_comments / total
print('The average number of comments on ask_posts is:', avg_ask_comments, '\n')
total_show_comments = 0
total = 0 # to calculate number of iteration for average
for row in show_posts:
num_comments = int(row[4])
total_show_comments += num_comments
total += 1
avg_show_comments = total_show_comments / total
print('The average number of comments on show_posts is:', avg_show_comments)
The average number of comments on ask_posts is: 14.038417431192661 The average number of comments on show_posts is: 10.31669535283993
From the findings, we can see that 'ask_posts'
has more comments on average with 14 comments on posts than the 10 comments on 'show_posts
'.
Since the 'ask_posts
' are more likely to receive comments, we will focus our remaining analysis on them.
'Ask HN'
Posts per Time¶Next, we'll determine if ask posts created at a certain time are more likely to attract comments. We'll use the following steps to perform this analysis:
Let's calculate the number of 'ask_posts'
and comments by hour created using the 'datetime'
module.
import datetime as dt
result_list = [] # a list that will contain the date and number of comments
for row in ask_posts:
created_at = row[6]
num_comments = int(row[4])
result_list.append([created_at, num_comments])
counts_by_hour = {} # dictionary that will hold the number of posts in per hour
comments_by_hour = {} # dictionary that will hold the number of comments per hour
for row in result_list:
date = row[0]
comment_num = int(row[1])
date_dt = dt.datetime.strptime(date, "%m/%d/%Y %H:%M") # making the date column to be a datetime object
hour = date_dt.strftime("%H") # Extracting the hour of that datetime object
if hour not in counts_by_hour:
counts_by_hour[hour] = 1
comments_by_hour[hour] = comment_num
else:
counts_by_hour[hour] += 1
comments_by_hour[hour] += comment_num
print('The number of ask_posts in each hour:\n', counts_by_hour, '\n')
print('The number of comments in each hour:\n', comments_by_hour)
The number of ask_posts in each hour: {'09': 45, '13': 85, '10': 59, '14': 107, '16': 108, '23': 68, '12': 73, '17': 100, '15': 116, '21': 109, '20': 80, '02': 58, '18': 109, '03': 54, '05': 46, '19': 110, '01': 60, '22': 71, '08': 48, '04': 47, '00': 55, '06': 44, '07': 34, '11': 58} The number of comments in each hour: {'09': 251, '13': 1253, '10': 793, '14': 1416, '16': 1814, '23': 543, '12': 687, '17': 1146, '15': 4477, '21': 1745, '20': 1722, '02': 1381, '18': 1439, '03': 421, '05': 464, '19': 1188, '01': 683, '22': 479, '08': 492, '04': 337, '00': 447, '06': 397, '07': 267, '11': 641}
Next, we'll use these two dictionaries to calculate the average number of comments for posts created during each hour of the day.
avg_by_hour = [] # a list that will contain the hour and average comments per post in that hour
for hour in comments_by_hour:
average = comments_by_hour[hour] / counts_by_hour[hour]
avg_by_hour.append([hour, average])
print('Average number of comments per post by hour:')
avg_by_hour
Average number of comments per post by hour:
[['09', 5.5777777777777775], ['13', 14.741176470588234], ['10', 13.440677966101696], ['14', 13.233644859813085], ['16', 16.796296296296298], ['23', 7.985294117647059], ['12', 9.41095890410959], ['17', 11.46], ['15', 38.5948275862069], ['21', 16.009174311926607], ['20', 21.525], ['02', 23.810344827586206], ['18', 13.20183486238532], ['03', 7.796296296296297], ['05', 10.08695652173913], ['19', 10.8], ['01', 11.383333333333333], ['22', 6.746478873239437], ['08', 10.25], ['04', 7.170212765957447], ['00', 8.127272727272727], ['06', 9.022727272727273], ['07', 7.852941176470588], ['11', 11.051724137931034]]
Although we now have the results we need, this format makes it difficult to identify the hours with the highest values. Let's finish by sorting the list of lists and printing the five highest values in a format that's easier to read.
swap_avg_by_hour = []
for row in avg_by_hour:
swap_avg_by_hour.append([row[1], row[0]]) # interchanging rows and columns for sorting
swap_avg_by_hour
[[5.5777777777777775, '09'], [14.741176470588234, '13'], [13.440677966101696, '10'], [13.233644859813085, '14'], [16.796296296296298, '16'], [7.985294117647059, '23'], [9.41095890410959, '12'], [11.46, '17'], [38.5948275862069, '15'], [16.009174311926607, '21'], [21.525, '20'], [23.810344827586206, '02'], [13.20183486238532, '18'], [7.796296296296297, '03'], [10.08695652173913, '05'], [10.8, '19'], [11.383333333333333, '01'], [6.746478873239437, '22'], [10.25, '08'], [7.170212765957447, '04'], [8.127272727272727, '00'], [9.022727272727273, '06'], [7.852941176470588, '07'], [11.051724137931034, '11']]
In the above cell, we swapped our columns in the 'avg_by_hour'
so that number of comments becomes the first element to make sure that we sort by the highest number of comments. We then use 'sorted()'
function to do the sorting as shown below:
sorted_swap = sorted(swap_avg_by_hour, reverse = True)
sorted_swap
[[38.5948275862069, '15'], [23.810344827586206, '02'], [21.525, '20'], [16.796296296296298, '16'], [16.009174311926607, '21'], [14.741176470588234, '13'], [13.440677966101696, '10'], [13.233644859813085, '14'], [13.20183486238532, '18'], [11.46, '17'], [11.383333333333333, '01'], [11.051724137931034, '11'], [10.8, '19'], [10.25, '08'], [10.08695652173913, '05'], [9.41095890410959, '12'], [9.022727272727273, '06'], [8.127272727272727, '00'], [7.985294117647059, '23'], [7.852941176470588, '07'], [7.796296296296297, '03'], [7.170212765957447, '04'], [6.746478873239437, '22'], [5.5777777777777775, '09']]
'Ask Posts'
Comments¶So now let's see the top 5 hours with the highest number of comments.
print('Top 5 Hours for Ask Posts Comments\n')
for row in sorted_swap[:5]:
avg_comment = row[0]
hr = row[1]
hr_dt = dt.datetime.strptime(hr,'%H')
hr_final = hr_dt.strftime('%H')
print('{}: {:.2f} average comments per post.'.format(hr_final, avg_comment))
Top 5 Hours for Ask Posts Comments 15: 38.59 average comments per post. 02: 23.81 average comments per post. 20: 21.52 average comments per post. 16: 16.80 average comments per post. 21: 16.01 average comments per post.
Thus from the above findings, it has been shown that 15:00 has the highest average comments per post (with 38.59 average comments per post) and that any post made between 15:00 and 21:00 will have higher chance of receiving more comments.
In conclusion, this project answers two questions:
Answer: From the findings, 'Ask HN' posts generate more comments with an average of 14 comments per post while 'Show HN' posts have an average of 10 comments per post.
Answer: From the findings, the best hour with average number of comments is 15:00 or 16:00 WAT. Thus, it is the best time for users to create 'Ask HN' posts to have more comments. In summary, here is a table that lists the top 5 hours for ask posts comments.
Hour | Average Comment per Post |
---|---|
15 | 38.59 |
02 | 23.81| 20| 21.52| 16| 16.80| 21| 16.01|