Hacker News is a site started by the startup incubator Y Combinator, where user-submitted posts are voted and commented upon,similar to reddit.
In this project we're interested in posts whose titles begin with either Ask Hn
(where users submit posts to ask the Hacker News community a specific question) or Show HN
(where users submit posts to show the Hacker News community a project, product, or just generally something interesting).
So we will compare these two types of posts to determine the following:
You can find the data set here, but keep in mind that it has been reduced from 300,000 rows to approximately 20,000 rows by removing submissions that did not recieve any comments.
First, we will read in the data.
# Import the reader to read in the file
from csv import reader
file = open('hacker_news.csv')
read_file = reader(file)
hn = list(read_file)
headers = hn[0] # The first row of the data set.
hn = hn[1:] # The data set without the first row.
print(headers) # Display the headers.
print('\n')
print(hn[:5]) # Display the first five rows of the data set.
['id', 'title', 'url', 'num_points', 'num_comments', 'author', 'created_at'] [['12224879', 'Interactive Dynamic Video', 'http://www.interactivedynamicvideo.com/', '386', '52', 'ne0phyte', '8/4/2016 11:52'], ['10975351', 'How to Use Open Source and Shut the Fuck Up at the Same Time', 'http://hueniverse.com/2016/01/26/how-to-use-open-source-and-shut-the-fuck-up-at-the-same-time/', '39', '10', 'josep2', '1/26/2016 19:30'], ['11964716', "Florida DJs May Face Felony for April Fools' Water Joke", 'http://www.thewire.com/entertainment/2013/04/florida-djs-april-fools-water-joke/63798/', '2', '1', 'vezycash', '6/23/2016 22:20'], ['11919867', 'Technology ventures: From Idea to Enterprise', 'https://www.amazon.com/Technology-Ventures-Enterprise-Thomas-Byers/dp/0073523429', '3', '1', 'hswarna', '6/17/2016 0:01'], ['10301696', 'Note by Note: The Making of Steinway L1037 (2007)', 'http://www.nytimes.com/2007/11/07/movies/07stein.html?_r=0', '8', '2', 'walterbell', '9/30/2015 4:12']]
Since we are only concerned with post titles beginning with Ask HN
or Show HN
we'll create new lists of lists containing just the data for those titles.
To find the posts that begin with Ask HN
or Show HN
we'll use the string method startswith().
# Seperate hn data set into separate list of lists for the
# various post types (ask_posts, show_posts, other_posts)
ask_posts = []
show_posts = []
other_posts = []
for row in hn :
title = row[1]
if title.lower().startswith('ask hn') : # The lower() method returns a lowercase
ask_posts.append(row) # version of the string
elif title.lower().startswith('show hn') :
show_posts.append(row)
else :
other_posts.append(row)
# Count number of posts per type
num_ask_posts = len(ask_posts)
num_show_posts = len(show_posts)
num_other_posts = len(other_posts)
# Display the number of posts per type
print('There are {} Ask HN posts.'.format(num_ask_posts))
print('\n')
print('There are {} Show HN psts.'.format(num_show_posts))
print('\n')
print('There are {} other posts.'.format(num_other_posts))
There are 1744 Ask HN posts. There are 1162 Show HN psts. There are 17194 other posts.
# Dissplay the first five rows of each list of list
print(ask_posts[:5])
print('\n')
print(show_posts[:5])
print('\n')
print(other_posts[:5])
[['12296411', 'Ask HN: How to improve my personal website?', '', '2', '6', 'ahmedbaracat', '8/16/2016 9:55'], ['10610020', 'Ask HN: Am I the only one outraged by Twitter shutting down share counts?', '', '28', '29', 'tkfx', '11/22/2015 13:43'], ['11610310', 'Ask HN: Aby recent changes to CSS that broke mobile?', '', '1', '1', 'polskibus', '5/2/2016 10:14'], ['12210105', 'Ask HN: Looking for Employee #3 How do I do it?', '', '1', '3', 'sph130', '8/2/2016 14:20'], ['10394168', 'Ask HN: Someone offered to buy my browser extension from me. What now?', '', '28', '17', 'roykolak', '10/15/2015 16:38']] [['10627194', 'Show HN: Wio Link ESP8266 Based Web of Things Hardware Development Platform', 'https://iot.seeed.cc', '26', '22', 'kfihihc', '11/25/2015 14:03'], ['10646440', 'Show HN: Something pointless I made', 'http://dn.ht/picklecat/', '747', '102', 'dhotson', '11/29/2015 22:46'], ['11590768', 'Show HN: Shanhu.io, a programming playground powered by e8vm', 'https://shanhu.io', '1', '1', 'h8liu', '4/28/2016 18:05'], ['12178806', 'Show HN: Webscope Easy way for web developers to communicate with Clients', 'http://webscopeapp.com', '3', '3', 'fastbrick', '7/28/2016 7:11'], ['10872799', 'Show HN: GeoScreenshot Easily test Geo-IP based web pages', 'https://www.geoscreenshot.com/', '1', '9', 'kpsychwave', '1/9/2016 20:45']] [['12224879', 'Interactive Dynamic Video', 'http://www.interactivedynamicvideo.com/', '386', '52', 'ne0phyte', '8/4/2016 11:52'], ['10975351', 'How to Use Open Source and Shut the Fuck Up at the Same Time', 'http://hueniverse.com/2016/01/26/how-to-use-open-source-and-shut-the-fuck-up-at-the-same-time/', '39', '10', 'josep2', '1/26/2016 19:30'], ['11964716', "Florida DJs May Face Felony for April Fools' Water Joke", 'http://www.thewire.com/entertainment/2013/04/florida-djs-april-fools-water-joke/63798/', '2', '1', 'vezycash', '6/23/2016 22:20'], ['11919867', 'Technology ventures: From Idea to Enterprise', 'https://www.amazon.com/Technology-Ventures-Enterprise-Thomas-Byers/dp/0073523429', '3', '1', 'hswarna', '6/17/2016 0:01'], ['10301696', 'Note by Note: The Making of Steinway L1037 (2007)', 'http://www.nytimes.com/2007/11/07/movies/07stein.html?_r=0', '8', '2', 'walterbell', '9/30/2015 4:12']]
After separating the Ask HN
posts and Show HN
posts into different list of lists we will calculat the average number of comments for each and comparing them.
# This function is used for counting the number of comments in dataset
def count_comments(dataset, title_HN) :
total_comments = 0
for row in dataset :
total_comments += int(row[4]) # we need to convert the string to integer
avg_comments = total_comments / len(dataset)
print('There are {} comments in {} over {} posts.'.format(total_comments, title_HN, len(dataset)))
print('There is an average of {} comments on {} posts.'.format(avg_comments, title_HN))
print('\n')
return total_comments, avg_comments
# Ask HN posts
total_ask_comments, avg_ask_comments = count_comments(ask_posts, "Ask HN")
# Show HN posts
total_show_comments, avg_show_comments = count_comments(show_posts, "Show HN")
There are 24483 comments in Ask HN over 1744 posts. There is an average of 14.038417431192661 comments on Ask HN posts. There are 11988 comments in Show HN over 1162 posts. There is an average of 10.31669535283993 comments on Show HN posts.
We see that Ask HN
posts have more number of comments than Show HN
posts which lead to a greater average of 14 comments perAsk HN
post vs an average of 10 comments per Show HN
post. So we will focus our remaining analysis just on these posts.
After finding that Ask HN
posts has more comments per post, we'd like to see if there is a certin period of time within a day that attracts more comments. Our stratgey to do this is to :
import datetime as dt
result_list = [] # contains the date and time the post was created and
# the number of comments of that post
for row in ask_posts :
result_list.append([row[6], int(row[4])])
counts_by_hour = {} # contains the number of ask posts created during each hour of the day
comments_by_hour = {}# contains the number of comments for ask posts per hour
date_format = "%m/%d/%Y %H:%M"
for row in result_list :
date = row[0]
comment = row[1]
time = dt.datetime.strptime(date, date_format).strftime("%H") # Separating the date and the time
if time in counts_by_hour :
counts_by_hour[time] += 1
comments_by_hour[time] += comment
else :
counts_by_hour[time] = 1
comments_by_hour[time] = comment
print(counts_by_hour)
print('\n')
print(comments_by_hour)
{'09': 45, '13': 85, '10': 59, '14': 107, '16': 108, '23': 68, '12': 73, '17': 100, '15': 116, '21': 109, '20': 80, '02': 58, '18': 109, '03': 54, '05': 46, '19': 110, '01': 60, '22': 71, '08': 48, '04': 47, '00': 55, '06': 44, '07': 34, '11': 58} {'09': 251, '13': 1253, '10': 793, '14': 1416, '16': 1814, '23': 543, '12': 687, '17': 1146, '15': 4477, '21': 1745, '20': 1722, '02': 1381, '18': 1439, '03': 421, '05': 464, '19': 1188, '01': 683, '22': 479, '08': 492, '04': 337, '00': 447, '06': 397, '07': 267, '11': 641}
avg_by_hour = [] # The average number of comments per post each hour
for time in comments_by_hour :
avg_by_hour.append([time, round(comments_by_hour[time] / counts_by_hour[time], 2)])
print(avg_by_hour)
[['09', 5.58], ['13', 14.74], ['10', 13.44], ['14', 13.23], ['16', 16.8], ['23', 7.99], ['12', 9.41], ['17', 11.46], ['15', 38.59], ['21', 16.01], ['20', 21.52], ['02', 23.81], ['18', 13.2], ['03', 7.8], ['05', 10.09], ['19', 10.8], ['01', 11.38], ['22', 6.75], ['08', 10.25], ['04', 7.17], ['00', 8.13], ['06', 9.02], ['07', 7.85], ['11', 11.05]]
Although we have a list of lists for the average number of comments for posts created during each hour of the day, but this format makes it difficult to read.so we will sort the list of lists and print the five highest values and the five lowest values for easier analysis.
def sort_avg_by_hour(dataset) :
swap_avg_by_hour = []
for row in avg_by_hour :
swap_avg_by_hour.append([row[1], row[0]])
# Sort in highest to lowest to find the best hours to post
sorted_swap = sorted(swap_avg_by_hour, reverse = True)
print("Top 5 Hours for `Ask HN` comments")
for row in sorted_swap[:5] :
print("{} : {} average comments per post".format(dt.datetime.strptime(row[1], '%H').strftime('%H:%M'), row[0]))
print('\n')
# sort in lowest to highest to find the worst hours to post
sorted_swap = sorted(swap_avg_by_hour, reverse = False)
print("Bottom 5 Hours for `Ask HN` comments")
for row in sorted_swap[:5] :
print("{} : {} average comments per post".format(dt.datetime.strptime(row[1], '%H').strftime('%H:%M'), row[0]))
sort_avg_by_hour(avg_by_hour)
Top 5 Hours for `Ask HN` comments 15:00 : 38.59 average comments per post 02:00 : 23.81 average comments per post 20:00 : 21.52 average comments per post 16:00 : 16.8 average comments per post 21:00 : 16.01 average comments per post Bottom 5 Hours for `Ask HN` comments 09:00 : 5.58 average comments per post 22:00 : 6.75 average comments per post 04:00 : 7.17 average comments per post 03:00 : 7.8 average comments per post 07:00 : 7.85 average comments per post
we found that,Ask HN
posts that were created at 3:00PM, 2:00AM, 8:00PM, 4:00PM, and 9:00PM (eastern US timezone) are the top 5 hours to create Ask HN Posts that on average generate the most comments.
We also see that Ask HN
posts that were created at 9:00AM, 10:00PM, 4:00AM, 3:00AM, and 7:00AM (eastern US timezone) are the bottom 5 hours to create Ask HN posts that on average do not generate a lot of comments.
With some exceptions, afternoon or evening hours seem like the best time to generate Ask HN
posts for more comment engagement.
In this project we analyzed Ask posts and Show posts and found that Ask posts have more average number of comments per post during each hour of the day.If we want to post something and need more comment engagement ,the post should be categorized as Ask post and created between (3:00PM - 4:00PM) EST.