Hacker News is a site started by the startup incubator Y Combinator, where user-submitted stories (known as "posts") are voted and commented upon, similar to reddit. Hacker News is extremely popular in technology and startup circles, and posts that make it to the top of Hacker News' listings can get hundreds of thousands of visitors as a result.
You can find the data set here
We'll compare these two types of posts to determine the following:
After analyzing the data, we can say that Ask HM has higher comments than Show HM post. Moreover, the time of posting has an impact on the average number of comments; Ask post, posted at 15:00 receives, on average, the highest number of comments. On the other hand, points average is higher for Show posts at 23.00. For more details, please refer to the full analysis below.
# import csv module to read data
from csv import reader
# import datetime module to convert date
from datetime import datetime as dt
opened_file = open('HN_posts_year_to_Sep_26_2016.csv')
# read file with reader method from csv module
read_file = reader(opened_file)
# Set opened file as a list of list in a variable name hacker news as hn
hn = list(read_file)
Below are descriptions of the columns:
# First 5 rows of dataset including header
for row in hn[:5]:
print(row)
print('\n')
# Show the number of rows and columns in dataset
print('The data set has '+ str(len(hn))+' rows.')
print("Each row has "+ str(len(hn[0]))+ ' columns.')
['id', 'title', 'url', 'num_points', 'num_comments', 'author', 'created_at'] ['12579008', 'You have two days to comment if you want stem cells to be classified as your own', 'http://www.regulations.gov/document?D=FDA-2015-D-3719-0018', '1', '0', 'altstar', '9/26/2016 3:26'] ['12579005', 'SQLAR the SQLite Archiver', 'https://www.sqlite.org/sqlar/doc/trunk/README.md', '1', '0', 'blacksqr', '9/26/2016 3:24'] ['12578997', 'What if we just printed a flatscreen television on the side of our boxes?', 'https://medium.com/vanmoof/our-secrets-out-f21c1f03fdc8#.ietxmez43', '1', '0', 'pavel_lishin', '9/26/2016 3:19'] ['12578989', 'algorithmic music', 'http://cacm.acm.org/magazines/2011/7/109891-algorithmic-composition/fulltext', '1', '0', 'poindontcare', '9/26/2016 3:16'] The data set has 293120 rows. Each row has 7 columns.
# set 0 index for header and rest index for rows
header, hn = hn[0], hn[1:]
print("The header is : ", header)
print()
print("Rows without header :", hn[:5])
The header is : ['12579008', 'You have two days to comment if you want stem cells to be classified as your own', 'http://www.regulations.gov/document?D=FDA-2015-D-3719-0018', '1', '0', 'altstar', '9/26/2016 3:26'] Rows without header : [['12579005', 'SQLAR the SQLite Archiver', 'https://www.sqlite.org/sqlar/doc/trunk/README.md', '1', '0', 'blacksqr', '9/26/2016 3:24'], ['12578997', 'What if we just printed a flatscreen television on the side of our boxes?', 'https://medium.com/vanmoof/our-secrets-out-f21c1f03fdc8#.ietxmez43', '1', '0', 'pavel_lishin', '9/26/2016 3:19'], ['12578989', 'algorithmic music', 'http://cacm.acm.org/magazines/2011/7/109891-algorithmic-composition/fulltext', '1', '0', 'poindontcare', '9/26/2016 3:16'], ['12578979', 'How the Data Vault Enables the Next-Gen Data Warehouse and Data Lake', 'https://www.talend.com/blog/2016/05/12/talend-and-Â\x93the-data-vaultÂ\x94', '1', '0', 'markgainor1', '9/26/2016 3:14'], ['12578975', 'Saving the Hassle of Shopping', 'https://blog.menswr.com/2016/09/07/whats-new-with-your-style-feed/', '1', '1', 'bdoux', '9/26/2016 3:13']]
# Create three empty lists
ask_posts = []
show_posts = []
other_posts = []
for row in hn:
title = row[1]
if title.lower().startswith("ask hn"):
ask_posts.append(row)
elif title.lower().startswith("show hn"):
show_posts.append(row)
else:
other_posts.append(row)
print("There are "+ str(len(ask_posts))+" Ask HN posts")
print("There are "+ str(len(show_posts))+" Show HN posts")
print("There are "+ str(len(other_posts))+" other posts")
There are 9139 Ask HN posts There are 10158 Show HN posts There are 273822 other posts
We can see that average posts in ask_posts is greater than show_posts.
print("The first five rows of ask posts : \n", ask_posts[:5])
The first five rows of ask posts : [['12578908', 'Ask HN: What TLD do you use for local development?', '', '4', '7', 'Sevrene', '9/26/2016 2:53'], ['12578522', 'Ask HN: How do you pass on your work when you die?', '', '6', '3', 'PascLeRasc', '9/26/2016 1:17'], ['12577908', 'Ask HN: How a DNS problem can be limited to a geographic region?', '', '1', '0', 'kuon', '9/25/2016 22:57'], ['12577870', 'Ask HN: Why join a fund when you can be an angel?', '', '1', '3', 'anthony_james', '9/25/2016 22:48'], ['12577647', 'Ask HN: Someone uses stock trading as passive income?', '', '5', '2', '00taffe', '9/25/2016 21:50']]
print("The first five rows of show posts : \n", show_posts[:5])
The first five rows of show posts : [['12578335', 'Show HN: Finding puns computationally', 'http://puns.samueltaylor.org/', '2', '0', 'saamm', '9/26/2016 0:36'], ['12578182', 'Show HN: A simple library for complicated animations', 'https://christinecha.github.io/choreographer-js/', '1', '0', 'christinecha', '9/26/2016 0:01'], ['12578098', 'Show HN: WebGL visualization of DNA sequences', 'http://grondilu.github.io/dna.html', '1', '0', 'grondilu', '9/25/2016 23:44'], ['12577991', 'Show HN: Pomodoro-centric, heirarchical project management with ES6 modules', 'https://github.com/jakebian/zeal', '2', '0', 'dbranes', '9/25/2016 23:17'], ['12577142', 'Show HN: Jumble Essays on the go #PaulInYourPocket', 'https://itunes.apple.com/us/app/jumble-find-startup-essay/id1150939197?ls=1&mt=8', '1', '1', 'ryderj', '9/25/2016 20:06']]
# calculate total and average ask_post comments
total_ask_comments = 0
for row in ask_posts:
num_comments = int(row[4])
total_ask_comments +=num_comments
avg_ask_comments = total_ask_comments/len(ask_posts)
print("Total ask post {:,}".format(total_ask_comments))
print("Average ask post ", round(avg_ask_comments,2))
Total ask post 94,986 Average ask post 10.39
# calculate total and average show_post comments
total_show_comments = 0
for row in show_posts:
num_comments = int(row[4])
total_show_comments +=num_comments
avg_show_comments = total_show_comments/len(show_posts)
print("Total show posts {:,}".format(total_show_comments))
print("Average show posts ", round(avg_show_comments,2))
Total show posts 49,633 Average show posts 4.89
HN Posts | Total | Average | ||
---|---|---|---|---|
ask_post | 94,986 | 10.34 | ||
show_post | 49,633 | 4.89 | ||
From the above two posts we can see that average posts of ask_post have more comments than show_posts.
Since ask posts are more likely to receive comments, we'll focus our remaining analysis just on these posts.
Next, we'll determine if ask posts created at a certain time are more likely to attract comments. We'll use the following steps to perform this analysis:
# Create an empty list where we load new columns
result_list = []
for post in ask_posts:
result_list.append([post[6], int(post[4])])
comments_by_hour = {}
counts_by_hour = {}
date_format = "%m/%d/%Y %H:%M"
for each_row in result_list:
dt_created = each_row[0]
num_comment = each_row[1]
time = dt.datetime.strptime(dt_created, date_format).strftime("%H")
if time in counts_by_hour:
comments_by_hour[time] += num_comment
counts_by_hour[time] += 1
else:
comments_by_hour[time] = num_comment
counts_by_hour[time] = 1
print("Hourly comments number : ", comments_by_hour)
print("\n")
print("Hourly counted number :", counts_by_hour)
Hourly comments number : {'02': 2996, '01': 2089, '22': 3372, '21': 4500, '19': 3954, '17': 5547, '15': 18525, '14': 4972, '13': 7245, '11': 2797, '10': 3013, '09': 1477, '07': 1585, '03': 2154, '23': 2297, '20': 4462, '16': 4466, '08': 2362, '00': 2277, '18': 4877, '12': 4234, '04': 2360, '06': 1587, '05': 1838} Hourly counted number : {'02': 269, '01': 282, '22': 383, '21': 518, '19': 552, '17': 587, '15': 646, '14': 513, '13': 444, '11': 312, '10': 282, '09': 222, '07': 226, '03': 271, '23': 343, '20': 510, '16': 579, '08': 257, '00': 301, '18': 614, '12': 342, '04': 243, '06': 234, '05': 209}
The data shows that most ask posts recieve more comments around 15:00pm.
# Calculate the average number of comments by ask_posts
avg_by_hour = []
for hr in comments_by_hour:
avg_by_hour.append([hr, comments_by_hour[hr] / counts_by_hour[hr]])
avg_by_hour
[['02', 14.334928229665072], ['01', 8.45748987854251], ['22', 8.944297082228116], ['21', 10.465116279069768], ['19', 7.111510791366906], ['17', 7.289093298291721], ['15', 22.15909090909091], ['14', 7.14367816091954], ['13', 11.87704918032787], ['11', 6.95771144278607], ['10', 9.328173374613003], ['09', 4.8907284768211925], ['07', 6.716101694915254], ['03', 10.45631067961165], ['23', 7.200626959247649], ['20', 8.49904761904762], ['16', 5.575530586766542], ['08', 7.474683544303797], ['00', 8.25], ['18', 7.434451219512195], ['12', 8.205426356589147], ['04', 12.164948453608247], ['06', 8.265625], ['05', 10.686046511627907]]
swap_avg_by_hour = []
for row in avg_by_hour:
swap_avg_by_hour.append([row[1], row[0]])
print(swap_avg_by_hour)
[[14.334928229665072, '02'], [8.45748987854251, '01'], [8.944297082228116, '22'], [10.465116279069768, '21'], [7.111510791366906, '19'], [7.289093298291721, '17'], [22.15909090909091, '15'], [7.14367816091954, '14'], [11.87704918032787, '13'], [6.95771144278607, '11'], [9.328173374613003, '10'], [4.8907284768211925, '09'], [6.716101694915254, '07'], [10.45631067961165, '03'], [7.200626959247649, '23'], [8.49904761904762, '20'], [5.575530586766542, '16'], [7.474683544303797, '08'], [8.25, '00'], [7.434451219512195, '18'], [8.205426356589147, '12'], [12.164948453608247, '04'], [8.265625, '06'], [10.686046511627907, '05']]
sorted_swap = sorted(swap_avg_by_hour, reverse = True)
print("Top 5 Hours for Ask Posts Comments :", sorted_swap[:5])
Top 5 Hours for Ask Posts Comments : [[22.15909090909091, '15'], [14.334928229665072, '02'], [12.164948453608247, '04'], [11.87704918032787, '13'], [10.686046511627907, '05']]
for avg, hr in sorted_swap[:5]:
print("{}:{:.2f}".format(
dt.datetime.strptime(hr, "%H").strftime("%H:%M"), avg))
15:00:22.16 02:00:14.33 04:00:12.16 13:00:11.88 05:00:10.69
Based on the dataset documentation, the timezone used is the eastern time in the US, 15:00 will be equivalent to 3:00 pm est. The top 5 hours for most comments on Ask Posts are 15:00, 02:00, 20.00, 16:00 and 21.00. The hour that receives the most comments on average is 15:00 with thirty nine comments per post.
# calculate total and average ask_post points
total_ask_points= 0
for row in ask_posts:
num_points = int(row[3])
total_ask_points +=num_points
avg_ask_points= total_ask_points/len(ask_posts)
print(" Total ask points {:,}".format(total_ask_points))
print("Average ask points :", round(avg_ask_points,2))
Total ask points 103,378 Average ask points : 11.31
# calculate total and average show_post points
total_show_points = 0
for row in show_posts:
num_points = int(row[3])
total_show_points +=num_points
avg_show_points = total_show_points/len(show_posts)
print("Total show posts {:,}".format(total_show_points))
print("Average show posts :", round(avg_show_points,2))
Total show posts 150,781 Average show posts : 14.84
The average posts from ask and show its easily describle that show post recieve more points on average than ask post points where the average ask posts comments were more than show posts.
Post type | points(total) | points(avg) |
---|---|---|
Ask HN | 103,378 | 11.31 |
Show HN | 150,781 | 14.84 |
Though show post average number is high so we will analysis show posts part.
# create an empty list where we will load columns we need for analysis
result_list = []
for post in show_posts:
result_list.append([post[6], int(post[3])])
points_by_hour = {}
counts_by_hour = {}
date_format = "%m/%d/%Y %H:%M"
for each_row in result_list:
dt_created = each_row[0]
num_points = each_row[1]
time = dt.datetime.strptime(dt_created, date_format).strftime("%H")
if time in counts_by_hour:
points_by_hour[time] += num_points
counts_by_hour[time] += 1
else:
points_by_hour[time] = num_points
counts_by_hour[time] = 1
print("Hourly points number : ", points_by_hour)
print("\n")
print("Hourly counted number :", counts_by_hour)
Hourly points number : {'00': 4291, '23': 5060, '20': 6948, '19': 8928, '18': 9935, '16': 11487, '14': 10503, '10': 4303, '09': 3762, '08': 4640, '06': 3071, '03': 2168, '21': 5990, '17': 10563, '15': 11657, '11': 7742, '07': 3303, '04': 2707, '13': 10381, '12': 10787, '01': 2931, '22': 5026, '02': 2764, '05': 1834} Hourly counted number : {'00': 276, '23': 319, '20': 525, '19': 556, '18': 656, '16': 801, '14': 696, '10': 323, '09': 302, '08': 316, '06': 192, '03': 206, '21': 430, '17': 761, '15': 836, '11': 402, '07': 236, '04': 194, '13': 610, '12': 516, '01': 247, '22': 377, '02': 209, '05': 172}
# Create an empty list for average points by hour
avg_by_hour = []
for hr in points_by_hour:
avg_by_hour.append([hr, points_by_hour[hr] / counts_by_hour[hr]])
avg_by_hour
[['00', 15.547101449275363], ['23', 15.862068965517242], ['20', 13.234285714285715], ['19', 16.057553956834532], ['18', 15.144817073170731], ['16', 14.340823970037453], ['14', 15.09051724137931], ['10', 13.321981424148607], ['09', 12.456953642384105], ['08', 14.683544303797468], ['06', 15.994791666666666], ['03', 10.524271844660195], ['21', 13.930232558139535], ['17', 13.88042049934297], ['15', 13.94377990430622], ['11', 19.258706467661693], ['07', 13.995762711864407], ['04', 13.95360824742268], ['13', 17.018032786885247], ['12', 20.905038759689923], ['01', 11.866396761133604], ['22', 13.331564986737401], ['02', 13.224880382775119], ['05', 10.662790697674419]]
The hour that receives the most points for show posts on average is 12:00.
# Create an empty list
points_avg_by_hour = []
# Add columns in that list
for row in avg_by_hour:
points_avg_by_hour.append([row[1], row[0]])
sorted_list = sorted(points_avg_by_hour, reverse=True)
print("Top 5 hours for points on 'Show HN' posts")
for avg, hr in sorted_list[:5]:
print(
"{}: {:.2f}".format(
dt.datetime.strptime(hr, "%H").strftime("%H:%M"),avg
)
)
Top 5 hours for points on 'Show HN' posts 12:00: 20.91 11:00: 19.26 13:00: 17.02 19:00: 16.06 06:00: 15.99
The top 5 hours for most upvotes from the hacker news community for show posts are 11:00, 12 noon, 13.00, 19:00 and 06.00.
This project analysed ask posts and show posts on the hacker news platform to determine which type of post and time period received the most comments and points on average.
Based on the analysis, to optimise the possibility of receiving more comments, we'd recommend users to post on the hacker news platform using the 'ask hn' title and possibly create the post sometime between the time period of 15:00pm and 16:00pm EST. This time period in my timezone is between 20:00pm and 21:00pm WAT.
Furthermore, on comparing both types of posts based on the points / upvotes recieved, the data shows that there are more points on average for show posts than there are for ask posts. This suggests that while 'ask hn' posts are more likely to recieve more comments from the hacker news community, 'show hn' posts tend to recieve more points average. The table shows:
Post type | points(total) | points(avg) | comments(total) | comments(avg) |
---|---|---|---|---|
Ask HN | 103,378 | 11.31 | 94,986 | 10.34 |
Show HN | 150,781 | 14.34 | 49,633 | 4.89 |
We accomplished the following task in this guided project: