Hacker News is a site started by the startup incubator Y Combinator, where user-submitted stories (known as "posts") are voted and commented upon, similar to reddit. Hacker News is extremely popular in technology and startup circles, and posts that make it to the top of Hacker News' listings can get hundreds of thousands of visitors as a result.
What is interesting is that here we would be interested in focussing on posts where the titles begin with either 'Ask HN'
or 'Show HN'
. Users Use Ask HN
to ask the Hacker News community a specific question.
Examples
Ask HN: How to improve my personal website? Ask HN: Am I the only one outraged by Twitter shutting down share counts?
Similarly, users use Show HN
to show the Hacker News community a project, product, or just generally something interesting.
Examples
Show HN: Wio Link ESP8266 Based Web of Things Hardware Development Platform' Show HN: Something pointless I made
Now we would aim at comparing these two types of posts and explore:
Ask HN
or Show HM
receive more comments on average?Data Source
The data set being used in this project is from the Kaggle, however it has been reduced from almost 300,000 rows to approximately 20,000 rows by removing all submissions that did not receive any comments, and then randomly sampling from the remaining submissions.
Below are descriptions of the columns:
id
: The unique identifier from Hacker News for the post
title
: The title of the post
url
: The URL that the posts links to, if it the post has a URL
num_points
: The number of points the post acquired, calculated as the total number of upvotes minus the total number of downvotes
num_comments
: The number of comments that were made on the post
author
: The username of the person who submitted the post
created_at
: The date and time at which the post was submitted
Let us start off by importing our data set
from csv import reader
opened_file = open('hacker_news.csv')
read_file = reader(opened_file)
hn = list(read_file)
headers = hn[0]
hn = hn[1:]
print(len(hn))
print('\n')
print(headers)
print('\n')
print(hn[:4])
20100 ['id', 'title', 'url', 'num_points', 'num_comments', 'author', 'created_at'] [['12224879', 'Interactive Dynamic Video', 'http://www.interactivedynamicvideo.com/', '386', '52', 'ne0phyte', '8/4/2016 11:52'], ['10975351', 'How to Use Open Source and Shut the Fuck Up at the Same Time', 'http://hueniverse.com/2016/01/26/how-to-use-open-source-and-shut-the-fuck-up-at-the-same-time/', '39', '10', 'josep2', '1/26/2016 19:30'], ['11964716', "Florida DJs May Face Felony for April Fools' Water Joke", 'http://www.thewire.com/entertainment/2013/04/florida-djs-april-fools-water-joke/63798/', '2', '1', 'vezycash', '6/23/2016 22:20'], ['11919867', 'Technology ventures: From Idea to Enterprise', 'https://www.amazon.com/Technology-Ventures-Enterprise-Thomas-Byers/dp/0073523429', '3', '1', 'hswarna', '6/17/2016 0:01']]
Filtering our Data
As discussed, we want to isolate the posts with Ask HN
and Show HN
. Let us create two seperate list of lists to contain the data for these titles.
To find the posts which begin with either Ask HN
or Show HN
, we'll use the string method startswith
. Let us do a test run
string1 = "SwarJoshi"
print(string1.startswith("Swar"))
print(string1.startswith("swar"))
True False
While this approach works, it shows the importance of case sensitivity. As we can not control the posts, we would have to cater our approach to control this case of case sensitivity.
So, we will convert all the strings into a lower case user lower()
and then use our isolation methods on the stirings.
ask_posts = []
show_posts = []
other_posts = []
for row in hn:
title = row[1]
title = title.lower()
if title.startswith('ask hn'):
ask_posts.append(row)
elif title.startswith('show hn'):
show_posts.append(row)
else:
other_posts.append(row)
print(len(ask_posts))
print(len(show_posts))
print(len(other_posts))
1744 1162 17194
From this we can make the following basic assumptions:
Ask HN
(1744) posts as compared to Show HN
(1162) postsDigging deeper, let us determine if ask posts or show posts receive more comments on average.
total_ask_comments = 0
for row in ask_posts:
comments = int(row[4])
total_ask_comments += comments
avg_ask_comments = total_ask_comments / len(ask_posts)
print(avg_ask_comments)
total_show_comments = 0
for row in show_posts:
comments = int(row[4])
total_show_comments += comments
avg_show_comments = total_show_comments / len(show_posts)
print(avg_show_comments)
14.038417431192661 10.31669535283993
As we can see, Ask HN
as 14 comments on average on each post while Show HN
has 10 comments on average.
This comes in addition to the fact that there are more Ask HN
posts in general as compared to Show HN
posts.
With a higher number of posts and more comments, it could be said that Ask HN
posts have higher engagement in comparison with Show HN
posts.
The reason behind this could also be the fact that, when a person asks a question it is directly requesting a response. In general, seeking a response would ideally be met with a few comments and replies from the original poster leader to increased number of comments.
As we see that ask posts recceive more comments, we would now focus the rest of our analysis just on these posts.
Bringing Time of Posting into the Picture
It is no secret, that all digital marketers and social media 'influencers' on Facebook, Twitter, Instagram etc. have a set time for posting their regular images (not the I-wish-it-were-occassional drunk ramblings) which brings in the most engagement and views for them. While the users/posters of HN are way cooler in general, we could possibly look into the possibility of possible times when the post engagement is higher.
In order to perform this analysis we will:
The Module that we will use will be datetime
module which we will use in the created_at
column (Index 6) Moreover, we use strptime()
to bring the date and time in the column to our desired format.
import datetime as dt
result_list = []
for row in ask_posts:
created = row[6]
comments = int(row[4])
new1 = [created, comments]
result_list.append(new1)
counts_by_hour = {}
comments_by_hour = {}
for row in result_list:
thehour = row[0]
comment = row[1]
dt_object = dt.datetime.strptime(thehour, "%m/%d/%Y %H:%M")
hour = dt_object.hour
if hour in counts_by_hour:
counts_by_hour[hour] += 1
comments_by_hour[hour] += comment
else:
counts_by_hour[hour] = 1
comments_by_hour[hour] = comment
print(counts_by_hour)
print('\n')
print(comments_by_hour)
{0: 55, 1: 60, 2: 58, 3: 54, 4: 47, 5: 46, 6: 44, 7: 34, 8: 48, 9: 45, 10: 59, 11: 58, 12: 73, 13: 85, 14: 107, 15: 116, 16: 108, 17: 100, 18: 109, 19: 110, 20: 80, 21: 109, 22: 71, 23: 68} {0: 447, 1: 683, 2: 1381, 3: 421, 4: 337, 5: 464, 6: 397, 7: 267, 8: 492, 9: 251, 10: 793, 11: 641, 12: 687, 13: 1253, 14: 1416, 15: 4477, 16: 1814, 17: 1146, 18: 1439, 19: 1188, 20: 1722, 21: 1745, 22: 479, 23: 543}
counts_by_hour
- Shows total posts within that hour
comments_by_hour
- Shows total comments within that hour
This has given us a total count & how many comments we have had in total during the course of the day within Ask HN
posts. Let us find out the average comments per hour for the same.
avg_by_hour = []
for row in comments_by_hour:
average = (comments_by_hour[row]/counts_by_hour[row])
new1 = [row, average]
avg_by_hour.append(new1)
print(avg_by_hour)
[[0, 8.127272727272727], [1, 11.383333333333333], [2, 23.810344827586206], [3, 7.796296296296297], [4, 7.170212765957447], [5, 10.08695652173913], [6, 9.022727272727273], [7, 7.852941176470588], [8, 10.25], [9, 5.5777777777777775], [10, 13.440677966101696], [11, 11.051724137931034], [12, 9.41095890410959], [13, 14.741176470588234], [14, 13.233644859813085], [15, 38.5948275862069], [16, 16.796296296296298], [17, 11.46], [18, 13.20183486238532], [19, 10.8], [20, 21.525], [21, 16.009174311926607], [22, 6.746478873239437], [23, 7.985294117647059]]
swap_avg_by_hour = []
for row in avg_by_hour:
temp = row[0]
temp1 = row[1]
swap_avg_by_hour.append([temp1, temp])
sorted_swap = sorted(swap_avg_by_hour, reverse = True)
print("Top 5 Hours for Ask Posts Comments :",sorted_swap[:4])
print('\n')
for average, hour in sorted_swap[:5]:
dt_hour = dt.datetime.strptime(str(hour), "%H")
dt_hour = dt_hour.strftime("%H:%M")
print("{} : {} average comments per post".format(dt_hour,average))
print('\n')
Top 5 Hours for Ask Posts Comments : [[38.5948275862069, 15], [23.810344827586206, 2], [21.525, 20], [16.796296296296298, 16]] 15:00 : 38.5948275862069 average comments per post 02:00 : 23.810344827586206 average comments per post 20:00 : 21.525 average comments per post 16:00 : 16.796296296296298 average comments per post 21:00 : 16.009174311926607 average comments per post
We can see that post in the hour between 15:00 and 16:00 EST have the highest comments in the Ask HN
section.
This could be possibly due to eitherdue to the following reasons:
This is followed by the hour between 02:00 and 03:00.
Considering that 'Hacker News' is primarily an American company, 02:00 seems rather odd, however the contributers are not necessarily from the US and the 02:00 post comments are probably from either the noctornal American techies (or their Silicon Valley buddies who left late from work, their non-noctornal South-Asian counterparts or early bird techies from Europe/Africa.
To ensure most amount of replies, the posts should be made between 15:00 and 16:00 (even a little later woiuld not hurt as the hour between 16:00 and 17:00 has also made the list but doesnt look all that impressive)
But these are relevant to people living in EST, what if I want to figure our what my chances are as a person living in Germany?
Let's add the 5 hours to convert it from EST to CET using dt.timedelta()
for average, hour in sorted_swap[:5]:
dt_hour = dt.datetime.strptime(str(hour), "%H")
dt_hour = dt_hour + dt.timedelta(hours=5)
dt_hour = dt_hour.strftime("%H:%M")
print("{} : {} average comments per post".format(dt_hour,average))
print('\n')
20:00 : 38.5948275862069 average comments per post 07:00 : 23.810344827586206 average comments per post 01:00 : 21.525 average comments per post 21:00 : 16.796296296296298 average comments per post 02:00 : 16.009174311926607 average comments per post
There! As someone living in Germany, my best chances of getting the most amount of engagement is if I post between 20:00 and 21:00 CET (as I mentioned a little later is fine too, but not optimal), or if I post between 07:00 and 08:00. This is subject to ofcourse, if I have something valuable to ask in my Ask HN
post.