Hacker News is site that is extremely popular in certain technology and startup circles, started by the startup incubator Y Combinator . Quite similar to Reddit, users can submit stories (known as "posts"), which are voted and commented upon.
While posting anything on Hacker News (HN) is pretty straight-forward, getting heard out there is a real fight!
Whether you are asking a question or showing off your work, or sharing any news (not only on forums, but even in social media in general) getting heard can be subject to -
On HN, users can -
(A Screenshot of the Hacker News forum)
While the first 2 points are out of our scope of the study, the 3rd condition - i.e. checking whether the time at which the OP posts a new thread is related to the number of comments or points on that thread, shall be the subject of my analysis.
Besides, it will be interesting to see if I can improve the chances of my thread gaining attention of the community by changing the hour at which i post my thread!
Let us get a fair idea of the structure of our data, and what we shall be dealing with.
The complete data-set for the data can be found on Kaggle (which has been sampled down from 300,000+ rows to around 20,000 rows). I've provided the link below for any deep-dwellers - https://www.kaggle.com/hacker-news/hacker-news-posts
# import the csv module and open 'hacker_news.csv' to
# display the first five rows
from csv import reader
opened_file = open('hacker_news.csv')
fhand = reader(opened_file)
hn = list(fhand)
for row in hn[:5]:
print(row,'\n')
# separate the header row from the remaining rows, assign
# both lists to separate objects of list class
hn_header = hn[0]
hn = hn[1:]
['id', 'title', 'url', 'num_points', 'num_comments', 'author', 'created_at'] ['12224879', 'Interactive Dynamic Video', 'http://www.interactivedynamicvideo.com/', '386', '52', 'ne0phyte', '8/4/2016 11:52'] ['10975351', 'How to Use Open Source and Shut the Fuck Up at the Same Time', 'http://hueniverse.com/2016/01/26/how-to-use-open-source-and-shut-the-fuck-up-at-the-same-time/', '39', '10', 'josep2', '1/26/2016 19:30'] ['11964716', "Florida DJs May Face Felony for April Fools' Water Joke", 'http://www.thewire.com/entertainment/2013/04/florida-djs-april-fools-water-joke/63798/', '2', '1', 'vezycash', '6/23/2016 22:20'] ['11919867', 'Technology ventures: From Idea to Enterprise', 'https://www.amazon.com/Technology-Ventures-Enterprise-Thomas-Byers/dp/0073523429', '3', '1', 'hswarna', '6/17/2016 0:01']
For the sake of convenience, let's also record the field names of our data-set along with the index number they are located at.
Index | Name of Field |
---|---|
0 | id |
1 | title |
2 | url |
3 | num_points |
4 | num_comments |
5 | author |
6 | created_at |
Since we want to perform comparitive analysis on "Ask HN" and "Show HN" titled posts, it would be better if we divide our singular list into 3 separate lists, namely -
Let's do it, and print the first 5 rows of each category to see if we have what we need -
# make separate lists for "Ask HN" titled posts,
# "Show HN" titles posts and posts woth other titles
ask_posts = []
show_posts = []
other_posts = []
for row in hn:
title = row[1]
title = title.lower()
if title.startswith('ask hn'):
ask_posts.append(row)
elif title.startswith('show hn'):
show_posts.append(row)
else :
other_posts.append(row)
# printing top 5 rows in each of the 3 lists
print("Top 5 rows in ask_posts list:")
for row in ask_posts[:5]:
print(row)
print("Top 5 rows in show_posts list:\n")
for row in show_posts[:5]:
print(row)
print("\nTop 5 rows in other_posts list:")
for row in other_posts[:5]:
print(row)
Top 5 rows in ask_posts list: ['12296411', 'Ask HN: How to improve my personal website?', '', '2', '6', 'ahmedbaracat', '8/16/2016 9:55'] ['10610020', 'Ask HN: Am I the only one outraged by Twitter shutting down share counts?', '', '28', '29', 'tkfx', '11/22/2015 13:43'] ['11610310', 'Ask HN: Aby recent changes to CSS that broke mobile?', '', '1', '1', 'polskibus', '5/2/2016 10:14'] ['12210105', 'Ask HN: Looking for Employee #3 How do I do it?', '', '1', '3', 'sph130', '8/2/2016 14:20'] ['10394168', 'Ask HN: Someone offered to buy my browser extension from me. What now?', '', '28', '17', 'roykolak', '10/15/2015 16:38'] Top 5 rows in show_posts list: ['10627194', 'Show HN: Wio Link ESP8266 Based Web of Things Hardware Development Platform', 'https://iot.seeed.cc', '26', '22', 'kfihihc', '11/25/2015 14:03'] ['10646440', 'Show HN: Something pointless I made', 'http://dn.ht/picklecat/', '747', '102', 'dhotson', '11/29/2015 22:46'] ['11590768', 'Show HN: Shanhu.io, a programming playground powered by e8vm', 'https://shanhu.io', '1', '1', 'h8liu', '4/28/2016 18:05'] ['12178806', 'Show HN: Webscope Easy way for web developers to communicate with Clients', 'http://webscopeapp.com', '3', '3', 'fastbrick', '7/28/2016 7:11'] ['10872799', 'Show HN: GeoScreenshot Easily test Geo-IP based web pages', 'https://www.geoscreenshot.com/', '1', '9', 'kpsychwave', '1/9/2016 20:45'] Top 5 rows in other_posts list: ['12224879', 'Interactive Dynamic Video', 'http://www.interactivedynamicvideo.com/', '386', '52', 'ne0phyte', '8/4/2016 11:52'] ['10975351', 'How to Use Open Source and Shut the Fuck Up at the Same Time', 'http://hueniverse.com/2016/01/26/how-to-use-open-source-and-shut-the-fuck-up-at-the-same-time/', '39', '10', 'josep2', '1/26/2016 19:30'] ['11964716', "Florida DJs May Face Felony for April Fools' Water Joke", 'http://www.thewire.com/entertainment/2013/04/florida-djs-april-fools-water-joke/63798/', '2', '1', 'vezycash', '6/23/2016 22:20'] ['11919867', 'Technology ventures: From Idea to Enterprise', 'https://www.amazon.com/Technology-Ventures-Enterprise-Thomas-Byers/dp/0073523429', '3', '1', 'hswarna', '6/17/2016 0:01'] ['10301696', 'Note by Note: The Making of Steinway L1037 (2007)', 'http://www.nytimes.com/2007/11/07/movies/07stein.html?_r=0', '8', '2', 'walterbell', '9/30/2015 4:12']
Now that we are sure the ask/show type posts are segregated, let's calculate the average number of comments per post for 1. Ask, 2. Show type posts.
This will give us an idea if posting a particular type of post can give us more comments to our post -
# checking which one out of ask_ posts or show_posts
# receive more comments on an avg
total_ask_comments = 0
total_show_comments = 0
for row in ask_posts:
num_comments = int(row[4])
total_ask_comments += num_comments
avg_ask_comments = total_ask_comments/len(ask_posts)
for row in show_posts:
num_comments = int(row[4])
total_show_comments += num_comments
avg_show_comments = total_show_comments/len(show_posts)
print(avg_ask_comments)
print(avg_show_comments)
14.038417431192661 10.31669535283993
The findings are significant - posts of the "Ask HN" type receive more than 14 comments on an average, while posts of the "Show HN" type receive about 10 comments on an average.
Next, we'll determine if Ask HN posts created at a certain time are more likely to attract comments. We'll do this by :
Let's write a code that will give us the total number of posts and the total number of comments per hour -
import datetime as dt
result_list = []
counts_by_hour = {}
comments_by_hour = {}
avg_by_hour = []
# generate "result_list" list
for row in ask_posts:
created_at = row[6]
num_comments = int(row[4])
result_list.append([created_at,num_comments])
# generate count dictionaries
for row in result_list:
num_comments = row[1]
date_time = dt.datetime.strptime(row[0],"%m/%d/%Y %H:%M")
hr = date_time.strftime("%H")
if hr in counts_by_hour:
counts_by_hour[hr] += 1
comments_by_hour[hr] += num_comments
else:
counts_by_hour[hr] = 1
comments_by_hour[hr] = num_comments
# generating a list of hour-wise avg number of
# comments per post
for key in comments_by_hour:
num_comments = comments_by_hour[key]
count_posts = counts_by_hour[key]
avg = num_comments/count_posts
avg = avg
avg_by_hour.append([key, avg])
# printing a list with top 5 average comment values
avg_sorted = []
for row in avg_by_hour:
avg_sorted.append([row[1],row[0]])
# sorting average values in descending order ...
avg_sorted = sorted(avg_sorted,reverse = True)
# printing the top-5 hours for "Ask HN" post comments...
print("Top 5 Hours for Ask Posts Comments\n")
for row in avg_sorted[:5]:
avg = row[0]
hr = row[1]
hr = dt.datetime.strptime(hr, "%H")
hr = hr.time()
hr = hr.strftime("%H:%M")
sentence = '{}: {:.2f} average comments per "Ask HN" post'.format(hr,avg)
print(sentence)
Top 5 Hours for Ask Posts Comments 15:00: 38.59 average comments per "Ask HN" post 02:00: 23.81 average comments per "Ask HN" post 20:00: 21.52 average comments per "Ask HN" post 16:00: 16.80 average comments per "Ask HN" post 21:00: 16.01 average comments per "Ask HN" post
Against an average of 14 comments per Ask HN post, posting a post between 1500 to 1600 hours have an 38.6 comments on an average. This is a whopping 170% more than the net Ask HN average! I will make sure to post a question on HN forum on "What are the top Blogs on data Science" in this time period, and report in the conclusions if I receive any comments from fellow communitee members there!
Next, we'll determine if Ask HN posts created at a certain time are more likely to receive more points. We'll do this by :
Let's write a code that will give us the total number of posts and the total number of points per hour. But this time, we'll write a function that does the same job as the code above, except that it can take any of the ask_posts, show_posts or other_posts lists as arguments -
import datetime as dt
def top_avg(post_list):
result_list = []
posts_by_hour = {}
points_by_hour = {}
avg_by_hour = []
avg_sorted = []
# generate "result_list" list
for row in post_list:
created_at = row[6]
num_points = int(row[3])
result_list.append([created_at,num_points])
# generate count dictionaries
for row in result_list:
num_points = row[1]
date_time = dt.datetime.strptime(row[0],"%m/%d/%Y %H:%M")
hr = date_time.strftime("%H")
if hr in posts_by_hour:
posts_by_hour[hr] += 1
points_by_hour[hr] += num_points
else:
posts_by_hour[hr] = 1
points_by_hour[hr] = num_points
# generating a list of hour-wise avg number of
# comments per post
for key in points_by_hour:
num_points = points_by_hour[key]
count_posts = posts_by_hour[key]
avg = num_points/count_posts
avg = avg
avg_by_hour.append([key, avg])
# appending a list with average point values in random order
for row in avg_by_hour:
avg_sorted.append([row[1],row[0]])
# sorting average values in descending order ...
avg_sorted = sorted(avg_sorted,reverse = True)
# printing the top-5 hours for "Ask HN" post comments...
for row in avg_sorted[:5]:
avg = row[0]
hr = row[1]
hr = dt.datetime.strptime(hr, "%H")
hr = hr.time()
hr = hr.strftime("%H:%M")
if post_list == ask_posts:
sentence = '{}: {:.2f} average points per "Ask HN" post'.format(hr,avg)
elif post_list == show_posts:
sentence = '{}: {:.2f} average points per "Show HN" post'.format(hr,avg)
else:
sentence = '{}: {:.2f} average points per "Others" post'.format(hr,avg)
print(sentence)
print("Top 5 Hours for Average Points per 'Ask Post'\n")
top_avg(ask_posts)
Top 5 Hours for Average Points per 'Ask Post' 15:00: 29.99 average points per "Ask HN" post 13:00: 24.26 average points per "Ask HN" post 16:00: 23.35 average points per "Ask HN" post 17:00: 19.41 average points per "Ask HN" post 10:00: 18.68 average points per "Ask HN" post
Our findings have even more base - the Ask HN type posts receive maximum average number of points per post between 1500 and 1600 hours. This is in sync with our finding earlier, which showed that Ask HN posts posted in this time slot received more comments than other hours over the day.
One possible explanation could be that most users are logged in during this time on HN, and are able to interact with the community more. What causes this could be difficult to guess, but its an irrefutable correlation nonetheless!
print("Top 5 Hours for Average Points per 'Show' Post: \n")
top_avg(show_posts)
Top 5 Hours for Average Points per 'Show' Post: 23:00: 42.39 average points per "Show HN" post 12:00: 41.69 average points per "Show HN" post 22:00: 40.35 average points per "Show HN" post 00:00: 37.84 average points per "Show HN" post 18:00: 36.31 average points per "Show HN" post
Show HN posts on the other hand have a more spread out points per hour average. The average points per post in a given hour for top 5 posts is in the range of 36-42 points, and the hours have no relation - This shows that the chances of one's Show type post receiving more points won't be affected by what hour of the day it is.
print("Top 10 Hours for Average Points per 'others' post: \n")
top_avg(other_posts)
Top 10 Hours for Average Points per 'others' post: 13:00: 62.53 average points per "Others" post 14:00: 61.79 average points per "Others" post 15:00: 60.54 average points per "Others" post 10:00: 60.48 average points per "Others" post 19:00: 60.01 average points per "Others" post 02:00: 58.47 average points per "Others" post 00:00: 58.46 average points per "Others" post 17:00: 57.98 average points per "Others" post 11:00: 57.57 average points per "Others" post 12:00: 57.40 average points per "Others" post
A similar lack of heterogeinity is seen in all posts that are neither Ask no Show type (Others posts). The average points per post in a given hour for top 5 posts in the Others category is in the range of 60-62 points, and 57-62 points for th top 10 hours. Here too, the chances of one's Others type post receiving more points won't be affected by what hour of the day it is.
These findings conclude my analysis. If you went through all the trouble of reaching till here, then thanks for your time !
As stated earlier in introduction - subject relevance and subject interest to the community too matters, which I learnt the hard way, by posting a question of my own using a "Ask HN" type post... which sadly received no reply (It may have to do something with the Time Zones I guess..:p)