Notebook

Exploring HACKER NEWS Posts¶

Can i get more replies to my post if I post a thread at a particular time?!¶

Hacker News is site that is extremely popular in certain technology and startup circles, started by the startup incubator Y Combinator . Quite similar to Reddit, users can submit stories (known as "posts"), which are voted and commented upon.

While posting anything on Hacker News (HN) is pretty straight-forward, getting heard out there is a real fight!

Whether you are asking a question or showing off your work, or sharing any news (not only on forums, but even in social media in general) getting heard can be subject to -

the Title / Subject being relevant or domething of interest to the community,
the Content / Body of the post.
the Time at which it is uploaded

On HN, users can -

Ask the HN community (Ask HN)
Show their work to the HN community (Show HN)
Share news related to tech and startup developments

(A Screenshot of the Hacker News forum)

While the first 2 points are out of our scope of the study, the 3rd condition - i.e. checking whether the time at which the OP posts a new thread is related to the number of comments or points on that thread, shall be the subject of my analysis.

Besides, it will be interesting to see if I can improve the chances of my thread gaining attention of the community by changing the hour at which i post my thread!

Reading the Data¶

Let us get a fair idea of the structure of our data, and what we shall be dealing with.

The complete data-set for the data can be found on Kaggle (which has been sampled down from 300,000+ rows to around 20,000 rows). I've provided the link below for any deep-dwellers - https://www.kaggle.com/hacker-news/hacker-news-posts

In [1]:

# import the csv module and open 'hacker_news.csv' to
# display the first five rows

from csv import reader
opened_file = open('hacker_news.csv')
fhand = reader(opened_file)
hn = list(fhand)

for row in hn[:5]:
    print(row,'\n')

# separate the header row from the remaining rows, assign
# both lists to separate objects of list class

hn_header = hn[0]
hn = hn[1:]

['id', 'title', 'url', 'num_points', 'num_comments', 'author', 'created_at'] 

['12224879', 'Interactive Dynamic Video', 'http://www.interactivedynamicvideo.com/', '386', '52', 'ne0phyte', '8/4/2016 11:52'] 

['10975351', 'How to Use Open Source and Shut the Fuck Up at the Same Time', 'http://hueniverse.com/2016/01/26/how-to-use-open-source-and-shut-the-fuck-up-at-the-same-time/', '39', '10', 'josep2', '1/26/2016 19:30'] 

['11964716', "Florida DJs May Face Felony for April Fools' Water Joke", 'http://www.thewire.com/entertainment/2013/04/florida-djs-april-fools-water-joke/63798/', '2', '1', 'vezycash', '6/23/2016 22:20'] 

['11919867', 'Technology ventures: From Idea to Enterprise', 'https://www.amazon.com/Technology-Ventures-Enterprise-Thomas-Byers/dp/0073523429', '3', '1', 'hswarna', '6/17/2016 0:01']

For the sake of convenience, let's also record the field names of our data-set along with the index number they are located at.

Index	Name of Field
0	id
1	title
2	url
3	num_points
4	num_comments
5	author
6	created_at

Since we want to perform comparitive analysis on "Ask HN" and "Show HN" titled posts, it would be better if we divide our singular list into 3 separate lists, namely -

ask_posts[] (for posts with title starting with "Ask HN")
show_posts[] (for posts with title starting with "Show HN)"
other_posts[] all other posts that dont fit any of these categories)

Let's do it, and print the first 5 rows of each category to see if we have what we need -

In [2]:

# make separate lists for "Ask HN" titled posts, 
# "Show HN" titles posts and posts woth other titles

ask_posts = []
show_posts = []
other_posts = []

for row in hn:
    title = row[1]
    title = title.lower()
    if title.startswith('ask hn'):
        ask_posts.append(row)
    elif title.startswith('show hn'):
        show_posts.append(row)
    else :
        other_posts.append(row)
    
# printing top 5 rows in each of the 3 lists
print("Top 5 rows in ask_posts list:")
for row in ask_posts[:5]:
    print(row)
print("Top 5 rows in show_posts list:\n")
for row in show_posts[:5]:
    print(row)
print("\nTop 5 rows in other_posts list:")
for row in other_posts[:5]:
    print(row)

Top 5 rows in ask_posts list:
['12296411', 'Ask HN: How to improve my personal website?', '', '2', '6', 'ahmedbaracat', '8/16/2016 9:55']
['10610020', 'Ask HN: Am I the only one outraged by Twitter shutting down share counts?', '', '28', '29', 'tkfx', '11/22/2015 13:43']
['11610310', 'Ask HN: Aby recent changes to CSS that broke mobile?', '', '1', '1', 'polskibus', '5/2/2016 10:14']
['12210105', 'Ask HN: Looking for Employee #3 How do I do it?', '', '1', '3', 'sph130', '8/2/2016 14:20']
['10394168', 'Ask HN: Someone offered to buy my browser extension from me. What now?', '', '28', '17', 'roykolak', '10/15/2015 16:38']
Top 5 rows in show_posts list:

['10627194', 'Show HN: Wio Link  ESP8266 Based Web of Things Hardware Development Platform', 'https://iot.seeed.cc', '26', '22', 'kfihihc', '11/25/2015 14:03']
['10646440', 'Show HN: Something pointless I made', 'http://dn.ht/picklecat/', '747', '102', 'dhotson', '11/29/2015 22:46']
['11590768', 'Show HN: Shanhu.io, a programming playground powered by e8vm', 'https://shanhu.io', '1', '1', 'h8liu', '4/28/2016 18:05']
['12178806', 'Show HN: Webscope  Easy way for web developers to communicate with Clients', 'http://webscopeapp.com', '3', '3', 'fastbrick', '7/28/2016 7:11']
['10872799', 'Show HN: GeoScreenshot  Easily test Geo-IP based web pages', 'https://www.geoscreenshot.com/', '1', '9', 'kpsychwave', '1/9/2016 20:45']

Top 5 rows in other_posts list:
['12224879', 'Interactive Dynamic Video', 'http://www.interactivedynamicvideo.com/', '386', '52', 'ne0phyte', '8/4/2016 11:52']
['10975351', 'How to Use Open Source and Shut the Fuck Up at the Same Time', 'http://hueniverse.com/2016/01/26/how-to-use-open-source-and-shut-the-fuck-up-at-the-same-time/', '39', '10', 'josep2', '1/26/2016 19:30']
['11964716', "Florida DJs May Face Felony for April Fools' Water Joke", 'http://www.thewire.com/entertainment/2013/04/florida-djs-april-fools-water-joke/63798/', '2', '1', 'vezycash', '6/23/2016 22:20']
['11919867', 'Technology ventures: From Idea to Enterprise', 'https://www.amazon.com/Technology-Ventures-Enterprise-Thomas-Byers/dp/0073523429', '3', '1', 'hswarna', '6/17/2016 0:01']
['10301696', 'Note by Note: The Making of Steinway L1037 (2007)', 'http://www.nytimes.com/2007/11/07/movies/07stein.html?_r=0', '8', '2', 'walterbell', '9/30/2015 4:12']

Average comments per Post¶

Now that we are sure the ask/show type posts are segregated, let's calculate the average number of comments per post for 1. Ask, 2. Show type posts.

This will give us an idea if posting a particular type of post can give us more comments to our post -

In [9]:

# checking which one out of ask_ posts or show_posts 
# receive more comments on an avg

total_ask_comments = 0
total_show_comments = 0

for row in ask_posts:
    num_comments = int(row[4])
    total_ask_comments += num_comments
avg_ask_comments = total_ask_comments/len(ask_posts)

for row in show_posts:
    num_comments = int(row[4])
    total_show_comments += num_comments
avg_show_comments = total_show_comments/len(show_posts)

print(avg_ask_comments)
print(avg_show_comments)
    

14.038417431192661
10.31669535283993

The findings are significant - posts of the "Ask HN" type receive more than 14 comments on an average, while posts of the "Show HN" type receive about 10 comments on an average.

Hour-wise Average Comments per post¶

Next, we'll determine if Ask HN posts created at a certain time are more likely to attract comments. We'll do this by :

Calculating the average number of Ask HN posts created in each hour of the day, along with the number of comments received

Let's write a code that will give us the total number of posts and the total number of comments per hour -

In [16]:

import datetime as dt

result_list = []
counts_by_hour = {}
comments_by_hour = {}
avg_by_hour = []

# generate "result_list" list
for row in ask_posts:
    created_at = row[6]
    num_comments = int(row[4])
    result_list.append([created_at,num_comments])

# generate count dictionaries
for row in result_list:
    num_comments = row[1]
    date_time = dt.datetime.strptime(row[0],"%m/%d/%Y %H:%M")
    hr = date_time.strftime("%H")
    
    if hr in counts_by_hour:
        counts_by_hour[hr] += 1
        comments_by_hour[hr] += num_comments
    else:
        counts_by_hour[hr] = 1
        comments_by_hour[hr] = num_comments

# generating a list of hour-wise avg number of 
# comments per post 
for key in comments_by_hour:
    num_comments = comments_by_hour[key]
    count_posts = counts_by_hour[key]
    avg = num_comments/count_posts
    avg = avg
    avg_by_hour.append([key, avg])

In [20]:

# printing a list with top 5 average comment values

avg_sorted = []

for row in avg_by_hour:
    avg_sorted.append([row[1],row[0]])

# sorting average values in descending order ...
avg_sorted = sorted(avg_sorted,reverse = True)

# printing the top-5 hours for "Ask HN" post comments...

print("Top 5 Hours for Ask Posts Comments\n")
for row in avg_sorted[:5]:
    avg = row[0]
    hr = row[1]
    
    hr = dt.datetime.strptime(hr, "%H")
    hr = hr.time()
    hr = hr.strftime("%H:%M")
    sentence = '{}: {:.2f} average comments per "Ask HN" post'.format(hr,avg)
    print(sentence)

Top 5 Hours for Ask Posts Comments

15:00: 38.59 average comments per "Ask HN" post
02:00: 23.81 average comments per "Ask HN" post
20:00: 21.52 average comments per "Ask HN" post
16:00: 16.80 average comments per "Ask HN" post
21:00: 16.01 average comments per "Ask HN" post

Interesting..!¶

Against an average of 14 comments per Ask HN post, posting a post between 1500 to 1600 hours have an 38.6 comments on an average. This is a whopping 170% more than the net Ask HN average! I will make sure to post a question on HN forum on "What are the top Blogs on data Science" in this time period, and report in the conclusions if I receive any comments from fellow communitee members there!

Hour-wise Average Points per Post¶

Next, we'll determine if Ask HN posts created at a certain time are more likely to receive more points. We'll do this by :

Calculating the average number of Ask HN posts created in each hour of the day, along with the number of points received

Let's write a code that will give us the total number of posts and the total number of points per hour. But this time, we'll write a function that does the same job as the code above, except that it can take any of the ask_posts, show_posts or other_posts lists as arguments -

In [54]:

import datetime as dt

def top_avg(post_list):
    result_list = []
    posts_by_hour = {}
    points_by_hour = {}
    avg_by_hour = []
    avg_sorted = []

    
    # generate "result_list" list
    for row in post_list:
        created_at = row[6]
        num_points = int(row[3])
        result_list.append([created_at,num_points])

    # generate count dictionaries
    for row in result_list:
        num_points = row[1]
        date_time = dt.datetime.strptime(row[0],"%m/%d/%Y %H:%M")
        hr = date_time.strftime("%H")
        if hr in posts_by_hour:
            posts_by_hour[hr] += 1
            points_by_hour[hr] += num_points
        else:
            posts_by_hour[hr] = 1
            points_by_hour[hr] = num_points

    # generating a list of hour-wise avg number of 
    # comments per post 
    for key in points_by_hour:
        num_points = points_by_hour[key]
        count_posts = posts_by_hour[key]
        avg = num_points/count_posts
        avg = avg
        avg_by_hour.append([key, avg])
        
    # appending a list with average point values in random order
    for row in avg_by_hour:
        avg_sorted.append([row[1],row[0]])

    # sorting average values in descending order ...
    avg_sorted = sorted(avg_sorted,reverse = True)

    # printing the top-5 hours for "Ask HN" post comments...
    for row in avg_sorted[:5]:
        avg = row[0]
        hr = row[1]

        hr = dt.datetime.strptime(hr, "%H")
        hr = hr.time()
        hr = hr.strftime("%H:%M")
        if post_list == ask_posts:
            sentence = '{}: {:.2f} average points per "Ask HN" post'.format(hr,avg)
        elif post_list == show_posts:
            sentence = '{}: {:.2f} average points per "Show HN" post'.format(hr,avg)
        else:
            sentence = '{}: {:.2f} average points per "Others" post'.format(hr,avg)  
        print(sentence)

In [55]:

print("Top 5 Hours for Average Points per 'Ask Post'\n")
top_avg(ask_posts)

Top 5 Hours for Average Points per 'Ask Post'

15:00: 29.99 average points per "Ask HN" post
13:00: 24.26 average points per "Ask HN" post
16:00: 23.35 average points per "Ask HN" post
17:00: 19.41 average points per "Ask HN" post
10:00: 18.68 average points per "Ask HN" post

Very Interesting..!¶

Our findings have even more base - the Ask HN type posts receive maximum average number of points per post between 1500 and 1600 hours. This is in sync with our finding earlier, which showed that Ask HN posts posted in this time slot received more comments than other hours over the day.

One possible explanation could be that most users are logged in during this time on HN, and are able to interact with the community more. What causes this could be difficult to guess, but its an irrefutable correlation nonetheless!

In [56]:

print("Top 5 Hours for Average Points per 'Show' Post: \n")
top_avg(show_posts)

Top 5 Hours for Average Points per 'Show' Post: 

23:00: 42.39 average points per "Show HN" post
12:00: 41.69 average points per "Show HN" post
22:00: 40.35 average points per "Show HN" post
00:00: 37.84 average points per "Show HN" post
18:00: 36.31 average points per "Show HN" post

Show HN posts on the other hand have a more spread out points per hour average. The average points per post in a given hour for top 5 posts is in the range of 36-42 points, and the hours have no relation - This shows that the chances of one's Show type post receiving more points won't be affected by what hour of the day it is.

In [51]:

print("Top 10 Hours for Average Points per 'others' post: \n")
top_avg(other_posts)

Top 10 Hours for Average Points per 'others' post: 

13:00: 62.53 average points per "Others" post
14:00: 61.79 average points per "Others" post
15:00: 60.54 average points per "Others" post
10:00: 60.48 average points per "Others" post
19:00: 60.01 average points per "Others" post
02:00: 58.47 average points per "Others" post
00:00: 58.46 average points per "Others" post
17:00: 57.98 average points per "Others" post
11:00: 57.57 average points per "Others" post
12:00: 57.40 average points per "Others" post

A similar lack of heterogeinity is seen in all posts that are neither Ask no Show type (Others posts). The average points per post in a given hour for top 5 posts in the Others category is in the range of 60-62 points, and 57-62 points for th top 10 hours. Here too, the chances of one's Others type post receiving more points won't be affected by what hour of the day it is.

Conclusion¶

For any post of the Ask HN type posted between 1500-1600 hours, the average number of users commenting on my question is roughly 170% more than the Ask HN Average! So, bottom line - Post your Ask HN questions between 15-1600 Hours
For any other post, the time at which you post doesn't change the probability of more comments on your post by much.
Ask type posts receive 40% more replies on average than Show type posts.
Generally speaking, Others and Show type posts receive more points than Ask HN type posts. So, if you are phishing for more points on your posts, post in these types! :-P

These findings conclude my analysis. If you went through all the trouble of reaching till here, then thanks for your time !

PS -¶

As stated earlier in introduction - subject relevance and subject interest to the community too matters, which I learnt the hard way, by posting a question of my own using a "Ask HN" type post... which sadly received no reply (It may have to do something with the Time Zones I guess..:p)

In [ ]: