Notebook

Exploring Hacker News Posts¶

Hacker News is a site started by the startup incubator Y Combinator, where user-submitted stories (known as "posts") are voted and commented upon, similar to reddit. Hacker News is extremely popular in technology and startup circles, and posts that make it to the top of Hacker News' listings can get hundreds of thousands of visitors as a result.

What is interesting is that here we would be interested in focussing on posts where the titles begin with either 'Ask HN' or 'Show HN'. Users Use Ask HN to ask the Hacker News community a specific question.

Examples

Ask HN: How to improve my personal website? Ask HN: Am I the only one outraged by Twitter shutting down share counts?

Similarly, users use Show HN to show the Hacker News community a project, product, or just generally something interesting.

Examples

Show HN: Wio Link ESP8266 Based Web of Things Hardware Development Platform' Show HN: Something pointless I made

Now we would aim at comparing these two types of posts and explore:

Do Ask HN or Show HM receive more comments on average?
Do posts created at a certsain time receive more comments on average?

Data Source

The data set being used in this project is from the Kaggle, however it has been reduced from almost 300,000 rows to approximately 20,000 rows by removing all submissions that did not receive any comments, and then randomly sampling from the remaining submissions.

Below are descriptions of the columns:

id: The unique identifier from Hacker News for the post

title: The title of the post

url: The URL that the posts links to, if it the post has a URL

num_points: The number of points the post acquired, calculated as the total number of upvotes minus the total number of downvotes

num_comments: The number of comments that were made on the post

author: The username of the person who submitted the post

created_at: The date and time at which the post was submitted

Let us start off by importing our data set

In [1]:

from csv import reader
opened_file = open('hacker_news.csv')
read_file = reader(opened_file)
hn = list(read_file)
headers = hn[0]
hn = hn[1:]

print(len(hn))
print('\n')
print(headers)
print('\n')
print(hn[:4])

20100


['id', 'title', 'url', 'num_points', 'num_comments', 'author', 'created_at']


[['12224879', 'Interactive Dynamic Video', 'http://www.interactivedynamicvideo.com/', '386', '52', 'ne0phyte', '8/4/2016 11:52'], ['10975351', 'How to Use Open Source and Shut the Fuck Up at the Same Time', 'http://hueniverse.com/2016/01/26/how-to-use-open-source-and-shut-the-fuck-up-at-the-same-time/', '39', '10', 'josep2', '1/26/2016 19:30'], ['11964716', "Florida DJs May Face Felony for April Fools' Water Joke", 'http://www.thewire.com/entertainment/2013/04/florida-djs-april-fools-water-joke/63798/', '2', '1', 'vezycash', '6/23/2016 22:20'], ['11919867', 'Technology ventures: From Idea to Enterprise', 'https://www.amazon.com/Technology-Ventures-Enterprise-Thomas-Byers/dp/0073523429', '3', '1', 'hswarna', '6/17/2016 0:01']]

Filtering our Data

As discussed, we want to isolate the posts with Ask HN and Show HN. Let us create two seperate list of lists to contain the data for these titles.

To find the posts which begin with either Ask HN or Show HN, we'll use the string method startswith. Let us do a test run

In [2]:

string1 = "SwarJoshi"
print(string1.startswith("Swar"))
print(string1.startswith("swar"))

True
False

While this approach works, it shows the importance of case sensitivity. As we can not control the posts, we would have to cater our approach to control this case of case sensitivity.

So, we will convert all the strings into a lower case user lower() and then use our isolation methods on the stirings.

In [3]:

ask_posts = []
show_posts = []
other_posts = []

for row in hn:
    title = row[1]
    title = title.lower()
    if title.startswith('ask hn'):
        ask_posts.append(row)
    elif title.startswith('show hn'):
        show_posts.append(row)
    else:
        other_posts.append(row)
        
        
print(len(ask_posts))
print(len(show_posts))
print(len(other_posts))
        

1744
1162
17194

From this we can make the following basic assumptions:

There are more number of Ask HN(1744) posts as compared to Show HN(1162) posts
Neither of those constitute more than 10% of the data set individually.

Digging deeper, let us determine if ask posts or show posts receive more comments on average.

In [4]:

total_ask_comments = 0
for row in ask_posts:
    comments = int(row[4])
    total_ask_comments += comments
avg_ask_comments = total_ask_comments / len(ask_posts)
print(avg_ask_comments)

total_show_comments = 0
for row in show_posts:
    comments = int(row[4])
    total_show_comments += comments    
avg_show_comments = total_show_comments / len(show_posts)
print(avg_show_comments)

14.038417431192661
10.31669535283993

As we can see, Ask HN as 14 comments on average on each post while Show HN has 10 comments on average. This comes in addition to the fact that there are more Ask HN posts in general as compared to Show HN posts.

With a higher number of posts and more comments, it could be said that Ask HN posts have higher engagement in comparison with Show HN posts.

The reason behind this could also be the fact that, when a person asks a question it is directly requesting a response. In general, seeking a response would ideally be met with a few comments and replies from the original poster leader to increased number of comments.

As we see that ask posts recceive more comments, we would now focus the rest of our analysis just on these posts.

Bringing Time of Posting into the Picture

It is no secret, that all digital marketers and social media 'influencers' on Facebook, Twitter, Instagram etc. have a set time for posting their regular images (not the I-wish-it-were-occassional drunk ramblings) which brings in the most engagement and views for them. While the users/posters of HN are way cooler in general, we could possibly look into the possibility of possible times when the post engagement is higher.

In order to perform this analysis we will:

Calculate the amount of ask posts created in each hour of the day, along with the number of comments received.
Calculate the average number of comments ask posts receive by hour created.

The Module that we will use will be datetime module which we will use in the created_at column (Index 6) Moreover, we use strptime() to bring the date and time in the column to our desired format.

In [5]:

import datetime as dt

result_list = []
for row in ask_posts:
    created = row[6]
    comments = int(row[4])
    new1 = [created, comments]
    result_list.append(new1)


counts_by_hour = {}
comments_by_hour = {}

for row in result_list:
    thehour = row[0]
    comment = row[1]
    dt_object = dt.datetime.strptime(thehour, "%m/%d/%Y %H:%M")
    hour = dt_object.hour
    if hour in counts_by_hour:
        counts_by_hour[hour] += 1
        comments_by_hour[hour] += comment
    else:
        counts_by_hour[hour] = 1
        comments_by_hour[hour] = comment
        
print(counts_by_hour)
print('\n')
print(comments_by_hour)
    

{0: 55, 1: 60, 2: 58, 3: 54, 4: 47, 5: 46, 6: 44, 7: 34, 8: 48, 9: 45, 10: 59, 11: 58, 12: 73, 13: 85, 14: 107, 15: 116, 16: 108, 17: 100, 18: 109, 19: 110, 20: 80, 21: 109, 22: 71, 23: 68}


{0: 447, 1: 683, 2: 1381, 3: 421, 4: 337, 5: 464, 6: 397, 7: 267, 8: 492, 9: 251, 10: 793, 11: 641, 12: 687, 13: 1253, 14: 1416, 15: 4477, 16: 1814, 17: 1146, 18: 1439, 19: 1188, 20: 1722, 21: 1745, 22: 479, 23: 543}

counts_by_hour - Shows total posts within that hour comments_by_hour- Shows total comments within that hour

This has given us a total count & how many comments we have had in total during the course of the day within Ask HN posts. Let us find out the average comments per hour for the same.

In [38]:

avg_by_hour = []

for row in comments_by_hour:
    average = (comments_by_hour[row]/counts_by_hour[row])
    new1 = [row, average]
    avg_by_hour.append(new1)
    
print(avg_by_hour)
    

[[0, 8.127272727272727], [1, 11.383333333333333], [2, 23.810344827586206], [3, 7.796296296296297], [4, 7.170212765957447], [5, 10.08695652173913], [6, 9.022727272727273], [7, 7.852941176470588], [8, 10.25], [9, 5.5777777777777775], [10, 13.440677966101696], [11, 11.051724137931034], [12, 9.41095890410959], [13, 14.741176470588234], [14, 13.233644859813085], [15, 38.5948275862069], [16, 16.796296296296298], [17, 11.46], [18, 13.20183486238532], [19, 10.8], [20, 21.525], [21, 16.009174311926607], [22, 6.746478873239437], [23, 7.985294117647059]]

In [42]:

swap_avg_by_hour = []

for row in avg_by_hour:
    temp = row[0]
    temp1 = row[1]
    swap_avg_by_hour.append([temp1, temp])
    
sorted_swap = sorted(swap_avg_by_hour, reverse = True)

print("Top 5 Hours for Ask Posts Comments :",sorted_swap[:4])
print('\n')

for average, hour in sorted_swap[:5]:
    dt_hour = dt.datetime.strptime(str(hour), "%H")
    dt_hour = dt_hour.strftime("%H:%M")

    print("{} : {} average comments per post".format(dt_hour,average))
    print('\n')

Top 5 Hours for Ask Posts Comments : [[38.5948275862069, 15], [23.810344827586206, 2], [21.525, 20], [16.796296296296298, 16]]


15:00 : 38.5948275862069 average comments per post


02:00 : 23.810344827586206 average comments per post


20:00 : 21.525 average comments per post


16:00 : 16.796296296296298 average comments per post


21:00 : 16.009174311926607 average comments per post

We can see that post in the hour between 15:00 and 16:00 EST have the highest comments in the Ask HN section.

This could be possibly due to eitherdue to the following reasons:

General break time at workplace for tech folks working in USA and Canada. or
More inflow of comments could be from European countries (CET) where it is then 20:00 and later on during the day for people there.

This is followed by the hour between 02:00 and 03:00.

Considering that 'Hacker News' is primarily an American company, 02:00 seems rather odd, however the contributers are not necessarily from the US and the 02:00 post comments are probably from either the noctornal American techies (or their Silicon Valley buddies who left late from work, their non-noctornal South-Asian counterparts or early bird techies from Europe/Africa.

To ensure most amount of replies, the posts should be made between 15:00 and 16:00 (even a little later woiuld not hurt as the hour between 16:00 and 17:00 has also made the list but doesnt look all that impressive)

But these are relevant to people living in EST, what if I want to figure our what my chances are as a person living in Germany?

Let's add the 5 hours to convert it from EST to CET using dt.timedelta()

In [46]:

for average, hour in sorted_swap[:5]:
    dt_hour = dt.datetime.strptime(str(hour), "%H")
    dt_hour = dt_hour + dt.timedelta(hours=5)
    dt_hour = dt_hour.strftime("%H:%M")
    
    print("{} : {} average comments per post".format(dt_hour,average))
    print('\n')

20:00 : 38.5948275862069 average comments per post


07:00 : 23.810344827586206 average comments per post


01:00 : 21.525 average comments per post


21:00 : 16.796296296296298 average comments per post


02:00 : 16.009174311926607 average comments per post

There! As someone living in Germany, my best chances of getting the most amount of engagement is if I post between 20:00 and 21:00 CET (as I mentioned a little later is fine too, but not optimal), or if I post between 07:00 and 08:00. This is subject to ofcourse, if I have something valuable to ask in my Ask HN post.

In [ ]: