Driving post engagement - Analysing post engagement in Hacker News

Hacker News is a site where user-submitted stories (known as "posts") are voted and commented upon. Hacker News is extremely popular in technology and startup circles, and posts that make it to the top of Hacker News' listings can get hundreds of thousands of visitors as a result. We're specifically interested in posts whose titles begin with either Ask HN or Show HN. Users submit Ask HN posts to ask the Hacker News community a specific question. Likewise, users submit Show HN posts to show the Hacker News community a project, product, or just generally something interesting. The goal of our analysis is to determine the following:

  • Do Ask HN or Show HN receive more comments on average?
  • Do post created at a particular time receive more comments on average?

Summary of Results

After analyzing the data, we can say that Ask HM has higher comments than Show HM post. Moreover, the time of posting has an impact on the average number of comments; Ask post, posted at 15:00 Easter time (07:00 Fiji time), receives, on average, the highest number of comments. For more details, please refer to the full analysis below.

Exploring existing data

Hacker news post data is available in Kaggle. This data set contains 12 month of data, up to September 2016

We will start by importing the data.

In [1]:
from csv import reader
opened_file  = open('hacker_news.csv')
read_file = reader(opened_file)
hn = list(read_file)

for row in hn[:5]:
    print(row)
    print('\n')

print('The data set has '+ str(len(hn))+' rows.')
print("Each row has "+ str(len(hn[0]))+ ' columns.')
['id', 'title', 'url', 'num_points', 'num_comments', 'author', 'created_at']


['12224879', 'Interactive Dynamic Video', 'http://www.interactivedynamicvideo.com/', '386', '52', 'ne0phyte', '8/4/2016 11:52']


['10975351', 'How to Use Open Source and Shut the Fuck Up at the Same Time', 'http://hueniverse.com/2016/01/26/how-to-use-open-source-and-shut-the-fuck-up-at-the-same-time/', '39', '10', 'josep2', '1/26/2016 19:30']


['11964716', "Florida DJs May Face Felony for April Fools' Water Joke", 'http://www.thewire.com/entertainment/2013/04/florida-djs-april-fools-water-joke/63798/', '2', '1', 'vezycash', '6/23/2016 22:20']


['11919867', 'Technology ventures: From Idea to Enterprise', 'https://www.amazon.com/Technology-Ventures-Enterprise-Thomas-Byers/dp/0073523429', '3', '1', 'hswarna', '6/17/2016 0:01']


The data set has 20101 rows.
Each row has 7 columns.

The Hacker news data set has 20,101 rows and 7 columns. the description of each column is :

  • title: title of the post

  • url: the url of the item being linked to

  • num_points: the number of upvotes the post received

  • num_comments: the number of comments the post received

  • author: the name of the account that made the post

  • created_at: the date and time the post was made (the time zone is Eastern Time in the US)

To continue with the analysis, we are going to remove the header

In [2]:
header = hn[0]
hn = hn[1:]

for row in hn[:5]:
    print(row)
    print('\n')
['12224879', 'Interactive Dynamic Video', 'http://www.interactivedynamicvideo.com/', '386', '52', 'ne0phyte', '8/4/2016 11:52']


['10975351', 'How to Use Open Source and Shut the Fuck Up at the Same Time', 'http://hueniverse.com/2016/01/26/how-to-use-open-source-and-shut-the-fuck-up-at-the-same-time/', '39', '10', 'josep2', '1/26/2016 19:30']


['11964716', "Florida DJs May Face Felony for April Fools' Water Joke", 'http://www.thewire.com/entertainment/2013/04/florida-djs-april-fools-water-joke/63798/', '2', '1', 'vezycash', '6/23/2016 22:20']


['11919867', 'Technology ventures: From Idea to Enterprise', 'https://www.amazon.com/Technology-Ventures-Enterprise-Thomas-Byers/dp/0073523429', '3', '1', 'hswarna', '6/17/2016 0:01']


['10301696', 'Note by Note: The Making of Steinway L1037 (2007)', 'http://www.nytimes.com/2007/11/07/movies/07stein.html?_r=0', '8', '2', 'walterbell', '9/30/2015 4:12']


We have removed the header to make it easier to analyze the data.

For our analysis we are interested only in Ask HN and Show HN posts. In the next step, we are going to filter the data.

In [3]:
ask_posts = []
show_posts = []
other_posts = []

for row in hn:
    title = row[1]
    title = title.lower()
    
    if title.startswith('ask hn'):
        ask_posts.append(row)
    elif title.startswith('show hn'):
        show_posts.append(row)
    else:
        other_posts.append(row)
        
print("There are "+ str(len(ask_posts))+" Ask HN posts")
print("There are "+ str(len(show_posts))+" Show HN posts")
print("There are "+ str(len(other_posts))+" other posts")     
There are 1744 Ask HN posts
There are 1162 Show HN posts
There are 17194 other posts

In the step above, we have filtered out the data into three lists.

In the data set, there are 1744 Ask HN posts and 1162 Show HN posts. We will perform the rest of our analysis on those filtered datasets.

Average comment by post category

In the next step we are going to determine if ask posts or show posts receive more comments on average

Average comments on Ask HN posts

In [4]:
total_ask_comments = 0

for row in ask_posts:
    num_comments = int(row[4])
    total_ask_comments += num_comments

avg_ask_comments = total_ask_comments /( len(ask_posts))
output= "The average number of comments for Ask HN post is {:.2f} comments".format(avg_ask_comments)
print(output)
The average number of comments for Ask HN post is 14.04 comments

As we can see, on average Ask HN post get 14.04 comments.

Average comments on Show HN posts

In [5]:
total_show_comments = 0

for row in show_posts:
    num_comments = int(row[4])
    total_show_comments += num_comments

avg_show_comments = total_show_comments /( len(show_posts))
output= "The average number of comments for Show HN post is {:.2f} comments".format(avg_show_comments)
print(output)
The average number of comments for Show HN post is 10.32 comments

As we can see, on average Show HN post get 10.32 comments.

Summary

Ask HN get 14.04 comments per post, whereas Show HN get 10.32 comments per post. Therefore we can say that Ask HN posts, on average, get more comments than Show HN posts

Impact of posting time on the number of comments

As we have seen before, Ask HN post drive higher engagement than Show HN posts. For the rest of the analysis, we are going to focus on Ask HN posts only.

The next step is to determine if ask posts created at a certain time are more likely to attract comments.

To do so, we will calculate the number of ask posts and comments by hour created.

In [6]:
import datetime as dt # import as dt to make code more readable

result_list = []

for row in ask_posts:
    created_at = row[6]
    num_comments = int(row[4])
    result = [created_at,num_comments]
    result_list.append(result)

counts_by_hour = {}
comments_by_hour = {}

for row in result_list:
    hour = row [0]
    hour = dt.datetime.strptime(hour,"%m/%d/%Y %H:%M")
    hour = hour.strftime("%H")
    comment = row[1]
    if hour not in counts_by_hour:
        counts_by_hour[hour] = 1
        comments_by_hour[hour] = comment
    else:
        counts_by_hour[hour] += 1
        comments_by_hour[hour] += comment

In the step above, we created two dictionaries:

  • counts_by_hour: contains the number of ask posts created during each hour of the day.
  • comments_by_hour: contains the corresponding number of comments ask posts created at each hour received.

Next, we'll use these two dictionaries to calculate the average number of comments for posts created during each hour of the day.

In [7]:
average_by_hour = []

for hour in comments_by_hour:
    comments = comments_by_hour[hour]
    total = counts_by_hour[hour]
    
    average = comments / total
    average_by_hour.append([hour,average])
    

In the step above, we have calculated the average number of comments for each hour fo the day.

To make the data more readable, we are going to sort the data and print the top 5 hours for Ask posts comments

In [8]:
swap_avg_by_hour = []
for row in average_by_hour:
    hour = row[0]
    average = row [1]
    swap_avg_by_hour.append([average,hour])

sorted_swap = sorted(swap_avg_by_hour, reverse = True)

print( "Top 5 Hours for Ask Posts Comments")

for row in sorted_swap[:5]:
    string = "{}:00 : {:.2f} average comments per post"
    output = string.format(row[1],row[0])
    print(output)
Top 5 Hours for Ask Posts Comments
15:00 : 38.59 average comments per post
02:00 : 23.81 average comments per post
20:00 : 21.52 average comments per post
16:00 : 16.80 average comments per post
21:00 : 16.01 average comments per post

From the table above we can see that Ask post potesd at 15:00, on average, get the highest amount fo comments.

The above data the time zone is Eastern Time in the US. As i am currently living in Fiji, I am going to convert the best time to post to local time in Fiji.

In [9]:
import pytz

# turn hour 15 into a time object
Eastern_time= dt.datetime.strptime("15:00","%H:%M")

#Define the current HM time zone and my current time zone
Hn_time_zone = pytz.timezone("US/Eastern")
Fiji_time_zone = pytz.timezone("Pacific/Fiji")

#Convert US/eastern to Fiji time
Eastern_time = Hn_time_zone.localize(Eastern_time)
Fiji_time = Eastern_time.astimezone(Fiji_time_zone)

Fiji_time = Fiji_time.strftime("%H")

print ("15:00 US/Eastern time corresponds to "+ Fiji_time+ ":00 local time in Fiji")
                  
15:00 US/Eastern time corresponds to 07:00 local time in Fiji

From the above we can see that 15:00 US/Eastern time corresponds to 07:00 local time in Fiji.

Conclusion

In this project, we have analyzed data from Hackers News comments to find what is driving post engagement. Our analysis focused on Ask HM and Show HM posts.

In conclusion, we can say that Ask HM has higher comments than Show HM post. Moreover, the time of posting has an impact on the average number of comments; Ask post, posted at 15:00, Easter time,(07:00 Fiji time) receive, on average, the highest number of comments.