Maximizing Positive Community Engagement on Hacker News

In this project, we will use Hacker News data to understand which posting behaviors receive the highest levels of positive engagement from the Hacker News community.

GOAL: Determine the type and timing of posts that lead to positive engagement from Hacker News users to inform what and when we should publish.

The dataset has been shortened to include only a random sample of posts that received comments on Hacker News. The columns are labeled as follows:

  • [0] id : The unique identifier from Hacker News for the post
  • [1] title: The title of the post
  • [2] url: The URL that the post links to, if the post has a URL
  • [3] num_points: The post's number of points (upvotes - downvotes)
  • [4] num_comments: The number of comments that were made on the post
  • [5] author: The username of the person who submitted the post
  • [6] created_at: The date and time at which the post was submitted
In [1]:
from csv import reader
opened_file = open('hacker_news.csv')
read_file = reader(opened_file)
hn = list(read_file)
headers = hn[0]
hn = hn[1:]

for row in hn[:4]:
    print(row)
    print("\n")
    
['12224879', 'Interactive Dynamic Video', 'http://www.interactivedynamicvideo.com/', '386', '52', 'ne0phyte', '8/4/2016 11:52']


['10975351', 'How to Use Open Source and Shut the Fuck Up at the Same Time', 'http://hueniverse.com/2016/01/26/how-to-use-open-source-and-shut-the-fuck-up-at-the-same-time/', '39', '10', 'josep2', '1/26/2016 19:30']


['11964716', "Florida DJs May Face Felony for April Fools' Water Joke", 'http://www.thewire.com/entertainment/2013/04/florida-djs-april-fools-water-joke/63798/', '2', '1', 'vezycash', '6/23/2016 22:20']


['11919867', 'Technology ventures: From Idea to Enterprise', 'https://www.amazon.com/Technology-Ventures-Enterprise-Thomas-Byers/dp/0073523429', '3', '1', 'hswarna', '6/17/2016 0:01']


Filtering the Data

This data does not have any 'category' delineations. To drill down to certain types of posts, we will need to use their titles, as there are some naming conventions that are in widespread use across the site.

We will focus specifically on posts whose titles begin with Ask HN and Show HN. These are posts that either pose a question to the community or show the community something new, respectively. We will filter down to those posts below:

In [2]:
ask_posts = []
show_posts = []
other_posts = []

for row in hn: 
    title = row[1]
    lowercase = title.lower()
    
    if lowercase.startswith('ask hn'):
        ask_posts.append(row)
        
    elif lowercase.startswith('show hn'):
        show_posts.append(row)
    
    else:
        other_posts.append(row)
        
print("There are " + str(len(ask_posts)) + " Ask HN posts")
print("There are " + str(len(show_posts)) + " Show HN posts")
print("There are " + str(len(other_posts)) + " Other posts")
        
There are 1744 Ask HN posts
There are 1162 Show HN posts
There are 17194 Other posts

Analysis of Comment Activity & Timing

Next, we can dig in and look at engagement with the Ask HN and Show HN posts posts. One proxy for engagement is the number of comments that each type of post received on average.

In [3]:
total_ask_comments = 0

for row in ask_posts: 
    num_comments = int(row[4])
    total_ask_comments += num_comments
     
avg_ask_comments = (total_ask_comments) / len(ask_posts)

print("Avg Ask HN comments:")
print(round(avg_ask_comments,2))
Avg Ask HN comments:
14.04
In [4]:
total_show_comments = 0

for row in show_posts: 
    num_comments = int(row[4])
    total_show_comments += num_comments
    
avg_show_comments = (total_show_comments) / len(show_posts)

print("Avg Show HN comments:")
print(round(avg_show_comments,2))
Avg Show HN comments:
10.32

We know that there are more Ask posts than Show posts, and in looking at the average comments per post, we also see heightened engagement with Ask posts. On average, Show posts see about 10 comments each, while Ask posts see around 14.

For our analysis of comment activity, let's focus our attention on these Ask posts. Next, we'll look at the timing of the post - does the time of day that the publisher posted their thoughts correlate to higher or lower engagement?

In [5]:
import datetime as dt

result_list = []

for row in ask_posts:
    created_at = row[6]
    num_comments = int(row[4])
    result_list.append([created_at, num_comments])

counts_by_hour = {}
comments_by_hour = {}

for row in result_list:
    date = row[0]
    num_comments = row[1]
    clean_date = dt.datetime.strptime(date, "%m/%d/%Y %H:%M")
    hour = clean_date.strftime("%H")
    
    if hour not in counts_by_hour:
        counts_by_hour[hour] = 1
        comments_by_hour[hour] = num_comments
    else:
        counts_by_hour[hour] += 1
        comments_by_hour[hour] += num_comments
    
avg_by_hour = []

for hour in comments_by_hour:
    avg = comments_by_hour[hour] / counts_by_hour[hour]
    avg_by_hour.append([hour, avg])


swap_avg_by_hour = []

for row in avg_by_hour:
    new_zero = row[1]
    new_one = row[0]
    swap_avg_by_hour.append([new_zero, new_one])
    
sorted_swap = sorted(swap_avg_by_hour, reverse = True)

print("Top 5 Hours for Ask Post Comments")
print("\n")

for row in sorted_swap[:5]:
    avg_comments = row[0]
    hour = row[1]
    clean_hour = dt.datetime.strptime(hour, "%H")
    final_hour = dt.datetime.strftime(clean_hour, "%H:%M")
    output_string = "{0}: {1:.2f} average comments per post".format(final_hour,avg_comments)
    print(output_string)

    
Top 5 Hours for Ask Post Comments


15:00: 38.59 average comments per post
02:00: 23.81 average comments per post
20:00: 21.52 average comments per post
16:00: 16.80 average comments per post
21:00: 16.01 average comments per post

The five most popular times of day to comment on Ask posts are 3pm, 2am, 8pm, 4pm, and 9pm.

It is intuitive that people would be more likely to engage with posts toward the end of, or after, their work day. We do see engagement throughout the day, but it is a bit lower on average:

In [6]:
work_day = []

for row in sorted_swap: 
    if 8 <= int(row[1]) <=14:
        avg_comments = row[0]
        hour = row[1]
        clean_hour = dt.datetime.strptime(hour, "%H")
        final_hour = dt.datetime.strftime(clean_hour, "%H:%M")
        output_string = "{0}: {1:.2f} average comments per post".format(final_hour,avg_comments)
        print(output_string)
        
13:00: 14.74 average comments per post
10:00: 13.44 average comments per post
14:00: 13.23 average comments per post
11:00: 11.05 average comments per post
08:00: 10.25 average comments per post
12:00: 9.41 average comments per post
09:00: 5.58 average comments per post

It is clear that 3pm is the best time to catch commenters, at an average of 39 comments. This may be because folks are already at their computers for work, but are winding down their day, or perhaps burning out for the day and exploring Hacker News.

If we are not able to post at 3pm, posting late at night around 2am could help us catch the night owl visitors to the site, who also seem to comment quite often.

Analysis of Point Earnings for Positivity of Engagement

Comments aren't the only way to determine engagement. After all, people could be furiously commenting on a post because they strongly dislike or disagree with it. We care not only about attracting attention, but also about making a positive contribution to the Hacker News community.

Point counts on Hacker News factor in positive as well as negative reactions. Lets take a look at Ask vs Show posts and see which type of post is earning the most points on average.

In [7]:
show_post_count = 0
show_point_count = 0

for row in show_posts:
    points = int(row[3])
    show_post_count += 1
    show_point_count += points

avg_show_points = show_point_count / show_post_count

print("Average points of Show HN posts:")
print(round(avg_show_points,2))

ask_post_count = 0
ask_point_count = 0

for row in ask_posts:
    points = int(row[3])
    ask_post_count += 1
    ask_point_count += points

avg_ask_points = ask_point_count / ask_post_count

print("Average points of Ask HN posts:")
print(round(avg_ask_points,2))
    
Average points of Show HN posts:
27.56
Average points of Ask HN posts:
15.06

In looking at point counts, we can see that even though Show posts get ~4 fewer comments on average than Ask posts, they are seeing more positive engagement in the form of points. Show posts are earning close to double the points of Ask posts.

This may be because a 'Show' post and an 'Ask' post invite two different types of engagement. A 'Show' post brings a new idea or thought to the community, which will often provoke excitement and engagement in the form of upvotes. An 'Ask' post invites conversation and troubleshooting, which most effectively takes the form of comments.

Both types of engagement can be positive - that said, using this data, we are able to have more confidence in point counts as a measure of positive engagement, since they represent both upvotes and downvotes.

With this in mind, I'll focus on point counts as a proxy for positive engagement moving forward.

Analysis of Point Counts by Time of Day

Let's see if there's a best time of day to post in order to gain positive engagement through points. Here we will focus on 'Show' posts, since we've seen above that they outperform Ask posts in terms of point counts.

In [8]:
result_list = []

for row in show_posts: 
    num_points = int(row[3])
    date = row[6]
    result_list.append([num_points, date])
    
    
show_points_by_hour = {}
show_count_by_hour = {}

for row in result_list:
    date = row[1]
    num_points = row[0]
    clean_date = dt.datetime.strptime(date, "%m/%d/%Y %H:%M")
    hour = clean_date.strftime("%H")
    
    if hour not in show_count_by_hour:
        show_count_by_hour[hour] = 1
        show_points_by_hour[hour] = num_points
    else:
        show_count_by_hour[hour] += 1
        show_points_by_hour[hour] += num_points
        
show_avg_by_hour = []

for hour in show_points_by_hour:
    avg = show_points_by_hour[hour] / show_count_by_hour[hour]
    show_avg_by_hour.append([hour, avg])
    
sort_show_avg = []

for hour in show_avg_by_hour:
    new_zero = hour[1]
    new_one = hour[0]
    sort_show_avg.append([new_zero, new_one])

sort_show_avg = sorted(sort_show_avg, reverse = True)

print("Top 5 Posting Hours for Point Counts")

for row in sort_show_avg[:5]:
    avg_points = row[0]
    hour = row[1]
    clean_hour = dt.datetime.strptime(hour, "%H")
    final_hour = dt.datetime.strftime(clean_hour, "%H:%M")
    output_string = "{0}: {1:.2f} average points per post".format(final_hour,avg_points)
    print(output_string)
Top 5 Posting Hours for Point Counts
23:00: 42.39 average points per post
12:00: 41.69 average points per post
22:00: 40.35 average points per post
00:00: 37.84 average points per post
18:00: 36.31 average points per post

We see some similarities between time-of-day engagement for comments on Ask posts and points on Show posts. In the case of point counts, however, three of the top 5 hours run in immediate succession, showing a clear trend of upvoting activity during the 10pm-12am window.

It looks as if this would be the ideal time frame to publish a Show HN post in order to maximize positive engagement.

Checking our Findings Against 'Other' Data

For learning purposes, early on in this analysis I decided to focus on 'Ask HN' and 'Show HN' posts in the Hacker News community. As a quick gut check to ensure we're not missing any high-value posting types, we will do a quick comparison of our findings to the other_posts list, which contains all posts that do not begin with the 'Ask HN' and 'Show HN' keywords.

Let's take a quick look at the average comment & point activity in the 'other' category, compared to the highest performing post type for each measure:

In [9]:
other_post_count = 0
other_comment_count = 0 
other_point_count = 0

for row in other_posts: 
    points = int(row[3])
    comments = int(row[4])
    
    other_point_count += points
    other_comment_count += comments
    other_post_count += 1
    
other_avg_comments = other_comment_count / other_post_count
other_avg_points = other_point_count / other_post_count

print("Average 'Other' Comments")
print(round(other_avg_comments,2))
print("Average 'Ask HN' Comments")
print(round(avg_ask_comments,2))
print('\n')
print("Average 'Other' Points")
print(round(other_avg_points,2))
print("Average 'Show HN' Points")
print(round(avg_show_points,2))
Average 'Other' Comments
26.87
Average 'Ask HN' Comments
14.04


Average 'Other' Points
55.41
Average 'Show HN' Points
27.56

Checking our Ask and Show analysis against the 'Other' data provides a really interesting next path to explore in this analysis. Though we could get some great engagement through 'Show HN' posts, it seems as though there are some hidden opportunities in the 'Other' category that we should explore.

That said, one challenge we may run into is that this data is not categorized - we were able to use a proxy for the category by analyzing two types of posts that always begin with the same string. Let's take a high level look at the 'Other' data to see if there is a clear path forward.

Quick Exploration of 'Other' Data

Just to see if we can glean some immediate insights from, or easily categorize, this 'other' data, below we've taken a look at the top 20 titles by point value.

In [10]:
other_post_points = []

for row in other_posts: 
    title = row[1]
    points = int(row[3])
    author = row[5]
    other_post_points.append([points, title])
    
sorted_other_points = sorted(other_post_points, reverse = True)

print("Top 20 'Other' Titles by Points")

for row in sorted_other_points[:20]:
    title = row[1]
    points = row[0]
    string = "{} points : {}"
    final_string = string.format(points, title)
    print(final_string)
    
Top 20 'Other' Titles by Points
2553 points : Pardon Snowden
2381 points : Tell HN: New features and a moderator
1851 points : Master Plan, Part Deux
1622 points : Responsive Pixel Art
1573 points : I've Just Liberated My Modules
1565 points : Being sued, in East Texas, for using the Google Play Store [video]
1562 points : Instagram's Million Dollar Bug
1559 points : TensorFlow: open-source library for machine intelligence
1447 points : Amazon's customer service backdoor
1395 points : Lee Sedol Beats AlphaGo in Game 4
1368 points : VLC contributor living in Aleppo writing about the Paris attacks
1323 points : It's The Future
1304 points : He Always Had a Dark Side
1302 points : Graphing when your Facebook friends are awake
1284 points : SpaceX launch webcast: Orbcomm-2 Mission [video]
1282 points : My First 10 Minutes on a Server
1260 points : Let's Encrypt is Trusted
1238 points : Philae Found
1207 points : Google achieves AI 'breakthrough' by beating Go champion
1195 points : Massachusetts Bans Employers from Asking Applicants About Previous Pay

At first glance, no major categorizations of these top titles jump out that may be able to give us guidance on which type of 'other' post to pursue. I'm interested in hearing how one might approach digging further into this section of the data.

Conclusion

To maximize positive engagement on the Hacker News site, we should opt to publish an engaging 'Show HN' post between the hours of 10pm-12am. This type of post and time period have shown promising levels of engagement based on point values, and would be our best bet to catch the attention of the HN community.

That said, there is an interesting opportunity to explore the other_posts list, which is more difficult to categorize & filter, but shows very high levels of engagement upon a cursory exploration of the data.