How to conquer audience on a social network tech-oriented ?¶

Let's study 20,000 different posts from Hacker News to determine what are the keys of a post' success.

Hacker News is a site started by the startup incubator Y Combinator, where user-submitted stories (known as "posts") are voted and commented upon, similar to reddit. Hacker News is extremely popular in technology and startup circles, and posts that make it to the top of Hacker News' listings can get hundreds of thousands of visitors as a result.

Opening and discovering the dataset¶

Full dataset is available at this link on Kaggle platform.

In [1]:

from csv import reader
opened_file = open("hacker_news.csv")
read_file = reader(opened_file)
hn = list(read_file)

Let's see some of the rows of the set to have a better idea on what data we will deal with during this project.

In order to do that, we'll first write a function named explore_data() that we can use repeatedly to explore rows in a more readable way.

In [2]:

def explore_data(dataset, start, end):
    dataset_slice = dataset[start:end]    
    for row in dataset_slice:
        print(row)
        print('\n')

In [3]:

explore_data(hn,0,4)

['id', 'title', 'url', 'num_points', 'num_comments', 'author', 'created_at']


['12224879', 'Interactive Dynamic Video', 'http://www.interactivedynamicvideo.com/', '386', '52', 'ne0phyte', '8/4/2016 11:52']


['10975351', 'How to Use Open Source and Shut the Fuck Up at the Same Time', 'http://hueniverse.com/2016/01/26/how-to-use-open-source-and-shut-the-fuck-up-at-the-same-time/', '39', '10', 'josep2', '1/26/2016 19:30']


['11964716', "Florida DJs May Face Felony for April Fools' Water Joke", 'http://www.thewire.com/entertainment/2013/04/florida-djs-april-fools-water-joke/63798/', '2', '1', 'vezycash', '6/23/2016 22:20']

Now, we will manage the headers properly, in order to have a dataset full of usable data.

In [4]:

headers = hn[0:1]
hn = hn[1:]

Let's precise that we will specifically be interested in posts whose titles begin with either Ask HN or Show HN, as these posts are the most relevant for our study.

Users submit Ask HN posts to ask the Hacker News community a specific question.
Users submit Show HN posts to share a thought or a discovert with the community.

In [5]:

ask_posts = []
show_posts = []
other_posts = []

for row in hn:
    title = row[1].lower()
    if title.startswith('ask hn'):
        ask_posts.append(row)
    elif title.startswith('show hn'):
        show_posts.append(row)
    else:
        other_posts.append(row)

print(len(ask_posts))
print(len(show_posts))
print(len(other_posts))

1744
1162
17194

In [6]:

explore_data(ask_posts,0,2)
explore_data(show_posts,0,2)

['12296411', 'Ask HN: How to improve my personal website?', '', '2', '6', 'ahmedbaracat', '8/16/2016 9:55']


['10610020', 'Ask HN: Am I the only one outraged by Twitter shutting down share counts?', '', '28', '29', 'tkfx', '11/22/2015 13:43']


['10627194', 'Show HN: Wio Link  ESP8266 Based Web of Things Hardware Development Platform', 'https://iot.seeed.cc', '26', '22', 'kfihihc', '11/25/2015 14:03']


['10646440', 'Show HN: Something pointless I made', 'http://dn.ht/picklecat/', '747', '102', 'dhotson', '11/29/2015 22:46']

Basic analysis using average and datatime¶

We will now explore the two lists made respectively with "show" and "ask" posts. The first (show) contains 1162 rows whereas the second (ask) contains 1744 rows.

At this stage, we want to determine what type of posts generates the more reactions from the audience. To evaluate this qualitative effect, we will first explore the relevant quantitative data we have : the number of comments generated by each post.

1) First analysis : "show" vs "ask" posts¶

To perform this analysis, we will use the fifth column : num_comment

In [7]:

print(headers)

[['id', 'title', 'url', 'num_points', 'num_comments', 'author', 'created_at']]

In [8]:

total_ask_comments = 0
total_show_comments = 0

for row in ask_posts:
    total_ask_comments += int(row[4])
    
for row in show_posts:
    total_show_comments += int(row[4])
    
average_ask_comments = total_ask_comments / len(ask_posts)
average_show_comments = total_show_comments / len(show_posts)

print(round(average_ask_comments,2))
print(round(average_show_comments,2))

14.04
10.32

It seems that "ask posts" generates more comments than "show posts". We can explain that using the idea that people are more likely to answer to a question than reacting to an opinion. Indeed, a question really calls the reader and easily provoke a reaction whereas an opinion is more likely to "not worth it" for the reader.

Since ask posts gather the more audience, we will now focus our remaining analysis on these posts. Let's determine if ask posts created at a certain time are more likely to attract comments.

2) Is there a "best hour" to ask-post something ?¶

We will use datetime class to perform analysis on posts' dates and hours. In order to better understand the arguments of the .strptime method, you can refer to this site.

In [45]:

import datetime as dt

result_list = []

for row in ask_posts:
    element_1 = row[6] # Let's pick the creating date of the post
    element_2 = int(row[4]) # We take also the number of comments of the post
    
    # We transform the date in a datetime object using .strptime method
    # And we join it to the number of comments in the same list
    
    element_1 = dt.datetime.strptime(element_1, "%m/%d/%Y %H:%M")
    result_list.append([element_1, element_2])

Now that we have created a datetime object, we can perform some analysis on it. We want to know what time slot gathers the more ask posts and the more comments in total.

In [42]:

counts_by_hour = {}
comments_by_hour = {}

for join_element in result_list:
    element_1 = join_element[0]
    hour = element_1.strftime("%H") # We take only the hour from the datetime object
    
    if hour not in counts_by_hour:
        counts_by_hour[hour] = 1
        comments_by_hour[hour] = join_element[1] # We add the number of comment of the post
    else:
        counts_by_hour[hour] += 1
        comments_by_hour[hour] += join_element[1]

In [43]:

print(counts_by_hour)

{'17': 100, '11': 58, '01': 60, '03': 54, '02': 58, '20': 80, '07': 34, '15': 116, '04': 47, '00': 55, '14': 107, '13': 85, '09': 45, '23': 68, '19': 110, '05': 46, '21': 109, '06': 44, '16': 108, '12': 73, '08': 48, '10': 59, '18': 109, '22': 71}

In [48]:

print(comments_by_hour)

{'17': 1146, '11': 641, '01': 683, '03': 421, '02': 1381, '20': 1722, '07': 267, '15': 4477, '04': 337, '00': 447, '14': 1416, '13': 1253, '09': 251, '23': 543, '19': 1188, '05': 464, '21': 1745, '06': 397, '16': 1814, '12': 687, '08': 492, '10': 793, '18': 1439, '22': 479}

Thus, we created two dictionaries:

counts_by_hour : contains the number of ask posts created during each hour of the day.
comments_by_hour : contains the corresponding number of comments ask posts created at each hour received.

Next, let's use these two dictionaries to calculate the average number of comments for ask-posts created during each hour of the day.

In [63]:

avg_by_hour = []
for hour in comments_by_hour:
    mean = comments_by_hour[hour] / counts_by_hour[hour]
    avg_by_hour.append([hour, round(mean, 3)])

avg_by_hour

Out[63]:

[['17', 11.46],
 ['11', 11.052],
 ['01', 11.383],
 ['03', 7.796],
 ['02', 23.81],
 ['20', 21.525],
 ['07', 7.853],
 ['15', 38.595],
 ['04', 7.17],
 ['00', 8.127],
 ['14', 13.234],
 ['13', 14.741],
 ['09', 5.578],
 ['23', 7.985],
 ['19', 10.8],
 ['05', 10.087],
 ['21', 16.009],
 ['06', 9.023],
 ['16', 16.796],
 ['12', 9.411],
 ['08', 10.25],
 ['10', 13.441],
 ['18', 13.202],
 ['22', 6.746]]

This format is difficult to read. That's why we will finally sort the list and print the five highest values in a format that's easier to read.

In [65]:

swap_avg_by_hour = []

# We first swap both elements of the lists included in avg_by_hour

for row in avg_by_hour:
    swap_avg_by_hour.append([row[1],row[0]])

print(swap_avg_by_hour)

[[11.46, '17'], [11.052, '11'], [11.383, '01'], [7.796, '03'], [23.81, '02'], [21.525, '20'], [7.853, '07'], [38.595, '15'], [7.17, '04'], [8.127, '00'], [13.234, '14'], [14.741, '13'], [5.578, '09'], [7.985, '23'], [10.8, '19'], [10.087, '05'], [16.009, '21'], [9.023, '06'], [16.796, '16'], [9.411, '12'], [10.25, '08'], [13.441, '10'], [13.202, '18'], [6.746, '22']]

In [79]:

sorted_swap = sorted(swap_avg_by_hour, reverse = True)

print("Top 5 Hours for Ask Posts Comments\n")
for row in sorted_swap[:5]:
    hour = dt.datetime.strptime(row[1],"%H")
    hour = hour.strftime("%H:%M")
    print("{}: {:.2f} average comments per post".format(hour, row[0]))

Top 5 Hours for Ask Posts Comments

15:00: 38.59 average comments per post
02:00: 23.81 average comments per post
20:00: 21.52 average comments per post
16:00: 16.80 average comments per post
21:00: 16.01 average comments per post

Conclusion¶

It seems that the best moment to receive several answers on Hacker News social network is to post an "ask post" around 3pm.

Be careful ! The time zone used here is Eastern Time in the US. Thus, you need to convert 3pm to the time zone where you live to take full advantage of this discovery 😉