Let's study 20,000 different posts from Hacker News to determine what are the keys of a post' success.
Hacker News is a site started by the startup incubator Y Combinator, where user-submitted stories (known as "posts") are voted and commented upon, similar to reddit. Hacker News is extremely popular in technology and startup circles, and posts that make it to the top of Hacker News' listings can get hundreds of thousands of visitors as a result.
Full dataset is available at this link on Kaggle platform.
from csv import reader
opened_file = open("hacker_news.csv")
read_file = reader(opened_file)
hn = list(read_file)
Let's see some of the rows of the set to have a better idea on what data we will deal with during this project.
In order to do that, we'll first write a function named explore_data() that we can use repeatedly to explore rows in a more readable way.
def explore_data(dataset, start, end):
dataset_slice = dataset[start:end]
for row in dataset_slice:
print(row)
print('\n')
explore_data(hn,0,4)
['id', 'title', 'url', 'num_points', 'num_comments', 'author', 'created_at'] ['12224879', 'Interactive Dynamic Video', 'http://www.interactivedynamicvideo.com/', '386', '52', 'ne0phyte', '8/4/2016 11:52'] ['10975351', 'How to Use Open Source and Shut the Fuck Up at the Same Time', 'http://hueniverse.com/2016/01/26/how-to-use-open-source-and-shut-the-fuck-up-at-the-same-time/', '39', '10', 'josep2', '1/26/2016 19:30'] ['11964716', "Florida DJs May Face Felony for April Fools' Water Joke", 'http://www.thewire.com/entertainment/2013/04/florida-djs-april-fools-water-joke/63798/', '2', '1', 'vezycash', '6/23/2016 22:20']
Now, we will manage the headers properly, in order to have a dataset full of usable data.
headers = hn[0:1]
hn = hn[1:]
Let's precise that we will specifically be interested in posts whose titles begin with either Ask HN or Show HN, as these posts are the most relevant for our study.
ask_posts = []
show_posts = []
other_posts = []
for row in hn:
title = row[1].lower()
if title.startswith('ask hn'):
ask_posts.append(row)
elif title.startswith('show hn'):
show_posts.append(row)
else:
other_posts.append(row)
print(len(ask_posts))
print(len(show_posts))
print(len(other_posts))
1744 1162 17194
explore_data(ask_posts,0,2)
explore_data(show_posts,0,2)
['12296411', 'Ask HN: How to improve my personal website?', '', '2', '6', 'ahmedbaracat', '8/16/2016 9:55'] ['10610020', 'Ask HN: Am I the only one outraged by Twitter shutting down share counts?', '', '28', '29', 'tkfx', '11/22/2015 13:43'] ['10627194', 'Show HN: Wio Link ESP8266 Based Web of Things Hardware Development Platform', 'https://iot.seeed.cc', '26', '22', 'kfihihc', '11/25/2015 14:03'] ['10646440', 'Show HN: Something pointless I made', 'http://dn.ht/picklecat/', '747', '102', 'dhotson', '11/29/2015 22:46']
We will now explore the two lists made respectively with "show" and "ask" posts. The first (show) contains 1162 rows whereas the second (ask) contains 1744 rows.
At this stage, we want to determine what type of posts generates the more reactions from the audience. To evaluate this qualitative effect, we will first explore the relevant quantitative data we have : the number of comments generated by each post.
To perform this analysis, we will use the fifth column : num_comment
print(headers)
[['id', 'title', 'url', 'num_points', 'num_comments', 'author', 'created_at']]
total_ask_comments = 0
total_show_comments = 0
for row in ask_posts:
total_ask_comments += int(row[4])
for row in show_posts:
total_show_comments += int(row[4])
average_ask_comments = total_ask_comments / len(ask_posts)
average_show_comments = total_show_comments / len(show_posts)
print(round(average_ask_comments,2))
print(round(average_show_comments,2))
14.04 10.32
It seems that "ask posts" generates more comments than "show posts". We can explain that using the idea that people are more likely to answer to a question than reacting to an opinion. Indeed, a question really calls the reader and easily provoke a reaction whereas an opinion is more likely to "not worth it" for the reader.
Since ask posts gather the more audience, we will now focus our remaining analysis on these posts. Let's determine if ask posts created at a certain time are more likely to attract comments.
We will use datetime class to perform analysis on posts' dates and hours. In order to better understand the arguments of the .strptime method, you can refer to this site.
import datetime as dt
result_list = []
for row in ask_posts:
element_1 = row[6] # Let's pick the creating date of the post
element_2 = int(row[4]) # We take also the number of comments of the post
# We transform the date in a datetime object using .strptime method
# And we join it to the number of comments in the same list
element_1 = dt.datetime.strptime(element_1, "%m/%d/%Y %H:%M")
result_list.append([element_1, element_2])
Now that we have created a datetime object, we can perform some analysis on it. We want to know what time slot gathers the more ask posts and the more comments in total.
counts_by_hour = {}
comments_by_hour = {}
for join_element in result_list:
element_1 = join_element[0]
hour = element_1.strftime("%H") # We take only the hour from the datetime object
if hour not in counts_by_hour:
counts_by_hour[hour] = 1
comments_by_hour[hour] = join_element[1] # We add the number of comment of the post
else:
counts_by_hour[hour] += 1
comments_by_hour[hour] += join_element[1]
print(counts_by_hour)
{'17': 100, '11': 58, '01': 60, '03': 54, '02': 58, '20': 80, '07': 34, '15': 116, '04': 47, '00': 55, '14': 107, '13': 85, '09': 45, '23': 68, '19': 110, '05': 46, '21': 109, '06': 44, '16': 108, '12': 73, '08': 48, '10': 59, '18': 109, '22': 71}
print(comments_by_hour)
{'17': 1146, '11': 641, '01': 683, '03': 421, '02': 1381, '20': 1722, '07': 267, '15': 4477, '04': 337, '00': 447, '14': 1416, '13': 1253, '09': 251, '23': 543, '19': 1188, '05': 464, '21': 1745, '06': 397, '16': 1814, '12': 687, '08': 492, '10': 793, '18': 1439, '22': 479}
Thus, we created two dictionaries:
Next, let's use these two dictionaries to calculate the average number of comments for ask-posts created during each hour of the day.
avg_by_hour = []
for hour in comments_by_hour:
mean = comments_by_hour[hour] / counts_by_hour[hour]
avg_by_hour.append([hour, round(mean, 3)])
avg_by_hour
[['17', 11.46], ['11', 11.052], ['01', 11.383], ['03', 7.796], ['02', 23.81], ['20', 21.525], ['07', 7.853], ['15', 38.595], ['04', 7.17], ['00', 8.127], ['14', 13.234], ['13', 14.741], ['09', 5.578], ['23', 7.985], ['19', 10.8], ['05', 10.087], ['21', 16.009], ['06', 9.023], ['16', 16.796], ['12', 9.411], ['08', 10.25], ['10', 13.441], ['18', 13.202], ['22', 6.746]]
This format is difficult to read. That's why we will finally sort the list and print the five highest values in a format that's easier to read.
swap_avg_by_hour = []
# We first swap both elements of the lists included in avg_by_hour
for row in avg_by_hour:
swap_avg_by_hour.append([row[1],row[0]])
print(swap_avg_by_hour)
[[11.46, '17'], [11.052, '11'], [11.383, '01'], [7.796, '03'], [23.81, '02'], [21.525, '20'], [7.853, '07'], [38.595, '15'], [7.17, '04'], [8.127, '00'], [13.234, '14'], [14.741, '13'], [5.578, '09'], [7.985, '23'], [10.8, '19'], [10.087, '05'], [16.009, '21'], [9.023, '06'], [16.796, '16'], [9.411, '12'], [10.25, '08'], [13.441, '10'], [13.202, '18'], [6.746, '22']]
sorted_swap = sorted(swap_avg_by_hour, reverse = True)
print("Top 5 Hours for Ask Posts Comments\n")
for row in sorted_swap[:5]:
hour = dt.datetime.strptime(row[1],"%H")
hour = hour.strftime("%H:%M")
print("{}: {:.2f} average comments per post".format(hour, row[0]))
Top 5 Hours for Ask Posts Comments 15:00: 38.59 average comments per post 02:00: 23.81 average comments per post 20:00: 21.52 average comments per post 16:00: 16.80 average comments per post 21:00: 16.01 average comments per post
It seems that the best moment to receive several answers on Hacker News social network is to post an "ask post" around 3pm.
Be careful ! The time zone used here is Eastern Time in the US. Thus, you need to convert 3pm to the time zone where you live to take full advantage of this discovery 😉