In this project, we'll work with a dataset of submissions to popular technology site Hacker News. Hacker News is a site started by the startup incubator Y Combinator, where user-submitted stories (known as "posts") receive votes and comments, similar to reddit. Hacker News is extremely popular in technology and startup circles, and posts that make it to the top of the Hacker News listings can get hundreds of thousands of visitors as a result.
You can find the data set here, it contains almost 300,000 rows. Below are descriptions of the columns:
We're specifically interested in posts with titles that begin with either Ask HN or Show HN. Users submit Ask HN posts to ask the Hacker News community a specific question. Likewise, users submit Show HN posts to show the Hacker News community a project, product, or just something interesting. We'll compare these two types of posts to determine the following:
Let's start by importing the libraries we need and reading the dataset into a list of lists hn_data
and demonstrate first 5 rows.
from csv import reader
# reading .csv file and transforming data into list of lists
opened_file = open('hacker_news.csv', encoding="utf8")
read_file = reader(opened_file)
hn_data = list(read_file)
# demonstrating first 5 rows
for row in hn_data[:5]:
print(row)
print('\n')
print('Number of rows in dataset:', len(hn_data))
['id', 'title', 'url', 'num_points', 'num_comments', 'author', 'created_at'] ['12579008', 'You have two days to comment if you want stem cells to be classified as your own', 'http://www.regulations.gov/document?D=FDA-2015-D-3719-0018', '1', '0', 'altstar', '9/26/2016 3:26'] ['12579005', 'SQLAR the SQLite Archiver', 'https://www.sqlite.org/sqlar/doc/trunk/README.md', '1', '0', 'blacksqr', '9/26/2016 3:24'] ['12578997', 'What if we just printed a flatscreen television on the side of our boxes?', 'https://medium.com/vanmoof/our-secrets-out-f21c1f03fdc8#.ietxmez43', '1', '0', 'pavel_lishin', '9/26/2016 3:19'] ['12578989', 'algorithmic music', 'http://cacm.acm.org/magazines/2011/7/109891-algorithmic-composition/fulltext', '1', '0', 'poindontcare', '9/26/2016 3:16'] Number of rows in dataset: 293120
We can see that demonstrated above posts have 0 (zero) comments. As our goal to examine posts that get more comments, we will clean our dataset from posts that don't have comments.
# collecting rows with comments in separate list 'hn'
hn = []
for row in hn_data:
if row[4] != '0':
hn.append(row)
# checking if there are rows with '0' points
number_points_0 = 0
for row in hn:
if row[3] == '0':
number_points_0 += 1
print("Number of rows with '0' points:", number_points_0)
print('Number of rows in dataset:', len(hn))
print('First 5 rows:')
for row in hn[:5]:
print(row)
Number of rows with '0' points: 0 Number of rows in dataset: 80402 First 5 rows: ['id', 'title', 'url', 'num_points', 'num_comments', 'author', 'created_at'] ['12578975', 'Saving the Hassle of Shopping', 'https://blog.menswr.com/2016/09/07/whats-new-with-your-style-feed/', '1', '1', 'bdoux', '9/26/2016 3:13'] ['12578908', 'Ask HN: What TLD do you use for local development?', '', '4', '7', 'Sevrene', '9/26/2016 2:53'] ['12578822', 'Amazons Algorithms Dont Find You the Best Deals', 'https://www.technologyreview.com/s/602442/amazons-algorithms-dont-find-you-the-best-deals/', '1', '1', 'yarapavan', '9/26/2016 2:26'] ['12578694', 'Emergency dose of epinephrine that does not cost an arm and a leg', 'http://m.imgur.com/gallery/th6Ua', '2', '1', 'dredmorbius', '9/26/2016 1:54']
We reduced our dataset to 80,402 rows.
Let's extract header row and assign it to variable headers. Next we remove the header row from hn and demonstrate 5 first rows to check, that the header row was removed.
headers = hn[0]
hn = hn[1:]
print(hn[:5], '\n')
print('Title values:', headers)
[['12578908', 'Ask HN: What TLD do you use for local development?', '', '4', '7', 'Sevrene', '9/26/2016 2:53'], ['12578822', 'Amazons Algorithms Dont Find You the Best Deals', 'https://www.technologyreview.com/s/602442/amazons-algorithms-dont-find-you-the-best-deals/', '1', '1', 'yarapavan', '9/26/2016 2:26'], ['12578694', 'Emergency dose of epinephrine that does not cost an arm and a leg', 'http://m.imgur.com/gallery/th6Ua', '2', '1', 'dredmorbius', '9/26/2016 1:54'], ['12578624', 'Phone Makers Could Cut Off Drivers. So Why Dont They?', 'http://www.nytimes.com/2016/09/25/technology/phone-makers-could-cut-off-drivers-so-why-dont-they.html', '4', '1', 'danso', '9/26/2016 1:37'], ['12578556', 'OpenMW, Open Source Elderscrolls III: Morrowind Reimplementation', 'https://openmw.org/en/', '32', '3', 'rocky1138', '9/26/2016 1:24']] Title values: ['12578975', 'Saving the Hassle of Shopping', 'https://blog.menswr.com/2016/09/07/whats-new-with-your-style-feed/', '1', '1', 'bdoux', '9/26/2016 3:13']
Since we're only concerned with post titles beginning with Ask HN or Show HN, we'll create new lists of lists containing just the data for those titles: ask_posts
to collect rows starting with Ask HN, show_posts
to collect rows starting with Show HN and other_posts
for the rest of rows.
In order to make this distribution we are using startswith()
method. And to make sure that the destribution of the rows is done correctly we are using lower()
method.
ask_posts = []
show_posts = []
other_posts = []
for row in hn:
title = row[1]
# checking if the title starts with 'ask hn' in lower case
if title.lower().startswith('ask hn'): # checking if the title starts with 'ask hn' in lower case
ask_posts.append(row)
# if previous condition wasn't fulfilled it will be checked if the title starts with 'show hn' in lower case
elif title.lower().startswith('show hn'):
show_posts.append(row)
# if previous condition wasn't fulfilled the row will be appended to 'other_posts' list of lists.
else:
other_posts.append(row)
print(len(ask_posts))
print(ask_posts[:3])
print('\n')
print(len(show_posts))
print(show_posts[:3])
print('\n')
print(len(other_posts))
print(other_posts[:3])
print('\n')
6911 [['12578908', 'Ask HN: What TLD do you use for local development?', '', '4', '7', 'Sevrene', '9/26/2016 2:53'], ['12578522', 'Ask HN: How do you pass on your work when you die?', '', '6', '3', 'PascLeRasc', '9/26/2016 1:17'], ['12577870', 'Ask HN: Why join a fund when you can be an angel?', '', '1', '3', 'anthony_james', '9/25/2016 22:48']] 5059 [['12577142', 'Show HN: Jumble Essays on the go #PaulInYourPocket', 'https://itunes.apple.com/us/app/jumble-find-startup-essay/id1150939197?ls=1&mt=8', '1', '1', 'ryderj', '9/25/2016 20:06'], ['12576813', 'Show HN: Learn Japanese Vocab via multiple choice questions', 'http://japanese.vul.io/', '1', '1', 'soulchild37', '9/25/2016 19:06'], ['12576090', 'Show HN: Markov chain Twitter bot. Trained on comments left on Pornhub', 'https://twitter.com/botsonasty', '3', '1', 'keepingscore', '9/25/2016 16:50']] 68430 [['12578822', 'Amazons Algorithms Dont Find You the Best Deals', 'https://www.technologyreview.com/s/602442/amazons-algorithms-dont-find-you-the-best-deals/', '1', '1', 'yarapavan', '9/26/2016 2:26'], ['12578694', 'Emergency dose of epinephrine that does not cost an arm and a leg', 'http://m.imgur.com/gallery/th6Ua', '2', '1', 'dredmorbius', '9/26/2016 1:54'], ['12578624', 'Phone Makers Could Cut Off Drivers. So Why Dont They?', 'http://www.nytimes.com/2016/09/25/technology/phone-makers-could-cut-off-drivers-so-why-dont-they.html', '4', '1', 'danso', '9/26/2016 1:37']]
Now let's determine if ask posts or show posts receive more comments on average.
# creatting variable total_ask_comments to count total amount of comments for ask posts
total_ask_comments = 0
for row in ask_posts:
num_comments = int(row[4])
total_ask_comments += num_comments
# calculating average number of comments for ask_post
avg_ask_comments = total_ask_comments / len(ask_posts)
print(round(avg_ask_comments,3))
13.744
# creatting variable total_show_comments to count total amount of comments for show posts
total_show_comments = 0
for row in show_posts:
num_comments = int(row[4])
total_show_comments += num_comments
# calculating average number of comments for show_post
avg_show_comments = total_show_comments / len(show_posts)
print(round(avg_show_comments,3))
9.811
Let's check what is going on in the category of other posts: how many comments on average do people leave?
# creatting variable total_show_comments to count total amount of comments for other posts
total_other_comments = 0
for row in other_posts:
num_comments = int(row[4])
total_other_comments += num_comments
# calculating average number of comments for other_post
avg_other_comments = total_other_comments / len(other_posts)
print(round(avg_other_comments,3))
25.839
We can see that on average ask_posts get more response than show_posts. May be this is because people prefer to give advice than to give some kind of feedback on something.
Also we can see that post with other titles have the biggest average number of comments. This can happen due to the fact that there a lot of different topics. Some of the topics can be very popular or controversial, thats why people discuss them a lot.
Since Ask posts are more likely to receive comments, we'll focus our remaining analysis just on these posts.
Next, we'll determine if ask posts created at a certain time are more likely to attract comments. We'll use the following steps to perform this analysis:
First, we'll work on calculating the number of ask posts and comments by hour created. We'll use the datetime module to work with the data in the created_at column.
Now let's create an empty list result_list
, we will iterate over ask_posts
list of list and append a list of 2 elements (the column 'created_at', the number of comments of the post) to the result_list
.
result_list = []
for row in ask_posts:
result_list.append([row[6], int(row[4])])
print(result_list[:5])
[['9/26/2016 2:53', 7], ['9/26/2016 1:17', 3], ['9/25/2016 22:48', 3], ['9/25/2016 21:50', 2], ['9/25/2016 19:30', 1]]
Next, we are creating 2 empty dictionaries counts_by_hour
to collect there information about created post in each hour and comments_by_hour
to collect there information about number of comments left in each hour. To do that we need to create a datetime object using datetime.strptime().
# importing datetime module using alias 'dt'
import datetime as dt
counts_by_hour = {}
comments_by_hour = {}
for row in result_list:
comments = row[1]
date_str = row[0]
# creating datetime object from the string 'date_str'
date_dt = dt.datetime.strptime(date_str, "%m/%d/%Y %H:%M")
# extracting hour from the datetime object and assigning to variable hour_created
hour_created = date_dt.strftime('%H')
if hour_created not in counts_by_hour:
counts_by_hour[hour_created] = 1
comments_by_hour[hour_created] = comments
else:
counts_by_hour[hour_created] += 1
comments_by_hour[hour_created] += comments
print('Posts created by hour:', counts_by_hour)
print('Comments left by hour:', comments_by_hour)
Posts created by hour: {'02': 227, '01': 223, '22': 287, '21': 407, '19': 420, '17': 404, '15': 467, '14': 378, '13': 326, '11': 251, '10': 219, '09': 176, '07': 157, '03': 212, '16': 415, '08': 190, '00': 231, '23': 276, '20': 392, '18': 452, '12': 274, '04': 186, '06': 176, '05': 165} Comments left by hour: {'02': 2996, '01': 2089, '22': 3372, '21': 4500, '19': 3954, '17': 5547, '15': 18525, '14': 4972, '13': 7245, '11': 2797, '10': 3013, '09': 1477, '07': 1585, '03': 2154, '16': 4466, '08': 2362, '00': 2277, '23': 2297, '20': 4462, '18': 4877, '12': 4234, '04': 2360, '06': 1587, '05': 1838}
Now we will create a list of lists avg_by_hour
containing the hours during which posts were created and the average number of comments those posts received.
avg_by_hour = []
for key in comments_by_hour:
# calculating the average number of comments for each hour
# for better readability we round the avg value up to 3 symbols
avg_comments = round(comments_by_hour[key] / counts_by_hour[key], 3)
avg_by_hour.append([key, avg_comments])
for row in avg_by_hour:
print(row)
['02', 13.198] ['01', 9.368] ['22', 11.749] ['21', 11.057] ['19', 9.414] ['17', 13.73] ['15', 39.668] ['14', 13.153] ['13', 22.224] ['11', 11.143] ['10', 13.758] ['09', 8.392] ['07', 10.096] ['03', 10.16] ['16', 10.761] ['08', 12.432] ['00', 9.857] ['23', 8.322] ['20', 11.383] ['18', 10.79] ['12', 15.453] ['04', 12.688] ['06', 9.017] ['05', 11.139]
In order to make it easier to sort our data, let's swap the columns.
# creating empty list of lists to place there swapped columns
swap_avg_by_hour = []
for row in avg_by_hour:
x = row[0]
y = row[1]
swap_avg_by_hour.append([y, x])
# sorting our data in descending order
sorted_swap = sorted(swap_avg_by_hour, reverse=True)
print(sorted_swap)
[[39.668, '15'], [22.224, '13'], [15.453, '12'], [13.758, '10'], [13.73, '17'], [13.198, '02'], [13.153, '14'], [12.688, '04'], [12.432, '08'], [11.749, '22'], [11.383, '20'], [11.143, '11'], [11.139, '05'], [11.057, '21'], [10.79, '18'], [10.761, '16'], [10.16, '03'], [10.096, '07'], [9.857, '00'], [9.414, '19'], [9.368, '01'], [9.017, '06'], [8.392, '09'], [8.322, '23']]
# demonstrating top 5 commented hours
print("Top 5 Hours for Ask Posts Comments:")
for hour in sorted_swap[:5]:
print(hour)
Top 5 Hours for Ask Posts Comments: [39.668, '15'] [22.224, '13'] [15.453, '12'] [13.758, '10'] [13.73, '17']
Let's demonstrate our findings in a more readable way: using string formating.
for hour in sorted_swap[:5]:
time_str = hour[1]
# creating datetime object from the string
time_dt = dt.datetime.strptime(time_str, '%H')
# setting the format of the string - transforming from 'hour' format to 'hour:minute' format
post_time = time_dt.strftime('%H:%M')
average = hour[0]
print(f'{post_time}: {average:.2f} average comments per post')
15:00: 39.67 average comments per post 13:00: 22.22 average comments per post 12:00: 15.45 average comments per post 10:00: 13.76 average comments per post 17:00: 13.73 average comments per post
Our main goal goal was to check Ask HN and Show HN posts.
But as we've got a big value for average number of comments in the category 'other posts', it will be interesting to analyse this data too. And check if there is the same commenting pattern as for Ask posts.
Let's do the same analysis for other posts as we have made for Ask posts.
other_posts_result = []
for row in other_posts:
other_posts_result.append([int(row[4]), row[6]])
print(other_posts_result[:5])
[[1, '9/26/2016 2:26'], [1, '9/26/2016 1:54'], [1, '9/26/2016 1:37'], [3, '9/26/2016 1:24'], [1, '9/26/2016 0:31']]
Creating 2 empty dictionaries counts_by_hour to collect there information about created post in each hour and comments_by_hour to collect there information about number of comments left in each hour. To do that we need to create a datetime object using datetime.strptime().
other_posts_byhour = {}
other_comments_byhour = {}
for row in other_posts_result:
comments = row[0]
date_str = row[1]
# creating datetime object from the string 'date_str'
date_dt = dt.datetime.strptime(date_str, "%m/%d/%Y %H:%M")
# extracting hour from the datetime object and assigning to variable hour_created
hour_created = date_dt.strftime("%H")
if hour_created not in other_posts_byhour:
other_posts_byhour[hour_created] = 1
other_comments_byhour[hour_created] = comments
else:
other_posts_byhour[hour_created] += 1
other_comments_byhour[hour_created] += comments
print('Posts created by hour:', other_posts_byhour)
print('Comments left by hour:', other_comments_byhour)
Posts created by hour: {'02': 1870, '01': 2031, '00': 2271, '23': 2556, '22': 2995, '21': 3470, '20': 3730, '19': 3986, '18': 4314, '17': 4392, '16': 4335, '15': 4122, '14': 3854, '13': 3619, '12': 3085, '11': 2620, '10': 2298, '09': 2149, '08': 1919, '07': 1826, '06': 1789, '05': 1598, '04': 1861, '03': 1740} Comments left by hour: {'02': 50100, '01': 47756, '00': 55491, '23': 58378, '22': 68059, '21': 79996, '20': 88320, '19': 101127, '18': 112502, '17': 118217, '16': 116322, '15': 115286, '14': 108277, '13': 106302, '12': 90082, '11': 71072, '10': 59147, '09': 56141, '08': 49804, '07': 44424, '06': 43050, '05': 41773, '04': 43753, '03': 42762}
#creating a list of lists containing the hours during which posts were created
#and the average number of comments those posts received.
other_avg_byhour = []
for key in other_comments_byhour:
# calculating the average number of comments for each hour
# for better readability we round the avg value up to 2 symbols
average_comment = round(other_comments_byhour[key] / other_posts_byhour[key], 2)
other_avg_byhour.append([average_comment, key])
for row in other_avg_byhour:
print(row)
[26.79, '02'] [23.51, '01'] [24.43, '00'] [22.84, '23'] [22.72, '22'] [23.05, '21'] [23.68, '20'] [25.37, '19'] [26.08, '18'] [26.92, '17'] [26.83, '16'] [27.97, '15'] [28.09, '14'] [29.37, '13'] [29.2, '12'] [27.13, '11'] [25.74, '10'] [26.12, '09'] [25.95, '08'] [24.33, '07'] [24.06, '06'] [26.14, '05'] [23.51, '04'] [24.58, '03']
# sorting our list of lists in descending order
sorted_other_avg = sorted(other_avg_byhour, reverse=True)
# demonstrating our findings in a more readable way: using string formating.
for hour in sorted_other_avg[:5]:
date_str = hour[1]
date_dt = dt.datetime.strptime(date_str, "%H")
hour_str = date_dt.strftime("%H:%M")
average_com = hour[0]
print(f'{hour_str}: {average_com} average comments per post')
13:00: 29.37 average comments per post 12:00: 29.2 average comments per post 14:00: 28.09 average comments per post 15:00: 27.97 average comments per post 11:00: 27.13 average comments per post
Our main goalwas to compare two types of posts to determine the following:
We found out that on average Ask_posts receive more comments than Show posts (13.744 versus 9.81). We can assume that this is because people prefer to give advice than to give some kind of feedback on something.
Also we checked the situation in the rest of posts (othe_posts) and found out that they the biggest average number of comments. This can happen due to the fact that there a lot of different topics. Some of the topics can be very popular or controversial, thats why people discuss them a lot.
Regarding the second question, the analysis of Ask posts showed that the most commented hours are day time hours:
Analysis of Other posts showed that on average all hours don't differ to much. The average numbers of comments are pretty similar for each hour of the day. The first 5 leaders are:
So if you are deciding what time to post, in order to receive the most possible amount of comments or feedback, the answer is: do it between 11:00 and 15:00.