Notebook

Finding Perfect Time for Posting to Get More Comments.¶

In this project, we'll work with a dataset of submissions to popular technology site Hacker News. Hacker News is a site started by the startup incubator Y Combinator, where user-submitted stories (known as "posts") receive votes and comments, similar to reddit. Hacker News is extremely popular in technology and startup circles, and posts that make it to the top of the Hacker News listings can get hundreds of thousands of visitors as a result.

You can find the data set here, it contains almost 300,000 rows. Below are descriptions of the columns:

id: the unique identifier from Hacker News for the post
title: the title of the post
url: the URL that the posts links to, if the post has a URL
num_points: the number of points the post acquired, calculated as the total number of upvotes minus the total number of downvotes
num_comments: the number of comments on the post
author: the username of the person who submitted the post
created_at: the date and time of the post's submission

We're specifically interested in posts with titles that begin with either Ask HN or Show HN. Users submit Ask HN posts to ask the Hacker News community a specific question. Likewise, users submit Show HN posts to show the Hacker News community a project, product, or just something interesting. We'll compare these two types of posts to determine the following:

Do Ask HN or Show HN receive more comments on average?
Do posts created at a certain time receive more comments on average?

Let's start by importing the libraries we need and reading the dataset into a list of lists hn_data and demonstrate first 5 rows.

Cleaning data¶

In [2]:

from csv import reader
# reading .csv file and transforming data into list of lists
opened_file = open('hacker_news.csv', encoding="utf8")
read_file = reader(opened_file)
hn_data = list(read_file)
# demonstrating first 5 rows
for row in hn_data[:5]:
    print(row)
    print('\n')
print('Number of rows in dataset:', len(hn_data))

['id', 'title', 'url', 'num_points', 'num_comments', 'author', 'created_at']


['12579008', 'You have two days to comment if you want stem cells to be classified as your own', 'http://www.regulations.gov/document?D=FDA-2015-D-3719-0018', '1', '0', 'altstar', '9/26/2016 3:26']


['12579005', 'SQLAR  the SQLite Archiver', 'https://www.sqlite.org/sqlar/doc/trunk/README.md', '1', '0', 'blacksqr', '9/26/2016 3:24']


['12578997', 'What if we just printed a flatscreen television on the side of our boxes?', 'https://medium.com/vanmoof/our-secrets-out-f21c1f03fdc8#.ietxmez43', '1', '0', 'pavel_lishin', '9/26/2016 3:19']


['12578989', 'algorithmic music', 'http://cacm.acm.org/magazines/2011/7/109891-algorithmic-composition/fulltext', '1', '0', 'poindontcare', '9/26/2016 3:16']


Number of rows in dataset: 293120

We can see that demonstrated above posts have 0 (zero) comments. As our goal to examine posts that get more comments, we will clean our dataset from posts that don't have comments.

In [3]:

# collecting rows with comments in separate list 'hn'
hn = []
for row in hn_data:
    if row[4] != '0':
        hn.append(row)

# checking if there are rows with '0' points
number_points_0 = 0
for row in hn:
    if row[3] == '0':
        number_points_0 += 1
print("Number of rows with '0' points:", number_points_0)        

print('Number of rows in dataset:', len(hn))  

print('First 5 rows:')

for row in hn[:5]:
    print(row)

Number of rows with '0' points: 0
Number of rows in dataset: 80402
First 5 rows:
['id', 'title', 'url', 'num_points', 'num_comments', 'author', 'created_at']
['12578975', 'Saving the Hassle of Shopping', 'https://blog.menswr.com/2016/09/07/whats-new-with-your-style-feed/', '1', '1', 'bdoux', '9/26/2016 3:13']
['12578908', 'Ask HN: What TLD do you use for local development?', '', '4', '7', 'Sevrene', '9/26/2016 2:53']
['12578822', 'Amazons Algorithms Dont Find You the Best Deals', 'https://www.technologyreview.com/s/602442/amazons-algorithms-dont-find-you-the-best-deals/', '1', '1', 'yarapavan', '9/26/2016 2:26']
['12578694', 'Emergency dose of epinephrine that does not cost an arm and a leg', 'http://m.imgur.com/gallery/th6Ua', '2', '1', 'dredmorbius', '9/26/2016 1:54']

We reduced our dataset to 80,402 rows.

Let's extract header row and assign it to variable headers. Next we remove the header row from hn and demonstrate 5 first rows to check, that the header row was removed.

In [5]:

headers = hn[0]
hn = hn[1:]
print(hn[:5], '\n')
print('Title values:', headers)

[['12578908', 'Ask HN: What TLD do you use for local development?', '', '4', '7', 'Sevrene', '9/26/2016 2:53'], ['12578822', 'Amazons Algorithms Dont Find You the Best Deals', 'https://www.technologyreview.com/s/602442/amazons-algorithms-dont-find-you-the-best-deals/', '1', '1', 'yarapavan', '9/26/2016 2:26'], ['12578694', 'Emergency dose of epinephrine that does not cost an arm and a leg', 'http://m.imgur.com/gallery/th6Ua', '2', '1', 'dredmorbius', '9/26/2016 1:54'], ['12578624', 'Phone Makers Could Cut Off Drivers. So Why Dont They?', 'http://www.nytimes.com/2016/09/25/technology/phone-makers-could-cut-off-drivers-so-why-dont-they.html', '4', '1', 'danso', '9/26/2016 1:37'], ['12578556', 'OpenMW, Open Source Elderscrolls III: Morrowind Reimplementation', 'https://openmw.org/en/', '32', '3', 'rocky1138', '9/26/2016 1:24']] 

Title values: ['12578975', 'Saving the Hassle of Shopping', 'https://blog.menswr.com/2016/09/07/whats-new-with-your-style-feed/', '1', '1', 'bdoux', '9/26/2016 3:13']

Distributing posts by titles (topics)¶

Since we're only concerned with post titles beginning with Ask HN or Show HN, we'll create new lists of lists containing just the data for those titles: ask_posts to collect rows starting with Ask HN, show_posts to collect rows starting with Show HN and other_posts for the rest of rows.

In order to make this distribution we are using startswith() method. And to make sure that the destribution of the rows is done correctly we are using lower() method.

In [6]:

ask_posts = []
show_posts = []
other_posts = []
for row in hn:
    title = row[1]
     # checking if the title starts with 'ask hn' in lower case
    if title.lower().startswith('ask hn'): # checking if the title starts with 'ask hn' in lower case
        ask_posts.append(row)
    # if previous condition wasn't fulfilled it will be checked if the title starts with 'show hn' in lower case
    elif title.lower().startswith('show hn'): 
        show_posts.append(row)
    # if previous condition wasn't fulfilled the row will be appended to 'other_posts' list of lists.
    else:
        other_posts.append(row)
print(len(ask_posts)) 
print(ask_posts[:3])
print('\n')
print(len(show_posts))
print(show_posts[:3])
print('\n')
print(len(other_posts)) 
print(other_posts[:3])
print('\n')

6911
[['12578908', 'Ask HN: What TLD do you use for local development?', '', '4', '7', 'Sevrene', '9/26/2016 2:53'], ['12578522', 'Ask HN: How do you pass on your work when you die?', '', '6', '3', 'PascLeRasc', '9/26/2016 1:17'], ['12577870', 'Ask HN: Why join a fund when you can be an angel?', '', '1', '3', 'anthony_james', '9/25/2016 22:48']]


5059
[['12577142', 'Show HN: Jumble  Essays on the go #PaulInYourPocket', 'https://itunes.apple.com/us/app/jumble-find-startup-essay/id1150939197?ls=1&mt=8', '1', '1', 'ryderj', '9/25/2016 20:06'], ['12576813', 'Show HN: Learn Japanese Vocab via multiple choice questions', 'http://japanese.vul.io/', '1', '1', 'soulchild37', '9/25/2016 19:06'], ['12576090', 'Show HN: Markov chain Twitter bot. Trained on comments left on Pornhub', 'https://twitter.com/botsonasty', '3', '1', 'keepingscore', '9/25/2016 16:50']]


68430
[['12578822', 'Amazons Algorithms Dont Find You the Best Deals', 'https://www.technologyreview.com/s/602442/amazons-algorithms-dont-find-you-the-best-deals/', '1', '1', 'yarapavan', '9/26/2016 2:26'], ['12578694', 'Emergency dose of epinephrine that does not cost an arm and a leg', 'http://m.imgur.com/gallery/th6Ua', '2', '1', 'dredmorbius', '9/26/2016 1:54'], ['12578624', 'Phone Makers Could Cut Off Drivers. So Why Dont They?', 'http://www.nytimes.com/2016/09/25/technology/phone-makers-could-cut-off-drivers-so-why-dont-they.html', '4', '1', 'danso', '9/26/2016 1:37']]

Calculating average number of comments¶

Now let's determine if ask posts or show posts receive more comments on average.

In [7]:

# creatting variable total_ask_comments to count total amount of comments for ask posts
total_ask_comments = 0
for row in ask_posts:
    num_comments = int(row[4])
    total_ask_comments += num_comments
# calculating average number of comments for ask_post
avg_ask_comments = total_ask_comments / len(ask_posts)
print(round(avg_ask_comments,3))

13.744

In [8]:

# creatting variable total_show_comments to count total amount of comments for show posts
total_show_comments = 0
for row in show_posts:
    num_comments = int(row[4])
    total_show_comments += num_comments
# calculating average number of comments for show_post    
avg_show_comments = total_show_comments / len(show_posts)
print(round(avg_show_comments,3))

9.811

Let's check what is going on in the category of other posts: how many comments on average do people leave?

In [10]:

# creatting variable total_show_comments to count total amount of comments for other posts
total_other_comments = 0
for row in other_posts:
    num_comments = int(row[4])
    total_other_comments += num_comments
# calculating average number of comments for other_post
avg_other_comments = total_other_comments / len(other_posts)
print(round(avg_other_comments,3))

25.839

We can see that on average ask_posts get more response than show_posts. May be this is because people prefer to give advice than to give some kind of feedback on something.

Also we can see that post with other titles have the biggest average number of comments. This can happen due to the fact that there a lot of different topics. Some of the topics can be very popular or controversial, thats why people discuss them a lot.

Analysis of Ask posts¶

Distributing number of posts and comments by hour created¶

Since Ask posts are more likely to receive comments, we'll focus our remaining analysis just on these posts.

Next, we'll determine if ask posts created at a certain time are more likely to attract comments. We'll use the following steps to perform this analysis:

Calculate the number of ask posts created in each hour of the day, along with the number of comments received.
Calculate the average number of comments ask posts receive by hour created.

First, we'll work on calculating the number of ask posts and comments by hour created. We'll use the datetime module to work with the data in the created_at column.

Now let's create an empty list result_list, we will iterate over ask_posts list of list and append a list of 2 elements (the column 'created_at', the number of comments of the post) to the result_list.

In [14]:

result_list = []
for row in ask_posts:
    result_list.append([row[6], int(row[4])])
print(result_list[:5])

[['9/26/2016 2:53', 7], ['9/26/2016 1:17', 3], ['9/25/2016 22:48', 3], ['9/25/2016 21:50', 2], ['9/25/2016 19:30', 1]]

Next, we are creating 2 empty dictionaries counts_by_hour to collect there information about created post in each hour and comments_by_hour to collect there information about number of comments left in each hour. To do that we need to create a datetime object using datetime.strptime().

In [15]:

# importing datetime module using alias 'dt' 
import datetime as dt

counts_by_hour = {}
comments_by_hour = {}

for row in result_list:
    comments = row[1]
    date_str = row[0]
    # creating datetime object from the string 'date_str'
    date_dt = dt.datetime.strptime(date_str, "%m/%d/%Y %H:%M") 
    # extracting hour from the datetime object and assigning to variable hour_created
    hour_created = date_dt.strftime('%H')
     
    if hour_created not in counts_by_hour:
        counts_by_hour[hour_created] = 1
        comments_by_hour[hour_created] = comments
    else:
        counts_by_hour[hour_created] += 1
        comments_by_hour[hour_created] += comments
print('Posts created by hour:', counts_by_hour) 
print('Comments left by hour:', comments_by_hour) 

Posts created by hour: {'02': 227, '01': 223, '22': 287, '21': 407, '19': 420, '17': 404, '15': 467, '14': 378, '13': 326, '11': 251, '10': 219, '09': 176, '07': 157, '03': 212, '16': 415, '08': 190, '00': 231, '23': 276, '20': 392, '18': 452, '12': 274, '04': 186, '06': 176, '05': 165}
Comments left by hour: {'02': 2996, '01': 2089, '22': 3372, '21': 4500, '19': 3954, '17': 5547, '15': 18525, '14': 4972, '13': 7245, '11': 2797, '10': 3013, '09': 1477, '07': 1585, '03': 2154, '16': 4466, '08': 2362, '00': 2277, '23': 2297, '20': 4462, '18': 4877, '12': 4234, '04': 2360, '06': 1587, '05': 1838}

Calculating average number of comments for each hour¶

Now we will create a list of lists avg_by_hour containing the hours during which posts were created and the average number of comments those posts received.

In [16]:

avg_by_hour = []
for key in comments_by_hour:
    # calculating the average number of comments for each hour
    # for better readability we round the avg value up to 3 symbols
    avg_comments = round(comments_by_hour[key] / counts_by_hour[key], 3)
    avg_by_hour.append([key, avg_comments]) 
for row in avg_by_hour:
    print(row)

['02', 13.198]
['01', 9.368]
['22', 11.749]
['21', 11.057]
['19', 9.414]
['17', 13.73]
['15', 39.668]
['14', 13.153]
['13', 22.224]
['11', 11.143]
['10', 13.758]
['09', 8.392]
['07', 10.096]
['03', 10.16]
['16', 10.761]
['08', 12.432]
['00', 9.857]
['23', 8.322]
['20', 11.383]
['18', 10.79]
['12', 15.453]
['04', 12.688]
['06', 9.017]
['05', 11.139]

Formatting the output in more readable way¶

In order to make it easier to sort our data, let's swap the columns.

In [19]:

# creating empty list of lists to place there swapped columns
swap_avg_by_hour = []
for row in avg_by_hour:
    x = row[0]
    y = row[1]
    swap_avg_by_hour.append([y, x])

# sorting our data in descending order
sorted_swap = sorted(swap_avg_by_hour, reverse=True)
print(sorted_swap)

[[39.668, '15'], [22.224, '13'], [15.453, '12'], [13.758, '10'], [13.73, '17'], [13.198, '02'], [13.153, '14'], [12.688, '04'], [12.432, '08'], [11.749, '22'], [11.383, '20'], [11.143, '11'], [11.139, '05'], [11.057, '21'], [10.79, '18'], [10.761, '16'], [10.16, '03'], [10.096, '07'], [9.857, '00'], [9.414, '19'], [9.368, '01'], [9.017, '06'], [8.392, '09'], [8.322, '23']]

In [20]:

# demonstrating top 5 commented hours
print("Top 5 Hours for Ask Posts Comments:")
for hour in sorted_swap[:5]:
    print(hour)

Top 5 Hours for Ask Posts Comments:
[39.668, '15']
[22.224, '13']
[15.453, '12']
[13.758, '10']
[13.73, '17']

Let's demonstrate our findings in a more readable way: using string formating.

In [21]:

for hour in sorted_swap[:5]:
    time_str = hour[1]
    # creating datetime object from the string
    time_dt = dt.datetime.strptime(time_str, '%H')
    # setting the format of the string - transforming from 'hour' format to 'hour:minute' format
    post_time = time_dt.strftime('%H:%M')
    average = hour[0]
    print(f'{post_time}: {average:.2f} average comments per post')

15:00: 39.67 average comments per post
13:00: 22.22 average comments per post
12:00: 15.45 average comments per post
10:00: 13.76 average comments per post
17:00: 13.73 average comments per post

Analysis of Other posts¶

Distributing number of posts and comments by hour created¶

Our main goal goal was to check Ask HN and Show HN posts.

But as we've got a big value for average number of comments in the category 'other posts', it will be interesting to analyse this data too. And check if there is the same commenting pattern as for Ask posts.

Let's do the same analysis for other posts as we have made for Ask posts.

In [23]:

other_posts_result = []
for row in other_posts:
    other_posts_result.append([int(row[4]), row[6]])
print(other_posts_result[:5])    

[[1, '9/26/2016 2:26'], [1, '9/26/2016 1:54'], [1, '9/26/2016 1:37'], [3, '9/26/2016 1:24'], [1, '9/26/2016 0:31']]

Creating 2 empty dictionaries counts_by_hour to collect there information about created post in each hour and comments_by_hour to collect there information about number of comments left in each hour. To do that we need to create a datetime object using datetime.strptime().

In [33]:

other_posts_byhour = {}
other_comments_byhour = {}
for row in other_posts_result:
    comments = row[0]
    date_str = row[1]
     # creating datetime object from the string 'date_str'
    date_dt = dt.datetime.strptime(date_str, "%m/%d/%Y %H:%M")
    # extracting hour from the datetime object and assigning to variable hour_created
    hour_created = date_dt.strftime("%H")
    if hour_created not in other_posts_byhour:
        other_posts_byhour[hour_created] = 1
        other_comments_byhour[hour_created] = comments
    else:
        other_posts_byhour[hour_created] += 1
        other_comments_byhour[hour_created] += comments
print('Posts created by hour:', other_posts_byhour) 
print('Comments left by hour:', other_comments_byhour)        

Posts created by hour: {'02': 1870, '01': 2031, '00': 2271, '23': 2556, '22': 2995, '21': 3470, '20': 3730, '19': 3986, '18': 4314, '17': 4392, '16': 4335, '15': 4122, '14': 3854, '13': 3619, '12': 3085, '11': 2620, '10': 2298, '09': 2149, '08': 1919, '07': 1826, '06': 1789, '05': 1598, '04': 1861, '03': 1740}
Comments left by hour: {'02': 50100, '01': 47756, '00': 55491, '23': 58378, '22': 68059, '21': 79996, '20': 88320, '19': 101127, '18': 112502, '17': 118217, '16': 116322, '15': 115286, '14': 108277, '13': 106302, '12': 90082, '11': 71072, '10': 59147, '09': 56141, '08': 49804, '07': 44424, '06': 43050, '05': 41773, '04': 43753, '03': 42762}

Calculating average number of comments for each hour¶

In [34]:

#creating a list of lists containing the hours during which posts were created 
#and the average number of comments those posts received.
other_avg_byhour = []
for key in other_comments_byhour:
    # calculating the average number of comments for each hour
    # for better readability we round the avg value up to 2 symbols
    average_comment = round(other_comments_byhour[key] / other_posts_byhour[key], 2)
    other_avg_byhour.append([average_comment, key])
for row in other_avg_byhour:
    print(row)

[26.79, '02']
[23.51, '01']
[24.43, '00']
[22.84, '23']
[22.72, '22']
[23.05, '21']
[23.68, '20']
[25.37, '19']
[26.08, '18']
[26.92, '17']
[26.83, '16']
[27.97, '15']
[28.09, '14']
[29.37, '13']
[29.2, '12']
[27.13, '11']
[25.74, '10']
[26.12, '09']
[25.95, '08']
[24.33, '07']
[24.06, '06']
[26.14, '05']
[23.51, '04']
[24.58, '03']

Formatting the output in more readable way¶

In [35]:

# sorting our list of lists in descending order
sorted_other_avg = sorted(other_avg_byhour, reverse=True)
# demonstrating our findings in a more readable way: using string formating.
for hour in sorted_other_avg[:5]:
    date_str = hour[1]
    date_dt = dt.datetime.strptime(date_str, "%H")
    hour_str = date_dt.strftime("%H:%M")
    average_com = hour[0]
    print(f'{hour_str}: {average_com} average comments per post')

13:00: 29.37 average comments per post
12:00: 29.2 average comments per post
14:00: 28.09 average comments per post
15:00: 27.97 average comments per post
11:00: 27.13 average comments per post

Conclusion¶

Our main goalwas to compare two types of posts to determine the following:

Do Ask HN or Show HN receive more comments on average?
Do posts created at a certain time receive more comments on average?

We found out that on average Ask_posts receive more comments than Show posts (13.744 versus 9.81). We can assume that this is because people prefer to give advice than to give some kind of feedback on something.

Also we checked the situation in the rest of posts (othe_posts) and found out that they the biggest average number of comments. This can happen due to the fact that there a lot of different topics. Some of the topics can be very popular or controversial, thats why people discuss them a lot.

Regarding the second question, the analysis of Ask posts showed that the most commented hours are day time hours:

15:00
13:00
12:00
10:00
17:00

Analysis of Other posts showed that on average all hours don't differ to much. The average numbers of comments are pretty similar for each hour of the day. The first 5 leaders are:

13:00
12:00
14:00
15:00
11:00

So if you are deciding what time to post, in order to receive the most possible amount of comments or feedback, the answer is: do it between 11:00 and 15:00.

In [ ]: