Notebook

Exploring Hacker News Posts¶

Introduction¶

In this notebook we'll work with a dataset with submissions to Hacker News. You can download it here. To keeps things simple, posts without comments were removed and remaning submissions were randomly sampled (data was reduced from almost 300.000 rows to approximately 20.000 rows).

Below are the descriptions of the columns:

id: The unique identifier from Hacker News for the post
title: The title of the post
url: The URL that the posts links to, if the post has a URL
num_points: The number of points the post acquired, calculated as the total number of upvotes minus the total number of downvotes
num_comments: The number of comments that were made on the post
author: The username of the person who submitted the post
created_at: The date and time at which the post was submitted

For our analysis we're specifically interested in posts whose titles begin with either Ask HN (when users submit posts to ask the Hacker News community a specific question or Show HN (when users submit posts to show the Hacker community a projet, product, or just generally something interesting).

We will compare these two types of posts to determine the following:

Do Ask HN or Show HN receive more comments on average?
Do posts created at a certain time receive more comments on average?

Loading data¶

In [1]:

#import the reader function from the csv module
from csv import reader

opened_file = open('hacker_news.csv', encoding="utf8");
read_file = reader(opened_file)

#converting the read file into a list of lists forma
hn = list(read_file)

After loading it, we can check the first 5 rows from our dataset:

In [2]:

hn[0:5]

Out[2]:

[['id', 'title', 'url', 'num_points', 'num_comments', 'author', 'created_at'],
 ['12224879',
  'Interactive Dynamic Video',
  'http://www.interactivedynamicvideo.com/',
  '386',
  '52',
  'ne0phyte',
  '8/4/2016 11:52'],
 ['10975351',
  'How to Use Open Source and Shut the Fuck Up at the Same Time',
  'http://hueniverse.com/2016/01/26/how-to-use-open-source-and-shut-the-fuck-up-at-the-same-time/',
  '39',
  '10',
  'josep2',
  '1/26/2016 19:30'],
 ['11964716',
  "Florida DJs May Face Felony for April Fools' Water Joke",
  'http://www.thewire.com/entertainment/2013/04/florida-djs-april-fools-water-joke/63798/',
  '2',
  '1',
  'vezycash',
  '6/23/2016 22:20'],
 ['11919867',
  'Technology ventures: From Idea to Enterprise',
  'https://www.amazon.com/Technology-Ventures-Enterprise-Thomas-Byers/dp/0073523429',
  '3',
  '1',
  'hswarna',
  '6/17/2016 0:01']]

Now let's split our data from the headers

In [3]:

headers = hn[0]
hn = hn[1:]

In [4]:

#checking headers
headers

Out[4]:

['id', 'title', 'url', 'num_points', 'num_comments', 'author', 'created_at']

In [5]:

#checking data
hn[0:5]

Out[5]:

[['12224879',
  'Interactive Dynamic Video',
  'http://www.interactivedynamicvideo.com/',
  '386',
  '52',
  'ne0phyte',
  '8/4/2016 11:52'],
 ['10975351',
  'How to Use Open Source and Shut the Fuck Up at the Same Time',
  'http://hueniverse.com/2016/01/26/how-to-use-open-source-and-shut-the-fuck-up-at-the-same-time/',
  '39',
  '10',
  'josep2',
  '1/26/2016 19:30'],
 ['11964716',
  "Florida DJs May Face Felony for April Fools' Water Joke",
  'http://www.thewire.com/entertainment/2013/04/florida-djs-april-fools-water-joke/63798/',
  '2',
  '1',
  'vezycash',
  '6/23/2016 22:20'],
 ['11919867',
  'Technology ventures: From Idea to Enterprise',
  'https://www.amazon.com/Technology-Ventures-Enterprise-Thomas-Byers/dp/0073523429',
  '3',
  '1',
  'hswarna',
  '6/17/2016 0:01'],
 ['10301696',
  'Note by Note: The Making of Steinway L1037 (2007)',
  'http://www.nytimes.com/2007/11/07/movies/07stein.html?_r=0',
  '8',
  '2',
  'walterbell',
  '9/30/2015 4:12']]

Data analysis¶

Spliting Ask HN, Show HN and Other posts¶

We will populate three lists to categorize posts beginning with Ask HN, Show HN and Other (for any other post).

In [6]:

#initializing empty lists
ask_posts = []
show_posts = []
other_posts = []

#iterating on dataset
for row in hn:
    title = row[1] #title is the second column
    if title.lower().startswith('ask hn'):
        ask_posts.append(row)
    elif title.lower().startswith('show hn'):
        show_posts.append(row)
    else:
        other_posts.append(row)
        

In [7]:

#checking number of posts
print('Number of posts in ask_posts:',len(ask_posts))
print('Number of posts in show_posts:',len(show_posts))
print('Number of posts in other_posts:',len(other_posts))

Number of posts in ask_posts: 1744
Number of posts in show_posts: 1162
Number of posts in other_posts: 17194

Next, let's determine if ask posts or show posts receive more comments on average.

In [8]:

#average comments - ask hn posts
total_ask_comments = 0

for row in ask_posts:
    num_comments = int(row[4])
    total_ask_comments += num_comments
    
avg_ask_comments = total_ask_comments/len(ask_posts)

In [9]:

#average comments post hn posts
total_show_comments = 0

for row in show_posts:
    num_comments = int(row[4])
    total_show_comments += num_comments
    
avg_show_comments = total_show_comments/len(show_posts)

In [10]:

print('Average comments on Ask HN posts: {:.2f}'.format(avg_ask_comments))
print('Average comments on Show HN comments: {:.2f}'.format(avg_show_comments))

Average comments on Ask HN posts: 14.04
Average comments on Show HN comments: 10.32

As expected Aks HN posts have a higher average comments.

Next, we'll determine if ask posts created at a certain time are more likely to attract comments. We'll use the following steps to perform this analysis:

Calculate the amount of ask posts created in each hour of the day, along with the number of comments received.
Calculate the average number of comments ask posts receive by hour created.

Calculating average number of comments by hour created¶

In [11]:

#importing datetime library
import datetime as dt

In [34]:

result_list = []

for row in ask_posts:
    result_list.append([row[6],int(row[4])])#appending created at and number of comments

In [38]:

counts_by_hour = {}
comments_by_hour = {}

for row in result_list:
    comments_count = row[1]
    date_string = row[0]
    date_created = dt.datetime.strptime(date_string,"%m/%d/%Y %H:%M")
    hour_created = date_created.hour
    if hour_created in counts_by_hour:
        counts_by_hour[hour_created] += 1
        comments_by_hour[hour_created] += comments_count 
    else:
        counts_by_hour[hour_created] = 1
        comments_by_hour[hour_created] = comments_count

We have now the amount of posts and comments created for each hour so we can calculate the average number of comments per post for posts created during each hour of the day.

In [41]:

print('Posts created by hour: ', counts_by_hour)

print('Total comments added by hour', comments_by_hour)

Posts created by hour:  {9: 45, 13: 85, 10: 59, 14: 107, 16: 108, 23: 68, 12: 73, 17: 100, 15: 116, 21: 109, 20: 80, 2: 58, 18: 109, 3: 54, 5: 46, 19: 110, 1: 60, 22: 71, 8: 48, 4: 47, 0: 55, 6: 44, 7: 34, 11: 58}
Total comments added by hour {9: 251, 13: 1253, 10: 793, 14: 1416, 16: 1814, 23: 543, 12: 687, 17: 1146, 15: 4477, 21: 1745, 20: 1722, 2: 1381, 18: 1439, 3: 421, 5: 464, 19: 1188, 1: 683, 22: 479, 8: 492, 4: 337, 0: 447, 6: 397, 7: 267, 11: 641}

In [44]:

avg_by_hour = []

for key in counts_by_hour:
    avg_posts = comments_by_hour[key]/counts_by_hour[key]
    avg_by_hour.append([key,avg_posts])

In [45]:

avg_by_hour

Out[45]:

[[9, 5.5777777777777775],
 [13, 14.741176470588234],
 [10, 13.440677966101696],
 [14, 13.233644859813085],
 [16, 16.796296296296298],
 [23, 7.985294117647059],
 [12, 9.41095890410959],
 [17, 11.46],
 [15, 38.5948275862069],
 [21, 16.009174311926607],
 [20, 21.525],
 [2, 23.810344827586206],
 [18, 13.20183486238532],
 [3, 7.796296296296297],
 [5, 10.08695652173913],
 [19, 10.8],
 [1, 11.383333333333333],
 [22, 6.746478873239437],
 [8, 10.25],
 [4, 7.170212765957447],
 [0, 8.127272727272727],
 [6, 9.022727272727273],
 [7, 7.852941176470588],
 [11, 11.051724137931034]]

Althought we now have the results we need, this format makes it hard to identify the hours with the highest values. Let's finish by soring the list and printing the five highest values in a format that's easier to read.

In [47]:

#let's swap the column order from avg_by_hour
swap_avg_by_hour = []

for row in avg_by_hour:
    swap_avg_by_hour.append([row[1],row[0]])
    
swap_avg_by_hour

Out[47]:

[[5.5777777777777775, 9],
 [14.741176470588234, 13],
 [13.440677966101696, 10],
 [13.233644859813085, 14],
 [16.796296296296298, 16],
 [7.985294117647059, 23],
 [9.41095890410959, 12],
 [11.46, 17],
 [38.5948275862069, 15],
 [16.009174311926607, 21],
 [21.525, 20],
 [23.810344827586206, 2],
 [13.20183486238532, 18],
 [7.796296296296297, 3],
 [10.08695652173913, 5],
 [10.8, 19],
 [11.383333333333333, 1],
 [6.746478873239437, 22],
 [10.25, 8],
 [7.170212765957447, 4],
 [8.127272727272727, 0],
 [9.022727272727273, 6],
 [7.852941176470588, 7],
 [11.051724137931034, 11]]

In [67]:

#sorting swap_avg_by_hour in descending order
sorted_swap = sorted(swap_avg_by_hour, reverse=True)

print('Top 5 Hours for Ask Posts Comments')

for row in sorted_swap[:5]:
    hour_formatted = dt.datetime.strptime(str(row[1]),'%H')
    hour_formatted += dt.timedelta(hours=2) #converting EST to UTC-3  
    hour_formatted = hour_formatted.strftime('%H:%M')

    print('{}: {:.2f} average comments per post'.format(hour_formatted,row[0]))
    
    

Top 5 Hours for Ask Posts Comments
17:00: 38.59 average comments per post
04:00: 23.81 average comments per post
22:00: 21.52 average comments per post
18:00: 16.80 average comments per post
23:00: 16.01 average comments per post

Conclusions and next steps¶

From our top 5 hours for Ask Posts we can conclude that the best time to post a question would it be around 17:00 - 17:59 (in America/Sao_Paulo timezone). It's interesting to note that the average comments of posts around 18:00h drops to less than half than the average of posts created around 17:00.

The reason behind this huge difference in a small window of time could it be the ending of office ours. To verify if this is a valid hypothesis we could analyse the average comments per post for the whole dataset and not only the Ask HN posts. We could check then if the distribuition of average comments is the same.