Data Analysis of posts to technology site Hacker News¶

Introduction¶

In this project I analyze data about submissions to technology site Hacker News.

The goal of the project is to determine the following:

Do Ask HN or Show HN posts receive more comments on average?
Do posts created at a certain time receive more comments on average?

Note: This project forms part of Dataquest.io's course 'Data Science in Python - Fundamentals'.

Data Exploration & Preparation¶

Let's start out by gathering the data.

The freely available data set can be found here, and was stored in hacker_news.csv.

First we open and read the file making use of the reader class of the csv module, and then read the file in as a list of lists. Then we display the first five rows, each item on a new line.

In [1]:

opened_file = open("hacker_news.csv")

from csv import reader

read_file = reader(opened_file)
hn = list(read_file)

print (*hn[:5], sep =  "\n") 

['id', 'title', 'url', 'num_points', 'num_comments', 'author', 'created_at']
['12224879', 'Interactive Dynamic Video', 'http://www.interactivedynamicvideo.com/', '386', '52', 'ne0phyte', '8/4/2016 11:52']
['10975351', 'How to Use Open Source and Shut the Fuck Up at the Same Time', 'http://hueniverse.com/2016/01/26/how-to-use-open-source-and-shut-the-fuck-up-at-the-same-time/', '39', '10', 'josep2', '1/26/2016 19:30']
['11964716', "Florida DJs May Face Felony for April Fools' Water Joke", 'http://www.thewire.com/entertainment/2013/04/florida-djs-april-fools-water-joke/63798/', '2', '1', 'vezycash', '6/23/2016 22:20']
['11919867', 'Technology ventures: From Idea to Enterprise', 'https://www.amazon.com/Technology-Ventures-Enterprise-Thomas-Byers/dp/0073523429', '3', '1', 'hswarna', '6/17/2016 0:01']

The data contains a header row, which we need to remove to analyze the data itself.

Let's remove the header row and save it to a separate header. Then we display the first five rows of the data set again, to verifiy that all went well.

In [2]:

# run this cell only once, 
# or it will remove more than just the header row

headers = hn[0]
hn = hn[1:]

print (headers)
print () # just a blank line, to separate the two lists
print (*hn[:5], sep =  "\n") 

['id', 'title', 'url', 'num_points', 'num_comments', 'author', 'created_at']

['12224879', 'Interactive Dynamic Video', 'http://www.interactivedynamicvideo.com/', '386', '52', 'ne0phyte', '8/4/2016 11:52']
['10975351', 'How to Use Open Source and Shut the Fuck Up at the Same Time', 'http://hueniverse.com/2016/01/26/how-to-use-open-source-and-shut-the-fuck-up-at-the-same-time/', '39', '10', 'josep2', '1/26/2016 19:30']
['11964716', "Florida DJs May Face Felony for April Fools' Water Joke", 'http://www.thewire.com/entertainment/2013/04/florida-djs-april-fools-water-joke/63798/', '2', '1', 'vezycash', '6/23/2016 22:20']
['11919867', 'Technology ventures: From Idea to Enterprise', 'https://www.amazon.com/Technology-Ventures-Enterprise-Thomas-Byers/dp/0073523429', '3', '1', 'hswarna', '6/17/2016 0:01']
['10301696', 'Note by Note: The Making of Steinway L1037 (2007)', 'http://www.nytimes.com/2007/11/07/movies/07stein.html?_r=0', '8', '2', 'walterbell', '9/30/2015 4:12']

We are only interested in post titles starting with Ask HN or Show HN, so we create a new lists of lists with our filtered data.

To filter the data we use the method startswith (link)

With that, our data is ready to be analysed.

In [3]:

# creating 3 empty lists
ask_posts = []
show_posts = []
other_posts = []

# Looping through the data, `title` is second column
# check lowercase version of `title` and append to right list
for item in hn:
    title = item[1]
    lower_title = title.lower()
    if lower_title.startswith("ask hn"):
        ask_posts.append(item)
    elif lower_title.startswith ("show hn"):
        show_posts.append(item)
    else:
        other_posts.append(item)

# checking the number of posts in each list
print ("ask_posts: " + str(len(ask_posts)))
print ("show_posts: " + str(len(show_posts)))
print ("other_posts: " + str(len(other_posts)))

ask_posts: 1744
show_posts: 1162
other_posts: 17194

Data Analysis - part 1¶

To determine if Ask posts or Show posts receive more comments on average, let's find the total number of comments for these types of posts.

In [4]:

total_ask_comments = 0
total_show_comments = 0

# looping over the ask posts
# number of comments is the fifth column (index 4)
# convert this number to an integer, and add to `total_ask_comments`
for item in ask_posts:
    comments = int(item[4])
    total_ask_comments += comments

# computing the average number of ask posts
avg_ask_comments = total_ask_comments / len(ask_posts)
print ("Average comments per Ask post: " + str(avg_ask_comments))

# looping over the show posts
# number of comments is the fifth column (index 4)
# convert this number to an integer, and add to `total_show_comments`
for item in show_posts:
    comments = int(item[4])
    total_show_comments += comments

# computing the average number of show posts
avg_show_comments = total_show_comments / len(show_posts)
print ("Average comments per Show post: " + str(avg_show_comments))

Average comments per Ask post: 14.038417431192661
Average comments per Show post: 10.31669535283993

It seems the Ask posts receive more comments on average (14) than Show posts (10).

Data Analysis - part 2¶

Let's focus our remaining analysis on the Ask posts and determine if posts created at a certain time attract more comments. We'll go about it in two steps:

Calculate the number of Ask posts created each hour of the day, and the number of comments each received
Calculate the average number of comments Ask posts received each hour of the day

We begin with the first step, using the datetime module

In [5]:

import datetime as dt

result_list = []

# looping over the ask posts and
# appending the create-date and number of comments as a list to the result_list 
for item in ask_posts:
    created_at = item[6]
    comments = int(item[4])
    result_list.append([created_at, comments])

counts_by_hour = {}
comments_by_hour = {}

# Looping throught the result_list
# extracting the created_at-time from the date
# creating a datetime-object for this time
# Select just the hour from our date-time object
for item in result_list:
    created_at = item[0]
    comments = item[1]
    dt_hour = dt.datetime.strptime(created_at, "%m/%d/%Y %H:%M")
    hour = dt_hour.strftime("%H")
    # check if the hour is in the dictionaries
    # if not, create an entry in both.  If it is, add an entry in both
    if hour not in counts_by_hour:
        counts_by_hour[hour] = 1
        comments_by_hour[hour] = comments
    else:
        counts_by_hour[hour] += 1
        comments_by_hour[hour] += comments

# let's see how that went
print ("Counts by hour:")
for key, value in counts_by_hour.items():
    print(key, value)
print() # blank line to separate
print ("Comments by hour:")
for key, value in comments_by_hour.items():
    print(key, value)

Counts by hour:
15 116
06 44
23 68
21 109
18 109
03 54
22 71
13 85
12 73
10 59
00 55
17 100
08 48
02 58
09 45
16 108
05 46
14 107
07 34
20 80
11 58
04 47
01 60
19 110

Comments by hour:
15 4477
06 397
23 543
21 1745
18 1439
03 421
22 479
13 1253
12 687
10 793
00 447
17 1146
08 492
02 1381
09 251
16 1814
05 464
14 1416
07 267
20 1722
11 641
04 337
01 683
19 1188

Now that we have the number of Ask posts created each hour of the day, and the number of comments each received, we can proceed with step 2:

calculate the average number of comments Ask posts received each hour of the day.

For example, on the hour of 15 PM there were 116 Ask posts that received 4477 comments, so on average 4477/116 = 38,59 comments per post. To calculate this for each hour we will iterate over the two dictionaries we created in the previous step: For each hour in comments by hour we will get the hour-key and the corresponding comments-value, and divide this value by the corresponding posts-value for the same hour-key in counts_by_hour. Every iteration is appended to a new list of lists.

In [8]:

avg_by_hour = []

for hour in comments_by_hour:
    avg_by_hour.append( [hour, comments_by_hour[hour] / counts_by_hour[hour]] )
    
# let's print the result

for element in avg_by_hour:
    print (element, sep =  "\n") # each item on a new line

['15', 38.5948275862069]
['06', 9.022727272727273]
['23', 7.985294117647059]
['21', 16.009174311926607]
['18', 13.20183486238532]
['03', 7.796296296296297]
['22', 6.746478873239437]
['13', 14.741176470588234]
['12', 9.41095890410959]
['10', 13.440677966101696]
['00', 8.127272727272727]
['17', 11.46]
['08', 10.25]
['02', 23.810344827586206]
['09', 5.5777777777777775]
['16', 16.796296296296298]
['05', 10.08695652173913]
['14', 13.233644859813085]
['07', 7.852941176470588]
['20', 21.525]
['11', 11.051724137931034]
['04', 7.170212765957447]
['01', 11.383333333333333]
['19', 10.8]

Finally, let's sort this list so it is easier to understand.

To do this we reverse the order of each element, so the average number of comments appears as the first element, and the hour as the second. then we sort this list of lists by the first element in descending order, (hour with highest average number of comments appears at the top of the list) and we print it.

To finish it off we present the top-5 results in an attractive format.

In [10]:

swap_avg_by_hour = []

for item in avg_by_hour:
    swap_avg_by_hour.append([item[1], item[0]])

sorted_swap = sorted(swap_avg_by_hour, reverse=True)

print ("Sorted list:")
for element in sorted_swap:
    print (element, sep =  "\n")

print () #just a blank line
print ("Top 5 Hours for Aks Posts Comments")
print () #just a blank line

for element in sorted_swap[0:5]:
    avg_comments = element[0]
    avg_comments_format = "{:2f}".format(avg_comments)
    hour = element[1]
    hour_object = dt.datetime.strptime(hour, "%H")
    hour_format = hour_object.strftime ("%H:%M")
    print (str(hour_format) + ": " + str(avg_comments_format) + " comments per post")

    

Sorted list:
[38.5948275862069, '15']
[23.810344827586206, '02']
[21.525, '20']
[16.796296296296298, '16']
[16.009174311926607, '21']
[14.741176470588234, '13']
[13.440677966101696, '10']
[13.233644859813085, '14']
[13.20183486238532, '18']
[11.46, '17']
[11.383333333333333, '01']
[11.051724137931034, '11']
[10.8, '19']
[10.25, '08']
[10.08695652173913, '05']
[9.41095890410959, '12']
[9.022727272727273, '06']
[8.127272727272727, '00']
[7.985294117647059, '23']
[7.852941176470588, '07']
[7.796296296296297, '03']
[7.170212765957447, '04']
[6.746478873239437, '22']
[5.5777777777777775, '09']

Top 5 Hours for Aks Posts Comments

15:00: 38.594828 comments per post
02:00: 23.810345 comments per post
20:00: 21.525000 comments per post
16:00: 16.796296 comments per post
21:00: 16.009174 comments per post

Conclusion¶

So to conclude there seem to be three time slots for creating Ask posts that have a high chance of receiving comments:

Between 15-17h in the afternoon Eastern Time U.S. (21-23h CET)
Between 20-22h in the evening Eastern Time U.S. (02-04h CET)
Between 02-03h at night Eastern Time U.S. (8-9h CET)

That should give you plenty of options for choosing the right moment for creating your Ask post and sit back to enjoy seeing the comments roll in.

That's all on this analysis for now!

In [ ]: