In this project I analyze data about submissions to technology site Hacker News.
The goal of the project is to determine the following:
Ask HN
or Show HN
posts receive more comments on average?Note: This project forms part of Dataquest.io's course 'Data Science in Python - Fundamentals'.
Let's start out by gathering the data.
The freely available data set can be found here, and was stored in hacker_news.csv
.
First we open and read the file making use of the reader class of the csv module, and then read the file in as a list of lists. Then we display the first five rows, each item on a new line.
opened_file = open("hacker_news.csv")
from csv import reader
read_file = reader(opened_file)
hn = list(read_file)
print (*hn[:5], sep = "\n")
['id', 'title', 'url', 'num_points', 'num_comments', 'author', 'created_at'] ['12224879', 'Interactive Dynamic Video', 'http://www.interactivedynamicvideo.com/', '386', '52', 'ne0phyte', '8/4/2016 11:52'] ['10975351', 'How to Use Open Source and Shut the Fuck Up at the Same Time', 'http://hueniverse.com/2016/01/26/how-to-use-open-source-and-shut-the-fuck-up-at-the-same-time/', '39', '10', 'josep2', '1/26/2016 19:30'] ['11964716', "Florida DJs May Face Felony for April Fools' Water Joke", 'http://www.thewire.com/entertainment/2013/04/florida-djs-april-fools-water-joke/63798/', '2', '1', 'vezycash', '6/23/2016 22:20'] ['11919867', 'Technology ventures: From Idea to Enterprise', 'https://www.amazon.com/Technology-Ventures-Enterprise-Thomas-Byers/dp/0073523429', '3', '1', 'hswarna', '6/17/2016 0:01']
The data contains a header row, which we need to remove to analyze the data itself.
Let's remove the header row and save it to a separate header. Then we display the first five rows of the data set again, to verifiy that all went well.
# run this cell only once,
# or it will remove more than just the header row
headers = hn[0]
hn = hn[1:]
print (headers)
print () # just a blank line, to separate the two lists
print (*hn[:5], sep = "\n")
['id', 'title', 'url', 'num_points', 'num_comments', 'author', 'created_at'] ['12224879', 'Interactive Dynamic Video', 'http://www.interactivedynamicvideo.com/', '386', '52', 'ne0phyte', '8/4/2016 11:52'] ['10975351', 'How to Use Open Source and Shut the Fuck Up at the Same Time', 'http://hueniverse.com/2016/01/26/how-to-use-open-source-and-shut-the-fuck-up-at-the-same-time/', '39', '10', 'josep2', '1/26/2016 19:30'] ['11964716', "Florida DJs May Face Felony for April Fools' Water Joke", 'http://www.thewire.com/entertainment/2013/04/florida-djs-april-fools-water-joke/63798/', '2', '1', 'vezycash', '6/23/2016 22:20'] ['11919867', 'Technology ventures: From Idea to Enterprise', 'https://www.amazon.com/Technology-Ventures-Enterprise-Thomas-Byers/dp/0073523429', '3', '1', 'hswarna', '6/17/2016 0:01'] ['10301696', 'Note by Note: The Making of Steinway L1037 (2007)', 'http://www.nytimes.com/2007/11/07/movies/07stein.html?_r=0', '8', '2', 'walterbell', '9/30/2015 4:12']
We are only interested in post titles starting with Ask HN
or Show HN
, so we create a new lists of lists with our filtered data.
To filter the data we use the method startswith
(link)
With that, our data is ready to be analysed.
# creating 3 empty lists
ask_posts = []
show_posts = []
other_posts = []
# Looping through the data, `title` is second column
# check lowercase version of `title` and append to right list
for item in hn:
title = item[1]
lower_title = title.lower()
if lower_title.startswith("ask hn"):
ask_posts.append(item)
elif lower_title.startswith ("show hn"):
show_posts.append(item)
else:
other_posts.append(item)
# checking the number of posts in each list
print ("ask_posts: " + str(len(ask_posts)))
print ("show_posts: " + str(len(show_posts)))
print ("other_posts: " + str(len(other_posts)))
ask_posts: 1744 show_posts: 1162 other_posts: 17194
To determine if Ask posts or Show posts receive more comments on average, let's find the total number of comments for these types of posts.
total_ask_comments = 0
total_show_comments = 0
# looping over the ask posts
# number of comments is the fifth column (index 4)
# convert this number to an integer, and add to `total_ask_comments`
for item in ask_posts:
comments = int(item[4])
total_ask_comments += comments
# computing the average number of ask posts
avg_ask_comments = total_ask_comments / len(ask_posts)
print ("Average comments per Ask post: " + str(avg_ask_comments))
# looping over the show posts
# number of comments is the fifth column (index 4)
# convert this number to an integer, and add to `total_show_comments`
for item in show_posts:
comments = int(item[4])
total_show_comments += comments
# computing the average number of show posts
avg_show_comments = total_show_comments / len(show_posts)
print ("Average comments per Show post: " + str(avg_show_comments))
Average comments per Ask post: 14.038417431192661 Average comments per Show post: 10.31669535283993
It seems the Ask posts receive more comments on average (14) than Show posts (10).
Let's focus our remaining analysis on the Ask posts and determine if posts created at a certain time attract more comments. We'll go about it in two steps:
We begin with the first step, using the datetime module
import datetime as dt
result_list = []
# looping over the ask posts and
# appending the create-date and number of comments as a list to the result_list
for item in ask_posts:
created_at = item[6]
comments = int(item[4])
result_list.append([created_at, comments])
counts_by_hour = {}
comments_by_hour = {}
# Looping throught the result_list
# extracting the created_at-time from the date
# creating a datetime-object for this time
# Select just the hour from our date-time object
for item in result_list:
created_at = item[0]
comments = item[1]
dt_hour = dt.datetime.strptime(created_at, "%m/%d/%Y %H:%M")
hour = dt_hour.strftime("%H")
# check if the hour is in the dictionaries
# if not, create an entry in both. If it is, add an entry in both
if hour not in counts_by_hour:
counts_by_hour[hour] = 1
comments_by_hour[hour] = comments
else:
counts_by_hour[hour] += 1
comments_by_hour[hour] += comments
# let's see how that went
print ("Counts by hour:")
for key, value in counts_by_hour.items():
print(key, value)
print() # blank line to separate
print ("Comments by hour:")
for key, value in comments_by_hour.items():
print(key, value)
Counts by hour: 15 116 06 44 23 68 21 109 18 109 03 54 22 71 13 85 12 73 10 59 00 55 17 100 08 48 02 58 09 45 16 108 05 46 14 107 07 34 20 80 11 58 04 47 01 60 19 110 Comments by hour: 15 4477 06 397 23 543 21 1745 18 1439 03 421 22 479 13 1253 12 687 10 793 00 447 17 1146 08 492 02 1381 09 251 16 1814 05 464 14 1416 07 267 20 1722 11 641 04 337 01 683 19 1188
Now that we have the number of Ask posts created each hour of the day, and the number of comments each received, we can proceed with step 2:
For example, on the hour of 15 PM there were 116 Ask posts that received 4477 comments, so on average 4477/116 = 38,59 comments per post. To calculate this for each hour we will iterate over the two dictionaries we created in the previous step: For each hour in comments by hour
we will get the hour-key and the corresponding comments-value, and divide this value by the corresponding posts-value for the same hour-key in counts_by_hour
.
Every iteration is appended to a new list of lists.
avg_by_hour = []
for hour in comments_by_hour:
avg_by_hour.append( [hour, comments_by_hour[hour] / counts_by_hour[hour]] )
# let's print the result
for element in avg_by_hour:
print (element, sep = "\n") # each item on a new line
['15', 38.5948275862069] ['06', 9.022727272727273] ['23', 7.985294117647059] ['21', 16.009174311926607] ['18', 13.20183486238532] ['03', 7.796296296296297] ['22', 6.746478873239437] ['13', 14.741176470588234] ['12', 9.41095890410959] ['10', 13.440677966101696] ['00', 8.127272727272727] ['17', 11.46] ['08', 10.25] ['02', 23.810344827586206] ['09', 5.5777777777777775] ['16', 16.796296296296298] ['05', 10.08695652173913] ['14', 13.233644859813085] ['07', 7.852941176470588] ['20', 21.525] ['11', 11.051724137931034] ['04', 7.170212765957447] ['01', 11.383333333333333] ['19', 10.8]
Finally, let's sort this list so it is easier to understand.
To do this we reverse the order of each element, so the average number of comments appears as the first element, and the hour as the second. then we sort this list of lists by the first element in descending order, (hour with highest average number of comments appears at the top of the list) and we print it.
To finish it off we present the top-5 results in an attractive format.
swap_avg_by_hour = []
for item in avg_by_hour:
swap_avg_by_hour.append([item[1], item[0]])
sorted_swap = sorted(swap_avg_by_hour, reverse=True)
print ("Sorted list:")
for element in sorted_swap:
print (element, sep = "\n")
print () #just a blank line
print ("Top 5 Hours for Aks Posts Comments")
print () #just a blank line
for element in sorted_swap[0:5]:
avg_comments = element[0]
avg_comments_format = "{:2f}".format(avg_comments)
hour = element[1]
hour_object = dt.datetime.strptime(hour, "%H")
hour_format = hour_object.strftime ("%H:%M")
print (str(hour_format) + ": " + str(avg_comments_format) + " comments per post")
Sorted list: [38.5948275862069, '15'] [23.810344827586206, '02'] [21.525, '20'] [16.796296296296298, '16'] [16.009174311926607, '21'] [14.741176470588234, '13'] [13.440677966101696, '10'] [13.233644859813085, '14'] [13.20183486238532, '18'] [11.46, '17'] [11.383333333333333, '01'] [11.051724137931034, '11'] [10.8, '19'] [10.25, '08'] [10.08695652173913, '05'] [9.41095890410959, '12'] [9.022727272727273, '06'] [8.127272727272727, '00'] [7.985294117647059, '23'] [7.852941176470588, '07'] [7.796296296296297, '03'] [7.170212765957447, '04'] [6.746478873239437, '22'] [5.5777777777777775, '09'] Top 5 Hours for Aks Posts Comments 15:00: 38.594828 comments per post 02:00: 23.810345 comments per post 20:00: 21.525000 comments per post 16:00: 16.796296 comments per post 21:00: 16.009174 comments per post
So to conclude there seem to be three time slots for creating Ask posts that have a high chance of receiving comments:
That should give you plenty of options for choosing the right moment for creating your Ask post and sit back to enjoy seeing the comments roll in.
That's all on this analysis for now!