This project will analyse the number of comments on posts that are found on Hacker News, a popular technology website. The aim of this project is two-fold. Hacker News creates many different types of posts, but we will focus on two in this project:
We will isolate these posts from the data set and use them to answer two questions:
The original data set for this project can be found on kaggle. However, for this project, a reduced data set will be used. As the purpose of this project is to analyse the number of comments different types of posts receive, all posts that did not receive comments have been removed from the data set.
#read in the csv file:
import csv
with open('hacker_news.csv') as file:
reader = csv.reader(file)
hn = list(reader)
header = hn[0] #create a header list to keep track of what the values in the list relate to
hn = hn[1:]
#display the header and the first 5 rows:
print(header)
print('\n')
print(hn[:5])
['id', 'title', 'url', 'num_points', 'num_comments', 'author', 'created_at'] [['12224879', 'Interactive Dynamic Video', 'http://www.interactivedynamicvideo.com/', '386', '52', 'ne0phyte', '8/4/2016 11:52'], ['10975351', 'How to Use Open Source and Shut the Fuck Up at the Same Time', 'http://hueniverse.com/2016/01/26/how-to-use-open-source-and-shut-the-fuck-up-at-the-same-time/', '39', '10', 'josep2', '1/26/2016 19:30'], ['11964716', "Florida DJs May Face Felony for April Fools' Water Joke", 'http://www.thewire.com/entertainment/2013/04/florida-djs-april-fools-water-joke/63798/', '2', '1', 'vezycash', '6/23/2016 22:20'], ['11919867', 'Technology ventures: From Idea to Enterprise', 'https://www.amazon.com/Technology-Ventures-Enterprise-Thomas-Byers/dp/0073523429', '3', '1', 'hswarna', '6/17/2016 0:01'], ['10301696', 'Note by Note: The Making of Steinway L1037 (2007)', 'http://www.nytimes.com/2007/11/07/movies/07stein.html?_r=0', '8', '2', 'walterbell', '9/30/2015 4:12']]
#create new lists to isolate the 'Ask HN', 'Show HN' and 'Other' posts
ask_posts = []
show_posts = []
other_posts = []
for row in hn:
title = row[1]
title = title.lower() #put all letters in lower case for easier search
if title.startswith('ask hn'):
ask_posts.append(row)
elif title.startswith('show hn'):
show_posts.append(row)
else:
other_posts.append(row)
#check the number of posts in each list:
print('Number of Ask posts:', len(ask_posts))
print('Number of Show posts:', len(show_posts))
print('Number of all other posts:', len(other_posts))
Number of Ask posts: 1744 Number of Show posts: 1162 Number of all other posts: 17194
#create a function to calculate the average number of comments for 'Ask' and 'Show' posts
def avg_comments(list):
total_comments = 0
for row in list:
num_comments = int(row[4])
total_comments += num_comments
avg_comments = total_comments/len(list)
return avg_comments
avg_ask_comments = print('Number of average ask posts:', avg_comments(ask_posts))
avg_show_comments = print('Number of average show posts:', avg_comments(show_posts))
Number of average ask posts: 14.038417431192661 Number of average show posts: 10.31669535283993
So far, we have read the Hacker News csv file, created a list containing only the header information from the data and another list of data with the header removed. We then created 3 new lists to isolate the data according to what we would like to analyse:
We then created a function to compute the average number of comments that the different types of posts get. The results are as follows:
Ask HN posts get, on average, 14.04 comments, while Show HN posts get 10.32. This suggests that Ask HN posts have more engagement than the Show HN posts.
Since Ask HN posts receive more comments, we will analyse these posts further to see whether the time that the post is posted affects the number of comments that the post receives.
#import datetime module
import datetime as dt
result_list = [] #this list will have the time that the post was created, and how many comments were left
for row in ask_posts:
created_at = row[6]
num_comments = row[4]
result_list.append([created_at, num_comments])
#print first few rows to check that the list has been correctly appended
print(result_list[:3])
[['8/16/2016 9:55', '6'], ['11/22/2015 13:43', '29'], ['5/2/2016 10:14', '1']]
#create frequency tables for the number of posts per hour, and number of comments per hour
counts_by_hour = {}
comments_by_hour = {}
for row in result_list:
date = row[0]
date = dt.datetime.strptime(date, '%m/%d/%Y %H:%M')
hour = date.strftime('%H')
num_comments = int(row[1])
if hour in counts_by_hour:
counts_by_hour[hour] += 1
comments_by_hour[hour] += num_comments
else:
counts_by_hour[hour] = 1
comments_by_hour[hour] = num_comments
print('Posts per hour:')
print(counts_by_hour)
print('\n')
print('Comments per hour:')
print(comments_by_hour)
Posts per hour: {'09': 45, '13': 85, '10': 59, '14': 107, '16': 108, '23': 68, '12': 73, '17': 100, '15': 116, '21': 109, '20': 80, '02': 58, '18': 109, '03': 54, '05': 46, '19': 110, '01': 60, '22': 71, '08': 48, '04': 47, '00': 55, '06': 44, '07': 34, '11': 58} Comments per hour: {'09': 251, '13': 1253, '10': 793, '14': 1416, '16': 1814, '23': 543, '12': 687, '17': 1146, '15': 4477, '21': 1745, '20': 1722, '02': 1381, '18': 1439, '03': 421, '05': 464, '19': 1188, '01': 683, '22': 479, '08': 492, '04': 337, '00': 447, '06': 397, '07': 267, '11': 641}
#calculate the average number of comments per hour
avg_by_hour = []
for hour in counts_by_hour:
average = comments_by_hour[hour]/counts_by_hour[hour]
avg_by_hour.append([hour, average])
print('Average number of comments per hour (hour first):')
print(avg_by_hour)
Average number of comments per hour (hour first): [['09', 5.5777777777777775], ['13', 14.741176470588234], ['10', 13.440677966101696], ['14', 13.233644859813085], ['16', 16.796296296296298], ['23', 7.985294117647059], ['12', 9.41095890410959], ['17', 11.46], ['15', 38.5948275862069], ['21', 16.009174311926607], ['20', 21.525], ['02', 23.810344827586206], ['18', 13.20183486238532], ['03', 7.796296296296297], ['05', 10.08695652173913], ['19', 10.8], ['01', 11.383333333333333], ['22', 6.746478873239437], ['08', 10.25], ['04', 7.170212765957447], ['00', 8.127272727272727], ['06', 9.022727272727273], ['07', 7.852941176470588], ['11', 11.051724137931034]]
#swap the columns around so that average is the first element
swap_avg_by_hour = []
for row in avg_by_hour:
swapped = [row[1], row[0]]
swap_avg_by_hour.append(swapped)
print('Unsorted average number of comments per hour (average first):')
print(swap_avg_by_hour)
#sort the list so that it is easier to analyse
sorted_swap = sorted(swap_avg_by_hour, reverse = True)
print('\n')
print('Sorted average number of comments per hour (average first):')
print(sorted_swap)
Unsorted average number of comments per hour (average first): [[5.5777777777777775, '09'], [14.741176470588234, '13'], [13.440677966101696, '10'], [13.233644859813085, '14'], [16.796296296296298, '16'], [7.985294117647059, '23'], [9.41095890410959, '12'], [11.46, '17'], [38.5948275862069, '15'], [16.009174311926607, '21'], [21.525, '20'], [23.810344827586206, '02'], [13.20183486238532, '18'], [7.796296296296297, '03'], [10.08695652173913, '05'], [10.8, '19'], [11.383333333333333, '01'], [6.746478873239437, '22'], [10.25, '08'], [7.170212765957447, '04'], [8.127272727272727, '00'], [9.022727272727273, '06'], [7.852941176470588, '07'], [11.051724137931034, '11']] Sorted average number of comments per hour (average first): [[38.5948275862069, '15'], [23.810344827586206, '02'], [21.525, '20'], [16.796296296296298, '16'], [16.009174311926607, '21'], [14.741176470588234, '13'], [13.440677966101696, '10'], [13.233644859813085, '14'], [13.20183486238532, '18'], [11.46, '17'], [11.383333333333333, '01'], [11.051724137931034, '11'], [10.8, '19'], [10.25, '08'], [10.08695652173913, '05'], [9.41095890410959, '12'], [9.022727272727273, '06'], [8.127272727272727, '00'], [7.985294117647059, '23'], [7.852941176470588, '07'], [7.796296296296297, '03'], [7.170212765957447, '04'], [6.746478873239437, '22'], [5.5777777777777775, '09']]
Up to this point, we have isolated the time of post creation and the number of comments per post, and generated a new list with this data. Using this list, we extracted the hour from each time, and created two frequency tables:
These frequency tables then allowed us to calculate the average comments per hour. Although this gave us the information we needed, the data was not in a readable format for analysis. Therefore, we sorted the average number of comments in descending order, for ease of analysis.
Next, we will format a string and print out the top 5 hours that generate the highest number of comments:
print('Top 5 Hours for Ask Posts Comments.')
for row in sorted_swap[:5]:
average = row[0]
hour = row[1]
hour = dt.datetime.strptime(hour, "%H")
hour = hour.strftime("%H:%M")
top_5 = '{}: {:.2f} average comments per post'.format(hour, average)
print(top_5)
Top 5 Hours for Ask Posts Comments. 15:00: 38.59 average comments per post 02:00: 23.81 average comments per post 20:00: 21.52 average comments per post 16:00: 16.80 average comments per post 21:00: 16.01 average comments per post
Top 5 Hours for Ask Posts Comments (US/Eastern Time)
The 'Ask Posts' with the highest number of comments are those posted at 15:00 US/Eastern Time. The next best time to post would be 02:00, then 20:00 and finally 16:00 and 21:00, which have, essentially, the same number of average posts.
In South Africa, the best times to post would be 6 hours ahead of these times. Therefore, in order to get a higher number of comments, posts should be created at 21:00.
This project had two aims: to determine whether 'Ask' or 'Show' posts receive more comments on average, and to determine whether the time that the post was created affected the number of comments.
We found that 'Ask' posts receive more comments on average compared with 'Show' posts. We also found that posts created at 15:00 US/Eastern time received considerably more comments that other times, with the second best time being 02:00.
Therefore, if you are looking to create a post that receives optimal engagement from readers, I'd suggest creating an 'Ask' post, and posting it at 15:00 Eastern time.