The goal of this project is to confront various post types on the Hacker News website, and to identify the most popular types of posts based on the number of comments they trigger.
Here I open the file and get an overview of its content by looking at the header and the first four data rows. I did not remove the header, instead I decide to do all the computational work on the subset hn[1:], which corresponds to the original dataset minus the header.
from csv import reader
opened_file = open("hacker_news.csv")
read_file = reader(opened_file)
hn = list(read_file)
print(hn[:5])
[['id', 'title', 'url', 'num_points', 'num_comments', 'author', 'created_at'], ['12224879', 'Interactive Dynamic Video', 'http://www.interactivedynamicvideo.com/', '386', '52', 'ne0phyte', '8/4/2016 11:52'], ['10975351', 'How to Use Open Source and Shut the Fuck Up at the Same Time', 'http://hueniverse.com/2016/01/26/how-to-use-open-source-and-shut-the-fuck-up-at-the-same-time/', '39', '10', 'josep2', '1/26/2016 19:30'], ['11964716', "Florida DJs May Face Felony for April Fools' Water Joke", 'http://www.thewire.com/entertainment/2013/04/florida-djs-april-fools-water-joke/63798/', '2', '1', 'vezycash', '6/23/2016 22:20'], ['11919867', 'Technology ventures: From Idea to Enterprise', 'https://www.amazon.com/Technology-Ventures-Enterprise-Thomas-Byers/dp/0073523429', '3', '1', 'hswarna', '6/17/2016 0:01']]
print(hn[1:4]) #shows the first rows without the header
[['12224879', 'Interactive Dynamic Video', 'http://www.interactivedynamicvideo.com/', '386', '52', 'ne0phyte', '8/4/2016 11:52'], ['10975351', 'How to Use Open Source and Shut the Fuck Up at the Same Time', 'http://hueniverse.com/2016/01/26/how-to-use-open-source-and-shut-the-fuck-up-at-the-same-time/', '39', '10', 'josep2', '1/26/2016 19:30'], ['11964716', "Florida DJs May Face Felony for April Fools' Water Joke", 'http://www.thewire.com/entertainment/2013/04/florida-djs-april-fools-water-joke/63798/', '2', '1', 'vezycash', '6/23/2016 22:20']]
I want to isolate in two separate lists the posts that correspond to questions (Ask HN) and the posts that correspond to shared projects (Show HN).
ask_posts = []
show_posts = []
other_posts = []
header = hn[0]
ask_posts.append(header)
show_posts.append(header)
other_posts.append(header)
for row in hn[1:]:
title = str(row[1])
lower_title = title.lower()
if lower_title.startswith('ask hn'):
ask_posts.append(row)
elif lower_title.startswith('show hn'):
show_posts.append(row)
else:
other_posts.append(row)
print("Number of Ask HN posts:", len(ask_posts[1:]))
print("Number of Show HN posts:", len(show_posts[1:]))
print("Number of Other posts:", len(other_posts[1:]))
Number of Ask HN posts: 1744 Number of Show HN posts: 1162 Number of Other posts: 17194
Now I compare the number of comments for Questions and Project posts.
total_ask_comments = 0
for row in ask_posts[1:]:
n_comments = int(row[4])
total_ask_comments += n_comments
avg_ask_comments = total_ask_comments/len(ask_posts)
print("Average number of comments for Question posts: ", avg_ask_comments)
total_show_comments = 0
for row in show_posts[1:]:
n_comments = int(row[4])
total_show_comments += n_comments
avg_show_comments = total_show_comments/len(show_posts)
print("Average number of comments for Project posts: ", avg_show_comments)
total_other_comments = 0
for row in other_posts[1:]:
n_comments = int(row[4])
total_other_comments += n_comments
avg_other_comments = total_other_comments/len(other_posts)
print("Average number of comments for Other posts: ", avg_other_comments)
Average number of comments for Question posts: 14.030372492836676 Average number of comments for Project posts: 10.307824591573517 Average number of comments for Other posts: 26.871474265774935
At this point in the analysis, it seems that Question posts receive more comments on average with respect to Project posts. Note, however that the number of comments mean for the remaining posts is much higher.
For now on, I will only focus on Question posts. I will start by calculating the amount of Question posts and comments by hour created. I then determine the average number of posts per hour in which they were created.
import datetime as dt
result_list = []
for row in ask_posts[1:]:
result_list.append([row[6], int(row[4])])
print(result_list[:4])
counts_by_hour = {}
comments_by_hour = {}
date_format = "%m/%d/%Y %H:%M"
for row in result_list:
date_time = row[0]
n_comments = int(row[1])
hour = dt.datetime.strptime(date_time, date_format).strftime("%H")
if hour not in counts_by_hour:
counts_by_hour[hour] = 1
comments_by_hour[hour] = n_comments
else:
counts_by_hour[hour] += 1
comments_by_hour[hour] += n_comments
print(counts_by_hour)
print(comments_by_hour)
[['8/16/2016 9:55', 6], ['11/22/2015 13:43', 29], ['5/2/2016 10:14', 1], ['8/2/2016 14:20', 3]] {'19': 110, '14': 107, '13': 85, '21': 109, '07': 34, '20': 80, '00': 55, '01': 60, '11': 58, '22': 71, '08': 48, '12': 73, '06': 44, '15': 116, '02': 58, '18': 109, '23': 68, '17': 100, '09': 45, '10': 59, '03': 54, '16': 108, '04': 47, '05': 46} {'19': 1188, '14': 1416, '13': 1253, '21': 1745, '07': 267, '20': 1722, '00': 447, '01': 683, '11': 641, '22': 479, '08': 492, '12': 687, '06': 397, '15': 4477, '02': 1381, '18': 1439, '23': 543, '17': 1146, '09': 251, '10': 793, '03': 421, '16': 1814, '04': 337, '05': 464}
avg_by_hour = []
for row in counts_by_hour:
time = row
posts = counts_by_hour[time]
for element in comments_by_hour:
time_1 = element
comments = comments_by_hour[time_1]
if time == time_1:
avg_by_hour.append([time, comments / posts])
print("list of hours and of the average number of comments per day hour", avg_by_hour)
list of hours and of the average number of comments per day hour [['19', 10.8], ['14', 13.233644859813085], ['13', 14.741176470588234], ['21', 16.009174311926607], ['07', 7.852941176470588], ['20', 21.525], ['00', 8.127272727272727], ['01', 11.383333333333333], ['11', 11.051724137931034], ['22', 6.746478873239437], ['08', 10.25], ['12', 9.41095890410959], ['06', 9.022727272727273], ['15', 38.5948275862069], ['02', 23.810344827586206], ['18', 13.20183486238532], ['23', 7.985294117647059], ['17', 11.46], ['09', 5.5777777777777775], ['10', 13.440677966101696], ['03', 7.796296296296297], ['16', 16.796296296296298], ['04', 7.170212765957447], ['05', 10.08695652173913]]
To identify the hours with the highest number of comments more easily, I will sort the list and retrieve the hours with the highest number of associated comments. I start by creating a new table with inverted columns, so that the average number of comments per hours comes first.
swap_avg_by_hour = []
for row in avg_by_hour:
first_element = row[0]
second_element = row[1]
swap_avg_by_hour.append([row[1], row[0]])
print(swap_avg_by_hour)
[[10.8, '19'], [13.233644859813085, '14'], [14.741176470588234, '13'], [16.009174311926607, '21'], [7.852941176470588, '07'], [21.525, '20'], [8.127272727272727, '00'], [11.383333333333333, '01'], [11.051724137931034, '11'], [6.746478873239437, '22'], [10.25, '08'], [9.41095890410959, '12'], [9.022727272727273, '06'], [38.5948275862069, '15'], [23.810344827586206, '02'], [13.20183486238532, '18'], [7.985294117647059, '23'], [11.46, '17'], [5.5777777777777775, '09'], [13.440677966101696, '10'], [7.796296296296297, '03'], [16.796296296296298, '16'], [7.170212765957447, '04'], [10.08695652173913, '05']]
Finally, I sort the newly created list of lists from higher to smaller and show the first five rows of this list.
sorted_swap = sorted(swap_avg_by_hour, reverse=True)
print("Top 5 Hours for Ask Post Comments")
hour_format = "%H:%M"
for row in sorted_swap[:6]:
hour = str(row[1])
avg_comments = row[0]
hour_f = dt.datetime.strptime(hour, "%H")
hour_ff = hour_f.strftime(hour_format)
print("{a}: {b:.2f} average comments per post.".format(a = hour_ff, b = avg_comments))
Top 5 Hours for Ask Post Comments 15:00: 38.59 average comments per post. 02:00: 23.81 average comments per post. 20:00: 21.52 average comments per post. 16:00: 16.80 average comments per post. 21:00: 16.01 average comments per post. 13:00: 14.74 average comments per post.
In conclusion, it seems that the best time to post a question on Hacher News is between 3 and 4 pm. Any other time will result in far less comments, but in general evening times and afternoon times seem to be preferable. However, let's keep in mind that I did not take into account the distribution of the data, therefore outliers (a post with a crazy number of comments that happened to have been randomly publisehd at a certain time of the day) may have affected the whole outcome. In addition, posts belonging to other categories may have a very different distribution of comments per hour of the day.