In this project we are going to be exploring data about posts on Hacker news. We will attempt to determine which type of posts receive the most comments i.e posts asking the hacker news community questions(Ask HN) or posts showcasing something new and interesting to the community(Show HN). We will try to determine the time at which posts receive the most comments on average (best time to post)
#reading the hacker_news.csv file and extracting the header
from csv import reader
opened_file = open('hacker_news.csv')
read_file = reader(opened_file)
hn = list(read_file)
headers = hn[0]
hn = hn[1:]
# testing the hn list of lists
print(hn[:5])
[['12224879', 'Interactive Dynamic Video', 'http://www.interactivedynamicvideo.com/', '386', '52', 'ne0phyte', '8/4/2016 11:52'], ['10975351', 'How to Use Open Source and Shut the Fuck Up at the Same Time', 'http://hueniverse.com/2016/01/26/how-to-use-open-source-and-shut-the-fuck-up-at-the-same-time/', '39', '10', 'josep2', '1/26/2016 19:30'], ['11964716', "Florida DJs May Face Felony for April Fools' Water Joke", 'http://www.thewire.com/entertainment/2013/04/florida-djs-april-fools-water-joke/63798/', '2', '1', 'vezycash', '6/23/2016 22:20'], ['11919867', 'Technology ventures: From Idea to Enterprise', 'https://www.amazon.com/Technology-Ventures-Enterprise-Thomas-Byers/dp/0073523429', '3', '1', 'hswarna', '6/17/2016 0:01'], ['10301696', 'Note by Note: The Making of Steinway L1037 (2007)', 'http://www.nytimes.com/2007/11/07/movies/07stein.html?_r=0', '8', '2', 'walterbell', '9/30/2015 4:12']]
# displaying the title column of the list of lists
print(headers)
['id', 'title', 'url', 'num_points', 'num_comments', 'author', 'created_at']
To extract the titles which begin with Ask HN
and Show HN
, we:
ask_posts
,show_posts
and other_posts
hn
list of lists, convert the title column (index 1) to lowercase using the lower
method and assign it to a variable called title
startswith
method to extract the posts which begin with ask hn and append the titles to the ask_posts
list.startswith
method to extract the posts which begin with show hn and append the titles to the show_posts
list.other_posts
list.ask_posts = []
show_posts = []
other_posts = []
for row in hn:
title = row[1].lower()
if title.startswith('ask hn'):
ask_posts.append(row)
if title.startswith('show hn'):
show_posts.append(row)
else:
other_posts.append(row)
# checking the number of posts in each list
print(len(ask_posts))
print(len(show_posts))
print(len(other_posts))
1744 1162 18938
# exploring the ask_posts list
print(ask_posts[:5])
[['12296411', 'Ask HN: How to improve my personal website?', '', '2', '6', 'ahmedbaracat', '8/16/2016 9:55'], ['10610020', 'Ask HN: Am I the only one outraged by Twitter shutting down share counts?', '', '28', '29', 'tkfx', '11/22/2015 13:43'], ['11610310', 'Ask HN: Aby recent changes to CSS that broke mobile?', '', '1', '1', 'polskibus', '5/2/2016 10:14'], ['12210105', 'Ask HN: Looking for Employee #3 How do I do it?', '', '1', '3', 'sph130', '8/2/2016 14:20'], ['10394168', 'Ask HN: Someone offered to buy my browser extension from me. What now?', '', '28', '17', 'roykolak', '10/15/2015 16:38']]
To do this we need to compare the average number of comments that ask posts get with the average number of posts received by show posts. We can calculate the average number of comments received by ask posts by:
total_ask_comments
at 0.ask_posts
list, convert the comments column (index 4) to integer and store it in variable num_comments
num_comments
to total_ask_comments
total_ask_comments
by the number of posts in the ask_posts
list and save it in the variable avg_ask_comments
.We follow the same steps above for the show_posts
list to determine the average number of comments received by show posts.
total_ask_comments = 0
for row in ask_posts:
num_comments = int(row[4])
total_ask_comments += num_comments
print('Total number of ask post comments:', total_ask_comments)
avg_ask_comments = total_ask_comments/len(ask_posts)
print('Average number of ask comments:', avg_ask_comments)
Total number of ask post comments: 24483 Average number of ask comments: 14.038417431192661
total_show_comments = 0
for row in show_posts:
num_comments = int(row[4])
total_show_comments += num_comments
print('Total number of show post comments:', total_show_comments)
avg_show_comments = total_show_comments/len(show_posts)
print('Average number of show comments:', avg_show_comments)
Total number of show post comments: 11988 Average number of show comments: 10.31669535283993
Looking at the values above, the ask_posts
list has 14.04 average number of comments and the show_posts
list has 10.32 average number of cooments.
This tells us that ask posts on Hacker News receive more comments on average when compared to show posts.
Since ask posts are more likely to receive comments, we'll focus our remaining analysis just on these posts. Next we will determine the time at which ask posts are likely to attract the most comments. We can achieve this by doing the following.
# Calculating the amount of ask posts created per hour
import datetime as dt
result_list = []
for row in ask_posts:
created_at = row[6]
num_comments = int(row[4])
list_1 = [created_at, num_comments]
result_list.append(list_1)
len(result_list)
1744
counts_by_hour = {}
comments_by_hour = {}
for row in result_list:
date = row[0]
comment = row[-1]
date_dt = dt.datetime.strptime(date, '%m/%d/%Y %H:%M')
hour = date_dt.hour
if hour not in counts_by_hour:
counts_by_hour[hour] = 1
comments_by_hour[hour] = comment
else:
counts_by_hour[hour] += 1
comments_by_hour[hour] += comment
print('Posts by the Hour')
print(counts_by_hour)
print('Comments by the Hour')
print(comments_by_hour)
Posts by the Hour {0: 55, 1: 60, 2: 58, 3: 54, 4: 47, 5: 46, 6: 44, 7: 34, 8: 48, 9: 45, 10: 59, 11: 58, 12: 73, 13: 85, 14: 107, 15: 116, 16: 108, 17: 100, 18: 109, 19: 110, 20: 80, 21: 109, 22: 71, 23: 68} Comments by the Hour {0: 447, 1: 683, 2: 1381, 3: 421, 4: 337, 5: 464, 6: 397, 7: 267, 8: 492, 9: 251, 10: 793, 11: 641, 12: 687, 13: 1253, 14: 1416, 15: 4477, 16: 1814, 17: 1146, 18: 1439, 19: 1188, 20: 1722, 21: 1745, 22: 479, 23: 543}
# Calculating the average number of comments per post
avg_by_hour = []
for hour in comments_by_hour:
avg_by_hour.append([hour,comments_by_hour[hour]/counts_by_hour[hour]])
print(avg_by_hour)
[[0, 8.127272727272727], [1, 11.383333333333333], [2, 23.810344827586206], [3, 7.796296296296297], [4, 7.170212765957447], [5, 10.08695652173913], [6, 9.022727272727273], [7, 7.852941176470588], [8, 10.25], [9, 5.5777777777777775], [10, 13.440677966101696], [11, 11.051724137931034], [12, 9.41095890410959], [13, 14.741176470588234], [14, 13.233644859813085], [15, 38.5948275862069], [16, 16.796296296296298], [17, 11.46], [18, 13.20183486238532], [19, 10.8], [20, 21.525], [21, 16.009174311926607], [22, 6.746478873239437], [23, 7.985294117647059]]
At this point we have already gotten the desired result. However it is difficult to analyze because of the arrangement of the output. it is therefore necessary to arrange(sort) the data into a more presentable form. We can do this with the help of the sorted()
function.
# swaping the data in the avg_by_hour list to make it easier to sort
swap_avg_by_hour = []
for row in avg_by_hour:
swap_avg_by_hour.append([row[-1], row[0]])
#Sorting the swapped data, reverse = True to present the data in descending order
sorted_swap = sorted(swap_avg_by_hour, reverse = True)
print('Top 5 Hours for Ask Posts Comments')
for row in sorted_swap[:5]:
string = '{}: {:.2f} average comments per post'
hour = str(row[-1])
hour_dt = dt.datetime.strptime(hour, '%H')
hour = hour_dt.strftime('%H:%M')
string = string.format(hour, row[0])
print(string)
Top 5 Hours for Ask Posts Comments 15:00: 38.59 average comments per post 02:00: 23.81 average comments per post 20:00: 21.52 average comments per post 16:00: 16.80 average comments per post 21:00: 16.01 average comments per post
After analyzing the data for Hacker News Posts, we find that the Ask HN posts receive more comments on average and have specific peak times for number of comments received. Therefore, if you are in Lagos, Nigeria looking to post on Hacker News and garner maximum engagement, it is best to make an Ask HN post at 10:00 AM, 3:00 PM or 9:00 PM