In this project we'll work with a dataset of posts from the popular technology website Hacker News. We'll focus on those posts whose titles begin with Ask HN or Show HN.
Users submit Ask HN
posts to ask the Hacker News community a specific question, such as "What's the best online course you've ever done?" Similarly, users send Show HN
posts to show the Hacker News community a project, a product, or something interesting in general.
We must bear in mind that the dataset we are working with was reduced from almost 300,000 rows to approximately 20,000 rows by removing all posts that did not receive any comments, and then doing a random sampling of the remaining posts.
Our goal is to determine the following:
Ask HN
or Show HN
posts get more comments (on average)?Ask HN
or Show HN
posts get more points (on average)?Post Type | Comments on average | Points on average |
---|---|---|
Ask HN | 14.04 | 15.06 |
Show HN | 10.32 | 27.56 |
Time in which the highest number of comments is recorded on average per post
Post type | Rush Hour | AVG Comments |
---|---|---|
Ask HN | 15:00 - 16:00 | 38.59 |
Show HN | 18:00 - 19:00 | 15.77 |
Time in which the highest number of points is recorded on average per post
Post type | Rush Hour | AVG Points |
---|---|---|
Ask HN | 15:00 - 16:00 | 29.99 |
Show HN | 23:00 - 00:00 | 42.39 |
from csv import reader
# Read in the data
opened_file = open('hacker_news.csv')
read_file = reader(opened_file)
hn = list(read_file)
# Showing the first five rows.
print(*hn[:5], sep='\n\n')
['id', 'title', 'url', 'num_points', 'num_comments', 'author', 'created_at'] ['12224879', 'Interactive Dynamic Video', 'http://www.interactivedynamicvideo.com/', '386', '52', 'ne0phyte', '8/4/2016 11:52'] ['10975351', 'How to Use Open Source and Shut the Fuck Up at the Same Time', 'http://hueniverse.com/2016/01/26/how-to-use-open-source-and-shut-the-fuck-up-at-the-same-time/', '39', '10', 'josep2', '1/26/2016 19:30'] ['11964716', "Florida DJs May Face Felony for April Fools' Water Joke", 'http://www.thewire.com/entertainment/2013/04/florida-djs-april-fools-water-joke/63798/', '2', '1', 'vezycash', '6/23/2016 22:20'] ['11919867', 'Technology ventures: From Idea to Enterprise', 'https://www.amazon.com/Technology-Ventures-Enterprise-Thomas-Byers/dp/0073523429', '3', '1', 'hswarna', '6/17/2016 0:01']
We notice that the first row contains the header of the dataset. To carry out our analysis we must first separate the header from the data.
headers = hn[0] # First row contains the headers
hn = hn[1:] # Selecting data without headers
print(headers)
['id', 'title', 'url', 'num_points', 'num_comments', 'author', 'created_at']
print(*hn[:5], sep='\n\n') # Showing the first five records
['12224879', 'Interactive Dynamic Video', 'http://www.interactivedynamicvideo.com/', '386', '52', 'ne0phyte', '8/4/2016 11:52'] ['10975351', 'How to Use Open Source and Shut the Fuck Up at the Same Time', 'http://hueniverse.com/2016/01/26/how-to-use-open-source-and-shut-the-fuck-up-at-the-same-time/', '39', '10', 'josep2', '1/26/2016 19:30'] ['11964716', "Florida DJs May Face Felony for April Fools' Water Joke", 'http://www.thewire.com/entertainment/2013/04/florida-djs-april-fools-water-joke/63798/', '2', '1', 'vezycash', '6/23/2016 22:20'] ['11919867', 'Technology ventures: From Idea to Enterprise', 'https://www.amazon.com/Technology-Ventures-Enterprise-Thomas-Byers/dp/0073523429', '3', '1', 'hswarna', '6/17/2016 0:01'] ['10301696', 'Note by Note: The Making of Steinway L1037 (2007)', 'http://www.nytimes.com/2007/11/07/movies/07stein.html?_r=0', '8', '2', 'walterbell', '9/30/2015 4:12']
Now, we can start exploring the number of comments for each type of post.
We'll identify posts that start with Ask HN or Show HN and separate the data for those two types of posts into different lists. Separating the data will facilitate analysis for the next steps.
ask_posts = []
show_posts = []
other_posts = []
for row in hn:
title = row[1] # Select the post title
title = title.lower()
if title.startswith('ask hn'):
ask_posts.append(row)
elif title.startswith('show hn'):
show_posts.append(row)
else:
other_posts.append(row)
# Shows the number of records for each list
print("Total 'Ask HN' posts: {:,}".format(len(ask_posts)))
print("Total 'Show HN' posts: {:,}".format(len(show_posts)))
print("Total other posts: {:,}".format(len(other_posts)))
Total 'Ask HN' posts: 1,744 Total 'Show HN' posts: 1,162 Total other posts: 17,194
To make our task easier, we will implement a function that allows us to obtain the average value in a given column:
def avg(data, index):
total = 0
for row in data:
total += int(row[index])
return total / len(data)
Now that we have the separate lists, we will calculate the average number of comments each post type receives.
avg_ask_comments = avg(ask_posts, 4)
print("Average number of comments for 'Ask HN' posts: {:.2f}".format(avg_ask_comments))
Average number of comments for 'Ask HN' posts: 14.04
avg_show_comments = avg(show_posts, 4)
print("Average number of comments for 'Show HN' posts: {:.2f}".format(avg_show_comments))
Average number of comments for 'Show HN' posts: 10.32
avg_other_comments = avg(other_posts, 4)
print("Average number of comments for other posts: {:.2f}".format(avg_other_comments))
Average number of comments for other posts: 26.87
It's normal to observe that other posts receive more number of comments on average as they cover many other topics. If we focus on the Ask HN
and Show HN
posts we see that Ask HN posts are more likely to receive comments.
avg_ask_points = avg(ask_posts, 3)
print("Average number of points for 'Ask HN' posts: {:.2f}".format(avg_ask_points))
Average number of points for 'Ask HN' posts: 15.06
avg_show_points = avg(show_posts, 3)
print("Average number of points for 'Show HN' posts: {:.2f}".format(avg_show_points))
Average number of points for 'Show HN' posts: 27.56
On average, the Show
posts in our sample receive approximately 28 points, while the Ask
posts receive approximately 15. We can say that the Show posts are more likely to receive points.
Next, we'll implement a function, which, given a dataset and the index of one of its columns, returns two dictionaries (contained in a tuple). The first contains the number of values by hour and the second one the average value by hour.
import datetime as dt
def amount_avg_by_hour(data, column):
result_list = []
for row in data:
result_list.append([row[6], int(row[column])])
date_format = "%m/%d/%Y %H:%M"
counts_by_hour = {}
amount_by_hour = {}
for row in result_list:
date = row[0]
n_vals = row[1]
time = dt.datetime.strptime(date, date_format).strftime("%H")
if time not in counts_by_hour:
counts_by_hour[time] = 0
amount_by_hour[time] = 0
counts_by_hour[time] += 1
amount_by_hour[time] += n_vals
avg_by_hour = {}
for hour in amount_by_hour:
avg_by_hour[hour] = round(amount_by_hour[hour] / counts_by_hour[hour], 2)
return amount_by_hour, avg_by_hour
Now, let's determine if posts made at a certain time are more likely to attract comments. The following steps will help us to perform this analysis:
We'll calculate the number of posts made in each hour of the day along with the number of comments received.
We'll get the average number of comments posts get per hour.
# Gets the results for "Ask HN" posts
ask_comments = amount_avg_by_hour(ask_posts, 4) # Index 4 contains the number of comments
# Selects the number of Ask posts created during each hour of the day
ask_comments_by_hour = ask_comments[0]
print("ASK HN - NUMBER OF COMMENTS BY HOUR:")
print(*sorted(ask_comments_by_hour.items()), sep='\n')
ASK HN - NUMBER OF COMMENTS BY HOUR: ('00', 447) ('01', 683) ('02', 1381) ('03', 421) ('04', 337) ('05', 464) ('06', 397) ('07', 267) ('08', 492) ('09', 251) ('10', 793) ('11', 641) ('12', 687) ('13', 1253) ('14', 1416) ('15', 4477) ('16', 1814) ('17', 1146) ('18', 1439) ('19', 1188) ('20', 1722) ('21', 1745) ('22', 479) ('23', 543)
# Gets the results for "Show HN" posts
show_comments = amount_avg_by_hour(show_posts, 4)
# Selects the number of Ask posts created during each hour of the day
show_comments_by_hour = show_comments[0]
print("SHOW HN - NUMBER OF COMMENTS BY HOUR:")
print(*sorted(show_comments_by_hour.items()), sep='\n')
SHOW HN - NUMBER OF COMMENTS BY HOUR: ('00', 487) ('01', 246) ('02', 127) ('03', 287) ('04', 247) ('05', 58) ('06', 142) ('07', 299) ('08', 165) ('09', 291) ('10', 297) ('11', 491) ('12', 720) ('13', 946) ('14', 1156) ('15', 632) ('16', 1084) ('17', 911) ('18', 962) ('19', 539) ('20', 612) ('21', 272) ('22', 570) ('23', 447)
# Selects the number of comments received by hour of the day
avg_ask_comments_by_hour = ask_comments[1]
print("AVG NUMBER OF COMMENTS BY HOUR:")
print(*sorted(avg_ask_comments_by_hour.items()), sep='\n')
AVG NUMBER OF COMMENTS BY HOUR: ('00', 8.13) ('01', 11.38) ('02', 23.81) ('03', 7.8) ('04', 7.17) ('05', 10.09) ('06', 9.02) ('07', 7.85) ('08', 10.25) ('09', 5.58) ('10', 13.44) ('11', 11.05) ('12', 9.41) ('13', 14.74) ('14', 13.23) ('15', 38.59) ('16', 16.8) ('17', 11.46) ('18', 13.2) ('19', 10.8) ('20', 21.52) ('21', 16.01) ('22', 6.75) ('23', 7.99)
To make it easier to identify the times with the highest values, we will sort the results by values and retrieve the five highest values in a format that is easier to read.
swap_avg_ask_by_hour = {v: k for k, v in avg_ask_comments_by_hour.items()}
sorted_swap_ask = sorted(swap_avg_ask_by_hour.items(), reverse=True) # Returns a ordered list of tuples
print("AVG NUMBER OF COMMENTS BY HOUR - ORDERED LIST")
print(*sorted_swap_ask, sep='\n')
AVG NUMBER OF COMMENTS BY HOUR - ORDERED LIST (38.59, '15') (23.81, '02') (21.52, '20') (16.8, '16') (16.01, '21') (14.74, '13') (13.44, '10') (13.23, '14') (13.2, '18') (11.46, '17') (11.38, '01') (11.05, '11') (10.8, '19') (10.25, '08') (10.09, '05') (9.41, '12') (9.02, '06') (8.13, '00') (7.99, '23') (7.85, '07') (7.8, '03') (7.17, '04') (6.75, '22') (5.58, '09')
# Shows the the 5 hours with the highest average comments.
print("TOP 5 HOURS FOR 'ASK' POSTS COMMENTS")
for avg, hour in sorted_swap_ask[:5]:
print(
"{}: {:.2f} average comments per post".format(
dt.datetime.strptime(hour, "%H").strftime("%H:%M"), avg)
)
TOP 5 HOURS FOR 'ASK' POSTS COMMENTS 15:00: 38.59 average comments per post 02:00: 23.81 average comments per post 20:00: 21.52 average comments per post 16:00: 16.80 average comments per post 21:00: 16.01 average comments per post
The time of day when Ask
posts receive the most comments on average is 3:00 PM, with an average of 38.59 comments per post. There is an increase of approximately 60% in the number of comments between 2:00 AM and 8:00 PM respectively.
# Selects the number of comments received by hour of the day
avg_show_comments_by_hour = show_comments[1]
print("AVG NUMBER OF COMMENTS BY HOUR:")
print(*sorted(avg_show_comments_by_hour.items()), sep='\n')
AVG NUMBER OF COMMENTS BY HOUR: ('00', 15.71) ('01', 8.79) ('02', 4.23) ('03', 10.63) ('04', 9.5) ('05', 3.05) ('06', 8.88) ('07', 11.5) ('08', 4.85) ('09', 9.7) ('10', 8.25) ('11', 11.16) ('12', 11.8) ('13', 9.56) ('14', 13.44) ('15', 8.1) ('16', 11.66) ('17', 9.8) ('18', 15.77) ('19', 9.8) ('20', 10.2) ('21', 5.79) ('22', 12.39) ('23', 12.42)
We sort the results by values and retrieve the five highest values.
swap_avg_show_by_hour = {v: k for k, v in avg_show_comments_by_hour.items()}
sorted_swap_show = sorted(swap_avg_show_by_hour.items(), reverse=True) # Returns a ordered list of tuples
print("AVG NUMBER OF COMMENTS BY HOUR - ORDERED LIST")
print(*sorted_swap_show, sep='\n')
AVG NUMBER OF COMMENTS BY HOUR - ORDERED LIST (15.77, '18') (15.71, '00') (13.44, '14') (12.42, '23') (12.39, '22') (11.8, '12') (11.66, '16') (11.5, '07') (11.16, '11') (10.63, '03') (10.2, '20') (9.8, '17') (9.7, '09') (9.56, '13') (9.5, '04') (8.88, '06') (8.79, '01') (8.25, '10') (8.1, '15') (5.79, '21') (4.85, '08') (4.23, '02') (3.05, '05')
# Shows the the 5 hours with the highest average comments.
print("TOP 5 HOURS FOR 'SHOW' POSTS COMMENTS")
for avg, hour in sorted_swap_show[:5]:
print(
"{}: {:.2f} average comments per post".format(
dt.datetime.strptime(hour, "%H").strftime("%H:%M"), avg)
)
TOP 5 HOURS FOR 'SHOW' POSTS COMMENTS 18:00: 15.77 average comments per post 00:00: 15.71 average comments per post 14:00: 13.44 average comments per post 23:00: 12.42 average comments per post 22:00: 12.39 average comments per post
We see that Show
posts receive the most comments on average both at 18:00 hrs. as well as at midnight, with an average of 15.7 comments per post. This is approximately 24% more than the rest of the ranking.
Based on the dataset documentation, the time zone used is US Eastern Time.
ask_points = amount_avg_by_hour(ask_posts, 3) # Index 3 contains the points
ask_points_by_hour = ask_points[0]
print("ASK HN - POINTS BY HOUR:")
print(*sorted(ask_points_by_hour.items()), sep='\n')
ASK HN - POINTS BY HOUR: ('00', 451) ('01', 700) ('02', 793) ('03', 374) ('04', 389) ('05', 552) ('06', 591) ('07', 361) ('08', 515) ('09', 329) ('10', 1102) ('11', 825) ('12', 782) ('13', 2062) ('14', 1282) ('15', 3479) ('16', 2522) ('17', 1941) ('18', 1741) ('19', 1513) ('20', 1151) ('21', 1721) ('22', 511) ('23', 581)
avg_points_ask_by_hour = ask_points[1]
print("AVG OF POINTS BY HOUR:")
print(*sorted(avg_points_ask_by_hour.items()), sep='\n')
AVG OF POINTS BY HOUR: ('00', 8.2) ('01', 11.67) ('02', 13.67) ('03', 6.93) ('04', 8.28) ('05', 12.0) ('06', 13.43) ('07', 10.62) ('08', 10.73) ('09', 7.31) ('10', 18.68) ('11', 14.22) ('12', 10.71) ('13', 24.26) ('14', 11.98) ('15', 29.99) ('16', 23.35) ('17', 19.41) ('18', 15.97) ('19', 13.75) ('20', 14.39) ('21', 15.79) ('22', 7.2) ('23', 8.54)
swap_avg_ask_by_hour = {v: k for k, v in avg_points_ask_by_hour.items()}
sorted_swap_ask = sorted(swap_avg_ask_by_hour.items(), reverse=True)
print("TOP 5 HOURS FOR 'ASK' POSTS POINTS")
for avg, hour in sorted_swap_ask[:5]:
print(
"{}: {:.2f} average points per post".format(
dt.datetime.strptime(hour, "%H").strftime("%H:%M"), avg)
)
TOP 5 HOURS FOR 'ASK' POSTS POINTS 15:00: 29.99 average points per post 13:00: 24.26 average points per post 16:00: 23.35 average points per post 17:00: 19.41 average points per post 10:00: 18.68 average points per post
The hour that receives the most points per post on average is 15:00, with an average of 29.99 points per post. There is an increase of approximately 24% in the number of points between the hours with the highest and the second highest average number of points.
show_points = amount_avg_by_hour(show_posts, 3) # Index 3 contains the points
show_points_by_hour = show_points[0]
print("SHOW HN - POINTS BY HOUR:")
print(*sorted(show_points_by_hour.items()), sep='\n')
SHOW HN - POINTS BY HOUR: ('00', 1173) ('01', 700) ('02', 340) ('03', 679) ('04', 386) ('05', 104) ('06', 375) ('07', 494) ('08', 519) ('09', 553) ('10', 681) ('11', 1480) ('12', 2543) ('13', 2438) ('14', 2187) ('15', 2228) ('16', 2634) ('17', 2521) ('18', 2215) ('19', 1702) ('20', 1819) ('21', 866) ('22', 1856) ('23', 1526)
avg_points_show_by_hour = show_points[1]
print("AVG OF POINTS BY HOUR:")
print(*sorted(avg_points_show_by_hour.items()), sep='\n')
AVG OF POINTS BY HOUR: ('00', 37.84) ('01', 25.0) ('02', 11.33) ('03', 25.15) ('04', 14.85) ('05', 5.47) ('06', 23.44) ('07', 19.0) ('08', 15.26) ('09', 18.43) ('10', 18.92) ('11', 33.64) ('12', 41.69) ('13', 24.63) ('14', 25.43) ('15', 28.56) ('16', 28.32) ('17', 27.11) ('18', 36.31) ('19', 30.95) ('20', 30.32) ('21', 18.43) ('22', 40.35) ('23', 42.39)
swap_avg_show_by_hour = {v: k for k, v in avg_points_show_by_hour.items()}
sorted_swap_show = sorted(swap_avg_show_by_hour.items(), reverse=True)
print("TOP 5 HOURS FOR 'SHOW' POSTS POINTS")
for avg, hour in sorted_swap_show[:5]:
print(
"{}: {:.2f} average points per post".format(
dt.datetime.strptime(hour, "%H").strftime("%H:%M"), avg)
)
TOP 5 HOURS FOR 'SHOW' POSTS POINTS 23:00: 42.39 average points per post 12:00: 41.69 average points per post 22:00: 40.35 average points per post 00:00: 37.84 average points per post 18:00: 36.31 average points per post
We see that Show
posts receive the most points on average both at 23:00 hrs as well as at 12:00, with an average of 42 points per post. This is approximately 10.5% more than the rest of the ranking.
In this project, we analyze Ask HN
posts and Show HN
posts to determine what type of post and at what time they receive the most comments and points on average.
Based on our analysis, to maximize the number of comments a post receives, we recommend Ask HN
post logs with the highest values believed to be between 3:00 PM and 4:00 PM (3:00 AM pm est - 4:00 pm est).
Whereas, to maximize the number of points a post receives, we recommend Show HN
post logs and it is noted that the highest values are obtained between 23:00 and 00:00 (11:00 pm est - 00:00 est).
However, it should be noted that the data set we analyzed excluded those publications without comment.