The word *Hacker* can mean many of things. Hackers can hack with good intentions or malacious ones.
For this project we will be focusing on a commmunity of Hackers with good intentions. We will be looking at a popular technology platform, Hacker News. This site works very similar to other online forums such as reddit, where user's submit posts and recieve answers, comments, votes and feedback.
We removed any submissions without any comments bringing our total amount of rows to an approximate *20,000. Within the dataset we are specifically interested in the posts with titles that begin with either Ask HN* or *Show HN*.
*Ask HN* - users asks the Hacker News community specific questions
*Show HN* - users posts to show the Hacker News community projects or anything interesting
With Hacker News being extremely popular in the hacker culture and tech startups, users posts has a chance of recieving hundreds of thousands of visitors as a result.
In this analysis, we will be comparing *Ask HN* and *Show HN* to determine if either of these two recieve more comments on average and if posts created at a certain time recieve more comments on avereage.
### import csv file, open, read ###
import csv
opened_file = open('hacker_news.csv')
read_file = csv.reader(opened_file)
hn = list(read_file)
### iterate through hn, print first 5 rows ###
for i in hn [:5]:
print(i)
print('\n')
['id', 'title', 'url', 'num_points', 'num_comments', 'author', 'created_at'] ['12224879', 'Interactive Dynamic Video', 'http://www.interactivedynamicvideo.com/', '386', '52', 'ne0phyte', '8/4/2016 11:52'] ['10975351', 'How to Use Open Source and Shut the Fuck Up at the Same Time', 'http://hueniverse.com/2016/01/26/how-to-use-open-source-and-shut-the-fuck-up-at-the-same-time/', '39', '10', 'josep2', '1/26/2016 19:30'] ['11964716', "Florida DJs May Face Felony for April Fools' Water Joke", 'http://www.thewire.com/entertainment/2013/04/florida-djs-april-fools-water-joke/63798/', '2', '1', 'vezycash', '6/23/2016 22:20'] ['11919867', 'Technology ventures: From Idea to Enterprise', 'https://www.amazon.com/Technology-Ventures-Enterprise-Thomas-Byers/dp/0073523429', '3', '1', 'hswarna', '6/17/2016 0:01']
To continue, we must split the column header row with the rest of the data:
headers = hn[0]
hn = hn[1:]
print(headers)
['id', 'title', 'url', 'num_points', 'num_comments', 'author', 'created_at']
### first 5 rows of data ###
hn[0:5]
[['12224879', 'Interactive Dynamic Video', 'http://www.interactivedynamicvideo.com/', '386', '52', 'ne0phyte', '8/4/2016 11:52'], ['10975351', 'How to Use Open Source and Shut the Fuck Up at the Same Time', 'http://hueniverse.com/2016/01/26/how-to-use-open-source-and-shut-the-fuck-up-at-the-same-time/', '39', '10', 'josep2', '1/26/2016 19:30'], ['11964716', "Florida DJs May Face Felony for April Fools' Water Joke", 'http://www.thewire.com/entertainment/2013/04/florida-djs-april-fools-water-joke/63798/', '2', '1', 'vezycash', '6/23/2016 22:20'], ['11919867', 'Technology ventures: From Idea to Enterprise', 'https://www.amazon.com/Technology-Ventures-Enterprise-Thomas-Byers/dp/0073523429', '3', '1', 'hswarna', '6/17/2016 0:01'], ['10301696', 'Note by Note: The Making of Steinway L1037 (2007)', 'http://www.nytimes.com/2007/11/07/movies/07stein.html?_r=0', '8', '2', 'walterbell', '9/30/2015 4:12']]
The Column headers that we will use for our analysis are:
Create three empty lists, each containg the different types of posts (ASk HN, Show Hn, Other).
Then iterate over each row and append the appropiate rows by title.
ask_posts = []
show_posts = []
other_posts = []
for row in hn:
title = row[1]
if title.lower().startswith('ask hn'):
ask_posts.append(row)
elif title.lower().startswith('show hn'):
show_posts.append(row)
else:
other_posts.append(row)
print('Total Number Ask HN posts:', len(ask_posts))
print('Total Number Show HN posts:', len(show_posts))
print('Total Number Other posts:', len(other_posts))
Total Number Ask HN posts: 1744 Total Number Show HN posts: 1162 Total Number Other posts: 17194
We can now use the lists we've created above to calculate the average number of comments per Ask posts, Show posts and other posts.
total_ask_comments = 0
### Ask HN Avg ###
for row in ask_posts:
num_comments = int(row[4])
total_ask_comments += num_comments
avg_ask_comments = total_ask_comments / len(ask_posts)
print('Total number of Ask Comments:', total_ask_comments)
print('Avg number of Ask Comments:', avg_ask_comments)
print('\n')
print('Clean Avg number of Ask Comments: {:.2f}'.format(avg_ask_comments))
Total number of Ask Comments: 24483 Avg number of Ask Comments: 14.038417431192661 Clean Avg number of Ask Comments: 14.04
total_show_comments = 0
### Show HN Avg ###
for row in show_posts:
num_comments = int(row[4])
total_show_comments += num_comments
avg_show_comments = total_show_comments / len(show_posts)
print('Total number of Show Comments:', total_show_comments)
print('Avg number of Show Comments:', avg_show_comments)
print('\n')
print('Clean Avg number of Show Comments: {:.2f}'.format(avg_show_comments))
Total number of Show Comments: 11988 Avg number of Show Comments: 10.31669535283993 Clean Avg number of Show Comments: 10.32
total_other_comments = 0
### Other Posts Avg ###
for row in other_posts:
num_comments = int(row[4])
total_other_comments += num_comments
avg_other_comments = total_other_comments / len(other_posts)
print('Total number of Other Comments:', total_other_comments)
print('Avg number of Other Comments:', avg_other_comments)
print('\n')
print('Clean Avg number of Other Comments: {:.2f}'.format(avg_other_comments))
Total number of Other Comments: 462055 Avg number of Other Comments: 26.8730371059672 Clean Avg number of Other Comments: 26.87
We can now see Other Posts recieves more comments on average than Show HN and Ask HN combined.
But for our analysis, we are interested in between the Ask HN avg and the Show HN avg. Which above, clearly states that Ask HN posts recieves more comments on average than Show HN posts.
Now, it's time to determine at what certain time are posts more likely to attract comments.
We'll be working with the data in the created_at column to:
calculate the number of ask posts and show posts created in each hour of the day, along with the number of comments received.
Calculate the average number of comments ask posts and show posts receive by hour created.
import datetime as dt
result_list = []
### Ask HN ###
for row in ask_posts:
result_list.append([row[6],int(row[4])])
counts_by_hour = {}
comments_by_hour = {}
for row in result_list:
comments_count = row[1]
date_string = row[0]
date_created = dt.datetime.strptime(date_string,"%m/%d/%Y %H:%M")
hour_created = date_created.hour
if hour_created in counts_by_hour:
counts_by_hour[hour_created] += 1
comments_by_hour[hour_created] += comments_count
else:
counts_by_hour[hour_created] = 1
comments_by_hour[hour_created] = comments_count
print('Ask HN posts created by hour:', counts_by_hour)
print('\n')
print('Ask HN comments post per hour:', comments_by_hour)
Ask HN posts created by hour: {9: 45, 13: 85, 10: 59, 14: 107, 16: 108, 23: 68, 12: 73, 17: 100, 15: 116, 21: 109, 20: 80, 2: 58, 18: 109, 3: 54, 5: 46, 19: 110, 1: 60, 22: 71, 8: 48, 4: 47, 0: 55, 6: 44, 7: 34, 11: 58} Ask HN comments post per hour: {9: 251, 13: 1253, 10: 793, 14: 1416, 16: 1814, 23: 543, 12: 687, 17: 1146, 15: 4477, 21: 1745, 20: 1722, 2: 1381, 18: 1439, 3: 421, 5: 464, 19: 1188, 1: 683, 22: 479, 8: 492, 4: 337, 0: 447, 6: 397, 7: 267, 11: 641}
Now to create a list of lists containing the hours during which posts were created and the average number of comments those posts received.
avg_by_hour = []
### Avg comments posts recieve ###
for key in counts_by_hour:
avg_posts = comments_by_hour[key]/counts_by_hour[key]
avg_by_hour.append([key,avg_posts])
avg_by_hour
[[9, 5.5777777777777775], [13, 14.741176470588234], [10, 13.440677966101696], [14, 13.233644859813085], [16, 16.796296296296298], [23, 7.985294117647059], [12, 9.41095890410959], [17, 11.46], [15, 38.5948275862069], [21, 16.009174311926607], [20, 21.525], [2, 23.810344827586206], [18, 13.20183486238532], [3, 7.796296296296297], [5, 10.08695652173913], [19, 10.8], [1, 11.383333333333333], [22, 6.746478873239437], [8, 10.25], [4, 7.170212765957447], [0, 8.127272727272727], [6, 9.022727272727273], [7, 7.852941176470588], [11, 11.051724137931034]]
We now have the results we need but the above format makes it very difficult to read and identify the hours with the highest values. We will swap the elements to show the average of number of comments first and the hour second.
Lets sort the list of lists and print the 5 highest values in a more presentable format.
swap_avg_by_hour = []
### swapping elements and appending ###
for row in avg_by_hour:
swap_avg_by_hour.append([row[1],row[0]])
print(swap_avg_by_hour)
[[5.5777777777777775, 9], [14.741176470588234, 13], [13.440677966101696, 10], [13.233644859813085, 14], [16.796296296296298, 16], [7.985294117647059, 23], [9.41095890410959, 12], [11.46, 17], [38.5948275862069, 15], [16.009174311926607, 21], [21.525, 20], [23.810344827586206, 2], [13.20183486238532, 18], [7.796296296296297, 3], [10.08695652173913, 5], [10.8, 19], [11.383333333333333, 1], [6.746478873239437, 22], [10.25, 8], [7.170212765957447, 4], [8.127272727272727, 0], [9.022727272727273, 6], [7.852941176470588, 7], [11.051724137931034, 11]]
### time to sort swap_avg_by_hour descending order ###
sorted_swap = sorted(swap_avg_by_hour, reverse = True)
print('Top 5 Hours for Ask Posts Comments')
for row in sorted_swap[:5]:
hour_formatted = dt.datetime.strptime(str(row[1]),'%H')
hour_formatted = hour_formatted.strftime('%H:%M')
print('{}: {:.2f} average comments per post'.format(hour_formatted,row[0]))
Top 5 Hours for Ask Posts Comments 15:00: 38.59 average comments per post 02:00: 23.81 average comments per post 20:00: 21.52 average comments per post 16:00: 16.80 average comments per post 21:00: 16.01 average comments per post
Now to replicate what we just did to figure out the top 5 hours for Show HN posts comments.
import datetime as dt
result_list = []
### Show HN ###
for row in show_posts:
result_list.append([row[6],int(row[4])])
counts_by_hour = {}
comments_by_hour = {}
for row in result_list:
comments_count = row[1]
date_string = row[0]
date_created = dt.datetime.strptime(date_string,"%m/%d/%Y %H:%M")
hour_created = date_created.hour
if hour_created in counts_by_hour:
counts_by_hour[hour_created] += 1
comments_by_hour[hour_created] += comments_count
else:
counts_by_hour[hour_created] = 1
comments_by_hour[hour_created] = comments_count
print('Show HN posts created by hour:', counts_by_hour)
print('\n')
print('Show HN comments post per hour:', comments_by_hour)
Show HN posts created by hour: {14: 86, 22: 46, 18: 61, 7: 26, 20: 60, 5: 19, 16: 93, 19: 55, 15: 78, 3: 27, 17: 93, 6: 16, 2: 30, 13: 99, 8: 34, 21: 47, 4: 26, 11: 44, 12: 61, 23: 36, 9: 30, 1: 28, 10: 36, 0: 31} Show HN comments post per hour: {14: 1156, 22: 570, 18: 962, 7: 299, 20: 612, 5: 58, 16: 1084, 19: 539, 15: 632, 3: 287, 17: 911, 6: 142, 2: 127, 13: 946, 8: 165, 21: 272, 4: 247, 11: 491, 12: 720, 23: 447, 9: 291, 1: 246, 10: 297, 0: 487}
avg_by_hour = []
### Avg comments posts recieve ###
for key in counts_by_hour:
avg_posts = comments_by_hour[key]/counts_by_hour[key]
avg_by_hour.append([key,avg_posts])
avg_by_hour
[[14, 13.44186046511628], [22, 12.391304347826088], [18, 15.770491803278688], [7, 11.5], [20, 10.2], [5, 3.0526315789473686], [16, 11.655913978494624], [19, 9.8], [15, 8.102564102564102], [3, 10.62962962962963], [17, 9.795698924731182], [6, 8.875], [2, 4.233333333333333], [13, 9.555555555555555], [8, 4.852941176470588], [21, 5.787234042553192], [4, 9.5], [11, 11.159090909090908], [12, 11.80327868852459], [23, 12.416666666666666], [9, 9.7], [1, 8.785714285714286], [10, 8.25], [0, 15.709677419354838]]
swap_avg_by_hour = []
### swapping elements and appending ###
for row in avg_by_hour:
swap_avg_by_hour.append([row[1],row[0]])
print(swap_avg_by_hour)
[[13.44186046511628, 14], [12.391304347826088, 22], [15.770491803278688, 18], [11.5, 7], [10.2, 20], [3.0526315789473686, 5], [11.655913978494624, 16], [9.8, 19], [8.102564102564102, 15], [10.62962962962963, 3], [9.795698924731182, 17], [8.875, 6], [4.233333333333333, 2], [9.555555555555555, 13], [4.852941176470588, 8], [5.787234042553192, 21], [9.5, 4], [11.159090909090908, 11], [11.80327868852459, 12], [12.416666666666666, 23], [9.7, 9], [8.785714285714286, 1], [8.25, 10], [15.709677419354838, 0]]
### sort swap_avg_by_hour descending order ###
sorted_swap = sorted(swap_avg_by_hour, reverse = True)
print('Top 5 Hours for Show Posts Comments')
for row in sorted_swap[:5]:
hour_formatted = dt.datetime.strptime(str(row[1]),'%H')
hour_formatted = hour_formatted.strftime('%H:%M')
print('{}: {:.2f} average comments per post'.format(hour_formatted,row[0]))
Top 5 Hours for Show Posts Comments 18:00: 15.77 average comments per post 00:00: 15.71 average comments per post 14:00: 13.44 average comments per post 23:00: 12.42 average comments per post 22:00: 12.39 average comments per post
We now have the best hours of the day when a user should expect the most comments for an Ask HN post and Show Hn post.
Ask HN
Show HN
We can see that Ask HN posts contains more average comments per posts than Show HN posts. This shows that users comment more on question based posts for intellectual gratification of helping that user understand the problem they are facing. Problems can be solved in many different ways, this can often lead to debates among the commentators, hence more comments.