opened_file = open("hacker_news.csv") #opening the dataset.
from csv import reader
read_file = reader(opened_file)
hn = list(read_file) #converting the data set into a list of lists.
print(hn[:5])
[['id', 'title', 'url', 'num_points', 'num_comments', 'author', 'created_at'], ['12224879', 'Interactive Dynamic Video', 'http://www.interactivedynamicvideo.com/', '386', '52', 'ne0phyte', '8/4/2016 11:52'], ['10975351', 'How to Use Open Source and Shut the Fuck Up at the Same Time', 'http://hueniverse.com/2016/01/26/how-to-use-open-source-and-shut-the-fuck-up-at-the-same-time/', '39', '10', 'josep2', '1/26/2016 19:30'], ['11964716', "Florida DJs May Face Felony for April Fools' Water Joke", 'http://www.thewire.com/entertainment/2013/04/florida-djs-april-fools-water-joke/63798/', '2', '1', 'vezycash', '6/23/2016 22:20'], ['11919867', 'Technology ventures: From Idea to Enterprise', 'https://www.amazon.com/Technology-Ventures-Enterprise-Thomas-Byers/dp/0073523429', '3', '1', 'hswarna', '6/17/2016 0:01']]
Here's what the columns represent:
Column Name | What it represents |
---|---|
id | The unique identifier from Hacker News for the post |
title | The title of the post |
url | The URL that the posts links to, if the post has a URL |
num_points | The number of points the post acquired, calculated as the total number of upvotes minus the total number of downvotes |
num_comments | The number of comments that were made on the post |
author | The username of the person who submitted the post |
created_at | the date and time the post was made (the time zone is Eastern Time in the US) |
headers = hn[0]
hn = hn[1:] #we need to remove the header to analyse our data.
print(headers)
print('\n')
print(hn[0],'\n',hn[1],'\n',hn[2],'\n',hn[3],'\n',hn[4]) #displaying the first five rows of the data set
#after we removed the header
['id', 'title', 'url', 'num_points', 'num_comments', 'author', 'created_at'] ['12224879', 'Interactive Dynamic Video', 'http://www.interactivedynamicvideo.com/', '386', '52', 'ne0phyte', '8/4/2016 11:52'] ['10975351', 'How to Use Open Source and Shut the Fuck Up at the Same Time', 'http://hueniverse.com/2016/01/26/how-to-use-open-source-and-shut-the-fuck-up-at-the-same-time/', '39', '10', 'josep2', '1/26/2016 19:30'] ['11964716', "Florida DJs May Face Felony for April Fools' Water Joke", 'http://www.thewire.com/entertainment/2013/04/florida-djs-april-fools-water-joke/63798/', '2', '1', 'vezycash', '6/23/2016 22:20'] ['11919867', 'Technology ventures: From Idea to Enterprise', 'https://www.amazon.com/Technology-Ventures-Enterprise-Thomas-Byers/dp/0073523429', '3', '1', 'hswarna', '6/17/2016 0:01'] ['10301696', 'Note by Note: The Making of Steinway L1037 (2007)', 'http://www.nytimes.com/2007/11/07/movies/07stein.html?_r=0', '8', '2', 'walterbell', '9/30/2015 4:12']
We're specifically interested in posts whose titles begin with either 'Ask HN' or 'Show HN'. Users submit Ask HN posts to ask the Hacker News community a specific question. Below are a couple examples:
Likewise, users submit Show HN posts to show the Hacker News community a project, product, or just generally something interesting. Below are a couple of examples:
ask_posts = []
show_posts = []
other_posts = []
for row in hn:
title = row[1]
if title.lower().startswith('ask hn'):
ask_posts.append(row) # creating the list of ask hn posts
elif title.lower().startswith('show hn'):
show_posts.append(row) # creating the list of show hn posts
else:
other_posts.append(row)
print('Number of ask hn posts:',len(ask_posts))
print('\n')
print('Below are two instances of Ask HN posts:')
print('\n')
print(ask_posts[:2])
print('\n')
print('Number of show hn posts:',len(show_posts))
print('\n')
print('Below are two instances of Show HN posts:')
print('\n')
print(show_posts[:2])
print('\n')
print('Number of other posts:',len(other_posts))
Number of ask hn posts: 1744 Below are two instances of Ask HN posts: [['12296411', 'Ask HN: How to improve my personal website?', '', '2', '6', 'ahmedbaracat', '8/16/2016 9:55'], ['10610020', 'Ask HN: Am I the only one outraged by Twitter shutting down share counts?', '', '28', '29', 'tkfx', '11/22/2015 13:43']] Number of show hn posts: 1162 Below are two instances of Show HN posts: [['10627194', 'Show HN: Wio Link ESP8266 Based Web of Things Hardware Development Platform', 'https://iot.seeed.cc', '26', '22', 'kfihihc', '11/25/2015 14:03'], ['10646440', 'Show HN: Something pointless I made', 'http://dn.ht/picklecat/', '747', '102', 'dhotson', '11/29/2015 22:46']] Number of other posts: 17194
# finding total comments in ask posts
total_ask_comments = 0
for row in ask_posts:
total_ask_comments += int(row[4])
# calculating average number of comments on ask posts
avg_ask_comments = total_ask_comments/len(ask_posts)
print('Average number of ask post comments:',avg_ask_comments)
# finding total comments in show posts
total_show_comments = 0
for row in show_posts:
total_show_comments += int(row[4])
# calculating average number of comments on show posts
avg_show_comments = total_show_comments/len(show_posts)
print('Average number of show post comments:',avg_show_comments)
Average number of ask post comments: 14.038417431192661 Average number of show post comments: 10.31669535283993
We can see in the output of the code cell above that where the average number of comments on Ask HN posts are about 14, the average number of comments on show posts are around 10. This could be expected as the questions will attract more comments than general posts.
Since ask posts are more likely to receive comments, we'll focus our remaining analysis just on these posts.
Next, we'll determine if ask posts created at a certain time are more likely to attract comments.
import datetime as dt
result_list = []
for row in ask_posts:
date = row[6]
comments = int(row[4])
result_list.append([date,comments])
counts_by_hour = {}
comments_by_hour = {}
for row in result_list:
hour_dt = dt.datetime.strptime(row[0],'%m/%d/%Y %H:%M')
hour = hour_dt.strftime('%H')
comments = row[1]
if hour in counts_by_hour:
counts_by_hour[hour] += 1
comments_by_hour[hour] += comments
else:
counts_by_hour[hour] = 1
comments_by_hour[hour] = comments
print(counts_by_hour)
print('\n')
print(comments_by_hour)
{'09': 45, '13': 85, '10': 59, '14': 107, '16': 108, '23': 68, '12': 73, '17': 100, '15': 116, '21': 109, '20': 80, '02': 58, '18': 109, '03': 54, '05': 46, '19': 110, '01': 60, '22': 71, '08': 48, '04': 47, '00': 55, '06': 44, '07': 34, '11': 58} {'09': 251, '13': 1253, '10': 793, '14': 1416, '16': 1814, '23': 543, '12': 687, '17': 1146, '15': 4477, '21': 1745, '20': 1722, '02': 1381, '18': 1439, '03': 421, '05': 464, '19': 1188, '01': 683, '22': 479, '08': 492, '04': 337, '00': 447, '06': 397, '07': 267, '11': 641}
avg_by_hour = []
for hour in comments_by_hour:
avg_comments = comments_by_hour[hour]/counts_by_hour[hour]
avg_by_hour.append([hour,avg_comments])
print(avg_by_hour)
[['09', 5.5777777777777775], ['13', 14.741176470588234], ['10', 13.440677966101696], ['14', 13.233644859813085], ['16', 16.796296296296298], ['23', 7.985294117647059], ['12', 9.41095890410959], ['17', 11.46], ['15', 38.5948275862069], ['21', 16.009174311926607], ['20', 21.525], ['02', 23.810344827586206], ['18', 13.20183486238532], ['03', 7.796296296296297], ['05', 10.08695652173913], ['19', 10.8], ['01', 11.383333333333333], ['22', 6.746478873239437], ['08', 10.25], ['04', 7.170212765957447], ['00', 8.127272727272727], ['06', 9.022727272727273], ['07', 7.852941176470588], ['11', 11.051724137931034]]
swap_avg_by_hour = []
for row in avg_by_hour:
swap_avg_by_hour.append([row[1],row[0]])
print(swap_avg_by_hour)
sorted_swap = sorted(swap_avg_by_hour,reverse = True)
print('\n')
print('Top 5 Hours for Ask Posts Comments:')
print('\n')
for avg,hours in sorted_swap[:5]:
hour = dt.datetime.strptime(hours,'%H')
hour = hour.strftime('%H:%M')
string = '{hours} : {comments:.2f} average comments per post'.format(hours = hour,comments = avg)
print(string)
[[5.5777777777777775, '09'], [14.741176470588234, '13'], [13.440677966101696, '10'], [13.233644859813085, '14'], [16.796296296296298, '16'], [7.985294117647059, '23'], [9.41095890410959, '12'], [11.46, '17'], [38.5948275862069, '15'], [16.009174311926607, '21'], [21.525, '20'], [23.810344827586206, '02'], [13.20183486238532, '18'], [7.796296296296297, '03'], [10.08695652173913, '05'], [10.8, '19'], [11.383333333333333, '01'], [6.746478873239437, '22'], [10.25, '08'], [7.170212765957447, '04'], [8.127272727272727, '00'], [9.022727272727273, '06'], [7.852941176470588, '07'], [11.051724137931034, '11']] Top 5 Hours for Ask Posts Comments: 15:00 : 38.59 average comments per post 02:00 : 23.81 average comments per post 20:00 : 21.52 average comments per post 16:00 : 16.80 average comments per post 21:00 : 16.01 average comments per post
Ask HN posts recieve more comments on average as compared to Show HN posts. This is to be expected as the questions are more likely to recieve comments than general posts.
Average comments of ask posts are about 14. If we see hour-wise data, the comments peak during 3 pm to 4 pm with an average of about 38 comments. This is a 1.6x increase to the 2nd highest average comments recieved during the night at 2 am to 3am which is around 23. The increase could be due to combination of young and adult population being active on the web after the lunch time. And the reason for next highest comments being recorded after midnight could be the young population which is mostly active on the internet during this time.
The trend is exactly opposite if we convert the time zone from EST to IST where I live. The comments peak around 1:30 am. The next highest comments are recorded around 12:30 pm which is past the lunch time here. This reversal could be due to greater population and a majority of it being occupied by the youth, which here too is mostly active at night.