HACKER NEWS POST COMPARISON PROJECT
The project aims to compare two type of posts (Ask and Show) posted on Hacker News in terms of comments received.
The dataset used in this project were downloaded from here, however, the number of rows have been reduced from almost 300k to nearly 20k rows by removing no-comment ones and then randonmly sampling from the remaining ones.
For removing unwanted datas from a dataset please refer to my project here
About Hacker News(HN) please visit here
Opened_file_HN = open("hacker_news.csv")
from csv import reader
read_file = reader(Opened_file_HN)
hn = list(read_file)
print(hn[:3])
[['id', 'title', 'url', 'num_points', 'num_comments', 'author', 'created_at'], ['12224879', 'Interactive Dynamic Video', 'http://www.interactivedynamicvideo.com/', '386', '52', 'ne0phyte', '8/4/2016 11:52'], ['10975351', 'How to Use Open Source and Shut the Fuck Up at the Same Time', 'http://hueniverse.com/2016/01/26/how-to-use-open-source-and-shut-the-fuck-up-at-the-same-time/', '39', '10', 'josep2', '1/26/2016 19:30']]
Column Headers | Description |
---|---|
id: | the unique identifier from Hacker News for the post |
title: | the title of the post |
url: | the URL that the posts links to, if the post has a URL |
num_points: | the number of points the post acquired, calculated as the total number of upvotes minus the total number of downvotes |
num_comments: | the number of comments on the post |
author: | the username of the person who submitted the post |
created_at: | the date and time of the post's submission |
In order to analyze data in the dataset without encountring an inconsistency we need to remove first row as it contains only coloumn headers.
headers = hn[0]
print(headers)
['id', 'title', 'url', 'num_points', 'num_comments', 'author', 'created_at']
hn = hn[1:]
print(hn[:3])
[['12224879', 'Interactive Dynamic Video', 'http://www.interactivedynamicvideo.com/', '386', '52', 'ne0phyte', '8/4/2016 11:52'], ['10975351', 'How to Use Open Source and Shut the Fuck Up at the Same Time', 'http://hueniverse.com/2016/01/26/how-to-use-open-source-and-shut-the-fuck-up-at-the-same-time/', '39', '10', 'josep2', '1/26/2016 19:30'], ['11964716', "Florida DJs May Face Felony for April Fools' Water Joke", 'http://www.thewire.com/entertainment/2013/04/florida-djs-april-fools-water-joke/63798/', '2', '1', 'vezycash', '6/23/2016 22:20']]
In line with the aim of the project, firstly, the posts that begin with either Ask HN or Show HN should be determined.
ask_posts = [] # This list will include posts start with "ask hn"
show_posts = [] # This list will include posts start with "show hn"
other_posts = [] # List for posts neither start with "ask hn" nor "show hn"
for row in hn:
title = row[1].lower()
if title.startswith("ask hn"):
ask_posts.append(row)
elif title.startswith("show hn"):
show_posts.append(row)
else:
other_posts.append(row)
print(len(ask_posts))
print(len(show_posts))
print(len(other_posts))
1744 1162 17194
print(ask_posts[:3])
print("\n")
print(show_posts[:3])
[['12296411', 'Ask HN: How to improve my personal website?', '', '2', '6', 'ahmedbaracat', '8/16/2016 9:55'], ['10610020', 'Ask HN: Am I the only one outraged by Twitter shutting down share counts?', '', '28', '29', 'tkfx', '11/22/2015 13:43'], ['11610310', 'Ask HN: Aby recent changes to CSS that broke mobile?', '', '1', '1', 'polskibus', '5/2/2016 10:14']] [['10627194', 'Show HN: Wio Link ESP8266 Based Web of Things Hardware Development Platform', 'https://iot.seeed.cc', '26', '22', 'kfihihc', '11/25/2015 14:03'], ['10646440', 'Show HN: Something pointless I made', 'http://dn.ht/picklecat/', '747', '102', 'dhotson', '11/29/2015 22:46'], ['11590768', 'Show HN: Shanhu.io, a programming playground powered by e8vm', 'https://shanhu.io', '1', '1', 'h8liu', '4/28/2016 18:05']]
Calculation of Average number of Comments- Let's determine which one gets, on average, more comments than the other.
total_ask_comments = 0
for row in ask_posts:
b = int(row[4]) # index number is 4 for number of comments
total_ask_comments +=b
avg_ask_comments = total_ask_comments/len(ask_posts)
print("Average number of comments for per post for Ask HN is "+ str(round(avg_ask_comments,2)))
print("\n")
total_show_comments = 0
for row in show_posts:
b = int(row[4]) # index number is 4 for number of comments
total_show_comments +=b
avg_show_comments = total_show_comments/len(show_posts)
print("Average number of comments for per post for Show HN is " + str(round(avg_show_comments,2)))
Average number of comments for per post for Ask HN is 14.04 Average number of comments for per post for Show HN is 10.32
First finding - As per the result above,on average, the posts start with "Ask HN" has taken more comments than those beginning with with "Show HN"
Consideration of time element - Further to the above findings, let's take a look at the time the post was created to determine which time is more likely to attract comments.
import datetime as dt #we need datetime module--
# --as we work on time
result_list = [] # list of list
for row in ask_posts:
date_created = row[6] # first element
#created_at column is the seventh column in ask_posts
number_of_comment = int(row[4])
result_list.append([date_created,number_of_comment])
counts_by_hour = {} #contains the number of ask posts--
# --created during each hour of the day
comments_by_hour = {} #contains the corresponding number of comments--
#--ask posts created at each hour received.
for row in result_list:
hour = row[0]
dt_object = dt.datetime.strptime(hour,"%m/%d/%Y %H:%M") #creating the object to parse the time info
dt_string = dt_object.strftime("%H") #parsing the time info
if dt_string not in counts_by_hour:
counts_by_hour[dt_string] = 1
comments_by_hour[dt_string] = row[1]
else:
counts_by_hour[dt_string] +=1
comments_by_hour[dt_string] += row[1]
Finding the time when the most comments made - After creating above two dictionaries we can calculate average number of comments for posts created during each hour of the day.
avg_by_hour =[]
for (k,v), (k1,v1) in zip(counts_by_hour.items(), comments_by_hour.items()):
if k == k1:
avg_by_hour.append([k,round(v1/v,2)])
else:
break
print(avg_by_hour)
[['09', 5.58], ['13', 14.74], ['10', 13.44], ['14', 13.23], ['16', 16.8], ['23', 7.99], ['12', 9.41], ['17', 11.46], ['15', 38.59], ['21', 16.01], ['20', 21.52], ['02', 23.81], ['18', 13.2], ['03', 7.8], ['05', 10.09], ['19', 10.8], ['01', 11.38], ['22', 6.75], ['08', 10.25], ['04', 7.17], ['00', 8.13], ['06', 9.02], ['07', 7.85], ['11', 11.05]]
Uncluttering- Hooray! we have the result. but wait! Doesn't it look a little messy to you?
swap_avg_by_hour = []
for row in avg_by_hour:
swap_avg_by_hour.append([row[1],row[0]])
print(swap_avg_by_hour)
[[5.58, '09'], [14.74, '13'], [13.44, '10'], [13.23, '14'], [16.8, '16'], [7.99, '23'], [9.41, '12'], [11.46, '17'], [38.59, '15'], [16.01, '21'], [21.52, '20'], [23.81, '02'], [13.2, '18'], [7.8, '03'], [10.09, '05'], [10.8, '19'], [11.38, '01'], [6.75, '22'], [10.25, '08'], [7.17, '04'], [8.13, '00'], [9.02, '06'], [7.85, '07'], [11.05, '11']]
sorted_swap=sorted(swap_avg_by_hour,reverse=True)
print("Top 5 Hours for Ask Posts Comments")
for avg,hour in sorted_swap[:5]:
dt_object = dt.datetime.strptime(hour,"%H")
dt_string = dt_object.strftime("%H:%M")
print("{time}: {comm:.2f} average comments per post".format(time=dt_string,comm=avg))
Top 5 Hours for Ask Posts Comments 15:00: 38.59 average comments per post 02:00: 23.81 average comments per post 20:00: 21.52 average comments per post 16:00: 16.80 average comments per post 21:00: 16.01 average comments per post