In this project,we'll work with a data set of submissions to popular technology site Hacker News and at the same time i will have to get the fact for my friend.
** My friend had a too much compains with his submission in the Hacker News site.He argues that all the post he submitted has lesser comments compared to other people's posts. As data analyst,I had to check into the site to come up with the fact.** ** We had a long phone conversion for him to believe me.**
This how our conversion went through, but before that, you can click here to have the data set( note that it has been reduced from almost 300,000 rows to approximately 20,000 rows by removing all submissions that did not receive any comments, and then randomly sampling from the remaining submissions)
Friend:Helow, how are you doing.
Me: I am doing well, tell me ,hope this time it's not about the apps.
Friend:Not really,I don't trust this site, imagine ever since i stated submitting my post, the number of comments i recieved are so wanting compared to other posts which do end up with uncountable comments.
Me: That's so funny, and which site are you talking about if you don't mind.
Friend: Hacker news,i can't trust it anymore.
Me: You don't have to say that yet, let me check into the site, so that we can get the fact.You will have to hold on for atmost 45 minutes.
Friend: It's okay.
Without further ado, we will import and read our data set which is found on CSV file named hacker_news.csv
from csv import reader
opened_file = open('hacker_news.csv') # we open our file
reader_file= reader(opened_file) # we read it here
hn = list(reader_file) # we convert it to list of list result then stored in a variable named hn
# let's display the first five rows
print(hn[:5])
[['id', 'title', 'url', 'num_points', 'num_comments', 'author', 'created_at'], ['12224879', 'Interactive Dynamic Video', 'http://www.interactivedynamicvideo.com/', '386', '52', 'ne0phyte', '8/4/2016 11:52'], ['10975351', 'How to Use Open Source and Shut the Fuck Up at the Same Time', 'http://hueniverse.com/2016/01/26/how-to-use-open-source-and-shut-the-fuck-up-at-the-same-time/', '39', '10', 'josep2', '1/26/2016 19:30'], ['11964716', "Florida DJs May Face Felony for April Fools' Water Joke", 'http://www.thewire.com/entertainment/2013/04/florida-djs-april-fools-water-joke/63798/', '2', '1', 'vezycash', '6/23/2016 22:20'], ['11919867', 'Technology ventures: From Idea to Enterprise', 'https://www.amazon.com/Technology-Ventures-Enterprise-Thomas-Byers/dp/0073523429', '3', '1', 'hswarna', '6/17/2016 0:01']]
with header row, we may end up with errors while analysing the data, and my friend as well may not get the actual facts, so in the cell below we will remove the first row(header row) from hn
Have a look
headers = hn[0]
hn = hn[1:]
# let's now display the headers and the first five rows of hn for verification
print(headers)
print("\n") # this creates a space
print(hn[:5])
['id', 'title', 'url', 'num_points', 'num_comments', 'author', 'created_at'] [['12224879', 'Interactive Dynamic Video', 'http://www.interactivedynamicvideo.com/', '386', '52', 'ne0phyte', '8/4/2016 11:52'], ['10975351', 'How to Use Open Source and Shut the Fuck Up at the Same Time', 'http://hueniverse.com/2016/01/26/how-to-use-open-source-and-shut-the-fuck-up-at-the-same-time/', '39', '10', 'josep2', '1/26/2016 19:30'], ['11964716', "Florida DJs May Face Felony for April Fools' Water Joke", 'http://www.thewire.com/entertainment/2013/04/florida-djs-april-fools-water-joke/63798/', '2', '1', 'vezycash', '6/23/2016 22:20'], ['11919867', 'Technology ventures: From Idea to Enterprise', 'https://www.amazon.com/Technology-Ventures-Enterprise-Thomas-Byers/dp/0073523429', '3', '1', 'hswarna', '6/17/2016 0:01'], ['10301696', 'Note by Note: The Making of Steinway L1037 (2007)', 'http://www.nytimes.com/2007/11/07/movies/07stein.html?_r=0', '8', '2', 'walterbell', '9/30/2015 4:12']]
** Since we have removed the headers from hn, we can now filter our data. BUt before that, I have to get back to my friend to know the type of post he has been submitting.**
Me:can you hear me friend,
Friend:yes I can hear you.
Me:can you recall the kind of post you have been submitting.
Friend:ask post
Me:It's okay I'll get back to you.
for now the response is somehow positive, since in the cell below we will be concerned only on the post begining with Ask HN or SHow HN and we'll create new lists of lists containing just the data for those titles.
# let's now create three empty list
ask_posts = [] # this will hold posts titled ask hn
show_posts = [] # hold posts titled show hn
other_posts = [] # post's titles are neither ask hn nor show hn
#le't now loop through each row in hn
for row in hn:
title = row[1]
if title.lower().startswith("ask hn"):
ask_posts.append(row)
elif title.lower().startswith("show hn"):
show_posts.append(row)
else:
other_posts.append(row)
# Let's check the number of post in each list
print("posts in ask hn: ",len(ask_posts))
print("posts in show hn: ",len(show_posts))
print("posts in other posts: ",len(other_posts))
posts in ask hn: 1744 posts in show hn: 1162 posts in other posts: 17194
Let's check on few rows in ask posts and show posts
1.Ask posts
print(ask_posts[:2])
[['12296411', 'Ask HN: How to improve my personal website?', '', '2', '6', 'ahmedbaracat', '8/16/2016 9:55'], ['10610020', 'Ask HN: Am I the only one outraged by Twitter shutting down share counts?', '', '28', '29', 'tkfx', '11/22/2015 13:43']]
** 2. Show posts**
print(show_posts[:2])
[['10627194', 'Show HN: Wio Link ESP8266 Based Web of Things Hardware Development Platform', 'https://iot.seeed.cc', '26', '22', 'kfihihc', '11/25/2015 14:03'], ['10646440', 'Show HN: Something pointless I made', 'http://dn.ht/picklecat/', '747', '102', 'dhotson', '11/29/2015 22:46']]
Remember our aim is to determine which post recieves more comments, to do this we will compute the average number of comments in each post i.e ask posts and show posts
** i). Ask posts**
# working on average number of comments in ask posts
total_ask_comments = 0
for row in ask_posts:
num_comments = row[4]
total_ask_comments += int(num_comments)
#let's now compute the average
avg_ask_comments = total_ask_comments/len(ask_posts)
print("sum total of comments in ask posts: ",total_ask_comments)
print("Average number of comments in ask posts: ",avg_ask_comments)
print("\n") # to create space
# working on average number of comments in show posts
total_show_comments = 0
for row in show_posts:
num_comments = row[4]
total_show_comments += int(num_comments)
#let's now compute the average
avg_show_comments = total_show_comments/len(show_posts)
print("sum total of comments in show posts: ",total_show_comments)
print("Average number of comments in show posts: ",avg_show_comments)
sum total of comments in ask posts: 24483 Average number of comments in ask posts: 14.038417431192661 sum total of comments in show posts: 11988 Average number of comments in show posts: 10.31669535283993
From the above workings we find that, ask posts recieve more comment of an average of 14 compared to show posts which have 10 comments on average.
We are very lucky, my friend had an issue in ask posts, and this is the posts with more comments compared to show posts, we'll now focus our remaining analysins on these posts(ask posts).
With this I'll be in position to answer my friend on the fact about this site. since ,we'll determine if ask posts created at a certain time are more likely to attract comments.
To do this, We'll,
i.Calculate the amount of ask posts created in each hour of the day, along with the number of comments received.
ii.Calculate the average number of comments ask posts receive by hour created.
# we will first import datetime module
import datetime as dt
result_list = [] # this will contain number of comments created at different times
for row in ask_posts:
created_at = row[6]
num_comments = int(row[4])
result_list.append([created_at, num_comments])
#let's now two empt ditionary
counts_by_hour = {}
comments_by_hour ={}
for row in result_list:
extract_date = row[0] # extract_date
comment = row[1] #extract comments
date_time = dt.datetime.strptime(extract_date, "%m/%d/%Y %H:%M") #etract the date
extract_hour = dt.datetime.strftime(date_time, "%H") # strftime extract the hour
if extract_hour not in counts_by_hour:
counts_by_hour[extract_hour] = 1
comments_by_hour[extract_hour] = comment
else:
counts_by_hour[extract_hour] += 1
comments_by_hour[extract_hour] += comment
print("the number of comments on asked post create by the hour:")
print("hour: " "comments")
comments_by_hour
the number of comments on asked post create by the hour: hour: comments
{'00': 447, '01': 683, '02': 1381, '03': 421, '04': 337, '05': 464, '06': 397, '07': 267, '08': 492, '09': 251, '10': 793, '11': 641, '12': 687, '13': 1253, '14': 1416, '15': 4477, '16': 1814, '17': 1146, '18': 1439, '19': 1188, '20': 1722, '21': 1745, '22': 479, '23': 543}
From the above output,we can see that ask posts uploaded noon time to late in the evening i.e between 13:00 to 21:00 recieves more comment with a peak at aroud 15:00.
We can also notice that,ask posts uploaded between midnight to 8.00 in the morning with exception of 2:00 recieve less comments.
let me now get back to my friend to acquare more information
Me:hellow...what hours do you usually upload your post.
Friend:Oftenly, I do submit my posts in late afternoon.
Me: Be specific please...
Friend: between 12:00 to 15:00 (time zone is East Africa time,EAT)
Me:Now I catch you, but hold on in not more than 5 minutes I present to you the fact.
so you can see why my friend had to complain, can we say it's was out of ignorance ? I don't know, but there wasn't reasons at all for blackmailing the site.
we haven't done yet, let's canculate the average number of comments per posts for posts created during each hour of the day.
To achieve this, we use the two dictionary we created in cell 6, have alook;
avg_by_hour = []
for hours in comments_by_hour:
avg_by_hour.append([hours,comments_by_hour[hours]/counts_by_hour[hours]])
print("average number of comments for posts created during each hour of the day")
avg_by_hour
average number of comments for posts created during each hour of the day
[['12', 9.41095890410959], ['10', 13.440677966101696], ['22', 6.746478873239437], ['15', 38.5948275862069], ['11', 11.051724137931034], ['07', 7.852941176470588], ['00', 8.127272727272727], ['18', 13.20183486238532], ['21', 16.009174311926607], ['23', 7.985294117647059], ['02', 23.810344827586206], ['14', 13.233644859813085], ['17', 11.46], ['05', 10.08695652173913], ['08', 10.25], ['20', 21.525], ['13', 14.741176470588234], ['09', 5.5777777777777775], ['19', 10.8], ['01', 11.383333333333333], ['06', 9.022727272727273], ['03', 7.796296296296297], ['04', 7.170212765957447], ['16', 16.796296296296298]]
when, we sort the output above it will be easily readable
Have a look;
# we create an empty list first
swap_avg_by_hour = []
for row in avg_by_hour:
swap_avg_by_hour.append([row[1],row[0]]) #this makes it easy to sort our list
# let's print our new list to confirms the order of arrangement
print("list with swapped columns:")
print("\n") #to create space
print(*swap_avg_by_hour[:3], sep="\n")
print("\n")
# sorting the list
sorted_swap = sorted(swap_avg_by_hour, reverse = True) # this sort the list in descending oder
print("Top 5 Hours for Ask posts comments")
print("\n")
print(*sorted_swap[:5], sep="\n")
print("\n")
# formating our top 5 hours for ask post comments
for row in sorted_swap[:5]:
avg_comments = row[0]
hour = row[1]
print("{} : {:.2f} average comments per post.".format(dt.datetime.strptime(hour, "%H").strftime("%H:%M"),avg_comments) )
list with swapped columns: [9.41095890410959, '12'] [13.440677966101696, '10'] [6.746478873239437, '22'] Top 5 Hours for Ask posts comments [38.5948275862069, '15'] [23.810344827586206, '02'] [21.525, '20'] [16.796296296296298, '16'] [16.009174311926607, '21'] 15:00 : 38.59 average comments per post. 02:00 : 23.81 average comments per post. 20:00 : 21.52 average comments per post. 16:00 : 16.80 average comments per post. 21:00 : 16.01 average comments per post.
Before i get back to my friend, I have to check for my time zone; East Africa Time(EAT) which is 7 hours ahead to the given data.
Have a look
print("Top 5 Hours for Ask Posts Comments in EAT time zone")
print("\n")
date_format = "%H"
for row in sorted_swap[:5]:
avg_comments ="{:.2f}".format(row[0]) # we use two decimal place
hour = dt.datetime.strptime(row[1], date_format)
hour_in_gmt = hour + dt.timedelta(hours=7)
time = hour_in_gmt.strftime("%H:%M")
print ("{}: {} average cooments per post".format(time, avg_comments))
Top 5 Hours for Ask Posts Comments in EAT time zone 22:00: 38.59 average cooments per post 09:00: 23.81 average cooments per post 03:00: 21.52 average cooments per post 23:00: 16.80 average cooments per post 04:00: 16.01 average cooments per post
I think now I have the facts for my friend.
Me:Hello friend I am done...
Friend:hope you've seen for youself.**
Me:Not really,first ,posts(Ask post) in hacker news site, are highly affected by time, which do varry.If by any chance, you submitted your post during late noon at 15:00 ( or 22:00 in EAT zone) which is not for your case, you will automatically recieve more comments.BUt your submission has been during morning hours between 05:00 to 08:00 (or 12:00 to 15: 00 in EAT zone) which have the least comments, of an average of atmost 10.so, always do your submission between 15:00 t0 21:00 (or between 22:00 to 04:00 in EAT zone) and you'll have uncountable comments just the way you claimed before.
Friend:Wow! I can't believe this,you mean it's a matter of time(hours)! Thank you very much friend and I wish you all the best in your journey(Data science).
Me:Thank you, and always welcome.
From the results above we can conclude;
1.) Ask HN posts received more comments on average than Show_posts.
2.) Ask posts get a higher recieve upvotes, especially those uploaded between 13:00 and 21:00. (or 20:00 to 04:00 in EAT zone.)