In this project, we are working on a dataset gotten from Hacker News site. On this Site Users submit stories known as posts and are voted and commented upon. For the purpose of this project, we are interested in post whose Titles begin with Ask HN or Show HN. Users submit Ask HN to ask Hacker News community a specific question while Show HN posts are to show the Hacker News community a project, product or something interesting. The Goal of this project is to compare this two Posts and determine the following:
1.Do Ask HN or Show HN receive more comments on average?
2.Do posts created at a certain time receive more comments on average?
#import reader from csv so that we can access and extract our data
from csv import reader
open_file = open("hacker_news.csv")
read_file = reader(open_file)
hn = list(read_file)
print(hn[:5])
[['id', 'title', 'url', 'num_points', 'num_comments', 'author', 'created_at'], ['12224879', 'Interactive Dynamic Video', 'http://www.interactivedynamicvideo.com/', '386', '52', 'ne0phyte', '8/4/2016 11:52'], ['10975351', 'How to Use Open Source and Shut the Fuck Up at the Same Time', 'http://hueniverse.com/2016/01/26/how-to-use-open-source-and-shut-the-fuck-up-at-the-same-time/', '39', '10', 'josep2', '1/26/2016 19:30'], ['11964716', "Florida DJs May Face Felony for April Fools' Water Joke", 'http://www.thewire.com/entertainment/2013/04/florida-djs-april-fools-water-joke/63798/', '2', '1', 'vezycash', '6/23/2016 22:20'], ['11919867', 'Technology ventures: From Idea to Enterprise', 'https://www.amazon.com/Technology-Ventures-Enterprise-Thomas-Byers/dp/0073523429', '3', '1', 'hswarna', '6/17/2016 0:01']]
#We removed the header from our dataset
headers = hn[0]
hn = hn[1:]
print(headers)
print(hn)
# This Code separates post beginning with Ask HN and Show HN
ask_posts = []
show_posts = []
other_posts = []
for row in hn:
title = row[1]
title = title.lower() #convert all titles to lowercase
if title.startswith('ask hn'):
ask_posts.append(row)
elif title.startswith('show hn'):
show_posts.append(row)
else:
other_posts.append(row)
print("The number of posts in ask_posts is : ", len(ask_posts)) # to find the number of posts for each list
print("The number of posts in show_posts is : ", len(show_posts))
print("The number of posts in other_posts is : ", len(other_posts))
# we want to know if either ask posts or show posts receive more comments on average
total_ask_comments = 0
for post in ask_posts:
num_comments =int(post[4])
total_ask_comments += num_comments
avg_ask_comments = total_ask_comments / len(ask_posts)
print("Average ask posts: ", avg_ask_comments)
total_show_comments = 0
for post in show_posts:
num_comments =int(post[4])
total_show_comments += num_comments
avg_show_comments = total_show_comments / len(show_posts)
print("Average show posts: ", avg_show_comments)
The result of our output has fulfilled the first goal of this project, we have been able to determine that ask posts: Ask HN receives more comments than show post: Show HN Our second goal is to detemine if a post posted at a particular time receives more comment. To do that we need to parse the our ask_ post and show_comment were created into datetime format, then we convert the dates to find posts and comments by hour.
import datetime as dt
result_list = []
for post in ask_posts:
created_at = post[6]
num_comments = int(post[4])
appended_list = [created_at, num_comments]
result_list.append(appended_list)
print(result_list)
counts_by_hour = {}
comments_by_hour = {}
for result in result_list:
h_dt = result[0]
h = dt.datetime.strptime(h_dt, "%m/%d/%Y %H:%M")
hour = h.strftime("%H")
print(hour)
if hour not in counts_by_hour:
counts_by_hour[hour] = 1
comments_by_hour[hour] = result[1]
elif hour in counts_by_hour:
counts_by_hour[hour] += 1
comments_by_hour[hour] += result[1]
print("Counts by hour is:", counts_by_hour)
print("Comments by hour is: ", comments_by_hour)
In this next code, we are going to calculate the average comments per hour.
avg_by_hour = []
for hour in comments_by_hour:
avg_by_hour.append([hour,comments_by_hour[hour] / counts_by_hour[hour]])
print(avg_by_hour)
We have the result we need from our output above but the format is a little bit confusing. It is hard to identify the hour with the highest value. For better understanding, we will swap our average and hour so that the hour becomes the first element of the list while the Average becomes the second elements on the list. We will use the Sorted() function to sort them in descending order
swap_avg_by_hour = []
for avg in avg_by_hour:
f_avg = avg[1]
swap_avg_by_hour.append([avg[1], avg[0]])
print(swap_avg_by_hour)
sorted_swap = sorted(swap_avg_by_hour, reverse = True)
print(sorted_swap)
From our goal, it was stated that we want to find out if Posts created at a particular time receives more comments. We are going to do this in the cell below, we will loop through the first five lists of sorted_swap, format the hours by using datetime.strptime() constructor to return datetime object and then use the strftime() method to specify the format of the time which is hour. To format the average, we will use {:.2f} to indicate that the average should be rounded to two decimal places and finally, we will use the str.format() method to print hour and average in this format: 15:00:38.59 average comments per post.
print("Top 5 Hours for Ask Posts Comments")
for swap in sorted_swap[:5]:
average = swap[0]
hour = swap[1]
hour = dt.datetime.strptime(hour, "%H")
hour_h = hour.strftime("%H")
final = "{0} : {1:.2f} comments per post".format(hour_h,average)
print(final)
In conclusion, the project has achieved all our goals:
receives more comments than others.