In this project we will look at the differnce between posts that ask for reviews, advice and input while also looking at posts that are simply up for sharing. These are referred to as ASK HN and SHOW HN, respectively. We're hoping to illuminate whether certain posts garner more comments and attention and whether or not certain times affect said responses.
#open, prepare and read the file
from csv import reader
opened_file = open("hacker_news.csv")
read_file = reader(opened_file)
hn = list(read_file)
print(hn[:5])
[['id', 'title', 'url', 'num_points', 'num_comments', 'author', 'created_at'], ['12224879', 'Interactive Dynamic Video', 'http://www.interactivedynamicvideo.com/', '386', '52', 'ne0phyte', '8/4/2016 11:52'], ['10975351', 'How to Use Open Source and Shut the Fuck Up at the Same Time', 'http://hueniverse.com/2016/01/26/how-to-use-open-source-and-shut-the-fuck-up-at-the-same-time/', '39', '10', 'josep2', '1/26/2016 19:30'], ['11964716', "Florida DJs May Face Felony for April Fools' Water Joke", 'http://www.thewire.com/entertainment/2013/04/florida-djs-april-fools-water-joke/63798/', '2', '1', 'vezycash', '6/23/2016 22:20'], ['11919867', 'Technology ventures: From Idea to Enterprise', 'https://www.amazon.com/Technology-Ventures-Enterprise-Thomas-Byers/dp/0073523429', '3', '1', 'hswarna', '6/17/2016 0:01']]
#create a header file and a main hn file. This is to ensure that
#the title column is not inlcluded in count data
headers = hn[0]
hn = hn[1:]
print(hn[:5])
[['12224879', 'Interactive Dynamic Video', 'http://www.interactivedynamicvideo.com/', '386', '52', 'ne0phyte', '8/4/2016 11:52'], ['10975351', 'How to Use Open Source and Shut the Fuck Up at the Same Time', 'http://hueniverse.com/2016/01/26/how-to-use-open-source-and-shut-the-fuck-up-at-the-same-time/', '39', '10', 'josep2', '1/26/2016 19:30'], ['11964716', "Florida DJs May Face Felony for April Fools' Water Joke", 'http://www.thewire.com/entertainment/2013/04/florida-djs-april-fools-water-joke/63798/', '2', '1', 'vezycash', '6/23/2016 22:20'], ['11919867', 'Technology ventures: From Idea to Enterprise', 'https://www.amazon.com/Technology-Ventures-Enterprise-Thomas-Byers/dp/0073523429', '3', '1', 'hswarna', '6/17/2016 0:01'], ['10301696', 'Note by Note: The Making of Steinway L1037 (2007)', 'http://www.nytimes.com/2007/11/07/movies/07stein.html?_r=0', '8', '2', 'walterbell', '9/30/2015 4:12']]
#after creating the hn list of lists w/o a header we will
#now create three empty lists which will be populated with
#posts beginning with Ask HN and Show HN and other
ask_posts = []
show_posts = []
other_posts = []
#looping through hn to find titled articles
for row in hn:
title = row[1]
if title.lower().startswith("ask hn"):
ask_posts.append(row)
elif title.lower().startswith("show hn"):
show_posts.append(row)
else:
other_posts.append(row)
print("The ask hn posts have",len(ask_posts),"entries")
print("The show hn posts have",len(show_posts),"entries")
print("The other posts have",len(other_posts),"entries")
The ask hn posts have 1744 entries The show hn posts have 1162 entries The other posts have 17194 entries
#below we will look at the first few rows of both ask_posts and
#show_posts to see if we've gathered the correct data
print(ask_posts[:3])
print('\n')
print(show_posts[:3])
[['12296411', 'Ask HN: How to improve my personal website?', '', '2', '6', 'ahmedbaracat', '8/16/2016 9:55'], ['10610020', 'Ask HN: Am I the only one outraged by Twitter shutting down share counts?', '', '28', '29', 'tkfx', '11/22/2015 13:43'], ['11610310', 'Ask HN: Aby recent changes to CSS that broke mobile?', '', '1', '1', 'polskibus', '5/2/2016 10:14']] [['10627194', 'Show HN: Wio Link ESP8266 Based Web of Things Hardware Development Platform', 'https://iot.seeed.cc', '26', '22', 'kfihihc', '11/25/2015 14:03'], ['10646440', 'Show HN: Something pointless I made', 'http://dn.ht/picklecat/', '747', '102', 'dhotson', '11/29/2015 22:46'], ['11590768', 'Show HN: Shanhu.io, a programming playground powered by e8vm', 'https://shanhu.io', '1', '1', 'h8liu', '4/28/2016 18:05']]
Now that we've assigned the ask posts versus show posts we want to see the number of comments received. This will set the stage of beginning to understand, possibly, which category gathers more comments as suggested in the introduction.
#finding the total number of comments for ask_posts
total_ask_comments = 0
#iterating over ask_posts
for comments in ask_posts:
total_ask_comments += int(comments[4])
#find the average number of ask post comments
avg_ask_comments = ((total_ask_comments))/(len(ask_posts))
print("The average number of ask comments is","{:.2f}".format(avg_ask_comments))
The average number of ask comments is 14.04
#We'll now repeat the same process that we ran on ask_posts
#but with show_posts instead.
total_show_comments = 0
#iterating over show_posts
for comments in show_posts:
total_show_comments += int(comments[4])
#find the average number of show post comments
avg_show_comments = total_show_comments/len(show_posts)
print("The average number of show comments is","{:.2f}".format(avg_show_comments))
The average number of show comments is 10.32
Based on this short series of data queries and averages it appears that, on average, ask post comments receive 14.04 comments per post as opposed to 10.32 for show posts. This makes sense in some ways as ask posts are just that, asks. They are looking for responses and feedback and show posts are more about exhibition. The remainder of the project analysis will focus on ask_posts.
Now that we know we're working on ask_posts we're going to dig a little deeper and find out the number of ask_posts created in each hour of the day as well as comments during the same time periods. We will then calculate the averages of these numbers and, ostensibly, gain a better idea about which parts of the day are more active than others.
import datetime as dt
result_list = []
#iterated over askposts to find times
for results in ask_posts:
result_list.append([results[6],int(results[4])])
#verify if results_list was succssful
print(result_list[:3])
[['8/16/2016 9:55', 6], ['11/22/2015 13:43', 29], ['5/2/2016 10:14', 1]]
#We'll now create two empty dictionaries that will house counts and
#comments by the hour
counts_by_hour = {}
comments_by_hour = {}
for row in result_list:
date = row[0]
comments_number = row[1]
date_made = dt.datetime.strptime(date,"%m/%d/%Y %H:%M")
hour_made = date_made.hour
if hour_made in counts_by_hour:
counts_by_hour[hour_made] += 1
comments_by_hour[hour_made] += comments_number
else:
counts_by_hour[hour_made] = 1
comments_by_hour[hour_made] = comments_number
#check to see if counts and comments by hour were successfully created
print("The counts by hour were:",counts_by_hour)
print("\n")
print("The comments by hour were:",comments_by_hour)
The counts by hour were: {9: 45, 13: 85, 10: 59, 14: 107, 16: 108, 23: 68, 12: 73, 17: 100, 15: 116, 21: 109, 20: 80, 2: 58, 18: 109, 3: 54, 5: 46, 19: 110, 1: 60, 22: 71, 8: 48, 4: 47, 0: 55, 6: 44, 7: 34, 11: 58} The comments by hour were: {9: 251, 13: 1253, 10: 793, 14: 1416, 16: 1814, 23: 543, 12: 687, 17: 1146, 15: 4477, 21: 1745, 20: 1722, 2: 1381, 18: 1439, 3: 421, 5: 464, 19: 1188, 1: 683, 22: 479, 8: 492, 4: 337, 0: 447, 6: 397, 7: 267, 11: 641}
Our goal now is to create an average number of comments per hour, per post. The best way to do this will be to create a list of lists from both dictionaries created (count_by_hour & comments_by_hour). This will be fairly straight forward, requiring isolating hours, comments and averaging them by dividing the number of comments, per hour, by the number of posts per the same hour.
#create an empty list which will count posts by hours and average of
#comments for each hour.
avg_by_hour = []
for posts in counts_by_hour:
post_avg = comments_by_hour[posts]/counts_by_hour[posts]
avg_by_hour.append([posts, post_avg])
#check for accuracy in averages
print(avg_by_hour)
[[9, 5.5777777777777775], [13, 14.741176470588234], [10, 13.440677966101696], [14, 13.233644859813085], [16, 16.796296296296298], [23, 7.985294117647059], [12, 9.41095890410959], [17, 11.46], [15, 38.5948275862069], [21, 16.009174311926607], [20, 21.525], [2, 23.810344827586206], [18, 13.20183486238532], [3, 7.796296296296297], [5, 10.08695652173913], [19, 10.8], [1, 11.383333333333333], [22, 6.746478873239437], [8, 10.25], [4, 7.170212765957447], [0, 8.127272727272727], [6, 9.022727272727273], [7, 7.852941176470588], [11, 11.051724137931034]]
Now that we have the information we need, it would be helpful to show it in a way where it's easier to read and shows the highest averages by hour along with formatted decimal places. The finshed result will highlight the top five averge comments by hour and give some insight into what time of day both posts and commenters are most active within this part of Hacker News
#create a list of lists for average by hour
swap_avg_by_hour = []
#iterate over avg_by_hour
for row in avg_by_hour:
swap_avg_by_hour.append([row[1],row[0]])
print(swap_avg_by_hour)
[[5.5777777777777775, 9], [14.741176470588234, 13], [13.440677966101696, 10], [13.233644859813085, 14], [16.796296296296298, 16], [7.985294117647059, 23], [9.41095890410959, 12], [11.46, 17], [38.5948275862069, 15], [16.009174311926607, 21], [21.525, 20], [23.810344827586206, 2], [13.20183486238532, 18], [7.796296296296297, 3], [10.08695652173913, 5], [10.8, 19], [11.383333333333333, 1], [6.746478873239437, 22], [10.25, 8], [7.170212765957447, 4], [8.127272727272727, 0], [9.022727272727273, 6], [7.852941176470588, 7], [11.051724137931034, 11]]
#We'll use the sorted function here to put these values in descending order
sorted_swap_avg = sorted(swap_avg_by_hour, reverse = True)
print("The Top 5 Hours for Ask Post Comments:")
#loop through the first five sorted values and format
for row in sorted_swap_avg[:5]:
hour_form = dt.datetime.strptime(str(row[1]), "%H")
hour_form = hour_form.strftime("%H:%M")
print(hour_form,'{:.2f}'.format(row[0]),'average comments per post')
The Top 5 Hours for Ask Post Comments: 15:00 38.59 average comments per post 02:00 23.81 average comments per post 20:00 21.52 average comments per post 16:00 16.80 average comments per post 21:00 16.01 average comments per post
Now that we have the top five times and averages per time period, we're going to convert the listed times from above (Eastern Standard Time) to the this author's time zone (Pacific Standard Time). We will check the documentation from the Kaggle page (https://www.kaggle.com/hacker-news/hacker-news-posts) that tells us what time zone it was originated in. This exercise will illuminate a requirement for time conversion as well as advantageous times to post in the author's time zone.
#convert times to Pacific Standard Time-United States(PST)
print('Top 5 Hours for Ask Posts Comments (Time zone: PST, -3 from EST)')
for row in sorted_swap_avg[:5]:
# format the hours
adj_time = (dt.datetime.strptime(str(row[1]), "%H") + dt.timedelta(hours = -3)).strftime("%H:%M")
print(adj_time,'{:.2f}'.format(row[0]),'average comments per post')
Top 5 Hours for Ask Posts Comments (Time zone: PST, -3 from EST) 12:00 38.59 average comments per post 23:00 23.81 average comments per post 17:00 21.52 average comments per post 13:00 16.80 average comments per post 18:00 16.01 average comments per post
After changing the time from Eastern Standard to Pacific Standard it's easy to see the top five times to post and, hopefully, receive a large share of comments. While this is not guaranteed, it provides insight into trends that may help understand characteristics of users on Hacker News at certain times.