Hacker News is a popular site where the users can post, comment, and vote.
We are interested in two types of posts Ask HN and Show HN posts. Users submit Ask HN posts to ask the Hacker News community a specific question while Show HN posts are to show the Hacker News community a project, product, or just something interesting.
Posing Questions:
1-Do Ask HN or Show HN receive more comments on average?
2-Do posts created at a certain time receive more comments on average?
The source of the data set is here
#import required package
from csv import reader
#open the file
opened_file = open('hacker_news.csv')
#read the file
read_file = reader(opened_file)
#Convert csv file into list of lists
hn = list(read_file)
#explore the first 5 rows
print(hn[:5])
[['id', 'title', 'url', 'num_points', 'num_comments', 'author', 'created_at'], ['12224879', 'Interactive Dynamic Video', 'http://www.interactivedynamicvideo.com/', '386', '52', 'ne0phyte', '8/4/2016 11:52'], ['10975351', 'How to Use Open Source and Shut the Fuck Up at the Same Time', 'http://hueniverse.com/2016/01/26/how-to-use-open-source-and-shut-the-fuck-up-at-the-same-time/', '39', '10', 'josep2', '1/26/2016 19:30'], ['11964716', "Florida DJs May Face Felony for April Fools' Water Joke", 'http://www.thewire.com/entertainment/2013/04/florida-djs-april-fools-water-joke/63798/', '2', '1', 'vezycash', '6/23/2016 22:20'], ['11919867', 'Technology ventures: From Idea to Enterprise', 'https://www.amazon.com/Technology-Ventures-Enterprise-Thomas-Byers/dp/0073523429', '3', '1', 'hswarna', '6/17/2016 0:01']]
#extract the first row of data(headers)
headers = hn[0]
#remove the headers from data set hn
hn = hn[1:]
print('Headers',headers)
#explore the first 5 rows without the header
print('First 5 Rows in The Data Set\n',hn[:5])
Headers ['id', 'title', 'url', 'num_points', 'num_comments', 'author', 'created_at'] First 5 Rows in The Data Set [['12224879', 'Interactive Dynamic Video', 'http://www.interactivedynamicvideo.com/', '386', '52', 'ne0phyte', '8/4/2016 11:52'], ['10975351', 'How to Use Open Source and Shut the Fuck Up at the Same Time', 'http://hueniverse.com/2016/01/26/how-to-use-open-source-and-shut-the-fuck-up-at-the-same-time/', '39', '10', 'josep2', '1/26/2016 19:30'], ['11964716', "Florida DJs May Face Felony for April Fools' Water Joke", 'http://www.thewire.com/entertainment/2013/04/florida-djs-april-fools-water-joke/63798/', '2', '1', 'vezycash', '6/23/2016 22:20'], ['11919867', 'Technology ventures: From Idea to Enterprise', 'https://www.amazon.com/Technology-Ventures-Enterprise-Thomas-Byers/dp/0073523429', '3', '1', 'hswarna', '6/17/2016 0:01'], ['10301696', 'Note by Note: The Making of Steinway L1037 (2007)', 'http://www.nytimes.com/2007/11/07/movies/07stein.html?_r=0', '8', '2', 'walterbell', '9/30/2015 4:12']]
Since we are only concerned with post titles that are beginning with Ask HN and Show HN.
We will create a list of lists for Ask HN and Show HN post title.
Since the capitalization matters in startswith() fnction,we will use lower() function to return a lowercase version of the title to control all cases.
#lists for posts based on their title
ask_posts = []
show_posts = []
other_posts = []
#loop through each row in hn
for row in hn:
#Assign the title of the post to the variable title
title = row[1]
#convert titles into lowercase version
title = title.lower()
#check if the title starts with...
if title.startswith('ask hn'):
#add the title to the list
ask_posts.append(row)
#check if the title starts with...
elif title.startswith('show hn'):
#add the title to the list
show_posts.append(row)
else:
other_posts.append(row)
#Check the number of posts in each list
print('The number of posts with Ask hn title is :',len(ask_posts))
print('The number of posts with Show hn title is :',len(show_posts))
print('The number of posts with other title is :',len(other_posts))
The number of posts with Ask hn title is : 1744 The number of posts with Show hn title is : 1162 The number of posts with other title is : 17194
#explore the first 5 rows of ask_posts list
ask_posts[:5]
[['12296411', 'Ask HN: How to improve my personal website?', '', '2', '6', 'ahmedbaracat', '8/16/2016 9:55'], ['10610020', 'Ask HN: Am I the only one outraged by Twitter shutting down share counts?', '', '28', '29', 'tkfx', '11/22/2015 13:43'], ['11610310', 'Ask HN: Aby recent changes to CSS that broke mobile?', '', '1', '1', 'polskibus', '5/2/2016 10:14'], ['12210105', 'Ask HN: Looking for Employee #3 How do I do it?', '', '1', '3', 'sph130', '8/2/2016 14:20'], ['10394168', 'Ask HN: Someone offered to buy my browser extension from me. What now?', '', '28', '17', 'roykolak', '10/15/2015 16:38']]
#explore the first 5 rows of show_posts list
show_posts[:5]
[['10627194', 'Show HN: Wio Link ESP8266 Based Web of Things Hardware Development Platform', 'https://iot.seeed.cc', '26', '22', 'kfihihc', '11/25/2015 14:03'], ['10646440', 'Show HN: Something pointless I made', 'http://dn.ht/picklecat/', '747', '102', 'dhotson', '11/29/2015 22:46'], ['11590768', 'Show HN: Shanhu.io, a programming playground powered by e8vm', 'https://shanhu.io', '1', '1', 'h8liu', '4/28/2016 18:05'], ['12178806', 'Show HN: Webscope Easy way for web developers to communicate with Clients', 'http://webscopeapp.com', '3', '3', 'fastbrick', '7/28/2016 7:11'], ['10872799', 'Show HN: GeoScreenshot Easily test Geo-IP based web pages', 'https://www.geoscreenshot.com/', '1', '9', 'kpsychwave', '1/9/2016 20:45']]
#explore the first 5 rows of other_posts list
other_posts[:5]
[['12224879', 'Interactive Dynamic Video', 'http://www.interactivedynamicvideo.com/', '386', '52', 'ne0phyte', '8/4/2016 11:52'], ['10975351', 'How to Use Open Source and Shut the Fuck Up at the Same Time', 'http://hueniverse.com/2016/01/26/how-to-use-open-source-and-shut-the-fuck-up-at-the-same-time/', '39', '10', 'josep2', '1/26/2016 19:30'], ['11964716', "Florida DJs May Face Felony for April Fools' Water Joke", 'http://www.thewire.com/entertainment/2013/04/florida-djs-april-fools-water-joke/63798/', '2', '1', 'vezycash', '6/23/2016 22:20'], ['11919867', 'Technology ventures: From Idea to Enterprise', 'https://www.amazon.com/Technology-Ventures-Enterprise-Thomas-Byers/dp/0073523429', '3', '1', 'hswarna', '6/17/2016 0:01'], ['10301696', 'Note by Note: The Making of Steinway L1037 (2007)', 'http://www.nytimes.com/2007/11/07/movies/07stein.html?_r=0', '8', '2', 'walterbell', '9/30/2015 4:12']]
In this step, we calculate the average number of comments for Ask HN and Show HN posts, to find out which post type gets the highest average.
#set the counter to 0
total_ask_comments = 0
#loop through the list
for row in ask_posts:
'''assign the number of comments to num_comments variable
and convert the column from string into integer'''
num_comments = int(row[4])
#add the vlue
total_ask_comments += num_comments
#calculate the average
avg_ask_comments = total_ask_comments / len(ask_posts)
print('The average number of comments on ask posts is :{:.2f}'.format(avg_ask_comments))
The average number of comments on ask posts is :14.04
#set the counter to 0
total_show_comments = 0
#loop through the list
for row in show_posts:
'''assign the number of comments to num_comments variable
and convert the column from string into integer'''
num_comments = int(row[4])
#add the vlue
total_show_comments += num_comments
#calculate the average
avg_show_comments = total_show_comments / len(show_posts)
print('The average number of comments on show posts is :{:.2f}'.format(avg_show_comments))
The average number of comments on show posts is :10.32
As we see the Ask HN posts receive more comments on average about (14.04) while Show HN posts receive (10.32) on average
As the Ask HN posts are more likely to receive more comments, we will focus our analysis on these posts, to understand more about members' behaviors and how they interact on this website.
We will find the amount of Ask posts and comments by hour created through two steps.
1-Create a list of lists with two elements time and the number of comments for the post.
2-Create two dictionaries one to keep track of the number of posts created by hour and the second one to keep track of the number of comments created by hour.
#required package
import datetime as dt
#a list of lists with 2 elements time and the number of comments for the post
result_list = []
#loop
for row in ask_posts:
#get time
post_time = row[6]
#get the number of comments
num_comments = int(row[4])
#add time and the number of comments as list to the result_list
result_list.append([post_time,num_comments])
#two dics to find amount of Ask posts and Comments by Hour Created
#keep track of the number of posts created by hour
counts_by_hour = {}
#keep track of the number of comments created by hour
comments_by_hour = {}
#loop
for item in result_list:
#get the date
date = item[0]
#get datetime object
date = dt.datetime.strptime(date,'%m/%d/%Y %H:%M')
#get the hour
hour = date.hour
#if hour is already a key in the dic
if hour in counts_by_hour:
#Increment the value by 1
counts_by_hour[hour] += 1
#Increment the value by comment number
comments_by_hour[hour] += item[1]
#if not
else:
#set the value to 1
counts_by_hour[hour] = 1
#set the value to the comment number
comments_by_hour[hour] = item[1]
#explore the number of posts created by hour
counts_by_hour
{0: 55, 1: 60, 2: 58, 3: 54, 4: 47, 5: 46, 6: 44, 7: 34, 8: 48, 9: 45, 10: 59, 11: 58, 12: 73, 13: 85, 14: 107, 15: 116, 16: 108, 17: 100, 18: 109, 19: 110, 20: 80, 21: 109, 22: 71, 23: 68}
#explore the number of comments created by hour
comments_by_hour
{0: 447, 1: 683, 2: 1381, 3: 421, 4: 337, 5: 464, 6: 397, 7: 267, 8: 492, 9: 251, 10: 793, 11: 641, 12: 687, 13: 1253, 14: 1416, 15: 4477, 16: 1814, 17: 1146, 18: 1439, 19: 1188, 20: 1722, 21: 1745, 22: 479, 23: 543}
In this step, we will calculate the average number of comments for posts created during each individual hour. For each hour we will take the total comments and divide it by the total posts from the dictionaries we have created.
#list of lists ,2 elements : hour & avg
avg_by_hour = []
#nested loop through the 2 dics
for comment_h in comments_by_hour:
for post_h in counts_by_hour:
#if hour is equal
if comment_h == post_h:
#calculate the avg
avg = comments_by_hour[comment_h]/counts_by_hour[post_h]
#add the hour & avg to the list
avg_by_hour.append([post_h,avg])
#display the results
print('The Average Number of Comments per Post in Each Hour')
#loop
for h,avg in avg_by_hour:
#print hour & average
print(h,":",avg)
The Average Number of Comments per Post in Each Hour 0 : 8.127272727272727 1 : 11.383333333333333 2 : 23.810344827586206 3 : 7.796296296296297 4 : 7.170212765957447 5 : 10.08695652173913 6 : 9.022727272727273 7 : 7.852941176470588 8 : 10.25 9 : 5.5777777777777775 10 : 13.440677966101696 11 : 11.051724137931034 12 : 9.41095890410959 13 : 14.741176470588234 14 : 13.233644859813085 15 : 38.5948275862069 16 : 16.796296296296298 17 : 11.46 18 : 13.20183486238532 19 : 10.8 20 : 21.525 21 : 16.009174311926607 22 : 6.746478873239437 23 : 7.985294117647059
To make the results more readable and easy to identify the hour with the highest average, we will sort the list based on the average in descending order.
Since the sorted()function will sort the list based on the first column, we need to swap the avg_by_hour list's columns to make the average the first column.
#list has swapped columns of avg_by_hour list
swap_avg_by_hour = []
#loop
for item in avg_by_hour:
#add the swapped columns to the list
swap_avg_by_hour.append([item[1],item[0]])
#explore
swap_avg_by_hour
[[8.127272727272727, 0], [11.383333333333333, 1], [23.810344827586206, 2], [7.796296296296297, 3], [7.170212765957447, 4], [10.08695652173913, 5], [9.022727272727273, 6], [7.852941176470588, 7], [10.25, 8], [5.5777777777777775, 9], [13.440677966101696, 10], [11.051724137931034, 11], [9.41095890410959, 12], [14.741176470588234, 13], [13.233644859813085, 14], [38.5948275862069, 15], [16.796296296296298, 16], [11.46, 17], [13.20183486238532, 18], [10.8, 19], [21.525, 20], [16.009174311926607, 21], [6.746478873239437, 22], [7.985294117647059, 23]]
#sort the list based on the average in descending order
sorted_swap = sorted(swap_avg_by_hour, reverse = True)
#explore
sorted_swap
[[38.5948275862069, 15], [23.810344827586206, 2], [21.525, 20], [16.796296296296298, 16], [16.009174311926607, 21], [14.741176470588234, 13], [13.440677966101696, 10], [13.233644859813085, 14], [13.20183486238532, 18], [11.46, 17], [11.383333333333333, 1], [11.051724137931034, 11], [10.8, 19], [10.25, 8], [10.08695652173913, 5], [9.41095890410959, 12], [9.022727272727273, 6], [8.127272727272727, 0], [7.985294117647059, 23], [7.852941176470588, 7], [7.796296296296297, 3], [7.170212765957447, 4], [6.746478873239437, 22], [5.5777777777777775, 9]]
As we can notice the highest average is 38.59 at 15:00.However, to format the hour as this we will use the datetime package.
print('Top 5 Hours for Ask Posts Comments')
#loop
for item in sorted_swap[:5]:
#get hour as string
hour = str(item[1])
#get the average
avg = item[0]
#create a datetime object
hour = dt.datetime.strptime(hour,'%H')
#format hour ex:15:00
hour = hour.strftime('%H:%M')
#display
print(hour,': {:.2f}'.format(avg))
Top 5 Hours for Ask Posts Comments 15:00 : 38.59 02:00 : 23.81 20:00 : 21.52 16:00 : 16.80 21:00 : 16.01
As we see these the top hours that people being active on the website.
Based on the documentation the time zone is Eastern Time in the US . So, to make this insight more useful for people who are sharing my own time zone, we will convert the hours into my time zone Asia/Riyadh.
#requred package to deal with the time zone
import pytz
#get the current time of 'US/Eastern' time zone
timezone_Eastern = 'US/Eastern'
time_Eastern = dt.datetime.now(pytz.timezone(timezone_Eastern))
print(time_Eastern)
#convert the time from 'US/Eastern' into 'Asia/Riyadh'
myTimezone = time_Eastern.astimezone(pytz.timezone('Asia/Riyadh'))
print(myTimezone)
2020-07-23 09:09:51.084056-04:00 2020-07-23 16:09:51.084056+03:00
As we notice the difference between US/Eastern and Asia/Riyadh is 7 hours. We will use the timedelta object to add 7 hours to the top 5 hours to get the hours in Asia/Riyadh time zone.
# get the top 5 hours in Asia/Riyadh time zone
import pytz
print('Top 5 Hours for Ask Posts Comments in Asia/Riyadh Time Zone')
for item in sorted_swap[:5]:
hour = str(item[1])
avg = item[0]
hour = dt.datetime.strptime(hour,'%H')
#add 7 hours to the hours that are in US/Eastern time zone
hour= hour + dt.timedelta(hours = 7)
hour = hour.strftime('%H:%M')
print(hour,': {:.2f}'.format(avg))
Top 5 Hours for Ask Posts Comments in Asia/Riyadh Time Zone 22:00 : 38.59 09:00 : 23.81 03:00 : 21.52 23:00 : 16.80 04:00 : 16.01
Ask HN posts receive more comments on average than Show HN posts. However, what is the proper hours you should post on the website to get the best chance of receiving comments in your post? Based on the analysis. The top two hours regarding Eastern Time in the US is 3:00 p.m. with average comment (38.59) and 2:00 a.m. with average (23.81). The top two hours in my time zone Asia time in Riyadh is 11:00 p.m. with average comment (38.59) and 9:00 a.m. with average (23.81).