The objective of this project is to have insights regarding the hacker posts activity, this info could be used to decided wich kind of post I should publish and at which hour to increase the chance to obtain the most points or comments.
We are going to use string methods to obtain the titles taht we need, also we are going to use datetime methods to obtain info of the hours.
The most important insight here is that the results depends of the type of post that your are analyzing. News posts have more points early, ask posts have more interactions at the afternoon, show posts at late night. So we need to be careful of what we want to post to choose the best hour to publish.
First, we need to open the dataset. This dataset was obtained in this link. There are no comments or special observations in the dataset on the discussion page.
The descriptions of the columns are as follow:
import csv
open_file = open('hacker_news.csv')
hn = list(csv.reader(open_file))
hn[:5]
[['id', 'title', 'url', 'num_points', 'num_comments', 'author', 'created_at'], ['12224879', 'Interactive Dynamic Video', 'http://www.interactivedynamicvideo.com/', '386', '52', 'ne0phyte', '8/4/2016 11:52'], ['10975351', 'How to Use Open Source and Shut the Fuck Up at the Same Time', 'http://hueniverse.com/2016/01/26/how-to-use-open-source-and-shut-the-fuck-up-at-the-same-time/', '39', '10', 'josep2', '1/26/2016 19:30'], ['11964716', "Florida DJs May Face Felony for April Fools' Water Joke", 'http://www.thewire.com/entertainment/2013/04/florida-djs-april-fools-water-joke/63798/', '2', '1', 'vezycash', '6/23/2016 22:20'], ['11919867', 'Technology ventures: From Idea to Enterprise', 'https://www.amazon.com/Technology-Ventures-Enterprise-Thomas-Byers/dp/0073523429', '3', '1', 'hswarna', '6/17/2016 0:01']]
After open the dataset we proceed to remove the headers.
headers = hn[0]
hn = hn[1:]
print(headers)
print(hn[:5])
['id', 'title', 'url', 'num_points', 'num_comments', 'author', 'created_at'] [['12224879', 'Interactive Dynamic Video', 'http://www.interactivedynamicvideo.com/', '386', '52', 'ne0phyte', '8/4/2016 11:52'], ['10975351', 'How to Use Open Source and Shut the Fuck Up at the Same Time', 'http://hueniverse.com/2016/01/26/how-to-use-open-source-and-shut-the-fuck-up-at-the-same-time/', '39', '10', 'josep2', '1/26/2016 19:30'], ['11964716', "Florida DJs May Face Felony for April Fools' Water Joke", 'http://www.thewire.com/entertainment/2013/04/florida-djs-april-fools-water-joke/63798/', '2', '1', 'vezycash', '6/23/2016 22:20'], ['11919867', 'Technology ventures: From Idea to Enterprise', 'https://www.amazon.com/Technology-Ventures-Enterprise-Thomas-Byers/dp/0073523429', '3', '1', 'hswarna', '6/17/2016 0:01'], ['10301696', 'Note by Note: The Making of Steinway L1037 (2007)', 'http://www.nytimes.com/2007/11/07/movies/07stein.html?_r=0', '8', '2', 'walterbell', '9/30/2015 4:12']]
In this part we are going to use the string method startswith
and the lower()
.
startswith
is a method to compare the first letters of a string with a given string.lower
is a method to convert the string to lower case. Because startswith
is a case sensitive method, we need to put the letters in the same case to avoid missing data.In Hacker News there is two kinds of post generated by the users: Ask Hacker News (Ask HN) and Show Hacker News (Show Hn). There is other kinds, but in this exercise we want to focus on these two.
# We need to create empty lists to sort the data later
ask_posts = []
show_posts = []
other_posts = []
for row in hn:
title = row[1]
# We are using here startswith, and previous we transform the case to lower
if title.lower().startswith("ask hn"):
ask_posts.append(row)
elif title.lower().startswith("show hn"):
show_posts.append(row)
else:
other_posts.append(row)
# Print the len to verify the quantities
print(len(ask_posts))
print(len(show_posts))
print(len(other_posts))
1744 1162 17194
# Print he first five rows of each list to see the results.
print(ask_posts[:2])
print(show_posts[:2])
print(other_posts[:2])
[['12296411', 'Ask HN: How to improve my personal website?', '', '2', '6', 'ahmedbaracat', '8/16/2016 9:55'], ['10610020', 'Ask HN: Am I the only one outraged by Twitter shutting down share counts?', '', '28', '29', 'tkfx', '11/22/2015 13:43']] [['10627194', 'Show HN: Wio Link ESP8266 Based Web of Things Hardware Development Platform', 'https://iot.seeed.cc', '26', '22', 'kfihihc', '11/25/2015 14:03'], ['10646440', 'Show HN: Something pointless I made', 'http://dn.ht/picklecat/', '747', '102', 'dhotson', '11/29/2015 22:46']] [['12224879', 'Interactive Dynamic Video', 'http://www.interactivedynamicvideo.com/', '386', '52', 'ne0phyte', '8/4/2016 11:52'], ['10975351', 'How to Use Open Source and Shut the Fuck Up at the Same Time', 'http://hueniverse.com/2016/01/26/how-to-use-open-source-and-shut-the-fuck-up-at-the-same-time/', '39', '10', 'josep2', '1/26/2016 19:30']]
Once we have three lists with differente kind of posts we need to identify wich kind of post have the most comments in average. Because we are not comparing the other posts we are not going to use this list here.
total_ask_comments = 0
for row in ask_posts:
num_comments = int(row[4])
total_ask_comments += num_comments
avg_ask_comments = total_ask_comments / len(ask_posts)
total_show_comments = 0
for row in show_posts:
num_comments = int(row[4])
total_show_comments += num_comments
avg_show_comments = total_show_comments / len(show_posts)
print(avg_ask_comments)
print(avg_show_comments)
14.038417431192661 10.31669535283993
In average in Ask posts we have 14 comments for post, instead in the show comments post we have 10 comments for post. So we are going to analyze the Ask Posts first.
Now that we know which kind of posts have in average the most comments, we wan to know:
For this tasks we need to us datetime
. This Class allow us to use certain methods to transform our raw date data to a date time standarized data, that python could use and manipulate for further analysis.
import datetime as dt # With this we import the methods to use time data
#We create an empty list that stores the date and the number of comments of the raw data
result_list = []
for row in ask_posts:
created_at = row[6]
n_comments = int(row[4])
result_list.append([created_at, n_comments])
print(result_list[:5])
# We need to create first two empty dictionaries where we are going to store number of comments and the hour of post
counts_by_hour = {}
comments_by_hour = {}
for row in result_list:
date = row[0]
comments = row[1]
date_format = "%m/%d/%Y %H:%M"
#This function parse the date, it is convert the strings in date formats for python
time = dt.datetime.strptime(date, date_format)
#This function extract from the parsed date the info that I need
hour = time.strftime("%H")
if hour not in counts_by_hour:
counts_by_hour[hour] = 1
comments_by_hour[hour] = comments
else:
counts_by_hour[hour] += 1
comments_by_hour[hour] += comments
comments_by_hour
[['8/16/2016 9:55', 6], ['11/22/2015 13:43', 29], ['5/2/2016 10:14', 1], ['8/2/2016 14:20', 3], ['10/15/2015 16:38', 17]]
{'00': 447, '01': 683, '02': 1381, '03': 421, '04': 337, '05': 464, '06': 397, '07': 267, '08': 492, '09': 251, '10': 793, '11': 641, '12': 687, '13': 1253, '14': 1416, '15': 4477, '16': 1814, '17': 1146, '18': 1439, '19': 1188, '20': 1722, '21': 1745, '22': 479, '23': 543}
In this part we are going calculate the average number of comments per hour, and assign this value to a new list.
avg_by_hour = []
for hour in comments_by_hour:
avg_by_hour.append([hour, comments_by_hour[hour]/counts_by_hour[hour]])
avg_by_hour
[['07', 7.852941176470588], ['10', 13.440677966101696], ['00', 8.127272727272727], ['14', 13.233644859813085], ['23', 7.985294117647059], ['02', 23.810344827586206], ['11', 11.051724137931034], ['16', 16.796296296296298], ['09', 5.5777777777777775], ['22', 6.746478873239437], ['04', 7.170212765957447], ['08', 10.25], ['03', 7.796296296296297], ['21', 16.009174311926607], ['19', 10.8], ['15', 38.5948275862069], ['13', 14.741176470588234], ['18', 13.20183486238532], ['01', 11.383333333333333], ['05', 10.08695652173913], ['20', 21.525], ['12', 9.41095890410959], ['06', 9.022727272727273], ['17', 11.46]]
Now that we have the average number of comments, we need to sort the results, so we can see more clearly wich are the hours with most comments. To sort, we need to swap the list results, so we can use the sorted
function.
swap_avg_by_hour = []
for row in avg_by_hour:
swap_avg_by_hour.append([row[1], row[0]])
print(swap_avg_by_hour)
[[7.852941176470588, '07'], [13.440677966101696, '10'], [8.127272727272727, '00'], [13.233644859813085, '14'], [7.985294117647059, '23'], [23.810344827586206, '02'], [11.051724137931034, '11'], [16.796296296296298, '16'], [5.5777777777777775, '09'], [6.746478873239437, '22'], [7.170212765957447, '04'], [10.25, '08'], [7.796296296296297, '03'], [16.009174311926607, '21'], [10.8, '19'], [38.5948275862069, '15'], [14.741176470588234, '13'], [13.20183486238532, '18'], [11.383333333333333, '01'], [10.08695652173913, '05'], [21.525, '20'], [9.41095890410959, '12'], [9.022727272727273, '06'], [11.46, '17']]
#In the sorted function we use the reverse method to sorte the results from largest to smallest average comments
sorted_swap = sorted(swap_avg_by_hour, reverse=True)
print(sorted_swap)
[[38.5948275862069, '15'], [23.810344827586206, '02'], [21.525, '20'], [16.796296296296298, '16'], [16.009174311926607, '21'], [14.741176470588234, '13'], [13.440677966101696, '10'], [13.233644859813085, '14'], [13.20183486238532, '18'], [11.46, '17'], [11.383333333333333, '01'], [11.051724137931034, '11'], [10.8, '19'], [10.25, '08'], [10.08695652173913, '05'], [9.41095890410959, '12'], [9.022727272727273, '06'], [8.127272727272727, '00'], [7.985294117647059, '23'], [7.852941176470588, '07'], [7.796296296296297, '03'], [7.170212765957447, '04'], [6.746478873239437, '22'], [5.5777777777777775, '09']]
Now, with the sorted data, we have the info that we needed to know which hours are the best to post, so we are going to print the first five results with the average number of comments.
print("Top 5 Hours for Ask Posts Comments")
for row in sorted_swap[:5]:
hour_format = "%H" #We only need the hour info
hour = dt.datetime.strptime(row[1], hour_format)
hr = hour.strftime("%H:%M")
# The format method allow us to assign a value to a string and print it with a loop
string = "{h} ---> {avg:.2f} average comments per post".format(h=hr, avg=row[0])
print(string)
Top 5 Hours for Ask Posts Comments 15:00 ---> 38.59 average comments per post 02:00 ---> 23.81 average comments per post 20:00 ---> 21.52 average comments per post 16:00 ---> 16.80 average comments per post 21:00 ---> 16.01 average comments per post
This is a failed experiment, but an instructive one. In this part I tried first to assign a timezone to the data, and convert this date time to my timezone. For this part I used the pytz
Class.
The timezones that python use could be seen using the function, or you can see here a list of the principal timezones. The datetime of the dataset is 'US/Eastern', accoding the documentation, so I need to convert this to 'America/Lima'.
from pytz import timezone # With this method I assign a time zone to my data
import pytz
# An empty list to fill with the changed hours
sorted_swap_east = []
for row in sorted_swap[:5]:
hour_format = "%H"
hour = dt.datetime.strptime(row[1], hour_format)
# In the variable Eastern I assign the timezone,
eastern = timezone('US/Eastern')
# In this part I us .localize method to assign the timezone
hour_tz = eastern.localize(hour)
#Now I can use the hour with the assigned timezone
hr = hour_tz.strftime("%H:%M")
sorted_swap_east.append([row[0], hour_tz])
string = "{h} ---> {avg:.2f} average comments per post".format(h=hr, avg=row[0])
print(string)
print ('\n')
15:00 ---> 38.59 average comments per post 02:00 ---> 23.81 average comments per post 20:00 ---> 21.52 average comments per post 16:00 ---> 16.80 average comments per post 21:00 ---> 16.01 average comments per post
for row in sorted_swap_east:
hour = row[1]
# Here I'm going to assign my timezone to the dataset
tz = timezone('America/Lima')
# In this case I used the .astimezone to assign my time zone to the hour
hour_ame = hour.astimezone(tz)
hr = hour_ame.strftime("%H:%M")
string = "{h} ---> {avg:.2f} average comments per post".format(h=hr, avg=row[0])
print(string)
14:48 ---> 38.59 average comments per post 01:48 ---> 23.81 average comments per post 19:48 ---> 21.52 average comments per post 15:48 ---> 16.80 average comments per post 20:48 ---> 16.01 average comments per post
As we can see the results is not good, I expected to obtain 14:00
but instead I obtained 14:48
, so clearly something is missing, but I don't want to dive more in the problem, maybe is related with the minutes.
Finished the comments part we are going to explore the points parts. It is exactly the same procedure, but in this case we use the index for points, so we only need to extract the data an see what we obtain.
total_ask_points = 0
for row in ask_posts:
num_points = int(row[3]) # The points index is 3
total_ask_points += num_points
avg_ask_points = total_ask_points / len(ask_posts)
total_show_points = 0
for row in show_posts:
num_points = int(row[3])
total_show_points += num_points
avg_show_points = total_show_points / len(show_posts)
print(avg_ask_points)
print(avg_show_points)
15.061926605504587 27.555077452667813
So in this case we could see that the Show Posts have more points that the Asks Posts, which is totally expected because the asks post are totally designed to generate comments, the show posts not.
The best exercise in this case is follow the same steps of the previous analysis with comments, and apply the code to obtain the average points per hour in every kind of posts.
result_list_point = []
for row in ask_posts:
created_at = row[6]
n_points = int(row[3])
result_list_point.append([created_at, n_points])
countspoint_by_hour = {}
points_by_hour = {}
for row in result_list_point:
date = row[0]
points = row[1]
date_format = "%m/%d/%Y %H:%M"
#This function parse the date, it is convert the strings in date formats for python
time = dt.datetime.strptime(date, date_format)
#This function extract from the parsed date the info that I need
hour = time.strftime("%H")
if hour not in countspoint_by_hour:
countspoint_by_hour[hour] = 1
points_by_hour[hour] = points
else:
countspoint_by_hour[hour] += 1
points_by_hour[hour] += points
avgpoint_by_hour = []
for hour in points_by_hour:
avgpoint_by_hour.append([hour, points_by_hour[hour]/countspoint_by_hour[hour]])
swap_avgpoint_by_hour = []
for row in avgpoint_by_hour:
swap_avgpoint_by_hour.append([row[1], row[0]])
sortedpoint_swap = sorted(swap_avgpoint_by_hour, reverse=True)
sortedpoint_swap[:5]
[[29.99137931034483, '15'], [24.258823529411764, '13'], [23.35185185185185, '16'], [19.41, '17'], [18.677966101694917, '10']]
result_list_point = []
for row in show_posts:
created_at = row[6]
n_points = int(row[3])
result_list_point.append([created_at, n_points])
countspoint_by_hour = {}
points_by_hour = {}
for row in result_list_point:
date = row[0]
points = row[1]
date_format = "%m/%d/%Y %H:%M"
#This function parse the date, it is convert the strings in date formats for python
time = dt.datetime.strptime(date, date_format)
#This function extract from the parsed date the info that I need
hour = time.strftime("%H")
if hour not in countspoint_by_hour:
countspoint_by_hour[hour] = 1
points_by_hour[hour] = points
else:
countspoint_by_hour[hour] += 1
points_by_hour[hour] += points
avgpoint_by_hour = []
for hour in points_by_hour:
avgpoint_by_hour.append([hour, points_by_hour[hour]/countspoint_by_hour[hour]])
swap_avgpoint_by_hour = []
for row in avgpoint_by_hour:
swap_avgpoint_by_hour.append([row[1], row[0]])
sortedpoint_swap = sorted(swap_avgpoint_by_hour, reverse=True)
sortedpoint_swap[:5]
[[42.388888888888886, '23'], [41.68852459016394, '12'], [40.34782608695652, '22'], [37.83870967741935, '00'], [36.31147540983606, '18']]
result_list_other = []
for row in other_posts:
created_at = row[6]
n_other = int(row[3])
result_list_other.append([created_at, n_points])
countsother_by_hour = {}
points_by_hour = {}
for row in result_list_other:
date = row[0]
otherss = row[1]
date_format = "%m/%d/%Y %H:%M"
#This function parse the date, it is convert the strings in date formats for python
time = dt.datetime.strptime(date, date_format)
#This function extract from the parsed date the info that I need
hour = time.strftime("%H")
if hour not in countsother_by_hour:
countsother_by_hour[hour] = 1
points_by_hour[hour] = points
else:
countsother_by_hour[hour] += 1
points_by_hour[hour] += points
avgpoint_by_hour = []
for hour in points_by_hour:
avgpoint_by_hour.append([hour, points_by_hour[hour]/countspoint_by_hour[hour]])
swap_avgpoint_by_hour = []
for row in avgpoint_by_hour:
swap_avgpoint_by_hour.append([row[1], row[0]])
sortedpoint_swap = sorted(swap_avgpoint_by_hour, reverse=True)
sortedpoint_swap [:5]
[[153.0, '06'], [122.52631578947368, '05'], [118.25806451612904, '00'], [112.33333333333333, '23'], [111.57446808510639, '21']]
So, we can resume the findings:
The Asks Posts obtained more points in the afternoon, coinciding with some of the hours in the comments.
The Show Posts obtained more points late in the night, because the people that likes to explore new content use their free hours to explore.
The other posts obtained more points early in the morning, coincides with the habit of read the news before going to work.
In this project we wanted to find which kind of posts have more comments and at which hours.
We found that it depends of the type of post, and clearly each post have different audience hours.
We need to know what we want to publish, if it is an Asks posts afternoon is a good time to publish. Show posts late night, and other posts like news are better in the morning.