We are analyzing the data set coming from the submissions to popular technology site Hacker News. Hacker News is a site started by the startup incubator Y Combinator, where user-submitted stories (known as "posts") are voted and commented upon, similar to reddit. Hacker News is extremely popular in technology and startup circles, and posts that make it to the top of Hacker News' listings can get hundreds of thousands of visitors as a result.
We're specifically interested in posts whose titles begin with either Ask HN
or Show HN
. Users submit Ask HN posts to ask the Hacker News community a specific question.
Likewise, users submit Show HN posts to show the Hacker News community a project, product, or just generally something interesting.
This project will unveil wether if "Ask HN" or "Show HN" receives more comments on average and use that to determine wether a specific time determines if the post receives more comments on average.
We have determined that Ask HN
has the most post and we also determined that 3:00 pm in the afternoon is the peak time that users post comments in posts every day.
This project will be beneficial to users wanting an answer to their specific needs. Now they will have an idea what time to post a question to HackerNews and gain a lot of answers to their question.
For more details, you can explore the steps below
OR
First let's make the data set readable from Jupyter Notebook.
from csv import reader;
# Read data set
opened_file = open("hacker_news.csv");
read_file = reader(opened_file);
For the header to not interupt to our analysis, we will separate it from the actual data set that we are querying.
# Turn data set into list of list
hn = list(read_file);
# Remove the header
headers = hn[0];
hn = hn[1:]; # Assigns the data set without header
print(headers, "\n");
# Display 5 row from the data set
for i in range(5, 25, 5): # Display rows starting from 5 with increments of 5 up until the 25th row
print(hn[i], "\n");
['id', 'title', 'url', 'num_points', 'num_comments', 'author', 'created_at'] ['10482257', 'Title II kills investment? Comcast and other ISPs are now spending more', 'http://arstechnica.com/business/2015/10/comcast-and-other-isps-boost-network-investment-despite-net-neutrality/', '53', '22', 'Deinos', '10/31/2015 9:48'] ['11370829', 'Crate raises $4M seed round for its next-gen SQL database', 'http://techcrunch.com/2016/03/15/crate-raises-4m-seed-round-for-its-next-gen-sql-database/', '3', '1', 'hitekker', '3/27/2016 18:08'] ['12335860', 'How often to update third party libraries?', '', '7', '5', 'rabid_oxen', '8/22/2016 12:37'] ['11079821', 'APOD: LIGO detects gravity waves...', 'http://apod.nasa.gov/apod/astropix.html', '1', '2', 'AliCollins', '2/11/2016 12:57']
You can find the data set here, but note that it has been reduced from almost 300,000 rows to approximately 20,000 rows by removing all submissions that did not receive any comments, and then randomly sampling from the remaining submissions. Below are descriptions of the columns:
id | title | url | num_points | num_comments | author | created_at |
---|---|---|---|---|---|---|
12296411 | Ask HN: How to improve my personal website? | 2 | 6 | ahmedbaracat | 8/16/2016 9:55:00 AM | |
10975351 | How to Use Open Source and Shut the F*ck Up at the Same Time | http://hueniverse.com/2016/01/26/how-to-use-open-source-and-shut-the-fuck-up-at-the-same-time/ | 39 | 10 | josep2 | 1/26/2016 19:30 |
11964716 | Florida DJs May Face Felony for April Fools' Water Joke | http://www.thewire.com/entertainment/2013/04/florida-djs-april-fools-water-joke/63798/ | 2 | 1 | vezycash | 6/23/2016 22:20 |
As you can see there are entries that have missing columns and there are others that don't have a tag. For the missing values above which is the URL, we actually dont need this so that entry is good as it is, While the title that doesn't have any tag (Ask HN or Show HN) at the beginning will be separated from the actual data set that we will need later in the process
Since we are only interested with post that has Ask HN
and Show HN
, we will use the startswith()
method to test if a title column of a row has word containing Ask HN
and Show HN
. We will then contain it in a list to separate each post within their own list.
ask_post
, show_post
, other_post
is a list of list containing the separated tagged post
Since we are only interested with post that has Ask HN
and Show HN
, we will use the startswith()
method to test if a title column of a row has word containing Ask HN
and Show HN
. We will then contain it in a list to separate each post within their own list.
ask_post
, show_post
, other_post
is a list of list containing the separated tagged post
ask_post = [];
show_post = [];
other_post = [];
# Code to separate "ask hn" and "show hn" posts into separate list
for row in hn: # iterate each row in data set "hn"
title = row[1];
# append titles to "ask_post" that has "ask hn" as title
if title.lower().startswith("ask hn"):
ask_post.append(row);
# append titles to "show_post" that has "show hn" as title
elif title.lower().startswith("show hn"):
show_post.append(row);
# if neither, append to "other_post"
else:
other_post.append(row);
# Code to print the length of rows in the list
print("Number of post in Ask HN: {:,}".format(len(ask_post)))
print("Number of post in Show HN: {:,}".format(len(show_post)))
print("Number of post in other tagged post: {:,}".format(len(other_post)))
Number of post in Ask HN: 1,744 Number of post in Show HN: 1,162 Number of post in other tagged post: 17,194
Ask HN
has more post than Show HN
. This only would mean that for a technology site, there are a lot of post that ask questions rather than a post that show something interesting.
BUT this will not identify wether
ask_post
has more comments thanshow_post
.
Now that we have segragated our data set based on their tag, we wil now use that to sum up all the comments on all post and divide that by the total number of the post leaving us with the average comments.
This code will determine the average comments of both tagged post (ask_hn and show_hn).
total_ask_comments = 0; # Count total ask_post comments
total_show_comments = 0; # Count total show_post comments
# Code to output average comments per post in "ask_post"
for post in ask_post:
num_comments = int(post[4]); # Assign Column num_comments in a variable "num_comments"
# Total all of the comments
total_ask_comments += num_comments;
# Code to output average comments per post in "show_post"
for post in show_post:
num_comments = int(post[4]);
# Total all of the comments
total_show_comments += num_comments;
avg_ask_comments = total_ask_comments / len(ask_post) # Average comments per ask_post
print("Average Comments in ask_post: {:.2f}".format(avg_ask_comments)) # Output format ask_post
avg_show_comments = total_show_comments / len(show_post); # Average comments per ask_post
print("Average Comments in show_post: {:.2f}".format(avg_show_comments)) # Output format show_post
Average Comments in ask_post: 14.04 Average Comments in show_post: 10.32
Show HN
garnered an average comments per post of 10.32 while Ask HN
is 3.72 ahead with 14.04 average comments per post. This only means that ask_post has more comments over show_post hence,
we will use ask_post first to determine what is the peak time for receiving comments on average.
In the previous cell, we have determined which tag has the most comments on average and now we will use that to have a greater detail what time is it optimal to post an Ask HN
tag and garner a lot of comments? But first, to do that we need to count the post
per hour of the day and count the comments
per hour of the day and use the hour
as a key to divide both count of post and comments leaving us the average comments per post on that hour.
Code to create a dictionary for counts of comments per hour and post per hour to use later for computing for average comments per post per hour.
Ask HN
has more average comments than Show HN
and we will use ask_post now to count "post" and "comments" in each hour of the day to determine later what hours has more post and commentsimport datetime as dt; # Module to process date and time
result_list = []; # list to create list of list
# Code to create a list of list data set
for post in ask_post:
time_created = post[6];
comments = int(post[4]); # convert number into integer
result_list.append([time_created, comments]);
counts_by_hour = {}; # Dictinary to store total post per hour of day
comments_by_hour = {}; # Dictinary to store total comments per hour
# Code to create a dictionary with "time" hour as `key` with `value` as comments
for result in result_list: # Retrieve each list in the data set of list of list (result_list)
# Assign each element of each row in descriptive variables
date_str = result[0];
comments = result[1];
time = dt.datetime.strptime(date_str, "%m/%d/%Y %H:%M"); # convert the "time_created" as datetime obj to process dates
hour = str(time.hour); # Take `hour` only in the date object
# Count and initialize posts and comments in corresponding dictionary
if hour not in counts_by_hour: # if the key `time` is not in counts_by_hour dictionary, initialize
counts_by_hour[hour] = 1;
comments_by_hour[hour] = comments;
else: # else add all of them for counting
counts_by_hour[hour] += 1;
comments_by_hour[hour] += comments;
Since we have determined in the previous cell the count of post and comments in a specific hour, now what we need to do now is to find the average comments per post in that specific hour to determine later in the output what time of the day that has the most comment per post in that hour
avg_by_hour = []; # list of list containing hours of day and avg comments per post in that hour
# Code to determine average comments per post in specific hour
for hour_post, post in counts_by_hour.items(): # Assign `key`=hour and `value`=total post in variables respectively
for hour_comments, comments in comments_by_hour.items(): # Assign `key`=hour and `value`=total post in variables respectively
# Code to determine if they have the same keys (hour) and determine avg comments per post in that hour
if int(hour_post) == int(hour_comments):
avg_by_hour.append([hour_post, comments/post]) # append a list of hour, avg_comments per hour respectively
Now we have determined the average comments per post in specific hour of the day, all we have to do now is to make it readable for us to see the peak time for posting an Ask HN
tag.
This code will sort the list of average comments in descending order
In order to sort this, we have to swap the elements of the list of list so that when we sort that list of list, it will sort the average comments per post value in descending order not the hour it posted.
swap_avg_by_hour = []; # list of swapped elements of `avg_by_hour` for sorting
# Code for sorting average comments in descending order
for row in avg_by_hour:
# Assign column to variables for readability
hour = row[0];
comments = row[1];
swap_avg_by_hour.append([comments, hour]); # Append swapped elements to list
sorted_swap = sorted(swap_avg_by_hour, reverse=True) # sort the list in descending order
sorted_swap
[[38.5948275862069, '15'], [23.810344827586206, '2'], [21.525, '20'], [16.796296296296298, '16'], [16.009174311926607, '21'], [14.741176470588234, '13'], [13.440677966101696, '10'], [13.233644859813085, '14'], [13.20183486238532, '18'], [11.46, '17'], [11.383333333333333, '1'], [11.051724137931034, '11'], [10.8, '19'], [10.25, '8'], [10.08695652173913, '5'], [9.41095890410959, '12'], [9.022727272727273, '6'], [8.127272727272727, '0'], [7.985294117647059, '23'], [7.852941176470588, '7'], [7.796296296296297, '3'], [7.170212765957447, '4'], [6.746478873239437, '22'], [5.5777777777777775, '9']]
We have now sorted the list. The only step to do is to display the top 5 hours for posting an Ask HN
and format it for readbility
print("Top 5 Hours for Ask Posts Comments in Eastern Time");
# Code for outputing top 5 Ask Post highest number of average comments in each hour
for i in range(5):
hours = dt.datetime.strptime(sorted_swap[i][1], "%H").strftime("%H:%M:%S") # Turn each hour in the list(sorted_swap) as datetime obj
print("{hours}: {comments:.2f} average comments per post".format(hours=hours, comments=sorted_swap[i][0])) # Formated output
Top 5 Hours for Ask Posts Comments in Eastern Time 15:00:00: 38.59 average comments per post 02:00:00: 23.81 average comments per post 20:00:00: 21.52 average comments per post 16:00:00: 16.80 average comments per post 21:00:00: 16.01 average comments per post
The output above is only applicable with the timezone of Eastern Times and may not be usable for other users that live with a different timezone. For your convenience, I will convert this Eastern Times to 3 different timezones.
import pytz;
from datetime import timedelta;
localFormat = "%H:%M:%S"; # Format of time
timezones = ["America/Los_Angeles", "Europe/Madrid", "America/Puerto_Rico"]; # Output different timezones
eastern = pytz.timezone("US/Eastern");
for tz in timezones:
print("\nEastern Timezone -> {} Timezone".format(tz));
for i in range(5): # Iterate to top 5 Hours with the most average comments per post
hours = dt.datetime.strptime(sorted_swap[i][1], "%H") # Turn each hour in the list(sorted_swap) as datetime obj
est_moment = eastern.localize(hours); # Localize EST time to convert to different timezones
localDatetime = est_moment.astimezone(pytz.timezone(tz)); # Convert the top 5 hours in different timezone
"""I have to add additional hours
to the generated timezone by pytz
it seems it doesn't correctly output
the different times in each place"""
if tz == "America/Los_Angeles":
add_time = timedelta(minutes=57);
elif tz == "Europe/Madrid":
add_time = timedelta(hours=2, minutes=19);
elif tz == "America/Puerto_Rico":
add_time = timedelta(minutes=28);
localDatetime = localDatetime + add_time # Added the missing hours
print("{0} -> {1} : {2:.2f} Average Comments per Post".format(est_moment.strftime(localFormat), localDatetime.strftime(localFormat), sorted_swap[i][0]));
Eastern Timezone -> America/Los_Angeles Timezone 15:00:00 -> 13:00:00 : 38.59 Average Comments per Post 02:00:00 -> 00:00:00 : 23.81 Average Comments per Post 20:00:00 -> 18:00:00 : 21.52 Average Comments per Post 16:00:00 -> 14:00:00 : 16.80 Average Comments per Post 21:00:00 -> 19:00:00 : 16.01 Average Comments per Post Eastern Timezone -> Europe/Madrid Timezone 15:00:00 -> 22:00:00 : 38.59 Average Comments per Post 02:00:00 -> 09:00:00 : 23.81 Average Comments per Post 20:00:00 -> 03:00:00 : 21.52 Average Comments per Post 16:00:00 -> 23:00:00 : 16.80 Average Comments per Post 21:00:00 -> 04:00:00 : 16.01 Average Comments per Post Eastern Timezone -> America/Puerto_Rico Timezone 15:00:00 -> 16:00:00 : 38.59 Average Comments per Post 02:00:00 -> 03:00:00 : 23.81 Average Comments per Post 20:00:00 -> 21:00:00 : 21.52 Average Comments per Post 16:00:00 -> 17:00:00 : 16.80 Average Comments per Post 21:00:00 -> 22:00:00 : 16.01 Average Comments per Post
And there you have it, the top 5 Hours to post your Ask HN
to gain a large volume of comments. Hope you find this useful!
TIP!: You can convert this to your own timezone by modifying the list of timezone above!
We have determined that Ask HN
has the most post and we determined also that 3:00 pm in the afternoon is the peak time that users post comments in posts every day.
This answered the two question that we have earlier
To answer this we took the steps of:
This project will be beneficial to users wanting an answer to their specific needs. Now they will have an idea what time to post a question to HackerNews and gain a lot of answers to their question.