In this project, we are going to work with a dataset from Hacker News
Hacker News is a site started by the startup incubator Y Combinator, where user-submitted stories (known as "posts") are voted and commented upon, similar to reddit. Hacker News is extremely popular in technology and startup circles, and posts that make it to the top of Hacker News' listings can get hundreds of thousands of visitors as a result.
The original dataset is available on Kaggle
id | title | url |
---|---|---|
12224879 | Interactive Dynamic Video | http://www.interactivedynamicvideo.com/ |
10975351 | How to Use Open Source and Shut the F*ck Up at the Same Time | http://hueniverse.com/2016/01/26/how-to-use-open-source-and-shut-the-fuck-up-at-the-same-time/ |
11964716 | Florida DJs May Face Felony for April Fools' Water Joke | http://www.thewire.com/entertainment/2013/04/florida-djs-april-fools-water-joke/63798/ |
11919867 | Technology ventures: From Idea to Enterprise | https://www.amazon.com/Technology-Ventures-Enterprise-Thomas-Byers/dp/0073523429 |
10301696 | Note by Note: The Making of Steinway L1037 (2007) | http://www.nytimes.com/2007/11/07/movies/07stein.html?_r=0 |
We're specifically interested in posts whose titles begin with either Ask HN or Show HN.
Ask HN: How to improve my personal website?
Ask HN: Am I the only one outraged by Twitter shutting down share counts?
Ask HN: Aby recent changes to CSS that broke mobile?
Likewise, users submit Show HN posts to show the Hacker News community a project, product, or just generally something interesting. Below are a couple of examples:
Show HN: Wio Link ESP8266 Based Web of Things Hardware Development Platform'
Show HN: Something pointless I made
Show HN: Shanhu.io, a programming playground powered by e8vm
**The purpose of this project is to answer the following questions:
(1) Which kind of posts receive the most comments?
As a start, we will do the following:
(1) Read the hacker_news.csv file in as a list of lists.
opened_file = open('hacker_news.csv') #(1)
from csv import reader
read_file = reader(opened_file)
hn = list(read_file) #(2)
1.1. Header and columns
Next, after opening the data we want to:
hn_header = hn[0] #(1)
print('header: ' + str(hn_header)) #(2)
#(3) column description below
header: ['id', 'title', 'url', 'num_points', 'num_comments', 'author', 'created_at']
Column description:
id: The unique identifier from Hacker News for the post
1.2. Overview
Next, we want to get an overview of the dataset that we are working on.
To do that, we will do the following:
(1) Separate the header from the rest of the file
hn = hn[1:] #(1)
print("The first 5 rows are:") #(2)
hn[:4]
The first 5 rows are:
[['12224879', 'Interactive Dynamic Video', 'http://www.interactivedynamicvideo.com/', '386', '52', 'ne0phyte', '8/4/2016 11:52'], ['10975351', 'How to Use Open Source and Shut the Fuck Up at the Same Time', 'http://hueniverse.com/2016/01/26/how-to-use-open-source-and-shut-the-fuck-up-at-the-same-time/', '39', '10', 'josep2', '1/26/2016 19:30'], ['11964716', "Florida DJs May Face Felony for April Fools' Water Joke", 'http://www.thewire.com/entertainment/2013/04/florida-djs-april-fools-water-joke/63798/', '2', '1', 'vezycash', '6/23/2016 22:20'], ['11919867', 'Technology ventures: From Idea to Enterprise', 'https://www.amazon.com/Technology-Ventures-Enterprise-Thomas-Byers/dp/0073523429', '3', '1', 'hswarna', '6/17/2016 0:01']]
print('length: ' + str(len(hn)) + ' rows') #(3)
length: 20100 rows
Note: To find the posts that begin with either Ask HN or Show HN, we'll use the string method, startswith( )
#(1)
ask_posts = [] #Ask HN
show_posts = [] #Show HN
other_posts = []
for row in hn: #(2)
title = row[1]
if title.lower().startswith('ask hn'):
ask_posts.append(row)
elif title.lower().startswith('show hn'):
show_posts.append(row)
else:
other_posts.append(row)
print('The first five rows in each list:') #(3)
print('\n')
print('--ask_posts--')
print(ask_posts[:4])
print('\n')
print('--show_posts--')
print(show_posts[:4])
print('\n')
print('--other_posts--')
print(other_posts[:4])
The first five rows in each list: --ask_posts-- [['12296411', 'Ask HN: How to improve my personal website?', '', '2', '6', 'ahmedbaracat', '8/16/2016 9:55'], ['10610020', 'Ask HN: Am I the only one outraged by Twitter shutting down share counts?', '', '28', '29', 'tkfx', '11/22/2015 13:43'], ['11610310', 'Ask HN: Aby recent changes to CSS that broke mobile?', '', '1', '1', 'polskibus', '5/2/2016 10:14'], ['12210105', 'Ask HN: Looking for Employee #3 How do I do it?', '', '1', '3', 'sph130', '8/2/2016 14:20']] --show_posts-- [['10627194', 'Show HN: Wio Link ESP8266 Based Web of Things Hardware Development Platform', 'https://iot.seeed.cc', '26', '22', 'kfihihc', '11/25/2015 14:03'], ['10646440', 'Show HN: Something pointless I made', 'http://dn.ht/picklecat/', '747', '102', 'dhotson', '11/29/2015 22:46'], ['11590768', 'Show HN: Shanhu.io, a programming playground powered by e8vm', 'https://shanhu.io', '1', '1', 'h8liu', '4/28/2016 18:05'], ['12178806', 'Show HN: Webscope Easy way for web developers to communicate with Clients', 'http://webscopeapp.com', '3', '3', 'fastbrick', '7/28/2016 7:11']] --other_posts-- [['12224879', 'Interactive Dynamic Video', 'http://www.interactivedynamicvideo.com/', '386', '52', 'ne0phyte', '8/4/2016 11:52'], ['10975351', 'How to Use Open Source and Shut the Fuck Up at the Same Time', 'http://hueniverse.com/2016/01/26/how-to-use-open-source-and-shut-the-fuck-up-at-the-same-time/', '39', '10', 'josep2', '1/26/2016 19:30'], ['11964716', "Florida DJs May Face Felony for April Fools' Water Joke", 'http://www.thewire.com/entertainment/2013/04/florida-djs-april-fools-water-joke/63798/', '2', '1', 'vezycash', '6/23/2016 22:20'], ['11919867', 'Technology ventures: From Idea to Enterprise', 'https://www.amazon.com/Technology-Ventures-Enterprise-Thomas-Byers/dp/0073523429', '3', '1', 'hswarna', '6/17/2016 0:01']]
print('Checking the number of posts:') #(4)
print('\n')
print('length of ask_posts: ' + str(len(ask_posts)) + ' posts')
print('length of show_posts: ' + str(len(show_posts)) + ' posts')
print('length of other_posts: ' + str(len(other_posts)) + ' posts')
Checking the number of posts: length of ask_posts: 1744 posts length of show_posts: 1162 posts length of other_posts: 17194 posts
3.1. explore( )
To explore the num_comments(index 4) or num_points (index 3) column in each list, we create a function, explore_( )
Using explore_( ) , we explore the num_comments column in ask_posts, and show_posts
def explore(list_name, variable_name):
total_variable = 0 #(1)
row_variable = 0
if variable_name == 'comments':
row_variable = 4
if variable_name == 'points':
row_variable = 3
for row in list_name: #(2)
num_variable = int(row[row_variable])
total_variable += num_variable
avg_variable = total_variable / len(list_name) #3
print(round(avg_variable,2)) #4
print('Average comments on ask_posts:') #calling explore( )
explore(ask_posts, 'comments') #to explore the num_comments column
print('\n') #in ask_posts, and show_posts
print('Average comments on show_posts:')
explore(show_posts, 'comments')
Average comments on ask_posts: 14.04 Average comments on show_posts: 10.32
Using the explore( ) we found the following:
Since ask_posts has more comments on average, this indicates that ask_posts are more likely to receive comments from the user. Hence, we will focus on ask_posts when exploring the num_comments column
This answers our first question: Which kind of posts receive the most comments?
Answer: Posts that start with Ask HN (ask_posts)
FInding the optimal time to create a post in ask_posts, in order to get the most comments
We want to determine if there is an optimal time to create a post in ask_posts that can get us the most comments.
We will do the following:
(1) Calculate the amount of posts created in ask_posts in each hour of the day (posts per hour).
3.2 hourly_average( )
We will do the two aforementioned things above by creating a function, hourly_average( ),
(1) Create a new parameter, row_variable, and set it to 0
- row_variable will be used to determine the row number for comments or points
import datetime as dt
def hourly_average(list_name, variable_name):
#(1)
row_variable = 0
if variable_name == 'comments':
row_variable = 4
if variable_name == 'points':
row_variable = 3
#(2)
posts_per_hour = {} #counting number of post per hour
variable_per_hour = {} #counting number of comments per hour
#(3)
date_format = "%m/%d/%Y %H:%M"
#(4)
for row in list_name: #(4)
num_variable = int(row[row_variable])
datetime_str = row[6]
#(5)
datetime_dt = dt.datetime.strptime(datetime_str, date_format) #parse the date and create a datetime object.
hour_str = datetime_dt.strftime("%I %p")
#(6)
if hour_str not in posts_per_hour:
posts_per_hour[hour_str] = 1
variable_per_hour[hour_str] = num_variable
else:
posts_per_hour[hour_str] += 1
variable_per_hour[hour_str] += num_variable
#(7)
print('posts_per_hour')
print(posts_per_hour)
print('\n')
if variable_name == 'comments':
print('comments_per_hour')
if variable_name == 'points':
print('points_per_hour')
print(variable_per_hour)
#(8)
avg_by_hour = []
#(9)
for hour in posts_per_hour:
avg_by_hour.append([ round(variable_per_hour[hour] / posts_per_hour[hour],2), hour ])
#(10)
avg_by_hour = sorted(avg_by_hour, reverse = True)
print('\n')
print('avg_by_hour (sorted in descending order)')
print(avg_by_hour)
#(11)
return avg_by_hour
avg_hr_comments = hourly_average(ask_posts, 'comments') #function call
posts_per_hour {'09 AM': 45, '01 PM': 85, '10 AM': 59, '02 PM': 107, '04 PM': 108, '11 PM': 68, '12 PM': 73, '05 PM': 100, '03 PM': 116, '09 PM': 109, '08 PM': 80, '02 AM': 58, '06 PM': 109, '03 AM': 54, '05 AM': 46, '07 PM': 110, '01 AM': 60, '10 PM': 71, '08 AM': 48, '04 AM': 47, '12 AM': 55, '06 AM': 44, '07 AM': 34, '11 AM': 58} comments_per_hour {'09 AM': 251, '01 PM': 1253, '10 AM': 793, '02 PM': 1416, '04 PM': 1814, '11 PM': 543, '12 PM': 687, '05 PM': 1146, '03 PM': 4477, '09 PM': 1745, '08 PM': 1722, '02 AM': 1381, '06 PM': 1439, '03 AM': 421, '05 AM': 464, '07 PM': 1188, '01 AM': 683, '10 PM': 479, '08 AM': 492, '04 AM': 337, '12 AM': 447, '06 AM': 397, '07 AM': 267, '11 AM': 641} avg_by_hour (sorted in descending order) [[38.59, '03 PM'], [23.81, '02 AM'], [21.52, '08 PM'], [16.8, '04 PM'], [16.01, '09 PM'], [14.74, '01 PM'], [13.44, '10 AM'], [13.23, '02 PM'], [13.2, '06 PM'], [11.46, '05 PM'], [11.38, '01 AM'], [11.05, '11 AM'], [10.8, '07 PM'], [10.25, '08 AM'], [10.09, '05 AM'], [9.41, '12 PM'], [9.02, '06 AM'], [8.13, '12 AM'], [7.99, '11 PM'], [7.85, '07 AM'], [7.8, '03 AM'], [7.17, '04 AM'], [6.75, '10 PM'], [5.58, '09 AM']]
3.3. print_hourly_avg( )
After get the data we need, now is the time to display our findings.
We want to print the top 5 hours that are the most optimal for posting--in order to get the most comments.
(1) Loop through each elements in list_name up until the 5th row (since we want to print the top 5 hour)
def print_hourly_avg(list_name):
print('==============================================================================')
for variable_per_post, hour in list_name[:5]: #(1)
output = "{h}: {cp:.2f} per post".format(h = hour, cp = variable_per_post) #(2)
print(output)
3.4. convert_pt( )
import pytz #(1)
def convert_pt(list_name):
et_timezone = pytz.timezone("US/Eastern") #(2)
pt_timezone = pytz.timezone("America/Los_Angeles")
for row in list_name: #(3)
hour = dt.datetime.strptime(row[1] ,"%I %p") #(4)
et = et_timezone.localize(hour)
pt = et.astimezone(pt_timezone)
pt_string = pt.strftime("%I %p")
row[1] = pt_string
We just created 2 functions:
(i). print_hourly_avg( )
(ii). convert_pt( )
Next, we will use both functions to display the top 5 hours to get the most comments in both Eastern Time (ET) and Pacific Time (PT)
#Calling both functions: print_hourly_avg( ), and convert_pt( )
print('Top 5 Hours to post on Ask Posts to get the most comments in Eastern Time (ET)')
print_hourly_avg(avg_hr_comments)
convert_pt(avg_hr_comments)
print('\n')
print('Top 5 Hours to post on Ask Posts to get the most comments in Pacific Time (PT)')
print_hourly_avg(avg_hr_comments)
Top 5 Hours to post on Ask Posts to get the most comments in Eastern Time (ET) ============================================================================== 03 PM: 38.59 per post 02 AM: 23.81 per post 08 PM: 21.52 per post 04 PM: 16.80 per post 09 PM: 16.01 per post Top 5 Hours to post on Ask Posts to get the most comments in Pacific Time (PT) ============================================================================== 12 PM: 38.59 per post 11 PM: 23.81 per post 05 PM: 21.52 per post 01 PM: 16.80 per post 06 PM: 16.01 per post
Finally! We get the top 5 hours for posting that will give us the most comments.
However, one question comes to mind:
Do the posts created at those hour really get the most comments because of the timing of the post? Or is there any other factor? Like the author or the post title, perhaps?
3.5. check_author( )
We want to check the top 5 authors with the most comments.
def check_author(list_name, variable_name):
authors = {} #(1)
if variable_name == 'comments':
row_variable = 4
if variable_name == 'points':
row_variable = 3
for row in list_name:
author = row[5]
num_variable = int(row[row_variable])
if author not in authors: #(3)
authors[author] = num_variable
else:
authors[author] += num_variable
authors_list = [] #(4)
for author in authors: #(5)
authors_list.append([authors[author], author])
authors_list = sorted(authors_list, reverse = True) #(6)
return authors_list #(7)
#callling check_author( )
print('Top 5 Authors with the most comments')
print('====================================')
author_most_comments = check_author(ask_posts, 'comments')
author_most_comments[:5]
Top 5 Authors with the most comments ====================================
[[3046, 'whoishiring'], [868, 'mod50ack'], [691, 'boren_ave11'], [531, 'schappim'], [520, 'sama']]
Now that we have the top 5 authors, we want to check the title of their posts
3.6. print_title( )
(1) Loop through each row in list_name
def print_title(list_name, author_name):
for row in list_name: #(1)
author = row[5] #(2)
title = row[1]
if author == author_name: #(3)
print(title)
#calling print_title( )
print("whoishiring's posts title")
print('==========================')
print_title(ask_posts, 'whoishiring')
print('\n')
print("mod50ack's posts title")
print('======================')
print_title(ask_posts, 'mod50ack')
print('\n')
print("boren_ave11's posts title")
print('=========================')
print_title(ask_posts, 'boren_ave11')
print('\n')
print("schappim's posts title")
print('======================')
print_title(ask_posts, 'schappim')
print('\n')
print("sama's posts title")
print('===================')
print_title(ask_posts, 'sama')
whoishiring's posts title ========================== Ask HN: Who wants to be hired? (June 2016) Ask HN: Freelancer? Seeking freelancer? (December 2015) Ask HN: Who is hiring? (September 2016) Ask HN: Who wants to be hired? (August 2016) Ask HN: Freelancer? Seeking freelancer? (September 2016) Ask HN: Who is hiring? (August 2016) Ask HN: Who wants to be hired? (April 2016) Ask HN: Freelancer? Seeking freelancer? (November 2015) Ask HN: Who wants to be hired? (March 2016) mod50ack's posts title ====================== Ask HN: What's the best tool you used to use that doesn't exist anymore? boren_ave11's posts title ========================= Ask HN: How much do you make at Amazon? Here is how much I make at Amazon schappim's posts title ====================== Ask HN: What book have you given as a gift? Ask HN: Is it feasible to port Apple's Swift to the ESP8266? sama's posts title =================== Ask HN: What should we fund at YC Research?
From the result above, we find that all of whoishiring's posts are related to hiring, whether it is looking for people who want to get hired, or looking for people who are hiring.
It appears that the posts gathered a lot of comments not because the posts are timed perfectly, but because the nature of the posts instead. It is natural to get a lot of comments if the posts are related to hiring, regardless of the time.
Because of the aforementioned reason, I decided to check on the time the posts are created. If the posts are created during the top 5 hour that are mentioned above, I am going to remove whoishiring's posts from the analysis, because the nature of whoishiring's posts are different from the other authors.
3.7. author_post_hr_list( )
To get the author posting times, we create a function, author_post_hr_list( )
This function takes list_name and author_name as arguments and does the following:
(1) Create an empty dictionary, author_dict, to create a frequency table with hour as key and the number of posts crated as its values
def author_post_hr_list(list_name, author_name):
date_format = "%m/%d/%Y %H:%M"
author_dict = {} #(1)
for row in list_name:
if (row[5] == author_name):
datetime_str = row[6]
datetime_dt = dt.datetime.strptime(datetime_str, date_format) #(2)
hour_str = datetime_dt.strftime("%I %p")
if hour_str not in author_dict: #(3)
author_dict[hour_str] = 1
else:
author_dict[hour_str] += 1
author_list = [] #(4)
for hour in author_dict:
author_list.append([author_dict[hour], hour])
return author_list #(5)
#calling author_post_hr_list( )
whoishiring_hr = author_post_hr_list(ask_posts, 'whoishiring')
print('whoishiring_hr')
print(whoishiring_hr)
whoishiring_hr [[9, '03 PM']]
3.8. print_post_count( )
(2) Use format( ) to print string in a formatted way
(3) Print the output
def print_post_count(list_name): #(1)
for posts, hour in list_name:
output = '{p} post(s) created at {h}'.format(h = hour, p = posts) #(2)
print(output) #(3)
print("whoishiring's post count in Eastern Time (ET)")
print('=============================================')
print_post_count(whoishiring_hr)
whoishiring's post count in Eastern Time (ET) ============================================= 9 post(s) created at 03 PM
Next, we will convert the time from ET to PT using convert_pt( ), which we defined earlier and then print the posting time in PT using print_post_count( )
convert_pt(whoishiring_hr) #calling a function that we defined earlier
print("whoishiring's post count in Pacific Time (PT)")
print('=============================================')
print_post_count(whoishiring_hr) #calling a function that we defined earlier
whoishiring's post count in Pacific Time (PT) ============================================= 9 post(s) created at 12 PM
Based on the result above, we find that whoishiring created 9 posts at 12 PM (PT). If we look back to our findings above, 12 PM (PT) is ranked number 1 in the Top 5 Hours to post on Ask Posts to get the most comments in Pacific Time (PT).
3.9. Remove whoishiring's posts from the analysis and re-explore the num_comments column
As previously mentioned, since whoishiring's posts are created during the aforementioned top 5 hour, I am going to remove whoishiring's posts from the analysis, because the nature of whoishiring's posts are different from the other authors.
(Note: whoishiring's posts are related to hiring)
We are going to do the following:
posts_without_whoishiring = [] #(1)
for row in ask_posts:
author = row[5]
if author != 'whoishiring':
posts_without_whoishiring.append(row)
avg_hr_comments_wo_whoishiring = hourly_average(posts_without_whoishiring, 'comments') #(2)
posts_per_hour {'09 AM': 45, '01 PM': 85, '10 AM': 59, '02 PM': 107, '04 PM': 108, '11 PM': 68, '12 PM': 73, '05 PM': 100, '03 PM': 107, '09 PM': 109, '08 PM': 80, '02 AM': 58, '06 PM': 109, '03 AM': 54, '05 AM': 46, '07 PM': 110, '01 AM': 60, '10 PM': 71, '08 AM': 48, '04 AM': 47, '12 AM': 55, '06 AM': 44, '07 AM': 34, '11 AM': 58} comments_per_hour {'09 AM': 251, '01 PM': 1253, '10 AM': 793, '02 PM': 1416, '04 PM': 1814, '11 PM': 543, '12 PM': 687, '05 PM': 1146, '03 PM': 1431, '09 PM': 1745, '08 PM': 1722, '02 AM': 1381, '06 PM': 1439, '03 AM': 421, '05 AM': 464, '07 PM': 1188, '01 AM': 683, '10 PM': 479, '08 AM': 492, '04 AM': 337, '12 AM': 447, '06 AM': 397, '07 AM': 267, '11 AM': 641} avg_by_hour (sorted in descending order) [[23.81, '02 AM'], [21.52, '08 PM'], [16.8, '04 PM'], [16.01, '09 PM'], [14.74, '01 PM'], [13.44, '10 AM'], [13.37, '03 PM'], [13.23, '02 PM'], [13.2, '06 PM'], [11.46, '05 PM'], [11.38, '01 AM'], [11.05, '11 AM'], [10.8, '07 PM'], [10.25, '08 AM'], [10.09, '05 AM'], [9.41, '12 PM'], [9.02, '06 AM'], [8.13, '12 AM'], [7.99, '11 PM'], [7.85, '07 AM'], [7.8, '03 AM'], [7.17, '04 AM'], [6.75, '10 PM'], [5.58, '09 AM']]
print('avg_hr_comments_wo_whoishiring')
print(avg_hr_comments_wo_whoishiring) #printing the result from (2)
avg_hr_comments_wo_whoishiring [[23.81, '02 AM'], [21.52, '08 PM'], [16.8, '04 PM'], [16.01, '09 PM'], [14.74, '01 PM'], [13.44, '10 AM'], [13.37, '03 PM'], [13.23, '02 PM'], [13.2, '06 PM'], [11.46, '05 PM'], [11.38, '01 AM'], [11.05, '11 AM'], [10.8, '07 PM'], [10.25, '08 AM'], [10.09, '05 AM'], [9.41, '12 PM'], [9.02, '06 AM'], [8.13, '12 AM'], [7.99, '11 PM'], [7.85, '07 AM'], [7.8, '03 AM'], [7.17, '04 AM'], [6.75, '10 PM'], [5.58, '09 AM']]
print('Top 5 Hours to post on Ask Posts to get the most comments in Eastern Time (ET)')
print_hourly_avg(avg_hr_comments_wo_whoishiring) #(3)
Top 5 Hours to post on Ask Posts to get the most comments in Eastern Time (ET) ============================================================================== 02 AM: 23.81 per post 08 PM: 21.52 per post 04 PM: 16.80 per post 09 PM: 16.01 per post 01 PM: 14.74 per post
convert_pt(avg_hr_comments_wo_whoishiring) #(4)
print('Top 5 Hours to post on Ask Posts to get the most comments Pacific Time (PT)')
print_hourly_avg(avg_hr_comments_wo_whoishiring) #(5)
Top 5 Hours to post on Ask Posts to get the most comments Pacific Time (PT) ============================================================================== 11 PM: 23.81 per post 05 PM: 21.52 per post 01 PM: 16.80 per post 06 PM: 16.01 per post 10 AM: 14.74 per post
At last! We managed to find that optimal to post on asks_posts to get the most comments! Based on the result above, after removing whoishiring's post from the analysis we find that the most optimal time to post --in order to get the most comments-- is at 11PM (PT) or 2AM (ET)
That was the answer to our second question: What is the most optimal time to create posts that gather the most comments?
Now there is one more thing, we want to find the optimal time to get the most points too! To do that, we are going to explore num_points the same way we did num_comments.
Thankfully, we have created a lot of functions when exploring num_comments, so we will have an easier time exploring now. All we need to do is to call all the functions we defined earlier
4.1. explore( )
Using explore( ), we will be able to find the average points in both ask_posts and show_posts
print('Average points on ask_posts:')
explore(ask_posts, 'points')
print('\n')
print('Average points on show_posts:')
explore(show_posts, 'points')
Average points on ask_posts: 15.06 Average points on show_posts: 27.56
Since show_posts has more points on average, this indicates that show_posts are more likely to receive points from the user. Hence, we will focus on show_posts when exploring the num_points column
This answers our third question: Which kind of posts receive the most comments?
Answer: Posts that start with Show HN (show_posts)
4.2 hourly_average( )
Using hourly_average( ), we will do the following:
(1) Calculate the amount of posts created in show_posts in each hour of the day (posts per hour).
print('avg_hr_points')
print('=============')
avg_hr_points = hourly_average(show_posts, 'points') #(1), (2), (3)
avg_hr_points ============= posts_per_hour {'02 PM': 86, '10 PM': 46, '06 PM': 61, '07 AM': 26, '08 PM': 60, '05 AM': 19, '04 PM': 93, '07 PM': 55, '03 PM': 78, '03 AM': 27, '05 PM': 93, '06 AM': 16, '02 AM': 30, '01 PM': 99, '08 AM': 34, '09 PM': 47, '04 AM': 26, '11 AM': 44, '12 PM': 61, '11 PM': 36, '09 AM': 30, '01 AM': 28, '10 AM': 36, '12 AM': 31} points_per_hour {'02 PM': 2187, '10 PM': 1856, '06 PM': 2215, '07 AM': 494, '08 PM': 1819, '05 AM': 104, '04 PM': 2634, '07 PM': 1702, '03 PM': 2228, '03 AM': 679, '05 PM': 2521, '06 AM': 375, '02 AM': 340, '01 PM': 2438, '08 AM': 519, '09 PM': 866, '04 AM': 386, '11 AM': 1480, '12 PM': 2543, '11 PM': 1526, '09 AM': 553, '01 AM': 700, '10 AM': 681, '12 AM': 1173} avg_by_hour (sorted in descending order) [[42.39, '11 PM'], [41.69, '12 PM'], [40.35, '10 PM'], [37.84, '12 AM'], [36.31, '06 PM'], [33.64, '11 AM'], [30.95, '07 PM'], [30.32, '08 PM'], [28.56, '03 PM'], [28.32, '04 PM'], [27.11, '05 PM'], [25.43, '02 PM'], [25.15, '03 AM'], [25.0, '01 AM'], [24.63, '01 PM'], [23.44, '06 AM'], [19.0, '07 AM'], [18.92, '10 AM'], [18.43, '09 PM'], [18.43, '09 AM'], [15.26, '08 AM'], [14.85, '04 AM'], [11.33, '02 AM'], [5.47, '05 AM']]
4.3. print_hourly_avg( )
Using print_hourly_avg( ), we will display the top 5 Hours to post on Ask Posts --in order to get the most points in both Eastern Time and Pacific Time.
print('Top 5 Hours to post on Show Posts to get the most points in Eastern Time (ET)')
print_hourly_avg(avg_hr_points) #calling print_hourly_avg for ET
Top 5 Hours to post on Show Posts to get the most points in Eastern Time (ET) ============================================================================== 11 PM: 42.39 per post 12 PM: 41.69 per post 10 PM: 40.35 per post 12 AM: 37.84 per post 06 PM: 36.31 per post
4.4. convert_pt( )
Using convert_pt( ), we will convert the time from PT to ET
convert_pt(avg_hr_points)
print('Top 5 Hours to post on Show Posts to get the most points in Pacific Time (PT)')
print_hourly_avg(avg_hr_points) #calling print_hourly_avg for PT (see 4.3)
Top 5 Hours to post on Show Posts to get the most points in Pacific Time (PT) ============================================================================== 08 PM: 42.39 per post 09 AM: 41.69 per post 07 PM: 40.35 per post 09 PM: 37.84 per post 03 PM: 36.31 per post
4.5. check_author( )
Using check_author( ), we will display the top 5 author with the most points
print('Top 5 Authors with the most points')
print('====================================')
author_most_points = check_author(show_posts, 'points')
print(author_most_points[:5])
Top 5 Authors with the most points ==================================== [[825, 'petermolyneux'], [747, 'dhotson'], [681, 'damjanstankovic'], [572, 'orf'], [553, 'Capira']]
4.6. print_title( )
Using print_title( ), we will display the titles of the post created by the authors above
print("petermolyneux's posts title")
print('==========================')
print_title(show_posts, 'petermolyneux')
print('\n')
print("dhotson's posts title")
print('======================')
print_title(show_posts, 'dhotson')
print('\n')
print("damjanstankovic's posts title")
print('=============================')
print_title(show_posts, 'damjanstankovic')
print('\n')
print("orf's posts title")
print('=================')
print_title(show_posts, 'orf')
print('\n')
print("Capira's posts title")
print('======================')
print_title(show_posts, 'Capira')
petermolyneux's posts title ========================== Show HN: New calendar app idea dhotson's posts title ====================== Show HN: Something pointless I made damjanstankovic's posts title ============================= Show HN: I spent a year making an electro-mechanical prototype of a liquid clock orf's posts title ================= Show HN: Hacker News Simulator Capira's posts title ====================== Show HN: What every browser knows about you
The results suggest that none of the authors posted a post that belongs to a specific category, such as hiring (which was what happened in the num_comments exploration). Therefore, I'm not going to remove any posts from the analysis --unlike what we do with whoishiring in the num_comments exploration
Thus, the answer to our third question, 'What is the most optimal time to create posts that gather the most comments?',
is at 08 PM, with 42.39 points per post (see 4.3)
#Conclusion data
print('Top 5 Hours to post on Ask Posts to get the most comments Pacific Time (PT)')
print_hourly_avg(avg_hr_comments_wo_whoishiring) #(see 3.9)
print('\n')
print('Top 5 Hours to post on Show Posts to get the most points in Pacific Time (PT)')
print_hourly_avg(avg_hr_points) #(see 4.3)
Top 5 Hours to post on Ask Posts to get the most comments Pacific Time (PT) ============================================================================== 11 PM: 23.81 per post 05 PM: 21.52 per post 01 PM: 16.80 per post 06 PM: 16.01 per post 10 AM: 14.74 per post Top 5 Hours to post on Show Posts to get the most points in Pacific Time (PT) ============================================================================== 08 PM: 42.39 per post 09 AM: 41.69 per post 07 PM: 40.35 per post 09 PM: 37.84 per post 03 PM: 36.31 per post
After exploring the hackernews dataset, we found the following:
(1) Posts whose titles begin with Ask HN receive more comments