Hacker News is a site started by the startup incubator Y Combinator, where user-submitted stories (known as "posts") are voted and commented upon, similar to reddit. Hacker News is extremely popular in technology and startup circles, and posts that make it to the top of Hacker News' listings can get hundreds of thousands of visitors as a result.
We're specifically interested in posts whose titles begin with either Ask HN or Show HN. Users submit Ask HN posts to ask the Hacker News community a specific question
Likewise, users submit Show HN posts to show the Hacker News community a project, product, or just generally something interesting.
Hacker News is extremely popular in technology and startup circles, and posts that make it to the top of Hacker News' listings can get hundreds of thousands of visitors as a result.
You can find the data set here, but note that it has been reduced from almost 300,000 rows to approximately 20,000 rows by removing all submissions that did not receive any comments, and then randomly sampling from the remaining submissions.
#to open and read the file as a list of lists:
from csv import reader
opened_file=open("hacker_news.csv")
read_file=reader(opened_file)
hn=list(read_file)
print(hn[:7])
[['id', 'title', 'url', 'num_points', 'num_comments', 'author', 'created_at'], ['12224879', 'Interactive Dynamic Video', 'http://www.interactivedynamicvideo.com/', '386', '52', 'ne0phyte', '8/4/2016 11:52'], ['10975351', 'How to Use Open Source and Shut the Fuck Up at the Same Time', 'http://hueniverse.com/2016/01/26/how-to-use-open-source-and-shut-the-fuck-up-at-the-same-time/', '39', '10', 'josep2', '1/26/2016 19:30'], ['11964716', "Florida DJs May Face Felony for April Fools' Water Joke", 'http://www.thewire.com/entertainment/2013/04/florida-djs-april-fools-water-joke/63798/', '2', '1', 'vezycash', '6/23/2016 22:20'], ['11919867', 'Technology ventures: From Idea to Enterprise', 'https://www.amazon.com/Technology-Ventures-Enterprise-Thomas-Byers/dp/0073523429', '3', '1', 'hswarna', '6/17/2016 0:01'], ['10301696', 'Note by Note: The Making of Steinway L1037 (2007)', 'http://www.nytimes.com/2007/11/07/movies/07stein.html?_r=0', '8', '2', 'walterbell', '9/30/2015 4:12'], ['10482257', 'Title II kills investment? Comcast and other ISPs are now spending more', 'http://arstechnica.com/business/2015/10/comcast-and-other-isps-boost-network-investment-despite-net-neutrality/', '53', '22', 'Deinos', '10/31/2015 9:48']]
headers=hn[0] #Extracting the first row of data and assigning it to the headers variable
hn = hn[1:] #Removing the first row from hn by replacing the dataset hn with hn without the headers
print(headers)
print('\n') #printing an empty line
print(hn[:5])
['id', 'title', 'url', 'num_points', 'num_comments', 'author', 'created_at'] [['12224879', 'Interactive Dynamic Video', 'http://www.interactivedynamicvideo.com/', '386', '52', 'ne0phyte', '8/4/2016 11:52'], ['10975351', 'How to Use Open Source and Shut the Fuck Up at the Same Time', 'http://hueniverse.com/2016/01/26/how-to-use-open-source-and-shut-the-fuck-up-at-the-same-time/', '39', '10', 'josep2', '1/26/2016 19:30'], ['11964716', "Florida DJs May Face Felony for April Fools' Water Joke", 'http://www.thewire.com/entertainment/2013/04/florida-djs-april-fools-water-joke/63798/', '2', '1', 'vezycash', '6/23/2016 22:20'], ['11919867', 'Technology ventures: From Idea to Enterprise', 'https://www.amazon.com/Technology-Ventures-Enterprise-Thomas-Byers/dp/0073523429', '3', '1', 'hswarna', '6/17/2016 0:01'], ['10301696', 'Note by Note: The Making of Steinway L1037 (2007)', 'http://www.nytimes.com/2007/11/07/movies/07stein.html?_r=0', '8', '2', 'walterbell', '9/30/2015 4:12']]
to find the posts that begin with Ask HN or Show HN, we will use the string method startswith. We can also use the lower method which returns a lowercase version of the starting string
If the lowercase version of title starts with ask hn, append the row to ask_posts. Else if the lowercase version of title starts with show hn, append the row to show_posts. Else append to other_posts.
Check the number of posts in ask_posts, show_posts, and other_posts.
Print the first 3 rows of each of the lists
ask_posts=[] #creating 3 empty lists to divide posts into 3 groups
show_posts=[]
other_posts=[]
for row in hn:
title=row[1]
if title.lower().startswith("ask hn"):
ask_posts.append(row)
elif title.lower().startswith("show hn"):
show_posts.append(row)
else:
other_posts.append(row)
print(len(ask_posts)) # checking the number of posts in each list
print(len(show_posts))
print(len(other_posts))
print('\n') # leaves a blank line
print(ask_posts[:3])
print('\n')
print(show_posts[:3])
print('\n')
print(other_posts[:3])
1744 1162 17194 [['12296411', 'Ask HN: How to improve my personal website?', '', '2', '6', 'ahmedbaracat', '8/16/2016 9:55'], ['10610020', 'Ask HN: Am I the only one outraged by Twitter shutting down share counts?', '', '28', '29', 'tkfx', '11/22/2015 13:43'], ['11610310', 'Ask HN: Aby recent changes to CSS that broke mobile?', '', '1', '1', 'polskibus', '5/2/2016 10:14']] [['10627194', 'Show HN: Wio Link ESP8266 Based Web of Things Hardware Development Platform', 'https://iot.seeed.cc', '26', '22', 'kfihihc', '11/25/2015 14:03'], ['10646440', 'Show HN: Something pointless I made', 'http://dn.ht/picklecat/', '747', '102', 'dhotson', '11/29/2015 22:46'], ['11590768', 'Show HN: Shanhu.io, a programming playground powered by e8vm', 'https://shanhu.io', '1', '1', 'h8liu', '4/28/2016 18:05']] [['12224879', 'Interactive Dynamic Video', 'http://www.interactivedynamicvideo.com/', '386', '52', 'ne0phyte', '8/4/2016 11:52'], ['10975351', 'How to Use Open Source and Shut the Fuck Up at the Same Time', 'http://hueniverse.com/2016/01/26/how-to-use-open-source-and-shut-the-fuck-up-at-the-same-time/', '39', '10', 'josep2', '1/26/2016 19:30'], ['11964716', "Florida DJs May Face Felony for April Fools' Water Joke", 'http://www.thewire.com/entertainment/2013/04/florida-djs-april-fools-water-joke/63798/', '2', '1', 'vezycash', '6/23/2016 22:20']]
To do this, we have to calcualte total number of comments on all ask posts and divide them by the number of ask posts. We have to do the same again with the show posts
To calculate the total number of comments, we need to assign a variable total_ask_comments that starts with value 0 but as we loop over the ask_posts lists, the number of comments is incrementaly added inside total_ask_comments until it holds the total number of comments added together
To calculate the average number of comments we can divide total_ask_comments/ the length of the ask_comments list
total_ask_comments=0
for row in ask_posts:
# column with index 4 in the dataset holds
#the number of comments
num_comments= int(row[4])
total_ask_comments+= num_comments
avg_ask_comments =total_ask_comments / (len(ask_posts))
print("Average number of comments in ask posts = ",avg_ask_comments)
total_show_comments=0
for row in show_posts:
total_show_comments+= int(row[4])
avg_show_comments=total_show_comments/(len(show_posts))
print("Average number of comments in show posts = ", avg_show_comments)
Average number of comments in ask posts = 14.038417431192661 Average number of comments in show posts = 10.31669535283993
We'll determine if ask posts created at a certain time are more likely to attract comments. We'll use the following steps to perform this analysis:
Calculate the amount of ask posts created in each hour of the day, along with the number of comments received.
Calculate the average number of comments ask posts receive by hour created.
This is the plan:
Import the datetime module as dt.
Create an empty list and assign it to result_list. This will be a list of lists.
Iterate over ask_posts and append to result_list a list with two elements:
The first element shall be the column created_at. Because the created_at column is the seventh column in ask_posts, you'll need to get the element at index 6 in each row. The second element shall be the number of comments of the post. You'll also need to convert the value to an integer.
Use the datetime.strftime() method to select just the hour from the datetime object. If the hour isn't a key in counts_by_hour: Create the key in counts_by_hour and set it equal to 1. Create the key in comments_by_hour and set it equal to the comment number. If the hour is already a key in counts_by_hour: Increment the value in counts_by_hour by 1. Increment the value in comments_by_hour by the comment number.
#creating an empty list of lists that will
#include the hour when the ask post was created at as a
# column and the number of comments on each post
import datetime as dt
result_list = []
for row in ask_posts:
column1=row[6]
column2=int(row[4])
result_list.append([column1, column2])
print(result_list[0:3])
len(result_list)
[['8/16/2016 9:55', 6], ['11/22/2015 13:43', 29], ['5/2/2016 10:14', 1]]
1744
#creating 2 empty dictionaries
counts_by_hour={} #contains the number of ask posts created during each hour of the day.
comments_by_hour={} # contains the corresponding number of comments ask posts created at each hour received.
for row in result_list:
date = dt.datetime.strptime(row[0],"%m/%d/%Y %H:%M")
hour = date.strftime("%H")
if hour in counts_by_hour:
counts_by_hour[hour] += 1
comments_by_hour[hour] += row[1]
else:
counts_by_hour[hour] = 1
comments_by_hour[hour] = row[1]
print("Counts per hour:", counts_by_hour)
print(len(counts_by_hour))
print('\n')
print("Comments by hour:", comments_by_hour)
print(len(comments_by_hour))
Counts per hour: {'04': 47, '09': 45, '12': 73, '02': 58, '01': 60, '06': 44, '08': 48, '14': 107, '22': 71, '17': 100, '07': 34, '18': 109, '13': 85, '19': 110, '20': 80, '16': 108, '15': 116, '05': 46, '03': 54, '23': 68, '21': 109, '11': 58, '10': 59, '00': 55} 24 Comments by hour: {'04': 337, '09': 251, '12': 687, '02': 1381, '01': 683, '06': 397, '08': 492, '14': 1416, '22': 479, '17': 1146, '07': 267, '18': 1439, '13': 1253, '19': 1188, '20': 1722, '16': 1814, '15': 4477, '05': 464, '03': 421, '23': 543, '21': 1745, '11': 641, '10': 793, '00': 447} 24
Next, we'll use these two dictionaries to calculate the average number of comments for posts created during each hour of the day.
To illustrate the technique, let's work with the following dictionary:
sample_dict = { 'apple': 2, 'banana': 4, 'orange': 6 } Suppose we wanted to multiply each of the values by ten and return the results as a list of lists. We can use the following code:
fruits = [] for fruit in sample_dict: fruits.append([fruit, 10*sample_dict[fruit]])
Below are the results: [['apple', 20], ['banana', 40], ['orange', 60]]
In the example above, we:
Initialized an empty list (of lists) and assigned it to fruits. Iterated over the keys of sample_dict and appended to fruits a list whose: First element is the key from sample_dict. Second element is the value corresponding to that key multiplied by ten. Let's use this format to create a list of lists containing the hours during which posts were created and the average number of comments those posts received.
Use the example above to calculate the average number of comments per post for posts created during each hour of the day.
The result should be a list of lists in which the first element is the hour and the second element is the average number of comments per post. Assign the result to a variable named avg_by_hour. Display the results.
-The average number of comments per post for each hour= total number of comments in that hour / total number of posts for that hour
-To generate the list, we need to divide the dictionary comments_by_hour by the dictionary counts_per_hour
avg_by_hour = []
for hr in comments_by_hour:
avg_by_hour.append([hr, comments_by_hour[hr]/counts_by_hour[hr]])
print("Average number of comments per post by hour:", avg_by_hour)
Average number of comments per post by hour: [['04', 7.170212765957447], ['09', 5.5777777777777775], ['12', 9.41095890410959], ['02', 23.810344827586206], ['01', 11.383333333333333], ['06', 9.022727272727273], ['08', 10.25], ['14', 13.233644859813085], ['22', 6.746478873239437], ['17', 11.46], ['07', 7.852941176470588], ['18', 13.20183486238532], ['13', 14.741176470588234], ['19', 10.8], ['20', 21.525], ['16', 16.796296296296298], ['15', 38.5948275862069], ['05', 10.08695652173913], ['03', 7.796296296296297], ['23', 7.985294117647059], ['21', 16.009174311926607], ['11', 11.051724137931034], ['10', 13.440677966101696], ['00', 8.127272727272727]]
Let's finish by sorting the list of lists and printing the five highest values in a format that's easier to read.
1- Create a list that equals avg_by_hour with swapped columns. a- Create an empty list and assign it to swap_avg_by_hour. b- Iterate over the rows of avg_by_hour and append to swap_avg_by_hour a list whose first element is the second element of the row, and whose second element is the first element of the row.
2-Print swap_avg_by_hour.
3-Use the sorted() function to sort swap_avg_by_hour in descending order. Since the first column of this list is the average number of comments, sorting the list will sort by the average number of comments. a.Set the reverse argument to True, so that the highest value in the first column appears first in the list. b. Assign the result to sorted_swap.
4-Print the string "Top 5 Hours for Ask Posts Comments".
5-Loop through each average and each hour (in this order) in the first five lists of sorted_swap.
6-Use the str.format() method to print the hour and average in the following format: 15:00: 38.59 average comments per post. a.To format the hours, use the datetime.strptime() constructor to return a datetime object and then use the strftime() method to specify the format of the time. b.To format the average, you can use {:.2f} to indicate that just two decimal places should be used.
7-Which hours should you create a post during to have a higher chance of receiving comments? Refer back to the documentation for the data set to convert the times to the time zone you live in. Write a markdown cell explaining your findings.
swap_avg_by_hour=[]
for row in avg_by_hour:
first_e=row[0]
second_e=row[1]
swap_avg_by_hour.append([second_e, first_e])
print(swap_avg_by_hour)
[[7.170212765957447, '04'], [5.5777777777777775, '09'], [9.41095890410959, '12'], [23.810344827586206, '02'], [11.383333333333333, '01'], [9.022727272727273, '06'], [10.25, '08'], [13.233644859813085, '14'], [6.746478873239437, '22'], [11.46, '17'], [7.852941176470588, '07'], [13.20183486238532, '18'], [14.741176470588234, '13'], [10.8, '19'], [21.525, '20'], [16.796296296296298, '16'], [38.5948275862069, '15'], [10.08695652173913, '05'], [7.796296296296297, '03'], [7.985294117647059, '23'], [16.009174311926607, '21'], [11.051724137931034, '11'], [13.440677966101696, '10'], [8.127272727272727, '00']]
from operator import itemgetter
sorted_swap = sorted(swap_avg_by_hour, key=itemgetter(0), reverse=True)
print(sorted_swap)
[[38.5948275862069, '15'], [23.810344827586206, '02'], [21.525, '20'], [16.796296296296298, '16'], [16.009174311926607, '21'], [14.741176470588234, '13'], [13.440677966101696, '10'], [13.233644859813085, '14'], [13.20183486238532, '18'], [11.46, '17'], [11.383333333333333, '01'], [11.051724137931034, '11'], [10.8, '19'], [10.25, '08'], [10.08695652173913, '05'], [9.41095890410959, '12'], [9.022727272727273, '06'], [8.127272727272727, '00'], [7.985294117647059, '23'], [7.852941176470588, '07'], [7.796296296296297, '03'], [7.170212765957447, '04'], [6.746478873239437, '22'], [5.5777777777777775, '09']]
print("Top 5 Hours for Ask Posts Comments:")
hour_format = "%H"
for row in sorted_swap[:5]:
hour = row[1]
# The original time zone was Eastern Standard.
# Here I am converting to Central Standard by subtracting an hour.
convert_to_cst = dt.datetime.strptime(hour, hour_format)
cst = convert_to_cst - dt.timedelta(hours=1)
cst = cst.strftime("%H:%M")
avg_comments = round(row[0],2)
print("{0} {1} average comments per post".format(cst, avg_comments))
Top 5 Hours for Ask Posts Comments: 14:00 38.59 average comments per post 01:00 23.81 average comments per post 19:00 21.52 average comments per post 15:00 16.8 average comments per post 20:00 16.01 average comments per post
Explaining the previous steps:
Use the sorted() function to sort swap_avg_by_hour in descending order. Since the first column of this list is the average number of comments, sorting the list will sort by the average number of comments.
Set the reverse argument to True, so that the highest value in the first column appears first in the list. Assign the result to sorted_swap.
Print the string "Top 5 Hours for Ask Posts Comments".
Loop through each average and each hour (in this order) in the first five lists of sorted_swap. Use the str.format() method to print the hour and average in the following format: 15:00: 38.59 average comments per post.
To format the hours, use the datetime.strptime() constructor to return a datetime object and then use the strftime() method to specify the format of the time.
To format the average, you can use {:.2f} to indicate that just two decimal places should be used.