Notebook

Analysing posts from Hacker news¶

Hacker News is a site started by the startup incubator Y Combinator, where user-submitted stories (known as "posts") are voted and commented upon, similar to reddit. Hacker News is extremely popular in technology and startup circles, and posts that make it to the top of Hacker News' listings can get hundreds of thousands of visitors as a result.

We're specifically interested in posts whose titles begin with either Ask HN or Show HN. Users submit Ask HN posts to ask the Hacker News community a specific question

Likewise, users submit Show HN posts to show the Hacker News community a project, product, or just generally something interesting.

Hacker News is extremely popular in technology and startup circles, and posts that make it to the top of Hacker News' listings can get hundreds of thousands of visitors as a result.

You can find the data set here, but note that it has been reduced from almost 300,000 rows to approximately 20,000 rows by removing all submissions that did not receive any comments, and then randomly sampling from the remaining submissions.

In [1]:

#to open and read the file as a list of lists:
from csv import reader 
opened_file=open("hacker_news.csv")
read_file=reader(opened_file)
hn=list(read_file)
print(hn[:7])

[['id', 'title', 'url', 'num_points', 'num_comments', 'author', 'created_at'], ['12224879', 'Interactive Dynamic Video', 'http://www.interactivedynamicvideo.com/', '386', '52', 'ne0phyte', '8/4/2016 11:52'], ['10975351', 'How to Use Open Source and Shut the Fuck Up at the Same Time', 'http://hueniverse.com/2016/01/26/how-to-use-open-source-and-shut-the-fuck-up-at-the-same-time/', '39', '10', 'josep2', '1/26/2016 19:30'], ['11964716', "Florida DJs May Face Felony for April Fools' Water Joke", 'http://www.thewire.com/entertainment/2013/04/florida-djs-april-fools-water-joke/63798/', '2', '1', 'vezycash', '6/23/2016 22:20'], ['11919867', 'Technology ventures: From Idea to Enterprise', 'https://www.amazon.com/Technology-Ventures-Enterprise-Thomas-Byers/dp/0073523429', '3', '1', 'hswarna', '6/17/2016 0:01'], ['10301696', 'Note by Note: The Making of Steinway L1037 (2007)', 'http://www.nytimes.com/2007/11/07/movies/07stein.html?_r=0', '8', '2', 'walterbell', '9/30/2015 4:12'], ['10482257', 'Title II kills investment? Comcast and other ISPs are now spending more', 'http://arstechnica.com/business/2015/10/comcast-and-other-isps-boost-network-investment-despite-net-neutrality/', '53', '22', 'Deinos', '10/31/2015 9:48']]

In [2]:

headers=hn[0] #Extracting the first row of data and assigning it to the headers variable
hn = hn[1:] #Removing the first row from hn by replacing the dataset hn with hn without the headers 
print(headers)
print('\n') #printing an empty line
print(hn[:5])

['id', 'title', 'url', 'num_points', 'num_comments', 'author', 'created_at']


[['12224879', 'Interactive Dynamic Video', 'http://www.interactivedynamicvideo.com/', '386', '52', 'ne0phyte', '8/4/2016 11:52'], ['10975351', 'How to Use Open Source and Shut the Fuck Up at the Same Time', 'http://hueniverse.com/2016/01/26/how-to-use-open-source-and-shut-the-fuck-up-at-the-same-time/', '39', '10', 'josep2', '1/26/2016 19:30'], ['11964716', "Florida DJs May Face Felony for April Fools' Water Joke", 'http://www.thewire.com/entertainment/2013/04/florida-djs-april-fools-water-joke/63798/', '2', '1', 'vezycash', '6/23/2016 22:20'], ['11919867', 'Technology ventures: From Idea to Enterprise', 'https://www.amazon.com/Technology-Ventures-Enterprise-Thomas-Byers/dp/0073523429', '3', '1', 'hswarna', '6/17/2016 0:01'], ['10301696', 'Note by Note: The Making of Steinway L1037 (2007)', 'http://www.nytimes.com/2007/11/07/movies/07stein.html?_r=0', '8', '2', 'walterbell', '9/30/2015 4:12']]

Sorting the posts into 3 groups:¶

to find the posts that begin with Ask HN or Show HN, we will use the string method startswith. We can also use the lower method which returns a lowercase version of the starting string

If the lowercase version of title starts with ask hn, append the row to ask_posts. Else if the lowercase version of title starts with show hn, append the row to show_posts. Else append to other_posts.

Check the number of posts in ask_posts, show_posts, and other_posts.

Print the first 3 rows of each of the lists

In [3]:

ask_posts=[] #creating 3 empty lists to divide posts into 3 groups
show_posts=[]
other_posts=[]

for row in hn: 
    title=row[1]
    if title.lower().startswith("ask hn"):
        ask_posts.append(row)
    elif title.lower().startswith("show hn"):
        show_posts.append(row)
    else:
        other_posts.append(row)

print(len(ask_posts)) # checking the number of posts in each list
print(len(show_posts))
print(len(other_posts))
print('\n') # leaves a blank line
print(ask_posts[:3])
print('\n')
print(show_posts[:3])
print('\n')
print(other_posts[:3])

       
    

1744
1162
17194


[['12296411', 'Ask HN: How to improve my personal website?', '', '2', '6', 'ahmedbaracat', '8/16/2016 9:55'], ['10610020', 'Ask HN: Am I the only one outraged by Twitter shutting down share counts?', '', '28', '29', 'tkfx', '11/22/2015 13:43'], ['11610310', 'Ask HN: Aby recent changes to CSS that broke mobile?', '', '1', '1', 'polskibus', '5/2/2016 10:14']]


[['10627194', 'Show HN: Wio Link  ESP8266 Based Web of Things Hardware Development Platform', 'https://iot.seeed.cc', '26', '22', 'kfihihc', '11/25/2015 14:03'], ['10646440', 'Show HN: Something pointless I made', 'http://dn.ht/picklecat/', '747', '102', 'dhotson', '11/29/2015 22:46'], ['11590768', 'Show HN: Shanhu.io, a programming playground powered by e8vm', 'https://shanhu.io', '1', '1', 'h8liu', '4/28/2016 18:05']]


[['12224879', 'Interactive Dynamic Video', 'http://www.interactivedynamicvideo.com/', '386', '52', 'ne0phyte', '8/4/2016 11:52'], ['10975351', 'How to Use Open Source and Shut the Fuck Up at the Same Time', 'http://hueniverse.com/2016/01/26/how-to-use-open-source-and-shut-the-fuck-up-at-the-same-time/', '39', '10', 'josep2', '1/26/2016 19:30'], ['11964716', "Florida DJs May Face Felony for April Fools' Water Joke", 'http://www.thewire.com/entertainment/2013/04/florida-djs-april-fools-water-joke/63798/', '2', '1', 'vezycash', '6/23/2016 22:20']]

Determining whether ask posts or show posts receive more comments on average¶

To do this, we have to calcualte total number of comments on all ask posts and divide them by the number of ask posts. We have to do the same again with the show posts

To calculate the total number of comments, we need to assign a variable total_ask_comments that starts with value 0 but as we loop over the ask_posts lists, the number of comments is incrementaly added inside total_ask_comments until it holds the total number of comments added together

To calculate the average number of comments we can divide total_ask_comments/ the length of the ask_comments list

In [4]:

total_ask_comments=0

for row in ask_posts:
# column with index 4 in the dataset holds 
#the number of comments 
    num_comments= int(row[4])
    total_ask_comments+= num_comments

avg_ask_comments =total_ask_comments / (len(ask_posts))
print("Average number of comments in ask posts = ",avg_ask_comments)
    
total_show_comments=0

for row in show_posts:
    total_show_comments+= int(row[4])
avg_show_comments=total_show_comments/(len(show_posts))

print("Average number of comments in show posts = ", avg_show_comments)

Average number of comments in ask posts =  14.038417431192661
Average number of comments in show posts =  10.31669535283993

Calculating the amount of ask posts and comments by hour created¶

We'll determine if ask posts created at a certain time are more likely to attract comments. We'll use the following steps to perform this analysis:

Calculate the amount of ask posts created in each hour of the day, along with the number of comments received.
Calculate the average number of comments ask posts receive by hour created.

This is the plan:

Import the datetime module as dt.
Create an empty list and assign it to result_list. This will be a list of lists.
Iterate over ask_posts and append to result_list a list with two elements:

The first element shall be the column created_at. Because the created_at column is the seventh column in ask_posts, you'll need to get the element at index 6 in each row. The second element shall be the number of comments of the post. You'll also need to convert the value to an integer.

Create two empty dictionaries called counts_by_hour and comments_by_hour.
Loop through each row of result_list.
Extract the hour from the date, which is the first element of the row.
Use the datetime.strptime() method to parse the date and create a datetime object.
Use the string we want to parse as the first argument and a string that specifies the format as the second argument.

Use the datetime.strftime() method to select just the hour from the datetime object. If the hour isn't a key in counts_by_hour: Create the key in counts_by_hour and set it equal to 1. Create the key in comments_by_hour and set it equal to the comment number. If the hour is already a key in counts_by_hour: Increment the value in counts_by_hour by 1. Increment the value in comments_by_hour by the comment number.

In [5]:

#creating an empty list of lists that will
#include the hour when the ask post was created at as a 
# column and the number of comments on each post  

import datetime as dt
result_list = []  
for row in ask_posts:
    column1=row[6]
    column2=int(row[4])
    result_list.append([column1, column2])
print(result_list[0:3])   
len(result_list)

[['8/16/2016 9:55', 6], ['11/22/2015 13:43', 29], ['5/2/2016 10:14', 1]]

Out[5]:

In [6]:

#creating 2 empty dictionaries
counts_by_hour={} #contains the number of ask posts created during each hour of the day.
comments_by_hour={} # contains the corresponding number of comments ask posts created at each hour received.


for row in result_list:
    date = dt.datetime.strptime(row[0],"%m/%d/%Y %H:%M")
    hour = date.strftime("%H")
    if hour in counts_by_hour:
        counts_by_hour[hour] += 1
        comments_by_hour[hour] += row[1]
    else:
        counts_by_hour[hour] = 1
        comments_by_hour[hour] = row[1]

print("Counts per hour:", counts_by_hour)
print(len(counts_by_hour))
print('\n')
print("Comments by hour:", comments_by_hour)
print(len(comments_by_hour))

Counts per hour: {'04': 47, '09': 45, '12': 73, '02': 58, '01': 60, '06': 44, '08': 48, '14': 107, '22': 71, '17': 100, '07': 34, '18': 109, '13': 85, '19': 110, '20': 80, '16': 108, '15': 116, '05': 46, '03': 54, '23': 68, '21': 109, '11': 58, '10': 59, '00': 55}
24


Comments by hour: {'04': 337, '09': 251, '12': 687, '02': 1381, '01': 683, '06': 397, '08': 492, '14': 1416, '22': 479, '17': 1146, '07': 267, '18': 1439, '13': 1253, '19': 1188, '20': 1722, '16': 1814, '15': 4477, '05': 464, '03': 421, '23': 543, '21': 1745, '11': 641, '10': 793, '00': 447}
24

Calculating average number of comments / posts for posts created during each hour of the day¶

Next, we'll use these two dictionaries to calculate the average number of comments for posts created during each hour of the day.

To illustrate the technique, let's work with the following dictionary:

sample_dict = { 'apple': 2, 'banana': 4, 'orange': 6 } Suppose we wanted to multiply each of the values by ten and return the results as a list of lists. We can use the following code:

fruits = [] for fruit in sample_dict: fruits.append([fruit, 10*sample_dict[fruit]])

Below are the results: [['apple', 20], ['banana', 40], ['orange', 60]]

In the example above, we:

Initialized an empty list (of lists) and assigned it to fruits. Iterated over the keys of sample_dict and appended to fruits a list whose: First element is the key from sample_dict. Second element is the value corresponding to that key multiplied by ten. Let's use this format to create a list of lists containing the hours during which posts were created and the average number of comments those posts received.

Use the example above to calculate the average number of comments per post for posts created during each hour of the day.

The result should be a list of lists in which the first element is the hour and the second element is the average number of comments per post. Assign the result to a variable named avg_by_hour. Display the results.

-The average number of comments per post for each hour= total number of comments in that hour / total number of posts for that hour

-To generate the list, we need to divide the dictionary comments_by_hour by the dictionary counts_per_hour

In [7]:

avg_by_hour = []
for hr in comments_by_hour:
    avg_by_hour.append([hr, comments_by_hour[hr]/counts_by_hour[hr]])

print("Average number of comments per post by hour:", avg_by_hour)    
                                                            

Average number of comments per post by hour: [['04', 7.170212765957447], ['09', 5.5777777777777775], ['12', 9.41095890410959], ['02', 23.810344827586206], ['01', 11.383333333333333], ['06', 9.022727272727273], ['08', 10.25], ['14', 13.233644859813085], ['22', 6.746478873239437], ['17', 11.46], ['07', 7.852941176470588], ['18', 13.20183486238532], ['13', 14.741176470588234], ['19', 10.8], ['20', 21.525], ['16', 16.796296296296298], ['15', 38.5948275862069], ['05', 10.08695652173913], ['03', 7.796296296296297], ['23', 7.985294117647059], ['21', 16.009174311926607], ['11', 11.051724137931034], ['10', 13.440677966101696], ['00', 8.127272727272727]]

Sorting the list representing average number of comments per post by hour¶

Let's finish by sorting the list of lists and printing the five highest values in a format that's easier to read.

1- Create a list that equals avg_by_hour with swapped columns. a- Create an empty list and assign it to swap_avg_by_hour. b- Iterate over the rows of avg_by_hour and append to swap_avg_by_hour a list whose first element is the second element of the row, and whose second element is the first element of the row.

2-Print swap_avg_by_hour.

3-Use the sorted() function to sort swap_avg_by_hour in descending order. Since the first column of this list is the average number of comments, sorting the list will sort by the average number of comments. a.Set the reverse argument to True, so that the highest value in the first column appears first in the list. b. Assign the result to sorted_swap.

4-Print the string "Top 5 Hours for Ask Posts Comments".

5-Loop through each average and each hour (in this order) in the first five lists of sorted_swap.

6-Use the str.format() method to print the hour and average in the following format: 15:00: 38.59 average comments per post. a.To format the hours, use the datetime.strptime() constructor to return a datetime object and then use the strftime() method to specify the format of the time. b.To format the average, you can use {:.2f} to indicate that just two decimal places should be used.

7-Which hours should you create a post during to have a higher chance of receiving comments? Refer back to the documentation for the data set to convert the times to the time zone you live in. Write a markdown cell explaining your findings.

In [8]:

swap_avg_by_hour=[]
for row in avg_by_hour:
    first_e=row[0]
    second_e=row[1]
    swap_avg_by_hour.append([second_e, first_e])

print(swap_avg_by_hour)

[[7.170212765957447, '04'], [5.5777777777777775, '09'], [9.41095890410959, '12'], [23.810344827586206, '02'], [11.383333333333333, '01'], [9.022727272727273, '06'], [10.25, '08'], [13.233644859813085, '14'], [6.746478873239437, '22'], [11.46, '17'], [7.852941176470588, '07'], [13.20183486238532, '18'], [14.741176470588234, '13'], [10.8, '19'], [21.525, '20'], [16.796296296296298, '16'], [38.5948275862069, '15'], [10.08695652173913, '05'], [7.796296296296297, '03'], [7.985294117647059, '23'], [16.009174311926607, '21'], [11.051724137931034, '11'], [13.440677966101696, '10'], [8.127272727272727, '00']]

In [9]:

from operator import itemgetter
sorted_swap = sorted(swap_avg_by_hour, key=itemgetter(0), reverse=True)
print(sorted_swap)

[[38.5948275862069, '15'], [23.810344827586206, '02'], [21.525, '20'], [16.796296296296298, '16'], [16.009174311926607, '21'], [14.741176470588234, '13'], [13.440677966101696, '10'], [13.233644859813085, '14'], [13.20183486238532, '18'], [11.46, '17'], [11.383333333333333, '01'], [11.051724137931034, '11'], [10.8, '19'], [10.25, '08'], [10.08695652173913, '05'], [9.41095890410959, '12'], [9.022727272727273, '06'], [8.127272727272727, '00'], [7.985294117647059, '23'], [7.852941176470588, '07'], [7.796296296296297, '03'], [7.170212765957447, '04'], [6.746478873239437, '22'], [5.5777777777777775, '09']]

In [10]:

        
print("Top 5 Hours for Ask Posts Comments:")
hour_format = "%H"
for row in sorted_swap[:5]:
    hour = row[1]
    
# The original time zone was Eastern Standard.
# Here I am converting to Central Standard by subtracting an hour.
    convert_to_cst = dt.datetime.strptime(hour, hour_format)
    cst = convert_to_cst - dt.timedelta(hours=1)
    cst = cst.strftime("%H:%M")     
   
    avg_comments = round(row[0],2)
    print("{0} {1} average comments per post".format(cst, avg_comments))

Top 5 Hours for Ask Posts Comments:
14:00 38.59 average comments per post
01:00 23.81 average comments per post
19:00 21.52 average comments per post
15:00 16.8 average comments per post
20:00 16.01 average comments per post

Explaining the previous steps:

Use the sorted() function to sort swap_avg_by_hour in descending order. Since the first column of this list is the average number of comments, sorting the list will sort by the average number of comments.

Set the reverse argument to True, so that the highest value in the first column appears first in the list. Assign the result to sorted_swap.

Print the string "Top 5 Hours for Ask Posts Comments".

Loop through each average and each hour (in this order) in the first five lists of sorted_swap. Use the str.format() method to print the hour and average in the following format: 15:00: 38.59 average comments per post.

To format the hours, use the datetime.strptime() constructor to return a datetime object and then use the strftime() method to specify the format of the time.

To format the average, you can use {:.2f} to indicate that just two decimal places should be used.

In [ ]: