Notebook

Project 2: Exploring Hacker News Posts¶

1. Introduction:¶

In this project, we will work with a data set of submissions to popular technology site Hacker News. Hacker News is extremely popular in technology and startup circles, and posts that make it to the top of Hacker News' listings can get hundreds of thousands of visitors as a result.

We are specifically interested in posts whose titles begin with either Ask HN or Show HN.

Users submit Ask HN posts to ask the Hacker News community a specific question.
Users submit Show HN posts to show the Hacker News community a project, product, or just generally something interesting.

Our goal for this project is to compare these two types of post and determine the following:

Do Ask HN or Show HN receive more comments on average?
Do posts created at a certain time receive more comments on average?

2. Opening and Exploring the Data:¶

2.1. File: "hacker_news.csv"¶

The data set contains data about approximately 20,000 rows by removing all submission that did not receive any comments, and then randonly from the reamining submissions. Bellow are descriptions of the columns:

id: The unique identifier from Hacker News for the post
title: The title of the post
url: The URL that the posts links to, if the post has a URL
num_points: The number of points the post acquired, calculated as the total number of upvotes minus the total number of downvotes
num_comments: The number of comments that were made on the post
author: The username of the person who submitted the post
created_at: The date and time at which the post was submitted

Since we get an error named UnicodeDecodeError, we add encoding="utf8" to the open() function (open('hacker_news.csv', encoding='utf8'))

In the cell bellow, we:

Transform the read-in file to a list of lists using list() and save it to a variable named hn_dataset
Save the header to a variable named hn_header
Save the values to a variable named hn

In [26]:

# import reader
from csv import reader

In [27]:

# hacker_news data set
opened_file = open('hacker_news.csv', encoding='utf8')
read_file = reader(opened_file)

hn_dataset = list(read_file)

hn_header = hn_dataset[0]
hn = hn_dataset[1:]

print(hn_header)
print('\n')
print(hn_dataset[:5])

['id', 'title', 'url', 'num_points', 'num_comments', 'author', 'created_at']


[['id', 'title', 'url', 'num_points', 'num_comments', 'author', 'created_at'], ['12224879', 'Interactive Dynamic Video', 'http://www.interactivedynamicvideo.com/', '386', '52', 'ne0phyte', '8/4/2016 11:52'], ['10975351', 'How to Use Open Source and Shut the Fuck Up at the Same Time', 'http://hueniverse.com/2016/01/26/how-to-use-open-source-and-shut-the-fuck-up-at-the-same-time/', '39', '10', 'josep2', '1/26/2016 19:30'], ['11964716', "Florida DJs May Face Felony for April Fools' Water Joke", 'http://www.thewire.com/entertainment/2013/04/florida-djs-april-fools-water-joke/63798/', '2', '1', 'vezycash', '6/23/2016 22:20'], ['11919867', 'Technology ventures: From Idea to Enterprise', 'https://www.amazon.com/Technology-Ventures-Enterprise-Thomas-Byers/dp/0073523429', '3', '1', 'hswarna', '6/17/2016 0:01']]

3. Extracting Ask HN and Show HN Posts¶

Since we are only concerned with post titles beginning with Ask HN or Show HN, we will create new lists of lists containing just the data for those titles.

To find the posts that begin with either Ask HN or Show HN, we will use the string method str.startswith(), and we can use the lower method which returns a lowercase version of the starting string.

In [28]:

# method to separate posts beginning with Ask HH and Show HN
ask_posts = []
show_posts = []
other_posts = []

for row in hn_dataset:
    title = row[1]    
    if title.lower().startswith("ask hn"):
        ask_posts.append(row)
    elif title.lower().startswith("show hn"):
        show_posts.append(row)
    else:
        other_posts.append(row)
        
print('ask_posts: ', len(ask_posts))
print('show_posts: ', len(show_posts))
print('other_posts: ', len(other_posts))

ask_posts:  1744
show_posts:  1162
other_posts:  17195

4. Calculating the Average Number of Comments for Ask HN and Show HN Posts¶

In the last screen, we separated the ask posts and the show posts into two list of lists named ask_posts and show_posts.

Next, we will determine if ask posts or show posts receive more comments on average.

In [29]:

# average Ask HN
total_ask_comments = 0

for row in ask_posts:
    num_comments = int(row[4])
    total_ask_comments += num_comments
    avg_ask_comments = total_ask_comments / len(ask_posts) 

print(avg_ask_comments) 

14.038417431192661

In [30]:

# average Show HN
total_show_comments = 0

for row in show_posts:
    num_comments = int(row[4])
    total_show_comments += num_comments
    avg_show_comments = total_show_comments / len(show_posts) 

print(avg_show_comments)

10.31669535283993

On average, ask posts approximately receive 14 comments whereas show posts receive almost 10 comments. Since ask posts are more likely to receive comments.

5. Finding the Amount of Ask Posts and Comments by Hour Created¶

Since ask posts are more likely to receive comments, we'll focus our remaining analysis just on these posts.

Next, we will determine if ask posts created at a certain time are more likely to attract comments. We'll use the following steps to perform this analysis:

Calculate the amount of ask posts created in each hour of the day, along with the number of comments received
Calculate the average number of comments ask posts receive by hour created

We will tackle the first step — calculating the amount of ask posts and comments by hour created. We will use the datetime module to work with the data in the created_at column.

We use the datetime.strptime() constructor to parse dates stored as strings and return datetime objects.

In [31]:

# import datetime module
import datetime as dt

In [32]:

# appending columns: created_at and num_columns
result_list = []

for row in ask_posts:
    result_list.append([row[6], int(row[4])])    

In [38]:

result_list[:5]

Out[38]:

[['8/16/2016 9:55', 6],
 ['11/22/2015 13:43', 29],
 ['5/2/2016 10:14', 1],
 ['8/2/2016 14:20', 3],
 ['10/15/2015 16:38', 17]]

In [33]:

# amount of ask post and comments
counts_by_hour = {}
comments_by_hour = {}

date_format = "%m/%d/%Y %H:%M"
for row in result_list:
    date_string = row[0]
    # parse the date and create a datetime object
    time = dt.datetime.strptime(date_string, date_format)
    # select just the hour from the datetime object
    hour = time.strftime("%H")
    if hour not in counts_by_hour:
        counts_by_hour[hour] = 1
        comments_by_hour[hour] = row[1]
    else:
        counts_by_hour[hour] += 1
        comments_by_hour[hour] += row[1]    

In [52]:

print('counts_by_hour: ', counts_by_hour)
print('\n')
print('comments_by_hour: ', comments_by_hour)

counts_by_hour:  {'09': 45, '13': 85, '10': 59, '14': 107, '16': 108, '23': 68, '12': 73, '17': 100, '15': 116, '21': 109, '20': 80, '02': 58, '18': 109, '03': 54, '05': 46, '19': 110, '01': 60, '22': 71, '08': 48, '04': 47, '00': 55, '06': 44, '07': 34, '11': 58}


comments_by_hour:  {'09': 251, '13': 1253, '10': 793, '14': 1416, '16': 1814, '23': 543, '12': 687, '17': 1146, '15': 4477, '21': 1745, '20': 1722, '02': 1381, '18': 1439, '03': 421, '05': 464, '19': 1188, '01': 683, '22': 479, '08': 492, '04': 337, '00': 447, '06': 397, '07': 267, '11': 641}

6. Calculatig the Average Number of Comments for Ask HN Post by Hour¶

We created two dictionaries:

counts_by_hour: Contains the number of ask posts created during each hour of the day
comments_by_hour: Contains the corresponding number of comments ask posts created at each hour received

Next, we will use these two dictionaries to calculate the average number of comments for posts created during each hour of the day.

In [53]:

# average number of comments
avg_by_hour = []

for hour in comments_by_hour:
    avg_num = comments_by_hour[hour] / counts_by_hour[hour]
    avg_by_hour.append([hour, avg_num])

avg_by_hour

Out[53]:

[['09', 5.5777777777777775],
 ['13', 14.741176470588234],
 ['10', 13.440677966101696],
 ['14', 13.233644859813085],
 ['16', 16.796296296296298],
 ['23', 7.985294117647059],
 ['12', 9.41095890410959],
 ['17', 11.46],
 ['15', 38.5948275862069],
 ['21', 16.009174311926607],
 ['20', 21.525],
 ['02', 23.810344827586206],
 ['18', 13.20183486238532],
 ['03', 7.796296296296297],
 ['05', 10.08695652173913],
 ['19', 10.8],
 ['01', 11.383333333333333],
 ['22', 6.746478873239437],
 ['08', 10.25],
 ['04', 7.170212765957447],
 ['00', 8.127272727272727],
 ['06', 9.022727272727273],
 ['07', 7.852941176470588],
 ['11', 11.051724137931034]]

7. Sorting and Printing Values from a List of Lists¶

In the last cell, we calculated the average number of comments for posts created during each hour of the day, and stored the results in a list of lists named avg_by_hour.

Although we now have the results we need, this format makes it hard to identify the hours with the highest values. We will finish by sorting the list of lists and printing the five highest values in a format that's easier to read.

In [54]:

# sorting and printing values
swap_avg_by_hour = []
for row in avg_by_hour:
    swap_avg_by_hour.append([row[1], row[0]])  
print(swap_avg_by_hour)    

[[5.5777777777777775, '09'], [14.741176470588234, '13'], [13.440677966101696, '10'], [13.233644859813085, '14'], [16.796296296296298, '16'], [7.985294117647059, '23'], [9.41095890410959, '12'], [11.46, '17'], [38.5948275862069, '15'], [16.009174311926607, '21'], [21.525, '20'], [23.810344827586206, '02'], [13.20183486238532, '18'], [7.796296296296297, '03'], [10.08695652173913, '05'], [10.8, '19'], [11.383333333333333, '01'], [6.746478873239437, '22'], [10.25, '08'], [7.170212765957447, '04'], [8.127272727272727, '00'], [9.022727272727273, '06'], [7.852941176470588, '07'], [11.051724137931034, '11']]

In [56]:

# descending order
sorted_swap = sorted(swap_avg_by_hour, reverse=True)

print("Top 5 Hours for Ask Posts Comments:")

for row in sorted_swap[:5]:
    time_string = row[1]
    # parse the hour and create a datetime object
    time_top = dt.datetime.strptime(time_string, "%H")
    # select the hour datetime object
    hour_top = time_top.strftime("%H:%M")
    print("{}: {:.2f} average comments per post".format(hour_top, row[0])) 

Top 5 Hours for Ask Posts Comments:
15:00: 38.59 average comments per post
02:00: 23.81 average comments per post
20:00: 21.52 average comments per post
16:00: 16.80 average comments per post
21:00: 16.01 average comments per post

8. Conclusion:¶

Based on our analysis, we recommend posting from 02:00 pm to 03:00 pm in order to have a higher chance of receiving more comments on an Aks Post. Furthermore, Creating posts from 20:00 to 21:00 receive on average between 16.01 to 21.52 comments per post, which it is another good option to do as well.

The hour that receives the most comments per post on average is 15:00 with an average of 38.59 comments per post. The time zone used is Eastern Time in the US; as a result, we could also write 15:00 as 3:00 pm est.