Notebook

Exploring Hacker News Posts¶

Introduction¶

In this project we'll work with a dataset of posts from the popular technology website Hacker News. We'll focus on those posts whose titles begin with Ask HN or Show HN.

Users submit Ask HN posts to ask the Hacker News community a specific question, such as "What's the best online course you've ever done?" Similarly, users send Show HN posts to show the Hacker News community a project, a product, or something interesting in general.

We must bear in mind that the dataset we are working with was reduced from almost 300,000 rows to approximately 20,000 rows by removing all posts that did not receive any comments, and then doing a random sampling of the remaining posts.

Our goal is to determine the following:

Do the Ask HN or Show HN posts get more comments (on average)?
Do the Ask HN or Show HN posts get more points (on average)?
Do the posts created at a given time receive more comments on average?
Do the posts created at a given time receive more points on average?

Summary of Results¶

Post Type	Comments on average	Points on average
Ask HN	14.04	15.06
Show HN	10.32	27.56

Time in which the highest number of comments is recorded on average per post

Post type	Rush Hour	AVG Comments
Ask HN	15:00 - 16:00	38.59
Show HN	18:00 - 19:00	15.77

Time in which the highest number of points is recorded on average per post

Post type	Rush Hour	AVG Points
Ask HN	15:00 - 16:00	29.99
Show HN	23:00 - 00:00	42.39

Initial exploration of the data set¶

In [1]:

from csv import reader

# Read in the data
opened_file = open('hacker_news.csv')
read_file = reader(opened_file)
hn = list(read_file)

# Showing the first five rows.
print(*hn[:5], sep='\n\n')

['id', 'title', 'url', 'num_points', 'num_comments', 'author', 'created_at']

['12224879', 'Interactive Dynamic Video', 'http://www.interactivedynamicvideo.com/', '386', '52', 'ne0phyte', '8/4/2016 11:52']

['10975351', 'How to Use Open Source and Shut the Fuck Up at the Same Time', 'http://hueniverse.com/2016/01/26/how-to-use-open-source-and-shut-the-fuck-up-at-the-same-time/', '39', '10', 'josep2', '1/26/2016 19:30']

['11964716', "Florida DJs May Face Felony for April Fools' Water Joke", 'http://www.thewire.com/entertainment/2013/04/florida-djs-april-fools-water-joke/63798/', '2', '1', 'vezycash', '6/23/2016 22:20']

['11919867', 'Technology ventures: From Idea to Enterprise', 'https://www.amazon.com/Technology-Ventures-Enterprise-Thomas-Byers/dp/0073523429', '3', '1', 'hswarna', '6/17/2016 0:01']

We notice that the first row contains the header of the dataset. To carry out our analysis we must first separate the header from the data.

In [2]:

headers = hn[0] # First row contains the headers

hn = hn[1:] # Selecting data without headers

print(headers)

['id', 'title', 'url', 'num_points', 'num_comments', 'author', 'created_at']

In [3]:

print(*hn[:5], sep='\n\n') # Showing the first five records

['12224879', 'Interactive Dynamic Video', 'http://www.interactivedynamicvideo.com/', '386', '52', 'ne0phyte', '8/4/2016 11:52']

['10975351', 'How to Use Open Source and Shut the Fuck Up at the Same Time', 'http://hueniverse.com/2016/01/26/how-to-use-open-source-and-shut-the-fuck-up-at-the-same-time/', '39', '10', 'josep2', '1/26/2016 19:30']

['11964716', "Florida DJs May Face Felony for April Fools' Water Joke", 'http://www.thewire.com/entertainment/2013/04/florida-djs-april-fools-water-joke/63798/', '2', '1', 'vezycash', '6/23/2016 22:20']

['11919867', 'Technology ventures: From Idea to Enterprise', 'https://www.amazon.com/Technology-Ventures-Enterprise-Thomas-Byers/dp/0073523429', '3', '1', 'hswarna', '6/17/2016 0:01']

['10301696', 'Note by Note: The Making of Steinway L1037 (2007)', 'http://www.nytimes.com/2007/11/07/movies/07stein.html?_r=0', '8', '2', 'walterbell', '9/30/2015 4:12']

Now, we can start exploring the number of comments for each type of post.

Extracting Ask HN and Show HN Posts¶

We'll identify posts that start with Ask HN or Show HN and separate the data for those two types of posts into different lists. Separating the data will facilitate analysis for the next steps.

In [4]:

ask_posts = []
show_posts = []
other_posts = []

for row in hn:
    title = row[1] # Select the post title
    title = title.lower()
    if title.startswith('ask hn'):
        ask_posts.append(row)
    elif title.startswith('show hn'):
        show_posts.append(row)
    else:
        other_posts.append(row)

# Shows the number of records for each list
print("Total 'Ask HN' posts: {:,}".format(len(ask_posts)))
print("Total 'Show HN' posts: {:,}".format(len(show_posts)))
print("Total other posts: {:,}".format(len(other_posts)))

Total 'Ask HN' posts: 1,744
Total 'Show HN' posts: 1,162
Total other posts: 17,194

Calculating the Average Number of Comments on the Posts¶

To make our task easier, we will implement a function that allows us to obtain the average value in a given column:

In [5]:

def avg(data, index):
    total = 0
    for row in data:
        total += int(row[index])
    return total / len(data)

Now that we have the separate lists, we will calculate the average number of comments each post type receives.

In [6]:

avg_ask_comments = avg(ask_posts, 4)
print("Average number of comments for 'Ask HN' posts: {:.2f}".format(avg_ask_comments))

Average number of comments for 'Ask HN' posts: 14.04

In [7]:

avg_show_comments = avg(show_posts, 4)
print("Average number of comments for 'Show HN' posts: {:.2f}".format(avg_show_comments))

Average number of comments for 'Show HN' posts: 10.32

In [8]:

avg_other_comments = avg(other_posts, 4)
print("Average number of comments for other posts: {:.2f}".format(avg_other_comments))

Average number of comments for other posts: 26.87

It's normal to observe that other posts receive more number of comments on average as they cover many other topics. If we focus on the Ask HN and Show HN posts we see that Ask HN posts are more likely to receive comments.

Calculating the Average Number of Points on the Posts¶

In [9]:

avg_ask_points = avg(ask_posts, 3)
print("Average number of points for 'Ask HN' posts: {:.2f}".format(avg_ask_points))

Average number of points for 'Ask HN' posts: 15.06

In [10]:

avg_show_points = avg(show_posts, 3)
print("Average number of points for 'Show HN' posts: {:.2f}".format(avg_show_points))

Average number of points for 'Show HN' posts: 27.56

On average, the Show posts in our sample receive approximately 28 points, while the Ask posts receive approximately 15. We can say that the Show posts are more likely to receive points.

Finding the amount of posts and comments by hour created¶

Next, we'll implement a function, which, given a dataset and the index of one of its columns, returns two dictionaries (contained in a tuple). The first contains the number of values by hour and the second one the average value by hour.

In [11]:

import datetime as dt

def amount_avg_by_hour(data, column):
    result_list = []
    
    for row in data:
        result_list.append([row[6], int(row[column])])
    
    date_format = "%m/%d/%Y %H:%M"
    
    counts_by_hour = {}
    amount_by_hour = {}
    for row in result_list:
        date = row[0]
        n_vals = row[1]
        time = dt.datetime.strptime(date, date_format).strftime("%H")
        if time not in counts_by_hour:
            counts_by_hour[time] = 0
            amount_by_hour[time] = 0
        counts_by_hour[time] += 1
        amount_by_hour[time] += n_vals
    
    avg_by_hour = {}
    for hour in amount_by_hour:
        avg_by_hour[hour] = round(amount_by_hour[hour] / counts_by_hour[hour], 2)
    
    return amount_by_hour, avg_by_hour

Now, let's determine if posts made at a certain time are more likely to attract comments. The following steps will help us to perform this analysis:

We'll calculate the number of posts made in each hour of the day along with the number of comments received.
We'll get the average number of comments posts get per hour.

In [12]:

# Gets the results for "Ask HN" posts
ask_comments = amount_avg_by_hour(ask_posts, 4) # Index 4 contains the number of comments

# Selects the number of Ask posts created during each hour of the day
ask_comments_by_hour = ask_comments[0]

print("ASK HN - NUMBER OF COMMENTS BY HOUR:")
print(*sorted(ask_comments_by_hour.items()), sep='\n')

ASK HN - NUMBER OF COMMENTS BY HOUR:
('00', 447)
('01', 683)
('02', 1381)
('03', 421)
('04', 337)
('05', 464)
('06', 397)
('07', 267)
('08', 492)
('09', 251)
('10', 793)
('11', 641)
('12', 687)
('13', 1253)
('14', 1416)
('15', 4477)
('16', 1814)
('17', 1146)
('18', 1439)
('19', 1188)
('20', 1722)
('21', 1745)
('22', 479)
('23', 543)

In [13]:

# Gets the results for "Show HN" posts
show_comments = amount_avg_by_hour(show_posts, 4)

# Selects the number of Ask posts created during each hour of the day
show_comments_by_hour = show_comments[0]

print("SHOW HN - NUMBER OF COMMENTS BY HOUR:")
print(*sorted(show_comments_by_hour.items()), sep='\n')

SHOW HN - NUMBER OF COMMENTS BY HOUR:
('00', 487)
('01', 246)
('02', 127)
('03', 287)
('04', 247)
('05', 58)
('06', 142)
('07', 299)
('08', 165)
('09', 291)
('10', 297)
('11', 491)
('12', 720)
('13', 946)
('14', 1156)
('15', 632)
('16', 1084)
('17', 911)
('18', 962)
('19', 539)
('20', 612)
('21', 272)
('22', 570)
('23', 447)

Getting the Average Number of Comments on the Posts by Hour¶

We'll use the dictionaries obtained from each post type (in the previous block) to calculate the average number of comments made on the posts created during each hour of the day.

Avg of comments for "Ask HN" posts¶

In [14]:

# Selects the number of comments received by hour of the day
avg_ask_comments_by_hour = ask_comments[1]

print("AVG NUMBER OF COMMENTS BY HOUR:")
print(*sorted(avg_ask_comments_by_hour.items()), sep='\n')

AVG NUMBER OF COMMENTS BY HOUR:
('00', 8.13)
('01', 11.38)
('02', 23.81)
('03', 7.8)
('04', 7.17)
('05', 10.09)
('06', 9.02)
('07', 7.85)
('08', 10.25)
('09', 5.58)
('10', 13.44)
('11', 11.05)
('12', 9.41)
('13', 14.74)
('14', 13.23)
('15', 38.59)
('16', 16.8)
('17', 11.46)
('18', 13.2)
('19', 10.8)
('20', 21.52)
('21', 16.01)
('22', 6.75)
('23', 7.99)

To make it easier to identify the times with the highest values, we will sort the results by values and retrieve the five highest values in a format that is easier to read.

In [15]:

swap_avg_ask_by_hour = {v: k for k, v in avg_ask_comments_by_hour.items()}

sorted_swap_ask = sorted(swap_avg_ask_by_hour.items(), reverse=True) # Returns a ordered list of tuples

print("AVG NUMBER OF COMMENTS BY HOUR - ORDERED LIST")
print(*sorted_swap_ask, sep='\n')

AVG NUMBER OF COMMENTS BY HOUR - ORDERED LIST
(38.59, '15')
(23.81, '02')
(21.52, '20')
(16.8, '16')
(16.01, '21')
(14.74, '13')
(13.44, '10')
(13.23, '14')
(13.2, '18')
(11.46, '17')
(11.38, '01')
(11.05, '11')
(10.8, '19')
(10.25, '08')
(10.09, '05')
(9.41, '12')
(9.02, '06')
(8.13, '00')
(7.99, '23')
(7.85, '07')
(7.8, '03')
(7.17, '04')
(6.75, '22')
(5.58, '09')

In [16]:

# Shows the the 5 hours with the highest average comments.
print("TOP 5 HOURS FOR 'ASK' POSTS COMMENTS")

for avg, hour in sorted_swap_ask[:5]:
    print(
        "{}: {:.2f} average comments per post".format(
        dt.datetime.strptime(hour, "%H").strftime("%H:%M"), avg)
    )

TOP 5 HOURS FOR 'ASK' POSTS COMMENTS
15:00: 38.59 average comments per post
02:00: 23.81 average comments per post
20:00: 21.52 average comments per post
16:00: 16.80 average comments per post
21:00: 16.01 average comments per post

The time of day when Ask posts receive the most comments on average is 3:00 PM, with an average of 38.59 comments per post. There is an increase of approximately 60% in the number of comments between 2:00 AM and 8:00 PM respectively.

Avg of comments for "Show HN" posts¶

In [17]:

# Selects the number of comments received by hour of the day
avg_show_comments_by_hour = show_comments[1]

print("AVG NUMBER OF COMMENTS BY HOUR:")
print(*sorted(avg_show_comments_by_hour.items()), sep='\n')

AVG NUMBER OF COMMENTS BY HOUR:
('00', 15.71)
('01', 8.79)
('02', 4.23)
('03', 10.63)
('04', 9.5)
('05', 3.05)
('06', 8.88)
('07', 11.5)
('08', 4.85)
('09', 9.7)
('10', 8.25)
('11', 11.16)
('12', 11.8)
('13', 9.56)
('14', 13.44)
('15', 8.1)
('16', 11.66)
('17', 9.8)
('18', 15.77)
('19', 9.8)
('20', 10.2)
('21', 5.79)
('22', 12.39)
('23', 12.42)

We sort the results by values and retrieve the five highest values.

In [18]:

swap_avg_show_by_hour = {v: k for k, v in avg_show_comments_by_hour.items()}

sorted_swap_show = sorted(swap_avg_show_by_hour.items(), reverse=True) # Returns a ordered list of tuples

print("AVG NUMBER OF COMMENTS BY HOUR - ORDERED LIST")
print(*sorted_swap_show, sep='\n')

AVG NUMBER OF COMMENTS BY HOUR - ORDERED LIST
(15.77, '18')
(15.71, '00')
(13.44, '14')
(12.42, '23')
(12.39, '22')
(11.8, '12')
(11.66, '16')
(11.5, '07')
(11.16, '11')
(10.63, '03')
(10.2, '20')
(9.8, '17')
(9.7, '09')
(9.56, '13')
(9.5, '04')
(8.88, '06')
(8.79, '01')
(8.25, '10')
(8.1, '15')
(5.79, '21')
(4.85, '08')
(4.23, '02')
(3.05, '05')

In [19]:

# Shows the the 5 hours with the highest average comments.
print("TOP 5 HOURS FOR 'SHOW' POSTS COMMENTS")

for avg, hour in sorted_swap_show[:5]:
    print(
        "{}: {:.2f} average comments per post".format(
        dt.datetime.strptime(hour, "%H").strftime("%H:%M"), avg)
    )

TOP 5 HOURS FOR 'SHOW' POSTS COMMENTS
18:00: 15.77 average comments per post
00:00: 15.71 average comments per post
14:00: 13.44 average comments per post
23:00: 12.42 average comments per post
22:00: 12.39 average comments per post

We see that Show posts receive the most comments on average both at 18:00 hrs. as well as at midnight, with an average of 15.7 comments per post. This is approximately 24% more than the rest of the ranking.

Based on the dataset documentation, the time zone used is US Eastern Time.

Finding the number of posts and points by hour received¶

Points for "Ask HN" posts¶

In [20]:

ask_points = amount_avg_by_hour(ask_posts, 3) # Index 3 contains the points

ask_points_by_hour = ask_points[0]

print("ASK HN - POINTS BY HOUR:")
print(*sorted(ask_points_by_hour.items()), sep='\n')

ASK HN - POINTS BY HOUR:
('00', 451)
('01', 700)
('02', 793)
('03', 374)
('04', 389)
('05', 552)
('06', 591)
('07', 361)
('08', 515)
('09', 329)
('10', 1102)
('11', 825)
('12', 782)
('13', 2062)
('14', 1282)
('15', 3479)
('16', 2522)
('17', 1941)
('18', 1741)
('19', 1513)
('20', 1151)
('21', 1721)
('22', 511)
('23', 581)

In [21]:

avg_points_ask_by_hour = ask_points[1]

print("AVG OF POINTS BY HOUR:")
print(*sorted(avg_points_ask_by_hour.items()), sep='\n')

AVG OF POINTS BY HOUR:
('00', 8.2)
('01', 11.67)
('02', 13.67)
('03', 6.93)
('04', 8.28)
('05', 12.0)
('06', 13.43)
('07', 10.62)
('08', 10.73)
('09', 7.31)
('10', 18.68)
('11', 14.22)
('12', 10.71)
('13', 24.26)
('14', 11.98)
('15', 29.99)
('16', 23.35)
('17', 19.41)
('18', 15.97)
('19', 13.75)
('20', 14.39)
('21', 15.79)
('22', 7.2)
('23', 8.54)

In [22]:

swap_avg_ask_by_hour = {v: k for k, v in avg_points_ask_by_hour.items()}
sorted_swap_ask = sorted(swap_avg_ask_by_hour.items(), reverse=True)

print("TOP 5 HOURS FOR 'ASK' POSTS POINTS")

for avg, hour in sorted_swap_ask[:5]:
    print(
        "{}: {:.2f} average points per post".format(
        dt.datetime.strptime(hour, "%H").strftime("%H:%M"), avg)
    )

TOP 5 HOURS FOR 'ASK' POSTS POINTS
15:00: 29.99 average points per post
13:00: 24.26 average points per post
16:00: 23.35 average points per post
17:00: 19.41 average points per post
10:00: 18.68 average points per post

The hour that receives the most points per post on average is 15:00, with an average of 29.99 points per post. There is an increase of approximately 24% in the number of points between the hours with the highest and the second highest average number of points.

Points for "Show HN" posts¶

In [23]:

show_points = amount_avg_by_hour(show_posts, 3) # Index 3 contains the points

show_points_by_hour = show_points[0]

print("SHOW HN - POINTS BY HOUR:")
print(*sorted(show_points_by_hour.items()), sep='\n')

SHOW HN - POINTS BY HOUR:
('00', 1173)
('01', 700)
('02', 340)
('03', 679)
('04', 386)
('05', 104)
('06', 375)
('07', 494)
('08', 519)
('09', 553)
('10', 681)
('11', 1480)
('12', 2543)
('13', 2438)
('14', 2187)
('15', 2228)
('16', 2634)
('17', 2521)
('18', 2215)
('19', 1702)
('20', 1819)
('21', 866)
('22', 1856)
('23', 1526)

In [24]:

avg_points_show_by_hour = show_points[1]

print("AVG OF POINTS BY HOUR:")
print(*sorted(avg_points_show_by_hour.items()), sep='\n')

AVG OF POINTS BY HOUR:
('00', 37.84)
('01', 25.0)
('02', 11.33)
('03', 25.15)
('04', 14.85)
('05', 5.47)
('06', 23.44)
('07', 19.0)
('08', 15.26)
('09', 18.43)
('10', 18.92)
('11', 33.64)
('12', 41.69)
('13', 24.63)
('14', 25.43)
('15', 28.56)
('16', 28.32)
('17', 27.11)
('18', 36.31)
('19', 30.95)
('20', 30.32)
('21', 18.43)
('22', 40.35)
('23', 42.39)

In [25]:

swap_avg_show_by_hour = {v: k for k, v in avg_points_show_by_hour.items()}
sorted_swap_show = sorted(swap_avg_show_by_hour.items(), reverse=True)

print("TOP 5 HOURS FOR 'SHOW' POSTS POINTS")

for avg, hour in sorted_swap_show[:5]:
    print(
        "{}: {:.2f} average points per post".format(
        dt.datetime.strptime(hour, "%H").strftime("%H:%M"), avg)
    )

TOP 5 HOURS FOR 'SHOW' POSTS POINTS
23:00: 42.39 average points per post
12:00: 41.69 average points per post
22:00: 40.35 average points per post
00:00: 37.84 average points per post
18:00: 36.31 average points per post

We see that Show posts receive the most points on average both at 23:00 hrs as well as at 12:00, with an average of 42 points per post. This is approximately 10.5% more than the rest of the ranking.

Conclusion¶

In this project, we analyze Ask HN posts and Show HN posts to determine what type of post and at what time they receive the most comments and points on average.

Based on our analysis, to maximize the number of comments a post receives, we recommend Ask HN post logs with the highest values believed to be between 3:00 PM and 4:00 PM (3:00 AM pm est - 4:00 pm est).

Whereas, to maximize the number of points a post receives, we recommend Show HN post logs and it is noted that the highest values are obtained between 23:00 and 00:00 (11:00 pm est - 00:00 est).

However, it should be noted that the data set we analyzed excluded those publications without comment.