Notebook

Exploring Hacker New Post¶

Hacker News is a site extremely popular in technology and startup circles. As a result, posts that make it to the top of Hacker News' listings can get hundreds of thousands of visitors as as result.

Title with either "Ask HN" or "Show HN" are particular interesting topics. "Ask HN" posts usually are question posts to the Hacker News community and "Show HN" posts are submissions to Hacker News community a project, product or just something interesting.

So, it is interesting to know more about which topics are having more comments on average. The "Ask HN" posts? Or the "Show HN" posts? Are there posts created at a certain time receive more comments on average?

Let's us explore together! If you are interesting to know more about the dataset, please visit here

Import the Hacker New dataset¶

In [1]:

import csv
opened_file = open('hacker_news.csv')
Reader = csv.reader(opened_file)
All_rows = list(Reader)
headers = All_rows[0]
hn = All_rows[1:]

In [2]:

print(headers)
print(hn[:5])

['id', 'title', 'url', 'num_points', 'num_comments', 'author', 'created_at']
[['12224879', 'Interactive Dynamic Video', 'http://www.interactivedynamicvideo.com/', '386', '52', 'ne0phyte', '8/4/2016 11:52'], ['10975351', 'How to Use Open Source and Shut the Fuck Up at the Same Time', 'http://hueniverse.com/2016/01/26/how-to-use-open-source-and-shut-the-fuck-up-at-the-same-time/', '39', '10', 'josep2', '1/26/2016 19:30'], ['11964716', "Florida DJs May Face Felony for April Fools' Water Joke", 'http://www.thewire.com/entertainment/2013/04/florida-djs-april-fools-water-joke/63798/', '2', '1', 'vezycash', '6/23/2016 22:20'], ['11919867', 'Technology ventures: From Idea to Enterprise', 'https://www.amazon.com/Technology-Ventures-Enterprise-Thomas-Byers/dp/0073523429', '3', '1', 'hswarna', '6/17/2016 0:01'], ['10301696', 'Note by Note: The Making of Steinway L1037 (2007)', 'http://www.nytimes.com/2007/11/07/movies/07stein.html?_r=0', '8', '2', 'walterbell', '9/30/2015 4:12']]

In [3]:

ask_posts = []
show_posts = []
other_posts = []

for row in hn:
    title = row[1]
    title = title.lower()
    if title.startswith('ask hn'):
        ask_posts.append(row)
    elif title.startswith('show hn'):
        show_posts.append(row)
    else:
        other_posts.append(row)
        
print('There are {} ask posts'.format(len(ask_posts)))
print('There are {} show posts'.format(len(show_posts)))
print('There are {} other posts'.format(len(other_posts)))
    

There are 1744 ask posts
There are 1162 show posts
There are 17194 other posts

Header information¶

id: The unique identifier from Hacker News for the post

title: The title of the post

url: The URL that the posts links to, if it the post has a URL

num_points: The number of points the post acquired, calculated as the total number of upvotes minus the total number of downvotes

num_comments: The number of comments that were made on the post

author: The username of the person who submitted the post

created_at: The date and time at which the post was submitted

Let's find the average number of comments in ask posts and show posts¶

In [4]:

def cal_num(posts, row):
    
    total_comments = 0

    for row in posts:
        num_comment = row[4]
        total_comments += int(num_comment)
    return total_comments

In [5]:

total_ask_comments = cal_num(ask_posts, 4)
aveg_ask_comments = total_ask_comments / len(ask_posts)
print(aveg_ask_comments)

total_show_comments = cal_num(show_posts, 4)
avg_show_comments = total_show_comments / len(show_posts)
print(avg_show_comments)

14.038417431192661
10.31669535283993

By comparing the two average numbers above, we can certainly say that there are receiving more comments on average.

To analyze the behavior of ask post comments, we would like to study:¶

If the ask posts created at a certain time are more likely to attract comments. We will:

Calculate the amount of ask posts created in each hour of day, along with the number of comments received.
Calculate the average number of comments ask posts receive by hour created.

In [6]:

import datetime as dt

result_list = []

for row in ask_posts:
    created_at = row[6]
    num_comment = row[4]
    result_list.append([created_at, int(num_comment)])

Find the number of posts and number of comments for each hour

In [7]:

counts_by_hour = {}
comments_by_hour = {}

for row in result_list:
    row_dt = dt.datetime.strptime(row[0],'%m/%d/%Y %H:%M')
    Hour = row_dt.strftime('%H') # Extract hours from datetime

    if Hour not in counts_by_hour:
        counts_by_hour[Hour] = 1
        comments_by_hour[Hour] = row[1]
    else:
        counts_by_hour[Hour] += 1
        comments_by_hour[Hour] += row[1]

print(counts_by_hour)
print(comments_by_hour)

{'09': 45, '13': 85, '10': 59, '14': 107, '16': 108, '23': 68, '12': 73, '17': 100, '15': 116, '21': 109, '20': 80, '02': 58, '18': 109, '03': 54, '05': 46, '19': 110, '01': 60, '22': 71, '08': 48, '04': 47, '00': 55, '06': 44, '07': 34, '11': 58}
{'09': 251, '13': 1253, '10': 793, '14': 1416, '16': 1814, '23': 543, '12': 687, '17': 1146, '15': 4477, '21': 1745, '20': 1722, '02': 1381, '18': 1439, '03': 421, '05': 464, '19': 1188, '01': 683, '22': 479, '08': 492, '04': 337, '00': 447, '06': 397, '07': 267, '11': 641}

In [8]:

avg_by_hour = []
for key in counts_by_hour:
    avg = comments_by_hour[key] / counts_by_hour[key]
    avg_by_hour.append([key, avg])

Inspect the frequency dictionary based on 'hour' key

In [9]:

print(avg_by_hour)

[['09', 5.5777777777777775], ['13', 14.741176470588234], ['10', 13.440677966101696], ['14', 13.233644859813085], ['16', 16.796296296296298], ['23', 7.985294117647059], ['12', 9.41095890410959], ['17', 11.46], ['15', 38.5948275862069], ['21', 16.009174311926607], ['20', 21.525], ['02', 23.810344827586206], ['18', 13.20183486238532], ['03', 7.796296296296297], ['05', 10.08695652173913], ['19', 10.8], ['01', 11.383333333333333], ['22', 6.746478873239437], ['08', 10.25], ['04', 7.170212765957447], ['00', 8.127272727272727], ['06', 9.022727272727273], ['07', 7.852941176470588], ['11', 11.051724137931034]]

Swap the two columns in the average hour dictionary for better inspection.

In [10]:

swap_avg_by_hour = []
for row in avg_by_hour:
    swap_avg_by_hour.append([row[1], row[0]])

print(swap_avg_by_hour)

[[5.5777777777777775, '09'], [14.741176470588234, '13'], [13.440677966101696, '10'], [13.233644859813085, '14'], [16.796296296296298, '16'], [7.985294117647059, '23'], [9.41095890410959, '12'], [11.46, '17'], [38.5948275862069, '15'], [16.009174311926607, '21'], [21.525, '20'], [23.810344827586206, '02'], [13.20183486238532, '18'], [7.796296296296297, '03'], [10.08695652173913, '05'], [10.8, '19'], [11.383333333333333, '01'], [6.746478873239437, '22'], [10.25, '08'], [7.170212765957447, '04'], [8.127272727272727, '00'], [9.022727272727273, '06'], [7.852941176470588, '07'], [11.051724137931034, '11']]

Sort the Swapped Average Hour located the maximum average comment per post at a given time

In [11]:

sorted_swap = sorted(swap_avg_by_hour, reverse = True)

In [12]:

for row in sorted_swap:
    template = '{HourMinute}: {Avg_per_post:.2f} average comments per post'
    Hour = dt.datetime.strptime(row[1],'%H') # Create datetime hour, # Format should match your str
    Pt = Hour.strftime('%H:00') # Convert hour to str
    print(template.format(HourMinute = Pt, Avg_per_post =row[0]))

15:00: 38.59 average comments per post
02:00: 23.81 average comments per post
20:00: 21.52 average comments per post
16:00: 16.80 average comments per post
21:00: 16.01 average comments per post
13:00: 14.74 average comments per post
10:00: 13.44 average comments per post
14:00: 13.23 average comments per post
18:00: 13.20 average comments per post
17:00: 11.46 average comments per post
01:00: 11.38 average comments per post
11:00: 11.05 average comments per post
19:00: 10.80 average comments per post
08:00: 10.25 average comments per post
05:00: 10.09 average comments per post
12:00: 9.41 average comments per post
06:00: 9.02 average comments per post
00:00: 8.13 average comments per post
23:00: 7.99 average comments per post
07:00: 7.85 average comments per post
03:00: 7.80 average comments per post
04:00: 7.17 average comments per post
22:00: 6.75 average comments per post
09:00: 5.58 average comments per post

Conclusion¶

We have to notice that the above time schedule are in Eastern Time (EST) in the US. As we are residents in Europe, it would be nice to convert EST to Central Europe Time (CET).

In [13]:

for row in sorted_swap:
    template = '{HourMinute}: {Avg_per_post:.2f} average comments per post'
    Hour = dt.datetime.strptime(row[1],'%H') # Create datetime hour, # Format should match your str
    Hour = Hour + dt.timedelta(hours = 6)
    Pt = Hour.strftime('%H:00') # Convert hour to str
    print(template.format(HourMinute = Pt, Avg_per_post =row[0]))

21:00: 38.59 average comments per post
08:00: 23.81 average comments per post
02:00: 21.52 average comments per post
22:00: 16.80 average comments per post
03:00: 16.01 average comments per post
19:00: 14.74 average comments per post
16:00: 13.44 average comments per post
20:00: 13.23 average comments per post
00:00: 13.20 average comments per post
23:00: 11.46 average comments per post
07:00: 11.38 average comments per post
17:00: 11.05 average comments per post
01:00: 10.80 average comments per post
14:00: 10.25 average comments per post
11:00: 10.09 average comments per post
18:00: 9.41 average comments per post
12:00: 9.02 average comments per post
06:00: 8.13 average comments per post
05:00: 7.99 average comments per post
13:00: 7.85 average comments per post
09:00: 7.80 average comments per post
10:00: 7.17 average comments per post
04:00: 6.75 average comments per post
15:00: 5.58 average comments per post

By refering to the average comments per post for each hour, we found that posts created at around 9pm, 10pm, 2am and 3 am have a higher chance of receiving comments. Post created in the midnight (3am) also have a good chance of receiving comments because