Notebook

Hacker News is a site started by the startup incubator Y Combinator, where user-submitted stories (known as "posts") are voted and commented upon, similar to reddit. Hacker News is extremely popular in technology and startup circles, and posts that make it to the top of Hacker News' listings can get hundreds of thousands of visitors as a result.

You can find the data set here, but note that it has been reduced from almost 300,000 rows to approximately 20,000 rows by removing all submissions that did not receive any comments, and then randomly sampling from the remaining submissions. Below are descriptions of the columns:

id: The unique identifier from Hacker News for the post

title: The title of the post

url: The URL that the posts links to, if it the post has a URL

num_points: The number of points the post acquired, calculated as the total number of upvotes minus the total number of downvotes

num_comments: The number of comments that were made on the post

author: The username of the person who submitted the post

created_at: The date and time at which the post was submitted

In this project, we’ll compare two different types of posts from Hacker News, a popular site where technology related stories (or ‘posts’) are voted and commented upon. The two types of posts we’ll explore begin with either Ask HN or Show HN.

Users submit Ask HN posts to ask the Hacker News community a specific question, such as “What is the best online course you’ve ever taken?” Likewise, users submit Show HN posts to show the Hacker News community a project, product, or just generally something interesting.

We’ll specifically compare these two types of posts to determine the following:

. Do Ask HN or Show HN receive more comments on average?

. Do posts created at a certain time receive more comments on average?

It should be noted that the data set we’re working with was reduced from almost 300,000 rows to approximately 20,000 rows by removing all submissions that did not receive any comments, and then randomly sampling from the remaining submissions.

Introduction¶

First, we’ll read in the data and remove the headers.

In [1]:

from csv import reader
opened_file = open('/Users/umesh/Downloads/HN_posts_year_to_Sep_20_2016.csv')
read_file = reader(opened_file)
hn = list(read_file)
hn_header = hn[0]
hn = hn[1:]

In [2]:

print(hn_header)

['id', 'title', 'url', 'num_points', 'num_comments', 'author', 'created_at']

In [3]:

hn[0:5]

Out[3]:

[['12579008',
  'You have two days to comment if you want stem cells to be classified as your own',
  'http://www.regulations.gov/document?D=FDA-2015-D-3719-0018',
  '1',
  '0',
  'altstar',
  '9/26/2016 3:26'],
 ['12579005',
  'SQLAR  the SQLite Archiver',
  'https://www.sqlite.org/sqlar/doc/trunk/README.md',
  '1',
  '0',
  'blacksqr',
  '9/26/2016 3:24'],
 ['12578997',
  'What if we just printed a flatscreen television on the side of our boxes?',
  'https://medium.com/vanmoof/our-secrets-out-f21c1f03fdc8#.ietxmez43',
  '1',
  '0',
  'pavel_lishin',
  '9/26/2016 3:19'],
 ['12578989',
  'algorithmic music',
  'http://cacm.acm.org/magazines/2011/7/109891-algorithmic-composition/fulltext',
  '1',
  '0',
  'poindontcare',
  '9/26/2016 3:16'],
 ['12578979',
  'How the Data Vault Enables the Next-Gen Data Warehouse and Data Lake',
  'https://www.talend.com/blog/2016/05/12/talend-and-Â\x93the-data-vaultÂ\x94',
  '1',
  '0',
  'markgainor1',
  '9/26/2016 3:14']]

We can see above that the data set contains the title of the posts, the number of comments for each post, and the date the post was created. Let’s start by exploring the number of comments for each type of post.

Extracting Ask HN and Show HN Posts¶

First, we’ll identify posts that begin with either Ask HN or Show HN and separate the data for those two types of posts into different lists. Separating the data makes it easier to analyze in the following steps.

In [4]:

ask_posts = []
show_posts = []
other_posts = []
for row in hn:
    title = row[1]
    if title.lower().startswith("ask hn"):
        ask_posts.append(row)
    elif title.lower().startswith("show hn"):
        show_posts.append(row)
    else:
        other_posts.append(row)
print(len(ask_posts))
print(len(show_posts))
print(len(other_posts))
print(len(ask_posts + show_posts + other_posts))

In [5]:

print(ask_posts[:3])

[['12578908', 'Ask HN: What TLD do you use for local development?', '', '4', '7', 'Sevrene', '9/26/2016 2:53'], ['12578522', 'Ask HN: How do you pass on your work when you die?', '', '6', '3', 'PascLeRasc', '9/26/2016 1:17'], ['12577908', 'Ask HN: How a DNS problem can be limited to a geographic region?', '', '1', '0', 'kuon', '9/25/2016 22:57']]

In [6]:

print(show_posts[:3])

[['12578335', 'Show HN: Finding puns computationally', 'http://puns.samueltaylor.org/', '2', '0', 'saamm', '9/26/2016 0:36'], ['12578182', 'Show HN: A simple library for complicated animations', 'https://christinecha.github.io/choreographer-js/', '1', '0', 'christinecha', '9/26/2016 0:01'], ['12578098', 'Show HN: WebGL visualization of DNA sequences', 'http://grondilu.github.io/dna.html', '1', '0', 'grondilu', '9/25/2016 23:44']]

In [7]:

print(other_posts[:3])

[['12579008', 'You have two days to comment if you want stem cells to be classified as your own', 'http://www.regulations.gov/document?D=FDA-2015-D-3719-0018', '1', '0', 'altstar', '9/26/2016 3:26'], ['12579005', 'SQLAR  the SQLite Archiver', 'https://www.sqlite.org/sqlar/doc/trunk/README.md', '1', '0', 'blacksqr', '9/26/2016 3:24'], ['12578997', 'What if we just printed a flatscreen television on the side of our boxes?', 'https://medium.com/vanmoof/our-secrets-out-f21c1f03fdc8#.ietxmez43', '1', '0', 'pavel_lishin', '9/26/2016 3:19']]

Calculating the Average Number of Comments for Ask HN and Show HN Posts¶

Now that we separated ask posts and show posts into different lists, we’ll calculate the average number of comments each type of post receives.

In [8]:

#Calculate the average number of comments `Ask HN` posts receive.
total_ask_comments = 0
for row in ask_posts:
    total_ask_comments += int(row[4])
avg_ask_comments = total_ask_comments / len(ask_posts)
print(avg_ask_comments)
print(total_ask_comments)

10.393478498741656
94986

In [9]:

#Calculate the average number of comments `Ask HN` posts receive.

total_show_comments = 0
for row in show_posts:
    total_show_comments += int(row[4])
avg_show_comments = total_show_comments / len(show_posts)
print(avg_show_comments)
print(total_show_comments)

4.886099625910612
49633

==> On average, ask posts in our sample receive approximately 10 comments, whereas show posts receive approximately 5. Since ask posts are more likely to receive comments, we’ll focus our remaining analysis just on these posts.

Finding the Amount of Ask Posts and Comments by Hour Created¶

Next, we'll determine if ask posts created at a certain time are more likely to attract comments. We'll use the following steps to perform this analysis:

. Calculate the amount of ask posts created in each hour of the day, along with the number of comments received.

. Calculate the average number of comments ask posts receive by hour created.

In [10]:

import datetime as dt
result_list = []
for row in ask_posts:
    created_at = row[6]
    num_comments = int(row[4])
    result_list.append((created_at, num_comments))
result_list[:10]  

Out[10]:

[('9/26/2016 2:53', 7),
 ('9/26/2016 1:17', 3),
 ('9/25/2016 22:57', 0),
 ('9/25/2016 22:48', 3),
 ('9/25/2016 21:50', 2),
 ('9/25/2016 19:30', 1),
 ('9/25/2016 19:22', 22),
 ('9/25/2016 17:55', 3),
 ('9/25/2016 15:48', 0),
 ('9/25/2016 15:35', 13)]

In [11]:

len(result_list)

Out[11]:

In [12]:

result_list[9130:]

Out[12]:

[('9/6/2015 15:09', 0),
 ('9/6/2015 14:53', 6),
 ('9/6/2015 13:01', 9),
 ('9/6/2015 12:17', 0),
 ('9/6/2015 11:27', 0),
 ('9/6/2015 10:52', 1),
 ('9/6/2015 10:46', 4),
 ('9/6/2015 9:36', 20),
 ('9/6/2015 6:02', 20)]

In [13]:

#Create two empty dictionaries called counts_by_hour and comments_per_hour
counts_by_hour = {}
comments_per_hour = {}

#Extract the hour from the date, which is the first element of the row.
#Use the datetime.strptime() method to parse the date and create a datetime object.
for row in result_list:
    hour = row[0]

#Use the string we want to parse as the first argument and a string that specifies the format as the second argument.
#Use the datetime.strftime() method to select just the hour from the datetime object.

#If the hour isn't a key in counts_by_hour:
##Create the key in counts_by_hour and set it equal to 1.
##Create the key in comments_by_hour and set it equal to the comment number.
    date_str = dt.datetime.strptime(hour,"%m/%d/%Y %H:%M")
    posts_created = date_str.strftime("%H")

    comments_created = row[1]
    
    if posts_created not in counts_by_hour:
        counts_by_hour[posts_created] = 1
        comments_per_hour[posts_created] = comments_created
        

#If the hour is already a key in counts_by_hour:
##Increment the value in counts_by_hour by 1.
##Increment the value in comments_by_hour by the comment number.
    else:
        counts_by_hour[posts_created] += 1
        comments_per_hour[posts_created] += comments_created

In [14]:

counts_by_hour

Out[14]:

{'02': 269,
 '01': 282,
 '22': 383,
 '21': 518,
 '19': 552,
 '17': 587,
 '15': 646,
 '14': 513,
 '13': 444,
 '11': 312,
 '10': 282,
 '09': 222,
 '07': 226,
 '03': 271,
 '23': 343,
 '20': 510,
 '16': 579,
 '08': 257,
 '00': 301,
 '18': 614,
 '12': 342,
 '04': 243,
 '06': 234,
 '05': 209}

In [15]:

comments_per_hour

Out[15]:

{'02': 2996,
 '01': 2089,
 '22': 3372,
 '21': 4500,
 '19': 3954,
 '17': 5547,
 '15': 18525,
 '14': 4972,
 '13': 7245,
 '11': 2797,
 '10': 3013,
 '09': 1477,
 '07': 1585,
 '03': 2154,
 '23': 2297,
 '20': 4462,
 '16': 4466,
 '08': 2362,
 '00': 2277,
 '18': 4877,
 '12': 4234,
 '04': 2360,
 '06': 1587,
 '05': 1838}

In the last screen, we created two dictionaries:¶

. counts_by_hour: contains the number of ask posts created during each hour of the day.

. comments_by_hour: contains the corresponding number of comments ask posts created at each hour received.

Next, we'll use these two dictionaries to calculate the average number of comments for posts created during each hour of the day.

In [16]:

#calculate the average number of comments per post for posts created during each hour of the day

avg_by_hour = []
for row in counts_by_hour:
    average = comments_per_hour[row] / counts_by_hour[row]
    avg_by_hour.append([row, average])
avg_by_hour    
    

Out[16]:

[['02', 11.137546468401487],
 ['01', 7.407801418439717],
 ['22', 8.804177545691905],
 ['21', 8.687258687258687],
 ['19', 7.163043478260869],
 ['17', 9.449744463373083],
 ['15', 28.676470588235293],
 ['14', 9.692007797270955],
 ['13', 16.31756756756757],
 ['11', 8.96474358974359],
 ['10', 10.684397163120567],
 ['09', 6.653153153153153],
 ['07', 7.013274336283186],
 ['03', 7.948339483394834],
 ['23', 6.696793002915452],
 ['20', 8.749019607843136],
 ['16', 7.713298791018998],
 ['08', 9.190661478599221],
 ['00', 7.5647840531561465],
 ['18', 7.94299674267101],
 ['12', 12.380116959064328],
 ['04', 9.7119341563786],
 ['06', 6.782051282051282],
 ['05', 8.794258373205741]]

Although we now have the results we need, this format makes it hard to identify the hours with the highest values. Let's finish by sorting the list of lists and printing the five highest values in a format that's easier to read.

Create a list that equals avg_by_hour with swapped columns.

Create an empty list and assign it to swap_avg_by_hour. Iterate over the rows of avg_by_hour and append to swap_avg_by_hour a list whose first element is the second element of the row, and whose second element is the first element of the row.

Print swap_avg_by_hour.
Use the sorted() function to sort swap_avg_by_hour in descending order. Since the first column of this list is the average number of comments, sorting the list will sort by the average number of comments.
. Set the reverse argument to True, so that the highest value in the first column appears first in the list.
. Assign the result to sorted_swap.
Print the string "Top 5 Hours for Ask Posts Comments".
Loop through each average and each hour (in this order) in the first five lists of sorted_swap.
Use the str.format() method to print the hour and average in the following format: 15:00: 38.59 average comments per post.

To format the hours, use the datetime.strptime() constructor to return a datetime object and then use the strftime() method to specify the format of the time. To format the average, you can use {:.2f} to indicate that just two decimal places should be used.

In [17]:

#Create a list that equals avg_by_hour with swapped columns
swap_avg_by_hour = []
for row in avg_by_hour:
    swap_avg_by_hour.append([row[1], row[0]])
swap_avg_by_hour    

Out[17]:

[[11.137546468401487, '02'],
 [7.407801418439717, '01'],
 [8.804177545691905, '22'],
 [8.687258687258687, '21'],
 [7.163043478260869, '19'],
 [9.449744463373083, '17'],
 [28.676470588235293, '15'],
 [9.692007797270955, '14'],
 [16.31756756756757, '13'],
 [8.96474358974359, '11'],
 [10.684397163120567, '10'],
 [6.653153153153153, '09'],
 [7.013274336283186, '07'],
 [7.948339483394834, '03'],
 [6.696793002915452, '23'],
 [8.749019607843136, '20'],
 [7.713298791018998, '16'],
 [9.190661478599221, '08'],
 [7.5647840531561465, '00'],
 [7.94299674267101, '18'],
 [12.380116959064328, '12'],
 [9.7119341563786, '04'],
 [6.782051282051282, '06'],
 [8.794258373205741, '05']]

In [18]:

#Use the sorted() function to sort swap_avg_by_hour in descending order
sorted_swap = sorted(swap_avg_by_hour, reverse=True)
sorted_swap

Out[18]:

[[28.676470588235293, '15'],
 [16.31756756756757, '13'],
 [12.380116959064328, '12'],
 [11.137546468401487, '02'],
 [10.684397163120567, '10'],
 [9.7119341563786, '04'],
 [9.692007797270955, '14'],
 [9.449744463373083, '17'],
 [9.190661478599221, '08'],
 [8.96474358974359, '11'],
 [8.804177545691905, '22'],
 [8.794258373205741, '05'],
 [8.749019607843136, '20'],
 [8.687258687258687, '21'],
 [7.948339483394834, '03'],
 [7.94299674267101, '18'],
 [7.713298791018998, '16'],
 [7.5647840531561465, '00'],
 [7.407801418439717, '01'],
 [7.163043478260869, '19'],
 [7.013274336283186, '07'],
 [6.782051282051282, '06'],
 [6.696793002915452, '23'],
 [6.653153153153153, '09']]

In [19]:

print("Top 5 Hours for Ask Posts Comments")

Top 5 Hours for Ask Posts Comments

In [20]:

#Use the str.format() method to print the hour and average
#use the datetime.strptime() constructor to return a datetime object 
#use the strftime() method to specify the format of the time.
#use {:.2f} to indicate that just two decimal places should be used
for each in sorted_swap[0:5]:
    hour = dt.datetime.strptime(each[1],"%H")
    hour = hour.strftime("%H:%M")
    a = str.format("{hour} > {comments:.2f} average comments per hour", hour = hour, comments = (each[0]))
    print(a)
    

15:00 > 28.68 average comments per hour
13:00 > 16.32 average comments per hour
12:00 > 12.38 average comments per hour
02:00 > 11.14 average comments per hour
10:00 > 10.68 average comments per hour

Finding the Amount of Show Posts and Comments by Hour Created¶

In [33]:

import datetime as dt
result_list1 = []
for row in show_posts:
    created_at1 = row[6]
    num_comments1 = int(row[4])
    result_list1.append((created_at1, num_comments1))
result_list1[:10]

Out[33]:

[('9/26/2016 0:36', 0),
 ('9/26/2016 0:01', 0),
 ('9/25/2016 23:44', 0),
 ('9/25/2016 23:17', 0),
 ('9/25/2016 20:06', 1),
 ('9/25/2016 19:06', 1),
 ('9/25/2016 18:32', 0),
 ('9/25/2016 16:50', 1),
 ('9/25/2016 16:43', 0),
 ('9/25/2016 14:30', 1)]

In [34]:

result_list1[10150:]

Out[34]:

[('9/6/2015 18:05', 0),
 ('9/6/2015 15:41', 0),
 ('9/6/2015 15:25', 0),
 ('9/6/2015 14:21', 0),
 ('9/6/2015 13:50', 2),
 ('9/6/2015 13:02', 6),
 ('9/6/2015 12:38', 4),
 ('9/6/2015 12:16', 1)]

In [38]:

counts_by_hour1 = {}
comments_by_hour1 = {}

for row in result_list1:
    hour1 = row[0]
    dt_str1 = dt.datetime.strptime(hour1, "%m/%d/%Y %H:%M")
    post_created1 = dt_str1.strftime("%H")
    comments_created1 = row[1]
    if post_created1 not in counts_by_hour1:
        counts_by_hour1[post_created1] = 1
        comments_by_hour1[post_created1] = comments_created1
    else:
        counts_by_hour1[post_created1] += 1
        comments_by_hour1[post_created1] += comments_created1
        

In [39]:

counts_by_hour1

Out[39]:

{'00': 276,
 '23': 319,
 '20': 525,
 '19': 556,
 '18': 656,
 '16': 801,
 '14': 696,
 '10': 323,
 '09': 302,
 '08': 316,
 '06': 192,
 '03': 206,
 '21': 430,
 '17': 761,
 '15': 836,
 '11': 402,
 '07': 236,
 '04': 194,
 '13': 610,
 '12': 516,
 '01': 247,
 '22': 377,
 '02': 209,
 '05': 172}

In [40]:

comments_by_hour1

Out[40]:

{'00': 1283,
 '23': 1444,
 '20': 2183,
 '19': 2791,
 '18': 3242,
 '16': 3769,
 '14': 3839,
 '10': 1228,
 '09': 1411,
 '08': 1771,
 '06': 904,
 '03': 934,
 '21': 1759,
 '17': 3236,
 '15': 3824,
 '11': 2413,
 '07': 1577,
 '04': 978,
 '13': 3314,
 '12': 3609,
 '01': 1006,
 '22': 1450,
 '02': 1076,
 '05': 592}

In [41]:

avg_by_hour1 = []
for row in counts_by_hour1:
    average1 = comments_by_hour1[row] / counts_by_hour1[row]
    avg_by_hour1.append([row, average1])
avg_by_hour1    

Out[41]:

[['00', 4.648550724637682],
 ['23', 4.5266457680250785],
 ['20', 4.158095238095238],
 ['19', 5.01978417266187],
 ['18', 4.942073170731708],
 ['16', 4.705368289637953],
 ['14', 5.515804597701149],
 ['10', 3.801857585139319],
 ['09', 4.672185430463577],
 ['08', 5.6044303797468356],
 ['06', 4.708333333333333],
 ['03', 4.533980582524272],
 ['21', 4.090697674418605],
 ['17', 4.252299605781866],
 ['15', 4.574162679425838],
 ['11', 6.002487562189055],
 ['07', 6.682203389830509],
 ['04', 5.041237113402062],
 ['13', 5.432786885245902],
 ['12', 6.994186046511628],
 ['01', 4.0728744939271255],
 ['22', 3.8461538461538463],
 ['02', 5.148325358851674],
 ['05', 3.441860465116279]]

In [43]:

swap_avg_by_hour1 = []
for row in avg_by_hour1:
    swap_avg_by_hour1.append([row[1], row[0]])
swap_avg_by_hour1    

Out[43]:

[[4.648550724637682, '00'],
 [4.5266457680250785, '23'],
 [4.158095238095238, '20'],
 [5.01978417266187, '19'],
 [4.942073170731708, '18'],
 [4.705368289637953, '16'],
 [5.515804597701149, '14'],
 [3.801857585139319, '10'],
 [4.672185430463577, '09'],
 [5.6044303797468356, '08'],
 [4.708333333333333, '06'],
 [4.533980582524272, '03'],
 [4.090697674418605, '21'],
 [4.252299605781866, '17'],
 [4.574162679425838, '15'],
 [6.002487562189055, '11'],
 [6.682203389830509, '07'],
 [5.041237113402062, '04'],
 [5.432786885245902, '13'],
 [6.994186046511628, '12'],
 [4.0728744939271255, '01'],
 [3.8461538461538463, '22'],
 [5.148325358851674, '02'],
 [3.441860465116279, '05']]

In [44]:

sorted_swap = sorted(swap_avg_by_hour1, reverse = True)

In [45]:

print("Top 5 Hours for Ask Posts Comments")

Top 5 Hours for Ask Posts Comments

In [46]:

for each in sorted_swap[0:5]:
    hour1 = dt.datetime.strptime(each[1],"%H")
    hour1 = hour1.strftime("%H:%M")
    b = str.format("{hour}: {comments:.2f} average comments per hour", hour = hour1, comments = (each[0]))
    print(b)
    

12:00: 6.99 average comments per hour
07:00: 6.68 average comments per hour
11:00: 6.00 average comments per hour
08:00: 5.60 average comments per hour
14:00: 5.52 average comments per hour

In [ ]: