Notebook

Hacker News - When should you ask your question?¶

Hacker News is a site started by the startup incubator Y Combinator, where user-submitted stories (known as "posts") receive votes and comments, similar to reddit. Hacker News is extremely popular in technology and startup circles, and posts that make it to the top of the Hacker News listings can get hundreds of thousands of visitors as a result.

The aim of the project is to compare the two types of posts submitted on Hacker News:

Questions - asking the community
Showcase - show the community a project, product, or just something intersting

in order to determine:

which type Ask HN (Question) or Show HN (Showcase) recieves more comments
how the time of the day a certain post is created influences the amount of comments it recieves.

The datset can be found here.

It contains data about posts submitten in the communityin a 12 month period (up to September 2016) and was originally put together by Hacker News in 2016. It has not been updated since.

It originally has 300,000 rows. We have reduced it to approximately 20,000 rows by removing all submissions that didn't receive any comments and then randomly sampling from the remaining submissions.

Below are descriptions of the columns:

Column Name in Dataset	Description
"id"	the unique identifier from Hacker News for the post
"title"	the title of the post
"url"	the URL that the posts links to, if the post has a URL
"num_points"	the number of points the post acquired, calculated as the total number of upvotes minus the total number of downvotes
"num_comments"	the number of comments on the post
"author"	the username of the person who submitted the post
"created_at"	the date and time of the post's submission

For the purposes of this project, the columns of interest to us are: num_comments and created_at.

Let's open the dataset and display the first 5 rows.

In [1]:

from csv import reader
opened_file = open ('hacker_news.csv')
read_file = reader (opened_file)
hn = list (read_file)
hn [:5]

Out[1]:

[['id', 'title', 'url', 'num_points', 'num_comments', 'author', 'created_at'],
 ['12224879',
  'Interactive Dynamic Video',
  'http://www.interactivedynamicvideo.com/',
  '386',
  '52',
  'ne0phyte',
  '8/4/2016 11:52'],
 ['10975351',
  'How to Use Open Source and Shut the Fuck Up at the Same Time',
  'http://hueniverse.com/2016/01/26/how-to-use-open-source-and-shut-the-fuck-up-at-the-same-time/',
  '39',
  '10',
  'josep2',
  '1/26/2016 19:30'],
 ['11964716',
  "Florida DJs May Face Felony for April Fools' Water Joke",
  'http://www.thewire.com/entertainment/2013/04/florida-djs-april-fools-water-joke/63798/',
  '2',
  '1',
  'vezycash',
  '6/23/2016 22:20'],
 ['11919867',
  'Technology ventures: From Idea to Enterprise',
  'https://www.amazon.com/Technology-Ventures-Enterprise-Thomas-Byers/dp/0073523429',
  '3',
  '1',
  'hswarna',
  '6/17/2016 0:01']]

In [2]:

headers = hn [0]
hn = hn [1:]
print(headers)
'\n'
hn [:5]

['id', 'title', 'url', 'num_points', 'num_comments', 'author', 'created_at']

Out[2]:

[['12224879',
  'Interactive Dynamic Video',
  'http://www.interactivedynamicvideo.com/',
  '386',
  '52',
  'ne0phyte',
  '8/4/2016 11:52'],
 ['10975351',
  'How to Use Open Source and Shut the Fuck Up at the Same Time',
  'http://hueniverse.com/2016/01/26/how-to-use-open-source-and-shut-the-fuck-up-at-the-same-time/',
  '39',
  '10',
  'josep2',
  '1/26/2016 19:30'],
 ['11964716',
  "Florida DJs May Face Felony for April Fools' Water Joke",
  'http://www.thewire.com/entertainment/2013/04/florida-djs-april-fools-water-joke/63798/',
  '2',
  '1',
  'vezycash',
  '6/23/2016 22:20'],
 ['11919867',
  'Technology ventures: From Idea to Enterprise',
  'https://www.amazon.com/Technology-Ventures-Enterprise-Thomas-Byers/dp/0073523429',
  '3',
  '1',
  'hswarna',
  '6/17/2016 0:01'],
 ['10301696',
  'Note by Note: The Making of Steinway L1037 (2007)',
  'http://www.nytimes.com/2007/11/07/movies/07stein.html?_r=0',
  '8',
  '2',
  'walterbell',
  '9/30/2015 4:12']]

Extracting the Ask HN and Show HN Posts¶

As we are interested in comparing the Question and Showcase type submissions, we will split them into two lists by using the string method startswith. To ensure that we have take into account all posts, we will use the lower method to convert all title strings to lower case.

In [3]:

#creating empty lists
ask_posts = []
show_posts = []
other_posts = []

for row in hn:
    title = row [1] 
    if title.lower().startswith('ask hn') is True:
        ask_posts.append (row)
    elif title.lower().startswith ('show hn') is True:
        show_posts.append (row)
    else:
        other_posts.append (row)

print ('Ask posts:' , len (ask_posts))
print ('Show Posts:' , len (show_posts))
print ('Other Posts:', len (other_posts))

Ask posts: 1744
Show Posts: 1162
Other Posts: 17194

Let's ensure we have collected the right posts in each list

In [4]:

ask_posts [:3]

Out[4]:

[['12296411',
  'Ask HN: How to improve my personal website?',
  '',
  '2',
  '6',
  'ahmedbaracat',
  '8/16/2016 9:55'],
 ['10610020',
  'Ask HN: Am I the only one outraged by Twitter shutting down share counts?',
  '',
  '28',
  '29',
  'tkfx',
  '11/22/2015 13:43'],
 ['11610310',
  'Ask HN: Aby recent changes to CSS that broke mobile?',
  '',
  '1',
  '1',
  'polskibus',
  '5/2/2016 10:14']]

In [5]:

show_posts [:3]

Out[5]:

[['10627194',
  'Show HN: Wio Link  ESP8266 Based Web of Things Hardware Development Platform',
  'https://iot.seeed.cc',
  '26',
  '22',
  'kfihihc',
  '11/25/2015 14:03'],
 ['10646440',
  'Show HN: Something pointless I made',
  'http://dn.ht/picklecat/',
  '747',
  '102',
  'dhotson',
  '11/29/2015 22:46'],
 ['11590768',
  'Show HN: Shanhu.io, a programming playground powered by e8vm',
  'https://shanhu.io',
  '1',
  '1',
  'h8liu',
  '4/28/2016 18:05']]

Calculating the Average Number of Comments for Ask HN and Show HN Posts¶

Which posts recieve more comments? Let's compare the averages for these two types of posts. In order to do this, we will:

Determine the total amount of comments for each post type
Divide by the amount of posts in each 'category': ask or show
Compare the two figures

In [6]:

total_ask_comments = 0

for post in ask_posts:
    comments = int (post [4])
    total_ask_comments += comments
    
avg_ask_comments = total_ask_comments/len(ask_posts)
print ('Average comments per "Ask" post:', avg_ask_comments)

total_show_comments = 0

for post in show_posts:
    comments = int (post [4])
    total_show_comments += comments
    
avg_show_comments = total_show_comments/len(show_posts)
print ('Average comments per "Show" post:', avg_show_comments)

    

Average comments per "Ask" post: 14.038417431192661
Average comments per "Show" post: 10.31669535283993

We can observe that Ask comments recieve more comments on average (14) compared to Show comments (10.3).

Since ask posts are more likely to receive comments, we'll focus our remaining analysis just on these posts.

Finding the number of Ask Posts and Comments by hour created¶

Calculate the number of ask posts created in each hour of the day, along with the number of comments received.
- Extract time and number of comments per post
- Clean created_at column to extract time (in 24h format) of the day
- Create frequency table for posts created in each hour of the day
- Create frequency table for comments created in each hour of the day - based on the comments of each post in that hour
Calculate the average number of comments ask posts receive by hour created.
- Divide the comment count frequency value by the posts created frequency value based on their corresponding dictionary key value

In [7]:

import datetime as dt
results_list = []

for post in ask_posts:
    created_at = post [6]
    comments = int (post[4])
    results_list.append ([created_at, comments])

results_list [:5]

Out[7]:

[['8/16/2016 9:55', 6],
 ['11/22/2015 13:43', 29],
 ['5/2/2016 10:14', 1],
 ['8/2/2016 14:20', 3],
 ['10/15/2015 16:38', 17]]

In [8]:

counts_by_hour = {}
comments_by_hour = {}
date_format = '%m/%d/%Y %H:%M' #instantiating datetime object, which will be second argument in our strptime methord

for row in results_list:
    date = row [0]
    time = dt.datetime.strptime(date,date_format).strftime('%H') #extracting only hour (in 24h format)
    if time not in counts_by_hour:
        counts_by_hour[time] = 1
        comments_by_hour[time] = int(row [1])
    else:
        counts_by_hour[time] += 1
        comments_by_hour[time] += int (row[1])

#presenting the comments by hour in order of the hour of the day
sorted_comments = sorted(comments_by_hour.items(), key=lambda x:x[0]) #we are sorting by key, not value, hence the '0'
for item in sorted_comments:
    print (item)

('00', 447)
('01', 683)
('02', 1381)
('03', 421)
('04', 337)
('05', 464)
('06', 397)
('07', 267)
('08', 492)
('09', 251)
('10', 793)
('11', 641)
('12', 687)
('13', 1253)
('14', 1416)
('15', 4477)
('16', 1814)
('17', 1146)
('18', 1439)
('19', 1188)
('20', 1722)
('21', 1745)
('22', 479)
('23', 543)

Now let's find out the average comment by hour of the day

In [9]:

avg_by_hour = []
for hour in comments_by_hour:
    avg_by_hour.append ([hour, comments_by_hour[hour]/counts_by_hour[hour]])

avg_by_hour

Out[9]:

[['09', 5.5777777777777775],
 ['13', 14.741176470588234],
 ['10', 13.440677966101696],
 ['14', 13.233644859813085],
 ['16', 16.796296296296298],
 ['23', 7.985294117647059],
 ['12', 9.41095890410959],
 ['17', 11.46],
 ['15', 38.5948275862069],
 ['21', 16.009174311926607],
 ['20', 21.525],
 ['02', 23.810344827586206],
 ['18', 13.20183486238532],
 ['03', 7.796296296296297],
 ['05', 10.08695652173913],
 ['19', 10.8],
 ['01', 11.383333333333333],
 ['22', 6.746478873239437],
 ['08', 10.25],
 ['04', 7.170212765957447],
 ['00', 8.127272727272727],
 ['06', 9.022727272727273],
 ['07', 7.852941176470588],
 ['11', 11.051724137931034]]

Let's print it in a more eye-friendly way to see which hours of the day have the most comments per post for the Ask posts.

In [10]:

sorted_list = sorted({tuple(x): x for x in avg_by_hour}.values())
sorted_list

Out[10]:

[['00', 8.127272727272727],
 ['01', 11.383333333333333],
 ['02', 23.810344827586206],
 ['03', 7.796296296296297],
 ['04', 7.170212765957447],
 ['05', 10.08695652173913],
 ['06', 9.022727272727273],
 ['07', 7.852941176470588],
 ['08', 10.25],
 ['09', 5.5777777777777775],
 ['10', 13.440677966101696],
 ['11', 11.051724137931034],
 ['12', 9.41095890410959],
 ['13', 14.741176470588234],
 ['14', 13.233644859813085],
 ['15', 38.5948275862069],
 ['16', 16.796296296296298],
 ['17', 11.46],
 ['18', 13.20183486238532],
 ['19', 10.8],
 ['20', 21.525],
 ['21', 16.009174311926607],
 ['22', 6.746478873239437],
 ['23', 7.985294117647059]]

This list gives us a clear understanding of how comments per ask post variate throughout the day, but not which are the top hours for posts to recieve most comments.

Let's print a list in order of highest average comments per post to find out.

In [11]:

swap_avg_by_hour = []
for hour in avg_by_hour:
    swap_avg_by_hour.append ([hour[1], hour[0]])
swap_avg_by_hour

Out[11]:

[[5.5777777777777775, '09'],
 [14.741176470588234, '13'],
 [13.440677966101696, '10'],
 [13.233644859813085, '14'],
 [16.796296296296298, '16'],
 [7.985294117647059, '23'],
 [9.41095890410959, '12'],
 [11.46, '17'],
 [38.5948275862069, '15'],
 [16.009174311926607, '21'],
 [21.525, '20'],
 [23.810344827586206, '02'],
 [13.20183486238532, '18'],
 [7.796296296296297, '03'],
 [10.08695652173913, '05'],
 [10.8, '19'],
 [11.383333333333333, '01'],
 [6.746478873239437, '22'],
 [10.25, '08'],
 [7.170212765957447, '04'],
 [8.127272727272727, '00'],
 [9.022727272727273, '06'],
 [7.852941176470588, '07'],
 [11.051724137931034, '11']]

In [12]:

sorted_swap = sorted(swap_avg_by_hour, reverse = True)
print ('Top 5 Hours for Ask Posts Comments')

for avg,hr in sorted_swap[:5]:
    print( 
        '{}: {:.2f} average comments per post'
    .format(dt.datetime.strptime(hr,'%H').strftime('%H:%M'), avg)
    )   

Top 5 Hours for Ask Posts Comments
15:00: 38.59 average comments per post
02:00: 23.81 average comments per post
20:00: 21.52 average comments per post
16:00: 16.80 average comments per post
21:00: 16.01 average comments per post

We can see that the top time is 15:00, 3 o'clock in the afternoon. It is interesting to see how spread out these times are, though there are two two-hour intervals: 15:00 & 16:00, and 20:00 & 21:00 which feature among the top 5, indicating an increase of activity at those hours.

One could argue these hours are when students finish college or shcool. The other time interval is on average after dinner, which is when professionals would most likely be on the platform. This second time slot could also correspond to people submitting after school/work activities.

There is also a surge at 2am, 02:00, which could be related some members in the community posting during unisual hours or in the morning in Europe.

As is s as specifified in the documentation:

created_at: the date and time the post was made (the time zone is Eastern Time in the US)

Conclusion

In conclusion, if you want to get high engagement in your question-related posts, you should post between 3pm and 5pm ET. You should abstrain from posting before the afternoon (in US timezone). This is evidenced by an overall higher comments per post average starting from 2pm ET in the above cell (id = 10). This drops off between 10pm ET, with a surge starting from 1am to 3am ET.

With additional data such as georgraphy of the user, it could be possible to see how these trends in hours are due to different regions. Hacker News being a predominantly English-speaking website, I will assume for the time being that these variations are due to user activity in North America and Europe, where English is more widespread as a language for news and blogging.

Furthermore, these conclusions are drawn from a sample in the dataset and with more data instances at our disposal, the results may vary.