Notebook

Exploring Hacker News Posts¶

We will be exploring the posts made on Hacker News. From the data set we want to see post that begin with either "Ask HN" or "Show HN". From this we want to see which get more comments on average and see if posting at certain times recieve more comments. The data was reduced by removing all submissions that did not have any comments. From this we got random samples from what did have comments. You can find the dataset here, https://www.kaggle.com/hacker-news/hacker-news-posts.

In [11]:

from csv import reader

hn = list(reader(open('hacker_news.csv')))
print(*hn[:5], sep = '\n')

['id', 'title', 'url', 'num_points', 'num_comments', 'author', 'created_at']
['12224879', 'Interactive Dynamic Video', 'http://www.interactivedynamicvideo.com/', '386', '52', 'ne0phyte', '8/4/2016 11:52']
['10975351', 'How to Use Open Source and Shut the Fuck Up at the Same Time', 'http://hueniverse.com/2016/01/26/how-to-use-open-source-and-shut-the-fuck-up-at-the-same-time/', '39', '10', 'josep2', '1/26/2016 19:30']
['11964716', "Florida DJs May Face Felony for April Fools' Water Joke", 'http://www.thewire.com/entertainment/2013/04/florida-djs-april-fools-water-joke/63798/', '2', '1', 'vezycash', '6/23/2016 22:20']
['11919867', 'Technology ventures: From Idea to Enterprise', 'https://www.amazon.com/Technology-Ventures-Enterprise-Thomas-Byers/dp/0073523429', '3', '1', 'hswarna', '6/17/2016 0:01']

In [2]:

headers = hn[0]
hn = hn[1:]
print(headers)
print(*hn[:4],sep = '\n')

['id', 'title', 'url', 'num_points', 'num_comments', 'author', 'created_at']
['12224879', 'Interactive Dynamic Video', 'http://www.interactivedynamicvideo.com/', '386', '52', 'ne0phyte', '8/4/2016 11:52']
['10975351', 'How to Use Open Source and Shut the Fuck Up at the Same Time', 'http://hueniverse.com/2016/01/26/how-to-use-open-source-and-shut-the-fuck-up-at-the-same-time/', '39', '10', 'josep2', '1/26/2016 19:30']
['11964716', "Florida DJs May Face Felony for April Fools' Water Joke", 'http://www.thewire.com/entertainment/2013/04/florida-djs-april-fools-water-joke/63798/', '2', '1', 'vezycash', '6/23/2016 22:20']
['11919867', 'Technology ventures: From Idea to Enterprise', 'https://www.amazon.com/Technology-Ventures-Enterprise-Thomas-Byers/dp/0073523429', '3', '1', 'hswarna', '6/17/2016 0:01']

Removing header from list so we do not have to worry about it when working with the list.

In [3]:

ask_posts = []
show_posts = []
other_posts = []

for row in hn:
    title = row[1].lower()
    if title.startswith('ask hn'):
        ask_posts.append(row)
    elif title.startswith('show hn'):
        show_posts.append(row)
    else:
        other_posts.append(row)
        
print(len(ask_posts))
print(len(show_posts))
print(len(other_posts))

1744
1162
17194

We found the number of post that started with "Ask HN" or "Show HN" from the list

In [4]:

print(*ask_posts[:5], sep = '\n')
print('\n')
print(*show_posts[:5], sep = '\n')

['12296411', 'Ask HN: How to improve my personal website?', '', '2', '6', 'ahmedbaracat', '8/16/2016 9:55']
['10610020', 'Ask HN: Am I the only one outraged by Twitter shutting down share counts?', '', '28', '29', 'tkfx', '11/22/2015 13:43']
['11610310', 'Ask HN: Aby recent changes to CSS that broke mobile?', '', '1', '1', 'polskibus', '5/2/2016 10:14']
['12210105', 'Ask HN: Looking for Employee #3 How do I do it?', '', '1', '3', 'sph130', '8/2/2016 14:20']
['10394168', 'Ask HN: Someone offered to buy my browser extension from me. What now?', '', '28', '17', 'roykolak', '10/15/2015 16:38']


['10627194', 'Show HN: Wio Link  ESP8266 Based Web of Things Hardware Development Platform', 'https://iot.seeed.cc', '26', '22', 'kfihihc', '11/25/2015 14:03']
['10646440', 'Show HN: Something pointless I made', 'http://dn.ht/picklecat/', '747', '102', 'dhotson', '11/29/2015 22:46']
['11590768', 'Show HN: Shanhu.io, a programming playground powered by e8vm', 'https://shanhu.io', '1', '1', 'h8liu', '4/28/2016 18:05']
['12178806', 'Show HN: Webscope  Easy way for web developers to communicate with Clients', 'http://webscopeapp.com', '3', '3', 'fastbrick', '7/28/2016 7:11']
['10872799', 'Show HN: GeoScreenshot  Easily test Geo-IP based web pages', 'https://www.geoscreenshot.com/', '1', '9', 'kpsychwave', '1/9/2016 20:45']

Just checking to see what the first five rows of each one looks like.

In [5]:

total_ask_comments = 0
for row in ask_posts:
    comments = int(row[4])
    total_ask_comments += comments

avg_ask_comments = total_ask_comments / len(ask_posts)
print(avg_ask_comments)

total_show_comments = 0
for row in show_posts:
    comments = int(row[4])
    total_show_comments += comments
    
avg_show_comments = total_show_comments / len(show_posts)
print(avg_show_comments)

14.038417431192661
10.31669535283993

We can see that "Ask HN" post on average get more comments then show post on Hacker New posts. So we will be focusing on "Ask HN" dataset to see what are the best times to post an "Ask HN" question.

In [6]:

import datetime as dt
result_list = []

for row in ask_posts:
    created_at = [row[6],int(row[4])]
    result_list.append(created_at)
    
counts_by_hour = {}
comments_by_hour = {}

for row in result_list:
    date = dt.datetime.strptime(row[0],'%m/%d/%Y %H:%M')
    hour = date.strftime('%H')
    if hour not in counts_by_hour:
        counts_by_hour[hour] = 1
        comments_by_hour[hour] = row[1]
    else:
        counts_by_hour[hour] += 1
        comments_by_hour[hour] += row[1]

In [7]:

avg_by_hour = []

for row in counts_by_hour:
    avg_by_hour.append([row, comments_by_hour[row]/counts_by_hour[row]])
    
print(*avg_by_hour,sep = '\n')

['09', 5.5777777777777775]
['13', 14.741176470588234]
['10', 13.440677966101696]
['14', 13.233644859813085]
['16', 16.796296296296298]
['23', 7.985294117647059]
['12', 9.41095890410959]
['17', 11.46]
['15', 38.5948275862069]
['21', 16.009174311926607]
['20', 21.525]
['02', 23.810344827586206]
['18', 13.20183486238532]
['03', 7.796296296296297]
['05', 10.08695652173913]
['19', 10.8]
['01', 11.383333333333333]
['22', 6.746478873239437]
['08', 10.25]
['04', 7.170212765957447]
['00', 8.127272727272727]
['06', 9.022727272727273]
['07', 7.852941176470588]
['11', 11.051724137931034]

In [10]:

swap_avg_by_hour = []

for row in avg_by_hour:
    swap = [row[1],row[0]]
    swap_avg_by_hour.append(swap)

sorted_swap = sorted(swap_avg_by_hour, reverse = True)

print('Top 5 Hours for Ask Posts Comments \n')

for row in sorted_swap[:5]:
    time = dt.datetime.strptime(row[1],'%H')
    time = time.strftime('%H:%M')
    avg = row[0]
    print(time + ': {0:.2f} average comments per post \n'.format(row[0]))

Top 5 Hours for Ask Posts Comments 

15:00: 38.59 average comments per post 

02:00: 23.81 average comments per post 

20:00: 21.52 average comments per post 

16:00: 16.80 average comments per post 

21:00: 16.01 average comments per post

From finding the average comments per post it would be best to post an Ask question at 3:00 pm Eastern Time.