Notebook

Exploring Hacker News¶

This project will be exploring data taken from Hacker News (hacker_news.csv). The primary goals are to compare post types and analyze comments based on the time their are created.

In [1]:

from csv import reader
opened_file = open("hacker_news.csv")
read_file = reader(opened_file)
hn = list(read_file)
opened_file.close()
print(hn[0:5])

[['id', 'title', 'url', 'num_points', 'num_comments', 'author', 'created_at'], ['12224879', 'Interactive Dynamic Video', 'http://www.interactivedynamicvideo.com/', '386', '52', 'ne0phyte', '8/4/2016 11:52'], ['10975351', 'How to Use Open Source and Shut the Fuck Up at the Same Time', 'http://hueniverse.com/2016/01/26/how-to-use-open-source-and-shut-the-fuck-up-at-the-same-time/', '39', '10', 'josep2', '1/26/2016 19:30'], ['11964716', "Florida DJs May Face Felony for April Fools' Water Joke", 'http://www.thewire.com/entertainment/2013/04/florida-djs-april-fools-water-joke/63798/', '2', '1', 'vezycash', '6/23/2016 22:20'], ['11919867', 'Technology ventures: From Idea to Enterprise', 'https://www.amazon.com/Technology-Ventures-Enterprise-Thomas-Byers/dp/0073523429', '3', '1', 'hswarna', '6/17/2016 0:01']]

In [2]:

headers = hn[0]
hn = hn[1:]
print(headers)
print("\n")
print(hn[0:5])

['id', 'title', 'url', 'num_points', 'num_comments', 'author', 'created_at']


[['12224879', 'Interactive Dynamic Video', 'http://www.interactivedynamicvideo.com/', '386', '52', 'ne0phyte', '8/4/2016 11:52'], ['10975351', 'How to Use Open Source and Shut the Fuck Up at the Same Time', 'http://hueniverse.com/2016/01/26/how-to-use-open-source-and-shut-the-fuck-up-at-the-same-time/', '39', '10', 'josep2', '1/26/2016 19:30'], ['11964716', "Florida DJs May Face Felony for April Fools' Water Joke", 'http://www.thewire.com/entertainment/2013/04/florida-djs-april-fools-water-joke/63798/', '2', '1', 'vezycash', '6/23/2016 22:20'], ['11919867', 'Technology ventures: From Idea to Enterprise', 'https://www.amazon.com/Technology-Ventures-Enterprise-Thomas-Byers/dp/0073523429', '3', '1', 'hswarna', '6/17/2016 0:01'], ['10301696', 'Note by Note: The Making of Steinway L1037 (2007)', 'http://www.nytimes.com/2007/11/07/movies/07stein.html?_r=0', '8', '2', 'walterbell', '9/30/2015 4:12']]

In [3]:

ask_posts=[]
show_posts=[]
other_posts=[]
for row in hn:
    title = row[1]
    if title.lower().startswith("ask hn"):  #need to append the row, not title
        ask_posts.append(row)
    elif title.lower().startswith("show hn"):
        show_posts.append(row)
    else:
        other_posts.append(row)
                
print(len(ask_posts),"ask posts")
print(len(show_posts), "show posts")
print(len(other_posts), "other posts")

1744 ask posts
1162 show posts
17194 other posts

Up to this point we have removed the first row (headers) and appended list (hn) with headers removed. Then three empty lists were created to find out the number of posts that contain "ask hn", "show hn" or other by iterating over (hn) and appending each time of post to the appropriate new list. Next we will find out which post type receives more comments on avaerage.

In [4]:

print(ask_posts[1])

['10610020', 'Ask HN: Am I the only one outraged by Twitter shutting down share counts?', '', '28', '29', 'tkfx', '11/22/2015 13:43']

In [5]:

total_ask_comments=0
for row in ask_posts:
    num_comments = int(row[4])
    total_ask_comments += num_comments
    
avg_ask_comments = total_ask_comments/len(ask_posts)

print(avg_ask_comments, "avg ask hn comments")

14.038417431192661 avg ask hn comments

In [6]:

total_show_comments=0
for row in show_posts:
    num_comments = int(row[4])
    total_show_comments += num_comments
    
avg_show_comments = total_show_comments/len(show_posts)

print(avg_show_comments, "avg show hn comments")

10.31669535283993 avg show hn comments

Based on above data to this point, there are more "Ask hn" comments than "Show hn" comments. Will concentrate time on exploring these post data further. Next will be to determine is ask posts created at a certain time attract more comment than other times:

Calulate the amount of ask posts created in each hour of the day and the number of comments
Calculate the average number of comments ask posts recieve be hour

In [7]:

#date and number of ask hn comments

import datetime as dt
result_list=[]
for row in ask_posts:
    created_at = row[6]
    num_comments = int(row[4])
    result_list.append([created_at, num_comments])
    
print(result_list[0:6])    

[['8/16/2016 9:55', 6], ['11/22/2015 13:43', 29], ['5/2/2016 10:14', 1], ['8/2/2016 14:20', 3], ['10/15/2015 16:38', 17], ['9/26/2015 23:23', 1]]

In [17]:

counts_by_hour = {}
comments_by_hour = {}

for row in result_list:
    date_time = dt.datetime.strptime(row[0], "%m/%d/%Y %H:%M")
    hour = date_time.strftime("%H")
    if hour not in counts_by_hour:
        counts_by_hour[hour] = 1
        comments_by_hour[hour] = row[1]
    else:    
        counts_by_hour[hour] += 1
        comments_by_hour[hour] += row[1]

print('num ask hn posts each hour', counts_by_hour)
print('\n')
print('num comments ask hn per hour', comments_by_hour)

num ask hn posts each hour {'09': 45, '13': 85, '10': 59, '14': 107, '16': 108, '23': 68, '12': 73, '17': 100, '15': 116, '21': 109, '20': 80, '02': 58, '18': 109, '03': 54, '05': 46, '19': 110, '01': 60, '22': 71, '08': 48, '04': 47, '00': 55, '06': 44, '07': 34, '11': 58}


num comments ask hn per hour {'09': 251, '13': 1253, '10': 793, '14': 1416, '16': 1814, '23': 543, '12': 687, '17': 1146, '15': 4477, '21': 1745, '20': 1722, '02': 1381, '18': 1439, '03': 421, '05': 464, '19': 1188, '01': 683, '22': 479, '08': 492, '04': 337, '00': 447, '06': 397, '07': 267, '11': 641}

In [24]:

avg_by_hour = []
for hour in comments_by_hour:
    avg_by_hour.append([hour,comments_by_hour[hour]/counts_by_hour[hour]])
    
print(avg_by_hour)
print('\n')
print(sorted(avg_by_hour))

[['09', 5.5777777777777775], ['13', 14.741176470588234], ['10', 13.440677966101696], ['14', 13.233644859813085], ['16', 16.796296296296298], ['23', 7.985294117647059], ['12', 9.41095890410959], ['17', 11.46], ['15', 38.5948275862069], ['21', 16.009174311926607], ['20', 21.525], ['02', 23.810344827586206], ['18', 13.20183486238532], ['03', 7.796296296296297], ['05', 10.08695652173913], ['19', 10.8], ['01', 11.383333333333333], ['22', 6.746478873239437], ['08', 10.25], ['04', 7.170212765957447], ['00', 8.127272727272727], ['06', 9.022727272727273], ['07', 7.852941176470588], ['11', 11.051724137931034]]


[['00', 8.127272727272727], ['01', 11.383333333333333], ['02', 23.810344827586206], ['03', 7.796296296296297], ['04', 7.170212765957447], ['05', 10.08695652173913], ['06', 9.022727272727273], ['07', 7.852941176470588], ['08', 10.25], ['09', 5.5777777777777775], ['10', 13.440677966101696], ['11', 11.051724137931034], ['12', 9.41095890410959], ['13', 14.741176470588234], ['14', 13.233644859813085], ['15', 38.5948275862069], ['16', 16.796296296296298], ['17', 11.46], ['18', 13.20183486238532], ['19', 10.8], ['20', 21.525], ['21', 16.009174311926607], ['22', 6.746478873239437], ['23', 7.985294117647059]]

Above we see the average number of comments per hour of the day (second print out same but in ascending order by hour)

In [34]:

swap_avg_hour = []
for hour in comments_by_hour:
    swap_avg_hour.append([comments_by_hour[hour]/counts_by_hour[hour], hour])
    
sorted_swap = sorted(swap_avg_hour, reverse = True)
print(sorted_swap)

[[38.5948275862069, '15'], [23.810344827586206, '02'], [21.525, '20'], [16.796296296296298, '16'], [16.009174311926607, '21'], [14.741176470588234, '13'], [13.440677966101696, '10'], [13.233644859813085, '14'], [13.20183486238532, '18'], [11.46, '17'], [11.383333333333333, '01'], [11.051724137931034, '11'], [10.8, '19'], [10.25, '08'], [10.08695652173913, '05'], [9.41095890410959, '12'], [9.022727272727273, '06'], [8.127272727272727, '00'], [7.985294117647059, '23'], [7.852941176470588, '07'], [7.796296296296297, '03'], [7.170212765957447, '04'], [6.746478873239437, '22'], [5.5777777777777775, '09']]

Re-iterated and sorted to get list sorted by number of comments per hour to better see what time periods get most comments. List appears to go in lowest to highest. We could concentrate on the last 4-5 items, which should = the times with most comments.

In [67]:

print('Top 5 hours for ask hn posts comments')
for avg,h in sorted_swap[:5]:
    time = dt.datetime.strptime(h,"%H").strftime("%H:%M")
    print(time, "{:.2f} average comments per post".format(avg))

Top 5 hours for ask hn posts comments
15:00 38.59 average comments per post
02:00 23.81 average comments per post
20:00 21.52 average comments per post
16:00 16.80 average comments per post
21:00 16.01 average comments per post

According to the dataset description, the times listed are EST. Based on current location, time is MST, so we just need to subtract 2 hours from each time (13:00, 0:00, 18:00, 14:00. 19:00). Creating posts during these times would be the best bets for highest average comments.