Notebook

HACKER NEWS POST COMPARISON PROJECT

The project aims to compare two type of posts (Ask and Show) posted on Hacker News in terms of comments received.

The dataset used in this project were downloaded from here, however, the number of rows have been reduced from almost 300k to nearly 20k rows by removing no-comment ones and then randonmly sampling from the remaining ones.

For removing unwanted datas from a dataset please refer to my project here

About Hacker News(HN) please visit here

In [1]:

Opened_file_HN = open("hacker_news.csv")
from csv import reader
read_file = reader(Opened_file_HN)
hn = list(read_file)

print(hn[:3])

[['id', 'title', 'url', 'num_points', 'num_comments', 'author', 'created_at'], ['12224879', 'Interactive Dynamic Video', 'http://www.interactivedynamicvideo.com/', '386', '52', 'ne0phyte', '8/4/2016 11:52'], ['10975351', 'How to Use Open Source and Shut the Fuck Up at the Same Time', 'http://hueniverse.com/2016/01/26/how-to-use-open-source-and-shut-the-fuck-up-at-the-same-time/', '39', '10', 'josep2', '1/26/2016 19:30']]

Column Headers	Description
id:	the unique identifier from Hacker News for the post
title:	the title of the post
url:	the URL that the posts links to, if the post has a URL
num_points:	the number of points the post acquired, calculated as the total number of upvotes minus the total number of downvotes
num_comments:	the number of comments on the post
author:	the username of the person who submitted the post
created_at:	the date and time of the post's submission

In order to analyze data in the dataset without encountring an inconsistency we need to remove first row as it contains only coloumn headers.

In [2]:

headers = hn[0]
print(headers)

['id', 'title', 'url', 'num_points', 'num_comments', 'author', 'created_at']

In [3]:

hn = hn[1:]
print(hn[:3])

[['12224879', 'Interactive Dynamic Video', 'http://www.interactivedynamicvideo.com/', '386', '52', 'ne0phyte', '8/4/2016 11:52'], ['10975351', 'How to Use Open Source and Shut the Fuck Up at the Same Time', 'http://hueniverse.com/2016/01/26/how-to-use-open-source-and-shut-the-fuck-up-at-the-same-time/', '39', '10', 'josep2', '1/26/2016 19:30'], ['11964716', "Florida DJs May Face Felony for April Fools' Water Joke", 'http://www.thewire.com/entertainment/2013/04/florida-djs-april-fools-water-joke/63798/', '2', '1', 'vezycash', '6/23/2016 22:20']]

In line with the aim of the project, firstly, the posts that begin with either Ask HN or Show HN should be determined.

In [4]:

ask_posts = [] # This list will include posts start with "ask hn"
show_posts = [] # This list will include posts start with "show hn"
other_posts = [] # List for posts neither start with "ask hn" nor "show hn" 

for row in hn:
    title = row[1].lower()
        
    if title.startswith("ask hn"):
        ask_posts.append(row)
    elif title.startswith("show hn"):
        show_posts.append(row)
    else:
        other_posts.append(row)

print(len(ask_posts))
print(len(show_posts))
print(len(other_posts))

1744
1162
17194

In [5]:

print(ask_posts[:3])
print("\n")
print(show_posts[:3])

[['12296411', 'Ask HN: How to improve my personal website?', '', '2', '6', 'ahmedbaracat', '8/16/2016 9:55'], ['10610020', 'Ask HN: Am I the only one outraged by Twitter shutting down share counts?', '', '28', '29', 'tkfx', '11/22/2015 13:43'], ['11610310', 'Ask HN: Aby recent changes to CSS that broke mobile?', '', '1', '1', 'polskibus', '5/2/2016 10:14']]


[['10627194', 'Show HN: Wio Link  ESP8266 Based Web of Things Hardware Development Platform', 'https://iot.seeed.cc', '26', '22', 'kfihihc', '11/25/2015 14:03'], ['10646440', 'Show HN: Something pointless I made', 'http://dn.ht/picklecat/', '747', '102', 'dhotson', '11/29/2015 22:46'], ['11590768', 'Show HN: Shanhu.io, a programming playground powered by e8vm', 'https://shanhu.io', '1', '1', 'h8liu', '4/28/2016 18:05']]

Calculation of Average number of Comments- Let's determine which one gets, on average, more comments than the other.

In [6]:

total_ask_comments = 0
for row in ask_posts:
    b = int(row[4]) # index number is 4 for number of comments
    total_ask_comments +=b
avg_ask_comments = total_ask_comments/len(ask_posts)
print("Average number of comments for per post for Ask HN is "+ str(round(avg_ask_comments,2)))
print("\n")

total_show_comments = 0
for row in show_posts:
    b = int(row[4]) # index number is 4 for number of comments
    total_show_comments +=b
avg_show_comments = total_show_comments/len(show_posts)
print("Average number of comments for per post for Show HN is " + str(round(avg_show_comments,2)))

Average number of comments for per post for Ask HN is 14.04


Average number of comments for per post for Show HN is 10.32

First finding - As per the result above,on average, the posts start with "Ask HN" has taken more comments than those beginning with with "Show HN"

Consideration of time element - Further to the above findings, let's take a look at the time the post was created to determine which time is more likely to attract comments.

In [7]:

import datetime as dt #we need datetime module-- 
# --as we work on time
result_list = [] # list of list
for row in ask_posts:
    date_created = row[6] # first element
    #created_at column is the seventh column in ask_posts
    number_of_comment = int(row[4])
    result_list.append([date_created,number_of_comment])


counts_by_hour = {} #contains the number of ask posts-- 
# --created during each hour of the day
comments_by_hour = {} #contains the corresponding number of comments-- 
#--ask posts created at each hour received.
for row in result_list:
    hour = row[0]
    dt_object = dt.datetime.strptime(hour,"%m/%d/%Y %H:%M") #creating the object to parse the time info 
    dt_string = dt_object.strftime("%H") #parsing the time info
    if dt_string not in counts_by_hour:
        counts_by_hour[dt_string] = 1
        comments_by_hour[dt_string] = row[1]
    else:
        counts_by_hour[dt_string] +=1
        comments_by_hour[dt_string] += row[1]

Finding the time when the most comments made - After creating above two dictionaries we can calculate average number of comments for posts created during each hour of the day.

In [8]:

avg_by_hour =[]
for (k,v), (k1,v1) in zip(counts_by_hour.items(), comments_by_hour.items()):
    if k == k1:
        avg_by_hour.append([k,round(v1/v,2)])
    else:
        break
print(avg_by_hour)

[['09', 5.58], ['13', 14.74], ['10', 13.44], ['14', 13.23], ['16', 16.8], ['23', 7.99], ['12', 9.41], ['17', 11.46], ['15', 38.59], ['21', 16.01], ['20', 21.52], ['02', 23.81], ['18', 13.2], ['03', 7.8], ['05', 10.09], ['19', 10.8], ['01', 11.38], ['22', 6.75], ['08', 10.25], ['04', 7.17], ['00', 8.13], ['06', 9.02], ['07', 7.85], ['11', 11.05]]

Uncluttering- Hooray! we have the result. but wait! Doesn't it look a little messy to you?

In [10]:

swap_avg_by_hour = []
for row in avg_by_hour:
    swap_avg_by_hour.append([row[1],row[0]])
print(swap_avg_by_hour)

[[5.58, '09'], [14.74, '13'], [13.44, '10'], [13.23, '14'], [16.8, '16'], [7.99, '23'], [9.41, '12'], [11.46, '17'], [38.59, '15'], [16.01, '21'], [21.52, '20'], [23.81, '02'], [13.2, '18'], [7.8, '03'], [10.09, '05'], [10.8, '19'], [11.38, '01'], [6.75, '22'], [10.25, '08'], [7.17, '04'], [8.13, '00'], [9.02, '06'], [7.85, '07'], [11.05, '11']]

In [42]:

sorted_swap=sorted(swap_avg_by_hour,reverse=True)

print("Top 5 Hours for Ask Posts Comments")
for avg,hour in sorted_swap[:5]:
    dt_object = dt.datetime.strptime(hour,"%H")
    dt_string = dt_object.strftime("%H:%M")
    print("{time}: {comm:.2f} average comments per post".format(time=dt_string,comm=avg)) 
    

    

Top 5 Hours for Ask Posts Comments
15:00: 38.59 average comments per post
02:00: 23.81 average comments per post
20:00: 21.52 average comments per post
16:00: 16.80 average comments per post
21:00: 16.01 average comments per post