Notebook

Project Title: Hacker News Post¶

The Hacker News is a leading, trusted, and widely recognized cybersecurity news platform that attracts over 8 million readers monthly, including IT professionals, researchers, hackers, technologists, and enthusiasts.

At Hacker News, one will find the latest cybersecurity news and in-depth reports on current and future Infosec trends and how they are shaping the cyber world.

Project File & Coding¶

Original data set can be foundd at https://www.kaggle.com/hacker-news/hacker-news-posts.

For the purpose of this project, the original data set had been reduced to to approximately 20,101 rows by removing all submissions that did not receive any comments and then randomly sampling from the remaining submissions.

The reduced data set used for this project is saved as 'hacker.csv'.

The coding utilised is the basic Python Programming for working with list of lists.

Project Task:¶

Compare two different types of posts:

Post starting with 'Ask HN'. In it, a user raises a questions to The Hacker News community.
Post starting with 'Show HN'. In it, a user shows his work or project (something that user have made/worked on).

Project Purpose:¶

Part 1¶

Total number of posts starting with 'Ask HN' or 'Show HN'.
Which type of posts ('Ask HN' or 'Show HN') received more comments on average.
Average number of comments recieved by each type of posts ('Ask HN' and 'Show HN').

Part 2¶

In each hour, for 'Ask HN', the total number of posts and comments.
In each hour, for 'Ask HN', the average number of comments recieved.
For 'Ask HN, the hours that recieved the highest average number of comments.

In [1]:

'''This predefines Class to:
(1) seperate header.
(2) find out number of rows and columns.
(3) print first few rows.
(4) check for missing values and duplicate rows.'''
 
class Data:
#-------------------------------------------------------------------#
    def header(dataset):
        header = dataset[0]
        return header
#-------------------------------------------------------------------#
    def data_without_header(dataset):
        dataset = dataset[1:]
        return dataset
#-------------------------------------------------------------------#
    def explore_data(dataset):
        dataset_slice = dataset[0:5]
        print("Number of rows:", len(dataset))
        print("Number of columns:", len(dataset[0]))
        print('\n')
        print("First 5 rows:")
        print('\n')
        for row in dataset_slice:
            print(row)
            print('\n')
#-------------------------------------------------------------------#
    def missing_value(dataset):
        len_row = 0
        header = Data.header(dataset)
        for row in dataset:
            if len(header) != len(row):
                len_row += 1
                print(row)
                print("Row Index Number:", dataset.index(row))
        print("Number of rows with missing value:", len_row)
#-------------------------------------------------------------------#
    def duplicate_row(dataset, integer):
        duplicate_entry = []
        unique_entry = []
        for row in dataset:
            value = row[integer]
            if value in unique_entry:
                duplicate_entry.append(value)
            else:
                unique_entry.append(value)  
        print("Rows with duplicate Entries:{num}".format(num=len(duplicate_entry), data=dataset))   

In [2]:

from csv import reader

file = open('hacker.csv')
read = reader(file)
hacker = list(read)

### header removed from the rest of data set ### 
hacker_header = Data.header(hacker)
hacker = Data.data_without_header(hacker)

Exploring The Hacker News data¶

In [3]:

'''This calls functions in the class.Data to explore:
(1) header.
(2) number of rows and columns.
(3) first 5 rows.
(4) missing values and duplicate entries. '''

print("Header:")
print(hacker_header)
print('\n')
Data.explore_data(hacker)
Data.missing_value(hacker)
Data.duplicate_row(hacker, 0)

Header:
['id', 'title', 'url', 'num_points', 'num_comments', 'author', 'created_at']


Number of rows: 20100
Number of columns: 7


First 5 rows:


['12224879', 'Interactive Dynamic Video', 'http://www.interactivedynamicvideo.com/', '386', '52', 'ne0phyte', '8/4/2016 11:52']


['10975351', 'How to Use Open Source and Shut the Fuck Up at the Same Time', 'http://hueniverse.com/2016/01/26/how-to-use-open-source-and-shut-the-fuck-up-at-the-same-time/', '39', '10', 'josep2', '1/26/2016 19:30']


['11964716', "Florida DJs May Face Felony for April Fools' Water Joke", 'http://www.thewire.com/entertainment/2013/04/florida-djs-april-fools-water-joke/63798/', '2', '1', 'vezycash', '6/23/2016 22:20']


['11919867', 'Technology ventures: From Idea to Enterprise', 'https://www.amazon.com/Technology-Ventures-Enterprise-Thomas-Byers/dp/0073523429', '3', '1', 'hswarna', '6/17/2016 0:01']


['10301696', 'Note by Note: The Making of Steinway L1037 (2007)', 'http://www.nytimes.com/2007/11/07/movies/07stein.html?_r=0', '8', '2', 'walterbell', '9/30/2015 4:12']


Number of rows with missing value: 0
Rows with duplicate Entries:0

Part 1¶

1. Total number of posts starting with 'Ask HN' or 'Show HN'.¶

In [4]:

'''Seperating all posts into 3 seperate lists:
1. ask_post []: to take in posts starting with 'Ask HN'
2. show_posts []: to take in posts starting with 'Show HN'
3. other_posts []: to take in all other posts'''

### creating emptry list ###
ask_posts = []
show_posts = []
other_posts = []

'''iterate over each row in hacker news data to lift rows for 'Ask HN' 
'Show HN' and put them in the above empty lists respectively. '''

for row in hacker:
    title = row[1].lower()
    if title.startswith('ask hn'):
        ask_posts.append(row)
    elif title.startswith('show hn'):
        show_posts.append(row)
    else:
        other_posts.append(row)

print('Posts beginning with "Ask HN":')
print(len(ask_posts), "posts")
print('\n')
print('Posts beginning with "Show HN":')
print(len(show_posts), "posts")
print('\n')
print('Posts not beginning with "Ask HN" and "Show HN":')
print(len(other_posts), "posts")

Posts beginning with "Ask HN":
1744 posts


Posts beginning with "Show HN":
1162 posts


Posts not beginning with "Ask HN" and "Show HN":
17194 posts

Part 1¶

2. Which type of posts ('Ask HN' or 'Show HN') received more comments on average?¶

In [5]:

'''Creating 3 variables for:
- total_ask_comments, total_show_comments and total_other_comments and,
- assigning each of them the value '0'.'''

total_ask_comments = 0
total_show_comments = 0
total_other_comments = 0

''' iterating over each row in ask_posts to lift and add the 
number of comments to total_ask_comments'''

for row in ask_posts:
    comments = int(row[4])
    total_ask_comments += comments 
    
''' iterating over each row in show_posts to lift and add the 
number of comments to total_show_comments'''

for row in show_posts:
    comments = int(row[4])
    total_show_comments += comments 

''' iterating over each row in other_posts to lift and add the 
number of comments to total_other_comments'''

for row in other_posts:
    comments = int(row[4])
    total_other_comments += comments 
    
    
print('Posts beginning with "Ask HN":')
print(len(ask_posts), "posts")
print(total_ask_comments, "comments")
print('\n')

print('Posts beginning with "Show HN":')
print(len(show_posts), "posts")
print(total_show_comments, "comments")
print('\n')

print('Posts NOT beginning with "Ask HN" and "Show HN":')
print(len(other_posts), "posts")
print(total_other_comments, "comments")
print('\n')

Posts beginning with "Ask HN":
1744 posts
24483 comments


Posts beginning with "Show HN":
1162 posts
11988 comments


Posts NOT beginning with "Ask HN" and "Show HN":
17194 posts
462055 comments

Part 1¶

3. Average number of comments received by each type of posts ('Ask HN' and 'Show HN').¶

In [6]:

### Calculating the average number of comments received by each post ###

avg_ask_comments = total_ask_comments / len(ask_posts)
avg_show_comments = total_show_comments / len(show_posts)
avg_other_comments = total_other_comments / len(other_posts)

print('Posts beginning with "Ask HN":')
print(len(ask_posts), "posts")
print(total_ask_comments, "comments")
print("Average comments:", avg_ask_comments)
print('\n')

print('Posts beginning with "Show HN":')
print(len(show_posts), "posts")
print(total_show_comments, "comments")
print("Average comments:", avg_show_comments)
print('\n')

print('Posts NOT beginning with "Ask HN" and "Show HN":')
print(len(other_posts), "posts")
print(total_other_comments, "comments")
print("Average comments:", avg_other_comments)
print('\n')

Posts beginning with "Ask HN":
1744 posts
24483 comments
Average comments: 14.038417431192661


Posts beginning with "Show HN":
1162 posts
11988 comments
Average comments: 10.31669535283993


Posts NOT beginning with "Ask HN" and "Show HN":
17194 posts
462055 comments
Average comments: 26.8730371059672

Part 1 - Observation¶

The above analysis shows that posts beginning with 'Ask HN' recieved 12495 more comments than 'Show HN'.¶

Based on the number of posts included in the 'hacker.csv', on average 'Ask HN' recieved about 3.7 comments more than 'Show HN'. In calculating this average, posts that did not recieve any comments are not taken into consideration as they were not included in the 'hacker.csv'¶

Part 2¶

4. In each hour, for 'Ask HN', the total number of posts and comments.¶

In [7]:

''' From the 'Ask HN' list above, extracting the time and 
the number of coments and appending them to a new empty list (result_list). '''

result_list = [] 

for row in ask_posts:
    created_at = row[-1] ### time recorded in the last row ###
    comments = int(row[-3])
    result_list.append([created_at, comments])

'''Creating 2 empty dictionaries to take in:
1. each hour as the key and total number of posts as the value (hour_post).  
2. each hour as the key and total number of comments as the value (hour_comment).'''

hour_post = {}
hour_comment = {}

'''import datetime class and creating date_format variable hich
is consistent with the format provided in the hacker news data'''

import datetime as dt
date_format = "%m/%d/%Y %H:%M"

''' iterating over each row in the result_list '''
for row in result_list:
    comment = row[1]
    hour_dt = row[0] 
    
    ''' parse the datetime format and extract only the H%'''
    hour_dt = dt.datetime.strptime(hour_dt, date_format)
    hour_dt = hour_dt.strftime("%H")

    ''' To hour_post and hour_comment dictionaries, assigning %H as the key
    and adding the number of posts and comments as the value '''

    if hour_dt not in hour_post:
        hour_post[hour_dt] = 1
        hour_comment[hour_dt] = comment
    else:
        hour_post[hour_dt] +=1
        hour_comment[hour_dt] += comment

print("Hour : Number of Posts")
print('\n')
print(hour_post)
print('\n')
print("Hour : Number of Comments")
print('\n')
print(hour_comment)

Hour : Number of Posts


{'09': 45, '13': 85, '10': 59, '14': 107, '16': 108, '23': 68, '12': 73, '17': 100, '15': 116, '21': 109, '20': 80, '02': 58, '18': 109, '03': 54, '05': 46, '19': 110, '01': 60, '22': 71, '08': 48, '04': 47, '00': 55, '06': 44, '07': 34, '11': 58}


Hour : Number of Comments


{'09': 251, '13': 1253, '10': 793, '14': 1416, '16': 1814, '23': 543, '12': 687, '17': 1146, '15': 4477, '21': 1745, '20': 1722, '02': 1381, '18': 1439, '03': 421, '05': 464, '19': 1188, '01': 683, '22': 479, '08': 492, '04': 337, '00': 447, '06': 397, '07': 267, '11': 641}

Part 2¶

5. In each hour, for 'Ask HN', the average number of comments recieved.¶

In [8]:

''' To an empty list (hr_post_comment), transfering  
Each Hour, Number of Posts and Number of Comments 
from hour_post and hour_comment dictionaries'''

hr_post_comment = []
for hour, post in hour_post.items():
    hr_post_comment.append([hour, post, hour_comment[hour]])

print("Hour, Posts, Comments")
hr_post_comment.sort()
for row in hr_post_comment:
    print(row)

### calculating the average number of comments received in each hour. ###


hr_avg_comment = []
for row in hr_post_comment:
    post = row[1]
    comment = row[2]
    avg = comment / post
    hr_avg_comment.append([row[0], avg])
    
print('\n')
hr_avg_comment.sort()
print("Hour, Avg Comments")
for row in hr_avg_comment:
    print(row)

Hour, Posts, Comments
['00', 55, 447]
['01', 60, 683]
['02', 58, 1381]
['03', 54, 421]
['04', 47, 337]
['05', 46, 464]
['06', 44, 397]
['07', 34, 267]
['08', 48, 492]
['09', 45, 251]
['10', 59, 793]
['11', 58, 641]
['12', 73, 687]
['13', 85, 1253]
['14', 107, 1416]
['15', 116, 4477]
['16', 108, 1814]
['17', 100, 1146]
['18', 109, 1439]
['19', 110, 1188]
['20', 80, 1722]
['21', 109, 1745]
['22', 71, 479]
['23', 68, 543]


Hour, Avg Comments
['00', 8.127272727272727]
['01', 11.383333333333333]
['02', 23.810344827586206]
['03', 7.796296296296297]
['04', 7.170212765957447]
['05', 10.08695652173913]
['06', 9.022727272727273]
['07', 7.852941176470588]
['08', 10.25]
['09', 5.5777777777777775]
['10', 13.440677966101696]
['11', 11.051724137931034]
['12', 9.41095890410959]
['13', 14.741176470588234]
['14', 13.233644859813085]
['15', 38.5948275862069]
['16', 16.796296296296298]
['17', 11.46]
['18', 13.20183486238532]
['19', 10.8]
['20', 21.525]
['21', 16.009174311926607]
['22', 6.746478873239437]
['23', 7.985294117647059]

Part 2¶

6. For 'Ask HN, the hours that recieved the highest average number of comments.¶

In [9]:

''' To an empty list, transfer hour and average number 
of comments with their positions swapped. '''

swap_avg_by_hr = []
for row in hr_avg_comment:
    hour = row[0]
    avg = row[1]
    swap_avg_by_hr.append([avg,  hour])
swap_avg_by_hr.sort(reverse=True)

'''
Converting hour to '%H:%M and using format() to display hours with the highest average number of comments.
'''

print("Top 5 Hours for Ask Posts Comments")
for avg, hr in swap_avg_by_hr[:5]:
    hr = dt.datetime.strptime(hr, "%H")
    hr = hr.strftime("%H:%M")
    template = '{}: {:.2f} average comments per post'
    print(template.format(hr, avg))

Top 5 Hours for Ask Posts Comments
15:00: 38.59 average comments per post
02:00: 23.81 average comments per post
20:00: 21.52 average comments per post
16:00: 16.80 average comments per post
21:00: 16.01 average comments per post

Conclusion¶

The above analysis shows that there are 582 more posts beggining with 'Ask HN' than "Show HN'. There are also 12,495 more comments submitted for 'Ask HN'. Based on the number of posts included in the 'hacker.csv', on average 'Ask HN' recieved about 3.7 more comments than 'Show HN'.¶

Most of these posts and comments were submitted in the '15:00' hour. In total, 4,477 comments were submitted in the '15:00' hour which is 2,663 higher than 1,814 comments submitted during the '16:00' hour (the second highest). The difference in the number of posts between the hour '15:00' (116 posts) and '19:00' (the second highest) is 6 posts. Therefore, '15:00' hour is undoubtedly the most popular hour for submitting both the posts and the comments and this is reflected by by the highest average comments per posts recieved by hour '15:00'.¶

However, this may not be true for '02:00' hour which is second on the list with 23.81 average comments per post. It is '16:00' hour with the second highest number of comments (1,814) followed by:¶

- '21:00' with 1,745 comments,¶

- '20:00' with 1,722 comments,¶

- '18:00' with 1,439 comments,¶

- '14:00' with 1,416 comments,¶

- followed by '02:00' hour with 1,381 comments.¶

Even in terms of the number of posts submitted, '02:00' is ranked 15th highest with 58 posts which is 52 posts less than '19:00' hour (with the second highest number of posts).¶

It appears that the following hours are more popular than '02:00' hour for the submission of both the posts and the comments:¶

- '16:00' with 108 posts and 1,814 comments¶

- '21:00' with 109 posts and 1745 comments.¶

- '20:00' with 80 posts and 1,722 comments¶

- '18:00' with 109 posts and 1,439 comments¶

- '14:00' with 107 posts and 1,416 comments¶

In comparison, '02:00' hours with 58 posts and 1,381 comments has attained the second spot because during this hour there are far less posts. This gives '02:00' hour a higher average for comments per post.¶

Therefore, in the top 5 hours based on the average comments per post, the findings of '15:00', '16:00'. '20:00'. and '21:00' seems logical if the purpose of this exercise is to find out during which hours the users are most active in terms of submitting posts and comments. Accordingly, '18:00' and '14:00' hours appears to be better candidates than '02:00' hour.¶