The Hacker News is a leading, trusted, and widely recognized cybersecurity news platform that attracts over 8 million readers monthly, including IT professionals, researchers, hackers, technologists, and enthusiasts.
At Hacker News, one will find the latest cybersecurity news and in-depth reports on current and future Infosec trends and how they are shaping the cyber world.
Original data set can be foundd at https://www.kaggle.com/hacker-news/hacker-news-posts.
For the purpose of this project, the original data set had been reduced to to approximately 20,101 rows by removing all submissions that did not receive any comments and then randomly sampling from the remaining submissions.
The reduced data set used for this project is saved as 'hacker.csv'.
The coding utilised is the basic Python Programming for working with list of lists.
Compare two different types of posts:
Post starting with 'Ask HN'. In it, a user raises a questions to The Hacker News community.
Post starting with 'Show HN'. In it, a user shows his work or project (something that user have made/worked on).
Total number of posts starting with 'Ask HN' or 'Show HN'.
Which type of posts ('Ask HN' or 'Show HN') received more comments on average.
Average number of comments recieved by each type of posts ('Ask HN' and 'Show HN').
In each hour, for 'Ask HN', the total number of posts and comments.
In each hour, for 'Ask HN', the average number of comments recieved.
For 'Ask HN, the hours that recieved the highest average number of comments.
'''This predefines Class to:
(1) seperate header.
(2) find out number of rows and columns.
(3) print first few rows.
(4) check for missing values and duplicate rows.'''
class Data:
#-------------------------------------------------------------------#
def header(dataset):
header = dataset[0]
return header
#-------------------------------------------------------------------#
def data_without_header(dataset):
dataset = dataset[1:]
return dataset
#-------------------------------------------------------------------#
def explore_data(dataset):
dataset_slice = dataset[0:5]
print("Number of rows:", len(dataset))
print("Number of columns:", len(dataset[0]))
print('\n')
print("First 5 rows:")
print('\n')
for row in dataset_slice:
print(row)
print('\n')
#-------------------------------------------------------------------#
def missing_value(dataset):
len_row = 0
header = Data.header(dataset)
for row in dataset:
if len(header) != len(row):
len_row += 1
print(row)
print("Row Index Number:", dataset.index(row))
print("Number of rows with missing value:", len_row)
#-------------------------------------------------------------------#
def duplicate_row(dataset, integer):
duplicate_entry = []
unique_entry = []
for row in dataset:
value = row[integer]
if value in unique_entry:
duplicate_entry.append(value)
else:
unique_entry.append(value)
print("Rows with duplicate Entries:{num}".format(num=len(duplicate_entry), data=dataset))
from csv import reader
file = open('hacker.csv')
read = reader(file)
hacker = list(read)
### header removed from the rest of data set ###
hacker_header = Data.header(hacker)
hacker = Data.data_without_header(hacker)
'''This calls functions in the class.Data to explore:
(1) header.
(2) number of rows and columns.
(3) first 5 rows.
(4) missing values and duplicate entries. '''
print("Header:")
print(hacker_header)
print('\n')
Data.explore_data(hacker)
Data.missing_value(hacker)
Data.duplicate_row(hacker, 0)
Header: ['id', 'title', 'url', 'num_points', 'num_comments', 'author', 'created_at'] Number of rows: 20100 Number of columns: 7 First 5 rows: ['12224879', 'Interactive Dynamic Video', 'http://www.interactivedynamicvideo.com/', '386', '52', 'ne0phyte', '8/4/2016 11:52'] ['10975351', 'How to Use Open Source and Shut the Fuck Up at the Same Time', 'http://hueniverse.com/2016/01/26/how-to-use-open-source-and-shut-the-fuck-up-at-the-same-time/', '39', '10', 'josep2', '1/26/2016 19:30'] ['11964716', "Florida DJs May Face Felony for April Fools' Water Joke", 'http://www.thewire.com/entertainment/2013/04/florida-djs-april-fools-water-joke/63798/', '2', '1', 'vezycash', '6/23/2016 22:20'] ['11919867', 'Technology ventures: From Idea to Enterprise', 'https://www.amazon.com/Technology-Ventures-Enterprise-Thomas-Byers/dp/0073523429', '3', '1', 'hswarna', '6/17/2016 0:01'] ['10301696', 'Note by Note: The Making of Steinway L1037 (2007)', 'http://www.nytimes.com/2007/11/07/movies/07stein.html?_r=0', '8', '2', 'walterbell', '9/30/2015 4:12'] Number of rows with missing value: 0 Rows with duplicate Entries:0
'''Seperating all posts into 3 seperate lists:
1. ask_post []: to take in posts starting with 'Ask HN'
2. show_posts []: to take in posts starting with 'Show HN'
3. other_posts []: to take in all other posts'''
### creating emptry list ###
ask_posts = []
show_posts = []
other_posts = []
'''iterate over each row in hacker news data to lift rows for 'Ask HN'
'Show HN' and put them in the above empty lists respectively. '''
for row in hacker:
title = row[1].lower()
if title.startswith('ask hn'):
ask_posts.append(row)
elif title.startswith('show hn'):
show_posts.append(row)
else:
other_posts.append(row)
print('Posts beginning with "Ask HN":')
print(len(ask_posts), "posts")
print('\n')
print('Posts beginning with "Show HN":')
print(len(show_posts), "posts")
print('\n')
print('Posts not beginning with "Ask HN" and "Show HN":')
print(len(other_posts), "posts")
Posts beginning with "Ask HN": 1744 posts Posts beginning with "Show HN": 1162 posts Posts not beginning with "Ask HN" and "Show HN": 17194 posts
'''Creating 3 variables for:
- total_ask_comments, total_show_comments and total_other_comments and,
- assigning each of them the value '0'.'''
total_ask_comments = 0
total_show_comments = 0
total_other_comments = 0
''' iterating over each row in ask_posts to lift and add the
number of comments to total_ask_comments'''
for row in ask_posts:
comments = int(row[4])
total_ask_comments += comments
''' iterating over each row in show_posts to lift and add the
number of comments to total_show_comments'''
for row in show_posts:
comments = int(row[4])
total_show_comments += comments
''' iterating over each row in other_posts to lift and add the
number of comments to total_other_comments'''
for row in other_posts:
comments = int(row[4])
total_other_comments += comments
print('Posts beginning with "Ask HN":')
print(len(ask_posts), "posts")
print(total_ask_comments, "comments")
print('\n')
print('Posts beginning with "Show HN":')
print(len(show_posts), "posts")
print(total_show_comments, "comments")
print('\n')
print('Posts NOT beginning with "Ask HN" and "Show HN":')
print(len(other_posts), "posts")
print(total_other_comments, "comments")
print('\n')
Posts beginning with "Ask HN": 1744 posts 24483 comments Posts beginning with "Show HN": 1162 posts 11988 comments Posts NOT beginning with "Ask HN" and "Show HN": 17194 posts 462055 comments
### Calculating the average number of comments received by each post ###
avg_ask_comments = total_ask_comments / len(ask_posts)
avg_show_comments = total_show_comments / len(show_posts)
avg_other_comments = total_other_comments / len(other_posts)
print('Posts beginning with "Ask HN":')
print(len(ask_posts), "posts")
print(total_ask_comments, "comments")
print("Average comments:", avg_ask_comments)
print('\n')
print('Posts beginning with "Show HN":')
print(len(show_posts), "posts")
print(total_show_comments, "comments")
print("Average comments:", avg_show_comments)
print('\n')
print('Posts NOT beginning with "Ask HN" and "Show HN":')
print(len(other_posts), "posts")
print(total_other_comments, "comments")
print("Average comments:", avg_other_comments)
print('\n')
Posts beginning with "Ask HN": 1744 posts 24483 comments Average comments: 14.038417431192661 Posts beginning with "Show HN": 1162 posts 11988 comments Average comments: 10.31669535283993 Posts NOT beginning with "Ask HN" and "Show HN": 17194 posts 462055 comments Average comments: 26.8730371059672
''' From the 'Ask HN' list above, extracting the time and
the number of coments and appending them to a new empty list (result_list). '''
result_list = []
for row in ask_posts:
created_at = row[-1] ### time recorded in the last row ###
comments = int(row[-3])
result_list.append([created_at, comments])
'''Creating 2 empty dictionaries to take in:
1. each hour as the key and total number of posts as the value (hour_post).
2. each hour as the key and total number of comments as the value (hour_comment).'''
hour_post = {}
hour_comment = {}
'''import datetime class and creating date_format variable hich
is consistent with the format provided in the hacker news data'''
import datetime as dt
date_format = "%m/%d/%Y %H:%M"
''' iterating over each row in the result_list '''
for row in result_list:
comment = row[1]
hour_dt = row[0]
''' parse the datetime format and extract only the H%'''
hour_dt = dt.datetime.strptime(hour_dt, date_format)
hour_dt = hour_dt.strftime("%H")
''' To hour_post and hour_comment dictionaries, assigning %H as the key
and adding the number of posts and comments as the value '''
if hour_dt not in hour_post:
hour_post[hour_dt] = 1
hour_comment[hour_dt] = comment
else:
hour_post[hour_dt] +=1
hour_comment[hour_dt] += comment
print("Hour : Number of Posts")
print('\n')
print(hour_post)
print('\n')
print("Hour : Number of Comments")
print('\n')
print(hour_comment)
Hour : Number of Posts {'09': 45, '13': 85, '10': 59, '14': 107, '16': 108, '23': 68, '12': 73, '17': 100, '15': 116, '21': 109, '20': 80, '02': 58, '18': 109, '03': 54, '05': 46, '19': 110, '01': 60, '22': 71, '08': 48, '04': 47, '00': 55, '06': 44, '07': 34, '11': 58} Hour : Number of Comments {'09': 251, '13': 1253, '10': 793, '14': 1416, '16': 1814, '23': 543, '12': 687, '17': 1146, '15': 4477, '21': 1745, '20': 1722, '02': 1381, '18': 1439, '03': 421, '05': 464, '19': 1188, '01': 683, '22': 479, '08': 492, '04': 337, '00': 447, '06': 397, '07': 267, '11': 641}
''' To an empty list (hr_post_comment), transfering
Each Hour, Number of Posts and Number of Comments
from hour_post and hour_comment dictionaries'''
hr_post_comment = []
for hour, post in hour_post.items():
hr_post_comment.append([hour, post, hour_comment[hour]])
print("Hour, Posts, Comments")
hr_post_comment.sort()
for row in hr_post_comment:
print(row)
### calculating the average number of comments received in each hour. ###
hr_avg_comment = []
for row in hr_post_comment:
post = row[1]
comment = row[2]
avg = comment / post
hr_avg_comment.append([row[0], avg])
print('\n')
hr_avg_comment.sort()
print("Hour, Avg Comments")
for row in hr_avg_comment:
print(row)
Hour, Posts, Comments ['00', 55, 447] ['01', 60, 683] ['02', 58, 1381] ['03', 54, 421] ['04', 47, 337] ['05', 46, 464] ['06', 44, 397] ['07', 34, 267] ['08', 48, 492] ['09', 45, 251] ['10', 59, 793] ['11', 58, 641] ['12', 73, 687] ['13', 85, 1253] ['14', 107, 1416] ['15', 116, 4477] ['16', 108, 1814] ['17', 100, 1146] ['18', 109, 1439] ['19', 110, 1188] ['20', 80, 1722] ['21', 109, 1745] ['22', 71, 479] ['23', 68, 543] Hour, Avg Comments ['00', 8.127272727272727] ['01', 11.383333333333333] ['02', 23.810344827586206] ['03', 7.796296296296297] ['04', 7.170212765957447] ['05', 10.08695652173913] ['06', 9.022727272727273] ['07', 7.852941176470588] ['08', 10.25] ['09', 5.5777777777777775] ['10', 13.440677966101696] ['11', 11.051724137931034] ['12', 9.41095890410959] ['13', 14.741176470588234] ['14', 13.233644859813085] ['15', 38.5948275862069] ['16', 16.796296296296298] ['17', 11.46] ['18', 13.20183486238532] ['19', 10.8] ['20', 21.525] ['21', 16.009174311926607] ['22', 6.746478873239437] ['23', 7.985294117647059]
''' To an empty list, transfer hour and average number
of comments with their positions swapped. '''
swap_avg_by_hr = []
for row in hr_avg_comment:
hour = row[0]
avg = row[1]
swap_avg_by_hr.append([avg, hour])
swap_avg_by_hr.sort(reverse=True)
'''
Converting hour to '%H:%M and using format() to display hours with the highest average number of comments.
'''
print("Top 5 Hours for Ask Posts Comments")
for avg, hr in swap_avg_by_hr[:5]:
hr = dt.datetime.strptime(hr, "%H")
hr = hr.strftime("%H:%M")
template = '{}: {:.2f} average comments per post'
print(template.format(hr, avg))
Top 5 Hours for Ask Posts Comments 15:00: 38.59 average comments per post 02:00: 23.81 average comments per post 20:00: 21.52 average comments per post 16:00: 16.80 average comments per post 21:00: 16.01 average comments per post