Hacker News is a site started by the startup incubator Y Combinator, where user-submitted stories (known as "posts") are voted and commented upon, similar to reddit. Hacker News is extremely popular in technology and startup circles, and posts that make it to the top of Hacker News' listings can get hundreds of thousands of visitors as a result. We are mostly concerned with posts of `Ask HN` where the user is asking the platform a question and `Show HN` where the user is showcasing something they made or something that they find interesting on the platform.
This project aims to:
- Identify which posts between `Ask HN` and `Show HN` are more likely to be posted
- Analyze which time of the day that a user posting will receive a comment
- Pinpoint the peak hours where most users are posting and commenting.
The following are an example posts title for Ask HN
and Show HN
:
Ask HN: How to improve my personal website?
Ask HN: Am I the only one outraged by Twitter shutting down share counts?
Ask HN: Aby recent changes to CSS that broke mobile?
Show HN: Wio Link ESP8266 Based Web of Things Hardware Development Platform'
Show HN: Something pointless I made
Show HN: Shanhu.io, a programming playground powered by e8vm
Sources:
Hacker news
Kaggle
DataQuest
Before we begin our analysis, we are going to create a function that will open, read and save the csv file in a variable named hn
. We will also create a function that we can use for initial exploration of the dataset. And lastly we are going to import the modules
that we are going to use on this project.
def dataset_csv(dataset, header = True):
'''
Open and reads the csv file
Parameter
---
dataset: str
A string type in the format of 'example.csv'
header: bool, default = True
While true, the result will include a header, otherwise it will return the rows only
Return
---
A list of lists
'''
from csv import reader
file = open(dataset)
read = reader(file)
data = list(read)
if header:
return data
if not header:
return data[1:]
def explore_data(dataset, start, end, rows_and_columns = True):
'''
Prints out an initial exploration of the data
Parameter
---
dataset: list
A variable of the dataset
start: int
The starting index to be shown
end: int
The ending index to be shown
rows_and_columns: bool, default = True
While true, it will show the total number of rows and columns of the dataset
Return
---
A print statement of the above parameters
'''
print('Sample {0} no. rows'.format(len(dataset[start:end])))
print('')
for row in dataset[start:end]:
print(row)
print('')
if rows_and_columns:
print('Total number of columns: {0}\nTotal number of rows: {1}'.format(len(dataset[0]),len(dataset[1:])))
import datetime as dt # Since we are going to deal with the hours later on in our analysis, we are going to need the datetime module
from matplotlib import pyplot as plt # We are going to represent our data in a graph later on in our analysis to derive a meaning from the data
The function named dataset_csv()
will be used to open the hacker news dataset. We will keep the header argument as true since we want to see what our headers are. But we are going to create a separate variable for the headers and the rows.
hn = dataset_csv(r'C:\Users\Mico\OneDrive\Desktop\DATASETS\KAGGLE\HACKER NEWS POST\DATAQUEST\hacker_news.csv')
hn_header = hn[0] # this will extract the header from the hn dataset
hn_rows = hn[1:] # this will exclude the header from the hn dataset
To begin our data analysis, we will first have an initial data exploration in order to have an insight on what are the contents of our dataset. In this section we are going to do the following:
explore_data()
to display the initial overview of the dataset.explore_data(hn,0,5)
Sample 5 no. rows ['id', 'title', 'url', 'num_points', 'num_comments', 'author', 'created_at'] ['12224879', 'Interactive Dynamic Video', 'http://www.interactivedynamicvideo.com/', '386', '52', 'ne0phyte', '8/4/2016 11:52'] ['10975351', 'How to Use Open Source and Shut the Fuck Up at the Same Time', 'http://hueniverse.com/2016/01/26/how-to-use-open-source-and-shut-the-fuck-up-at-the-same-time/', '39', '10', 'josep2', '1/26/2016 19:30'] ['11964716', "Florida DJs May Face Felony for April Fools' Water Joke", 'http://www.thewire.com/entertainment/2013/04/florida-djs-april-fools-water-joke/63798/', '2', '1', 'vezycash', '6/23/2016 22:20'] ['11919867', 'Technology ventures: From Idea to Enterprise', 'https://www.amazon.com/Technology-Ventures-Enterprise-Thomas-Byers/dp/0073523429', '3', '1', 'hswarna', '6/17/2016 0:01'] Total number of columns: 7 Total number of rows: 20100
The data above shows the 7 attributes of each rows which are described in the table below.
Index | Columns | Description | Ideal Data Type |
---|---|---|---|
0 | id | The unique identifier from the hacker news post | str |
1 | Title | Title of the post | str |
2 | url | Url of the item being linked to | str |
3 | num_points | The number of upvotes the post received | int |
4 | num_comments | The number of the comments the post received | int |
5 | author | The name of the account that made the post | str |
6 | created_at | The date and time the post was made (Timezone: Eastern Time in the US) | datetime |
Original dataset source: Kaggle - Hacker News Posts
Date created: 2016-09-27
Date updated: 2016-09-27
Source used in this analysis: DataQuest - Guided Project: Exploring Hacker News Posts
Note: The dataset from kaggle contains more than 300,000 rows while the dataset from this analysis contains 20,101 rows. The dataset was reduced by removing submissions that did not received any comments, and then randomly sampling from the remaining submissions.
As we've mentioned we're only concern with the posts that have a beginning title of Ask HN
or Show HN
. In this section we are going to do the following:
Ask HM
amd Show HN
are in the title column. We will verify if there's any missing value in this column.Ask HN
or Show HN
in the beginning of the title.In order to determine the usesability of our dataset, we have to check whether there's a missing value and identify how it will affect our analysis.
def missing_data(dataset,index):
'''
Count the rows with a missing value ('')
Parameter
---
dataset: list
A variable of the dataset
index: int
The index value of the column to be iterated
Return
---
Statement of the number of rows with missing value
'''
missing_column = list()
for row in dataset:
if len(row) != len(hn_header): # This is to check if there's a row that contains missing data
missing_column.append(row)
elif row[index] == '': # This is to check if there's a row with an empty `created_at`
missing_column.append(row)
return 'The number of rows with missing data is {0}.'.format(len(missing_column))
for columns in range(7):
print('Column -',hn_header[columns],', Index - ',columns)
print(missing_data(hn_rows,columns))
print('')
Column - id , Index - 0 The number of rows with missing data is 0. Column - title , Index - 1 The number of rows with missing data is 0. Column - url , Index - 2 The number of rows with missing data is 2440. Column - num_points , Index - 3 The number of rows with missing data is 0. Column - num_comments , Index - 4 The number of rows with missing data is 0. Column - author , Index - 5 The number of rows with missing data is 0. Column - created_at , Index - 6 The number of rows with missing data is 0.
As we've investigated the only column that contains a missing value ('') is url
which is not our concern since it is only a url where the item is being linked to.
Data useability is an integral part of our analysis. We have to make sure that the types data we are using is in compliance with our data requirements. As we have shown on the table above Attributes Description we have an ideal data type for each of our column. In the following below, we are going to check the data types of each columns and convert that column whenever the data type doesn't meet our data requirements.
for element in range(7):
print('Column -',hn_header[element],', Index - ',element)
print(type(hn_rows[1][element]))
print('')
Column - id , Index - 0 <class 'str'> Column - title , Index - 1 <class 'str'> Column - url , Index - 2 <class 'str'> Column - num_points , Index - 3 <class 'str'> Column - num_comments , Index - 4 <class 'str'> Column - author , Index - 5 <class 'str'> Column - created_at , Index - 6 <class 'str'>
As we can see, the following columns does not meet our data requirements.
integer
.integer
.datetime
.In order to comply, we have to convert these three columns on their corresponding data types.
def convert_data(dataset,index,type_of_data):
import datetime as dt
for rows in dataset:
convert_data = rows[index]
if type_of_data == 'datetime':
convert_to = dt.datetime.strptime(convert_data,'%m/%d/%Y %H:%M')
elif type_of_data == int:
convert_to = int(convert_data)
rows[index] = convert_to
# To convert the `num_points`
convert_data(hn_rows,3,int)
convert_data(hn_rows,4,int)
convert_data(hn_rows,6,'datetime')
# To verify the changes that we've made. We will iterate on one of the rows in our dataset.
for element in range(7):
print('Column -',hn_header[element],', Index - ',element)
print(type(hn_rows[1][element]))
print('')
Column - id , Index - 0 <class 'str'> Column - title , Index - 1 <class 'str'> Column - url , Index - 2 <class 'str'> Column - num_points , Index - 3 <class 'int'> Column - num_comments , Index - 4 <class 'int'> Column - author , Index - 5 <class 'str'> Column - created_at , Index - 6 <class 'datetime.datetime'>
Now that we are complying with our data requirements we can procede to segregate our dataset. Remember that we are interested at Ask HN
and Show HN
posts to analyze whether the date and time of posting affects the number of average comments.
In order to derive a meaning from our dataset, we're going to create a sperate lists of the following:
Even though our concern are the posts that contains Ask HN
and Show HN
we will still create a separate list for Other posts
for comparison purposes further down in our analysis.
ask_post = list()
show_post = list()
other_post = list()
for row in hn_rows:
title = row[1].lower() # Since there might be posts that contains `ask HN` and `show HN` rather than `Ask HN` and `Show HN`, we will convert all our title into lower cases and iterate by using `ask hn` and `show hn`
if title.startswith('ask hn'):
ask_post.append(row)
elif title.startswith('show hn'):
show_post.append(row)
else:
other_post.append(row)
# We'll print out the first 3 rows of each list
print('Posts that starts with Ask HN:')
explore_data(ask_post,0,3)
print('\n')
print('Posts that starts with Show HN:')
explore_data(show_post,0,3)
print('\n')
print('Other posts:')
explore_data(other_post,0,3)
Posts that starts with Ask HN: Sample 3 no. rows ['12296411', 'Ask HN: How to improve my personal website?', '', 2, 6, 'ahmedbaracat', datetime.datetime(2016, 8, 16, 9, 55)] ['10610020', 'Ask HN: Am I the only one outraged by Twitter shutting down share counts?', '', 28, 29, 'tkfx', datetime.datetime(2015, 11, 22, 13, 43)] ['11610310', 'Ask HN: Aby recent changes to CSS that broke mobile?', '', 1, 1, 'polskibus', datetime.datetime(2016, 5, 2, 10, 14)] Total number of columns: 7 Total number of rows: 1743 Posts that starts with Show HN: Sample 3 no. rows ['10627194', 'Show HN: Wio Link ESP8266 Based Web of Things Hardware Development Platform', 'https://iot.seeed.cc', 26, 22, 'kfihihc', datetime.datetime(2015, 11, 25, 14, 3)] ['10646440', 'Show HN: Something pointless I made', 'http://dn.ht/picklecat/', 747, 102, 'dhotson', datetime.datetime(2015, 11, 29, 22, 46)] ['11590768', 'Show HN: Shanhu.io, a programming playground powered by e8vm', 'https://shanhu.io', 1, 1, 'h8liu', datetime.datetime(2016, 4, 28, 18, 5)] Total number of columns: 7 Total number of rows: 1161 Other posts: Sample 3 no. rows ['12224879', 'Interactive Dynamic Video', 'http://www.interactivedynamicvideo.com/', 386, 52, 'ne0phyte', datetime.datetime(2016, 8, 4, 11, 52)] ['10975351', 'How to Use Open Source and Shut the Fuck Up at the Same Time', 'http://hueniverse.com/2016/01/26/how-to-use-open-source-and-shut-the-fuck-up-at-the-same-time/', 39, 10, 'josep2', datetime.datetime(2016, 1, 26, 19, 30)] ['11964716', "Florida DJs May Face Felony for April Fools' Water Joke", 'http://www.thewire.com/entertainment/2013/04/florida-djs-april-fools-water-joke/63798/', 2, 1, 'vezycash', datetime.datetime(2016, 6, 23, 22, 20)] Total number of columns: 7 Total number of rows: 17193
After segregating the dataset, we can see that approximately 14% of the dataset contains posts that start swith either Ask HN
or Show HN
. Now that we have a separate dataset we can procede with the data analysis.
Now that we got a clean set of data, we are going to analyze it to gain an information to drive our conclusion and recommendations. In this section we are going to do the following:
def total_column(dataset, index):
'''
Summation of the emtire column given that the data is int
Parameter
---
dataset: list
A variable of the dataset
index: int
The index number of the column to be iterated
Return
---
Single integer value
'''
total = 0
for row in dataset:
total_column = row[index]
total += total_column
average = round((total / (len(dataset))),2)
return total, average
ask_posts_comments, show_posts_comments, other_posts_comments = total_column(ask_post,4), total_column(show_post,4),total_column(other_post,4)
print('ask_post dataset:')
print('The total number of comments is {0} wtih the average {1} number of comments'.format(ask_posts_comments[0],ask_posts_comments[1]))
print('\n')
print('show_post dataset:')
print('The total number of comments is {0} wtih the average {1} number of comments'.format(show_posts_comments[0],show_posts_comments[1]))
print('\n')
print('other_post dataset:')
print('The total number of comments is {0} wtih the average {1} number of comments'.format(other_posts_comments[0],other_posts_comments[1]))
ask_post dataset: The total number of comments is 24483 wtih the average 14.04 number of comments show_post dataset: The total number of comments is 11988 wtih the average 10.32 number of comments other_post dataset: The total number of comments is 462055 wtih the average 26.87 number of comments
We can observe that on average Ask HN
posts has a higher number of comments compared to Show HN
posts. And other related posts have much larger number of people that commented on those posts.
Since Ask HN
posts are more likely to receive comments with an average of 14.04 comments, we'll focus our analysis just on these posts. But a further analysis about Show HN
and other posts is recommended to be studied outside this project to further give a concrete support on our conclusion and recommendations.
We will start our analysis by look at the first five rows of our Ask HN
dataset that we segregated. And at the same time we are going to extract the number of comments and date and time created in order to show a frequency of the following:
print('Ask HN dataset:')
print('')
explore_data(ask_post,0,5)
result_list = list()
for row in ask_post:
result = row[6], row[4]
result_list.append(result)
print('\n')
print('Extracted date and time created and number of comments by the hour')
print('')
explore_data(result_list,0,3,rows_and_columns = False)
Ask HN dataset: Sample 5 no. rows ['12296411', 'Ask HN: How to improve my personal website?', '', 2, 6, 'ahmedbaracat', datetime.datetime(2016, 8, 16, 9, 55)] ['10610020', 'Ask HN: Am I the only one outraged by Twitter shutting down share counts?', '', 28, 29, 'tkfx', datetime.datetime(2015, 11, 22, 13, 43)] ['11610310', 'Ask HN: Aby recent changes to CSS that broke mobile?', '', 1, 1, 'polskibus', datetime.datetime(2016, 5, 2, 10, 14)] ['12210105', 'Ask HN: Looking for Employee #3 How do I do it?', '', 1, 3, 'sph130', datetime.datetime(2016, 8, 2, 14, 20)] ['10394168', 'Ask HN: Someone offered to buy my browser extension from me. What now?', '', 28, 17, 'roykolak', datetime.datetime(2015, 10, 15, 16, 38)] Total number of columns: 7 Total number of rows: 1743 Extracted date and time created and number of comments by the hour Sample 3 no. rows (datetime.datetime(2016, 8, 16, 9, 55), 6) (datetime.datetime(2015, 11, 22, 13, 43), 29) (datetime.datetime(2016, 5, 2, 10, 14), 1)
Now that we have a list of tuples containing the attributes created_at
and num_comments
we are going to create the frequency table for both of these attributes. The frequency tables will take in the hour per day in a 24 hour format as a key and posts created and comments made on those hour as the values.
So we will have two frequency table:
counts_by_hour = dict()
comments_by_hour = dict()
for row in result_list:
hour = row[0].time().hour
counts_by_hour[hour] = counts_by_hour.get(hour, 0)+1
if hour not in comments_by_hour:
comments_by_hour[hour] = row[1]
elif hour in comments_by_hour:
comments_by_hour[hour] += row[1]
print('Number of posts created by the hour of the day:')
print('Hour:Number of posts')
print(counts_by_hour)
print('\n')
print('Number of comments made by the hour of the day:')
print('Hour:Number of comments')
print(comments_by_hour)
Number of posts created by the hour of the day: Hour:Number of posts {9: 45, 13: 85, 10: 59, 14: 107, 16: 108, 23: 68, 12: 73, 17: 100, 15: 116, 21: 109, 20: 80, 2: 58, 18: 109, 3: 54, 5: 46, 19: 110, 1: 60, 22: 71, 8: 48, 4: 47, 0: 55, 6: 44, 7: 34, 11: 58} Number of comments made by the hour of the day: Hour:Number of comments {9: 251, 13: 1253, 10: 793, 14: 1416, 16: 1814, 23: 543, 12: 687, 17: 1146, 15: 4477, 21: 1745, 20: 1722, 2: 1381, 18: 1439, 3: 421, 5: 464, 19: 1188, 1: 683, 22: 479, 8: 492, 4: 337, 0: 447, 6: 397, 7: 267, 11: 641}
Now, getting an insight on this frequency table is very confusing. A much better way of representing these numbers is by using a line graph
. It is easier to derive a meaning from a line graph rather than looking at these numbers. In the below graphs, we are going to show the relationship of the hour of the day by the number of comments made and posts created.
# Since we are using a 24 hour time. We will create a list consisting of the values in our time. The purpose of this list is to label our x-axis which is the `hour` in our dataset.
hours_24 = [hour for hour in range(24)]
x = sorted(counts_by_hour.items(), key=lambda x: x[0])
plt.plot(*zip(*x))
plt.xlabel("Hour")
plt.ylabel("Number of posts")
plt.title('Posts created by hour')
plt.xticks(hours_24[::2])
plt.show()
As we can see there is a spike of posts being created by the afternoon up to the evening. We can see three
peaks in our line graph, which all of them are in the afternoon and evening.
y=sorted(comments_by_hour.items(), key=lambda x: x[0])
plt.plot(*zip(*y))
plt.xlabel("Hour")
plt.ylabel("Number of comments")
plt.title('Comments by hour')
plt.xticks(hours_24[::2])
plt.show()
We can observe from the line graph above that there is a higher surge of comments being made by 15:00
or 3:00 PM
. And a lot less throughout the day.
Now that we have an insight on posts created and comments made by the hour. We are going to look at the average number of comments per posts by the hour. Following below, we are going to create a list of lists of the average comments per posts by the hour. Since we have already sorted out our dataset above, we can directly iterate on our two list two create the list of lists.
average_per_hour = list()
for i in range(24):
posts = x[i][1] # Since we have a tupple of (hour,posts) we are going to extract the value by specifying the index [1]
comments = y[i][1] # Since we have a tupple of (hour,comments) we are going to extract the value by specifying the index [1]
average = round((comments/posts),2)
average_by_hour = i, average
average_per_hour.append(average_by_hour)
plt.plot(*zip(*average_per_hour))
plt.xlabel("Hour")
plt.ylabel("Number of comments")
plt.title('Average comments per posts')
plt.xticks(hours_24[::2])
plt.show()
As expected there is a spike of comments made per posts created on 13:00
or 3:00PM
but we can also observe that there is are peaks that happened on the morning and the evening. Remember that Ask HN
data has an average 14.04 number of comments. So we can say that the times around 02:00
or 2:00 AM
, 13:00
or 3:00PM
and 20:00
or 8:00PM
are all above average.
sorted(average_per_hour, key=lambda x: x[1],reverse = True)
above_average_hours = list()
for hours in range(24):
if average_per_hour[hours][1] > ask_posts_comments[1]:
avg_per_hour = average_per_hour[hours][1]
hours = average_per_hour[hours][0]
total_avg = hours,avg_per_hour
above_average_hours.append(total_avg)
top_above_avg_hours = sorted(above_average_hours,key = lambda x:x[1], reverse = True)
for hours in top_above_avg_hours:
time = dt.time(hours[0])
formatted_time = time.strftime('%H:%M')
print('{0}: Average of {1} number of comments'.format(formatted_time,hours[1]))
15:00: Average of 38.59 number of comments 02:00: Average of 23.81 number of comments 20:00: Average of 21.52 number of comments 16:00: Average of 16.8 number of comments 21:00: Average of 16.01 number of comments 13:00: Average of 14.74 number of comments
We can see that the hours that has an above average of comments made per posts created corresponds with the results of our line graph above.
From the information that we extracted from our results, we can the peak times below where a user who created a posts are more likely to receive a comment:
Note: These timings are in Eastern Time in the US.
The following times below are in Qatar Timezone (GMT +3)
And the following times below are in Philippines Timezone (GMT +8)
Even though these times exceeds the average number of comments by posts created we have to remember that we have only identified the times where it is more more likely to receive a comment on a topic relating to Ask HN
posts. But we cannot guarantee that we will receive a comment for posting on these times.
Though due to the high number of active people on these times, we recommend to posT the Ask HN
related posts by these hours.