Exploring Hacker News Posts¶

1. Introduction
2. Scope
3. Goals
4. Read the Data
5. Clean the Data
6. Explore the Data
7. Conclusion

1. Introduction¶

Hacker News is a site that is extremely popular in the technology and startup circles, where users can make posts, receive votes and comments, similar to reddit.

The full dataset of almost 300,000 rows can be found here. We have taken a sample of approximately 20,000 rows by removing all submissions that didn't receive any comments and then randomly sampling from the remaining submissions.

Below are descriptions of the columns:

id: the unique identifier from Hacker News for the post
title: the title of the post
url: the URL that the posts links to, if the post has a URL
num_points: the number of points the post acquired, calculated as the total number of upvotes minus the total number of downvotes
num_comments: the number of comments on the post
author: the username of the person who submitted the post
created_at: the date and time the post was made (the time zone is Eastern Time in the US)

2. Scope¶

The scope of this project is limited to the exploratory data analysis of the sample dataset provided by Dataquest in its Guided Project: Exploring Hacker News Posts.

3. Goals¶

In this project, we are specifically interested in posts with titles that begin with either Ask HN or Show HN. Users submit Ask HN posts to ask the Hacker News community a specific question. Below are a few examples:

Ask HN: How to improve my personal website?
Ask HN: Am I the only one outraged by Twitter shutting down share counts?
Ask HN: Aby recent changes to CSS that broke mobile?

Likewise, users submit Show HN posts to show the Hacker News community a project, product, or just something interesting. Below are a few examples:

Show HN: Wio Link ESP8266 Based Web of Things Hardware Development Platform'
Show HN: Something pointless I made
Show HN: Shanhu.io, a programming playground powered by e8vm

We will analyze the sampled data and compare these two types of posts to determine the following:

Do Ask HN or Show HN receive more comments on average?

Do posts created at a certain time receive more comments on average?

4. Read the Data¶

4.1 Get data¶

In [1]:

# Imports
import datetime as dt
import pandas as pd
import matplotlib.pyplot as plt
import numpy as np
import seaborn as sns

import matplotlib.style as style
style.use('fivethirtyeight')
sns.set(style="darkgrid")
sns.set_palette("pastel")

In [2]:

# Load dataframe 
file = 'hacker_news.csv'
df = pd.read_csv(file)

In [3]:

# Look at first 5 rows
df.head()

Out[3]:

	id	title	url	num_points	num_comments	author	created_at
0	12224879	Interactive Dynamic Video	http://www.interactivedynamicvideo.com/	386	52	ne0phyte	8/4/2016 11:52
1	10975351	How to Use Open Source and Shut the Fuck Up at...	http://hueniverse.com/2016/01/26/how-to-use-op...	39	10	josep2	1/26/2016 19:30
2	11964716	Florida DJs May Face Felony for April Fools' W...	http://www.thewire.com/entertainment/2013/04/f...	2	1	vezycash	6/23/2016 22:20
3	11919867	Technology ventures: From Idea to Enterprise	https://www.amazon.com/Technology-Ventures-Ent...	3	1	hswarna	6/17/2016 0:01
4	10301696	Note by Note: The Making of Steinway L1037 (2007)	http://www.nytimes.com/2007/11/07/movies/07ste...	8	2	walterbell	9/30/2015 4:12

4.2 Check for data type, missing data, unwanted columns, and sensibility of data¶

In [4]:

# Check for missing data
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 20100 entries, 0 to 20099
Data columns (total 7 columns):
 #   Column        Non-Null Count  Dtype 
---  ------        --------------  ----- 
 0   id            20100 non-null  int64 
 1   title         20100 non-null  object
 2   url           17660 non-null  object
 3   num_points    20100 non-null  int64 
 4   num_comments  20100 non-null  int64 
 5   author        20100 non-null  object
 6   created_at    20100 non-null  object
dtypes: int64(3), object(4)
memory usage: 1.1+ MB

In [5]:

# Check if data makes sense
df.describe()

Out[5]:

	id	num_points	num_comments
count	2.010000e+04	20100.000000	20100.000000
mean	1.131753e+07	50.296070	24.802289
std	6.964399e+05	107.107687	56.107340
min	1.017691e+07	1.000000	1.000000
25%	1.070176e+07	3.000000	1.000000
50%	1.128445e+07	9.000000	3.000000
75%	1.192607e+07	54.000000	21.000000
max	1.257898e+07	2553.000000	1733.000000

Observations

About 12% of url has missing values. There is no need to fix the rows with missing values as the url is not important for our analysis.
created_at column should be a datetime object.
Data looks sensible.

5. Clean the Data¶

5.1 Change data type, drop unwanted columns, deal with missing data¶

In [6]:

# convert `created_at` column to datetime object
df['created_at'] = pd.to_datetime(df['created_at'], infer_datetime_format=True)
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 20100 entries, 0 to 20099
Data columns (total 7 columns):
 #   Column        Non-Null Count  Dtype         
---  ------        --------------  -----         
 0   id            20100 non-null  int64         
 1   title         20100 non-null  object        
 2   url           17660 non-null  object        
 3   num_points    20100 non-null  int64         
 4   num_comments  20100 non-null  int64         
 5   author        20100 non-null  object        
 6   created_at    20100 non-null  datetime64[ns]
dtypes: datetime64[ns](1), int64(3), object(3)
memory usage: 1.1+ MB

6. Explore the Data¶

6.1 Posts categories¶

Since we are concerned with post titles beginning with Ask HN or Show HN, we will add a new column with three labels ask posts, show posts, and other posts.

In [7]:

# Add new column `category`
def categorize(row):
    if row['title'].lower().startswith('ask hn'):
        return 'ask posts'
    elif row['title'].lower().startswith('show hn'):
        return 'show posts'
    else:
        return 'other posts'

df['category'] = df.apply(lambda row: categorize(row), axis=1)

In [8]:

# Show frequency of the categories
df['category'].value_counts()

Out[8]:

other posts    17194
ask posts       1744
show posts      1162
Name: category, dtype: int64

In [9]:

# Show frequency of the categories in percentages
df['category'].value_counts(normalize=True)

Out[9]:

other posts    0.855423
ask posts      0.086766
show posts     0.057811
Name: category, dtype: float64

In [10]:

# Plot graph
sns.countplot(x=df['category'])
plt.title('Number of Posts By Categories')
plt.xlabel('Categories')
plt.show()

Ask HN posts account for around 9% (1,744) of all posts while Show HN posts account for around 6% (1,162) of all posts. Other posts account for about 85% (17,194) of all posts.

6.2 Compare total comments of Ask HN and Show HN categories¶

In [11]:

# Filter post categories
ask_posts = df[df['category']=='ask posts'].copy()
show_posts = df[df['category']=='show posts'].copy()
other_posts = df[df['category']=='other posts'].copy()

In [12]:

# Total comments of `ask_posts`
ask_posts['num_comments'].sum()

Out[12]:

In [13]:

# Total comments of `show_posts`
show_posts['num_comments'].sum()

Out[13]:

In [14]:

# Plot graph
sns.barplot(data=df[df['category'].isin(['ask posts', 'show posts'])], x='category', y='num_comments', estimator=sum, errorbar=None)
plt.title('Number of Comments By Categories')
plt.xlabel('Categories')
plt.ylabel('Number of Comments')

plt.show()

Ask HN posts have a total of 24,483 comments, while Show HN posts have a total of 11,988 comments.

6.3 Compare average comments of Ask HN and Show HN categories¶

In [15]:

# Average comments of `ask_posts`
ask_posts['num_comments'].mean().round(2)

Out[15]:

14.04

In [16]:

# Average comments of `show_posts`
show_posts['num_comments'].mean().round(2)

Out[16]:

10.32

In [17]:

# Plot graph
sns.barplot(data=df[df['category'].isin(['ask posts', 'show posts'])], x='category', y='num_comments', estimator=np.mean, errorbar=None)
plt.title('Mean Number of Comments By Categories')
plt.xlabel('Categories')
plt.ylabel('Number of Comments')

plt.show()

Ask HN posts receive more comments on average (14 comments) compared to Show HN posts (10 comments).

We will further explore Ask HN posts and their comments for a deeper analysis.

6.4 Number of Ask HN posts and comments by hour¶

In [18]:

# Extract the hour from `created_at`
def get_hour(row):
    return dt.datetime.strftime(row['created_at'], '%H')

ask_posts['created_hour'] = ask_posts.apply(lambda row: get_hour(row), axis=1)

In [19]:

# Show frequency of ask posts by hour
ask_posts_counts_by_hour = ask_posts['created_hour'].value_counts().sort_index()
ask_posts_counts_by_hour

Out[19]:

00     55
01     60
02     58
03     54
04     47
05     46
06     44
07     34
08     48
09     45
10     59
11     58
12     73
13     85
14    107
15    116
16    108
17    100
18    109
19    110
20     80
21    109
22     71
23     68
Name: created_hour, dtype: int64

In [20]:

# Create a palette to highlight the maximum and minimum values
def set_max_min_palette(series, max_color='turquoise', min_color='coral', other_color='lightgray'):
    palette = []
    for item in series:
        if item == series.max():
            palette.append(max_color)
        elif item == series.min():
            palette.append(min_color)
        else:
            palette.append(other_color)
    return palette

In [21]:

# Plot graph
palette = set_max_min_palette(ask_posts_counts_by_hour)
sns.barplot(x=ask_posts_counts_by_hour.index, y=ask_posts_counts_by_hour.values, palette=palette)
plt.title('Ask HN Posts Counts By Hour')
plt.xlabel('Hour of the Day')
plt.ylabel('Number of Posts')

plt.show()

In [22]:

# Show frequency of ask posts comments by hour
ask_posts_comments_by_hour = ask_posts.groupby(by=['created_hour'])['num_comments'].sum()
ask_posts_comments_by_hour

Out[22]:

created_hour
00     447
01     683
02    1381
03     421
04     337
05     464
06     397
07     267
08     492
09     251
10     793
11     641
12     687
13    1253
14    1416
15    4477
16    1814
17    1146
18    1439
19    1188
20    1722
21    1745
22     479
23     543
Name: num_comments, dtype: int64

In [23]:

# Plot graph
palette = set_max_min_palette(ask_posts_comments_by_hour)
sns.barplot(x=ask_posts_comments_by_hour.index, y=ask_posts_comments_by_hour.values, palette=palette)
plt.title('Ask HN Comments By Hour')
plt.xlabel('Hour of the Day')
plt.ylabel('Number of Comments')

plt.show()

For Ask HN posts, there is a cycle with the highest number of posts at 3 PM and the least number of posts at 7 AM (US Eastern Time).

As for Ask HN comments, there is a cycle with the highest number of comments at 3 PM and the least number of comments at 9 AM (US Eastern Time).

Since Hacker Noon is popular globally, we would expect the peaks and troughs of both graphs to be less pronounced. Instead the peak of the Ask HN posts graph is roughly 3 times that of its trough. For the Ask HN comments graph, the difference is almost 20 times.

6.5 Average number of comments for Ask HN posts by hour¶

In [24]:

# Show frequency of ask posts average comments by hour
ask_posts_average_comments_by_hour = ask_posts.groupby(by=['created_hour'])['num_comments'].mean()
ask_posts_average_comments_by_hour

Out[24]:

created_hour
00     8.127273
01    11.383333
02    23.810345
03     7.796296
04     7.170213
05    10.086957
06     9.022727
07     7.852941
08    10.250000
09     5.577778
10    13.440678
11    11.051724
12     9.410959
13    14.741176
14    13.233645
15    38.594828
16    16.796296
17    11.460000
18    13.201835
19    10.800000
20    21.525000
21    16.009174
22     6.746479
23     7.985294
Name: num_comments, dtype: float64

In [25]:

# Plot graph
palette = set_max_min_palette(ask_posts_average_comments_by_hour)
sns.barplot(x=ask_posts_average_comments_by_hour.index, y=ask_posts_average_comments_by_hour.values, palette=palette)
plt.title('Mean Number of Comments Per ASK HN Posts By Hour')
plt.xlabel('Hour of the Day')
plt.ylabel('Mean Number of Comments')

plt.show()

The mean number of comments per Ask HN posts follows the same trend as the number of comments, with the highest mean number of comments per Ask HN posts occuring at 3 PM and the least mean number of comments per Ask HN posts occuring at 9 AM (US Eastern Time).

With a large dispartity between the maximum and minimum average comments of Ask HN posts (about 7x), it is important to see what are the timings that attract the most comments.

6.6 Top 5 Hours for Ask Posts Comments¶

In [26]:

# Show top five ask posts average comments by hour
top_five_hours = ask_posts_average_comments_by_hour.sort_values(ascending=False)[:5]
top_five_hours.round(2)

Out[26]:

created_hour
15    38.59
02    23.81
20    21.52
16    16.80
21    16.01
Name: num_comments, dtype: float64

In [27]:

# Print in this format
# 15:00: 38.59 average comments per post
print('---- Top 5 hours for Ask Posts Comments -----')
for index, value in top_five_hours.items():
    time_t = dt.time(hour=int(index))
    time_str = time_t.strftime('%H:%M')
    print(f'{time_str}: {value:.2f} average comments per post')

---- Top 5 hours for Ask Posts Comments -----
15:00: 38.59 average comments per post
02:00: 23.81 average comments per post
20:00: 21.52 average comments per post
16:00: 16.80 average comments per post
21:00: 16.01 average comments per post

In [28]:

# Create a palette to highlight the top n values
def set_top_values_palette(series, top_n, top_color='xkcd:lightblue', other_color='lightgray'):
    palette = []
    for item in series:
        if item > series.nlargest(top_n + 1)[-1]:
            palette.append(top_color)
        else:
            palette.append(other_color)
    return palette

In [29]:

# Plot graph
palette = set_top_values_palette(ask_posts_average_comments_by_hour, 5)
sns.barplot(x=ask_posts_average_comments_by_hour.index, y=ask_posts_average_comments_by_hour.values, palette=palette)
plt.title('Mean Number of Comments Per ASK HN Posts By Hour')
plt.xlabel('Hour of the Day')
plt.ylabel('Mean Number of Comments')

plt.show()

6.7 Convert Top 5 Hours for Ask Posts Comments to Local Time¶

Singapore is 13 hours ahead of US Eastern Time (EST). We now convert the our findings to Singapore time.

In [30]:

# Convert to Singapore time
time_difference = dt.timedelta(hours=13)

print('---- Top 5 hours (+8 GMT) for Ask Posts Comments -----')
for index, value in top_five_hours.items():
    time_t = dt.time(hour=int(index))
    time_dt = dt.datetime.combine(dt.date(1, 1, 1), time_t) # convert time_t to a datetime object
    time_t = (time_dt + time_difference).time() # perform timedelta calculation, then extract the time
    time_str = time_t.strftime('%H:%M')
    print(f'{time_str}: {value:.2f} average comments per post')

---- Top 5 hours (+8 GMT) for Ask Posts Comments -----
04:00: 38.59 average comments per post
15:00: 23.81 average comments per post
09:00: 21.52 average comments per post
05:00: 16.80 average comments per post
10:00: 16.01 average comments per post

The most optimum hours for Ask Posts Comments are 9 AM, 10 AM and 3 PM (Singapore time). (4 AM and 5 AM are too early)

6.8 Compare average points of Ask HN and Show HN categories¶

In [31]:

# Find average ask posts points
ask_posts['num_points'].mean().round(2)

Out[31]:

15.06

In [32]:

# Find average show posts points
show_posts['num_points'].mean().round(2)

Out[32]:

27.56

In [33]:

# Plot graph
sns.barplot(data=df[df['category'].isin(['ask posts', 'show posts'])], x='category', y='num_points', estimator=np.mean, errorbar=None)
plt.title('Mean Number of Points By Categories')
plt.xlabel('Categories')
plt.ylabel('Number of Points')

plt.show()

Show HN posts receive more points on average (27.56 points) compared to Ask HN posts (15.06 points). Recall that Ask HN posts have more average comments.

It suggests that users are more likely to give points (similar to "like") Show HN posts, whereas users are more likely to help out Ask HN posts.

We will further explore Show HN posts and their points for a deeper analysis.

6.9 Average number of points for Show HN posts by hour¶

In [34]:

# Extract the hour from `created_at`
show_posts['created_hour'] =show_posts.apply(lambda row: get_hour(row), axis=1)

In [35]:

# Show frequency of show posts average points by hour
show_posts_average_points_by_hour = show_posts.groupby(by=['created_hour'])['num_points'].mean()
show_posts_average_points_by_hour.round(2)

Out[35]:

created_hour
00    37.84
01    25.00
02    11.33
03    25.15
04    14.85
05     5.47
06    23.44
07    19.00
08    15.26
09    18.43
10    18.92
11    33.64
12    41.69
13    24.63
14    25.43
15    28.56
16    28.32
17    27.11
18    36.31
19    30.95
20    30.32
21    18.43
22    40.35
23    42.39
Name: num_points, dtype: float64

In [36]:

# Plot graph
palette = set_max_min_palette(show_posts_average_points_by_hour)
sns.barplot(x=show_posts_average_points_by_hour.index, y=show_posts_average_points_by_hour.values, palette=palette)
plt.title('Mean Number of Points Per Show HN Posts By Hour')
plt.xlabel('Hour of the Day')
plt.ylabel('Mean Number of Points')

plt.show()

The highest mean number of points per Show HN posts occurs at 11 PM and the least mean number of points per Show HN posts occurs at 5 AM (US Eastern Time).

With a large dispartity between the maximum and minimum average points of Show HN posts (about 7x), it is important to see what are the timings that attract the most points.

6.10 Top 5 Hours for Show Posts Points¶

In [37]:

# Show top five show posts average points by hour
top_five_hours_show_points = show_posts_average_points_by_hour.sort_values(ascending=False)[:5]
top_five_hours_show_points.round(2)

Out[37]:

created_hour
23    42.39
12    41.69
22    40.35
00    37.84
18    36.31
Name: num_points, dtype: float64

In [38]:

# Print in this format
# 23:00: 42.39 average points per post
print('---- Top 5 hours for Show Posts Points -----')
for index, value in top_five_hours_show_points.items():
    time_t = dt.time(hour=int(index))
    time_str = time_t.strftime('%H:%M')
    print(f'{time_str}: {value:.2f} average points per post')

---- Top 5 hours for Show Posts Points -----
23:00: 42.39 average points per post
12:00: 41.69 average points per post
22:00: 40.35 average points per post
00:00: 37.84 average points per post
18:00: 36.31 average points per post

In [39]:

# Plot graph
palette = set_top_values_palette(show_posts_average_points_by_hour, 5)
sns.barplot(x=show_posts_average_points_by_hour.index, y=show_posts_average_points_by_hour.values, palette=palette)
plt.title('Mean Number of Points Per Show HN Posts By Hour')
plt.xlabel('Hour of the Day')
plt.ylabel('Mean Number of Points')

plt.show()

6.11 Compare average comments and points of other posts category¶

In [40]:

# Find average other posts comments
other_posts['num_comments'].mean().round(2)

Out[40]:

26.87

In [41]:

# Plot graph
sns.barplot(data=df, x='category', y='num_comments', estimator=np.mean, errorbar=None)
plt.title('Mean Number of Comments By Categories')
plt.xlabel('Categories')
plt.ylabel('Number of Comments')

plt.show()

In [42]:

# Find average other posts points
other_posts['num_points'].mean().round(2)

Out[42]:

55.41

In [43]:

# Plot graph
sns.barplot(data=df, x='category', y='num_points', estimator=np.mean, errorbar=None)
plt.title('Mean Number of Points By Categories')
plt.xlabel('Categories')
plt.ylabel('Number of Points')

plt.show()

Other posts have a much higher average number of comments and points as compared to Ask HN and Show HN posts.

7. Conclusion¶

7.1 Takeaways¶

Breakdown of categories: Ask HN posts take up 8.7%, Show HN posts take up 5.8%, and other posts take up 85.5% of all posts.
Total comments: Ask HN posts have 24,483 comments, while Show HN posts have 11,988 comments.
Average comments: Ask HN posts have 14 comments on average, Show HN posts have 10 comments on average, and other posts have 27 comments on average.
Ask HN posts: most posts at 3 PM and least posts at 7 AM (US Eastern Time).
Ask HN comments: most comments at 3 PM and least comments at 9 AM (US Eastern Time).
Ask HN average comments per post: highest average comments per post at 3 PM and lowest average comments per post at 9 AM (US Eastern Time).
Top 5 hours for Ask HN comments: 3 PM, 2 AM, 8 PM, 4 PM, 9 PM (US Eastern Time), or 4 AM, 3 PM, 9 AM, 5 AM, 10 AM (Singapore Time).
Average points: Ask HN posts have 15.06 points on average, Show HN posts have 27.56 points on average, and other posts have 55.41 points on average.
Show HN average points per post: highest average points per post at 11 PM and least average points per post at 5 AM (US Eastern Time).
Top 5 hours for Show HN points: 11 PM, 12 PM, 10 PM, 12 AM, 6 PM (US Eastern Time).

7.2 Possible next steps¶

Perhaps the project can be extended to dig deeper to see what are the other posts that do better in this regard, maybe even with consideration of the topic and the number of words in the title of the posts.

Created in Deepnote