Hacker News is a site that is extremely popular in the technology and startup circles, where users can make posts, receive votes and comments, similar to reddit.
The full dataset of almost 300,000 rows can be found here. We have taken a sample of approximately 20,000 rows by removing all submissions that didn't receive any comments and then randomly sampling from the remaining submissions.
Below are descriptions of the columns:
id
: the unique identifier from Hacker News for the posttitle
: the title of the posturl
: the URL that the posts links to, if the post has a URLnum_points
: the number of points the post acquired, calculated as the total number of upvotes minus the total number of downvotesnum_comments
: the number of comments on the postauthor
: the username of the person who submitted the postcreated_at
: the date and time the post was made (the time zone is Eastern Time in the US)The scope of this project is limited to the exploratory data analysis of the sample dataset provided by Dataquest in its Guided Project: Exploring Hacker News Posts.
In this project, we are specifically interested in posts with titles that begin with either Ask HN
or Show HN
. Users submit Ask HN
posts to ask the Hacker News community a specific question. Below are a few examples:
Likewise, users submit Show HN posts to show the Hacker News community a project, product, or just something interesting. Below are a few examples:
We will analyze the sampled data and compare these two types of posts to determine the following:
- Do Ask HN or Show HN receive more comments on average?
- Do posts created at a certain time receive more comments on average?
# Imports
import datetime as dt
import pandas as pd
import matplotlib.pyplot as plt
import numpy as np
import seaborn as sns
import matplotlib.style as style
style.use('fivethirtyeight')
sns.set(style="darkgrid")
sns.set_palette("pastel")
# Load dataframe
file = 'hacker_news.csv'
df = pd.read_csv(file)
# Look at first 5 rows
df.head()
id | title | url | num_points | num_comments | author | created_at | |
---|---|---|---|---|---|---|---|
0 | 12224879 | Interactive Dynamic Video | http://www.interactivedynamicvideo.com/ | 386 | 52 | ne0phyte | 8/4/2016 11:52 |
1 | 10975351 | How to Use Open Source and Shut the Fuck Up at... | http://hueniverse.com/2016/01/26/how-to-use-op... | 39 | 10 | josep2 | 1/26/2016 19:30 |
2 | 11964716 | Florida DJs May Face Felony for April Fools' W... | http://www.thewire.com/entertainment/2013/04/f... | 2 | 1 | vezycash | 6/23/2016 22:20 |
3 | 11919867 | Technology ventures: From Idea to Enterprise | https://www.amazon.com/Technology-Ventures-Ent... | 3 | 1 | hswarna | 6/17/2016 0:01 |
4 | 10301696 | Note by Note: The Making of Steinway L1037 (2007) | http://www.nytimes.com/2007/11/07/movies/07ste... | 8 | 2 | walterbell | 9/30/2015 4:12 |
# Check for missing data
df.info()
<class 'pandas.core.frame.DataFrame'> RangeIndex: 20100 entries, 0 to 20099 Data columns (total 7 columns): # Column Non-Null Count Dtype --- ------ -------------- ----- 0 id 20100 non-null int64 1 title 20100 non-null object 2 url 17660 non-null object 3 num_points 20100 non-null int64 4 num_comments 20100 non-null int64 5 author 20100 non-null object 6 created_at 20100 non-null object dtypes: int64(3), object(4) memory usage: 1.1+ MB
# Check if data makes sense
df.describe()
id | num_points | num_comments | |
---|---|---|---|
count | 2.010000e+04 | 20100.000000 | 20100.000000 |
mean | 1.131753e+07 | 50.296070 | 24.802289 |
std | 6.964399e+05 | 107.107687 | 56.107340 |
min | 1.017691e+07 | 1.000000 | 1.000000 |
25% | 1.070176e+07 | 3.000000 | 1.000000 |
50% | 1.128445e+07 | 9.000000 | 3.000000 |
75% | 1.192607e+07 | 54.000000 | 21.000000 |
max | 1.257898e+07 | 2553.000000 | 1733.000000 |
Observations
url
has missing values. There is no need to fix the rows with missing values as the url
is not important for our analysis.created_at
column should be a datetime object.# convert `created_at` column to datetime object
df['created_at'] = pd.to_datetime(df['created_at'], infer_datetime_format=True)
df.info()
<class 'pandas.core.frame.DataFrame'> RangeIndex: 20100 entries, 0 to 20099 Data columns (total 7 columns): # Column Non-Null Count Dtype --- ------ -------------- ----- 0 id 20100 non-null int64 1 title 20100 non-null object 2 url 17660 non-null object 3 num_points 20100 non-null int64 4 num_comments 20100 non-null int64 5 author 20100 non-null object 6 created_at 20100 non-null datetime64[ns] dtypes: datetime64[ns](1), int64(3), object(3) memory usage: 1.1+ MB
Since we are concerned with post titles beginning with Ask HN
or Show HN
, we will add a new column with three labels ask posts
, show posts
, and other posts
.
# Add new column `category`
def categorize(row):
if row['title'].lower().startswith('ask hn'):
return 'ask posts'
elif row['title'].lower().startswith('show hn'):
return 'show posts'
else:
return 'other posts'
df['category'] = df.apply(lambda row: categorize(row), axis=1)
# Show frequency of the categories
df['category'].value_counts()
other posts 17194 ask posts 1744 show posts 1162 Name: category, dtype: int64
# Show frequency of the categories in percentages
df['category'].value_counts(normalize=True)
other posts 0.855423 ask posts 0.086766 show posts 0.057811 Name: category, dtype: float64
# Plot graph
sns.countplot(x=df['category'])
plt.title('Number of Posts By Categories')
plt.xlabel('Categories')
plt.show()
Ask HN
posts account for around 9% (1,744) of all posts while Show HN
posts account for around 6% (1,162) of all posts. Other posts account for about 85% (17,194) of all posts.
# Filter post categories
ask_posts = df[df['category']=='ask posts'].copy()
show_posts = df[df['category']=='show posts'].copy()
other_posts = df[df['category']=='other posts'].copy()
# Total comments of `ask_posts`
ask_posts['num_comments'].sum()
24483
# Total comments of `show_posts`
show_posts['num_comments'].sum()
11988
# Plot graph
sns.barplot(data=df[df['category'].isin(['ask posts', 'show posts'])], x='category', y='num_comments', estimator=sum, errorbar=None)
plt.title('Number of Comments By Categories')
plt.xlabel('Categories')
plt.ylabel('Number of Comments')
plt.show()
Ask HN
posts have a total of 24,483 comments, while Show HN
posts have a total of 11,988 comments.
# Average comments of `ask_posts`
ask_posts['num_comments'].mean().round(2)
14.04
# Average comments of `show_posts`
show_posts['num_comments'].mean().round(2)
10.32
# Plot graph
sns.barplot(data=df[df['category'].isin(['ask posts', 'show posts'])], x='category', y='num_comments', estimator=np.mean, errorbar=None)
plt.title('Mean Number of Comments By Categories')
plt.xlabel('Categories')
plt.ylabel('Number of Comments')
plt.show()
Ask HN
posts receive more comments on average (14 comments) compared to Show HN
posts (10 comments).
We will further explore Ask HN
posts and their comments for a deeper analysis.
# Extract the hour from `created_at`
def get_hour(row):
return dt.datetime.strftime(row['created_at'], '%H')
ask_posts['created_hour'] = ask_posts.apply(lambda row: get_hour(row), axis=1)
# Show frequency of ask posts by hour
ask_posts_counts_by_hour = ask_posts['created_hour'].value_counts().sort_index()
ask_posts_counts_by_hour
00 55 01 60 02 58 03 54 04 47 05 46 06 44 07 34 08 48 09 45 10 59 11 58 12 73 13 85 14 107 15 116 16 108 17 100 18 109 19 110 20 80 21 109 22 71 23 68 Name: created_hour, dtype: int64
# Create a palette to highlight the maximum and minimum values
def set_max_min_palette(series, max_color='turquoise', min_color='coral', other_color='lightgray'):
palette = []
for item in series:
if item == series.max():
palette.append(max_color)
elif item == series.min():
palette.append(min_color)
else:
palette.append(other_color)
return palette
# Plot graph
palette = set_max_min_palette(ask_posts_counts_by_hour)
sns.barplot(x=ask_posts_counts_by_hour.index, y=ask_posts_counts_by_hour.values, palette=palette)
plt.title('Ask HN Posts Counts By Hour')
plt.xlabel('Hour of the Day')
plt.ylabel('Number of Posts')
plt.show()
# Show frequency of ask posts comments by hour
ask_posts_comments_by_hour = ask_posts.groupby(by=['created_hour'])['num_comments'].sum()
ask_posts_comments_by_hour
created_hour 00 447 01 683 02 1381 03 421 04 337 05 464 06 397 07 267 08 492 09 251 10 793 11 641 12 687 13 1253 14 1416 15 4477 16 1814 17 1146 18 1439 19 1188 20 1722 21 1745 22 479 23 543 Name: num_comments, dtype: int64
# Plot graph
palette = set_max_min_palette(ask_posts_comments_by_hour)
sns.barplot(x=ask_posts_comments_by_hour.index, y=ask_posts_comments_by_hour.values, palette=palette)
plt.title('Ask HN Comments By Hour')
plt.xlabel('Hour of the Day')
plt.ylabel('Number of Comments')
plt.show()
For Ask HN
posts, there is a cycle with the highest number of posts at 3 PM and the least number of posts at 7 AM (US Eastern Time).
As for Ask HN
comments, there is a cycle with the highest number of comments at 3 PM and the least number of comments at 9 AM (US Eastern Time).
Since Hacker Noon is popular globally, we would expect the peaks and troughs of both graphs to be less pronounced. Instead the peak of the Ask HN
posts graph is roughly 3 times that of its trough. For the Ask HN
comments graph, the difference is almost 20 times.
# Show frequency of ask posts average comments by hour
ask_posts_average_comments_by_hour = ask_posts.groupby(by=['created_hour'])['num_comments'].mean()
ask_posts_average_comments_by_hour
created_hour 00 8.127273 01 11.383333 02 23.810345 03 7.796296 04 7.170213 05 10.086957 06 9.022727 07 7.852941 08 10.250000 09 5.577778 10 13.440678 11 11.051724 12 9.410959 13 14.741176 14 13.233645 15 38.594828 16 16.796296 17 11.460000 18 13.201835 19 10.800000 20 21.525000 21 16.009174 22 6.746479 23 7.985294 Name: num_comments, dtype: float64
# Plot graph
palette = set_max_min_palette(ask_posts_average_comments_by_hour)
sns.barplot(x=ask_posts_average_comments_by_hour.index, y=ask_posts_average_comments_by_hour.values, palette=palette)
plt.title('Mean Number of Comments Per ASK HN Posts By Hour')
plt.xlabel('Hour of the Day')
plt.ylabel('Mean Number of Comments')
plt.show()
The mean number of comments per Ask HN
posts follows the same trend as the number of comments, with the highest mean number of comments per Ask HN
posts occuring at 3 PM and the least mean number of comments per Ask HN
posts occuring at 9 AM (US Eastern Time).
With a large dispartity between the maximum and minimum average comments of Ask HN
posts (about 7x), it is important to see what are the timings that attract the most comments.
# Show top five ask posts average comments by hour
top_five_hours = ask_posts_average_comments_by_hour.sort_values(ascending=False)[:5]
top_five_hours.round(2)
created_hour 15 38.59 02 23.81 20 21.52 16 16.80 21 16.01 Name: num_comments, dtype: float64
# Print in this format
# 15:00: 38.59 average comments per post
print('---- Top 5 hours for Ask Posts Comments -----')
for index, value in top_five_hours.items():
time_t = dt.time(hour=int(index))
time_str = time_t.strftime('%H:%M')
print(f'{time_str}: {value:.2f} average comments per post')
---- Top 5 hours for Ask Posts Comments ----- 15:00: 38.59 average comments per post 02:00: 23.81 average comments per post 20:00: 21.52 average comments per post 16:00: 16.80 average comments per post 21:00: 16.01 average comments per post
# Create a palette to highlight the top n values
def set_top_values_palette(series, top_n, top_color='xkcd:lightblue', other_color='lightgray'):
palette = []
for item in series:
if item > series.nlargest(top_n + 1)[-1]:
palette.append(top_color)
else:
palette.append(other_color)
return palette
# Plot graph
palette = set_top_values_palette(ask_posts_average_comments_by_hour, 5)
sns.barplot(x=ask_posts_average_comments_by_hour.index, y=ask_posts_average_comments_by_hour.values, palette=palette)
plt.title('Mean Number of Comments Per ASK HN Posts By Hour')
plt.xlabel('Hour of the Day')
plt.ylabel('Mean Number of Comments')
plt.show()
Singapore is 13 hours ahead of US Eastern Time (EST). We now convert the our findings to Singapore time.
# Convert to Singapore time
time_difference = dt.timedelta(hours=13)
print('---- Top 5 hours (+8 GMT) for Ask Posts Comments -----')
for index, value in top_five_hours.items():
time_t = dt.time(hour=int(index))
time_dt = dt.datetime.combine(dt.date(1, 1, 1), time_t) # convert time_t to a datetime object
time_t = (time_dt + time_difference).time() # perform timedelta calculation, then extract the time
time_str = time_t.strftime('%H:%M')
print(f'{time_str}: {value:.2f} average comments per post')
---- Top 5 hours (+8 GMT) for Ask Posts Comments ----- 04:00: 38.59 average comments per post 15:00: 23.81 average comments per post 09:00: 21.52 average comments per post 05:00: 16.80 average comments per post 10:00: 16.01 average comments per post
The most optimum hours for Ask Posts Comments are 9 AM, 10 AM and 3 PM (Singapore time). (4 AM and 5 AM are too early)
# Find average ask posts points
ask_posts['num_points'].mean().round(2)
15.06
# Find average show posts points
show_posts['num_points'].mean().round(2)
27.56
# Plot graph
sns.barplot(data=df[df['category'].isin(['ask posts', 'show posts'])], x='category', y='num_points', estimator=np.mean, errorbar=None)
plt.title('Mean Number of Points By Categories')
plt.xlabel('Categories')
plt.ylabel('Number of Points')
plt.show()
Show HN
posts receive more points on average (27.56 points) compared to Ask HN
posts (15.06 points). Recall that Ask HN
posts have more average comments.
It suggests that users are more likely to give points (similar to "like") Show HN
posts, whereas users are more likely to help out Ask HN
posts.
We will further explore Show HN
posts and their points for a deeper analysis.
# Extract the hour from `created_at`
show_posts['created_hour'] =show_posts.apply(lambda row: get_hour(row), axis=1)
# Show frequency of show posts average points by hour
show_posts_average_points_by_hour = show_posts.groupby(by=['created_hour'])['num_points'].mean()
show_posts_average_points_by_hour.round(2)
created_hour 00 37.84 01 25.00 02 11.33 03 25.15 04 14.85 05 5.47 06 23.44 07 19.00 08 15.26 09 18.43 10 18.92 11 33.64 12 41.69 13 24.63 14 25.43 15 28.56 16 28.32 17 27.11 18 36.31 19 30.95 20 30.32 21 18.43 22 40.35 23 42.39 Name: num_points, dtype: float64
# Plot graph
palette = set_max_min_palette(show_posts_average_points_by_hour)
sns.barplot(x=show_posts_average_points_by_hour.index, y=show_posts_average_points_by_hour.values, palette=palette)
plt.title('Mean Number of Points Per Show HN Posts By Hour')
plt.xlabel('Hour of the Day')
plt.ylabel('Mean Number of Points')
plt.show()
The highest mean number of points per Show HN
posts occurs at 11 PM and the least mean number of points per Show HN
posts occurs at 5 AM (US Eastern Time).
With a large dispartity between the maximum and minimum average points of Show HN
posts (about 7x), it is important to see what are the timings that attract the most points.
# Show top five show posts average points by hour
top_five_hours_show_points = show_posts_average_points_by_hour.sort_values(ascending=False)[:5]
top_five_hours_show_points.round(2)
created_hour 23 42.39 12 41.69 22 40.35 00 37.84 18 36.31 Name: num_points, dtype: float64
# Print in this format
# 23:00: 42.39 average points per post
print('---- Top 5 hours for Show Posts Points -----')
for index, value in top_five_hours_show_points.items():
time_t = dt.time(hour=int(index))
time_str = time_t.strftime('%H:%M')
print(f'{time_str}: {value:.2f} average points per post')
---- Top 5 hours for Show Posts Points ----- 23:00: 42.39 average points per post 12:00: 41.69 average points per post 22:00: 40.35 average points per post 00:00: 37.84 average points per post 18:00: 36.31 average points per post
# Plot graph
palette = set_top_values_palette(show_posts_average_points_by_hour, 5)
sns.barplot(x=show_posts_average_points_by_hour.index, y=show_posts_average_points_by_hour.values, palette=palette)
plt.title('Mean Number of Points Per Show HN Posts By Hour')
plt.xlabel('Hour of the Day')
plt.ylabel('Mean Number of Points')
plt.show()
# Find average other posts comments
other_posts['num_comments'].mean().round(2)
26.87
# Plot graph
sns.barplot(data=df, x='category', y='num_comments', estimator=np.mean, errorbar=None)
plt.title('Mean Number of Comments By Categories')
plt.xlabel('Categories')
plt.ylabel('Number of Comments')
plt.show()
# Find average other posts points
other_posts['num_points'].mean().round(2)
55.41
# Plot graph
sns.barplot(data=df, x='category', y='num_points', estimator=np.mean, errorbar=None)
plt.title('Mean Number of Points By Categories')
plt.xlabel('Categories')
plt.ylabel('Number of Points')
plt.show()
Other posts have a much higher average number of comments and points as compared to Ask HN
and Show HN
posts.
Ask HN
posts take up 8.7%, Show HN
posts take up 5.8%, and other posts take up 85.5% of all posts.Ask HN
posts have 24,483 comments, while Show HN
posts have 11,988 comments.Ask HN
posts have 14 comments on average, Show HN
posts have 10 comments on average, and other posts have 27 comments on average.Ask HN
posts: most posts at 3 PM and least posts at 7 AM (US Eastern Time).Ask HN
comments: most comments at 3 PM and least comments at 9 AM (US Eastern Time).Ask HN
average comments per post: highest average comments per post at 3 PM and lowest average comments per post at 9 AM (US Eastern Time).Ask HN
comments: 3 PM, 2 AM, 8 PM, 4 PM, 9 PM (US Eastern Time), or 4 AM, 3 PM, 9 AM, 5 AM, 10 AM (Singapore Time).Ask HN
posts have 15.06 points on average, Show HN
posts have 27.56 points on average, and other posts have 55.41 points on average.Show HN
average points per post: highest average points per post at 11 PM and least average points per post at 5 AM (US Eastern Time).Show HN
points: 11 PM, 12 PM, 10 PM, 12 AM, 6 PM (US Eastern Time).Perhaps the project can be extended to dig deeper to see what are the other posts that do better in this regard, maybe even with consideration of the topic and the number of words in the title of the posts.