Source:Zox News
Y Combinator a startup incubator that hosts a site called Hacker News. The site contains posts (or stories as they are called in Hacker News) submitted by users which can be voted or commented on. The site is very popular among start-up circles. Therefore posts that get very high up-votes can tend to attract a large following.
Among the various post types, HN (Hacker News) has special sections for two particular post types namely Ask HN and Show HN posts.
If a post title is prefixed with Ask HN, the post is meant to pose a question to the HN community. The question could be a query or a doubt. If a post title is prefixed with Show HN then the post is meant for exhibiting to the HN community a project the user may have done. It may also be to get an opinion from the community regarding the said project or work.
The goal of this project is to compare Ask HN and Show HN posts. The focus will be on:
The full dataset for the project can be found here. The explanation of every column has been provided in the link.
It must be noted that while the dataset contains nearly 300,000 records, the dataset that will be used for the project only contains a sample of about 20,000 records.
The data set being used here is one created by DQ(Dataquest) for the purpose of learning. The data set was created by removing all posts for which there were no comments and then randomly sampling 20,000 records from the remaining dataset.
import pandas as pd
import seaborn as sns
import matplotlib
%matplotlib inline
import matplotlib.pyplot as plt
#Read the dataset
hn = pd.read_csv("hacker_news.csv")
hn[:10]
id | title | url | num_points | num_comments | author | created_at | |
---|---|---|---|---|---|---|---|
0 | 12224879 | Interactive Dynamic Video | http://www.interactivedynamicvideo.com/ | 386 | 52 | ne0phyte | 8/4/2016 11:52 |
1 | 10975351 | How to Use Open Source and Shut the Fuck Up at... | http://hueniverse.com/2016/01/26/how-to-use-op... | 39 | 10 | josep2 | 1/26/2016 19:30 |
2 | 11964716 | Florida DJs May Face Felony for April Fools' W... | http://www.thewire.com/entertainment/2013/04/f... | 2 | 1 | vezycash | 6/23/2016 22:20 |
3 | 11919867 | Technology ventures: From Idea to Enterprise | https://www.amazon.com/Technology-Ventures-Ent... | 3 | 1 | hswarna | 6/17/2016 0:01 |
4 | 10301696 | Note by Note: The Making of Steinway L1037 (2007) | http://www.nytimes.com/2007/11/07/movies/07ste... | 8 | 2 | walterbell | 9/30/2015 4:12 |
5 | 10482257 | Title II kills investment? Comcast and other I... | http://arstechnica.com/business/2015/10/comcas... | 53 | 22 | Deinos | 10/31/2015 9:48 |
6 | 10557283 | Nuts and Bolts Business Advice | NaN | 3 | 4 | shomberj | 11/13/2015 0:45 |
7 | 12296411 | Ask HN: How to improve my personal website? | NaN | 2 | 6 | ahmedbaracat | 8/16/2016 9:55 |
8 | 11337617 | Shims, Jigs and Other Woodworking Concepts to ... | http://firstround.com/review/shims-jigs-and-ot... | 34 | 7 | zt | 3/22/2016 16:18 |
9 | 10379326 | That self-appendectomy | http://www.southpolestation.com/trivia/igy1/ap... | 91 | 10 | jimsojim | 10/13/2015 9:30 |
Since the id column is unique and the URL column has no relevance to our analysis they can be removed.
#Remove the ID and URL columns
hn = hn.drop(["id",'url'],axis = 1)
hn[:15]
title | num_points | num_comments | author | created_at | |
---|---|---|---|---|---|
0 | Interactive Dynamic Video | 386 | 52 | ne0phyte | 8/4/2016 11:52 |
1 | How to Use Open Source and Shut the Fuck Up at... | 39 | 10 | josep2 | 1/26/2016 19:30 |
2 | Florida DJs May Face Felony for April Fools' W... | 2 | 1 | vezycash | 6/23/2016 22:20 |
3 | Technology ventures: From Idea to Enterprise | 3 | 1 | hswarna | 6/17/2016 0:01 |
4 | Note by Note: The Making of Steinway L1037 (2007) | 8 | 2 | walterbell | 9/30/2015 4:12 |
5 | Title II kills investment? Comcast and other I... | 53 | 22 | Deinos | 10/31/2015 9:48 |
6 | Nuts and Bolts Business Advice | 3 | 4 | shomberj | 11/13/2015 0:45 |
7 | Ask HN: How to improve my personal website? | 2 | 6 | ahmedbaracat | 8/16/2016 9:55 |
8 | Shims, Jigs and Other Woodworking Concepts to ... | 34 | 7 | zt | 3/22/2016 16:18 |
9 | That self-appendectomy | 91 | 10 | jimsojim | 10/13/2015 9:30 |
10 | Crate raises $4M seed round for its next-gen S... | 3 | 1 | hitekker | 3/27/2016 18:08 |
11 | Advertising Cannot Maintain the Internet. Here... | 2 | 1 | dredmorbius | 5/10/2016 4:46 |
12 | Coding Is Over | 18 | 14 | prostoalex | 6/26/2016 16:36 |
13 | Show HN: Wio Link ESP8266 Based Web of Things... | 26 | 22 | kfihihc | 11/25/2015 14:03 |
14 | Custom Deleters for C++ Smart Pointers | 59 | 18 | ingve | 4/28/2016 10:01 |
Based on the discussion above, we will need to classify the posts based on their type. Once we have segragated the posts, we can calculate the average number of comments and total number of comments for each post type.
#Set the title column to lower case
hn["title"] = hn["title"].str.lower()
#Identify the type of each post
hn.loc[hn["title"].str.startswith("ask hn"), "type"]="ask"
hn.loc[hn["title"].str.startswith("show hn"), "type"]="show"
hn.loc[hn["type"].isnull(),"type"]="others"
hn[:10]
title | num_points | num_comments | author | created_at | type | |
---|---|---|---|---|---|---|
0 | interactive dynamic video | 386 | 52 | ne0phyte | 8/4/2016 11:52 | others |
1 | how to use open source and shut the fuck up at... | 39 | 10 | josep2 | 1/26/2016 19:30 | others |
2 | florida djs may face felony for april fools' w... | 2 | 1 | vezycash | 6/23/2016 22:20 | others |
3 | technology ventures: from idea to enterprise | 3 | 1 | hswarna | 6/17/2016 0:01 | others |
4 | note by note: the making of steinway l1037 (2007) | 8 | 2 | walterbell | 9/30/2015 4:12 | others |
5 | title ii kills investment? comcast and other i... | 53 | 22 | Deinos | 10/31/2015 9:48 | others |
6 | nuts and bolts business advice | 3 | 4 | shomberj | 11/13/2015 0:45 | others |
7 | ask hn: how to improve my personal website? | 2 | 6 | ahmedbaracat | 8/16/2016 9:55 | ask |
8 | shims, jigs and other woodworking concepts to ... | 34 | 7 | zt | 3/22/2016 16:18 | others |
9 | that self-appendectomy | 91 | 10 | jimsojim | 10/13/2015 9:30 | others |
Now that we have segragated the posts by type, we can find the average number of comments and average points by post type.
#Calculate the number of comments and points by post type
new_hn_avg = hn.groupby(by="type").mean().round(2).reset_index()
new_hn_avg.rename(columns = {"num_points":"avg_points", "num_comments":"avg_num_comments"},inplace=True)
new_hn_avg
type | avg_points | avg_num_comments | |
---|---|---|---|
0 | ask | 15.06 | 14.04 |
1 | others | 55.41 | 26.87 |
2 | show | 27.56 | 10.32 |
We could find the total number of comments and the total number of points by post type.
#Calculate the total number of comments and points by post type
new_hn_total = hn.groupby(by="type").sum().reset_index()
new_hn_total.rename(columns = {"num_points":"total_points","num_comments":"total_num_comments"},inplace=True)
new_hn_total
type | total_points | total_num_comments | |
---|---|---|---|
0 | ask | 26268 | 24483 |
1 | others | 952664 | 462055 |
2 | show | 32019 | 11988 |
#Combine the aggregated data
joined_data = pd.merge(new_hn_avg,new_hn_total,on = "type")
With the new set of data, we could create a single table to view the data and also analyse the same with graphs.
joined_data["avg_points"] = joined_data["avg_points"].round(2)
joined_data["avg_num_comments"] = joined_data["avg_num_comments"].round(2)
joined_data
type | avg_points | avg_num_comments | total_points | total_num_comments | |
---|---|---|---|---|---|
0 | ask | 15.06 | 14.04 | 26268 | 24483 |
1 | others | 55.41 | 26.87 | 952664 | 462055 |
2 | show | 27.56 | 10.32 | 32019 | 11988 |
def gen_barplot(x_axis,y_axis,title):
"""
Generate a barplot with a fixed style
Args:
x_axis (string): Column name of joined_data for which values are to be plotted
y_axis (string): Column name of joined_data for which values are to be plotted
title (string): Name of the plot
"""
plt = sns.barplot(data = joined_data, x = x_axis, y = y_axis, order = ['ask','show','others'])
sns.despine(left = True, top=False, bottom=True)
plt.set_title(title, size = 17)
plt.set_xlabel(None)
plt.set_ylabel(None)
plt.tick_params(left=False, labelsize=13)
plt.xaxis.tick_top()
gen_barplot(x_axis = 'avg_points', y_axis = 'type', title = 'Number of Points per Post Type')
gen_barplot(x_axis = 'avg_num_comments', y_axis = 'type', title = 'Number of Comments per Post Type')
It is clear from the analysis above that posts associated to show have clearly more points and those associated to ask have more comments. Those asking a question will have a lot of people attempting to answer that question. Questions may also lead to discussions and therefore an increase in the number of comments. Some questions may be common or helpful to the general public thus calling for better upvotes.
Posts associated to show , on the other hand do not warrant a deep discussion. Instead they seek feedback. This feedback is more easily given in points possibly with some additional comments.
Each post has a record of the date and time at which it was created. Analyzing this time could reveal when a question must be asked to recieve the most number of answers. More specifically, we could find out the hour of the day during which if a question is asked, it would receive the most answers.
To get the information we require, we could begin by extracting the hour from the time provided and then summarizing the number of comments for each ask post by the hour.
Before beginning the analysis we need to extract the hour from the date-time details provided.
#Extract the hour from the date-time detail.
import datetime as dt
hn["hour"] = pd.to_datetime(hn["created_at"],format = "%m/%d/%Y %H:%M")
hn["hour"] = hn["hour"].dt.hour
hn[:10]
title | num_points | num_comments | author | created_at | type | hour | |
---|---|---|---|---|---|---|---|
0 | interactive dynamic video | 386 | 52 | ne0phyte | 8/4/2016 11:52 | others | 11 |
1 | how to use open source and shut the fuck up at... | 39 | 10 | josep2 | 1/26/2016 19:30 | others | 19 |
2 | florida djs may face felony for april fools' w... | 2 | 1 | vezycash | 6/23/2016 22:20 | others | 22 |
3 | technology ventures: from idea to enterprise | 3 | 1 | hswarna | 6/17/2016 0:01 | others | 0 |
4 | note by note: the making of steinway l1037 (2007) | 8 | 2 | walterbell | 9/30/2015 4:12 | others | 4 |
5 | title ii kills investment? comcast and other i... | 53 | 22 | Deinos | 10/31/2015 9:48 | others | 9 |
6 | nuts and bolts business advice | 3 | 4 | shomberj | 11/13/2015 0:45 | others | 0 |
7 | ask hn: how to improve my personal website? | 2 | 6 | ahmedbaracat | 8/16/2016 9:55 | ask | 9 |
8 | shims, jigs and other woodworking concepts to ... | 34 | 7 | zt | 3/22/2016 16:18 | others | 16 |
9 | that self-appendectomy | 91 | 10 | jimsojim | 10/13/2015 9:30 | others | 9 |
#Consolidate the number of Ask posts by hour
ask_posts = hn[hn["type"] == 'ask'][["num_comments","hour"]]
ask_posts = ask_posts.groupby("hour").mean().round().reset_index()
print("\033[1m"+"Number of Comments for Ask Posts by Hour"+"\033[0m")
ask_posts
Number of Comments for Ask Posts by Hour
hour | num_comments | |
---|---|---|
0 | 0 | 8.0 |
1 | 1 | 11.0 |
2 | 2 | 24.0 |
3 | 3 | 8.0 |
4 | 4 | 7.0 |
5 | 5 | 10.0 |
6 | 6 | 9.0 |
7 | 7 | 8.0 |
8 | 8 | 10.0 |
9 | 9 | 6.0 |
10 | 10 | 13.0 |
11 | 11 | 11.0 |
12 | 12 | 9.0 |
13 | 13 | 15.0 |
14 | 14 | 13.0 |
15 | 15 | 39.0 |
16 | 16 | 17.0 |
17 | 17 | 11.0 |
18 | 18 | 13.0 |
19 | 19 | 11.0 |
20 | 20 | 22.0 |
21 | 21 | 16.0 |
22 | 22 | 7.0 |
23 | 23 | 8.0 |
fig = plt.figure(figsize=(7,7))
sns.set(style = "white")
plt = sns.lineplot(data = ask_posts, x = 'hour', y = 'num_comments', marker = 'o', color='green')
sns.despine(left = True,bottom = True, top=True)
plt.set_title("Hourly Count of Comments for Ask Posts", size=18)
plt.set_xlabel(None)
plt.set_ylabel(None)
plt.xaxis.tick_top()
plt.set_xticks([0,2,4,6,8,10,12,14,16,18,20,22])
plt.tick_params(axis='x',top=False,labelsize=16)
plt.tick_params(axis='y',top=False,labelsize=13)
It is clear from the above analysis that questions asked at the 15th hour of the day which is between 3:00-4:00 PM invites the most number comments.
In the same vein it would be interesting to know the impact of the time of day on the points alloted. We saw earlier that show posts tend to have more average points than ask posts. However total points between show and ask posts are not too far off. For this reason we shall compare the average number of points per hour for all posts.
#Consolidate post points by hour
most_points = hn[["num_points","hour"]]
most_points = most_points.groupby("hour").mean().round().reset_index()
print("\033[1m"+"Number of Points for all Posts by Hour"+"\033[0m")
most_points
Number of Points for all Posts by Hour
hour | num_points | |
---|---|---|
0 | 0 | 54.0 |
1 | 1 | 45.0 |
2 | 2 | 51.0 |
3 | 3 | 50.0 |
4 | 4 | 44.0 |
5 | 5 | 44.0 |
6 | 6 | 42.0 |
7 | 7 | 52.0 |
8 | 8 | 48.0 |
9 | 9 | 49.0 |
10 | 10 | 55.0 |
11 | 11 | 53.0 |
12 | 12 | 53.0 |
13 | 13 | 56.0 |
14 | 14 | 54.0 |
15 | 15 | 56.0 |
16 | 16 | 50.0 |
17 | 17 | 53.0 |
18 | 18 | 50.0 |
19 | 19 | 54.0 |
20 | 20 | 42.0 |
21 | 21 | 44.0 |
22 | 22 | 46.0 |
23 | 23 | 48.0 |
from matplotlib import pyplot as plt
# Set size of layout
fig = plt.figure(figsize=(7,7))
ax1 = fig.add_subplot(111)
# Set the line plot
sns.lineplot(data = most_points, x = 'hour', y = 'num_points', marker = 'o')
#Clean up the plot
sns.despine(left = True, bottom = True)
ax1.set_title("Hourly Count of Points for All Posts", size=18)
ax1.set_xlabel(None)
ax1.set_ylabel(None)
ax1.xaxis.tick_top()
ax1.tick_params(axis='x',top=False,labelsize=16)
ax1.tick_params(axis='y',top=False,labelsize=13)
ax1.set_xticks([0,2,4,6,8,10,12,14,16,18,20,22])
ax1.set_yticks([40,45,50,55])
plt.show()
As can be seen above, the points assigned to posts at any given point through out the day range between 43 and 53. We cannot definitively say that the time of day has a significant impact on the points alloted as the points do not vary significantly through the course of the day and because the allotment of points is subjective to the kind of posts that have been put up.
The goal of this project was to analyze a sample of 20,000 posts from the Hacker News site run by YCombinator and deduce the impact of the type of posts on the number of comments from users and they number of points allotted by users for a post. In addition we also try to analyse the impact of the time of the day on the number of comments put up by users and the number of points alloted by readers.
Ask related posts tend to get more comments while Show related posts tend to get more points in comparison. Questions posed betweem 3 and 4 PM tend to get more replies than at any other time of the day. However the allotment of points through the course of the day is between 43-53.