Hacker News is a site started by the startup incubator Y Combinator, where user-submitted stories (known as "posts") receive votes and comments, similar to reddit. Hacker News is extremely popular in technology and startup circles, and posts that make it to the top of the Hacker News listings can get hundreds of thousands of visitors as a result.
You can find the dataset here, but note that we have reduced from almost 300,000 rows to approximately 80,000 rows by removing all submissions that didn't receive any comments. Below are the descriptions of the columns:
We're specifically interested in posts that begin with either Ask HN and Show HN.
We now import the dataset and display the first five rows.
import pandas as pd
hacker_news = pd.read_csv('hacker_news.csv')
hacker_news.head(10)
id | title | url | num_points | num_comments | author | created_at | |
---|---|---|---|---|---|---|---|
0 | 12224879 | Interactive Dynamic Video | http://www.interactivedynamicvideo.com/ | 386 | 52 | ne0phyte | 8/4/2016 11:52 |
1 | 10975351 | How to Use Open Source and Shut the Fuck Up at... | http://hueniverse.com/2016/01/26/how-to-use-op... | 39 | 10 | josep2 | 1/26/2016 19:30 |
2 | 11964716 | Florida DJs May Face Felony for April Fools' W... | http://www.thewire.com/entertainment/2013/04/f... | 2 | 1 | vezycash | 6/23/2016 22:20 |
3 | 11919867 | Technology ventures: From Idea to Enterprise | https://www.amazon.com/Technology-Ventures-Ent... | 3 | 1 | hswarna | 6/17/2016 0:01 |
4 | 10301696 | Note by Note: The Making of Steinway L1037 (2007) | http://www.nytimes.com/2007/11/07/movies/07ste... | 8 | 2 | walterbell | 9/30/2015 4:12 |
5 | 10482257 | Title II kills investment? Comcast and other I... | http://arstechnica.com/business/2015/10/comcas... | 53 | 22 | Deinos | 10/31/2015 9:48 |
6 | 10557283 | Nuts and Bolts Business Advice | NaN | 3 | 4 | shomberj | 11/13/2015 0:45 |
7 | 12296411 | Ask HN: How to improve my personal website? | NaN | 2 | 6 | ahmedbaracat | 8/16/2016 9:55 |
8 | 11337617 | Shims, Jigs and Other Woodworking Concepts to ... | http://firstround.com/review/shims-jigs-and-ot... | 34 | 7 | zt | 3/22/2016 16:18 |
9 | 10379326 | That self-appendectomy | http://www.southpolestation.com/trivia/igy1/ap... | 91 | 10 | jimsojim | 10/13/2015 9:30 |
hacker_news.describe()
id | num_points | num_comments | |
---|---|---|---|
count | 2.010000e+04 | 20100.000000 | 20100.000000 |
mean | 1.131753e+07 | 50.296070 | 24.802289 |
std | 6.964399e+05 | 107.107687 | 56.107340 |
min | 1.017691e+07 | 1.000000 | 1.000000 |
25% | 1.070176e+07 | 3.000000 | 1.000000 |
50% | 1.128445e+07 | 9.000000 | 3.000000 |
75% | 1.192607e+07 | 54.000000 | 21.000000 |
max | 1.257898e+07 | 2553.000000 | 1733.000000 |
hacker_news.info()
<class 'pandas.core.frame.DataFrame'> RangeIndex: 20100 entries, 0 to 20099 Data columns (total 7 columns): # Column Non-Null Count Dtype --- ------ -------------- ----- 0 id 20100 non-null int64 1 title 20100 non-null object 2 url 17660 non-null object 3 num_points 20100 non-null int64 4 num_comments 20100 non-null int64 5 author 20100 non-null object 6 created_at 20100 non-null object dtypes: int64(3), object(4) memory usage: 1.1+ MB
hacker_news.count()
id 20100 title 20100 url 17660 num_points 20100 num_comments 20100 author 20100 created_at 20100 dtype: int64
hacker_news['DateTime'] = pd.to_datetime(hacker_news['created_at'], errors='coerce')
hacker_news['Date'] = hacker_news['DateTime'].dt.date
hacker_news['Time'] = hacker_news['DateTime'].dt.time
hacker_news[['h','m','s']] = hacker_news['Time'].astype(str).str.split(':', expand=True).astype(int)
Now that we've seen the data, let's categorize the posts into three categories ask_posts, show_posts, and other_posts.
These categories will help us see which types of posts are more common
ask_post = hacker_news["title"].str.contains("Ask HN")
ASK_POST = len(hacker_news[ask_post])
ASK_POST
1742
show_post = hacker_news["title"].str.contains("Show HN")
SHOW_POST = len(hacker_news[show_post])
SHOW_POST
1169
other_posts = len(hacker_news["title"]) - (SHOW_POST + ASK_POST )
other_posts
17189
We can see that Ask HN has more post than Show HN
At this point, we have see the number of all commented Ask HN posts in the ask_posts list and all commented Show HN in the show_posts list. We can now determine whether ask posts or show posts receive more comments on the average
ask = hacker_news[ask_post]
avg_ask_comments = ask['num_comments'].sum()/ASK_POST
Show = hacker_news[show_post]
avg_show_comments = Show['num_comments'].sum()/SHOW_POST
template = "Average number of comments for {name}: {avg: ,.5f}"
print(template.format(name = "ask hn posts", avg = avg_ask_comments))
print(template.format(name = "show hn posts", avg = avg_show_comments))
Average number of comments for ask hn posts: 14.04478 Average number of comments for show hn posts: 10.29170
Ask hn posts receive about 14% more comments on average than show hn posts, The focus on the rest of our analysis on these types of posts is to know what times are the best for creating ask hn posts.
This would be done by calculating the amount of ask hn posts created by hourly along with the comments received. Then calculate the average number of comments ask hn posts receive by hour is created. To do all of this, The datetime module and its datetime.strptime() constructor is used to parse dates stored as strings into datetime objects. Then the datetime.strptime constructor function is combined with the strftime method to convert the datetime object into an hour format.
ask.head()
id | title | url | num_points | num_comments | author | created_at | DateTime | Date | Time | h | m | s | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|
7 | 12296411 | Ask HN: How to improve my personal website? | NaN | 2 | 6 | ahmedbaracat | 8/16/2016 9:55 | 2016-08-16 09:55:00 | 2016-08-16 | 09:55:00 | 9 | 55 | 0 |
17 | 10610020 | Ask HN: Am I the only one outraged by Twitter ... | NaN | 28 | 29 | tkfx | 11/22/2015 13:43 | 2015-11-22 13:43:00 | 2015-11-22 | 13:43:00 | 13 | 43 | 0 |
22 | 11610310 | Ask HN: Aby recent changes to CSS that broke m... | NaN | 1 | 1 | polskibus | 5/2/2016 10:14 | 2016-05-02 10:14:00 | 2016-05-02 | 10:14:00 | 10 | 14 | 0 |
30 | 12210105 | Ask HN: Looking for Employee #3 How do I do it? | NaN | 1 | 3 | sph130 | 8/2/2016 14:20 | 2016-08-02 14:20:00 | 2016-08-02 | 14:20:00 | 14 | 20 | 0 |
31 | 10394168 | Ask HN: Someone offered to buy my browser exte... | NaN | 28 | 17 | roykolak | 10/15/2015 16:38 | 2015-10-15 16:38:00 | 2015-10-15 | 16:38:00 | 16 | 38 | 0 |
dff = ask.groupby('h').num_points.sum().reset_index()
dff
h | num_points | |
---|---|---|
0 | 0 | 449 |
1 | 1 | 700 |
2 | 2 | 793 |
3 | 3 | 374 |
4 | 4 | 389 |
5 | 5 | 552 |
6 | 6 | 591 |
7 | 7 | 361 |
8 | 8 | 515 |
9 | 9 | 329 |
10 | 10 | 1102 |
11 | 11 | 825 |
12 | 12 | 782 |
13 | 13 | 2062 |
14 | 14 | 1282 |
15 | 15 | 3479 |
16 | 16 | 2522 |
17 | 17 | 1941 |
18 | 18 | 1739 |
19 | 19 | 1513 |
20 | 20 | 1151 |
21 | 21 | 1721 |
22 | 22 | 511 |
23 | 23 | 581 |
Count_by_hour = dff = ask.groupby('h').num_comments.sum().reset_index()
Count_by_hour
h | num_comments | |
---|---|---|
0 | 0 | 439 |
1 | 1 | 683 |
2 | 2 | 1381 |
3 | 3 | 421 |
4 | 4 | 337 |
5 | 5 | 464 |
6 | 6 | 397 |
7 | 7 | 267 |
8 | 8 | 492 |
9 | 9 | 251 |
10 | 10 | 793 |
11 | 11 | 641 |
12 | 12 | 687 |
13 | 13 | 1253 |
14 | 14 | 1416 |
15 | 15 | 4477 |
16 | 16 | 1814 |
17 | 17 | 1146 |
18 | 18 | 1430 |
19 | 19 | 1188 |
20 | 20 | 1722 |
21 | 21 | 1745 |
22 | 22 | 479 |
23 | 23 | 543 |