Exercise 9¶

Mashable news stories analysis¶

Predicting if a news story is going to be popular

In [27]:

import pandas as pd

url = 'https://raw.githubusercontent.com/albahnsen/PracticalMachineLearningClass/master/datasets/mashable.csv'
df = pd.read_csv(url, index_col=0)
df.head()

Out[27]:

	url	timedelta	n_tokens_title	n_tokens_content	n_unique_tokens	n_non_stop_words	n_non_stop_unique_tokens	num_hrefs	num_self_hrefs	num_imgs	...	min_positive_polarity	max_positive_polarity	avg_negative_polarity	min_negative_polarity	max_negative_polarity	title_subjectivity	title_sentiment_polarity	abs_title_subjectivity	abs_title_sentiment_polarity	Popular
0	http://mashable.com/2014/12/10/cia-torture-rep...	28.0	9.0	188.0	0.732620	1.0	0.844262	5.0	1.0	1.0	...	0.200000	0.80	-0.487500	-0.60	-0.250000	0.9	0.8	0.4	0.8	1
1	http://mashable.com/2013/10/18/bitlock-kicksta...	447.0	7.0	297.0	0.653199	1.0	0.815789	9.0	4.0	1.0	...	0.160000	0.50	-0.135340	-0.40	-0.050000	0.1	-0.1	0.4	0.1	0
2	http://mashable.com/2013/07/24/google-glass-po...	533.0	11.0	181.0	0.660377	1.0	0.775701	4.0	3.0	1.0	...	0.136364	1.00	0.000000	0.00	0.000000	0.3	1.0	0.2	1.0	0
3	http://mashable.com/2013/11/21/these-are-the-m...	413.0	12.0	781.0	0.497409	1.0	0.677350	10.0	3.0	1.0	...	0.100000	1.00	-0.195701	-0.40	-0.071429	0.0	0.0	0.5	0.0	0
4	http://mashable.com/2014/02/11/parking-ticket-...	331.0	8.0	177.0	0.685714	1.0	0.830357	3.0	2.0	1.0	...	0.100000	0.55	-0.175000	-0.25	-0.100000	0.0	0.0	0.5	0.0	0

5 rows × 61 columns

In [28]:

train_df.shape

Out[28]:

(6000, 61)

In [29]:

X = train_df.drop(['url', 'Popular'], axis=1)
y = train_df['Popular']

In [30]:

y.mean()

Out[30]:

0.5

In [32]:

# train/test split
from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(X, y, random_state=1)

In [ ]:

Exercise 9.1¶

Estimate a Decision Tree Classifier and a Logistic Regression

Evaluate using the following metrics:

Accuracy
F1-Score

In [ ]:

Exercise 9.2¶

Estimate 300 bagged samples

Estimate the following set of classifiers:

100 Decision Trees where max_depth=None
100 Decision Trees where max_depth=2
100 Logistic Regressions

In [ ]:

Exercise 9.3¶

Ensemble using majority voting

Evaluate using the following metrics:

Accuracy
F1-Score

In [ ]:

Exercise 9.4¶

Estimate te probability as %models that predict positive

Modify the probability threshold and select the one that maximizes the F1-Score

In [ ]:

Exercise 9.5¶

Ensemble using weighted voting using the oob_error

Evaluate using the following metrics:

Accuracy
F1-Score

In [ ]:

Exercise 9.6¶

Estimate te probability of the weighted voting

Modify the probability threshold and select the one that maximizes the F1-Score

In [ ]:

Exercise 9.7¶

Estimate a logistic regression using as input the estimated classifiers

Modify the probability threshold such that maximizes the F1-Score

In [ ]: