import pandas as pd
url = 'https://raw.githubusercontent.com/albahnsen/PracticalMachineLearningClass/master/datasets/mashable.csv'
df = pd.read_csv(url, index_col=0)
df.head()
url | timedelta | n_tokens_title | n_tokens_content | n_unique_tokens | n_non_stop_words | n_non_stop_unique_tokens | num_hrefs | num_self_hrefs | num_imgs | ... | min_positive_polarity | max_positive_polarity | avg_negative_polarity | min_negative_polarity | max_negative_polarity | title_subjectivity | title_sentiment_polarity | abs_title_subjectivity | abs_title_sentiment_polarity | Popular | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | http://mashable.com/2014/12/10/cia-torture-rep... | 28.0 | 9.0 | 188.0 | 0.732620 | 1.0 | 0.844262 | 5.0 | 1.0 | 1.0 | ... | 0.200000 | 0.80 | -0.487500 | -0.60 | -0.250000 | 0.9 | 0.8 | 0.4 | 0.8 | 1 |
1 | http://mashable.com/2013/10/18/bitlock-kicksta... | 447.0 | 7.0 | 297.0 | 0.653199 | 1.0 | 0.815789 | 9.0 | 4.0 | 1.0 | ... | 0.160000 | 0.50 | -0.135340 | -0.40 | -0.050000 | 0.1 | -0.1 | 0.4 | 0.1 | 0 |
2 | http://mashable.com/2013/07/24/google-glass-po... | 533.0 | 11.0 | 181.0 | 0.660377 | 1.0 | 0.775701 | 4.0 | 3.0 | 1.0 | ... | 0.136364 | 1.00 | 0.000000 | 0.00 | 0.000000 | 0.3 | 1.0 | 0.2 | 1.0 | 0 |
3 | http://mashable.com/2013/11/21/these-are-the-m... | 413.0 | 12.0 | 781.0 | 0.497409 | 1.0 | 0.677350 | 10.0 | 3.0 | 1.0 | ... | 0.100000 | 1.00 | -0.195701 | -0.40 | -0.071429 | 0.0 | 0.0 | 0.5 | 0.0 | 0 |
4 | http://mashable.com/2014/02/11/parking-ticket-... | 331.0 | 8.0 | 177.0 | 0.685714 | 1.0 | 0.830357 | 3.0 | 2.0 | 1.0 | ... | 0.100000 | 0.55 | -0.175000 | -0.25 | -0.100000 | 0.0 | 0.0 | 0.5 | 0.0 | 0 |
5 rows × 61 columns
train_df.shape
(6000, 61)
X = train_df.drop(['url', 'Popular'], axis=1)
y = train_df['Popular']
y.mean()
0.5
# train/test split
from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(X, y, random_state=1)
Estimate a Decision Tree Classifier and a Logistic Regression
Evaluate using the following metrics:
Estimate 300 bagged samples
Estimate the following set of classifiers:
Estimate te probability as %models that predict positive
Modify the probability threshold and select the one that maximizes the F1-Score
Ensemble using weighted voting using the oob_error
Evaluate using the following metrics:
Estimate te probability of the weighted voting
Modify the probability threshold and select the one that maximizes the F1-Score
Estimate a logistic regression using as input the estimated classifiers
Modify the probability threshold such that maximizes the F1-Score