%matplotlib inline
import pandas as pd
data = pd.read_csv('https://github.com/albahnsen/PracticalMachineLearningClass/raw/master/datasets/dataTrain_carListings.zip')
data = data.loc[data['Model'].str.contains('Camry')].drop(['Make', 'State'], axis=1)
data = data.join(pd.get_dummies(data['Model'], prefix='M'))
data['HighPrice'] = (data['Price'] > data['Price'].mean()).astype(int)
data = data.drop(['Model', 'Price'], axis=1)
data.head()
Year | Mileage | M_Camry | M_Camry4dr | M_CamryBase | M_CamryL | M_CamryLE | M_CamrySE | M_CamryXLE | HighPrice | |
---|---|---|---|---|---|---|---|---|---|---|
15 | 2016 | 29242 | 0 | 0 | 0 | 0 | 1 | 0 | 0 | 1 |
47 | 2015 | 26465 | 0 | 0 | 0 | 0 | 1 | 0 | 0 | 1 |
85 | 2012 | 46739 | 0 | 1 | 0 | 0 | 0 | 0 | 0 | 1 |
141 | 2017 | 41722 | 0 | 0 | 0 | 0 | 0 | 1 | 0 | 1 |
226 | 2014 | 77669 | 0 | 0 | 0 | 0 | 0 | 0 | 1 | 0 |
data.shape
(13150, 10)
y = data['HighPrice']
X = data.drop(['HighPrice'], axis=1)
from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.33, random_state=42)
Estimate a Decision Tree Classifier Manually using the code created in the Notebook #13
Evaluate the accuracy on the testing set
Estimate a Bagging of 10 Decision Tree Classifiers Manually using the code created in the Notebook #13
Evaluate the accuracy on the testing set
Implement the variable max_features on the Decision Tree Classifier created in 11.1.
Compare the impact in the results by varing the parameter max_features
Evaluate the accuracy on the testing set
Estimate a Bagging of 10 Decision Tree Classifiers with max_features = log(n_features)
Evaluate the accuracy on the testing set
Using sklearn, train a RandomForestClassifier
Evaluate the accuracy on the testing set
Find the best parameters of the RandomForestClassifier (max_depth, max_features, n_estimators)
Evaluate the accuracy on the testing set