In this project we will be participating in the legendary Titanic competition on Kaggle. This is an excellent project to take part in to both become familiar with machine learning, as well as how to use the Kaggle platform. There are two main goals in this project:
This project will be using many models and tools from the scikit-learn and scipy libraries.
#models
from sklearn.linear_model import LogisticRegression
from sklearn.neighbors import KNeighborsClassifier
from sklearn.ensemble import RandomForestClassifier
from sklearn.svm import SVC
from sklearn.linear_model import SGDClassifier
from sklearn.linear_model import Perceptron
#tools
from scipy.stats import randint
from scipy.stats import loguniform
from scipy.stats import uniform
from sklearn.model_selection import train_test_split
from sklearn.model_selection import cross_val_score
from sklearn.model_selection import GridSearchCV
from sklearn.model_selection import RandomizedSearchCV
from sklearn.metrics import accuracy_score
from sklearn.preprocessing import minmax_scale
from sklearn.preprocessing import StandardScaler
from sklearn.feature_selection import RFECV
from sklearn.pipeline import make_pipeline
from sklearn.pipeline import Pipeline
import time
#we include the following to silence the warnings
import warnings
from sklearn.exceptions import ConvergenceWarning
train = pd.read_csv('train.csv')
holdout = pd.read_csv('test.csv')
print(train.shape)
print(holdout.shape)
(891, 12) (418, 11)
def process_missing(df):
#handles missing values from the dataset
df["Fare"] = df["Fare"].fillna(train["Fare"].mean())
df["Embarked"] = df["Embarked"].fillna("S")
return df
def process_age(df):
#categorizes age into bins
df["Age"] = df["Age"].fillna(-0.5)
cut_points = [-1,0,5,12,18,35,60,100]
label_names = ["Missing","Infant","Child","Teenager","Young Adult","Adult","Senior"]
df["Age_categories"] = pd.cut(df["Age"],cut_points,labels=label_names)
return df
def process_fare(df):
#categorizes fare price into bins
cut_points = [-1,12,50,100,1000]
label_names = ["0-12","12-50","50-100","100+"]
df["Fare_categories"] = pd.cut(df["Fare"],cut_points,labels=label_names)
return df
def process_cabin(df):
#categorizes cabin type into bins
df["Cabin_type"] = df["Cabin"].str[0]
df["Cabin_type"] = df["Cabin_type"].fillna("Unknown")
df = df.drop('Cabin',axis=1)
return df
def process_titles(df):
#extracts and categorizes ti
titles = {
"Mr" : "Mr",
"Mme": "Mrs",
"Ms": "Mrs",
"Mrs" : "Mrs",
"Master" : "Master",
"Mlle": "Miss",
"Miss" : "Miss",
"Capt": "Officer",
"Col": "Officer",
"Major": "Officer",
"Dr": "Officer",
"Rev": "Officer",
"Jonkheer": "Royalty",
"Don": "Royalty",
"Sir" : "Royalty",
"Countess": "Royalty",
"Dona": "Royalty",
"Lady" : "Royalty"
}
extracted_titles = df["Name"].str.extract(' ([A-Za-z]+)\.',expand=False)
df["Title"] = extracted_titles.map(titles)
return df
def process_fam(df):
df['Fam_onboard'] = df['SibSp'] + df['Parch']
df['Fam_onboard'] = df['Fam_onboard'].fillna(0.0)
return df
def process_alone(df):
df['Is_alone'] = (df['Fam_onboard'] == 0).astype(int)
df.drop('Fam_onboard', axis=1, inplace=True)
return df
def process_scaling(df):
#list of columns to scale
cols = ['SibSp', 'Parch', 'Fare']
#rescales data in specified columns
for col in cols:
df[col + '_scaled'] = minmax_scale(df[col])
# df.drop(col, axis=1, inplace=True)
return df
def create_dummies(df,column_name):
#implements one-hot encoding for given column
dummies = pd.get_dummies(df[column_name],prefix=column_name)
df = pd.concat([df,dummies],axis=1)
return df
def process_all(df):
name = [x for x in globals() if globals()[x] is df][0]
df = process_missing(df)
df = process_age(df)
df = process_fare(df)
df = process_titles(df)
df = process_cabin(df)
for col in ['Age_categories', 'Fare_categories', 'Title',
'Cabin_type', 'Sex', 'Pclass', 'Embarked']:
df = create_dummies(df, col)
print(f'Processing {name} complete!')
return df
#process train and holdout sets
train = process_all(train)
holdout = process_all(holdout)
#print shapes (holdout should have at least 1 fewer columns)
print(f'train: {train.shape}')
print(f'holdout: {holdout.shape}')
Processing train complete! Processing holdout complete! train: (891, 49) holdout: (418, 47)
The following are helper functions to save us from reusing code multiple times. These are functions that don't change the data, rather they help present and visualize it.
title_map = {
#column : [title, xlabel, bins]
'Pclass': ['Passenger Class', 'Class', 3],
'Sex': ['Sex', 'Sex', 4],
'Age': ['Age', 'Age', 5],
'SibSp': ['Sibling & Spouse Count', '# of Siblings/Spouses', 8],
'Parch': ['Parent & Child Count', '# of Parents/Children', 6],
'Fare': ['Fare Paid', 'Fare', 10],
'Cabin': ['Cabin Number', 'Cabin Type', 10],
'Embarked': ['Embarkment Point', 'Embarkment Point', 3],
'Fam_onboard': ['Family Member Count', '# of Family Members', 10]
}
def plot_distr(column):
#get title & label
title = title_map[column][0]
xlabel = title_map[column][1]
bin_count = title_map[column][2]
#figure details
fig, ax = plt.subplots(figsize = (12,8))
fig.suptitle(f'{title} Distribution', fontsize=22, y=0.92)
sns.histplot(data=train, x=column, hue='Survived', bins=bin_count,
palette='deep', alpha=1, multiple="dodge")
#plot details
ax.set_xlabel(f'{xlabel}', fontsize=16)
ax.set_ylabel('Frequency', fontsize=16)
ax.tick_params(axis='x', labelsize=12)
ax.tick_params(axis='y', labelsize=12)
labels = ['Survived', 'Died']
ax.legend(labels, frameon=True, bbox_to_anchor=(1, 1), fontsize=14)
plt.show()
def plot_vs_survival(column):
#get title & label
title = title_map[column][0]
xlabel = title_map[column][1]
#figure details
fig, ax = plt.subplots(figsize = (12,8))
fig.suptitle(f'Survival Rate vs. {title}', fontsize=22, y=0.92)
sns.barplot(x=train[column], y=train['Survived'], palette='deep')
ax.bar_label(ax.containers[0], fmt='%.2f', fontsize=14)
#plot details
if column == 'SibSp':
ax.set_xlim(-0.5,4.5,1)
ax.set_xlabel(f'{xlabel}', fontsize=16)
ax.set_ylabel('Survival Rate', fontsize=16)
ax.tick_params(axis='x', labelsize=12)
ax.tick_params(axis='y', labelsize=12)
plt.show()
def plot_correlation_heatmap(df):
cols_to_drop = ['Survived', 'PassengerId']
df_cols = df.columns
df_cols = df_cols.drop(cols_to_drop)
#create correlation matrix
corr = df[df_cols].corr()
#create mask to hide "upper" half
# sns.set(style="white")
mask = np.zeros_like(corr, dtype=bool)
mask[np.triu_indices_from(mask)] = True
#figure details
fig, ax = plt.subplots(figsize=(11, 9))
ax.set_title('Feature Correlation Heatmap', fontsize=22)
#plot details
cmap = sns.diverging_palette(220, 5, as_cmap=True)
sns.heatmap(corr, mask=mask, cmap=cmap, vmax=1, center=0, square=True,
linewidths=.5, linecolor='black', cbar_kws={'shrink': .8})
#colorbar details
cbar = ax.collections[0].colorbar
cbar.set_ticks([-1, -0.5, 0, .5, 1])
cbar.ax.tick_params(axis='both', which='both', length=0)
plt.show()
def score_comp(results, col='Scores'):
scores = []
names = []
for i in range(len(results)):
scores.append(results[i]['best_score'])
names.append(results[i]['name'])
comp = pd.DataFrame()
comp['Model'] = names
comp[col] = scores
comp.set_index('Model', inplace=True)
return comp
def plot_scores_vs_kaggle(scores, kaggle):
#predicted scores from models
accuracies = []
names = ['Logistic\nRegression', 'K-Nearest\nNeighbor', 'Random\nForest',
'Support Vector\nMachine', 'Stochastic\nGradient Descent', 'Linear\nPerceptron']
# #have to input these manually
# kaggle = [0.78229, 0.75598, 0.75358, 0.76794, 0.77033, 0.75598]
#extracts actual score
for i in range(len(scores)):
accuracies.append(scores[i]['best_score'])
#figure details
fig, ax = plt.subplots(figsize = (12,8))
fig.suptitle(f'Kaggle Score vs. Predicted Score', fontsize=22, y=0.92)
#plot predicted
sns.barplot(x=names, y=accuracies, palette='deep', alpha=0.3, label='Predicted')
ax.bar_label(ax.containers[0], fmt='%.4f', fontsize=14)
#plot kaggle
sns.barplot(x=names, y=kaggle, palette='deep', alpha=0.7, label='Kaggle')
ax.bar_label(ax.containers[1], fmt='%.4f', fontsize=14)
#plot horizontal lines
ax.axhline(y=0.75, color='red', linestyle='dashed', alpha=0.7, label='Minimum')
ax.axhline(y=0.8, color='grey', linestyle='dashed', alpha=0.7, label='Goal')
#plot details
ax.set_ylim(0.68, 0.85)
ax.set_xlabel('Models', fontsize=16)
ax.set_ylabel('Accuracy', fontsize=16)
ax.tick_params(axis='x', labelsize=12)
ax.tick_params(axis='y', labelsize=12)
ax.legend(frameon=True, bbox_to_anchor=(1, 1), fontsize=12)
plt.show()
def plot_model_progression(df):
dfT = df.T
#figure details
fig, ax = plt.subplots(figsize = (12,8))
fig.suptitle(f'Model Progression of Predicted Accuracy', fontsize=22, y=0.92)
#plot
for i in df.index:
sns.lineplot(data=dfT[i], marker='o')
#plot details
labels = ['Logistic Regression', 'K-Nearest Neighborrs', 'Random Forest',
'Support Vector Machine', 'Stochastic Gradient Descent',
'Linear Perceptron']
ax.set_ylim(0.70, 0.85)
ax.set_xlabel('Iterations', fontsize=16)
ax.set_ylabel('Predicted Accuracy', fontsize=16)
ax.tick_params(axis='x', labelsize=12)
ax.tick_params(axis='y', labelsize=12)
ax.legend(labels, frameon=True, fontsize=12)
plt.show()
print(train.dtypes,'\n')
print(f'train: {train.shape}')
print(f'holdout: {holdout.shape}')
PassengerId int64 Survived int64 Pclass int64 Name object Sex object Age float64 SibSp int64 Parch int64 Ticket object Fare float64 Embarked object Age_categories category Fare_categories category Title object Cabin_type object Age_categories_Missing uint8 Age_categories_Infant uint8 Age_categories_Child uint8 Age_categories_Teenager uint8 Age_categories_Young Adult uint8 Age_categories_Adult uint8 Age_categories_Senior uint8 Fare_categories_0-12 uint8 Fare_categories_12-50 uint8 Fare_categories_50-100 uint8 Fare_categories_100+ uint8 Title_Master uint8 Title_Miss uint8 Title_Mr uint8 Title_Mrs uint8 Title_Officer uint8 Title_Royalty uint8 Cabin_type_A uint8 Cabin_type_B uint8 Cabin_type_C uint8 Cabin_type_D uint8 Cabin_type_E uint8 Cabin_type_F uint8 Cabin_type_G uint8 Cabin_type_T uint8 Cabin_type_Unknown uint8 Sex_female uint8 Sex_male uint8 Pclass_1 uint8 Pclass_2 uint8 Pclass_3 uint8 Embarked_C uint8 Embarked_Q uint8 Embarked_S uint8 dtype: object train: (891, 49) holdout: (418, 47)
According to the data dictionary, the SibSp
column represents the number of siblings / spouses that that passenger had aboard the Titanic. Let's explore this data a little further by looking at the distribution of siblings/spouses and their survival rate.
plot_distr('SibSp')
In the above chart we can see that the majority of passengers aboard the Titanic were single and/or had no siblings on board with them. It appears that roughly 2/3 of passengers who were single and/or had no siblings did not survive. If you were married and/or had at least one sibling on board, that ratio approaches roughly 1/2.
sibsp_pivot = train.pivot_table(index='SibSp',values='Survived')
sibsp_pivot
Survived | |
---|---|
SibSp | |
0 | 0.345395 |
1 | 0.535885 |
2 | 0.464286 |
3 | 0.250000 |
4 | 0.166667 |
5 | 0.000000 |
8 | 0.000000 |
plot_vs_survival('SibSp')
The chart above shows us the survival rate of passenger who had X# of siblings/spouses on board the Titanic. It appears that if you were married or had one sibling on board with you, your odds of survival peaked at roughly 54%. If you had 4 siblings/spouses, then your odds of survival dropped to 17%. Unfortunately, the survival rate for any passenger that had 5 or more siblings/spouses was 0%, and thus is not shown on the graph.
According to the data dictionary, the Parch
column represents the number of parents or children the passenger had aboard the Titanic. Let's also explore this data further by looking at the distribution of parents/children, as well as their survival rates.
plot_distr('Parch')
The above plot shows very similar distributions as the SibSp
plot. Roughly 2/3 of passengers with no parent/child on board didn't survive. With 1 or more, that ratio approaches 1/2.
parch_pivot = train.pivot_table(index="Parch",values="Survived")
parch_pivot
Survived | |
---|---|
Parch | |
0 | 0.343658 |
1 | 0.550847 |
2 | 0.500000 |
3 | 0.600000 |
4 | 0.000000 |
5 | 0.200000 |
6 | 0.000000 |
plot_vs_survival('Parch')
Here we can see the survival rate for passengers with certain numbers of parents or children on board. It does make sense that if you have 3 parents/children on board, your survival rate is higher than if you had 0. This is most likely because women and children had priority access to the lifeboats.
The two columns SibSp
and Parch
both represent numbers of family members on board the Titanic. Since they are both numerical datatypes, and they both represent similar things, we can go ahead and combine then and see if/how it affects the distribution and survival rates.
#creating new column by combining two existing ones
train = process_fam(train)
holdout = process_fam(holdout)
train['Fam_onboard'].value_counts()
0 537 1 161 2 102 3 29 5 22 4 15 6 12 10 7 7 6 Name: Fam_onboard, dtype: int64
plot_distr('Fam_onboard')
In the combined Fam_onboard
column, it's much easier to see how many family members each passenger has on board with them. The overwhelming majority of passengers were actually single passengers. They were also the group that had the most deaths. Since they didn't have children with them, they weren't given lifeboat priority.
fam_pivot = train.pivot_table(index='Fam_onboard', values="Survived")
fam_pivot
Survived | |
---|---|
Fam_onboard | |
0 | 0.303538 |
1 | 0.552795 |
2 | 0.578431 |
3 | 0.724138 |
4 | 0.200000 |
5 | 0.136364 |
6 | 0.333333 |
7 | 0.000000 |
10 | 0.000000 |
plot_vs_survival('Fam_onboard')
If we look at the above plot, we can clearly see that if you were on board the Titanic with at least 1 family member, your odds of survival greatly increase. Having just one family member onboard increases your survival rate by 25%. Relative to the original SibSp
and Parch
columns, the odds of a passenger's survival goes up if they have any family members on board - the only exception being if you have 5 family members, in which case it goes down by about 7%. In both the SibSp
and Parch
columns, if you had 0 family on board your odds of survival were ~34%. However, in the combined Fam_onboard
column, your odds are now at 30%. This can be seen in the table below.
#frames to concat
frames = [sibsp_pivot, parch_pivot, fam_pivot]
#rename columns in pivot tables
sibsp_pivot.rename(columns={'Survived': 'SibSp'}, inplace=True)
parch_pivot.rename(columns={'Survived': 'Parch'}, inplace=True)
fam_pivot.rename(columns={'Survived': 'Fam_onboard'}, inplace=True)
#concat dataframes
survival_comp = pd.concat(frames, axis=1)
survival_comp.index.name = '#'
survival_comp.fillna(0.0, inplace=True)
survival_comp
SibSp | Parch | Fam_onboard | |
---|---|---|---|
# | |||
0 | 0.345395 | 0.343658 | 0.303538 |
1 | 0.535885 | 0.550847 | 0.552795 |
2 | 0.464286 | 0.500000 | 0.578431 |
3 | 0.250000 | 0.600000 | 0.724138 |
4 | 0.166667 | 0.000000 | 0.200000 |
5 | 0.000000 | 0.200000 | 0.136364 |
8 | 0.000000 | 0.000000 | 0.000000 |
6 | 0.000000 | 0.000000 | 0.333333 |
7 | 0.000000 | 0.000000 | 0.000000 |
10 | 0.000000 | 0.000000 | 0.000000 |
Because of the table above, we can create a new feature - one that simplifies a lot of the information we've been working with. Was the passenger alone? This new feature will be a binary column containing the value:
1
- if the passenger has zero family members onboard0
- if the passenger has one or more family members onboardLet's create a function that will process our data and create this new feature.
train = process_alone(train)
holdout = process_alone(holdout)
train['Is_alone'].value_counts()
1 537 0 354 Name: Is_alone, dtype: int64
We can use scikit-learn's builtin feature_selection.RFECV class to automate selecting the b est-performing features using recursive feature elimination. To make things easier, we can create a function that performs this step for us.
The first parameter in an RFECV()
object is an estimator. This is the model you want to use for your training. For this step we will use Random Forests as it seems to be a better fit for this project. Our function will:
def select_features(df):
#remove non-numeric columns, columns that have null values
df = df.select_dtypes([np.number]).dropna(axis=1)
all_X = df.drop(['Survived', 'PassengerId'], axis=1)
all_y = df['Survived']
#create new model with RFECV object
rf = RandomForestClassifier(random_state=1)
selector = RFECV(rf, cv=10, n_jobs=-1)
selector.fit(all_X, all_y)
#get subset of optimized columns
optimized_columns = all_X.columns[selector.support_]
#print columns
print(f'Optimized columns: {len(optimized_columns)}')
for col in range(len(optimized_columns)):
print(f' - {optimized_columns[col]}')
return optimized_columns
best_columns_1 = select_features(train)
Optimized columns: 6 - Pclass - Age - Fare - Title_Mr - Sex_female - Sex_male
Just like we did with feature selection above, we can write a function to perform all of the heavy lifting of model selection and tuning. Our function will (for now) test three different algorithms and use grid search to train using different combinations of hyperparameters to find the best performing model.
It will work like this:
GridSearchCV
objects as well as the results so we can see which was the most accurate.For our project we will be testing the following models:
def select_model(df, features, search=True):
#set up train / target variables
all_X = df[features]
all_y = df['Survived']
#for SGDClassifier
pipe = Pipeline(steps=[('scaler', StandardScaler()),
('sgd', SGDClassifier())])
if search:
print('Using GridSearchCV...\n')
model_params = [
{
"name": "LogisticRegression",
"estimator": LogisticRegression(),
"hyperparameters":
{
"solver": ["newton-cg", "lbfgs", "liblinear"],
"C": np.logspace(-4, 4, 50)
}
},
{
"name": "KNeighborsClassifier",
"estimator": KNeighborsClassifier(),
"hyperparameters":
{
"n_neighbors": range(1,20,2),
"weights": ["distance", "uniform"],
"algorithm": ["ball_tree", "kd_tree", "brute"],
"p": [1,2]
}
},
{
"name": "RandomForestClassifier",
"estimator": RandomForestClassifier(random_state=1),
"hyperparameters":
{
"n_estimators": [4, 6, 9],
"criterion": ["entropy", "gini"],
"max_depth": [2, 5, 10],
"max_features": ["log2", "sqrt"],
"min_samples_leaf": [1, 5, 8],
"min_samples_split": [2, 3, 5]
}
},
{
"name": "SupportVectorMachine",
"estimator": SVC(),
"hyperparameters":
{
"kernel": ['rbf', 'linear', 'sigmoid'],
"gamma": ['scale', 'auto'],
"C": [0.1, 0.01, 0.001]
}
},
{
"name": "StochasticGradientDescent",
"estimator": pipe,
"hyperparameters":
{
"sgd__loss": ['hinge', 'log', 'perceptron'],
"sgd__alpha": [0.00001, 0.0001, 0.01, 0.1],
"sgd__learning_rate": ['constant', 'optimal',
'invscaling', 'adaptive'],
"sgd__eta0": [1, 10, 100],
"sgd__l1_ratio": [0, 0.2, 0.4, 0.6, 0.8],
"sgd__penalty": ['l2', 'l1', 'elasticnet', 'None']
}
},
{
"name": "LinearPerceptron",
"estimator": Perceptron(),
"hyperparameters":
{
"eta0": [0.0001, 0.001, 0.01, 0.1, 1.0],
"penalty": ['l2', 'l1', 'elasticnet', 'None'],
"alpha": [0.00001, 0.0001, 0.01, 0.1],
"l1_ratio": [0, 0.2, 0.4, 0.6, 0.8]
}
}
]
##############################################################
# RandomizedSearchCV requires slightly different parameters! #
# Specify a continuous distribution (rather than a list of #
# values) for any continous parameters. #
############################################################
else:
print('Using RandomizedSearchCV...\n')
model_params = [
{
"name": "LogisticRegression",
"estimator": LogisticRegression(),
"hyperparameters":
{
"solver": ["newton-cg", "lbfgs", "liblinear"],
"C": loguniform(1e-5, 100)
}
},
{
"name": "KNeighborsClassifier",
"estimator": KNeighborsClassifier(),
"hyperparameters":
{
"n_neighbors": range(1,21),
"weights": ["distance", "uniform"],
"algorithm": ["ball_tree", "kd_tree", "brute"],
"p": [1,2]
}
},
{
"name": "RandomForestClassifier",
"estimator": RandomForestClassifier(random_state=1),
"hyperparameters":
{
"n_estimators": [int(x) for x in np.linspace(4, 15, num = 10)],
"criterion": ["entropy", "gini"],
"max_depth": [int(x) for x in np.linspace(2, 10, num = 5)] ,
"max_features": ["log2", "sqrt"],
"min_samples_leaf": [int(x) for x in np.linspace(1, 10, num = 5)],
"min_samples_split": [int(x) for x in np.linspace(2, 5, num = 3)]
}
},
{
"name": "SupportVectorMachine",
"estimator": SVC(),
"hyperparameters":
{
"kernel": ['rbf', 'linear', 'sigmoid'],
"C": loguniform(1e-5, 100),
"gamma": loguniform(0.1, 1)
}
},
{
"name": "StochasticGradientDescent",
"estimator": pipe,
"hyperparameters":
{
"sgd__loss": ['hinge', 'log', 'perceptron'],
"sgd__alpha": loguniform(1e-5, 100),
"sgd__learning_rate": ['constant', 'optimal',
'invscaling', 'adaptive'],
"sgd__eta0": loguniform(1, 100),
"sgd__l1_ratio": uniform(0, 1),
"sgd__penalty": ['l2', 'l1', 'elasticnet', 'None']
}
},
{
"name": "LinearPerceptron",
"estimator": Perceptron(),
"hyperparameters":
{
"eta0": loguniform(1e-5, 100),
"penalty": ['l2', 'l1', 'elasticnet', 'None'],
"alpha": loguniform(1e-5, 100),
"l1_ratio": uniform(0, 1)
}
}
]
for model in model_params:
#print name
print(f'Model: {model["name"]}')
#hide warnings, then train model
with warnings.catch_warnings():
warnings.simplefilter("ignore", category=ConvergenceWarning)
if search:
#instantiate gridsearch
grid = GridSearchCV(model['estimator'],
param_grid=model['hyperparameters'],
cv=10,
n_jobs=-1)
else:
#instantiate randomsearch
grid = RandomizedSearchCV(model['estimator'],
param_distributions=model['hyperparameters'],
cv=10,
n_iter=20,
n_jobs=-1)
grid.fit(all_X, all_y)
#assign to dictionary
model['best_params'] = grid.best_params_
model['best_score'] = grid.best_score_
model['best_model'] = grid.best_estimator_
#print best score and parameters
print(f'Best Score: {model["best_score"]}')
print(f'Best Params: {model["best_params"]}')
print('-'*(len(model['name'])+7))
print('Finished!')
return model_params
#select best model and parameters for train dataframe
best_model_1 = select_model(train, best_columns_1)
results_1 = score_comp(best_model_1, 'Grid')
Using GridSearchCV... Model: LogisticRegression Best Score: 0.7934831460674157 Best Params: {'C': 1.7575106248547894, 'solver': 'lbfgs'} ------------------------- Model: KNeighborsClassifier Best Score: 0.7722596754057427 Best Params: {'algorithm': 'ball_tree', 'n_neighbors': 9, 'p': 1, 'weights': 'distance'} --------------------------- Model: RandomForestClassifier Best Score: 0.8418476903870162 Best Params: {'criterion': 'entropy', 'max_depth': 10, 'max_features': 'log2', 'min_samples_leaf': 1, 'min_samples_split': 5, 'n_estimators': 9} ----------------------------- Model: SupportVectorMachine Best Score: 0.786729088639201 Best Params: {'C': 0.1, 'gamma': 'scale', 'kernel': 'linear'} --------------------------- Model: StochasticGradientDescent Best Score: 0.8024594257178528 Best Params: {'sgd__alpha': 1e-05, 'sgd__eta0': 10, 'sgd__l1_ratio': 0.8, 'sgd__learning_rate': 'invscaling', 'sgd__loss': 'log', 'sgd__penalty': 'l1'} -------------------------------- Model: LinearPerceptron Best Score: 0.7698751560549313 Best Params: {'alpha': 0.01, 'eta0': 0.0001, 'l1_ratio': 0, 'penalty': 'l1'} ----------------------- Finished!
At this point, our previous function returned three different scores from three different models. We can now either submit our results to Kaggle, or go back to engineering features and data exploration. Since we might be submitting to Kaggle multiple times, it would be helpful to have a function that automated that process.
def save_submission_file(model, features, filename='submission.csv'):
holdout_data = holdout[features]
predictions = model.predict(holdout_data)
#get ids
holdout_ids = holdout['PassengerId']
#create dataframe
submission_df = {'PassengerId': holdout_ids,
'Survived': predictions}
submission = pd.DataFrame(submission_df)
#create .csv file
submission.to_csv(f'{filename}.csv',index=False)
print(f'Created {filename}.csv!')
return submission
#use best performing model from select_model results
best = max(best_model_1, key=lambda x:x['best_score'])
model = best['best_model']
file = best['name'] + time.strftime("%Y%m%d_%H%M%S")
RF_results = save_submission_file(model, best_columns_1, file)
Created RandomForestClassifier20220224_161549.csv!
While our model may not be perfect, in this project we have created a reproducible workflow to help us iterate through ideas and continue to improve the accuracy of our predictions. We made functions to help with feature selection, model selection/tuning, and submission creation. This is a good place to stop and move on to a new project, as we have achieved our two main goals:
I felt encouraged to keep going with this project, and thus have included the following as "next steps" to take:
select_model()
function:
RandomizedSearchCV
instead of GridSearchCV
to speed up our select_model()
functionEverything past this point will be an experimentation on the points made above. I plan to do the following:
select_model
functionIf we look at the description of our numeric data again, we can see that each column has a different range of values.SibSp
ranges from 0-8, Parch
ranges from 0-6, and Fare
ranges from 0-512. In order to make sure these values are equally weighted for our model, we can rescale them. This essentially means stretching or shrinking the data to fit to be on the same scale, for example between 0 to 1. This can be a potentially crucial step, as some machine learning models actually depend on the data being rescaled. The image below helps illustrate this concept.
Within scikit-learn, the preprocessing.minmax_scale() function does exactly this. We can go ahead and rescale our columns of interest. Make sure to do this for both our train and test sets of data!
#scales data in both sets
train = process_scaling(train)
holdout = process_scaling(holdout)
Because of all of our processing and feature engineering, we sometimes can end up with redundant or unnecessary columns. This is called collinearity. For example, saying that a passenger is male is the same as saying that the passenger is not female. Thus, we could remove one of the sex columns. Additionally, the same concept could be applied to the passengers title - which means we could remove both of the sex columns.
It's a good idea to remove one column for each of the dummy variables we have made (this is from the create_dummies() function). So below we will remove all of the aforementioned columns.
#cols to ignore to help reduce collinearity (more of an art than a science it seems)
cols_to_drop = ['Age_categories_Teenager', 'Fare_categories_12-50', 'Pclass',
'Title_Master', 'Cabin_type', 'Sex_male', 'Sex_female',
'SibSp', 'Parch', 'Fare', 'Age', 'Pclass', 'Sex', 'Embarked',
'Ticket', 'Title']
columns = list(train.columns)
columns = [ x for x in columns if x not in cols_to_drop]
plot_correlation_heatmap(train[columns])
Above we can see how strongly our features are correlated to one another. The darker the color, the stronger the correlation.
#select best features again - this time with scaled data, with fewer columns to help reduce collinearity
best_columns_2 = select_features(train[columns])
Optimized columns: 6 - Title_Miss - Title_Mr - Title_Mrs - Pclass_3 - SibSp_scaled - Fare_scaled
#grab the best performing model from our search
best_model_2 = select_model(train[columns], best_columns_2)
Using GridSearchCV... Model: LogisticRegression Best Score: 0.8046816479400748 Best Params: {'C': 0.8286427728546842, 'solver': 'newton-cg'} ------------------------- Model: KNeighborsClassifier Best Score: 0.830561797752809 Best Params: {'algorithm': 'ball_tree', 'n_neighbors': 11, 'p': 1, 'weights': 'distance'} --------------------------- Model: RandomForestClassifier Best Score: 0.8384269662921348 Best Params: {'criterion': 'entropy', 'max_depth': 10, 'max_features': 'log2', 'min_samples_leaf': 5, 'min_samples_split': 2, 'n_estimators': 9} ----------------------------- Model: SupportVectorMachine Best Score: 0.7957178526841447 Best Params: {'C': 0.1, 'gamma': 'scale', 'kernel': 'sigmoid'} --------------------------- Model: StochasticGradientDescent Best Score: 0.8181647940074906 Best Params: {'sgd__alpha': 1e-05, 'sgd__eta0': 1, 'sgd__l1_ratio': 0, 'sgd__learning_rate': 'invscaling', 'sgd__loss': 'hinge', 'sgd__penalty': 'l2'} -------------------------------- Model: LinearPerceptron Best Score: 0.7575655430711611 Best Params: {'alpha': 0.01, 'eta0': 0.1, 'l1_ratio': 0.2, 'penalty': 'elasticnet'} ----------------------- Finished!
#create score df
results_2 = score_comp(best_model_2, 'Grid + Scaling')
#compare scores
final_comp = pd.concat([results_1, results_2], axis=1)
final_comp
Grid | Grid + Scaling | |
---|---|---|
Model | ||
LogisticRegression | 0.793483 | 0.804682 |
KNeighborsClassifier | 0.772260 | 0.830562 |
RandomForestClassifier | 0.841848 | 0.838427 |
SupportVectorMachine | 0.786729 | 0.795718 |
StochasticGradientDescent | 0.802459 | 0.818165 |
LinearPerceptron | 0.769875 | 0.757566 |
From our chart above, we can see that the scaling & removal of several columns has had a fairly positive impact on the predicted scores of our models. All of our models, except our RandomForestClassifier had improved scores! It is very important to note that these are just our predicted scores. If you are uploading your scores to Kaggle, you may expect slightly different results.
At this point, I was curious to see the Kaggle Scores from each of our models. So I created a submission file for each model and submitted them to Kaggle.
#save each models submission
for i in range(len(best_model_2)):
model = best_model_2[i]['best_model']
file = best_model_2[i]['name'] + time.strftime("%Y%m%d_%H%M%S")
end_results2 = save_submission_file(model, best_columns_2, file)
Created LogisticRegression20220224_161744.csv! Created KNeighborsClassifier20220224_161744.csv! Created RandomForestClassifier20220224_161744.csv! Created SupportVectorMachine20220224_161744.csv! Created StochasticGradientDescent20220224_161744.csv! Created LinearPerceptron20220224_161744.csv!
#need to input this manually - can it be automated?
kaggle = [0.77990, 0.75598, 0.76555, 0.76794, 0.77033, 0.75598]
# kaggle = [0.77033, 0.74162, 0.74162, 0.78708, 0.77511, 0.68181] - most recent
# kaggle = [0.78229, 0.75598, 0.75358, 0.76794, 0.77033, 0.75598] - oldest
plot_scores_vs_kaggle(best_model_2, kaggle)
From the above graph we can see that, surprisingly, our LogisticRegression model outperformed all other models. The darker colored bars in the plot represent our predicted values. You can easily see the difference between them and the brighter colored bars, which are the actual Kaggle Leaderboard scores given. So what can we gather from this? I think the biggest takeaway is that our models are overfitting the data. This means they can perform well with the training data, but perform poorly (or not as well) with the testing data. After we did some rescaling on a couple of the columns, as well as excluded some columns from the select_model()
function, we can see that our predicted scores went up (for the most part). The red and grey dotten lines represent our minimum and goal values. These are a bit arbitrary, but my personal goal is to hit 80% accuracy.
Seeing this plot helps us better understand how our models are fitting the data. It is worth experimenting further with all of our models to see if we can further reduce the overfitting that's occurring.
In our select_model()
function, we use GridSearchCV
to help with our hyperparameter tuning. Do we achieve any better scores with RandomizedSearchCV
instead? Let's find out. We can achieve this simply by modifying our existing function to accept a "search flag" of sorts, that determines which search method to use.
GridSearch and RandomizedSearch are two very popular hyperparameter tuning approaches. Both have the same goal of finding the optimal hyperparameter settings for a given model, however there are definitely clear differences between the two:
GridSearch
Using the RandomizedSearch
has potential to save us lots of computation time, while potentially sacrificing very little in terms of accuracy. I believe it is also critical to know that it is a random search, meaning that there is a tiny amount of "luck" involved when it comes to picking the optimal combination. It's very possible that it produces worse combinations than the GridSearchCV
approach. It's a tradeoff between accuracy, time, and resources. Also, it's important to note that in order to employ RandomizedSearch
you need to make sure your set of hyperparameters are configured correctly. If the parameter is a continous one, you need to make sure that it has a continuous distribution.
For continuous parameters ... it is important to specify a continuous distribution to take full advantage of the randomization. This way, increasing
n_iter
will always lead to a finer search. -scikit-learn.org
Aside from this quirk, you need to be more knowledgeable overall of a model's hyperparameters in order to know which one is more important. It can be very confusing trying to wrap one's head around every single hyperparameter.
#use same columns as previous attempt - this time with RandomSearchCV
best_model_3 = select_model(train, best_columns_2, False)
Using RandomizedSearchCV... Model: LogisticRegression Best Score: 0.8046816479400748 Best Params: {'C': 0.7836461049630783, 'solver': 'lbfgs'} ------------------------- Model: KNeighborsClassifier Best Score: 0.8294382022471911 Best Params: {'weights': 'distance', 'p': 2, 'n_neighbors': 12, 'algorithm': 'ball_tree'} --------------------------- Model: RandomForestClassifier Best Score: 0.8406616729088638 Best Params: {'n_estimators': 7, 'min_samples_split': 3, 'min_samples_leaf': 5, 'max_features': 'sqrt', 'max_depth': 10, 'criterion': 'gini'} ----------------------------- Model: SupportVectorMachine Best Score: 0.8046691635455681 Best Params: {'C': 2.49818247165343, 'gamma': 0.918322833274782, 'kernel': 'rbf'} --------------------------- Model: StochasticGradientDescent Best Score: 0.8159176029962547 Best Params: {'sgd__alpha': 4.0057831169526305e-05, 'sgd__eta0': 34.5815953010011, 'sgd__l1_ratio': 0.280184097174348, 'sgd__learning_rate': 'adaptive', 'sgd__loss': 'hinge', 'sgd__penalty': 'l2'} -------------------------------- Model: LinearPerceptron Best Score: 0.7397003745318352 Best Params: {'alpha': 0.0011201018954609691, 'eta0': 0.05426010022497572, 'l1_ratio': 0.5136374203004163, 'penalty': 'elasticnet'} ----------------------- Finished!
#create score df
results_3 = score_comp(best_model_3, 'Random + Scaling')
#compare scores
final_comp = pd.concat([results_1, results_2, results_3], axis=1)
final_comp
Grid | Grid + Scaling | Random + Scaling | |
---|---|---|---|
Model | |||
LogisticRegression | 0.793483 | 0.804682 | 0.804682 |
KNeighborsClassifier | 0.772260 | 0.830562 | 0.829438 |
RandomForestClassifier | 0.841848 | 0.838427 | 0.840662 |
SupportVectorMachine | 0.786729 | 0.795718 | 0.804669 |
StochasticGradientDescent | 0.802459 | 0.818165 | 0.815918 |
LinearPerceptron | 0.769875 | 0.757566 | 0.739700 |
#save each models submission
for i in range(len(best_model_3)):
model = best_model_3[i]['best_model']
file = best_model_3[i]['name'] + time.strftime("%Y%m%d_%H%M%S")
end_results3 = save_submission_file(model, best_columns_2, file)
Created LogisticRegression20220224_163223.csv! Created KNeighborsClassifier20220224_163223.csv! Created RandomForestClassifier20220224_163223.csv! Created SupportVectorMachine20220224_163223.csv! Created StochasticGradientDescent20220224_163223.csv! Created LinearPerceptron20220224_163223.csv!
#need to input this manually - can it be automated?
kaggle_rs = [0.77990, 0.75598, 0.77033, 0.77033, 0.77033, 0.77272]
# kaggle = [0.77990, 0.75598, 0.76555, 0.76794, 0.77033, 0.75598] for quick comparison
plot_scores_vs_kaggle(best_model_3, kaggle_rs)
We can see that by using RandomizedSearchCV
instead of GridSearchCV
, our Kaggle scores actually improved for most of our models. Five out of six models scored over 77%! There also seems to have been less overfitting when RandomizedSearchCV
was used. While I believe it is more than possible to further tweak and toy with all of the hyperparameter tuning, I feel like this is a good place to end the project. We have seen a lot of different models and the effects that the hyperparameter tuning has had on them.
plot_model_progression(final_comp)
Above we can see our final plot that shows how our three iterations of our workflow affected the predicted accuracy of our models. It's clear that some models responded much better to our processing, feature engineering, and hyperparameter tuning than others.
We can successfully say that we've created a relatively simple workflow to be used with Kaggle competitions, as well as other machine learning projects. One of the most challenging things was learning how to configure the hyperparameters for use with RandomizedSearchCV
. There is an overwhelming amount of information on not only RandomizedSearchCV
, but each individual model as well. I hope this project has served as a good basis for understanding how to create a good workflow, as well as how to pick, modify, and tune machine learning models. While this project is officially finished here, there are a few more additional steps one can take to further improve the results:
GridSearchCV
and RandomizedSearchCV