Random forests are an ensemble model* of many decision trees, in which each tree will specialize its focus on a particular feature, while maintaining an overview of all features.*
The root node (the first decision node) partitions the data using the feature that provides the most information gain. This root node clustering can be seen in the image below, where the red peaks give way to a Decision Tree, with similar root nodes being clustered close together.
Information gain tells us how important a given attribute of the feature vectors is in regards to the end prediction.
For a more in depth overview of feature importance and it's relation to information gain - specifically related to Random Forest - See the following link: Selecting Features by Importance
- It is vital that you use your understanding from the previous lessons to analyze each section of code before continuing.
- This lesson assumes you are starting to see patterns in how data is handled and processed before being used to train or test the model.
For this lesson we are going to create our own extracted dataframe from the provided Titanic dataset in order to make feature importance more evident. We will do this by converting text values to numbers which will also increase the efficiency of processing.
# remove warnings
import warnings
#warnings.filterwarnings('ignore')
%matplotlib inline
import pandas as pd
pd.options.display.max_columns = 100
from matplotlib import pyplot as plt
import matplotlib
matplotlib.style.use('ggplot')
import numpy as np
pd.options.display.max_rows = 100
Since we will be doing a lot of pre-processing for this dataset, we will want to know when cells have completed execution - let's make that extremely obvious by writing a simple function.
# print function for determining when feature processing is complete
def status(feature):
print('Processing', feature, ':OK')
Now that we have that, let's create another function to quickly load, split, and combine our training and test data so it can be easily extracted when we need it.
def get_combined_data():
train = pd.read_csv('train.csv')
test = pd.read_csv('test.csv')
targets = train.Survived # extracting and removing the targets from training data
train.drop(['Survived'], 1, inplace=True)
combined = train.append(test)
combined.reset_index(inplace=True)
combined.drop('index', inplace=True, axis=1)
return combined
With our new get_combine_data function let's create a new dataframe and verify it is working by attempting to extract the data
combined = get_combined_data()
combined.shape
Now that we have the dataset extracted.
Let's extract the passenger titles from the dataframe
def get_titles():
global combined
combined['Title'] = combined['Name'].map(lambda name:name.split(',')[1].split('.')[0].strip())
Title_Dictionary = {
'Capt': 'Officer',
'Col': 'Officer',
'Major': 'Officer',
'Jonkheer':'Royalty',
'Don': 'Royalty',
'Sir': 'Royalty',
'Dr': 'Officer',
'Rev': 'Officer',
'the Countess':'Royalty',
'Dona': 'Royalty',
'Mme':'Mrs',
'Mlle':'Miss',
'Ms':'Mrs',
'Mr':'Mr',
'Mrs':'Mrs',
'Miss':'Miss',
'Master':'Master',
'Lady':'Royalty'
}
combined['Title'] = combined.Title.map(Title_Dictionary)
combined.drop('Name', 1, inplace=True)
With our function in place, let's test it and verify the changes in our combined dataframe
get_titles()
combined.head(5)
Group the passenger Ages by Title, Pclass, and Gender
Use the 'median()' method to display the output
grouped = combined.groupby(['Sex','Pclass','Title'])
grouped.median()
With the mean data for each 'Title' relative to each 'Sex' and 'Pclass' we can now do our own imputation while processing.
Keeping the above in mind!
Let's define defaults for the 'Age' values using the mean data above to fill in any blanks or indiscernible values
def process_age():
global combined
# a function that fills the missing values of the Age variable
def fillAges(row):
if row['Sex']=='female' and row['Pclass'] == 1:
if row['Title'] == 'Miss':
return 30
elif row['Title'] == 'Mrs':
return 45
elif row['Title'] == 'Officer':
return 49
elif row['Title'] == 'Royalty':
return 39
elif row['Sex']=='female' and row['Pclass'] == 2:
if row['Title'] == 'Miss':
return 20
elif row['Title'] == 'Mrs':
return 30
elif row['Sex']=='female' and row['Pclass'] == 3:
if row['Title'] == 'Miss':
return 18
elif row['Title'] == 'Mrs':
return 31
elif row['Sex']=='male' and row['Pclass'] == 1:
if row['Title'] == 'Master':
return 6
elif row['Title'] == 'Mr':
return 41.5
elif row['Title'] == 'Officer':
return 52
elif row['Title'] == 'Royalty':
return 40
elif row['Sex']=='male' and row['Pclass'] == 2:
if row['Title'] == 'Master':
return 2
elif row['Title'] == 'Mr':
return 30
elif row['Title'] == 'Officer':
return 41.5
elif row['Sex']=='male' and row['Pclass'] == 3:
if row['Title'] == 'Master':
return 6
elif row['Title'] == 'Mr':
return 26
combined.Age = combined.apply(lambda r: fillAges(r) if np.isnan(r['Age']) else r['Age'], axis=1)
status('age')
- Ensure you understand the full scope of what the function above is doing before continuing!
After executing the function below, note there are still several features which have a number of rows that are either null, NaN or indiscernible.
# execute the function
process_age()
combined.info()
This will result in faster and clearer results
def process_names():
global combined
titles_dummies = pd.get_dummies(combined['Title'], prefix='Title')
combined = pd.concat([combined, titles_dummies], axis=1)
combined.drop('Title', axis=1, inplace=True)
status('Name')
# execute the function
process_names()
combined.head(5)
def process_fare():
global combined
combined.Fare.fillna(combined.Fare.mean(), inplace=True)
status('fare')
# execute the function
process_fare()
combined.head(5)
def process_embarked():
global combined
combined.Embarked.fillna('S', inplace=True)
# dummy encoding
embarked_dummies = pd.get_dummies(combined['Embarked'], prefix='Embarked')
combined = pd.concat([combined, embarked_dummies], axis=1)
combined.drop('Embarked', axis=1, inplace=True)
status('Embarked')
# execute the function
process_embarked()
def process_cabin():
global combined
combined.Cabin.fillna('U', inplace=True)
# mapping each
combined['Cabin'] = combined['Cabin'].map(lambda c : c[0])
# dummy encoding
cabin_dummies = pd.get_dummies(combined['Cabin'], prefix='Cabin')
combined = pd.concat([combined, cabin_dummies], axis=1)
combined.drop('Cabin', axis=1, inplace=True)
status('Cabin')
# execute the function
process_cabin()
combined.head(5)
def process_sex():
global combined
combined['Sex'] = combined['Sex'].map({'male':0, 'female':1})
status('Sex')
# execute the function
process_sex()
def process_pclass():
global combined
pclass_dummies = pd.get_dummies(combined['Pclass'], prefix='Pclass')
combined = pd.concat([combined, pclass_dummies], axis=1)
combined.drop('Pclass', axis=1, inplace=True)
status('pclass')
# execute the function
process_pclass()
def process_ticket():
global combined
# a function that extracts each prefix of the ticket, returns 'XXX' if no prefix
def cleanTicket(ticket):
ticket = ticket.replace('.','')
ticket = ticket.replace('/','')
ticket = map(lambda t : t.strip(), ticket)
# print(type(ticket))
ticket = list(filter(lambda t : not t.isdigit(), ticket))
if len(ticket) > 0:
return ticket[0]
else:
return 'XXX'
# extracting dummy variables from tickets
combined['Ticket'] = combined['Ticket'].map(cleanTicket)
tickets_dummies = pd.get_dummies(combined['Ticket'], prefix='Ticket')
combined = pd.concat([combined, tickets_dummies], axis=1)
combined.drop('Ticket', inplace=True, axis=1)
status('Ticket')
# execute the function
process_ticket()
combined.head(5)
def process_family():
global combined
#
combined['FamilySize'] = combined['Parch'] + combined['SibSp'] + 1
#
combined['Singleton'] = combined['FamilySize'].map(lambda s : 1 if s == 1 else 0)
#
combined['SmallFamily'] = combined['FamilySize'].map(lambda s : 1 if 2<=s<=4 else 0)
#
combined['BigFamily'] = combined['FamilySize'].map(lambda s : 1 if s > 4 else 0)
#
status('family')
# execute the function
process_family()
combined.shape
combined.head(5)
def scale_all_features():
global combined
features = list(combined.columns)
features.remove('PassengerId')
combined[features] = combined[features].apply(lambda x : x/x.max(), axis=0)
print('Features scaled successfully!')
scale_all_features()
combined.head(5)
from sklearn.pipeline import make_pipeline
from sklearn.ensemble import RandomForestClassifier
from sklearn.feature_selection import SelectKBest
from sklearn.model_selection import StratifiedKFold
from sklearn.model_selection import GridSearchCV
from sklearn.ensemble.gradient_boosting import GradientBoostingClassifier
from sklearn.model_selection import cross_val_score
def compute_score(clf, X, y, scoring='accuracy'):
# determine the cross validation scores
xval = cross_val_score(clf, X, y, cv=5, scoring=scoring)
# return the mean of the cross val scores
return np.mean(xval)
def recover_train_test_target():
global combined
# get the target values
train0 = pd.read_csv('train.csv')
targets = train0.Survived
# split the data
train = combined.loc[0:890]
test = combined.loc[891:]
return train, test, targets
train, test, targets = recover_train_test_target()
from sklearn.ensemble import ExtraTreesClassifier
from sklearn.feature_selection import SelectFromModel
# initialize the model ( set the number of estimators )
clf = ExtraTreesClassifier(n_estimators=400)
# fit the training data and targets as features/column names
clf = clf.fit(train, targets)
This step is enabled by the built in 'feature_importances_' attribute in the ExtraTreesClassifier() model we initialized earlier
features = pd.DataFrame()
# add the training data to the 'feature' column in the 'features' dataframe
features['feature'] = train.columns
# add the feature_importance_ attribute data to the 'importance' column in the 'features' dataframe
features['importance'] = clf.feature_importances_
# display the list of features and their importance
# adjust the .head() to see the entire list of features
features.sort_values(['importance'], ascending=False).head(12)
# use SelectFromModel to load our pre-existing classifier model into the new dataframe
# prefit = true because we already initialized the data with features and target values
model = SelectFromModel(clf, prefit=True)
# create new training data from the pre-existing model
train_new = model.transform(train)
# observe the training dataframe shape
train_new.shape
Let's do the same for our testing data
# create new testing data
test_new = model.transform(test)
# observe the testing dataframe shape
test_new.shape
forest = RandomForestClassifier(max_features='sqrt')
parameter_grid = {
'max_depth' : [4,5,6,7,8],
'n_estimators' : [200, 300, 400],
'criterion' : ['gini', 'entropy']
}
# get cross validation table for GridSearch (n_splits=5)
cross_validation = StratifiedKFold(n_splits=5)
grid_search = GridSearchCV(forest, param_grid=parameter_grid, cv=cross_validation)
grid_search.fit(train_new, targets)
print('Best score : {}'.format(grid_search.best_score_))
print('Best parameters : {}'.format(grid_search.best_params_))
Random forests are a popular method for feature ranking, since they are so easy to apply.
In general they require very little feature engineering and parameter tuning and mean decrease impurity is exposed in most random forest libraries.
But they come with their own gotchas, especially when data interpretation is concerned. With correlated features, strong features can end up with low scores and the method can be biased towards variables with many categories.
As long as the gotchas are kept in mind, there really is no reason not to try them out on your data.
You are not required to complete this section, it is mainly for observation.
# create output dataframe
# df_output = pd.DataFrame()
# get grid_search prediction results
# pipeline = grid_search
# output = pipeline.predict(test_new).astype(int)
# store the predictions in the 'survived' column
# df_output['Survived'] = output
# store the passenger ids in the corresponding table
# df_output['PassengerId'] = test['PassengerId']
# execute the to_csv() method on the output dataframe - hold the index
# df_output[['PassengerId', 'Survived']].to_csv('output.csv', index=False)