Notebook

Artificial Intelligence

Lesson 7

Random Forest

Random Forest Data Pre-Processing Create the Model Train the Model Test the Model Summary Challenge

OVERVIEW

Decision Trees are a great tool but they can often overfit the training set of data unless pruned effectively, hindering their predictive capabilities.

Random forests are an ensemble model* of many decision trees, in which each tree will specialize its focus on a particular feature, while maintaining an overview of all features.*

Each tree in the random forest will do its own random train/test split of the data, known as bootstrap aggregation and the samples not included are known as the ‘out-of-bag’ samples.

Additionally each tree will do feature bagging at each node-branch-split, in order to lessen the effects of a feature that is highly correlated with the response. Which minimizes individual feature importance and allows for more randomness in the variety of decision trees used to obtain a result.

While an individual tree might be sensitive to outliers, the ensemble model will likely not be.

The ensemble model predicts new labels by taking a majority vote from each of its trees given a new observation.

RANDOM FOREST

The root node (the first decision node) partitions the data using the feature that provides the most information gain. This root node clustering can be seen in the image below, where the red peaks give way to a Decision Tree, with similar root nodes being clustered close together.

Information gain tells us how important a given attribute of the feature vectors is in regards to the end prediction.

For a more in depth overview of feature importance and it's relation to information gain - specifically related to Random Forest - See the following link: Selecting Features by Importance

NOTE

It is vital that you use your understanding from the previous lessons to analyze each section of code before continuing.

This lesson assumes you are starting to see patterns in how data is handled and processed before being used to train or test the model.

DATA PRE-PROCESSING

For this lesson we are going to create our own extracted dataframe from the provided Titanic dataset in order to make feature importance more evident. We will do this by converting text values to numbers which will also increase the efficiency of processing.

In [ ]:

# remove warnings
import warnings
#warnings.filterwarnings('ignore')

%matplotlib inline

import pandas as pd
pd.options.display.max_columns = 100

from matplotlib import pyplot as plt
import matplotlib
matplotlib.style.use('ggplot')

import numpy as np
pd.options.display.max_rows = 100

Since we will be doing a lot of pre-processing for this dataset, we will want to know when cells have completed execution - let's make that extremely obvious by writing a simple function.

In [ ]:

# print function for determining when feature processing is complete
def status(feature):
    print('Processing', feature, ':OK')

Now that we have that, let's create another function to quickly load, split, and combine our training and test data so it can be easily extracted when we need it.

In [ ]:

def get_combined_data():
    train = pd.read_csv('train.csv')
    test = pd.read_csv('test.csv')
    targets = train.Survived # extracting and removing the targets from training data
    train.drop(['Survived'], 1, inplace=True)
    
    combined = train.append(test)
    combined.reset_index(inplace=True)
    combined.drop('index', inplace=True, axis=1)
    return combined

With our new get_combine_data function let's create a new dataframe and verify it is working by attempting to extract the data

In [ ]:

combined = get_combined_data()
combined.shape

Now that we have the dataset extracted.

Let's extract the passenger titles from the dataframe

In [ ]:

def get_titles():
    global combined
    combined['Title'] = combined['Name'].map(lambda name:name.split(',')[1].split('.')[0].strip())
    Title_Dictionary = {
        'Capt':    'Officer',
        'Col':     'Officer',
        'Major':   'Officer',
        'Jonkheer':'Royalty',
        'Don':     'Royalty',
        'Sir':     'Royalty',
        'Dr':      'Officer',
        'Rev':     'Officer',
        'the Countess':'Royalty',
        'Dona':    'Royalty',
        'Mme':'Mrs',
        'Mlle':'Miss',
        'Ms':'Mrs',
        'Mr':'Mr',
        'Mrs':'Mrs',
        'Miss':'Miss',
        'Master':'Master',
        'Lady':'Royalty'
    }
    combined['Title'] = combined.Title.map(Title_Dictionary)
    combined.drop('Name', 1, inplace=True)

With our function in place, let's test it and verify the changes in our combined dataframe

In [ ]:

get_titles()
combined.head(5)

Group the passenger Ages by Title, Pclass, and Gender

Use the 'median()' method to display the output

In [ ]:

grouped = combined.groupby(['Sex','Pclass','Title'])
grouped.median()

With the mean data for each 'Title' relative to each 'Sex' and 'Pclass' we can now do our own imputation while processing.

However, we can assume age is fairly important value in terms of determining a persons ability to survive a disaster - especially without assistance.

Keeping the above in mind!

Let's define defaults for the 'Age' values using the mean data above to fill in any blanks or indiscernible values

In [ ]:

def process_age():
    
    global combined
    
    # a function that fills the missing values of the Age variable
    
    def fillAges(row):
        if row['Sex']=='female' and row['Pclass'] == 1:
            if row['Title'] == 'Miss':
                return 30
            elif row['Title'] == 'Mrs':
                return 45
            elif row['Title'] == 'Officer':
                return 49
            elif row['Title'] == 'Royalty':
                return 39

        elif row['Sex']=='female' and row['Pclass'] == 2:
            if row['Title'] == 'Miss':
                return 20
            elif row['Title'] == 'Mrs':
                return 30

        elif row['Sex']=='female' and row['Pclass'] == 3:
            if row['Title'] == 'Miss':
                return 18
            elif row['Title'] == 'Mrs':
                return 31

        elif row['Sex']=='male' and row['Pclass'] == 1:
            if row['Title'] == 'Master':
                return 6
            elif row['Title'] == 'Mr':
                return 41.5
            elif row['Title'] == 'Officer':
                return 52
            elif row['Title'] == 'Royalty':
                return 40

        elif row['Sex']=='male' and row['Pclass'] == 2:
            if row['Title'] == 'Master':
                return 2
            elif row['Title'] == 'Mr':
                return 30
            elif row['Title'] == 'Officer':
                return 41.5

        elif row['Sex']=='male' and row['Pclass'] == 3:
            if row['Title'] == 'Master':
                return 6
            elif row['Title'] == 'Mr':
                return 26
    combined.Age = combined.apply(lambda r: fillAges(r) if np.isnan(r['Age']) else r['Age'], axis=1)
    status('age')

NOTE

Ensure you understand the full scope of what the function above is doing before continuing!

Test the function and view the changes to our dataframe

After executing the function below, note there are still several features which have a number of rows that are either null, NaN or indiscernible.

In [ ]:

# execute the function
process_age()
combined.info()

Define function for creating boolean passenger 'Title' classification ¶

This will result in faster and clearer results

In [ ]:

def process_names():
    global combined
    titles_dummies = pd.get_dummies(combined['Title'], prefix='Title')
    combined = pd.concat([combined, titles_dummies], axis=1)
    combined.drop('Title', axis=1, inplace=True)
    status('Name')
    
# execute the function
process_names()
combined.head(5)

Process the passengers Fare¶

In [ ]:

def process_fare():
    global combined
    combined.Fare.fillna(combined.Fare.mean(), inplace=True)
    status('fare')
    
# execute the function
process_fare()
combined.head(5)

Process the passengers Embarked status¶

In [ ]:

def process_embarked():
    global combined
    combined.Embarked.fillna('S', inplace=True)
    # dummy encoding
    embarked_dummies = pd.get_dummies(combined['Embarked'], prefix='Embarked')
    combined = pd.concat([combined, embarked_dummies], axis=1)
    combined.drop('Embarked', axis=1, inplace=True)
    status('Embarked')
    
# execute the function    
process_embarked()

Process the passenger's cabin¶

In [ ]:

def process_cabin():
    global combined
    combined.Cabin.fillna('U', inplace=True)
    # mapping each 
    combined['Cabin'] = combined['Cabin'].map(lambda c : c[0])
    # dummy encoding
    cabin_dummies = pd.get_dummies(combined['Cabin'], prefix='Cabin')
    combined = pd.concat([combined, cabin_dummies], axis=1)
    combined.drop('Cabin', axis=1, inplace=True)
    status('Cabin')

# execute the function
process_cabin()
combined.head(5)

Process the passenger genders into binary values¶

In [ ]:

def process_sex():
    global combined
    combined['Sex'] = combined['Sex'].map({'male':0, 'female':1})
    status('Sex')

# execute the function    
process_sex()

Process the passenger Class¶

In [ ]:

def process_pclass():
    global combined
    pclass_dummies = pd.get_dummies(combined['Pclass'], prefix='Pclass')
    combined = pd.concat([combined, pclass_dummies], axis=1)
    combined.drop('Pclass', axis=1, inplace=True)
    status('pclass')

# execute the function
process_pclass()

Process the passengers Ticket¶

In [ ]:

def process_ticket():
    global combined
    # a function that extracts each prefix of the ticket, returns 'XXX' if no prefix
    def cleanTicket(ticket):
        ticket = ticket.replace('.','')
        ticket = ticket.replace('/','')
        ticket = map(lambda t : t.strip(), ticket)
        # print(type(ticket))
        ticket = list(filter(lambda t : not t.isdigit(), ticket))
        if len(ticket) > 0:
            return ticket[0]
        else:
            return 'XXX'
    # extracting dummy variables from tickets
    combined['Ticket'] = combined['Ticket'].map(cleanTicket)
    tickets_dummies = pd.get_dummies(combined['Ticket'], prefix='Ticket')
    combined = pd.concat([combined, tickets_dummies], axis=1)
    combined.drop('Ticket', inplace=True, axis=1)
    status('Ticket')

# execute the function
process_ticket()
combined.head(5)

Process the passengers Family¶

In [ ]:

def process_family():
    global combined
    #
    combined['FamilySize'] = combined['Parch'] + combined['SibSp'] + 1
    #
    combined['Singleton'] = combined['FamilySize'].map(lambda s : 1 if s == 1 else 0)
    #
    combined['SmallFamily'] = combined['FamilySize'].map(lambda s : 1 if 2<=s<=4 else 0)
    #
    combined['BigFamily'] = combined['FamilySize'].map(lambda s : 1 if s > 4 else 0)
    #
    status('family')

# execute the function
process_family()
combined.shape
combined.head(5)

Now it is time to Scale all of the Features we Selected¶

In [ ]:

def scale_all_features():
    global combined
    features = list(combined.columns)
    features.remove('PassengerId')
    combined[features] = combined[features].apply(lambda x : x/x.max(), axis=0)
    print('Features scaled successfully!')
    
scale_all_features()
combined.head(5)

CREATE THE MODEL

Import Libraries¶

In [ ]:

from sklearn.pipeline import make_pipeline
from sklearn.ensemble import RandomForestClassifier
from sklearn.feature_selection import SelectKBest
from sklearn.model_selection import StratifiedKFold
from sklearn.model_selection import GridSearchCV
from sklearn.ensemble.gradient_boosting import GradientBoostingClassifier
from sklearn.model_selection import cross_val_score

Define function for determining score of classification¶

In [ ]:

def compute_score(clf, X, y, scoring='accuracy'):
    # determine the cross validation scores
    xval = cross_val_score(clf, X, y, cv=5, scoring=scoring)
    # return the mean of the cross val scores
    return np.mean(xval)

Define function for assigning testing and training data pools¶

In [ ]:

def recover_train_test_target():
    global combined
    # get the target values
    train0 = pd.read_csv('train.csv')
    targets = train0.Survived
    
    # split the data
    train = combined.loc[0:890]
    test = combined.loc[891:]
    return train, test, targets

Execute the function you just made for assigning data to train/test pools¶

In [ ]:

train, test, targets = recover_train_test_target()

TRAIN THE MODEL

Import libraries for ExtraTreesClassifier and SelectFromModel feature selection¶

In [ ]:

from sklearn.ensemble import ExtraTreesClassifier
from sklearn.feature_selection import SelectFromModel

# initialize the model ( set the number of estimators )
clf = ExtraTreesClassifier(n_estimators=400)

# fit the training data and targets as features/column names
clf = clf.fit(train, targets)

Create features data frame to observe feature importance¶

This step is enabled by the built in 'feature_importances_' attribute in the ExtraTreesClassifier() model we initialized earlier

In [ ]:

features = pd.DataFrame()
# add the training data to the 'feature' column in the 'features' dataframe
features['feature'] = train.columns
# add the feature_importance_ attribute data to the 'importance' column in the 'features' dataframe
features['importance'] = clf.feature_importances_

# display the list of features and their importance
# adjust the .head() to see the entire list of features
features.sort_values(['importance'], ascending=False).head(12)

Select the Features of most Importance¶

Use the selected features to create new training data

In [ ]:

# use SelectFromModel to load our pre-existing classifier model into the new dataframe
# prefit = true because we already initialized the data with features and target values
model = SelectFromModel(clf, prefit=True)
# create new training data from the pre-existing model
train_new = model.transform(train)
# observe the training dataframe shape
train_new.shape

Let's do the same for our testing data

In [ ]:

# create new testing data
test_new = model.transform(test)
# observe the testing dataframe shape
test_new.shape

#### Initialize the Random Forest model and Define a parameter grid

In [ ]:

forest = RandomForestClassifier(max_features='sqrt')
parameter_grid = {
                'max_depth' : [4,5,6,7,8],
                'n_estimators' : [200, 300, 400],
                'criterion' : ['gini', 'entropy']
                }
# get cross validation table for GridSearch (n_splits=5)
cross_validation = StratifiedKFold(n_splits=5)

TEST THE MODEL

Use GridSearchCV to select the best Random Forest results in the paramater grid as well as the best parameters used.¶

In [ ]:

grid_search = GridSearchCV(forest, param_grid=parameter_grid, cv=cross_validation)
grid_search.fit(train_new, targets)

print('Best score : {}'.format(grid_search.best_score_))
print('Best parameters : {}'.format(grid_search.best_params_))

SUMMARY

Random forests are a popular method for feature ranking, since they are so easy to apply.

In general they require very little feature engineering and parameter tuning and mean decrease impurity is exposed in most random forest libraries.

But they come with their own gotchas, especially when data interpretation is concerned. With correlated features, strong features can end up with low scores and the method can be biased towards variables with many categories.

As long as the gotchas are kept in mind, there really is no reason not to try them out on your data.

CHALLENGE

#### Store the results and classification overview in a CSV file locally This is an example of a submission being created for a competition on Kaggle.com

You are not required to complete this section, it is mainly for observation.

In [ ]:

# create output dataframe
# df_output = pd.DataFrame()

# get grid_search prediction results
# pipeline = grid_search
# output = pipeline.predict(test_new).astype(int)

# store the predictions in the 'survived' column
# df_output['Survived'] = output

# store the passenger ids in the corresponding table
# df_output['PassengerId'] = test['PassengerId']

# execute the to_csv() method on the output dataframe - hold the index
# df_output[['PassengerId', 'Survived']].to_csv('output.csv', index=False)

In [ ]:

Artificial Intelligence

Lesson 7

Random Forest

Define function for creating boolean passenger 'Title' classification¶

Process the passengers Fare¶

Process the passengers Embarked status¶

Process the passenger's cabin¶

Process the passenger genders into binary values¶

Process the passenger Class¶

Process the passengers Ticket¶

Process the passengers Family¶

Now it is time to Scale all of the Features we Selected¶

Import Libraries¶

Define function for determining score of classification¶

Define function for assigning testing and training data pools¶

Execute the function you just made for assigning data to train/test pools¶

Import libraries for ExtraTreesClassifier and SelectFromModel feature selection¶

Create features data frame to observe feature importance¶

Select the Features of most Importance¶

Use GridSearchCV to select the best Random Forest results in the paramater grid as well as the best parameters used.¶

Define function for creating boolean passenger 'Title' classification ¶