Machine Learning Using Python (MEAFA Workshop)

Lesson 1: Introduction to Machine Learning

In this lesson we will use the linear regression and K-nearest neigbours regression methods to illustrate key steps of a machine learning project: splitting a dataset into training and test sets, estimating, tuning and selecting models, making predictions,and evaluating generalisation performance. We will also illustrate oither important practical aspects of modelling, such as using exploratory data analysis.

Credit Card Data
Training and Test Sets
Exploratory Data Analysis
Data Preparation
Linear Regression
K-Nearest Neighbours Regression
Hyperparameter Tuning
Model Selection
Model Evaluation

This notebook relies on the following imports and setting. We will load new functions and libraries in context to make clear what we are using them for.

In [30]:
# Packages
import numpy as np
from scipy import stats
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
import warnings
warnings.filterwarnings('ignore') # this is to clear the warnings from this page, usually we should leave them on
In [31]:
# Plot settings
sns.set_context('notebook') # optimise figures for notebook display
sns.set_style('ticks') # set default plot style
colours = ['#1F77B4', '#FF7F0E', '#2CA02C', '#DB2728', '#9467BD', '#8C564B', '#E377C2','#7F7F7F', '#BCBD22', '#17BECF']
crayon = ['#4E79A7','#F28E2C','#E15759','#76B7B2','#59A14F', '#EDC949','#AF7AA1','#FF9DA7','#9C755F','#BAB0AB']
sns.set_palette(colours) # set custom color scheme
%matplotlib inline
plt.rcParams['figure.figsize'] = (9, 6)

Credit Card Data

In this lesson we will use the Credit dataset taken from the Introduction to Statistical Learning texbook by James, Witten, Hastie and Tibshirani. The dataset records the average credit card balace at end of the month for customers of a financial services company, as well as other individual characteristics such age, education, gender, marital status, number of cards, and credit rating.

Business objective: To predict the average monthly credit card balance of customers based on individual characteristics.

We start by loading the data. First, we need to load the pandas package. Since the dataset is in the csv format, we use the read_csv function to import the file. The package will automatically read the column names and infer the variable types. We assign the data variable to a variable called data, which will store the dataset as an object called a DataFrame.

In [32]:
# We will always assume that the data file is in a subdirectory called "Datasets"

data=pd.read_csv('Datasets/Credit.csv', index_col='Obs')
Income Limit Rating Cards Age Education Gender Student Married Ethnicity Balance
1 14.891 3606 283 2 34 11 Male No Yes Caucasian 333
2 106.025 6645 483 3 82 15 Female Yes Yes Asian 903
3 104.593 7075 514 4 71 11 Male No No Asian 580
4 148.924 9504 681 3 36 11 Female No No Asian 964
5 55.882 4897 357 2 68 16 Male No Yes Caucasian 331
6 80.180 8047 569 4 77 10 Male No No Caucasian 1151
7 20.996 3388 259 2 37 12 Female No No African American 203
8 71.408 7114 512 2 87 9 Male No No Asian 872
9 15.125 3300 266 5 66 13 Female No No Caucasian 279
10 71.061 6819 491 3 41 19 Female Yes Yes African American 1350

Training and Test Sets

We use the Scikit-Learn train_test_split method to split the data into training and test sets.

Below, we specify that the training set will contain 70% of the data. The random state parameter is an arbitrary number. Specifiying means that by setting this random state, we willget the same training and test sets if we run the analysis again, even though the split is random.

In [33]:
from sklearn.model_selection import train_test_split

# Randomly split indexes
index_train, index_test  = train_test_split(np.array(data.index), train_size=0.7, random_state=10)

# Write training and test sets 
train = data.loc[index_train,:].copy()
test =  data.loc[index_test,:].copy()

Exploratory data analysis

Exploratory data analysis (EDA) is the process of discovering features and patterns in the data that should inform the modelling process and in some cases prevent errors. Here, we conduct a short EDA of the dataset. Remember that we should use only the training set for this purpose.

We start by computing the descriptive statistics and pairwise correlations for the numerical variables (we will discuss categorial predictors next week). When displaying the results, we chain a command to round the results to two decimal place to facilitate visual inspection.

The count row in the first table shows the number available observations for each variable. In this case all variables have the same count, reassuring us that there are no missing values. The response has a large standard deviation (470) relative to the mean (504): we should keep the magnitudes in mind when discussing coefficient sizes and prediction errors. The first quartile of the response is zero, indicating that substantial fraction of the credit card customers have zero or low balance.

In [34]:
print(len(train)) #  number of observations in the training data frame
train.describe().round(1) # rounding the table to one decimal digit
Income Limit Rating Cards Age Education Balance
count 280.0 280.0 280.0 280.0 280.0 280.0 280.0
mean 44.6 4731.4 354.5 2.9 55.3 13.4 521.1
std 34.8 2343.7 156.6 1.3 17.3 3.2 458.2
min 10.4 855.0 93.0 1.0 23.0 5.0 0.0
25% 19.7 3082.5 247.2 2.0 41.0 11.0 64.0
50% 33.0 4636.0 344.0 3.0 54.0 14.0 459.5
75% 57.5 6034.8 443.5 4.0 69.0 16.0 864.2
max 186.6 13913.0 982.0 9.0 98.0 20.0 1999.0

The pairwise correlations reveal that the predictors that are most correlated with the response are limit, rating, and income. The remaining predictors have very low correlation with credit card balances. Another important finding is that limit and rating have nearly perfect positive correlation and are therefore redudant predictors (using both in regression analysis would lead to collinearity problems). Also note that income is highly correlated with limit and rating. Limit seems to be the key linear predictor in data.

In [35]:
Income Limit Rating Cards Age Education Balance
Income 1.00 0.80 0.80 -0.01 0.17 0.02 0.49
Limit 0.80 1.00 1.00 -0.01 0.10 0.00 0.87
Rating 0.80 1.00 1.00 0.03 0.10 -0.01 0.88
Cards -0.01 -0.01 0.03 1.00 0.11 -0.11 0.04
Age 0.17 0.10 0.10 0.11 1.00 -0.05 0.02
Education 0.02 0.00 -0.01 -0.11 -0.05 1.00 0.01
Balance 0.49 0.87 0.88 0.04 0.02 0.01 1.00

We now investigate the (unconditional) distribution of the response by plotting a histogram. We find that the response has a pronouncedly right skewed distribution. A substantial number of customers has zero (as confirmed in the following cell) or low credit card balance.

In [36]:
from statlearning import plot_histogram
# statlearning is our custom library, you need to download the file 
# and have it in the same directory as you are running this notebook from

fig, ax = plot_histogram(train['Balance'])
ax.set_title('Distribution plot for customer credit card balance')

A scatter plot of credit limit with credit balance reveals that there is a nonlinear relatioship between these two variables. Furtheremore, there are several clients with a response value of zero, which is likely to affect model fit.

In [37]:
fig, ax = plt.subplots(figsize=(8,5.5))
plt.scatter(train['Limit'], train['Balance'], s=25) # the s option is the size of the dot
ax.set_xlabel('Credit card limit')

The next cell illustrates how we can create a figure to explore the relationship between a discrete variables with not too many distinct values and the response.

In [38]:
rows=train['Cards']<=7 # too few observations for more than 7 cards
sns.boxplot(x=train.loc[rows,'Cards'], y=train.loc[rows,'Balance'], palette='Blues')

Data Preparation

In most cases, we need to performance additional processing to get the data ready for training machine learning models and computing predictions. It is good practice to write a function for this step, for several reasons:

  • It allows you to easily reproduce the data processing steps.
  • It helps you to build a library of data processing steps, which can save you a lot of time in the future.
  • You can use the function to continuously process new data in a live system.
  • This makes it simple for you to try different data tranformations.
  • In [39]:
    def prepare_data(df):
        df['Male']=(df['Gender'] ==' Male').astype(int) # create dummy variable for gender
        df['Student']=(df['Student'] =='Yes').astype(int)
        df['Married']=(df['Married'] =='Yes').astype(int)
        df['Caucasian']=(df['Ethnicity'] =='Caucasian').astype(int)
        df['Asian']=(df['Ethnicity'] =='Asian').astype(int)
        df=df.loc[:, df.dtypes!='object'] # discards the columns that are not numerical
        return df
    train = prepare_data(train)
    test = prepare_data(test)
    Income Limit Rating Cards Age Education Student Married Balance Male Caucasian Asian
    400 18.701 5524 415 5 64 7 0 0 966 0 0 1
    26 14.090 4323 326 5 25 16 0 1 671 0 0 0
    280 54.319 3063 248 3 59 8 1 0 269 0 1 0
    261 67.937 5184 383 4 63 12 0 1 345 1 0 1
    131 23.793 3821 281 4 56 12 1 1 868 0 0 0

    Finally, we need to identify and separate the response and the predictors. For illustrative purposes, we use only one predictor in this lesson, the customer's credit card limit.

    In [40]:
    # Construting response vector and design matrix (matrix of predictor values) 
    response = 'Balance'
    predictors = ['Limit',] # in general we will be working with a list of predictors, even though we only have one here
    # If we wanted to use all available predictors: 
    # predictors = [variable for variable in train.columns if variable!=response]
    y_train = train.pop(response) # removes the response column for the train dataframe and copies it to a new variable
    X_train=train[predictors].copy() # selects the variables in the predictor list
    y_test = test.pop(response)

    Linear Regression

    Scikit-Learn allows us to train and use a wide range of machine learning algorithms using a simple syntax structure:

    1. Import the method.
    2. Specify the model and options.
    3. Train the model.
    4. Use the estimated model to make predictions.

    We start with a basic method, a linear regression. We use the LinearRegression function from the to train the model.

    In [41]:
    from sklearn.linear_model import LinearRegression
    ols = LinearRegression(), y_train)
    LinearRegression(copy_X=True, fit_intercept=True, n_jobs=1, normalize=False)

    Suppose that we want to predict the credit card balance of a customer that has a limit of 5000 dollars.

    In [42]:
    array([ 566.97270565])

    Let's visualise the trained model.

    In [43]:
    # Here, we generate a grid of 500 values ranging from the minimum to the maximum value of the variable in the training data
    x = np.linspace(X_train.min(), X_train.max(), 2).reshape((-1,1))
    # Compute the predicted values for each of these input points
    y_pred = ols.predict(x)
    # Figure
    fig, ax = plt.subplots()
    ax.scatter(X_train.values, y_train.values, s=25) # the s option is the size of the dot
    ax.set_title("Linear Regression Fit")
    ax.set_xlabel('Credit card limit')
    ax.plot(x, y_pred, color='black', alpha = 0.7)

    K-Nearest Neighbours

    The previous figure suggests that the linear regression model may not be satisfactory for this data since the relationship between limit and balance appears to be nonlinear. The K-Nearest Neighbours (KNN) method is a flexible algorithm that can approximate complex relationships between the response and predictors without assuming a particular form for the regression function.

    Using the KNN method with Scikit-Learn follows the template from above. Unlike in the linear regression model, however, we need to specify a hyper-parameter, the number of neighbours.

    In [44]:
    from sklearn.neighbors import KNeighborsRegressor
    knn = KNeighborsRegressor(n_neighbors=10) # we shoudl specify the number of neighbours, y_train)
    array([ 566.2])

    Let's visualise the relationship fitted by K-Nearest Neighbours algorithm using K=10 neighbours.

    In [45]:
    # Generate a grid of 500 values ranging from the minimum to the maximum value of the variable in the training data
    x = np.linspace(train['Limit'].min(),train['Limit'].max(), 500).reshape((-1,1))
    # Compute the predicted values for each of these input points
    y_pred = knn.predict(x)
    # Plot figure
    fig, ax = plt.subplots()
    ax.scatter(X_train.values, y_train.values, s=25) # the s option is the size of the dot
    ax.set_xlabel('Credit card limit')
    ax.set_xlim(0, 15000)
    ax.plot(x, y_pred, color='black', alpha = 0.7)

    The K-Nearest Neighbours fit depends strongly on the choice the number of neighbours. Choosing K=2 seems to overfit, while K=50 clearly underfits the training data.

    In [46]:
    for K in [2, 50]:
        # Specify and fit model
        knn = KNeighborsRegressor(n_neighbors=K), y_train)
        # Compute the predicted values for each point in the grid
        y_pred = knn.predict(x)
        # Plot figuer
        fig, ax = plt.subplots()
        ax.scatter(X_train.values, y_train.values, s=25) 
        ax.set_title(f'Number of Neighbours = {K}', fontsize=13)
        ax.set_xlabel('Credit card limit')
        ax.set_xlim(0, 15000)
        ax.plot(x, y_pred, color='black', alpha = 0.7)

    Hyperparameter Tuning

    The previous sections leave us with two important questions:

    1) How many neighbours should we use in the KNN method?

    2) Should we use a linear regression or the KNN algorithm as our final model?

    We use cross validation to estimate the generalisation performance of different model choices. For clarity, we refer to the choice a hyperparameter as tuning, and to the choice of method as model selection, even though both cases are referred to as model selection in statistics.

    To select the number of neighbours, we need to specify a grid of candidate values and select the specification with best cross validation performance. The Scikit-Learn GridSearchCV method conveniently automates this process. We will follow the template below.

    In [47]:
    from sklearn.model_selection import GridSearchCV
    model = KNeighborsRegressor() 
    tuning_parameters = {
        'n_neighbors': np.arange(1,51),
    knn_search =  GridSearchCV(model, tuning_parameters, cv=5,  scoring = 'neg_mean_squared_error', 
                               return_train_score=False, n_jobs=4), y_train)
    GridSearchCV(cv=5, error_score='raise',
           estimator=KNeighborsRegressor(algorithm='auto', leaf_size=30, metric='minkowski',
              metric_params=None, n_jobs=1, n_neighbors=5, p=2,
           fit_params=None, iid=True, n_jobs=4,
           param_grid={'n_neighbors': array([ 1,  2,  3,  4,  5,  6,  7,  8,  9, 10, 11, 12, 13, 14, 15, 16, 17,
           18, 19, 20, 21, 22, 23, 24, 25, 26, 27, 28, 29, 30, 31, 32, 33, 34,
           35, 36, 37, 38, 39, 40, 41, 42, 43, 44, 45, 46, 47, 48, 49, 50])},
           pre_dispatch='2*n_jobs', refit=True, return_train_score=False,
           scoring='neg_mean_squared_error', verbose=0)

    Note that we specified the following options:

  • cv=5 specifies the number of folds. We can alternatively provide a cross validation object.
  • scoring = 'neg_mean_squared_error' specifies the evaluation criterion (it does not change the outcome here, but it is useful have this option as template).
  • return_train_score=False speeds up the computations by avoiding unnecessary calculations (this is set to become the default in future versions).
  • n_jobs=4 splits the task across four processor cores, speeding up the computations.
  • We can access the following attribute to view the selected value of the hyperparameter.

    In [48]:
    {'n_neighbors': 7}

    The selected model is stored in the best estimator attributor. Note however that we can also use the grid search object directly for prediction

    In [49]:
    KNeighborsRegressor(algorithm='auto', leaf_size=30, metric='minkowski',
              metric_params=None, n_jobs=1, n_neighbors=7, p=2,

    Performing a full grid search may become too computationally cost when we work with large datasets and/or methods that have multiple tuning hyperparameters. In this case, we can instead use the RandomizedSearchCV function, which evaluates only a random subset of hyperparameter combinations.

    The syntax is very similar to the above, except that we will often want to specify the n_iter option in order to control the computational budget for the hyperparameter seach. Below, we specify that we would like to try twenty values for the number of neighbours.

    In [50]:
    from sklearn.model_selection import RandomizedSearchCV
    model = KNeighborsRegressor() 
    tuning_parameters = {
        'n_neighbors': np.arange(1,101),
    knn_rs =  RandomizedSearchCV(model, tuning_parameters, cv=5,  n_iter=20, scoring = 'neg_mean_squared_error',
                                 return_train_score=False, n_jobs=4), y_train)
    knn_rs.best_params_ # the result may not be the same as before because of the random search
    {'n_neighbors': 7}

    Now, suppose that we want to compute the cross validation score for a given model. The syntax is as follows.

    In [51]:
    knn = KNeighborsRegressor(n_neighbors=7) 
    from sklearn.model_selection import cross_val_score
    scores = cross_val_score(knn, X_train, y_train, cv=5, scoring = 'neg_mean_squared_error')
    array([-67849.01093294, -45920.26421283, -45435.79846939, -47949.19387755,

    The cross_val_score function returns the scores for each fold. Below, we average the scores and obtain the cross validation root mean squared error.

    In [52]:

    The scoring in Scikit-Learn follows the convention that higher score values are better than lower values. This is why the argument in the function is the negative mean squared error . The Scikit-Learn model evaluation documentation provides a list of scoring options. You should save this for future reference.

    Often, the syntax is simplified by the fact that each method in Scikit-Learn has a default scoring method. In this case you have to consult the documentation to know what it is. For a KNN regression the default scoring is the r-squared.

    We can do as follows to plot the cross validation error as a function of the hyperparameter.

    In [53]:
    neighbours=np.arange(1, 51)
    cv_rmse = []
    for k in neighbours: 
        model = KNeighborsRegressor(n_neighbors= k) 
        scores = cross_val_score(model, X_train, y_train, cv=5, scoring = 'neg_mean_squared_error')
        rmse = np.sqrt(-1*np.mean(scores))
    fig, ax= plt.subplots()
    ax.plot(neighbours, cv_rmse)
    ax.set_xlabel('Number of neighbours')
    ax.set_ylabel('Cross Validation RMSE')
    print(f'Lowest CV error: K = {1 + np.argmin(cv_rmse)}')   
    Lowest CV error: K = 7

    Model Selection

    We now turn to model selection. We use the cross_val_predict function and show the results for different evaluation metrics. We can also apply the cross_val_score function to build this type of table, but this is slower when considering multiple metrics.

    In [54]:
    from sklearn.model_selection import cross_val_predict
    from sklearn.metrics import mean_squared_error, r2_score,  mean_absolute_error
    # Re-specifying the two models in context
    ols = LinearRegression()
    knn = knn_search.best_estimator_
    # Initialise table
    columns=['RMSE', 'R-Squared', 'MAE']
    rows=['Linear Regression', 'KNN']
    results =pd.DataFrame(0.0, columns=columns, index=rows)
    # List of algorithms
    methods = [ols, knn] 
    # Computer cross-validation predictions and metrics
    for i, method in enumerate(methods):
        y_pred = cross_val_predict(method, X_train, y_train, cv=5, n_jobs=4)
        results.iloc[i, 0] = np.sqrt(mean_squared_error(y_train, y_pred))
        results.iloc[i, 1] = r2_score(y_train, y_pred)
        results.iloc[i, 2] = mean_absolute_error(y_train, y_pred) 
    RMSE R-Squared MAE
    Linear Regression 225.61 0.76 171.13
    KNN 216.77 0.78 149.48

    Model evaluation

    We now assess the performance of our selected model in the test data, also displaying the results for the linear regression for comparison. The results confirm that we made a good choice by selecting the KNN as the model to predict new data.

    In [55]:
    # Training, y_train), y_train)
    # Initialise table
    columns=['RMSE', 'R-Squared', 'MAE']
    rows=['Linear Regression', 'KNN']
    results =pd.DataFrame(0.0, columns=columns, index=rows)
    # List algorithms
    methods = [ols, knn] 
    # Computer test predictions and metrics
    for i, method in enumerate(methods):
        y_pred = method.predict(X_test)
        results.iloc[i, 0] = np.sqrt(mean_squared_error(y_test, y_pred))
        results.iloc[i, 1] = r2_score(y_test, y_pred)
        results.iloc[i, 2] = mean_absolute_error(y_test, y_pred) 
    RMSE R-Squared MAE
    Linear Regression 254.82 0.70 197.13
    KNN 242.78 0.73 170.94


    Build on the preceding analysis by considering more predictors. The following specification will work best for the KNN method in this dataset when there are multiple predictors, with the number of neighbours to be determinde.

    In [56]:
    knn = KNeighborsRegressor(n_neighbors = k, metric='mahalanobis', metric_params={'V': X_train.cov()})