#!/usr/bin/env python # coding: utf-8 #

Machine Learning Using Python (MEAFA Workshop)

Lesson 1: Introduction to Machine Learning

#
# # In this lesson we will use the linear regression and K-nearest neigbours regression methods to illustrate key steps of a machine learning project: splitting a dataset into training and test sets, estimating, tuning and selecting models, making predictions,and evaluating generalisation performance. We will also illustrate oither important practical aspects of modelling, such as using exploratory data analysis. # # # Credit Card Data
# Training and Test Sets
# Exploratory Data Analysis
# Data Preparation
# Linear Regression
# K-Nearest Neighbours Regression
# Hyperparameter Tuning
# Model Selection
# Model Evaluation
# # # This notebook relies on the following imports and setting. We will load new functions and libraries in context to make clear what we are using them for. # In[30]: # Packages import numpy as np from scipy import stats import pandas as pd import matplotlib.pyplot as plt import seaborn as sns import warnings warnings.filterwarnings('ignore') # this is to clear the warnings from this page, usually we should leave them on # In[31]: # Plot settings sns.set_context('notebook') # optimise figures for notebook display sns.set_style('ticks') # set default plot style colours = ['#1F77B4', '#FF7F0E', '#2CA02C', '#DB2728', '#9467BD', '#8C564B', '#E377C2','#7F7F7F', '#BCBD22', '#17BECF'] crayon = ['#4E79A7','#F28E2C','#E15759','#76B7B2','#59A14F', '#EDC949','#AF7AA1','#FF9DA7','#9C755F','#BAB0AB'] sns.set_palette(colours) # set custom color scheme get_ipython().run_line_magic('matplotlib', 'inline') plt.rcParams['figure.figsize'] = (9, 6) # ## Credit Card Data # # In this lesson we will use the Credit dataset taken from the Introduction to Statistical Learning texbook by James, Witten, Hastie and Tibshirani. The dataset records the average credit card balace at end of the month for customers of a financial services company, as well as other individual characteristics such age, education, gender, marital status, number of cards, and credit rating. # # **Business objective**: To predict the average monthly credit card balance of customers based on individual characteristics. # # We start by loading the data. First, we need to load the pandas package. Since the dataset is in the csv format, we use the read_csv function to import the file. The package will automatically read the column names and infer the variable types. We assign the data variable to a variable called *data*, which will store the dataset as an object called a DataFrame. # In[32]: # We will always assume that the data file is in a subdirectory called "Datasets" data=pd.read_csv('Datasets/Credit.csv', index_col='Obs') data.head(10) # ##Training and Test Sets # # We use the Scikit-Learn train_test_split method to split the data into training and test sets. # # Below, we specify that the training set will contain 70% of the data. The random state parameter is an arbitrary number. Specifiying means that by setting this random state, we willget the same training and test sets if we run the analysis again, even though the split is random. # In[33]: from sklearn.model_selection import train_test_split # Randomly split indexes index_train, index_test = train_test_split(np.array(data.index), train_size=0.7, random_state=10) # Write training and test sets train = data.loc[index_train,:].copy() test = data.loc[index_test,:].copy() # ##Exploratory data analysis # # Exploratory data analysis (EDA) is the process of discovering features and patterns in the data that should inform the modelling process and in some cases prevent errors. Here, we conduct a short EDA of the dataset. Remember that we should use only the training set for this purpose. # # We start by computing the descriptive statistics and pairwise correlations for the numerical variables (we will discuss categorial predictors next week). When displaying the results, we chain a command to round the results to two decimal place to facilitate visual inspection. # # The count row in the first table shows the number available observations for each variable. In this case all variables have the same count, reassuring us that there are no missing values. The response has a large standard deviation (470) relative to the mean (504): we should keep the magnitudes in mind when discussing coefficient sizes and prediction errors. The first quartile of the response is zero, indicating that substantial fraction of the credit card customers have zero or low balance. # In[34]: print(len(train)) # number of observations in the training data frame train.describe().round(1) # rounding the table to one decimal digit # The pairwise correlations reveal that the predictors that are most correlated with the response are limit, rating, and income. The remaining predictors have very low correlation with credit card balances. Another important finding is that limit and rating have nearly perfect positive correlation and are therefore redudant predictors (using both in regression analysis would lead to collinearity problems). Also note that income is highly correlated with limit and rating. Limit seems to be the key linear predictor in data. # In[35]: train.corr().round(2) # We now investigate the (unconditional) distribution of the response by plotting a histogram. We find that the response has a pronouncedly right skewed distribution. A substantial number of customers has zero (as confirmed in the following cell) or low credit card balance. # In[36]: from statlearning import plot_histogram # statlearning is our custom library, you need to download the statlearning.py file # and have it in the same directory as you are running this notebook from fig, ax = plot_histogram(train['Balance']) ax.set_title('Distribution plot for customer credit card balance') plt.show() # A scatter plot of credit limit with credit balance reveals that there is a nonlinear relatioship between these two variables. Furtheremore, there are several clients with a response value of zero, which is likely to affect model fit. # In[37]: fig, ax = plt.subplots(figsize=(8,5.5)) plt.scatter(train['Limit'], train['Balance'], s=25) # the s option is the size of the dot ax.set_xlabel('Credit card limit') ax.set_ylabel('Balance') sns.despine() plt.show() # The next cell illustrates how we can create a figure to explore the relationship between a discrete variables with not too many distinct values and the response. # In[38]: rows=train['Cards']<=7 # too few observations for more than 7 cards sns.boxplot(x=train.loc[rows,'Cards'], y=train.loc[rows,'Balance'], palette='Blues') sns.despine() plt.show() # ## Data Preparation # # In most cases, we need to performance additional processing to get the data ready for training machine learning models and computing predictions. It is good practice to write a function for this step, for several reasons: # #

It allows you to easily reproduce the data processing steps.

# #

It helps you to build a library of data processing steps, which can save you a lot of time in the future.

# #

You can use the function to continuously process new data in a live system.

# #

This makes it simple for you to try different data tranformations.

# In[39]: def prepare_data(df): df['Male']=(df['Gender'] ==' Male').astype(int) # create dummy variable for gender df['Student']=(df['Student'] =='Yes').astype(int) df['Married']=(df['Married'] =='Yes').astype(int) df['Caucasian']=(df['Ethnicity'] =='Caucasian').astype(int) df['Asian']=(df['Ethnicity'] =='Asian').astype(int) df=df.loc[:, df.dtypes!='object'] # discards the columns that are not numerical return df train = prepare_data(train) test = prepare_data(test) train.head() # Finally, we need to identify and separate the response and the predictors. For illustrative purposes, we use only one predictor in this lesson, the customer's credit card limit. # In[40]: # Construting response vector and design matrix (matrix of predictor values) response = 'Balance' predictors = ['Limit',] # in general we will be working with a list of predictors, even though we only have one here # If we wanted to use all available predictors: # predictors = [variable for variable in train.columns if variable!=response] y_train = train.pop(response) # removes the response column for the train dataframe and copies it to a new variable X_train=train[predictors].copy() # selects the variables in the predictor list y_test = test.pop(response) X_test=test[predictors].copy() # ##Linear Regression # # Scikit-Learn allows us to train and use a wide range of machine learning algorithms using a simple syntax structure: # # 1. Import the method.
# 2. Specify the model and options. # 3. Train the model. # 4. Use the estimated model to make predictions. # # We start with a basic method, a linear regression. We use the LinearRegression function from the to train the model. # In[41]: from sklearn.linear_model import LinearRegression ols = LinearRegression() ols.fit(X_train, y_train) # Suppose that we want to predict the credit card balance of a customer that has a limit of 5000 dollars. # In[42]: ols.predict(5000) # Let's visualise the trained model. # In[43]: # Here, we generate a grid of 500 values ranging from the minimum to the maximum value of the variable in the training data x = np.linspace(X_train.min(), X_train.max(), 2).reshape((-1,1)) # Compute the predicted values for each of these input points y_pred = ols.predict(x) # Figure fig, ax = plt.subplots() ax.scatter(X_train.values, y_train.values, s=25) # the s option is the size of the dot ax.set_title("Linear Regression Fit") ax.set_xlabel('Credit card limit') ax.set_ylabel('Balance') ax.plot(x, y_pred, color='black', alpha = 0.7) sns.despine() plt.show() # ##K-Nearest Neighbours # # The previous figure suggests that the linear regression model may not be satisfactory for this data since the relationship between limit and balance appears to be nonlinear. The K-Nearest Neighbours (KNN) method is a flexible algorithm that can approximate complex relationships between the response and predictors without assuming a particular form for the regression function. # # Using the KNN method with Scikit-Learn follows the template from above. Unlike in the linear regression model, however, we need to specify a hyper-parameter, the number of neighbours. # In[44]: from sklearn.neighbors import KNeighborsRegressor knn = KNeighborsRegressor(n_neighbors=10) # we shoudl specify the number of neighbours knn.fit(X_train, y_train) knn.predict(5000) # Let's visualise the relationship fitted by K-Nearest Neighbours algorithm using K=10 neighbours. # In[45]: # Generate a grid of 500 values ranging from the minimum to the maximum value of the variable in the training data x = np.linspace(train['Limit'].min(),train['Limit'].max(), 500).reshape((-1,1)) # Compute the predicted values for each of these input points y_pred = knn.predict(x) # Plot figure fig, ax = plt.subplots() ax.scatter(X_train.values, y_train.values, s=25) # the s option is the size of the dot ax.set_xlabel('Credit card limit') ax.set_ylabel('Balance') ax.set_xlim(0, 15000) ax.plot(x, y_pred, color='black', alpha = 0.7) sns.despine() plt.show() # The K-Nearest Neighbours fit depends strongly on the choice the number of neighbours. Choosing K=2 seems to overfit, while K=50 clearly underfits the training data. # In[46]: for K in [2, 50]: # Specify and fit model knn = KNeighborsRegressor(n_neighbors=K) knn.fit(X_train, y_train) # Compute the predicted values for each point in the grid y_pred = knn.predict(x) # Plot figuer fig, ax = plt.subplots() ax.scatter(X_train.values, y_train.values, s=25) ax.set_title(f'Number of Neighbours = {K}', fontsize=13) ax.set_xlabel('Credit card limit') ax.set_ylabel('Balance') ax.set_xlim(0, 15000) ax.plot(x, y_pred, color='black', alpha = 0.7) sns.despine() plt.show() # ## Hyperparameter Tuning # # The previous sections leave us with two important questions: # # 1) How many neighbours should we use in the KNN method? # # 2) Should we use a linear regression or the KNN algorithm as our final model? # # We use [cross validation](http://scikit-learn.org/stable/modules/cross_validation.html) to estimate the generalisation performance of different model choices. For clarity, we refer to the choice a hyperparameter as tuning, and to the choice of method as model selection, even though both cases are referred to as model selection in statistics. # # To select the number of neighbours, we need to specify a grid of candidate values and select the specification with best cross validation performance. The Scikit-Learn [GridSearchCV](http://scikit-learn.org/stable/modules/generated/sklearn.model_selection.GridSearchCV.html) method conveniently automates this process. We will follow the template below. # In[47]: from sklearn.model_selection import GridSearchCV model = KNeighborsRegressor() tuning_parameters = { 'n_neighbors': np.arange(1,51), } knn_search = GridSearchCV(model, tuning_parameters, cv=5, scoring = 'neg_mean_squared_error', return_train_score=False, n_jobs=4) knn_search.fit(X_train, y_train) # Note that we specified the following options: # #

cv=5 specifies the number of folds. We can alternatively provide a cross validation object.

# #

scoring = 'neg_mean_squared_error' specifies the evaluation criterion (it does not change the outcome here, but it is useful have this option as template).

# #

return_train_score=False speeds up the computations by avoiding unnecessary calculations (this is set to become the default in future versions).

# #

n_jobs=4 splits the task across four processor cores, speeding up the computations.

# # We can access the following attribute to view the selected value of the hyperparameter. # In[48]: knn_search.best_params_ # The selected model is stored in the best estimator attributor. Note however that we can also use the grid search object directly for prediction # In[49]: knn_search.best_estimator_ # Performing a full grid search may become too computationally cost when we work with large datasets and/or methods that have multiple tuning hyperparameters. In this case, we can instead use the [RandomizedSearchCV](http://scikit-learn.org/stable/modules/generated/sklearn.model_selection.RandomizedSearchCV.html) function, which evaluates only a random subset of hyperparameter combinations. # # The syntax is very similar to the above, except that we will often want to specify the n_iter option in order to control the computational budget for the hyperparameter seach. Below, we specify that we would like to try twenty values for the number of neighbours. # In[50]: from sklearn.model_selection import RandomizedSearchCV model = KNeighborsRegressor() tuning_parameters = { 'n_neighbors': np.arange(1,101), } knn_rs = RandomizedSearchCV(model, tuning_parameters, cv=5, n_iter=20, scoring = 'neg_mean_squared_error', return_train_score=False, n_jobs=4) knn_rs.fit(X_train, y_train) knn_rs.best_params_ # the result may not be the same as before because of the random search # Now, suppose that we want to compute the cross validation score for a given model. The syntax is as follows. # In[51]: knn = KNeighborsRegressor(n_neighbors=7) from sklearn.model_selection import cross_val_score scores = cross_val_score(knn, X_train, y_train, cv=5, scoring = 'neg_mean_squared_error') scores # The [cross_val_score](http://scikit-learn.org/stable/modules/generated/sklearn.model_selection.cross_val_score.html) function returns the scores for each fold. Below, we average the scores and obtain the cross validation root mean squared error. # In[52]: np.sqrt(-1*np.mean(scores)) # The scoring in Scikit-Learn follows the convention that higher score values are better than lower values. This is why the argument in the function is the negative mean squared error . The Scikit-Learn [model evaluation](http://scikit-learn.org/stable/modules/model_evaluation.html) documentation provides a list of scoring options. You should save this for future reference. # # Often, the syntax is simplified by the fact that each method in Scikit-Learn has a default scoring method. In this case you have to consult the documentation to know what it is. For a [KNN regression the default scoring is the r-squared](http://scikit-learn.org/stable/modules/generated/sklearn.neighbors.KNeighborsRegressor.html#sklearn.neighbors.KNeighborsRegressor). # # We can do as follows to plot the cross validation error as a function of the hyperparameter. # In[53]: neighbours=np.arange(1, 51) cv_rmse = [] for k in neighbours: model = KNeighborsRegressor(n_neighbors= k) scores = cross_val_score(model, X_train, y_train, cv=5, scoring = 'neg_mean_squared_error') rmse = np.sqrt(-1*np.mean(scores)) cv_rmse.append(rmse) fig, ax= plt.subplots() ax.plot(neighbours, cv_rmse) ax.set_xlabel('Number of neighbours') ax.set_ylabel('Cross Validation RMSE') sns.despine() plt.show() print(f'Lowest CV error: K = {1 + np.argmin(cv_rmse)}') # ## Model Selection # We now turn to model selection. We use the [cross_val_predict](http://scikit-learn.org/stable/modules/generated/sklearn.model_selection.cross_val_predict.html) function and show the results for different evaluation metrics. We can also apply the cross_val_score function to build this type of table, but this is slower when considering multiple metrics. # In[54]: from sklearn.model_selection import cross_val_predict from sklearn.metrics import mean_squared_error, r2_score, mean_absolute_error # Re-specifying the two models in context ols = LinearRegression() knn = knn_search.best_estimator_ # Initialise table columns=['RMSE', 'R-Squared', 'MAE'] rows=['Linear Regression', 'KNN'] results =pd.DataFrame(0.0, columns=columns, index=rows) # List of algorithms methods = [ols, knn] # Computer cross-validation predictions and metrics for i, method in enumerate(methods): y_pred = cross_val_predict(method, X_train, y_train, cv=5, n_jobs=4) results.iloc[i, 0] = np.sqrt(mean_squared_error(y_train, y_pred)) results.iloc[i, 1] = r2_score(y_train, y_pred) results.iloc[i, 2] = mean_absolute_error(y_train, y_pred) results.round(2) # ## Model evaluation # # We now assess the performance of our selected model in the test data, also displaying the results for the linear regression for comparison. The results confirm that we made a good choice by selecting the KNN as the model to predict new data. # In[55]: # Training ols.fit(X_train, y_train) knn.fit(X_train, y_train) # Initialise table columns=['RMSE', 'R-Squared', 'MAE'] rows=['Linear Regression', 'KNN'] results =pd.DataFrame(0.0, columns=columns, index=rows) # List algorithms methods = [ols, knn] # Computer test predictions and metrics for i, method in enumerate(methods): y_pred = method.predict(X_test) results.iloc[i, 0] = np.sqrt(mean_squared_error(y_test, y_pred)) results.iloc[i, 1] = r2_score(y_test, y_pred) results.iloc[i, 2] = mean_absolute_error(y_test, y_pred) results.round(2) # ## Practice # # Build on the preceding analysis by considering more predictors. The following specification will work best for the KNN method in this dataset when there are multiple predictors, with the number of neighbours to be determinde. # In[56]: knn = KNeighborsRegressor(n_neighbors = k, metric='mahalanobis', metric_params={'V': X_train.cov()})