Building your first real machine learning model

This is the companion notebook for the article available at The DataRevenue Blog. You should read the article and follow along with the code samples provided here.

Each time you come to a piece of code, click on it and then press the >| Run button in the toolbar at the top. That code will run and the results will appear right below it.

running code cells

Reading the dataset

Here we use the pandas library to read a dataset of Titanic passengers and look at the first few rows of the dataset.

In [ ]:
import pandas as pd
from DRLearn import DRLearn

titanic_dataset = pd.read_csv("titanic.csv", index_col=0)
titanic_dataset.head()

Exploring the dataset

Let's visualise some aspects of our dataset before we start the machine learning analysis.

Plotting survival rate by class

Here we show how which class ticket the passenger had affects their survival chance. People in 1st class are more likely to survive than those in 2nd or 3rd.

In [ ]:
DRLearn.plot_passenger_class(titanic_dataset)

Plotting survival rate by gender

Here we show how the passenger's gender affects their survival chance. Women are far more likely to survive than men.

In [ ]:
DRLearn.plot_passenger_gender(titanic_dataset)

Preparing our data for the algorithm

Before we can train a machine learning model, we need to prepare our data. This means reformatting some of it to be more machine-friendly, and deleting parts that are unlikely to be helpful.

In [ ]:
selected_features, target = DRLearn.extract_features(titanic_dataset)
selected_features.sample(5)

Our data is now more difficult to read for humans, but easier for machines.

Splitting our dataset into two parts: training and test

We need part of our data to train the algorithm, and part of it to evaluate how well the algorithm does. Here we split it into a training and test set.

In [ ]:
X_train, X_test, y_train, y_test = DRLearn.split_dataset(selected_features, target, split=0.2)

Training our model

The part we have been waiting for. In this step, we feed the data to the algorithm and ask it to find patterns automatically.

In [ ]:
model = DRLearn.train_model(X_train, y_train)

Evaluating the model

We need to know how much the model has learned. Here we give it the 'test' part of the dataset (which it didn't see before) and compute model accuracy.

In [ ]:
DRLearn.evaluate_model(model, X_test, y_test)

Analysing our model

Here we find which aspects of the data the model learned were specifically interesting. We can see that gender is very important to predicting survival rate.

In [ ]:
DRLearn.explain_model(model, X_train)

We can also analyse how it makes predictions for specific passengers.

In [ ]:
model_interpretation = DRLearn.interpret_model(model, X_test, y_test)

change the number in the next line to see the anaysis for different passengers.

In [ ]:
passenger_number = 3
DRLearn.analyze_passenger_prediction(model_interpretation, X_test, passenger_number)

Passenger 3 has a 93% survival chance based on the fact that she is female and not in 3rd class (Class_3=0). The fact that she is not in 1st class lowers her survival chances slightly (the blue section).

Understanding how the quantity of data affects our model

Here we train our model multiple times with different amounts of data. We can see that the more data the model has, the better it does.

In [ ]:
DRLearn.visualise_training_progress(model, X_train, y_train, X_test, y_test)

Conclusion

You've built and trained your first machine model. Congratulations! Now you can

  • Understand what data science teams do day-to-day
  • Better communicate with your data science or machine learning team
  • Know what kind of problems machine learning is best for
  • See that machine learning is not so intimidating as a concept

The complex part of machine learning is getting into all the nitty-gritty details of building and scaling a customized solution. And that’s exactly what we specialise in, so if you need help let us know.