This is the companion notebook for the article available at The DataRevenue Blog. You should read the article and follow along with the code samples provided here.
Each time you come to a piece of code, click on it and then press the >| Run
button in the toolbar at the top. That code will run and the results will appear right below it.
Here we use the pandas
library to read a dataset of Titanic passengers and look at the first few rows of the dataset.
import pandas as pd
from DRLearn import DRLearn
titanic_dataset = pd.read_csv("titanic.csv", index_col=0)
titanic_dataset.head()
Let's visualise some aspects of our dataset before we start the machine learning analysis.
Here we show how which class ticket the passenger had affects their survival chance. People in 1st class are more likely to survive than those in 2nd or 3rd.
DRLearn.plot_passenger_class(titanic_dataset)
Here we show how the passenger's gender affects their survival chance. Women are far more likely to survive than men.
DRLearn.plot_passenger_gender(titanic_dataset)
Before we can train a machine learning model, we need to prepare our data. This means reformatting some of it to be more machine-friendly, and deleting parts that are unlikely to be helpful.
selected_features, target = DRLearn.extract_features(titanic_dataset)
selected_features.sample(5)
Our data is now more difficult to read for humans, but easier for machines.
We need part of our data to train the algorithm, and part of it to evaluate how well the algorithm does. Here we split it into a training and test set.
X_train, X_test, y_train, y_test = DRLearn.split_dataset(selected_features, target, split=0.2)
The part we have been waiting for. In this step, we feed the data to the algorithm and ask it to find patterns automatically.
model = DRLearn.train_model(X_train, y_train)
We need to know how much the model has learned. Here we give it the 'test' part of the dataset (which it didn't see before) and compute model accuracy.
DRLearn.evaluate_model(model, X_test, y_test)
Here we find which aspects of the data the model learned were specifically interesting. We can see that gender is very important to predicting survival rate.
DRLearn.explain_model(model, X_train)
We can also analyse how it makes predictions for specific passengers.
model_interpretation = DRLearn.interpret_model(model, X_test, y_test)
change the number in the next line to see the anaysis for different passengers.
passenger_number = 3
DRLearn.analyze_passenger_prediction(model_interpretation, X_test, passenger_number)
Passenger 3 has a 93% survival chance based on the fact that she is female and not in 3rd class (Class_3=0
). The fact that she is not in 1st class lowers her survival chances slightly (the blue section).
Here we train our model multiple times with different amounts of data. We can see that the more data the model has, the better it does.
DRLearn.visualise_training_progress(model, X_train, y_train, X_test, y_test)
You've built and trained your first machine model. Congratulations! Now you can
The complex part of machine learning is getting into all the nitty-gritty details of building and scaling a customized solution. And that’s exactly what we specialise in, so if you need help let us know.