by Ian McLoughlin ([email protected])
This notebook demonstrates the analysis of a data set using the Python programming language.
The notebook is hosted on GitHub.
You can take this interactive notebook and play around with it: https://goo.gl/SbYMqr.
Note: unfortunately, Google haven't updated their packages yet so you'll have to uncomment and run the following code if you're running it on Collaboratory.
#!pip install --upgrade seaborn
We'll look at the well-known Iris data set.
It was collected by Ronald Fisher (above).
Fisher is famous for The Design of Experiments including the Lady Tasting Tea problem and ANOVA amongst other things.
Using Python, we can easily load a comma separated values (CSV) file to analyse it.
# pandas is a Python package for investigating data sets.
import pandas as pd
# We can load a CSV file directly from a URL.
df = pd.read_csv("https://github.com/ianmcloughlin/datasets/raw/master/iris.csv")
# Have a look at the first five rows of the data set.
df.head(5)
The data set contains five variables, and so is difficult to visualise. Luckily somebody has written a Python package called seaborn with a lovely plot suited to it.
# Set up our Jupyer notebook to display plots nicely.
%matplotlib inline
# Set the default plot size.
import matplotlib.pyplot as plt
plt.rcParams['figure.figsize'] = (10, 8)
# seaborn is named after Rob Lowe's character in The West Wing.
import seaborn as sns
# A pair plot will create a matrix of scatter plots.
pp = sns.pairplot(df, hue="class", palette="husl")
Fisher was interested in knowing whether the class could be predicted from the other variables.
Can we figure out a good $f$ from the data set?
# Let's single out two of the numeric variables.
sns.scatterplot(x="petal_length", y="sepal_width", hue="class", data=df)
# Adapted from https://stackoverflow.com/questions/22491359/least-squares-linear-classifier-in-matlab
# Note the code here is a little more involved, but it can likely be simplified.
# Numerical package for Python
import numpy as np
# A will have three columns: petal lengths, sepal widths and a column of 1's.
A = df[['petal_length', 'sepal_width']].values
A = np.hstack([A, np.ones([A.shape[0], 1])])
# b is a column matrix that contains a 1 eveywhere there's a setosa and -1 elsewhere.
b = (df['class'] == 'setosa').map({True: 1, False: -1})
# Find the best x in Ax=b.
x = np.linalg.lstsq(A, b)[0]
# Now we can plot the line on top of the previous plot.
sns.scatterplot(x="petal_length", y="sepal_width", hue="class", data=df)
u = np.array([min(df['petal_length']), max(df['petal_length'])])
v = -x[2]/x[1] - (x[0]/x[1])*u;
plt.plot(u, v, 'k-')
The idea is now that when you come across a new iris, you can decide whether it's a setosa or not by plotting its petal length and sepal width to see what side of the line it is on.
Using Python, we can get up and running with quite sophisticated concepts very quickly.
For instance, we can create small neural network.
# For building neural networks.
import keras as kr
# For encoding categorical variables.
import sklearn.preprocessing as pre
# For splitting into training and test sets.
import sklearn.model_selection as mod
inputs = df[['petal_length', 'petal_width', 'sepal_length', 'sepal_width']]
encoder = pre.LabelBinarizer()
encoder.fit(df['class'])
outputs = encoder.transform(df['class'])
# Start a neural network, building it by layers.
model = kr.models.Sequential()
# Add a hidden layer with 64 neurons and an input layer with 4.
model.add(kr.layers.Dense(units=64, activation='relu', input_dim=4))
# Add a three neuron output layer.
model.add(kr.layers.Dense(units=3, activation='softmax'))
# Build the graph.
model.compile(loss='categorical_crossentropy', optimizer='adam')
# Train the neural network.
model.fit(inputs, outputs, epochs=10, batch_size=5)
# As an example, take the average values for a versicolor.
df[df['class'] == 'versicolor'].mean()
# Ask the neural network to predict what class of iris it is.
mean_versicolor = df[df['class'] == 'versicolor'].mean().values.reshape(1,4)
prediction = model.predict([mean_versicolor])
encoder.inverse_transform(prediction)[0]
You can see that the neural network predicts that the average versicolor to be a virginica.
When something like that happens, it's a good troubleshooting/learning opportunity for students.
You can let the students come up with ideas as to why the neural network thinks those values represent a virginica: