Scientific Computing with Hy: Linear Regressions

A tutorial adventure with Python 3, cytoolz, NumPy, and matplotlib

This tutorial was adapted from the Linear Regression tutorial for the Clojure scientific computing platform Incanter. It has been significantly modified to take advantage of the following, however:

  • Python 3
  • Hy - Clojure-Flavoured Python
  • cytoolz - a Cython implementation of the famous pytoolz suite
  • NumPy - a full-featured numeric library for Python
  • matplotlib - a Python mathematical plotting library, originally inspired by MATLAB

Okay, to be honest, we don't really use cytoolz in the linear regression exercise; but we really wanted to :-) The opportunity just didn't arise. Seriously, though, cytoolz some some pretty sweet stuff, and provides fast utility functions you may be missing if you're coming (or returning!) to Python after living in the land of functional programming. In particular, it borrows heavily from the Clojure API (which is, quite bluntly, awesome).

Introduction

Curve Fitting and Linear Regressions

From wikipedia:

Curve fitting is the process of constructing a curve, or mathematical function, that has the best fit to a series of data points, possibly subject to constraints. Curve fitting can involve either interpolation, where an exact fit to the data is required, or smoothing, in which a "smooth" function is constructed that approximately fits the data.
A related topic is regression analysis, which focuses more on questions of statistical inference such as how much uncertainty is present in a curve that is fit to data observed with random errors. Fitted curves can be used as an aid for data visualization, to infer values of a function where no data are available, and to summarize the relationships among two or more variables. Extrapolation refers to the use of a fitted curve beyond the range of the observed data, and is subject to a degree of uncertainty since it may reflect the method used to construct the curve as much as it reflects the observed data.

Scientific Computing in Python

The suite of tools available for scientific computing in Python are pretty stunning. Here's an abridged list:

  • NumPy
  • SciPy
  • matplotlib
  • SymPy
  • Pandas
  • IPython
  • scikit-learn
  • AstroPy

There are so many more, so much to enjoy.

About Hy

Hy is a Lisp dialect that converts its structure into Python’s abstract syntax tree. It is to Python what LFE is to Erlang.This provides developers from many backgrounds with the following:

  • A lisp that feels very Pythonic
  • A great way to use Lisp’s crazy powers but in the wide world of Python’s libraries
  • A great way to start exploring Lisp, from the comfort of python
  • A pleasant language that has a lot of neat ideas :-)

To support different languages in IPython notebooks, one either needs to create an IPython kernel for the desired language, or create a cell magic command. To the best of our knowledge, there is as yet no Hy IPython kernel, however, yardsale8 has created HyMagic for hylang, and this entire notebook depends upon it. Thanks, @yardsale8!

Preparation

Getting the Code

If you are reading this with the IPython nbviewer and you want to run the examples on your machine, just do the following:

$ git clone https://github.com/oubiwann/linear-regression-tutorial.git
$ cd linear-regression-tutorial
$ make

That will:

  • create a virtualenv
  • download all the dependencies, and then
  • start up this notebook using a local IPython HTTP server

Using Hy

Note that since this tutorial is written in Hy and there is no Hy IPython kernel yet, we use the %%hymagic command for each cell, which will look like this:

In [1]:
%%hylang

(* 2 (reduce + (range 1 7) 0))
Out[1]:
42

Let's get started with the notebook prep, now.

IPython Setup

This notebook is started with a custom IPython profile (defined in a make include file), and that profile has already loaded the Hy lang cell magic extension, so we don't need to.

We do need to set up matplotlib for IPython, though:

In [2]:
import matplotlib
matplotlib.use('nbagg')
%matplotlib inline

Imports and Color Palette Config

Next, switching to Hy, we can do all our imports:

In [3]:
%%hylang

(import [functools [reduce]]
        [pprint [pprint]]
        [cytoolz [itertoolz]]
        [numpy :as np]
        [matplotlib :as mpl]
        [matplotlib.pyplot :as plt]
        [seaborn :as sns])

Now we can configure the color map we'll use in our plots:

In [4]:
%%hylang

(def palette_name "husl")
(def colors (sns.color_palette palette_name 8))
(colors.reverse)
(def cmap (mpl.colors.LinearSegmentedColormap.from_list palette_name colors))

Loading the Observed Data

In the case of this tutorial, the "observed data" is the same data set as that used in the Incanter linear regression tutorial: the NIST Filip.dat file. In the ./notebooks directoy we've saved a local copy of Filip.dat as well as a CSV file converstion of the data (filip.csv).

Let's read it in:

In [5]:
%%hylang

(def data (apply np.genfromtxt ["filip.csv"] {"dtype" float "delimiter" "," "names" True}))

Let's set some of that data to variables for use later on:

In [6]:
%%hylang

(def x (get data "x"))
(def y (get data "y"))
(def [x-min x-max] [(x.min) (x.max)])
Out[6]:
[-8.7814644949999998, -3.1320024900000001]

Here's what it looks like:

In [7]:
%%hylang

x
Out[7]:
array([-6.86012091, -4.32413005, -4.35862506, -4.35842675, -6.95585238,
       -6.66114525, -6.35546294, -6.11810203, -7.11514802, -6.81530857,
       -6.51999306, -6.20411998, -5.85387196, -6.10952309, -5.79832982,
       -5.48267212, -5.17179139, -4.8517059 , -4.51712642, -4.14357323,
       -3.70907544, -3.49948909, -6.3007695 , -5.95350484, -5.64206515,
       -5.03137698, -4.6806857 , -4.32984695, -3.9284862 , -8.56735134,
       -8.36321131, -8.10768274, -7.82390874, -7.52287874, -7.21881928,
       -6.92081875, -6.62893214, -6.32394687, -5.99139983, -8.78146449,
       -8.66314018, -8.47353149, -8.24733706, -7.97142875, -7.67612939,
       -7.3528127 , -7.07206532, -6.77417401, -6.47886192, -6.15951751,
       -6.83564714, -6.53165267, -6.22409842, -5.91009489, -5.59859946,
       -5.29064522, -4.97428462, -4.64454848, -4.29056043, -3.88505558,
       -3.40837896, -3.13200249, -8.72676717, -8.66695597, -8.51102647,
       -8.16538858, -7.88605665, -7.58804376, -7.28341242, -6.99567863,
       -6.69186262, -6.39254498, -6.06737406, -6.68402965, -6.37871983,
       -6.06585519, -5.75227217, -5.13241467, -4.8113527 , -4.09826931,
       -3.66174277, -3.2644011 ])
In [8]:
%%hylang

y
Out[8]:
array([ 0.8116,  0.9072,  0.9052,  0.9039,  0.8053,  0.8377,  0.8667,
        0.8809,  0.7975,  0.8162,  0.8515,  0.8766,  0.8885,  0.8859,
        0.8959,  0.8913,  0.8959,  0.8971,  0.9021,  0.909 ,  0.9139,
        0.9199,  0.8692,  0.8872,  0.89  ,  0.891 ,  0.8977,  0.9035,
        0.9078,  0.7675,  0.7705,  0.7713,  0.7736,  0.7775,  0.7841,
        0.7971,  0.8329,  0.8641,  0.8804,  0.7668,  0.7633,  0.7678,
        0.7697,  0.77  ,  0.7749,  0.7796,  0.7897,  0.8131,  0.8498,
        0.8741,  0.8061,  0.846 ,  0.8751,  0.8856,  0.8919,  0.8934,
        0.894 ,  0.8957,  0.9047,  0.9129,  0.9209,  0.9219,  0.7739,
        0.7681,  0.7665,  0.7703,  0.7702,  0.7761,  0.7809,  0.7961,
        0.8253,  0.8602,  0.8809,  0.8301,  0.8664,  0.8834,  0.8898,
        0.8964,  0.8963,  0.9074,  0.9119,  0.9228])

Our preparations are complete -- let's get started!

Viewing the Data

Let's take a look at our data with a quick plot:

In [9]:
%%hylang

(apply plt.scatter [x y])
(plt.show)

That's not the most beautiful graph in the world, so perhaps we can make it a little better. Let's do the following:

  • increase the size of the plot
  • also make the plot points bigger
  • use the color map we defined earlier in the notebook
  • and give the points some transparency

Let's start with sizing the points. We'll create an array that's the same size and shape as one of the dimensions from our data set, and then we'll give calculate an arbitrary area to display for each point size:

In [10]:
%%hylang

(def ones (np.ones_like x))
(def size (* np.pi (** (* 20 ones) 2)))

Let's do some more sizes: one for the figure, and another for the label fonts:

In [11]:
%%hylang

(def figsize {"figsize" [18 18]})
(def fontsize {"fontsize" 24})

We'll use the $x$-axis values as the basis for color value differences. For the pallete, we'll use our previously-defined cmap variable:

In [12]:
%%hylang

(def plot-options {"s" size "c" x "alpha" 0.5 "cmap" cmap})

Let's set up our labels, create our plot, and then display it:

In [13]:
%%hylang

(apply plt.figure [] figsize)
(apply plt.title ["NIST Data Set for Linear Regression Demo"] fontsize)
(apply plt.xlabel ["$x$ values"] fontsize)
(apply plt.ylabel ["$y$ values"] fontsize)
(apply plt.scatter [x y] plot-options)
(plt.show)

Our graph is looking much better. Now let's get mathematical!

Curve-Fitting

The NIST data set provided a polynomial describing this data:

y = B0 + B1*x + B2*(x**2) + ... + B9*(x**9) + B10*(x**10) + e

Or, if you prefer LaTeX:

\begin{equation} y = B_0 + B_1{x} + B_2{x^2} + ... + B_9{x^9} + B_{10}{x^{10}} + e \end{equation}

Using NumPy, we can easily fit an 10th-degree polynomial curve to this data. We will use numpy.polyfit for finding a least squares polynomial fit, passing it the $x$ and $y$ values for the data to fit as well as the degree of our polynomial:

In [14]:
%%hylang

(def coeffs (np.polyfit x y 10))
In [15]:
coeffs
Out[15]:
array([ -4.02962521e-05,  -2.46781075e-03,  -6.70191147e-02,
        -1.06221497e+00,  -1.08753179e+01,  -7.51242009e+01,
        -3.54478230e+02,  -1.12797393e+03,  -2.31637105e+03,
        -2.77217956e+03,  -1.46748960e+03])

np.polyplot can return more data, if you are so inclined:

In [16]:
%%hylang

(def [coeffs residuals rank singular-values rcond]
     (apply np.polyfit [x y 10] {"full" true}))
Out[16]:
[array([ -4.02962521e-05,  -2.46781075e-03,  -6.70191147e-02,
         -1.06221497e+00,  -1.08753179e+01,  -7.51242009e+01,
         -3.54478230e+02,  -1.12797393e+03,  -2.31637105e+03,
         -2.77217956e+03,  -1.46748960e+03]),
 array([ 0.00079585]),
 11,
 array([  3.12889471e+00,   1.06454867e+00,   2.71803240e-01,
          5.29682154e-02,   8.38710833e-03,   1.01575660e-03,
          9.58303055e-05,   7.60511579e-06,   4.64910447e-07,
          1.98714214e-08,   6.00922228e-10]),
 1.8207657603852567e-14]

There is a conveience class in NumPy numpy.poly1d that, once instantiated with our fit data, we can use to evaluate at any given point. Let's try it out:

In [17]:
%%hylang

(def model (np.poly1d coeffs))

Let's call this function against several values as a sanity check:

In [18]:
%%hylang

[(model -9) (model -7) (model -6) (model -5) (model -4) (model -3)]
Out[18]:
[0.77668860985022548,
 0.79905917870519261,
 0.88604832190185334,
 0.89263439047817883,
 0.90943486799233142,
 0.88930227733135325]

Looking back at our graph, we can see that these check out fine.

Polynomial Linear Regression

Next, let's see if our linear model matches up with what NIST provided. We're going to need to calculate the coefficient of determination, or $R^2$, a value that indicates how well a statistical model fits with measured data. We'll start by feeding our $x$ values into our model:

In [19]:
%%hylang

(def y-predicted (model x))

We will also need several other values in order to calculate $R^2$, per the equation given on the Wikipedia page linked above:

  • The mean value of our observed (original) data: $\bar{y}=\frac{1}{n}\sum_{i=1}^n y_i$
  • The total sum of squares : $SS_\text{tot}=\sum_i (y_i-\bar{y})^2$
  • The regression sum of squares : $SS_\text{reg}=\sum_i (f_i -\bar{y})^2$
  • The sum of squares of residuals : $SS_\text{res}=\sum_i (y_i - f_i)^2$

We have already the following:

  • $y_i$: the $y$ values from the observed (NIST) data
  • $f_i$: the values generated by our model

With these, we will be able to calculate $R^2$:

\begin{equation} R^2 \equiv 1 - {SS_{\rm res}\over SS_{\rm tot}} \end{equation}
In [20]:
%%hylang

(def y-mean (/ (np.sum y) y.size))
(def sstot (np.sum (** (- y y-mean) 2)))
(def ssreg (np.sum (** (- y-predicted y-mean) 2)))
(def ssres (np.sum (** (- y y-predicted) 2)))

Now we're ready to get the $R^2$ value for our model:

In [21]:
%%hylang

(def rsquared (- 1 (/ ssres sstot)))
In [22]:
rsquared
Out[22]:
0.99672741617239946

If we compare this to the value from NIST, 0.996727416185620, we see that our model did pretty well:

In [23]:
%%hylang

(- 0.99672741617239946  0.996727416185620)
Out[23]:
-1.3220535777236364e-11

That's a tiny number ...

Mathematical!

A Linear Model Class

This is great for an IPython notebook, but kind of awkward for reuse. So let's create a linear model class that gives us everything we need in one shot:

In [24]:
%%hylang

(defclass PolynomialLinearModel []
  "A pretty sweet utility Python-Lisp class for creating
  a polynomial curve fitting model")

(defun PolynomialLinearModel.--init-- [self x y degree]
  (setv self.x x)
  (setv self.y y)
  (setv self.degree degree)
  (setv self.results None)
  (setv self.model None)
  (setv [self.coeffs self.residuals self.rank self.singular-values self.rcond]
        [None None None None None])
  (self.polyfit)
  None)

(defun PolynomialLinearModel.get-y-mean [self]
  "Get the mean value of the observed data"
  (/ (np.sum self.y) self.y.size))

(defun PolynomialLinearModel.get-ss-tot [self]
  "Get total sum of the squares"
  (np.sum (** (- self.y (self.get-y-mean)) 2)))

(defun PolynomialLinearModel.get-ss-reg [self]
  "Get the regression sum of squares"
  (np.sum (** (- self.y-predicted (self.get-y-mean)) 2)))

(defun PolynomialLinearModel.get-ss-res [self]
  "Get the sum of squares of residuals"
  (np.sum (** (- self.y self.y-predicted) 2)))

(defun PolynomialLinearModel.get-r-squared [self]
  "Get the R^2 value for the polynomial fit"
  (- 1 (/ (self.get-ss-res) (self.get-ss-tot))))

(defun PolynomialLinearModel.polyfit [self]
  "Do all the business"
  (setv [self.coeffs self.residuals self.rank self.singular-values self.rcond]
        (apply np.polyfit [self.x self.y self.degree] {"full" true}))
  (setv self.model (np.poly1d self.coeffs))
  (setv self.y-predicted (self.model self.x))
  (setv self.r-squared (self.get-r-squared))
  (setv self.results {
        "coeffs" (self.coeffs.tolist)
        "residuals" (self.residuals.tolist)
        "rank" self.rank
        "singular-values" (self.singular-values.tolist)
        "rcond" self.rcond
        "r-squared" self.r-squared}))

(defun PolynomialLinearModel.--str-- [self]
  "Provide a string representation of the data"
  (str self.results))

(defun PolynomialLinearModel.--repr-- [self]
  "Provide a representation of the data"
  (self.--str--))

(defun PolynomialLinearModel.predict [self xs]
  "Given a set of input values, produce outputs using the model"
  (self.model xs))
Out[24]:
<function __main__._hy_anon_fn_10>

This approach to adding methods to a class is different from the standard Hy approach (which mirrors the standard Python approach). I chose this form as it more closely matches that used by Common Lisp, and to a certain extent, Clojure. In fact, this lead me to ponder the creation of a defmethod macro for Hy (as I'm sure has happened to many who use Hy). Instead of clutter up this notebook, though, I've moved that over to a sibling notebook. But we digress ...

Let's take it for a spin!

In [25]:
%%hylang

(def model (PolynomialLinearModel x y 10))

When we created our first model, we ran it against several values to see if the outputs fit our measured data. Of of those was the value -9 which returned 0.77668860985022548. Let's try that again with our new object:

In [26]:
%%hylang

(model.predict -9)
Out[26]:
0.77668860985022548

Looking good!

Let's see what our results data look like:

In [27]:
%%hylang

(pprint model.results)
{'coeffs': [-4.029625205186532e-05,
            -0.0024678107546401437,
            -0.06701911469215643,
            -1.0622149736864719,
            -10.875317910262362,
            -75.12420087227511,
            -354.47822960532113,
            -1127.973927925715,
            -2316.371054759451,
            -2772.1795597594755,
            -1467.4895971641686],
 'r-squared': 0.99672741617239946,
 'rank': 11,
 'rcond': 1.8207657603852567e-14,
 'residuals': [0.0007958513839371895],
 'singular-values': [3.128894711145785,
                     1.064548669029962,
                     0.27180324022363517,
                     0.05296821542551952,
                     0.008387108325776571,
                     0.0010157565988992792,
                     9.583030547029836e-05,
                     7.605115790256685e-06,
                     4.6491044714423815e-07,
                     1.9871421381342612e-08,
                     6.009222284310632e-10]}

Plotting the Model with the Observed Data

We're going to need some data to feed to our fitted-poly function so that it can create the smooth polynomial curve that we will overlay on our scatter plot. Let's create a linear space between our minimum and maximum $x$ values (200 points should give us a nice, smooth curve). Then let's use fitted-poly to generate the $y$ values:

In [28]:
%%hylang

(def fitted-xs (np.linspace x-min x-max 200))
(def fitted-ys (model.predict fitted-xs))

Now we're ready to put them together:

In [29]:
%%hylang

(def [figure axes] (apply plt.subplots [] figsize))
(plt.hold True)
(apply axes.set_title ["NIST Data Set and Polynomial Fit"] fontsize)
(apply axes.set_xlabel ["$x$ values"] fontsize)
(apply axes.set_ylabel ["$y$ values"] fontsize)
(apply axes.scatter [x y] plot-options)
(axes.plot fitted-xs fitted-ys)
(plt.show)