#!/usr/bin/env python # coding: utf-8 # # MultiNorm # # This is a tutorial notebook for the ``MultiNorm`` class from the ``multinorm`` Python package ([Github](https://github.com/cdeil/multinorm)). # # - A quick tutorial introduction (~ 10 min read) # - Explain and illustrate some of the theory / statistical aspects # - Hands-on introduction how to work with ``MultiNorm`` class. # # Suggestions: # # - Read this tutorial and judge if `multinorm` is useful for you. # - If yes, install it with `pip install multinorm`, download this notebook, and execute / play with these examples on your machine. # - Learn more at https://multinorm.readthedocs.io (complete description of features and theory). # # We will use [pandas](http://pandas.pydata.org), so after reading this you will be a data scientist! # In the future we will probably change the example model to a deep neural network and use [tensorflow](https://www.tensorflow.org/), because likelihood fitting is so 1990s. # ## Introduction # # - TODO: What is MVN (theory)? # - https://en.wikipedia.org/wiki/Multivariate_normal_distribution # - TODO: What is `MultiNorm` (code)? # # # We will be using [scipy.optimize.curve_fit](https://docs.scipy.org/doc/scipy/reference/generated/scipy.optimize.curve_fit.html) to fit a model to data and obtain an example covariance matrix. Note that most analysis problems are not a simple least squares curve fit, but involve custom data and likelihood. Information how to obtain parameter error and covariance estimates with other popular Python packages is described in the `multinorm` documentation. # ## Imports # # Let's start by importing everything we'll use in this notebook. # In[1]: import matplotlib.pyplot as plt import numpy as np import pandas as pd import scipy.optimize from multinorm import MultiNorm # ## Example Setup # # As mentioned in the introduction, the kind of analysis we are considering here is that there is a **model** with **parameters**, as well as some **data** and a **likelihood function**. # # We will use the [Rat42](https://www.itl.nist.gov/div898/strd/nls/data/ratkowsky2.shtml) dataset from the [NIST Standard Reference Database](https://www.itl.nist.gov/div898/strd/). To quote from [here](https://www.itl.nist.gov/div898/strd/nls/data/LINKS/i-ratkowsky2.shtml): # # > This model and data are an example of fitting sigmoidal growth curves taken from Ratkowsky (1983). # > The response variable is pasture yield, and the predictor variable is growing time. # # Did I hear you say # # > Why? pasture yield? WTF? # # Well, the data is small (9 observations x, y) and , and the model is simple (one line in Python) and has 3 parameters, which is nice to illustrate properties of the correlation matrix and multivariate normal distribution. And the fit is using a least square likelihood, which is built-in in `scipy` and available for pretty much any fitting package, or can be implemented also with one line in Python. # In[2]: data = pd.DataFrame({ 'x': [9, 14, 21, 28, 42, 57, 63, 70, 79], 'y': [8.93, 10.80, 18.59, 22.33, 39.35, 56.11, 61.73, 64.62, 67.08], }) data.T # And having a DOI (http://dx.doi.org/10.18434/T43G6C) and expected reference results for the analysis is also nice. This is the result we should find: # In[3]: expected = pd.DataFrame.from_records( [ ("b1", 7.2462237576e01, 1.7340283401e00), ("b2", 2.6180768402e00, 8.8295217536e-02), ("b3", 6.7359200066e-02, 3.4465663377e-03), ], columns=["par", "value", "error"], index='par', ) expected # ### Model # # Our model is just a simple line. # In[4]: def model(x, b1, b2, b3): return b1 / (1 + np.exp(b2 - b3 *x)) # ### Fit # # If we fit the model to the data, we obtain a parameter estimate (`popt`, the "optimial" or "best" parameter values) and a covariance matrix (`pcov`). # In[5]: p0 = (100, 1, 0.1) popt, pcov = scipy.optimize.curve_fit( f=model, xdata=data["x"], ydata=data["y"], p0=p0, ) # Let's make a plot to see the best-fit model and the data: # In[6]: plt.plot(data['x'], data["y"], 'o') x = np.linspace(0, 100) plt.plot(x, model(x, *popt)); # Finally, this is where `MultiNorm` comes in! # ## MultiNorm # # In the last section we fit our model `y = m x + b` with two parameters `(m, b)` to some data, and optained the best-fit parameter vector `popt` and a covariance matrix `pcov`. Using this information, we can create a `MultiNorm` object. # # ### Create # In[7]: from multinorm import MultiNorm # In[8]: mn = MultiNorm(mean=popt, cov=pcov, names=["b1", "b2", "b3"]) # In[9]: mn # `MultiNorm` is very simple, it just stores the `mean` vector and `cov` matrix # and exposes them as pandas Series and DataFrame objects, with `name` as index. # # Of course the `mean` and `cov` are still available as Numpy arrays, using the `.values` property of the pandas objects. # In[10]: mn.names # In[11]: mn.mean # In[12]: mn.cov # ### Errors # # In the printout of `mn` above, we saw that in addition to the `names`, `mean` and `cov` we passed in, another property `error` exists. # # This is the vector of "1 sigma" parameter error estimates: # In[13]: mn.error # Usually you would state your measurement giving the estimated value and error for each parameter: # - `b1 = 72.46 +/- 1.73` # - `b2 = 2.618 +/- 0.088` # - `b3 = 0.0673 +/- 0.0034` # # The relationship between the covariance matrix and the errors is very simple: the diagonal entries of the covariance matrix contains the errors squared, meaning that you can compute the errors as the square root of the diagonal entries: # In[14]: np.sqrt(pcov.diagonal()) # ### Parameters # # Possibly the property you'll use most is `parameters`, which is a `pandas.DataFrame` with `name` as index, and columns that contain `mean` and `error` for all parameters. It has a nice table printout, and you can use the common pandas methods to select rows and columns and entries. # In[15]: mn.parameters # In[16]: pars = mn.parameters # Access parameter info as Series by index or name pars.iloc[0] pars.loc["b1"] # All the power of pandas at your fingertip # Just as an example: pars["mean"].to_frame().T # ### Correlations # # tbd # In[17]: mn.correlation # In[18]: mn.correlation.loc["b1", "b3"] # In[19]: mn.correlation["b3"].to_frame() # ### Plot # # TODO: plotting ellipses with MPL is horrible. Add a plot helper method! # In[20]: import matplotlib.pyplot as plt fig, ax = plt.subplots() ellipse = mn.fix("b3").to_matplotlib_ellipse() ax.add_artist(ellipse) ax.set_xlabel("b1") ax.set_ylabel("b2") # ax.set_xlim(7, 13) # ax.set_ylim(17, 23) ax.grid(True) plt.plot(mn.mean["b1"], mn.mean["b2"]) plt.tight_layout() plt.show() # ## Summary # # Let us recap the most important points concerning multivariante normal distributions and the `MultiNorm` class. # # To learn more: https://multinorm.readthedocs.io # # ### Theory # # - TODO: MVN are ... # # ### Code # # - TODO: the `MultiNorm` class ... # ## Exercises # # - TODO # In[ ]: