http://nbviewer.ipython.org/gist/dartdog/9008026

While a bit of a "toy" example the below is real data which uses the last quarter (already shifted) of the prior year to "project" the upcomming year by using a regression line of prior observations. Actually works surprisingly well..

Since it is something I know and have the data for, I'm trying to use it as a learning base for the various Python Stats capabilities, including a recent foray into PYMC3 and Statsmodels, along with Pandas. This Notebook can be freely downloaded and modified and redistibuted. (attribution appreciated) The PYMC3 code here was largely copied with small mods, from http://www.databozo.com/2014/01/17/Exploring_PyMC3.html and http://twiecki.github.io/blog/2013/08/12/bayesian-glms-1/ and http://nbviewer.ipython.org/github/twiecki/pymc3_talk/blob/master/bayesian_pymc3.ipynb

This is also quite helpful for OLS analysis http://www.datarobot.com/blog/ordinary-least-squares-in-python/

Questions:

Any comments to help interpet the Partial regression graph in cell # 6 How to better use the PYMC3 models questions above cell # 10 Any other comment as to how to improve this for others?

In [1]:

import pandas as pd
import io
import statsmodels.api as sm
%matplotlib inline
import matplotlib.pyplot as plt
content2 = '''\
Units	lastqu
2000-12-31	 19391	 NaN
2001-12-31	 35068	 5925
2002-12-31	 39279	 8063
2003-12-31	 47517	 9473
2004-12-31	 51439	 11226
2005-12-31	 59674	 11667
2006-12-31	 58664	 14016
2007-12-31	 55698	 13186
2008-12-31	 42235	 11343
2009-12-31	 40478	 7867
2010-12-31	 38722	 8114
2011-12-31	 36965	 8361
2012-12-31	 39132	 8608
2013-12-31	 43160	 9016
2014-12-31	 NaN	 9785
'''
df2 = pd.read_table(io.BytesIO(content2))
#make sure that the columns are int
df2['Units']=df2['Units'][:-1].astype('int')
df2['lastqu']=df2['lastqu'][1:].astype('int')
df2 #note sample data is from Statewide

Out[1]:

	Units	lastqu
2000-12-31	19391	NaN
2001-12-31	35068	5925
2002-12-31	39279	8063
2003-12-31	47517	9473
2004-12-31	51439	11226
2005-12-31	59674	11667
2006-12-31	58664	14016
2007-12-31	55698	13186
2008-12-31	42235	11343
2009-12-31	40478	7867
2010-12-31	38722	8114
2011-12-31	36965	8361
2012-12-31	39132	8608
2013-12-31	43160	9016
2014-12-31	NaN	9785

15 rows × 2 columns

In [2]:

def fit_line2(x, y):
    X = sm.add_constant(x, prepend=True) #Add a column of ones to allow the calculation of the intercept
    ols_test = sm.OLS(y, X,missing='drop').fit()
    """Return slope, intercept of best fit line."""
    X = sm.add_constant(x)
    return ols_test

In [3]:

ols_test=fit_line2(df2['lastqu'][1:-1], df2['Units'][1:-1]) #Use the lastqu to fit to Units (so can forecast future)
ols_test.summary()

/usr/local/lib/python2.7/dist-packages/scipy/stats/stats.py:1293: UserWarning: kurtosistest only valid for n>=20 ... continuing anyway, n=13
  int(n))

Out[3]:

OLS Regression Results
Dep. Variable:	Units	R-squared:	0.797
Model:	OLS	Adj. R-squared:	0.779
Method:	Least Squares	F-statistic:	43.18
Date:	Fri, 14 Feb 2014	Prob (F-statistic):	4.02e-05
Time:	17:24:05	Log-Likelihood:	-125.17
No. Observations:	13	AIC:	254.3
Df Residuals:	11	BIC:	255.5
Df Model:	1
Covariance Type:	nonrobust

	coef	std err	t	P>\|t\|	[95.0% Conf. Int.]
const	1.368e+04	4927.353	2.777	0.018	2837.631 2.45e+04
lastqu	3.2330	0.492	6.571	0.000	2.150 4.316

Omnibus:	2.563	Durbin-Watson:	1.782
Prob(Omnibus):	0.278	Jarque-Bera (JB):	0.504
Skew:	0.019	Prob(JB):	0.777
Kurtosis:	3.964	Cond. No.	4.45e+04

In [4]:

nextq=ols_test.predict(sm.add_constant(df2['lastqu'][-1:], prepend=True)) # the prediction for 2014
nextq

Out[4]:

array([ 45317.70961059])

Assigning the plot to fig fixes the duplicate plot issue

In [5]:

fig = plt.figure(figsize=(12,8))
fig=sm.graphics.plot_regress_exog(ols_test,'lastqu',fig=fig)

In [6]:

fig.clf() #clears the fig!
sm.graphics.plot_partregress_grid(ols_test, fig=fig)

Out[6]:

In [7]:

#fig.clf() #clears the fig!
result=ols_test.model.fit()
result.summary()

Out[7]:

OLS Regression Results
Dep. Variable:	Units	R-squared:	0.797
Model:	OLS	Adj. R-squared:	0.779
Method:	Least Squares	F-statistic:	43.18
Date:	Fri, 14 Feb 2014	Prob (F-statistic):	4.02e-05
Time:	17:24:06	Log-Likelihood:	-125.17
No. Observations:	13	AIC:	254.3
Df Residuals:	11	BIC:	255.5
Df Model:	1
Covariance Type:	nonrobust

	coef	std err	t	P>\|t\|	[95.0% Conf. Int.]
const	1.368e+04	4927.353	2.777	0.018	2837.631 2.45e+04
lastqu	3.2330	0.492	6.571	0.000	2.150 4.316

Omnibus:	2.563	Durbin-Watson:	1.782
Prob(Omnibus):	0.278	Jarque-Bera (JB):	0.504
Skew:	0.019	Prob(JB):	0.777
Kurtosis:	3.964	Cond. No.	4.45e+04

In [8]:

# this uses the statsmodels formula API (same results
# import formula api as alias smf
import statsmodels.formula.api as smf
# formula: response ~ predictors
est = smf.ols(formula='Units ~ lastqu', data=df2).fit()
est.summary()

Out[8]:

OLS Regression Results
Dep. Variable:	Units	R-squared:	0.797
Model:	OLS	Adj. R-squared:	0.779
Method:	Least Squares	F-statistic:	43.18
Date:	Fri, 14 Feb 2014	Prob (F-statistic):	4.02e-05
Time:	17:24:06	Log-Likelihood:	-125.17
No. Observations:	13	AIC:	254.3
Df Residuals:	11	BIC:	255.5
Df Model:	1
Covariance Type:	nonrobust

	coef	std err	t	P>\|t\|	[95.0% Conf. Int.]
Intercept	1.368e+04	4927.353	2.777	0.018	2837.631 2.45e+04
lastqu	3.2330	0.492	6.571	0.000	2.150 4.316

Omnibus:	2.563	Durbin-Watson:	1.782
Prob(Omnibus):	0.278	Jarque-Bera (JB):	0.504
Skew:	0.019	Prob(JB):	0.777
Kurtosis:	3.964	Cond. No.	4.45e+04

In [9]:

fig = plt.figure(figsize=(12,8))
fig=sm.graphics.plot_regress_exog(est,'lastqu',fig=fig)

Using Pymc3 with same data (excluding the rows with NaN) First with the Model directly specified I found I needed to set the upper bounds on Sigma to 12000 for good results. Using the GLM method below (next) I got unintelligible results.

Questions:

How can I determine the best regression line using the generated Bayes data? Shouldn't I be able to find the max liklihood for two points on the line,, ie be able to find the standatd parameters for the regression line and thereby also be able to perform a prediction with a new value of x? (and idealy say something about its probability?)

In [10]:

#Remove NaN's
df2=df2[1:-1]
import pymc as pm
import numpy as np
x=df2['lastqu']
y=df2['Units']
trace = None
with pm.Model() as model:
    alpha = pm.Normal('alpha', mu=0, sd=20)
    beta = pm.Normal('beta', mu=0, sd=20)
    sigma = pm.Uniform('sigma', lower=0, upper=12000)
    
    y_est = alpha + beta * x
    
    likelihood = pm.Normal('y', mu=y_est, sd=sigma, observed=y)
    
    start = pm.find_MAP()
    step = pm.NUTS(state=start)
    trace = pm.sample(2000, step, start=start, progressbar=False)

    pm.traceplot(trace);

/usr/local/lib/python2.7/dist-packages/theano/scan_module/scan_perform_ext.py:85: RuntimeWarning: numpy.ndarray size changed, may indicate binary incompatibility
  from scan_perform.scan_perform import *

I'd like to draw the most likely regression line via the MCMC process rather than the standard frequentist formula (as mentioned above)

In [11]:

def graph(formula, x_range, color='black', alpha=1):  
    x = np.array(x_range)  
    y = eval(formula)
    plt.plot(x, y, color=color, alpha=alpha)
    
plt.scatter(x,y)

for i in xrange(0,2000):
    point = trace.point(i)
    graph('{0} + {1}*x'.format(point['alpha'], point['beta']), range(6000,15000), color='black', alpha=.0098035)

plt.show()

The process below uses the new GLM function but seems to have issues since I could not specify sigma max,, one would hope that the formula could find it?

In [12]:

x=df2['lastqu']
y=df2['Units']
data = dict(x=x, y=y)
data

Out[12]:

{'x': 2001-12-31     5925
 2002-12-31     8063
 2003-12-31     9473
 2004-12-31    11226
 2005-12-31    11667
 2006-12-31    14016
 2007-12-31    13186
 2008-12-31    11343
 2009-12-31     7867
 2010-12-31     8114
 2011-12-31     8361
 2012-12-31     8608
 2013-12-31     9016
 Name: lastqu, dtype: float64, 'y': 2001-12-31    35068
 2002-12-31    39279
 2003-12-31    47517
 2004-12-31    51439
 2005-12-31    59674
 2006-12-31    58664
 2007-12-31    55698
 2008-12-31    42235
 2009-12-31    40478
 2010-12-31    38722
 2011-12-31    36965
 2012-12-31    39132
 2013-12-31    43160
 Name: Units, dtype: float64}

In [13]:

with pm.Model() as model:
    # specify glm and pass in data. The resulting linear model, its likelihood and 
    # and all its parameters are automatically added to our model.
    pm.glm.glm('y ~ x', data)
    step = pm.NUTS() # Instantiate MCMC sampling algorithm
    trace = pm.sample(2000, step, progressbar=False) # draw 2000 posterior samples using NUTS sampling

In [14]:

plt.figure(figsize=(7, 7))
pm.traceplot(trace)
plt.tight_layout();

<matplotlib.figure.Figure at 0x2686ff10>

In [15]:

point['sigma']

Out[15]:

7436.1618154575099

In [16]:

point['beta']

Out[16]:

4.2154869439146863

In [17]:

point['alpha']

Out[17]:

4.6985952299713603

In [18]:

#!pip list

In [18]: