While a bit of a "toy" example the below is real data which uses the last quarter (already shifted) of the prior year to "project" the upcomming year by using a regression line of prior observations. Actually works surprisingly well..
Since it is something I know and have the data for, I'm trying to use it as a learning base for the various Python Stats capabilities, including a recent foray into PYMC3 and Statsmodels, along with Pandas. This Notebook can be freely downloaded and modified and redistibuted. (attribution appreciated) The PYMC3 code here was largely copied with small mods, from http://www.databozo.com/2014/01/17/Exploring_PyMC3.html and http://twiecki.github.io/blog/2013/08/12/bayesian-glms-1/ and http://nbviewer.ipython.org/github/twiecki/pymc3_talk/blob/master/bayesian_pymc3.ipynb
This is also quite helpful for OLS analysis http://www.datarobot.com/blog/ordinary-least-squares-in-python/
Questions:
Any comments to help interpet the Partial regression graph in cell # 6 How to better use the PYMC3 models questions above cell # 10 Any other comment as to how to improve this for others?
import pandas as pd
import io
import statsmodels.api as sm
%matplotlib inline
import matplotlib.pyplot as plt
content2 = '''\
Units lastqu
2000-12-31 19391 NaN
2001-12-31 35068 5925
2002-12-31 39279 8063
2003-12-31 47517 9473
2004-12-31 51439 11226
2005-12-31 59674 11667
2006-12-31 58664 14016
2007-12-31 55698 13186
2008-12-31 42235 11343
2009-12-31 40478 7867
2010-12-31 38722 8114
2011-12-31 36965 8361
2012-12-31 39132 8608
2013-12-31 43160 9016
2014-12-31 NaN 9785
'''
df2 = pd.read_table(io.BytesIO(content2))
#make sure that the columns are int
df2['Units']=df2['Units'][:-1].astype('int')
df2['lastqu']=df2['lastqu'][1:].astype('int')
df2 #note sample data is from Statewide
Units | lastqu | |
---|---|---|
2000-12-31 | 19391 | NaN |
2001-12-31 | 35068 | 5925 |
2002-12-31 | 39279 | 8063 |
2003-12-31 | 47517 | 9473 |
2004-12-31 | 51439 | 11226 |
2005-12-31 | 59674 | 11667 |
2006-12-31 | 58664 | 14016 |
2007-12-31 | 55698 | 13186 |
2008-12-31 | 42235 | 11343 |
2009-12-31 | 40478 | 7867 |
2010-12-31 | 38722 | 8114 |
2011-12-31 | 36965 | 8361 |
2012-12-31 | 39132 | 8608 |
2013-12-31 | 43160 | 9016 |
2014-12-31 | NaN | 9785 |
15 rows × 2 columns
def fit_line2(x, y):
X = sm.add_constant(x, prepend=True) #Add a column of ones to allow the calculation of the intercept
ols_test = sm.OLS(y, X,missing='drop').fit()
"""Return slope, intercept of best fit line."""
X = sm.add_constant(x)
return ols_test
ols_test=fit_line2(df2['lastqu'][1:-1], df2['Units'][1:-1]) #Use the lastqu to fit to Units (so can forecast future)
ols_test.summary()
/usr/local/lib/python2.7/dist-packages/scipy/stats/stats.py:1293: UserWarning: kurtosistest only valid for n>=20 ... continuing anyway, n=13 int(n))
Dep. Variable: | Units | R-squared: | 0.797 |
---|---|---|---|
Model: | OLS | Adj. R-squared: | 0.779 |
Method: | Least Squares | F-statistic: | 43.18 |
Date: | Fri, 14 Feb 2014 | Prob (F-statistic): | 4.02e-05 |
Time: | 17:24:05 | Log-Likelihood: | -125.17 |
No. Observations: | 13 | AIC: | 254.3 |
Df Residuals: | 11 | BIC: | 255.5 |
Df Model: | 1 | ||
Covariance Type: | nonrobust |
coef | std err | t | P>|t| | [95.0% Conf. Int.] | |
---|---|---|---|---|---|
const | 1.368e+04 | 4927.353 | 2.777 | 0.018 | 2837.631 2.45e+04 |
lastqu | 3.2330 | 0.492 | 6.571 | 0.000 | 2.150 4.316 |
Omnibus: | 2.563 | Durbin-Watson: | 1.782 |
---|---|---|---|
Prob(Omnibus): | 0.278 | Jarque-Bera (JB): | 0.504 |
Skew: | 0.019 | Prob(JB): | 0.777 |
Kurtosis: | 3.964 | Cond. No. | 4.45e+04 |
nextq=ols_test.predict(sm.add_constant(df2['lastqu'][-1:], prepend=True)) # the prediction for 2014
nextq
array([ 45317.70961059])
Assigning the plot to fig fixes the duplicate plot issue
fig = plt.figure(figsize=(12,8))
fig=sm.graphics.plot_regress_exog(ols_test,'lastqu',fig=fig)
fig.clf() #clears the fig!
sm.graphics.plot_partregress_grid(ols_test, fig=fig)
#fig.clf() #clears the fig!
result=ols_test.model.fit()
result.summary()
Dep. Variable: | Units | R-squared: | 0.797 |
---|---|---|---|
Model: | OLS | Adj. R-squared: | 0.779 |
Method: | Least Squares | F-statistic: | 43.18 |
Date: | Fri, 14 Feb 2014 | Prob (F-statistic): | 4.02e-05 |
Time: | 17:24:06 | Log-Likelihood: | -125.17 |
No. Observations: | 13 | AIC: | 254.3 |
Df Residuals: | 11 | BIC: | 255.5 |
Df Model: | 1 | ||
Covariance Type: | nonrobust |
coef | std err | t | P>|t| | [95.0% Conf. Int.] | |
---|---|---|---|---|---|
const | 1.368e+04 | 4927.353 | 2.777 | 0.018 | 2837.631 2.45e+04 |
lastqu | 3.2330 | 0.492 | 6.571 | 0.000 | 2.150 4.316 |
Omnibus: | 2.563 | Durbin-Watson: | 1.782 |
---|---|---|---|
Prob(Omnibus): | 0.278 | Jarque-Bera (JB): | 0.504 |
Skew: | 0.019 | Prob(JB): | 0.777 |
Kurtosis: | 3.964 | Cond. No. | 4.45e+04 |
# this uses the statsmodels formula API (same results
# import formula api as alias smf
import statsmodels.formula.api as smf
# formula: response ~ predictors
est = smf.ols(formula='Units ~ lastqu', data=df2).fit()
est.summary()
Dep. Variable: | Units | R-squared: | 0.797 |
---|---|---|---|
Model: | OLS | Adj. R-squared: | 0.779 |
Method: | Least Squares | F-statistic: | 43.18 |
Date: | Fri, 14 Feb 2014 | Prob (F-statistic): | 4.02e-05 |
Time: | 17:24:06 | Log-Likelihood: | -125.17 |
No. Observations: | 13 | AIC: | 254.3 |
Df Residuals: | 11 | BIC: | 255.5 |
Df Model: | 1 | ||
Covariance Type: | nonrobust |
coef | std err | t | P>|t| | [95.0% Conf. Int.] | |
---|---|---|---|---|---|
Intercept | 1.368e+04 | 4927.353 | 2.777 | 0.018 | 2837.631 2.45e+04 |
lastqu | 3.2330 | 0.492 | 6.571 | 0.000 | 2.150 4.316 |
Omnibus: | 2.563 | Durbin-Watson: | 1.782 |
---|---|---|---|
Prob(Omnibus): | 0.278 | Jarque-Bera (JB): | 0.504 |
Skew: | 0.019 | Prob(JB): | 0.777 |
Kurtosis: | 3.964 | Cond. No. | 4.45e+04 |
fig = plt.figure(figsize=(12,8))
fig=sm.graphics.plot_regress_exog(est,'lastqu',fig=fig)
Using Pymc3 with same data (excluding the rows with NaN) First with the Model directly specified I found I needed to set the upper bounds on Sigma to 12000 for good results. Using the GLM method below (next) I got unintelligible results.
Questions:
How can I determine the best regression line using the generated Bayes data? Shouldn't I be able to find the max liklihood for two points on the line,, ie be able to find the standatd parameters for the regression line and thereby also be able to perform a prediction with a new value of x? (and idealy say something about its probability?)
#Remove NaN's
df2=df2[1:-1]
import pymc as pm
import numpy as np
x=df2['lastqu']
y=df2['Units']
trace = None
with pm.Model() as model:
alpha = pm.Normal('alpha', mu=0, sd=20)
beta = pm.Normal('beta', mu=0, sd=20)
sigma = pm.Uniform('sigma', lower=0, upper=12000)
y_est = alpha + beta * x
likelihood = pm.Normal('y', mu=y_est, sd=sigma, observed=y)
start = pm.find_MAP()
step = pm.NUTS(state=start)
trace = pm.sample(2000, step, start=start, progressbar=False)
pm.traceplot(trace);
/usr/local/lib/python2.7/dist-packages/theano/scan_module/scan_perform_ext.py:85: RuntimeWarning: numpy.ndarray size changed, may indicate binary incompatibility from scan_perform.scan_perform import *
I'd like to draw the most likely regression line via the MCMC process rather than the standard frequentist formula (as mentioned above)
def graph(formula, x_range, color='black', alpha=1):
x = np.array(x_range)
y = eval(formula)
plt.plot(x, y, color=color, alpha=alpha)
plt.scatter(x,y)
for i in xrange(0,2000):
point = trace.point(i)
graph('{0} + {1}*x'.format(point['alpha'], point['beta']), range(6000,15000), color='black', alpha=.0098035)
plt.show()
The process below uses the new GLM function but seems to have issues since I could not specify sigma max,, one would hope that the formula could find it?
x=df2['lastqu']
y=df2['Units']
data = dict(x=x, y=y)
data
{'x': 2001-12-31 5925 2002-12-31 8063 2003-12-31 9473 2004-12-31 11226 2005-12-31 11667 2006-12-31 14016 2007-12-31 13186 2008-12-31 11343 2009-12-31 7867 2010-12-31 8114 2011-12-31 8361 2012-12-31 8608 2013-12-31 9016 Name: lastqu, dtype: float64, 'y': 2001-12-31 35068 2002-12-31 39279 2003-12-31 47517 2004-12-31 51439 2005-12-31 59674 2006-12-31 58664 2007-12-31 55698 2008-12-31 42235 2009-12-31 40478 2010-12-31 38722 2011-12-31 36965 2012-12-31 39132 2013-12-31 43160 Name: Units, dtype: float64}
with pm.Model() as model:
# specify glm and pass in data. The resulting linear model, its likelihood and
# and all its parameters are automatically added to our model.
pm.glm.glm('y ~ x', data)
step = pm.NUTS() # Instantiate MCMC sampling algorithm
trace = pm.sample(2000, step, progressbar=False) # draw 2000 posterior samples using NUTS sampling
plt.figure(figsize=(7, 7))
pm.traceplot(trace)
plt.tight_layout();
<matplotlib.figure.Figure at 0x2686ff10>
point['sigma']
7436.1618154575099
point['beta']
4.2154869439146863
point['alpha']
4.6985952299713603
#!pip list