import pandas as pd
import pymc3 as pm
import numpy as np
import matplotlib.pyplot as plt
%load_ext autoreload
%autoreload 2
%matplotlib inline
%config InlineBackend.figure_format = 'retina'
Poisson regression is the kind of regression we do when we want to estimate the effect that our explanatory variables have on the dependent variable, which is of type "count data". If we're trying to find a linear combination of the explanatory variables, then our Poisson regression is a subset of generalized linear models.
It's "Poisson" mainly because we use the Poisson distribution to model the likelihood of the dependent variable.
What we get out of this type of model is the relative contribution of each explanatory variable to the value of the dependent variable.
To use it with this model, the data should be structured as such:
None.
Here are examples of how to summarize the findings.
For every increase in $X_i$, we expect to see an increase in Y by
mean
(95% HPD: [lower
,upper
].
None.
df = pd.read_csv("../datasets/ship-damage.txt")
# Log10 transform months
df["months"] = df["months"].apply(lambda x: np.log10(x))
df["period_op"] = df["period_op"] - 1
df.head()
df.shape
plt.scatter(x=df["months"], y=df["n_damages"])
plt.xlabel("months")
plt.ylabel("n_damages")
import theano.tensor as tt
with pm.Model() as model:
xs = pm.floatX(df[["yr_construction", "period_op", "months"]].values)
betas = pm.Normal("betas", mu=0, sd=100 ** 2, shape=(3, 1))
n_damages = tt.dot(xs, betas)
n_damages_like = pm.Poisson(
"likelihood", mu=np.exp(n_damages), observed=df["n_damages"]
)
trace = pm.sample(draws=2000) # , start=pm.find_MAP(), step=pm.Metropolis())
pm.traceplot(trace)
pm.forestplot(trace, ylabels=["yr_construction", "period_op", "months"])
The best interpretation of this is that the log10 number of months that a boat has been used is the strongest positive contributor to the number of damages that a ship takes.
pm.summary(trace)
Let's see what the PPC looks like. We will sample 10,000 predicted values for each row in the dataframe, and plot the 95% HPD of the values.
with model:
ppc = pm.sample_ppc(trace, samples=10000)
ppc["likelihood"].shape
Let's plot the posterior distribution of predictions vs. actual.
y_true = df["n_damages"].values
lower, med, upper = np.percentile(ppc["likelihood"], [2.5, 50, 97.5], axis=0)
fig = plt.figure(figsize=(6, 6))
ax = fig.add_subplot(111)
ax.errorbar(y_true, med, yerr=[lower, upper], fmt="o")
def x_eq_y(ax):
xmin, xmax = min(ax.get_xlim()), max(ax.get_xlim())
ymin, ymax = min(ax.get_ylim()), max(ax.get_ylim())
ax.plot([xmin, xmax], [ymin, ymax])
return ax
ax = x_eq_y(ax)
ax.set_xlabel("true values")
ax.set_ylabel("predicted values")