In [ ]:

import pandas as pd
import pymc3 as pm
import numpy as np
import matplotlib.pyplot as plt

%load_ext autoreload
%autoreload 2
%matplotlib inline
%config InlineBackend.figure_format = 'retina'

Introduction¶

Problem Type¶

Poisson regression is the kind of regression we do when we want to estimate the effect that our explanatory variables have on the dependent variable, which is of type "count data". If we're trying to find a linear combination of the explanatory variables, then our Poisson regression is a subset of generalized linear models.

It's "Poisson" mainly because we use the Poisson distribution to model the likelihood of the dependent variable.

What we get out of this type of model is the relative contribution of each explanatory variable to the value of the dependent variable.

Data structure¶

To use it with this model, the data should be structured as such:

Each row is one measurement.
The columns should be:
- One column per explanatory variable.
  - Use ordinal data where possible; otherwise, strictly categorical data should be binarized.
- One column for the dependent variable.

Extensions to the model¶

None.

Reporting summarized findings¶

Here are examples of how to summarize the findings.

For every increase in $X_i$, we expect to see an increase in Y by mean (95% HPD: [lower, upper].

Other notes¶

None.

Exploratory Data Analysis¶

In [ ]:

df = pd.read_csv("../datasets/ship-damage.txt")
# Log10 transform months
df["months"] = df["months"].apply(lambda x: np.log10(x))
df["period_op"] = df["period_op"] - 1
df.head()

In [ ]:

df.shape

In [ ]:

plt.scatter(x=df["months"], y=df["n_damages"])
plt.xlabel("months")
plt.ylabel("n_damages")

In [ ]:

import theano.tensor as tt

with pm.Model() as model:
    xs = pm.floatX(df[["yr_construction", "period_op", "months"]].values)
    betas = pm.Normal("betas", mu=0, sd=100 ** 2, shape=(3, 1))
    n_damages = tt.dot(xs, betas)
    n_damages_like = pm.Poisson(
        "likelihood", mu=np.exp(n_damages), observed=df["n_damages"]
    )
    trace = pm.sample(draws=2000)  # , start=pm.find_MAP(), step=pm.Metropolis())

In [ ]:

pm.traceplot(trace)

In [ ]:

pm.forestplot(trace, ylabels=["yr_construction", "period_op", "months"])

The best interpretation of this is that the log10 number of months that a boat has been used is the strongest positive contributor to the number of damages that a ship takes.

In [ ]:

pm.summary(trace)

Posterior Predictive Checks¶

Let's see what the PPC looks like. We will sample 10,000 predicted values for each row in the dataframe, and plot the 95% HPD of the values.

In [ ]:

with model:
    ppc = pm.sample_ppc(trace, samples=10000)

In [ ]:

ppc["likelihood"].shape

Let's plot the posterior distribution of predictions vs. actual.

In [ ]:

y_true = df["n_damages"].values
lower, med, upper = np.percentile(ppc["likelihood"], [2.5, 50, 97.5], axis=0)

fig = plt.figure(figsize=(6, 6))
ax = fig.add_subplot(111)
ax.errorbar(y_true, med, yerr=[lower, upper], fmt="o")


def x_eq_y(ax):
    xmin, xmax = min(ax.get_xlim()), max(ax.get_xlim())
    ymin, ymax = min(ax.get_ylim()), max(ax.get_ylim())

    ax.plot([xmin, xmax], [ymin, ymax])
    return ax


ax = x_eq_y(ax)
ax.set_xlabel("true values")
ax.set_ylabel("predicted values")

In [ ]: