You are seeing the notebook output generated by Literate.jl from the Julia source file. The corresponding HTML output can be viewed here.
using AlgebraOfGraphics
using CairoMakie
using CalibrationErrors
using DataFrames
using Distributions
using MLJ
using MLJNaiveBayesInterface
using PalmerPenguins
using Random
# Plotting settings
set_aog_theme!()
CairoMakie.activate!(; type="svg")
In this example we study the calibration of different models that classify three penguin species based on measurements of their bill and flipper lengths.
We use the Palmer penguins dataset to to train and validate the models.
penguins = dropmissing(DataFrame(PalmerPenguins.load()))
penguins_mapping =
data(penguins) * mapping(
:bill_length_mm => "bill length (mm)", :flipper_length_mm => "flipper length (mm)"
)
draw(penguins_mapping * mapping(; color=:species) * visual(; alpha=0.7))
We split the data randomly into a training and validation dataset. The training dataset contains around 60% of the samples.
Random.seed!(1234)
n = nrow(penguins)
k = floor(Int, 0.7 * n)
Random.seed!(100)
penguins.train = shuffle!(vcat(trues(k), falses(n - k)))
# Plot the training and validation data
dataset = :train => renamer(true => "training", false => "validation") => "Dataset"
plt = penguins_mapping * mapping(; color=:species, col=dataset) * visual(; alpha=0.7)
draw(plt; axis=(height=300,))
For each species, we fit independent normal distributions to the observations of the bill and flipper length in the training data, using maximum likelihood estimation.
y, X = unpack(
penguins,
==(:species),
x -> x === :bill_length_mm || x === :flipper_length_mm;
:species => Multiclass,
:bill_length_mm => MLJ.Continuous,
:flipper_length_mm => MLJ.Continuous,
)
model = fit!(machine(GaussianNBClassifier(), X, y); rows=penguins.train);
[ Info: Training machine(GaussianNBClassifier(), …).
We plot the estimated normal distributions.
# plot datasets
fg = draw(plt; axis=(height=300,))
# plot Gaussian distributions
xgrid = range(extrema(penguins.bill_length_mm)...; length=100)
ygrid = range(extrema(penguins.flipper_length_mm)...; length=100)
let f = (x, y, dist) -> pdf(dist, [x, y])
for (class, color) in zip(classes(y), Makie.wong_colors())
pdfs = f.(xgrid, ygrid', Ref(model.fitresult.gaussians[class]))
contour!(fg.figure[1, 1], xgrid, ygrid, pdfs; color=color)
contour!(fg.figure[1, 2], xgrid, ygrid, pdfs; color=color)
end
end
fg
Let us assume that the bill and flipper length are conditionally independent given the penguin species. Then Bayes' theorem implies that $$ \begin{aligned} \mathbb{P}(\mathrm{species} \,|\, \mathrm{bill}, \mathrm{flipper}) &= \frac{\mathbb{P}(\mathrm{species}) \mathbb{P}(\mathrm{bill}, \mathrm{flipper} \,|\, \mathrm{species})}{\mathbb{P}(\mathrm{bill}, \mathrm{flipper})} \\ &= \frac{\mathbb{P}(\mathrm{species}) \mathbb{P}(\mathrm{bill} \,|\, \mathrm{species}) \mathbb{P}(\mathrm{flipper} \,|\, \mathrm{species})}{\mathbb{P}(\mathrm{bill}, \mathrm{flipper})}. \end{aligned} $$ This predictive model is known as naive Bayes classifier.
In the section above, we estimated $\mathbb{P}(\mathrm{species})$, $\mathbb{P}(\mathrm{bill} \,|\, \mathrm{species})$, and $\mathbb{P}(\mathrm{flipper} \,|\, \mathrm{species})$ for each penguin species from the training data. For the conditional distributions we used a Gaussian approximation.
predictions = MLJ.predict(model)
train_predict = predictions[penguins.train]
val_predict = predictions[.!penguins.train]
# Plot datasets
fg = draw(plt; axis=(height=300,))
# Plot predictions
predictions_grid = reshape(
MLJ.predict(model, reduce(hcat, vcat.(xgrid, ygrid'))'), length(xgrid), length(ygrid)
)
for (class, color) in zip(classes(y), Makie.wong_colors())
p = pdf.(predictions_grid, class)
contour!(fg.figure[1, 1], xgrid, ygrid, p; color=color)
contour!(fg.figure[1, 2], xgrid, ygrid, p; color=color)
end
fg
val_y = y[.!penguins.train]
-mean(cross_entropy(val_predict, val_y))
-0.12188703745583586
The average log-likelihood is also equivalent to the logarithmic score. The Brier score is another strictly proper scoring rule that can be used for evaluating probabilistic predictions.
mean(brier_score(val_predict, val_y))
-0.07385286949971775
As all proper scoring rules, the logarithmic and the Brier score can be decomposed in three terms that quantify the sharpness and calibration of the predictive model and the irreducible uncertainty of the targets that is inherent to the prediction problem. The calibration term in this decomposition is the expected calibration error (ECE) $$ \mathbb{E} d\big(P_X, \mathrm{law}(Y \,|\, P_X)\big) $$ with respect to the score divergence $d$.
Scoring rules, however, include also the sharpness and the uncertainty term. Thus models can trade off calibration for sharpness and therefore scoring rules are not suitable for specifically evaluating calibration of predictive models.
The score divergence to the logarithmic and the Brier score are the Kullback-Leibler (KL) divergence $$ d\big(P_X, \mathrm{law}(Y \,|\, P_X)\big) = \sum_{y} \mathbb{P}(Y = y \,|\, P_X) \log\big(\mathbb{P}(Y = y \,|\, P_X) / P_X(\{y\})\big) $$ and the squared Euclidean distance $$ d\big(P_X, \mathrm{law}(Y \,|\, P_X)\big) = \sum_{y} \big(P_X - \mathrm{law}(Y \,|\, P_X)\big)^2(\{y\}), $$ respectively. The KL divergence is defined only if $\mathrm{law}(Y \,|\, P_X)$ is absolutely continuous with respect to $P_X$, i.e., if $P_X(\{y\}) = 0$ implies $\mathbb{P}(Y = y \,|\, P_X) = 0$.
We estimate the ECE by binning the probability simplex of predictions $P_X$ and computing the weighted average of the distances between the mean prediction and the distribution of targets in each bin.
One approach is to use bins of uniform size.
ece = ECE(UniformBinning(10), (μ, y) -> kl_divergence(y, μ));
We have to work with a numerical encoding of the true penguin species and a
corresponding vector of predictions. We use RowVecs
to indicate that the rows in the matrix of probabilities returned by pdf
are the predictions. If we would provide predictions as columns of a matrix, we would have
to use ColVecs
.
val_yint = map(MLJ.levelcode, val_y)
val_probs = RowVecs(pdf(val_predict, MLJ.classes(y)));
We compute the estimate on the validation data:
ece(val_probs, val_yint)
0.04860861700674836
For the squared Euclidean distance we obtain:
ece = ECE(UniformBinning(10), SqEuclidean())
ece(val_probs, val_yint)
0.02426469201343113
Alternatively, one can use a data-dependent binning scheme that tries to split the predictions in a way that minimizes the variance in each bin.
With the KL divergence we get:
ece = ECE(MedianVarianceBinning(5), (μ, y) -> kl_divergence(y, μ))
ece(val_probs, val_yint)
0.027874966150111966
For the squared Euclidean distance we obtain:
ece = ECE(MedianVarianceBinning(5), SqEuclidean())
ece(val_probs, val_yint)
0.012238423729555838
We see that the estimates (of the same theoretical quantity!) are highly dependent on the chosen binning scheme.
As an alternative to the ECE, we estimate the kernel calibration error (KCE). We keep it simple here, and use the tensor product kernel $$ k\big((\mu, y), (\mu', y')\big) = \delta_{y,y'} \exp{\bigg(-\frac{{\|\mu - \mu'\|}_2^2}{2\nu^2} \bigg)} $$ with length scale $\nu > 0$ for predictions $\mu,\mu'$ and corresponding targets $y, y'$. For simplicity, we estimate length scale $\nu$ with the median heuristic.
distances = pairwise(SqEuclidean(), RowVecs(pdf(train_predict, MLJ.classes(y))))
ν = sqrt(median(distances[i] for i in CartesianIndices(distances) if i[1] < i[2]))
kernel = with_lengthscale(GaussianKernel(), ν) ⊗ WhiteKernel();
We obtain the following biased estimate of the squared KCE (SKCE):
skce = SKCE(kernel; unbiased=false)
skce(val_probs, val_yint)
0.0007888007181424227
Similar to the biased estimates of the ECE, the biased estimates of the SKCE are always non-negative. The unbiased estimates can be negative as well, in particular if the model is (close to being) calibrated:
skce = SKCE(kernel)
skce(val_probs, val_yint)
5.0779821358820946e-5
When the datasets are large, the quadratic sample complexity of the standard biased and unbiased estimators of the SKCE can become prohibitive. In these cases, one can resort to an estimator that averages estimates of non-overlapping blocks of samples. This estimator allows to trade off computational cost for increased variance.
Here we consider the extreme case of blocks with two samples, which yields an estimator with linear sample complexity:
skce = SKCE(kernel; blocksize=2)
skce(val_probs, val_yint)
0.01903306352441053
Status `~/work/CalibrationErrors.jl/CalibrationErrors.jl/examples/classification/Project.toml`
[cbdf2221] AlgebraOfGraphics v0.6.16
[13f3f980] CairoMakie v0.10.6
[33913031] CalibrationErrors v0.6.4
[a93c6f00] DataFrames v1.5.0
[31c24e10] Distributions v0.25.98
[add582a8] MLJ v0.19.2
[33e4bacb] MLJNaiveBayesInterface v0.1.6
[8b842266] PalmerPenguins v0.1.4
Julia Version 1.9.1
Commit 147bdf428cd (2023-06-07 08:27 UTC)
Platform Info:
OS: Linux (x86_64-linux-gnu)
CPU: 2 × Intel(R) Xeon(R) CPU E5-2673 v4 @ 2.30GHz
WORD_SIZE: 64
LIBM: libopenlibm
LLVM: libLLVM-14.0.6 (ORCJIT, broadwell)
Threads: 1 on 2 virtual cores
Environment:
JULIA_PKG_SERVER_REGISTRY_PREFERENCE = eager
JULIA_DEBUG = Documenter
JULIA_LOAD_PATH = :/home/runner/work/CalibrationErrors.jl/CalibrationErrors.jl/docs
To reproduce the project environment of this example you can download the full Manifest.toml.
This notebook was generated using Literate.jl.