%load_ext rpy2.ipython
%matplotlib inline
import matplotlib.pyplot as plt
import numpy as np
import numpy.random as rnd
from scipy import stats
import sympy as sym
from IPython.display import display, Image
plt.rcParams['figure.figsize'] = (15, 5)
Markov chains taking real values: Discrete time process (Xn)n≥0 where Xn∈R), with the Markov property:
P(Xn≤x∣X0=x0,X1=x1,…,Xn−1=xn−1)=P(Xn≤x∣Xn−1=xn−1)for x∈R.
Markov chains taking vector values:
Discrete time process (Xn)n≥0, where the process takes values inside a state space Xn∈Rn, with the Markov property:
P(Xn∈A∣X0=x0,X1=x1,…,Xn−1=xn−1)=P(Xn∈A∣Xn−1=xn−1)for A∈B(Rn).
Simplify notation: Write the chain as X1,X2,… even if they are vector-valued. Say that they take values inside the state space S which can either be R or Rn or something else.
Say p(y∣x) is the MC's transition kernel, which gives the density of proposing a jump to y given we're currently at x. Specifically,
fXi∣Xi−1(y∣x)=P(Xi∈[y+dy]∣Xi−1=x)=p(y∣x).So, for every x, then ∫Sp(y∣x)dy=1 .
Some MCs have a stationary distribution, π(⋅), so if
X0∼π(⋅)⇒X1∼π(⋅)⇒X2∼π(⋅)…To be specific, if we have a MC with transition kernel p(y∣x) and which takes values in S,
fX1(x1)=∫x0∈SfX0,X1(x0,x1)dx0=∫x0∈SfX0(x0)fX1∣X0(x1∣x0)dx0=∫x0∈SfX0(x0)p(x1∣x0)dx0.If X0 has p.d.f. fX0(x0)=π(x0) and
fX1(x1)=∫x0∈Sπ(x0)p(x1∣x0)dx0=π(x1)then π is a stationary distribution.
Some MCs have a limiting distribution, that is, for all starting positions x0∈R as n→∞
(Xn∣X0=x0)D⟶π(⋅).Written another way,
limn→∞fXn∣X0(x∣x0)=π(x).The limiting distribution π is the stationary distribution.
So, if we want to simulate from π, just start X0 anywhere and run the MC for a while, then after N≈∞ steps we have
XNapprox.∼π(⋅)⇒XN+1approx.∼π(⋅)⇒XN+2approx.∼π(⋅)…Image("waiting-mcmc-to-converge.jpg")
The secret relates to the fact that α makes the chain reversible with respect to the target density fX.
A Markov chain with transition probability p(y∣x) is reversible with respect to a density g if
g(x)p(y∣x)=g(y)p(x∣y)for all x,y∈S. These are called the detailed balance equations.
This next part is a bit complicated:
Want to show that if a MC is reversible w.r.t. some distribution g, i.e.
g(x)p(y∣x)=g(y)p(x∣y)then g must be a stationary distribution for the MC, i.e. X0∼g⇒X1∼g.
Proof:
fX1(x1)=∫x0∈SfX0(x0)p(x1∣x0)dx0=∫x0∈Sg(x0)p(x1∣x0)dx0=∫x0∈Sg(x1)p(x0∣x1)dx0=g(x1)∫x0∈Sp(x0∣x1)dx0=g(x1).First, what is the transition kernel for the MCMC chain? Is it q(y∣x)? No.
For y≠x it is
p(y∣x)=q(y∣x)α(x,y)and if y=x it is
p(y∣x)=∫Sq(z∣x)[1−α(x,z)]dzHowever, proposing points using q(y∣x) does make the chain reversible w.r.t. the target density fX.
Remember, for x≠y we have p(y∣x)=q(y∣x)α(x,y) and
α(x,y)=min{fX(y)q(x∣y)fX(x)q(y∣x),1}andα(y,x)=min{fX(x)q(y∣x)fX(y)q(x∣y),1}.Do we have fX(x)p(y∣x)=fX(y)p(x∣y)? If x=y, then obviously yes. For x≠y
fX(x)p(y∣x)=fX(x)q(y∣x)α(x,y)=fX(x)q(y∣x)min{fX(y)q(x∣y)fX(x)q(y∣x),1}=min{fX(y)q(x∣y),fX(x)q(y∣x)}×fX(y)q(x∣y)fX(y)q(x∣y)=fX(y)q(x∣y)min{1,fX(x)q(y∣x)fX(y)q(x∣y)}=fX(y)q(x∣y)α(y,x)=fX(y)p(x∣y).Say that we have data x={x1,…,xn} and want to fit an Exponential(λ) distribution to it.
Frequentist approach: Find λ∗=argmaxλ∏ni=1f(xi,λ). Result is one number.
Image("vader.png")
Bayesian modelling: Treat λ as a random variable. First guess for it's value is λ∼Pareto(α), the prior distribution. After seeing the data we get a more accurate estimate for the distribution of λ, the posterior distribution.
f(λ∣x)=f(x∣λ)f(λ)∫f(x∣λ)f(λ)dλ∝f(x∣λ)f(λ)∝(n∏i=1λe−λxi)λ−(α+1)=λn−α−1exp(−λn∑i=1xi).%%R
library(ks)
run_mcmc <- function(R, logF, start) {
burnIn <- 5000
Rbig <- burnIn + R
jumpLambda <- 10
Xstart <- start
Xs <- rep(NA, Rbig)
Xs[1] <- Xstart
for (r in 2:Rbig) {
U1 <- (runif(1) < 0.5)
Y <- Xs[r-1] + (-1)^U1 * rexp(1, jumpLambda)
if (log(runif(1)) < logF(Y) - logF(Xs[r-1])) {
Xs[r] <- Y
} else {
Xs[r] <- Xs[r-1]
}
}
return(Xs[-(1:burnIn)])
}
%%R
set.seed(1337)
trueLambda <- 15
n <- 1000; xs <- rexp(n, trueLambda)
alpha <- -1
# Frequentist approach
lambdaPointEstimate <- 1/mean(xs)
# Bayesian way
targetLogF <- function(lambda) {
(n-alpha-1)*log(lambda) -lambda * sum(xs)
}
lambdas <- run_mcmc(R <- 10^4, targetLogF, 10)
plot(lambdas)
%%R
hist(lambdas, 40)
lines(c(trueLambda, trueLambda), c(0, 1e6), col="green")
lines(c(lambdaPointEstimate, lambdaPointEstimate), c(0, 1e6), col="blue")
%%R
plot(kde(lambdas))
lines(c(trueLambda, trueLambda), c(0, 1e6), col="green")
lines(c(lambdaPointEstimate, lambdaPointEstimate), c(0, 1e6), col="blue")
Two main fields:
Supervised learning: have ((x[1],y[1]),…,(x[n],y[n])) and want to find the function f(x)=y for all possible x.
Unsupervised learning: just have (x[1],…,x[n]) and want to find clusters inside them.
Inputs: data (x[1],…,x[n]), where x[i]∈Rm and K number of clusters to find.
Randomly choose K points (without repetition) and call them c[1],…,c[K]
For iteration=1,…,MaxIterations:
Calculate the distance from all x[i]'s to each c[j]'s. E.g. D=(Dij)∈Rn×K where
Assign each point x[i] to the closest c[k]. I.e.
P(i)=argminkd(x[i],c[k])=argminkDikfor i=1,…,nSo, for each cluster C(k)={i:P(i)=k}, update the cluster center
c[k]=1|C(k)|∑i∈C(k)x[i]rnd.seed(1)
X1s = stats.multivariate_normal([-2,-1], [[1, -0.9], [-0.9,1]]).rvs(100)
X2s = stats.multivariate_normal([2,-1], [[1, 0.9], [0.9,1]]).rvs(100)
X3s = stats.multivariate_normal([0,1], [[1, 0], [0,0.1]]).rvs(100)
plt.scatter(X1s[:,0], X1s[:,1])
plt.scatter(X2s[:,0], X2s[:,1])
plt.scatter(X3s[:,0], X3s[:,1]);
Xs = np.concatenate([X1s,X2s,X3s])
plt.scatter(Xs[:,0], Xs[:,1]);
from ipywidgets import interact_manual
from numpy.linalg import norm
rnd.seed(1337)
K = 3; n = Xs.shape[0]
clusterInds = rnd.choice(n, K)
Cs = Xs[clusterInds,:]
def update():
global Cs, firstIter
# Assign the points to their nearest cluster centre.
D = [[norm(Xs[i,:] - Cs[k,:]) for k in range(K)] for i in range(n)]
P = np.argmin(np.array(D), axis=1)
# Plot the current clusters
plt.figure(2)
for k in range(K):
plt.scatter(Xs[P==k,0], Xs[P==k,1], marker="*")
plt.scatter(Cs[:,0], Cs[:,1])
plt.show()
# Update the cluster center points.
for k in range(K):
Cs[k,:] = np.mean(Xs[P==k,:], axis=0)
interact_manual(update);
interactive(children=(Button(description='Run Interact', style=ButtonStyle()), Output()), _dom_classes=('widge…
Image("mathematica_classifiers.png")
Image("mathematica_classifiers2.png")
((x[1],y[1]),…,(x[n],y[n])).
some true function f∗, for i=1,…,n.
which is flexible (f∗(⋅)≈f(⋅;β∗) for some β∗∈RP) and efficient (evaluating f(x;β) is fast/simple).
Need to be able to numerically rate how badly some β is for our purposes. This is the loss function L(β).
Need to be able to run optimize the loss function quickly. We take
Image("multi-layer.png")
Example:
neural_network <- function(x1, x2, x3, beta) {
# Pull out all the parameters for the NN.
w10 <- beta[1]; w11 <- beta[2]; ...
# First layer
a1 <- w10 + w11*x1 + w12*x2 + w13*x3
if (a1 < 0) {
a1 <- 0
}
a2 <- w20 + w21*x1 + w22*x2 + w23*x3
if (a2 < 0) {
a2 <- 0
}
# Output layer
o <- wout0 + wout1 * a1 + wout2 * a2
return( o )
}
They are very fast to evaluate (when coded properly)!
Image("nn.png")
Image("universal_function.png") # from https://nthu-datalab.github.io/ml/slides/10_NN_Design.pdf
In classification, instead of y∈R we have y∈{True,False}, or y∈{Group 1,…,Group K}.
This is gross, let's turn it into a real-valued problem again, i.e., a regression problem.
y[1]=False,y[2]=True,…ML people call this a one-hot vector, but better to think of
y=(y1,…,yK) where yk=ˆP(x belongs to category k).Normally, if a NN gives an output o=(o1,…,oK) then they won't make sense as a probability; need
ok≥0 for k=1,…,K and K∑k=1ok=1.But we can make any vector satisfy this by
(˜o1,…,˜oK)↪(exp{˜o1},…,exp{˜oK})↪(exp{˜o1}∑Kk=1exp{˜ok},…,exp{˜oK}∑Kk=1exp{˜ok})so-called soft-max function.
Each training point x[i] we get a distribution of the predicted category f(x[i];β) and the true category y[i] (a degenerate distribution). What's the difference between these two distributions?
CE(f,g)=−∫f(x)log[g(x)]dx⇒CE(p,q)=−K∑k=1pilog[qi]Find minxf(x) by seeing the function as a computer would see it (blind).
import ipywidgets
from ipywidgets import interactive_output, FloatSlider, Checkbox, HBox
f = lambda x: (1/100)**2 * (x**2) * np.sin(x/5)
fDash = lambda x: 2 * (1/100)**2 *(x/100) * np.sin(x/5) + (1/5) * (x/100)**2 * np.cos(x/5)
xRange = (0, 100)
xGrid = np.linspace(xRange[0], xRange[1], 200)
yGrid = f(xGrid)
xs = []
ys = []
yDashs = []
eps = 5
def update(x, showDerivs, showFunction):
global xs, ys, yDashs
if x == 0:
return
plt.figure(2)
xs.append(x)
ys.append(f(x))
yDashs.append(fDash(x))
plt.scatter(xs, ys)
if showFunction:
plt.plot(xGrid, yGrid, "--")
if showDerivs:
for i in range(len(xs)):
plt.plot([xs[i]-eps/2, xs[i]+eps/2], [ys[i]-yDashs[i]*eps/2, ys[i]+yDashs[i]*eps/2], "--")
plt.show()
w1=FloatSlider(min=xRange[0], max=xRange[1], step=0.01, continuous_update=False)
w2=Checkbox(description='Show derivatives')
w3=Checkbox(description='Reveal function');
display(HBox((w1,w2,w3)))
interactive_output(update, {"x": w1, "showDerivs": w2, "showFunction": w3})
HBox(children=(FloatSlider(value=0.0, continuous_update=False, step=0.01), Checkbox(value=False, description='…
Output()
Input: objective function f(x), derivative function f′(x), starting point x0, and some learning rate η>0.
For iteration=1,…,MaxIterations:
Calculate the derivative ∇←f′(x)
Update our point to go some distance downhill: x←x−η∇
from ipywidgets import ToggleButton
xs = []
ys = []
yDashs = []
eps = 1
eta = 25
firstIteration = True
def update(x, showFunction, go):
global xs, ys, yDashs, firstIteration
if firstIteration:
firstIteration = False
return
plt.figure(2)
if showFunction:
plt.plot(xGrid, yGrid, "--")
if len(xs) > 0:
x = xs[-1] - eta*fDash(xs[-1])
xs.append(x)
ys.append(f(x))
yDashs.append(fDash(x))
plt.scatter(xs, ys)
for i in range(len(xs)):
plt.plot([xs[i]-eps, xs[i]+eps], [ys[i]-yDashs[i]*eps, ys[i]+yDashs[i]*eps], "--")
plt.show()
w1=FloatSlider(min=xRange[0], max=xRange[1], step=0.01, value=50, continuous_update=False)
w2=Checkbox(description='Reveal function')
w3=ToggleButton(description='Update')
display(HBox((w1,w2,w3)))
interactive_output(update, {"x": w1, "showFunction": w2, "go": w3})
HBox(children=(FloatSlider(value=50.0, continuous_update=False, step=0.01), Checkbox(value=False, description=…
Output()
Replace f(x) with L(β).
Input: predictors X=(x1,…,xn) and true classifications Y=(y1,…,yn), some initial values for the β parameters (important!!), and some learning rate η>0.
For iteration=1,…,MaxIterations:
(Optional) Randomly shuffle the data inside X, Y
For i=1,…,n:
Make a prediction ˆyi←f(xi;β)
Calculate the loss L(β∣ˆyi,yi) based on prediction ˆyi and real value yi
For p=1,…,P:
Find ∇p←(d/dβp)L(β∣ˆyi,yi)
Create the derivative vector ∇[i]=(∇1,…,∇P)
Finally update the parameters β←β−η∇[i]
Problem: Typically run for ≥103 iterations, have P≥106 parameters, and n≥106 data points... So we have to calculate ∇p at least 1015 times??
Input: predictors X=(x1,…,xn) and true classifications Y=(y1,…,yn), some initial values for the β parameters, and some learning rate η>0.
For iteration=1,…,MaxIterations:
(Optional) Randomly shuffle the data inside X, Y
For i=1,…,n:
Make a prediction ˆyi←f(xi;β)
Calculate the loss L(β∣ˆyi,yi) based on prediction ˆyi and real value yi
Get all derivatives ∇[i]←((d/dβ1)L(β),…,(d/dβp)L(β)) at once by back propagation
Finally update the parameters β←β−η∇[i]
How does it work? Lots of the chain rule. Check out https://arxiv.org/abs/1502.05767 for all the details.
Input: predictors X=(x1,…,xn) and true classifications Y=(y1,…,yn), some initial values for the β parameters, and some learning rate η>0. Also, choose some mini batch size, e.g., B=103.
For iteration=1,…,MaxIterations:
(Optional) Randomly shuffle the data inside X, Y
Split the n data points into batches of size B
For each batch b=1,…,B:
Make the predictions ˆyi←f(xi;β) for all the samples in batch b in parallel
Calculate the loss for each sample in the batch in parallel
Get derivative vector ∇[i] for each sample in batch b in parallel using back propagation
Average out these gradients to get ∇b
Update the parameters β←β−η∇b
With GPUs, parallel work becomes easy.
Previously, spent a lot of time in pre-processing the input using expert domain knowledge.
Examples:
Nowadays, a tendency to throw the raw data into a very deep network and let it figure out the best pre-processing steps. Called end-to-end learning.
Now, using a proper train/validation/test split of the data and regularisation it isn't such a problem.
Regularisation is the idea that we should penalise complexity of the model. A super simple way is lasso regularisation, which adds a term to our loss function:
⇒L(β;λ)=−n∑i=1K∑k=1y[i]klog(f(x[i];β)k)+λP∑p=1|βp|.What does this do? As λ↗∞, many βp parameters become 0, so the neural network becomes simpler (it has less connections).