Fabrizzio Sanchez suggested bootstrapping a statistical model as a non-trivial task on which to demonstrate the speed of Julia.
In this notebook I will demonstrate a parametric bootstrap, which means that samples are drawn from the probability model with parameters set at their estimated values from the original model, and the model refit.
A simple example is a linear model which can be fit using the GLM
package.
using GLM, RDatasets
ls = dataset("datasets","LifeCycleSavings")
WARNING: Base.String is deprecated, use AbstractString instead. likely near /home/bates/.julia/v0.4/RDatasets/src/dataset.jl:1 WARNING: Base.String is deprecated, use AbstractString instead. likely near /home/bates/.julia/v0.4/RDatasets/src/dataset.jl:1 WARNING: Base.String is deprecated, use AbstractString instead. likely near /home/bates/.julia/v0.4/RDatasets/src/datasets.jl:1
Country | SR | Pop15 | Pop75 | DPI | DDPI | |
---|---|---|---|---|---|---|
1 | Australia | 11.43 | 29.35 | 2.87 | 2329.68 | 2.87 |
2 | Austria | 12.07 | 23.32 | 4.41 | 1507.99 | 3.93 |
3 | Belgium | 13.17 | 23.8 | 4.43 | 2108.47 | 3.82 |
4 | Bolivia | 5.75 | 41.89 | 1.67 | 189.13 | 0.22 |
5 | Brazil | 12.88 | 42.19 | 0.83 | 728.47 | 4.56 |
6 | Canada | 8.79 | 31.72 | 2.85 | 2982.88 | 2.43 |
7 | Chile | 0.6 | 39.74 | 1.34 | 662.86 | 2.67 |
8 | China | 11.9 | 44.75 | 0.67 | 289.52 | 6.51 |
9 | Colombia | 4.98 | 46.64 | 1.06 | 276.65 | 3.08 |
10 | Costa Rica | 10.78 | 47.64 | 1.14 | 471.24 | 2.8 |
11 | Denmark | 16.85 | 24.42 | 3.93 | 2496.53 | 3.99 |
12 | Ecuador | 3.59 | 46.31 | 1.19 | 287.77 | 2.19 |
13 | Finland | 11.24 | 27.84 | 2.37 | 1681.25 | 4.32 |
14 | France | 12.64 | 25.06 | 4.7 | 2213.82 | 4.52 |
15 | Germany | 12.55 | 23.31 | 3.35 | 2457.12 | 3.44 |
16 | Greece | 10.67 | 25.62 | 3.1 | 870.85 | 6.28 |
17 | Guatamala | 3.01 | 46.05 | 0.87 | 289.71 | 1.48 |
18 | Honduras | 7.7 | 47.32 | 0.58 | 232.44 | 3.19 |
19 | Iceland | 1.27 | 34.03 | 3.08 | 1900.1 | 1.12 |
20 | India | 9.0 | 41.31 | 0.96 | 88.94 | 1.54 |
21 | Ireland | 11.34 | 31.16 | 4.19 | 1139.95 | 2.99 |
22 | Italy | 14.28 | 24.52 | 3.48 | 1390.0 | 3.54 |
23 | Japan | 21.1 | 27.01 | 1.91 | 1257.28 | 8.21 |
24 | Korea | 3.98 | 41.74 | 0.91 | 207.68 | 5.81 |
25 | Luxembourg | 10.35 | 21.8 | 3.73 | 2449.39 | 1.57 |
26 | Malta | 15.48 | 32.54 | 2.47 | 601.05 | 8.12 |
27 | Norway | 10.25 | 25.95 | 3.67 | 2231.03 | 3.62 |
28 | Netherlands | 14.65 | 24.71 | 3.25 | 1740.7 | 7.66 |
29 | New Zealand | 10.67 | 32.61 | 3.17 | 1487.52 | 1.76 |
30 | Nicaragua | 7.3 | 45.04 | 1.21 | 325.54 | 2.48 |
⋮ | ⋮ | ⋮ | ⋮ | ⋮ | ⋮ | ⋮ |
m1 = lm(SR ~ Pop15 + Pop75 + DPI + DDPI, ls)
DataFrames.DataFrameRegressionModel{GLM.LinearModel{GLM.LmResp{Array{Float64,1}},GLM.DensePredQR{Float64}},Float64}: Coefficients: Estimate Std.Error t value Pr(>|t|) (Intercept) 28.5661 7.35452 3.88416 0.0003 Pop15 -0.461193 0.144642 -3.18851 0.0026 Pop75 -1.6915 1.0836 -1.561 0.1255 DPI -0.000336902 0.000931107 -0.361829 0.7192 DDPI 0.409695 0.196197 2.08818 0.0425
We will take the fitted values and the estimate of the scale parameter from this model as the mean and scale parameter of a multivariate normal distribution from which to simulate new response vectors.
Getting the fitted values turned out to be more complicated than I thought it would be because I hadn't written the appropriate method. For now we will use the somewhat mysterious extractor m1.model.rr.mu
to get this.
We will use the IsoNormal
type of multivariate normal distribution from the Distributions
package for simulating the responses.
fitted(m1)
LoadError: fitted is not defined for DataFrames.DataFrameRegressionModel{GLM.LinearModel{GLM.LmResp{Array{Float64,1}},GLM.DensePredQR{Float64}},Float64}. while loading In[6], in expression starting on line 1 in error at ./error.jl:21 in fitted at /home/bates/.julia/v0.4/StatsBase/src/statmodels.jl:18
d = IsoNormal(m1.model.rr.mu,scale(m1.model))
LoadError: MethodError: `convert` has no method matching convert(::Type{PDMats.ScalMat{Float64}}, ::Float64) This may have arisen from a call to the constructor PDMats.ScalMat{Float64}(...), since type constructors fall back to convert methods. Closest candidates are: PDMats.ScalMat{T<:AbstractFloat}(::Any, !Matched::Any, !Matched::Any) call{T}(::Type{T}, ::Any) convert{T}(::Type{T}, !Matched::T) while loading In[3], in expression starting on line 1 in call at /home/bates/.julia/v0.4/Distributions/src/multivariate/mvnormal.jl:64
A simple method of obtaining the coefficients from fitting the model matrix to a new simulated response is the lm
method for a model matrix and a response.
X = copy(m1.model.pp.X) # copy the X matrix from the predictor field, pp
50x5 Array{Float64,2}: 1.0 29.35 2.87 2329.68 2.87 1.0 23.32 4.41 1507.99 3.93 1.0 23.8 4.43 2108.47 3.82 1.0 41.89 1.67 189.13 0.22 1.0 42.19 0.83 728.47 4.56 1.0 31.72 2.85 2982.88 2.43 1.0 39.74 1.34 662.86 2.67 1.0 44.75 0.67 289.52 6.51 1.0 46.64 1.06 276.65 3.08 1.0 47.64 1.14 471.24 2.8 1.0 24.42 3.93 2496.53 3.99 1.0 46.31 1.19 287.77 2.19 1.0 27.84 2.37 1681.25 4.32 ⋮ 1.0 21.44 4.54 3299.49 3.01 1.0 23.49 3.73 2630.96 2.7 1.0 43.42 1.08 389.66 2.96 1.0 46.12 1.21 249.87 1.13 1.0 23.27 4.46 1813.93 2.01 1.0 29.81 3.43 4001.89 2.45 1.0 46.4 0.9 813.39 0.53 1.0 45.25 0.56 138.33 5.14 1.0 41.12 1.73 380.47 10.23 1.0 28.13 2.72 766.54 1.88 1.0 43.69 2.07 123.58 16.71 1.0 47.2 0.66 242.69 5.08
srand(1234321)
lm(X,rand(d))
LinearModel{DensePredQR{Float64}}: Coefficients: Estimate Std.Error t value Pr(>|t|) x1 28.5583 8.82106 3.23751 0.0023 x2 -0.445019 0.173485 -2.56518 0.0137 x3 -2.39471 1.29968 -1.84255 0.0720 x4 0.000373479 0.00111678 0.334426 0.7396 x5 0.449379 0.23532 1.90965 0.0626
Suppose that we want the coefficient estimates, $\widehat{\beta}$, and the scale parameter estimate, $\widehat{\sigma}$, from, say, 10000 simulated responses for this model.
For a single fit we can extract both of these and concatenate them as a vector
sm1 = lm(X,rand(d))
[coef(sm1), scale(sm1)]
6-element Array{Float64,1}: 30.153 -0.509894 -1.85234 -0.00174832 0.737757 4.1995
Let's express this as a function
samp1(X::Matrix,d) = (sm = lm(X,rand(d)); [coef(sm), scale(sm)])
samp1 (generic function with 1 method)
samp1(X,d)
6-element Array{Float64,1}: 38.8541 -0.657691 -2.45837 -0.00162179 0.277361 3.63202
The simplest way to create a parametric bootstrap sample of these parameter estimates is as a comprehension.
srand(1234321)
bss = [samp1(X,d) for i in 1:10000]
10000-element Array{Any,1}: [28.5583,-0.445019,-2.39471,0.000373479,0.449379,4.56095] [35.944,-0.621564,-2.35589,-0.0012479,0.598529,4.14837] [21.9009,-0.343171,-0.86067,-9.93608e-5,0.338381,4.10657] [33.1855,-0.601662,-1.77152,-0.000105169,0.398659,3.27316] [17.9789,-0.231693,-0.386903,-0.000723807,0.371328,3.69092] [31.442,-0.516094,-2.32955,0.000473851,0.274775,3.83084] [17.6364,-0.292026,-0.440675,-0.000368788,0.828189,3.94549] [33.2162,-0.51244,-2.8588,8.82325e-5,0.373106,4.17757] [12.508,-0.148361,-0.244755,-0.000232026,0.877372,4.6156] [33.1391,-0.538018,-3.09807,-0.000228105,0.362138,3.62479] [29.9126,-0.476695,-1.169,-0.000781248,0.175824,4.16407] [30.0718,-0.479997,-2.0634,-0.000215239,0.455825,4.23013] [21.1611,-0.370234,-1.25906,0.000697624,0.827424,3.91826] ⋮ [29.7568,-0.503739,-1.68198,-9.8295e-5,0.514151,4.23564] [40.0399,-0.673688,-2.69232,-0.00111253,0.312692,3.84436] [26.1465,-0.357683,-1.81777,-0.000553048,0.157578,4.28691] [32.3954,-0.548962,-2.91014,0.000659737,0.51586,2.50244] [23.1756,-0.327187,-0.69978,7.07523e-5,0.20963,3.70527] [29.086,-0.471942,-1.49713,-0.000662799,0.399184,2.99092] [37.6333,-0.625821,-3.35728,0.000467882,0.255695,4.19263] [28.9012,-0.483732,-2.56316,0.000625508,0.708579,3.96592] [26.5054,-0.439392,-2.0225,0.000442202,0.61712,4.27044] [17.5618,-0.255279,-1.47908,0.00119981,0.885358,3.06848] [21.5199,-0.329395,-1.27995,0.000940134,0.233601,3.5673] [28.1422,-0.431297,-1.82568,-0.000431981,0.423766,3.95991]
@time [samp1(X,d) for i in 1:10000];
elapsed time: 2.263262407 seconds (329511888 bytes allocated, 16.73% gc time)
As we see, this produces the sample in the form of an array of vectors, as opposed to, say, a matrix. Furthermore this process is not blazingly fast, allocates a lot of memory and spends about 1/6 of the time in garbage collection.
What is happening here is that all the work of creating the model object and factoring the model matrix is (unnecessarily) being done for every sample.
As indicated by the rather long name of the object type for the fitted models, a dense QR decomposition is used to obtain the coefficients.
qrd = m1.model.pp.qr;
typeof(qrd)
QRCompactWY{Float64} (constructor with 1 method)
srand(1234321)
y1 = rand(d)
β = qrd\y1
5-element Array{Float64,1}: 28.5583 -0.445019 -2.39471 0.000373479 0.449379
We see we get the same coefficient estimates using the pre-computed QR decomposition of the model matrix. To get $\widehat{\sigma}$ we could use these coefficient estimates, generate the fitted values and residuals and assumulate the sum-of-squared residuals. However, there is a faster way using the vector $\bf Q^\prime y$. The coefficient estimates satisfy $\bf R\widehat{\beta}=Q_1^\prime y$ and the residual sum of squares is $\|\bf Q_2^\prime y\|^2$ where $\bf Q_1$ is the first $p$ columns of the $n\times n$ matrix $\bf Q$ and $\bf Q_2$ is the last $n-p$ columns.
This method would not be practical for large $n$ if we needed to generate and store $\bf Q$ explicitly. But we don't.
Q = qrd[:Q]
typeof(Q)
QRCompactWYQ{Float64} (constructor with 1 method)
R = Triangular(qrd[:R],:U)
5x5 Triangular{Float64,Array{Float64,2},:U,false}: -7.07107 -248.121 -16.214 -7825.96 -26.5702 0.0 64.0621 -8.20847 -5244.98 -0.960775 0.0 0.0 -3.77617 -1659.93 0.871339 0.0 0.0 0.0 4224.22 -5.12174 0.0 0.0 0.0 0.0 -19.3819
We can create both $\widehat{\sigma}$ and $\widehat{\beta}$ in a simple function.
function betasigmahat(qrd,y::Vector)
n,p = size(qrd)
length(y) == n || throw(DimensionMismatch(""))
qty = qrd[:Q]'y
sigmahat = sqrt(Base.sumabs2(sub(qty,(p+1):n))/(n-p))
Triangular(qrd[:R],:U)\sub(qty,1:p), sigmahat
end
betasigmahat (generic function with 1 method)
betasigmahat(qrd,y1)
([28.5583,-0.445019,-2.39471,0.000373479,0.449379],4.560948146915455)
We can wrap all this in a function to create the bootstrap sample
function boot(m::LinearModel,nsim::Integer)
qrd = m.pp.qr
qmat = qrd[:Q]
rmat = Triangular(qrd[:R],:U)
n,p = size(qrd)
df = float(n-p)
d = IsoNormal(m.rr.mu, scale(m))
betas = Array(Float64,(size(qrd,2),nsim))
sigmas = Array(Float64,nsim)
for j in 1:nsim
qty = qmat'rand(d)
betas[:,j] = rmat\sub(qty,1:p)
sigmas[j] = sqrt(Base.sumabs2(sub(qty,(p+1):n))/df)
end
betas, sigmas
end
boot (generic function with 1 method)
srand(1234321)
boot(m1.model,1) # check that the results are calculated properly
( 5x1 Array{Float64,2}: 28.5583 -0.445019 -2.39471 0.000373479 0.449379 , [4.56095])
srand(1234321)
boot(m1.model,10000)
( 5x10000 Array{Float64,2}: 28.5583 35.944 21.9009 … 21.5199 28.1422 -0.445019 -0.621564 -0.343171 -0.329395 -0.431297 -2.39471 -2.35589 -0.86067 -1.27995 -1.82568 0.000373479 -0.0012479 -9.93608e-5 0.000940134 -0.000431981 0.449379 0.598529 0.338381 0.233601 0.423766 , [4.56095,4.14837,4.10657,3.27316,3.69092,3.83084,3.94549,4.17757,4.6156,3.62479 … 4.28691,2.50244,3.70527,2.99092,4.19263,3.96592,4.27044,3.06848,3.5673,3.95991])
gc();@time boot(m1.model,10000);
elapsed time: 0.105056267 seconds (24635424 bytes allocated)
We are doing much better in the boot function except that we are allocating a lot of memory. We can avoid this by using mutating functions. For example each call to rand(d)
allocates a vector of length n
which eventually must be garbage collected. This is not a terrible burden if $n=50$ but when $n=100000$ you don't want to have this wasted allocation occurring nsim
times.
The mutating version of rand
avoids this.
function boot1(m::LinearModel,nsim::Integer)
qrd = m.pp.qr
qmat = qrd[:Q]
rmat = Triangular(qrd[:R],:U)
n,p = size(qrd)
df = float(n-p)
d = IsoNormal(m.rr.mu, scale(m))
betas = Array(Float64,(size(qrd,2),nsim))
sigmas = Array(Float64,nsim)
y = Array(Float64,n)
for j in 1:nsim
qty = qmat'rand!(d,y)
betas[:,j] = rmat\sub(qty,1:p)
sigmas[j] = sqrt(Base.sumabs2(sub(qty,(p+1):n))/df)
end
betas, sigmas
end
boot1 (generic function with 1 method)
srand(1234321)
boot1(m1.model,1)
( 5x1 Array{Float64,2}: 28.5583 -0.445019 -2.39471 0.000373479 0.449379 , [4.56095])
srand(1234321);gc();@time boot1(m1.model,10000)
elapsed time: 0.103414702 seconds (20153008 bytes allocated)
( 5x10000 Array{Float64,2}: 28.5583 35.944 21.9009 … 21.5199 28.1422 -0.445019 -0.621564 -0.343171 -0.329395 -0.431297 -2.39471 -2.35589 -0.86067 -1.27995 -1.82568 0.000373479 -0.0012479 -9.93608e-5 0.000940134 -0.000431981 0.449379 0.598529 0.338381 0.233601 0.423766 , [4.56095,4.14837,4.10657,3.27316,3.69092,3.83084,3.94549,4.17757,4.6156,3.62479 … 4.28691,2.50244,3.70527,2.99092,4.19263,3.96592,4.27044,3.06848,3.5673,3.95991])
We have cut down on the memory allocated (24 Mb total down to 20 Mb total) but only by about 1/6. The other allocations are occurring in the multiplication by $\bf Q$ and the solutions with respect to $\bf R$. There are mutating versions of both these operations. A multiplication of the form $\bf Q^\prime y$ ends up calling the relevant method for At_mul_B
. The general function is Ac_mul_B
indicating the conjugate transpose. The two functions are identical for real matrices and we will use the more general name Ac_mul_B
and its mutating version Base.Ac_mul_B!
. Similarly, R\qty
ends up calling A_ldiv_B
with mutating version Base.A_ldiv_B!
.
function boot2(m::LinearModel,nsim::Integer)
qrd = m.pp.qr
qmat = qrd[:Q]
rmat = Triangular(qrd[:R],:U)
n,p = size(qrd)
df = float(n-p)
d = IsoNormal(m.rr.mu, scale(m))
betas = Array(Float64,(size(qrd,2),nsim))
sigmas = Array(Float64,nsim)
y = Array(Float64,n)
for j in 1:nsim
Base.Ac_mul_B!(qmat,rand!(d,y))
Base.A_ldiv_B!(rmat,copy!(sub(betas,:,j),sub(y,1:p)))
sigmas[j] = sqrt(Base.sumabs2(sub(y,(p+1):n))/df)
end
betas, sigmas
end
boot2 (generic function with 1 method)
srand(1234321)
boot2(m1.model,1)
( 5x1 Array{Float64,2}: 28.5583 -0.445019 -2.39471 0.000373479 0.449379 , [4.56095])
srand(1234321); gc(); @time samp2 = boot2(m1.model,10000);
elapsed time: 0.099737582 seconds (14835392 bytes allocated)
It doesn't look like we have saved a lot but remember that we are allocating 4 Mb of that 14 Mb for the result.
A common idiom in Julia is to create a mutating version of a function that overwrites one or more arguments and a non-mutating version that allocates the object to be overwritten and calls the mutating version.
We first create the mutating boot2!
method then define the non-mutation version to allocate the storage and call the mutating version.
function boot2!(betas::Matrix, sigmas::Vector, m::LinearModel)
qrd = m.pp.qr
n,p = size(qrd)
size(betas,2) == length(sigmas) && size(betas,1) == p || throw(DimensionMismatch(""))
qmat = qrd[:Q]
rmat = Triangular(qrd[:R],:U)
df = float(n-p)
d = IsoNormal(m.rr.mu, scale(m))
y = Array(Float64,n)
for j in 1:length(sigmas)
Base.Ac_mul_B!(qmat,rand!(d,y))
Base.A_ldiv_B!(rmat,copy!(sub(betas,:,j),sub(y,1:p)))
sigmas[j] = sqrt(Base.sumabs2(sub(y,(p+1):n))/df)
end
betas, sigmas
end
function boot2(m::LinearModel, nsim::Integer)
boot2!(Array(Float64,(size(m.pp.qr,2),nsim)),Array(Float64,nsim),m)
end
boot2 (generic function with 1 method)
srand(1234321)
boot2!(zeros(5,1),zeros(1),m1.model)
( 5x1 Array{Float64,2}: 28.5583 -0.445019 -2.39471 0.000373479 0.449379 , [4.56095])
betas = zeros(5,10000)
sigmas = zeros(10000)
srand(1234321); gc(); @time boot2!(betas,sigmas,m1.model);
elapsed time: 0.099889196 seconds (14355200 bytes allocated)
So despite all these heroic efforts we seem to be stuck allocating about 14 Mb in this case.
We should profile to see what is happening. To get a reasonable execution time we up the number of samples from 10,000 to 100,000.
betas = zeros(5,100_000)
sigmas = zeros(100_000)
srand(1234321)
gc()
@profile boot2!(betas,sigmas,m1.model);
using ProfileView
At this point we would typically run ProfileViews.view()
but the resulting plot doesn't display well in an IJulia notebook. Also, when @profile
is called through IJulia there are several functions specific to the IJulia interface that show up.
Running this in the REPL does produce a profile flame plot which shows, as expected, that essentially all the time is being used in the inner loop. Roughly 1/3 in the Ac_mul_B!
call, nearly 2/3 in the A_ldiv_B!
call and a small amount in the sumabs2
call.
Furthermore, only a small fraction of the time in the A_ldiv_B!
call is actually spent in the calculation. Most of the time is spent in the sub
function. I misunderstood how it worked. This frequently happens to me, which is why @profile
and ProfileView.view()
are so valuable.
It is likely that view
from the ArrayViews
package will be faster and allocate less memory. (Because the capabilities in ArrayViews
are likely to break existing code, they will not be introduced into the base system until after release 0.3.0 is out.)
using ArrayViews
function boot3!(betas::Matrix, sigmas::Vector, m::LinearModel)
qrd = m.pp.qr
n,p = size(qrd)
size(betas,2) == length(sigmas) && size(betas,1) == p || throw(DimensionMismatch(""))
qmat = qrd[:Q]
rmat = Triangular(qrd[:R],:U)
df = float(n-p)
d = IsoNormal(m.rr.mu, scale(m))
y = Array(Float64,n)
for j in 1:length(sigmas)
Base.Ac_mul_B!(qmat,rand!(d,y))
Base.A_ldiv_B!(rmat,copy!(view(betas,:,j),view(y,1:p)))
sigmas[j] = sqrt(Base.sumabs2(view(y,(p+1):n))/df)
end
betas, sigmas
end
function boot3(m::LinearModel, nsim::Integer)
boot3!(Array(Float64,(size(m.pp.qr,2),nsim)),Array(Float64,nsim),m)
end
boot3 (generic function with 1 method)
srand(1234321)
boot3!(zeros(5,1),[0.],m1.model)
( 5x1 Array{Float64,2}: 28.5583 -0.445019 -2.39471 0.000373479 0.449379 , [4.56095])
betas = zeros(5,10_000)
sigmas = zeros(10_000)
srand(1234321)
gc()
@time boot3!(betas,sigmas,m1.model);
elapsed time: 0.062589559 seconds (5361040 bytes allocated)
The profile indicates that we are unlikely to make this much faster as almost all the time is being spent in the BLAS or LAPACK functions, plus the simulation from the multivariate normal. We are still allocating over 5 Mb during the execution but I'm not sure exactly where this is occuring.
In addition to allowing memory allocation to be more finely controlled, having a mutating version of the simulation function can help when parallelizing the simulation using distributed arrays or memory-mapped arrays.
A memory-mapped array corresponds to a stored file. Numbers are stored in their native binary representation.
Suppose that we were simulating a huge number of replications for this model or for more complex models.
nsim = 10_000_000
n, p = size(m1.model.pp.X)
s = open("/tmp/mmapbetas.bin","w+")
betas = mmap_array(Float64,(p,nsim),s)
typeof(betas)
Array{Float64,2}