library(lme4)
options(rgl.useNULL=TRUE)
library(LMERConvenienceFunctions)
library(ggplot2)
options(jupyter.plot_mimetypes = 'image/png')
library(repr)
options(repr.plot.width=6, repr.plot.height=6)
#no zeros in the outcome variable
data = read.csv("/home/psinger/tmp/data2.csv", header=TRUE)
nrow(data)
Let us start with a simple linear mixed-effects model where we have the body length as the dependenten variable and the index as the fixed effect we are interested in. Additionally, we allow the intercept to vary for each author (mandatory random effect) as well as for each length of a session. Latter was shown to be necessary to include.
m_lmer = lmer(body_length~1+index+(1|author)+(1|length), data=data, REML=FALSE)
summary(m_lmer)
Linear mixed model fit by maximum likelihood ['lmerMod'] Formula: body_length ~ 1 + index + (1 | author) + (1 | length) Data: data AIC BIC logLik deviance df.resid 14220344 14220403 -7110167 14220334 999995 Scaled residuals: Min 1Q Median 3Q Max -7.359 -0.409 -0.248 0.080 32.633 Random effects: Groups Name Variance Std.Dev. author (Intercept) 12886.3 113.52 length (Intercept) 485.3 22.03 Residual 76579.1 276.73 Number of obs: 1000000, groups: author, 528126; length, 162 Fixed effects: Estimate Std. Error t value (Intercept) 231.56709 3.69632 62.65 index -0.66291 0.09791 -6.77 Correlation of Fixed Effects: (Intr) index -0.414
mcp.fnc(m_lmer)
What we can see in above plot is, that the residuals do not appear to be normally distributed and we can see clear heteroskedasticity; thus we need to address this.
One way of doing so, is to remove outliers and refit as suggested in:
Baayen, R.H. (2008). Analyzing Linguistic Data. A Practical Introduction to Statistics Using R. Cambridge, UK: Cambridge University Press.
data_limit <- romr.fnc(m_lmer, data, trim = 2.5)$data
n.removed = 23020 percent.removed = 2.302
m_lmer_limit = lmer(body_length~1+index+(1|author)+(1|length), data=data_limit, REML=FALSE)
summary(m_lmer_limit)
Linear mixed model fit by maximum likelihood ['lmerMod'] Formula: body_length ~ 1 + index + (1 | author) + (1 | length) Data: data_limit AIC BIC logLik deviance df.resid 12712426 12712485 -6356208 12712416 976975 Scaled residuals: Min 1Q Median 3Q Max -3.5603 -0.5758 -0.2994 0.2490 11.0214 Random effects: Groups Name Variance Std.Dev. author (Intercept) 4963.9 70.46 length (Intercept) 259.3 16.10 Residual 22025.8 148.41 Number of obs: 976980, groups: author, 521054; length, 162 Fixed effects: Estimate Std. Error t value (Intercept) 191.90956 2.48666 77.18 index -0.22224 0.05504 -4.04 Correlation of Fixed Effects: (Intr) index -0.397
mcp.fnc(m_lmer_limit)
We can see that the residuals look more normal and more to indicate homoscedasticity. However, this model is far from perfect and we need to look for alternatives.
The first thing I do, is that I just log transform the dependent variable. I am familiar that this is usually not appropriate: http://www.r-bloggers.com/do-not-log-transform-count-data-bitches/
m_lmer_log = lmer(log(body_length)~1+index+(1|author)+(1|length), data=data, REML=FALSE)
Warning message: In checkConv(attr(opt, "derivs"), opt$par, ctrl = control$checkConv, : Model failed to converge with max|grad| = 0.00275697 (tol = 0.002, component 1)
summary(m_lmer_log)
Linear mixed model fit by maximum likelihood ['lmerMod'] Formula: log(body_length) ~ 1 + index + (1 | author) + (1 | length) Data: data AIC BIC logLik deviance df.resid 2944014 2944073 -1472002 2944004 999995 Scaled residuals: Min 1Q Median 3Q Max -4.2890 -0.6169 -0.0093 0.6059 4.4680 Random effects: Groups Name Variance Std.Dev. author (Intercept) 0.247666 0.49766 length (Intercept) 0.007848 0.08859 Residual 0.908644 0.95323 Number of obs: 1000000, groups: author, 528126; length, 162 Fixed effects: Estimate Std. Error t value (Intercept) 4.848252 0.014316 338.7 index -0.002480 0.000351 -7.1 Correlation of Fixed Effects: (Intr) index -0.403 convergence code: 0 Model failed to converge with max|grad| = 0.00275697 (tol = 0.002, component 1)
mcp.fnc(m_lmer_log)
That does look pretty good, but as said, log transforming might not be appropriate here.
Next, let us focus in generalized mixed effects models that are appropriate for count data.
m_lognormal = glmer(body_length~1+index+(1|author)+(1|length),data=data,family=gaussian(link=log))
Warning message: In checkConv(attr(opt, "derivs"), opt$par, ctrl = control$checkConv, : Model failed to converge with max|grad| = 0.689011 (tol = 0.001, component 1)Warning message: In checkConv(attr(opt, "derivs"), opt$par, ctrl = control$checkConv, : Model is nearly unidentifiable: very large eigenvalue - Rescale variables?
summary(m_lognormal)
Generalized linear mixed model fit by maximum likelihood (Laplace Approximation) [glmerMod] Family: gaussian ( log ) Formula: body_length ~ 1 + index + (1 | author) + (1 | length) Data: data AIC BIC logLik deviance df.resid 19070896 19070955 -9535443 19070886 999995 Scaled residuals: Min 1Q Median 3Q Max -31.658 -0.196 0.000 0.054 44.670 Random effects: Groups Name Variance Std.Dev. author (Intercept) 37932 194.76 length (Intercept) 8292 91.06 Residual 37193 192.86 Number of obs: 1000000, groups: author, 528126; length, 162 Fixed effects: Estimate Std. Error t value Pr(>|z|) (Intercept) 5.0011889 0.0087127 574.0 <2e-16 *** index -0.0054638 0.0001428 -38.3 <2e-16 *** --- Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1 Correlation of Fixed Effects: (Intr) index -0.033 convergence code: 0 Model failed to converge with max|grad| = 0.689011 (tol = 0.001, component 1) Model is nearly unidentifiable: very large eigenvalue - Rescale variables?
plot(resid(m_lognormal,type="deviance") ~ fitted(m_lognormal), pch = ".")
qqnorm(resid(m_lognormal,type="deviance"), pch = ".", main = "")
qqline(resid(m_lognormal,type="deviance"))
So a sinple log link does not seem to do the job. Let's continue with appropriate glmers for count data.
m_poisson = glmer(body_length~1+index+(1|author)+(1|length),data=data,family=poisson)
Warning message: In checkConv(attr(opt, "derivs"), opt$par, ctrl = control$checkConv, : Model failed to converge with max|grad| = 0.0276912 (tol = 0.001, component 1)Warning message: In checkConv(attr(opt, "derivs"), opt$par, ctrl = control$checkConv, : Model is nearly unidentifiable: very large eigenvalue - Rescale variables?
The first thing we can see here, are the failed convergence messages. I get these for all models that follow. I have followed the advice at https://rstudio-pubs-static.s3.amazonaws.com/33653_57fc7b8e5d484c909b615d8633c01d51.html without resolving the issue.
summary(m_poisson)
Generalized linear mixed model fit by maximum likelihood (Laplace Approximation) [glmerMod] Family: poisson ( log ) Formula: body_length ~ 1 + index + (1 | author) + (1 | length) Data: data AIC BIC logLik deviance df.resid 85549876 85549923 -42774934 85549868 999996 Scaled residuals: Min 1Q Median 3Q Max -73.617 -3.708 -0.069 1.088 287.146 Random effects: Groups Name Variance Std.Dev. author (Intercept) 0.99250 0.9962 length (Intercept) 0.08607 0.2934 Number of obs: 1000000, groups: author, 528126; length, 162 Fixed effects: Estimate Std. Error z value Pr(>|z|) (Intercept) 4.950e+00 1.531e-02 323.3 <2e-16 *** index -3.421e-03 2.851e-05 -120.0 <2e-16 *** --- Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1 Correlation of Fixed Effects: (Intr) index -0.039 convergence code: 0 Model failed to converge with max|grad| = 0.0276912 (tol = 0.001, component 1) Model is nearly unidentifiable: very large eigenvalue - Rescale variables?
The first thing, we want to do here, is to check for overdispersion (see http://glmm.wikidot.com/faq):
overdisp_fun <- function(model) {
## number of variance parameters in
## an n-by-n variance-covariance matrix
vpars <- function(m) {
nrow(m)*(nrow(m)+1)/2
}
model.df <- sum(sapply(VarCorr(model),vpars))+length(fixef(model))
rdf <- nrow(model.frame(model))-model.df
rp <- residuals(model,type="pearson")
Pearson.chisq <- sum(rp^2)
prat <- Pearson.chisq/rdf
pval <- pchisq(Pearson.chisq, df=rdf, lower.tail=FALSE)
c(chisq=Pearson.chisq,ratio=prat,rdf=rdf,p=pval)
}
overdisp_fun(m_poisson)
plot(scale(resid(m_poisson,type="deviance")) ~ fitted(m_poisson), pch = ".")
As expected, we have to deal with overdispersion. There are generally two ways of doing so: (i) adding an individual-level random effect or (ii) using a negative binomial model.
data$obs<-seq.int(nrow(data))
m_poisson2 = glmer(body_length~1+index+(1|author)+(1|length)+(1|obs),data=data,family=poisson)
Error in pwrssUpdate(pp, resp, tol = tolPwrss, GQmat = GQmat, compDev = compDev, : Downdated VtV is not positive definite
Somehow I end up with this error, unsure how to resolve it. So let's try to model the length as fixed effect.
m_poisson2 = glmer(body_length~1+index+length+(1|author)+(1|obs),data=data,family=poisson)
Error in eval(expr, envir, enclos): (maxstephalfit) PIRLS step-halvings failed to reduce deviance in pwrssUpdate
Just another error...
Let's use less data.
m_poisson2 = glmer(body_length~1+index+length+(1|author)+(1|obs),data=data[1:10000,],family=poisson)
Warning message: In checkConv(attr(opt, "derivs"), opt$par, ctrl = control$checkConv, : Model is nearly unidentifiable: very large eigenvalue - Rescale variables?
Convergence warnings but at least no error.
summary(m_poisson2)
Generalized linear mixed model fit by maximum likelihood (Laplace Approximation) [glmerMod] Family: poisson ( log ) Formula: body_length ~ 1 + index + length + (1 | author) + (1 | obs) Data: data[1:10000, ] AIC BIC logLik deviance df.resid 123135.1 123171.1 -61562.5 123125.1 9995 Scaled residuals: Min 1Q Median 3Q Max -1.08944 -0.08929 -0.00292 0.04186 0.07614 Random effects: Groups Name Variance Std.Dev. obs (Intercept) 0.9402 0.9696 author (Intercept) 0.2057 0.4535 Number of obs: 10000, groups: obs, 10000; author, 9880 Fixed effects: Estimate Std. Error z value Pr(>|z|) (Intercept) 4.635946 0.012166 381.0 < 2e-16 *** index -0.010424 0.005052 -2.1 0.0391 * length 0.013426 0.003372 4.0 6.86e-05 *** --- Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1 Correlation of Fixed Effects: (Intr) index index -0.057 length -0.149 -0.898 convergence code: 0 Model is nearly unidentifiable: very large eigenvalue - Rescale variables?
overdisp_fun(m_poisson2)
Okay, overdispersion is gone.
plot(resid(m_poisson2,type="deviance") ~ fitted(m_poisson2), pch = ".")
qqnorm(resid(m_poisson2,type="deviance"), pch = ".", main = "")
qqline(resid(m_poisson2,type="deviance"))
Okay, that doesn't look too good. Let's give a final try and use a negative binomial model with the small data sample as the large one results in the same error as above.
m_nbinom = glmer.nb(body_length~1+index+length+(1|author),data=data[1:10000,])
Warning message: In checkConv(attr(opt, "derivs"), opt$par, ctrl = control$checkConv, : Model is nearly unidentifiable: very large eigenvalue - Rescale variables?Warning message: In checkConv(attr(opt, "derivs"), opt$par, ctrl = control$checkConv, : Model is nearly unidentifiable: very large eigenvalue - Rescale variables?Warning message: In checkConv(attr(opt, "derivs"), opt$par, ctrl = control$checkConv, : Model failed to converge with max|grad| = 0.00131538 (tol = 0.001, component 1)Warning message: In checkConv(attr(opt, "derivs"), opt$par, ctrl = control$checkConv, : Model is nearly unidentifiable: very large eigenvalue - Rescale variables?Warning message: In checkConv(attr(opt, "derivs"), opt$par, ctrl = control$checkConv, : Model failed to converge with max|grad| = 0.00493279 (tol = 0.001, component 1)Warning message: In checkConv(attr(opt, "derivs"), opt$par, ctrl = control$checkConv, : Model is nearly unidentifiable: very large eigenvalue - Rescale variables?Warning message: In checkConv(attr(opt, "derivs"), opt$par, ctrl = control$checkConv, : Model is nearly unidentifiable: very large eigenvalue - Rescale variables?Warning message: In checkConv(attr(opt, "derivs"), opt$par, ctrl = control$checkConv, : Model failed to converge with max|grad| = 0.00135842 (tol = 0.001, component 1)Warning message: In checkConv(attr(opt, "derivs"), opt$par, ctrl = control$checkConv, : Model is nearly unidentifiable: very large eigenvalue - Rescale variables?Warning message: In checkConv(attr(opt, "derivs"), opt$par, ctrl = control$checkConv, : Model is nearly unidentifiable: very large eigenvalue - Rescale variables?Warning message: In checkConv(attr(opt, "derivs"), opt$par, ctrl = control$checkConv, : Model failed to converge with max|grad| = 0.00199486 (tol = 0.001, component 1)Warning message: In checkConv(attr(opt, "derivs"), opt$par, ctrl = control$checkConv, : Model is nearly unidentifiable: very large eigenvalue - Rescale variables?Warning message: In checkConv(attr(opt, "derivs"), opt$par, ctrl = control$checkConv, : Model is nearly unidentifiable: very large eigenvalue - Rescale variables?Warning message: In checkConv(attr(opt, "derivs"), opt$par, ctrl = control$checkConv, : Model failed to converge with max|grad| = 0.00215237 (tol = 0.001, component 1)Warning message: In checkConv(attr(opt, "derivs"), opt$par, ctrl = control$checkConv, : Model is nearly unidentifiable: very large eigenvalue - Rescale variables?Warning message: In checkConv(attr(opt, "derivs"), opt$par, ctrl = control$checkConv, : Model is nearly unidentifiable: very large eigenvalue - Rescale variables?Warning message: In checkConv(attr(opt, "derivs"), opt$par, ctrl = control$checkConv, : Model is nearly unidentifiable: very large eigenvalue - Rescale variables?Warning message: In checkConv(attr(opt, "derivs"), opt$par, ctrl = control$checkConv, : Model failed to converge with max|grad| = 0.00152596 (tol = 0.001, component 1)Warning message: In checkConv(attr(opt, "derivs"), opt$par, ctrl = control$checkConv, : Model is nearly unidentifiable: very large eigenvalue - Rescale variables?Warning message: In checkConv(attr(opt, "derivs"), opt$par, ctrl = control$checkConv, : Model failed to converge with max|grad| = 0.00290011 (tol = 0.001, component 1)Warning message: In checkConv(attr(opt, "derivs"), opt$par, ctrl = control$checkConv, : Model is nearly unidentifiable: very large eigenvalue - Rescale variables?Warning message: In checkConv(attr(opt, "derivs"), opt$par, ctrl = control$checkConv, : Model failed to converge with max|grad| = 0.0018277 (tol = 0.001, component 1)Warning message: In checkConv(attr(opt, "derivs"), opt$par, ctrl = control$checkConv, : Model is nearly unidentifiable: very large eigenvalue - Rescale variables?Warning message: In checkConv(attr(opt, "derivs"), opt$par, ctrl = control$checkConv, : Model failed to converge with max|grad| = 0.00216571 (tol = 0.001, component 1)Warning message: In checkConv(attr(opt, "derivs"), opt$par, ctrl = control$checkConv, : Model is nearly unidentifiable: very large eigenvalue - Rescale variables?Warning message: In checkConv(attr(opt, "derivs"), opt$par, ctrl = control$checkConv, : Model failed to converge with max|grad| = 0.00160043 (tol = 0.001, component 1)Warning message: In checkConv(attr(opt, "derivs"), opt$par, ctrl = control$checkConv, : Model is nearly unidentifiable: very large eigenvalue - Rescale variables?Warning message: In checkConv(attr(opt, "derivs"), opt$par, ctrl = control$checkConv, : Model failed to converge with max|grad| = 0.00121982 (tol = 0.001, component 1)Warning message: In checkConv(attr(opt, "derivs"), opt$par, ctrl = control$checkConv, : Model is nearly unidentifiable: very large eigenvalue - Rescale variables?Warning message: In checkConv(attr(opt, "derivs"), opt$par, ctrl = control$checkConv, : Model failed to converge with max|grad| = 0.00268541 (tol = 0.001, component 1)Warning message: In checkConv(attr(opt, "derivs"), opt$par, ctrl = control$checkConv, : Model is nearly unidentifiable: very large eigenvalue - Rescale variables?Warning message: In checkConv(attr(opt, "derivs"), opt$par, ctrl = control$checkConv, : Model is nearly unidentifiable: very large eigenvalue - Rescale variables?Warning message: In checkConv(attr(opt, "derivs"), opt$par, ctrl = control$checkConv, : Model failed to converge with max|grad| = 0.00205914 (tol = 0.001, component 1)Warning message: In checkConv(attr(opt, "derivs"), opt$par, ctrl = control$checkConv, : Model is nearly unidentifiable: very large eigenvalue - Rescale variables?Warning message: In checkConv(attr(opt, "derivs"), opt$par, ctrl = control$checkConv, : Model failed to converge with max|grad| = 0.00151131 (tol = 0.001, component 1)Warning message: In checkConv(attr(opt, "derivs"), opt$par, ctrl = control$checkConv, : Model is nearly unidentifiable: very large eigenvalue - Rescale variables?Warning message: In checkConv(attr(opt, "derivs"), opt$par, ctrl = control$checkConv, : Model is nearly unidentifiable: very large eigenvalue - Rescale variables?Warning message: In checkConv(attr(opt, "derivs"), opt$par, ctrl = control$checkConv, : Model failed to converge with max|grad| = 0.00210434 (tol = 0.001, component 1)Warning message: In checkConv(attr(opt, "derivs"), opt$par, ctrl = control$checkConv, : Model is nearly unidentifiable: very large eigenvalue - Rescale variables?Warning message: In checkConv(attr(opt, "derivs"), opt$par, ctrl = control$checkConv, : Model failed to converge with max|grad| = 0.00223369 (tol = 0.001, component 1)Warning message: In checkConv(attr(opt, "derivs"), opt$par, ctrl = control$checkConv, : Model is nearly unidentifiable: very large eigenvalue - Rescale variables?Warning message: In checkConv(attr(opt, "derivs"), opt$par, ctrl = control$checkConv, : Model failed to converge with max|grad| = 0.00231365 (tol = 0.001, component 1)Warning message: In checkConv(attr(opt, "derivs"), opt$par, ctrl = control$checkConv, : Model is nearly unidentifiable: very large eigenvalue - Rescale variables?Warning message: In checkConv(attr(opt, "derivs"), opt$par, ctrl = control$checkConv, : Model failed to converge with max|grad| = 0.00236301 (tol = 0.001, component 1)Warning message: In checkConv(attr(opt, "derivs"), opt$par, ctrl = control$checkConv, : Model is nearly unidentifiable: very large eigenvalue - Rescale variables?Warning message: In checkConv(attr(opt, "derivs"), opt$par, ctrl = control$checkConv, : Model failed to converge with max|grad| = 0.00252922 (tol = 0.001, component 1)Warning message: In checkConv(attr(opt, "derivs"), opt$par, ctrl = control$checkConv, : Model is nearly unidentifiable: very large eigenvalue - Rescale variables?Warning message: In checkConv(attr(opt, "derivs"), opt$par, ctrl = control$checkConv, : Model failed to converge with max|grad| = 0.0025292 (tol = 0.001, component 1)Warning message: In checkConv(attr(opt, "derivs"), opt$par, ctrl = control$checkConv, : Model is nearly unidentifiable: very large eigenvalue - Rescale variables?
Convergence doesn't look too good here...
summary(m_nbinom)
Generalized linear mixed model fit by maximum likelihood (Laplace Approximation) [glmerMod] Family: Negative Binomial(10.32) ( log ) Formula: body_length ~ 1 + index + length + (1 | author) Data: data[1:10000, ] AIC BIC logLik deviance df.resid 123782.2 123818.3 -61886.1 123772.2 9995 Scaled residuals: Min 1Q Median 3Q Max -2.9496 -0.2250 -0.0117 0.1984 3.3812 Random effects: Groups Name Variance Std.Dev. author (Intercept) 1.044 1.022 Number of obs: 10000, groups: author, 9880 Fixed effects: Estimate Std. Error z value Pr(>|z|) (Intercept) 4.643606 0.012054 385.2 < 2e-16 *** index -0.009904 0.003821 -2.6 0.00955 ** length 0.012627 0.002691 4.7 2.7e-06 *** --- Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1 Correlation of Fixed Effects: (Intr) index index -0.046 length -0.191 -0.850 convergence code: 0 Model failed to converge with max|grad| = 0.0025292 (tol = 0.001, component 1) Model is nearly unidentifiable: very large eigenvalue - Rescale variables?
Similar results as for the poisson with observational random effect.
plot(resid(m_nbinom,type="deviance") ~ fitted(m_nbinom), pch = ".")
qqnorm(resid(m_nbinom,type="deviance"), pch = ".", main = "")
qqline(resid(m_nbinom,type="deviance"))
Still not satisfying (apart from not converging and not being able to fit more data).