Survey Analysis in R & Stata

R-centric exercises that accompany workshop https://ccpr.ucla.edu/event/analysis-complex-surveys-using-r-stata/

In [1]:
# expect non-root permission error when installing R package in "default location" 

install.packages("mice",repos="http://cran.r-project.org")
Installing package into ‘/usr/local/spark-1.6.1-bin-hadoop2.6/R/lib’
(as ‘lib’ is unspecified)
Warning message:
In install.packages("mice", repos = "http://cran.r-project.org"): 'lib = "/usr/local/spark-1.6.1-bin-hadoop2.6/R/lib"' is not writable
Error in install.packages("mice", repos = "http://cran.r-project.org"): unable to install packages
Traceback:

1. install.packages("mice", repos = "http://cran.r-project.org")
2. stop("unable to install packages")
In [2]:
# non-root permission workaround to install R package in "custom location"

system('cd ~/work/')
system('mkdir myr')

install.packages("mice",lib="~/work/myr/",repos="http://cran.r-project.org")
install.packages("mitools",lib="~/work/myr/",repos="http://cran.r-project.org")
install.packages("survey",lib="~/work/myr/",repos="http://cran.r-project.org")


# survey packages
library(mice, lib.loc="~/work/myr/")
library(mitools, lib.loc="~/work/myr/")
library(survey, lib.loc="~/work/myr/")

# data wrangling packages (already installed)
library(RCurl)
library(car)
library(dplyr)

sessionInfo()
Loading required package: grid
Loading required package: Matrix
Loading required package: survival

Attaching package: ‘survey’

The following object is masked from ‘package:graphics’:

    dotchart

Loading required package: bitops

Attaching package: ‘RCurl’

The following object is masked from ‘package:mice’:

    complete


Attaching package: ‘dplyr’

The following objects are masked from ‘package:stats’:

    filter, lag

The following objects are masked from ‘package:base’:

    intersect, setdiff, setequal, union

R version 3.3.1 (2016-06-21)
Platform: x86_64-pc-linux-gnu (64-bit)
Running under: Debian GNU/Linux 8 (jessie)

locale:
 [1] LC_CTYPE=en_US.UTF-8       LC_NUMERIC=C              
 [3] LC_TIME=en_US.UTF-8        LC_COLLATE=en_US.UTF-8    
 [5] LC_MONETARY=en_US.UTF-8    LC_MESSAGES=en_US.UTF-8   
 [7] LC_PAPER=en_US.UTF-8       LC_NAME=C                 
 [9] LC_ADDRESS=C               LC_TELEPHONE=C            
[11] LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C       

attached base packages:
[1] grid      stats     graphics  grDevices utils     datasets  methods  
[8] base     

other attached packages:
[1] dplyr_0.4.3     car_2.1-2       RCurl_1.95-4.8  bitops_1.0-6   
[5] survey_3.31-5   survival_2.39-4 Matrix_1.2-6    mitools_2.3    
[9] mice_2.30      

loaded via a namespace (and not attached):
 [1] Rcpp_0.12.5        nloptr_1.0.4       base64enc_0.1-3    tools_3.3.1       
 [5] rpart_4.1-10       digest_0.6.9       lme4_1.1-12        uuid_0.1-2        
 [9] jsonlite_0.9.22    evaluate_0.9       nlme_3.1-128       lattice_0.20-33   
[13] mgcv_1.8-12        IRdisplay_0.3      DBI_0.4-1          IRkernel_0.6      
[17] parallel_3.3.1     SparseM_1.7        repr_0.7           stringr_1.0.0     
[21] MatrixModels_0.4-1 nnet_7.3-12        R6_2.1.2           pbdZMQ_0.2-3      
[25] minqa_1.2.4        magrittr_1.5       MASS_7.3-45        splines_3.3.1     
[29] assertthat_0.1     pbkrtest_0.4-6     quantreg_5.26      stringi_1.1.1     

Step 0) Understand Survey Data

Total Survey Error: "Measurement" and "Representation"

https://academic.oup.com/poq/article/74/5/849/1817502/Total-Survey-Error-Past-Present-and-Future

Receive Processed Data: Behavioral Risk Factor Surveillance System (2011)

https://www.cdc.gov/brfss/annual_data/annual_2011.htm

  • Find (Ctrl + F) keyword "design" in data documentation

"Get" survey data from "url"

In [3]:
########################################
# get brfss data example adapted from corey_sparks
########################################

# library(RCurl)
# library(dplyr)
# library(car)

url = "https://raw.githubusercontent.com/coreysparks/data/master/brfss_11.csv"
brfss_11 = getURL(url)
brfss_11 = read.csv(textConnection(brfss_11), header=T, dec=",")
nams = names(brfss_11)
newnames = gsub("_", "", nams)
names(brfss_11) = tolower(newnames)

brfss_11$statefip = sprintf("%02d", brfss_11$state )
brfss_11$cofip = sprintf("%03d", brfss_11$cnty )
brfss_11$cofips = paste(brfss_11$statefip, brfss_11$cofip, sep="")


brfss_11$weight = as.numeric(as.character(brfss_11$cntywt))
#just for brevity, I just select TX respondents with non missing weights
brfss_11 = brfss_11[is.na(brfss_11$cntywt)==F,]

dim(brfss_11)


################################################
# wrangle 'measurement' variables
################################################

# ?car::recode

# smoking currently
brfss_11$smoke = car::recode(brfss_11$smoker3, recodes="1:2=1; 3:4=0; else=NA")

# low physical activity
brfss_11$lowact = car::recode(brfss_11$pacat, recodes="1:2=0; 3:4=1; else=NA")

# high cholesterol
brfss_11$hc = car::recode(brfss_11$toldhi2, recodes="1=1; 2=0; else=NA")

# marital status
brfss_11$marst = car::recode(brfss_11$marital, recodes="1='married'; 2='divorced'; 3='widowed'; 4='separated'; 5='nm';6='cohab'; else=NA", as.factor.result=T)
brfss_11$marst = relevel(brfss_11$marst, ref='married')

# income
brfss_11$inc = as.factor(ifelse(brfss_11$incomg==9, NA, brfss_11$incomg))

# age z-standardized
brfss_11$agez = scale(brfss_11$age, center=T, scale=T)

brfss_11 = brfss_11[complete.cases(brfss_11[, c("agez")]),]
# names(brfss_11)

# parec
# what is cryptically named "parec" ? 
# exercise: look up in codebook
summary(brfss_11$parec)


# the 'measurement' variables to use
names_measure_var = c("agez",'hc','marst','lowact','smoke','inc')

brfss_11[,names_measure_var] %>% summary
brfss_11[,names_measure_var] %>% head


################################################
# wrangle 'design' variables
################################################

brfss_11$cntywt = as.numeric(as.character(brfss_11$cntywt))
brfss_11$sampleid = paste(brfss_11$state, brfss_11$ststr)

# within each sampling unit, sum the weights
wts = tapply(brfss_11$cntywt,brfss_11$sampleid,sum)

# make a data frame from this
wts = data.frame(id=names(unlist(wts)), wt=unlist(wts))

# get the unique sampling location ids'
t1 = as.data.frame(table(brfss_11$sampleid))

# put all of this into a data set
wts2 = data.frame(ids=wts$id, sumwt=wts$wt, jn=t1$Freq)

# merge all of this back to the original data file
brfss_11 = merge(brfss_11, wts2, by.x="sampleid", by.y="ids", all.x=T)
rm(wts2)

# In the new data set, multiply the original weight by the fraction of the
# sampling unit total population each person represents
brfss_11$swts = brfss_11$cntywt*(brfss_11$jn/brfss_11$sumwt)

hist(brfss_11$swts)
# head(brfss_11)

# names(brfss_11)

# the 'design' variables to use
names_design_var = c("ststr","swts")
brfss_11[,names_design_var] %>% head
brfss_11[,names_design_var] %>% summary


#################################
# gather 'measurement' and 'design' variables
#################################
dat_imp_ex = brfss_11[,c(names_design_var,names_measure_var)]

dat_imp_ex %>% head
dat_imp_ex %>% summary


## optional: export original BRFSS data from R to stata

# library(haven)
# names(brfss_11) %>% gsub(pattern='\\.',replacement='_') %>% sort
# 
# names(brfss_11) = gsub(x=names(brfss_11),pattern='\\.',replacement='_')
# write_dta(brfss_11,path='~/projects/survey_complex/data/proc/brfss_11.DTA')
# # read_dta('~/projects/survey_complex/data/proc/brfss_11.DTA')
# 
# write_dta(dat_imp_ex,path='~/projects/survey_complex/data/proc/dat_imp_ex.DTA')
# # read_dta('~/projects/survey_complex/data/proc/dat_imp_ex.DTA')
  1. 7594
  2. 212
   Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
  1.000   2.000   3.000   3.277   4.000   9.000 
       agez.V1              hc               marst          lowact      
 Min.   :-2.7375082   Min.   :0.0000   married  :4250   Min.   :0.0000  
 1st Qu.:-0.7110235   1st Qu.:0.0000   cohab    : 159   1st Qu.:0.0000  
 Median : 0.0770539   Median :0.0000   divorced :1041   Median :0.0000  
 Mean   : 0.0000000   Mean   :0.4769   nm       : 845   Mean   :0.4905  
 3rd Qu.: 0.7525488   3rd Qu.:1.0000   separated: 180   3rd Qu.:1.0000  
 Max.   : 2.4412861   Max.   :1.0000   widowed  :1062   Max.   :1.0000  
                      NA's   :1117     NA's     :  57   NA's   :676     
     smoke         inc      
 Min.   :0.000   1   : 753  
 1st Qu.:0.000   2   :1092  
 Median :0.000   3   : 650  
 Mean   :0.131   4   : 857  
 3rd Qu.:0.000   5   :3181  
 Max.   :1.000   NA's:1061  
 NA's   :54                 
agezhcmarstlowactsmokeinc
10.3022188864000221 married 1 0 5
2-0.9361884353011621 divorced 0 1 NA
3-1.724265821838280 nm 0 0 5
4-1.21764464477871 divorced 1 0 5
5-1.94943078942031NA married 0 0 3
6-0.8236059515101450 nm 1 1 5
ststrswts
148011.000000 4.632437
248011.000000 1.011813
34.801100e+042.891841e-01
448011.000000 1.222201
54.801100e+047.281387e-01
648011.000000 1.004207
     ststr            swts        
 Min.   :48011   Min.   :0.08207  
 1st Qu.:48031   1st Qu.:0.45894  
 Median :48071   Median :0.78172  
 Mean   :48154   Mean   :1.00000  
 3rd Qu.:48101   3rd Qu.:1.25624  
 Max.   :48999   Max.   :8.46653  
ststrswtsagezhcmarstlowactsmokeinc
148011 4.63243668832793 -1.94943078942031NA nm 0 0 5
248011 1.01181301185751 0.4710926120865480 married 1 0 5
348011 0.2891841397385450.9777137891461231 widowed 1 0 2
448011 1.22220072541493 -0.485858500137095NA married 1 0 4
548011 0.728138702400464 -0.09181980686853611 married 0 0 5
648011 1.00420670567116 0.3585101282955310 married 0 0 5
     ststr            swts               agez.V1              hc        
 Min.   :48011   Min.   :0.08207   Min.   :-2.7375082   Min.   :0.0000  
 1st Qu.:48031   1st Qu.:0.45894   1st Qu.:-0.7110235   1st Qu.:0.0000  
 Median :48071   Median :0.78172   Median : 0.0770539   Median :0.0000  
 Mean   :48154   Mean   :1.00000   Mean   : 0.0000000   Mean   :0.4769  
 3rd Qu.:48101   3rd Qu.:1.25624   3rd Qu.: 0.7525488   3rd Qu.:1.0000  
 Max.   :48999   Max.   :8.46653   Max.   : 2.4412861   Max.   :1.0000  
                                                        NA's   :1117    
       marst          lowact           smoke         inc      
 married  :4250   Min.   :0.0000   Min.   :0.000   1   : 753  
 cohab    : 159   1st Qu.:0.0000   1st Qu.:0.000   2   :1092  
 divorced :1041   Median :0.0000   Median :0.000   3   : 650  
 nm       : 845   Mean   :0.4905   Mean   :0.131   4   : 857  
 separated: 180   3rd Qu.:1.0000   3rd Qu.:0.000   5   :3181  
 widowed  :1062   Max.   :1.0000   Max.   :1.000   NA's:1061  
 NA's     :  57   NA's   :676      NA's   :54                 

Step 1) Declare the Survey Design

In [4]:
# library(survey)

########################################
# set survey design
# non-imputation example
########################################

dim(brfss_11)
dim(dat_imp_ex)

options(survey.lonely.psu = "adjust")

# ?svydesign

des_non_impute = svydesign(ids=~1, 
                           strata=~ststr, 
                           weights=~swts, 
                           data=brfss_11)

attributes(des_non_impute)
  1. 7594
  2. 218
  1. 7594
  2. 8
$names
  1. 'cluster'
  2. 'strata'
  3. 'has.strata'
  4. 'prob'
  5. 'allprob'
  6. 'call'
  7. 'variables'
  8. 'fpc'
  9. 'pps'
$class
  1. 'survey.design2'
  2. 'survey.design'

Step 2a) Conduct Survey Analysis (without Missing Data)

Descriptives

In [5]:
########################
# descriptives
########################

# head(des_non_impute)
# str(des_non_impute)

# summary(des_non_impute$variables)

# ?svymean
message('Note: age does not have missing cases')
svymean(~age, design = des_non_impute)

message('Note: smoke has missing cases, hence NA svymean() result')
svymean(~as.factor(smoke), design = des_non_impute)

message('Note: supply svymean(...,na.rm=TRUE) option to svymean() for case deletion')
svymean(~as.factor(smoke), design = des_non_impute,na.rm=TRUE)
Note: age does not have missing cases
      mean     SE
age 49.377 0.2488
Note: smoke has missing cases, hence NA svymean() result
                  mean SE
as.factor(smoke)0   NA NA
as.factor(smoke)1   NA NA
Note: supply svymean(...,na.rm=TRUE) option to svymean() for case deletion
                     mean     SE
as.factor(smoke)0 0.85476 0.0055
as.factor(smoke)1 0.14524 0.0055

In the codebox below, try computing the total counts of smokers with case deletion.
Hint: ?svytotal uses similar syntax as svymean()

In [6]:
message('Note: supply svytotal(...,na.rm=TRUE) option to svytotal() for case deletion')
Note: supply svytotal(...,na.rm=TRUE) option to svytotal() for case deletion

Regression

In [7]:
########################
# regression
########################

# codebook says the cryptically named 'parec' variable is
# parec: Aerobic and Strengthening Guideline

fit_non_impute_age = svyglm(design=des_non_impute,
                        formula=agez ~ as.factor(parec),
                        family='gaussian')
summary(fit_non_impute_age)

# note: default omitted NAs
fit_non_impute_smoke = svyglm(design=des_non_impute,
                        formula=smoke ~ agez + inc + lowact,
                        family=quasibinomial())

summary(fit_non_impute_smoke)
Call:
svyglm(design = des_non_impute, formula = agez ~ as.factor(parec), 
    family = "gaussian")

Survey design:
svydesign(ids = ~1, strata = ~ststr, weights = ~swts, data = brfss_11)

Coefficients:
                  Estimate Std. Error t value Pr(>|t|)    
(Intercept)       -0.46248    0.03459 -13.370  < 2e-16 ***
as.factor(parec)2  0.23211    0.04412   5.261 1.47e-07 ***
as.factor(parec)3 -0.07634    0.06376  -1.197 0.231182    
as.factor(parec)4  0.15360    0.04113   3.735 0.000189 ***
as.factor(parec)9 -0.04430    0.06479  -0.684 0.494128    
---
Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

(Dispersion parameter for gaussian family taken to be 0.9359618)

Number of Fisher Scoring iterations: 2
Call:
svyglm(design = des_non_impute, formula = smoke ~ agez + inc + 
    lowact, family = quasibinomial())

Survey design:
svydesign(ids = ~1, strata = ~ststr, weights = ~swts, data = brfss_11)

Coefficients:
            Estimate Std. Error t value Pr(>|t|)    
(Intercept) -1.26656    0.13886  -9.121  < 2e-16 ***
agez        -0.24541    0.04689  -5.234 1.72e-07 ***
inc2        -0.32622    0.16661  -1.958   0.0503 .  
inc3        -0.38498    0.19005  -2.026   0.0428 *  
inc4        -0.75757    0.18955  -3.997 6.50e-05 ***
inc5        -1.19397    0.14582  -8.188 3.21e-16 ***
lowact       0.25195    0.10040   2.509   0.0121 *  
---
Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

(Dispersion parameter for quasibinomial family taken to be 1.004407)

Number of Fisher Scoring iterations: 4

Subpopulation (Subdomain) Analysis

In [8]:
########################
# subpopulation (subdomain) analysis
########################

# library(survey)
methods(subset)
# ?subset

# ?subset.survey.design
# names(des_non_impute$variables) %>% sort

message('mean of age on whole sample')
svymean(~age, design = des_non_impute)

message('mean of age on Age 50 and older sample')
des_sub1 = survey:::subset.survey.design(age>=50,x=des_non_impute)
svymean(~age,design=des_sub1)

message('mean of age on Age under 50 sample')
des_sub2 = survey:::subset.survey.design(age<50,x=des_non_impute)
svymean(~age,design=des_sub2)
 [1] subset.data.frame           subset.DBIsvydesign*       
 [3] subset.default              subset.matrix              
 [5] subset.ODBCsvydesign*       subset.survey.design*      
 [7] subset.svyDBimputationList* subset.svyimputationList*  
 [9] subset.svyrep.design*       subset.twophase*           
[11] subset.twophase2*          
see '?methods' for accessing help and source code
mean of age on whole sample
      mean     SE
age 49.377 0.2488
mean of age on Age 50 and older sample
      mean     SE
age 63.464 0.1814
mean of age on Age under 50 sample
      mean     SE
age 35.242 0.2379

Step 2b1) Impute Missing Data + Propagate into Design

Impute via 'mice'

In [9]:
########################################
# impute
########################################
# library(dplyr)

# gather 'survey design variables' and 'measure variables'


# names(brfss_11)
dim(brfss_11)
# summary(brfss_11)


dim(dat_imp_ex)
summary(dat_imp_ex)


# note: later, we will even use vars in 'names_design_var' to impute
message('design variables')
names_design_var

message('measurement variables')
names_measure_var

message('dataset for imputation')
names(dat_imp_ex)


# use mice to impute and pool
# library(mice)
?mice
  1. 7594
  2. 218
  1. 7594
  2. 8
     ststr            swts               agez.V1              hc        
 Min.   :48011   Min.   :0.08207   Min.   :-2.7375082   Min.   :0.0000  
 1st Qu.:48031   1st Qu.:0.45894   1st Qu.:-0.7110235   1st Qu.:0.0000  
 Median :48071   Median :0.78172   Median : 0.0770539   Median :0.0000  
 Mean   :48154   Mean   :1.00000   Mean   : 0.0000000   Mean   :0.4769  
 3rd Qu.:48101   3rd Qu.:1.25624   3rd Qu.: 0.7525488   3rd Qu.:1.0000  
 Max.   :48999   Max.   :8.46653   Max.   : 2.4412861   Max.   :1.0000  
                                                        NA's   :1117    
       marst          lowact           smoke         inc      
 married  :4250   Min.   :0.0000   Min.   :0.000   1   : 753  
 cohab    : 159   1st Qu.:0.0000   1st Qu.:0.000   2   :1092  
 divorced :1041   Median :0.0000   Median :0.000   3   : 650  
 nm       : 845   Mean   :0.4905   Mean   :0.131   4   : 857  
 separated: 180   3rd Qu.:1.0000   3rd Qu.:0.000   5   :3181  
 widowed  :1062   Max.   :1.0000   Max.   :1.000   NA's:1061  
 NA's     :  57   NA's   :676      NA's   :54                 
design variables
  1. 'ststr'
  2. 'swts'
measurement variables
  1. 'agez'
  2. 'hc'
  3. 'marst'
  4. 'lowact'
  5. 'smoke'
  6. 'inc'
dataset for imputation
  1. 'ststr'
  2. 'swts'
  3. 'agez'
  4. 'hc'
  5. 'marst'
  6. 'lowact'
  7. 'smoke'
  8. 'inc'
mice {mice}R Documentation

Multivariate Imputation by Chained Equations (MICE)

Description

Generates Multivariate Imputations by Chained Equations (MICE)

Usage

mice(data, m = 5, method = vector("character", length = ncol(data)),
  predictorMatrix = (1 - diag(1, ncol(data))),
  visitSequence = (1:ncol(data))[apply(is.na(data), 2, any)],
  form = vector("character", length = ncol(data)),
  post = vector("character", length = ncol(data)), defaultMethod = c("pmm",
  "logreg", "polyreg", "polr"), maxit = 5, diagnostics = TRUE,
  printFlag = TRUE, seed = NA, imputationMethod = NULL,
  defaultImputationMethod = NULL, data.init = NULL, ...)

Arguments

data

A data frame or a matrix containing the incomplete data. Missing values are coded as NA.

m

Number of multiple imputations. The default is m=5.

method

Can be either a single string, or a vector of strings with length ncol(data), specifying the elementary imputation method to be used for each column in data. If specified as a single string, the same method will be used for all columns. The default imputation method (when no argument is specified) depends on the measurement level of the target column and are specified by the defaultMethod argument. Columns that need not be imputed have the empty method ''. See details for more information.

predictorMatrix

A square matrix of size ncol(data) containing 0/1 data specifying the set of predictors to be used for each target column. Rows correspond to target variables (i.e. variables to be imputed), in the sequence as they appear in data. A value of '1' means that the column variable is used as a predictor for the target variable (in the rows). The diagonal of predictorMatrix must be zero. The default for predictorMatrix is that all other columns are used as predictors (sometimes called massive imputation). Note: For two-level imputation codes '2' and '-2' are also allowed.

visitSequence

A vector of integers of arbitrary length, specifying the column indices of the visiting sequence. The visiting sequence is the column order that is used to impute the data during one pass through the data. A column may be visited more than once. All incomplete columns that are used as predictors should be visited, or else the function will stop with an error. The default sequence 1:ncol(data) implies that columns are imputed from left to right. It is possible to specify one of the keywords 'roman' (left to right), 'arabic' (right to left), 'monotone' (sorted in increasing amount of missingness) and 'revmonotone' (reverse of monotone). The keyword should be supplied as a string and may be abbreviated.

form

A vector of strings with length ncol(data), specifying formulae. Each string is parsed and executed within the sampler() function to create terms for the predictor. The default is to do nothing, indicated by a vector of empty strings ''. The main value lies in the easy specification of interaction terms. The user must ensure that the set of variables in the formula match those in predictors.

post

A vector of strings with length ncol(data), specifying expressions. Each string is parsed and executed within the sampler() function to postprocess imputed values. The default is to do nothing, indicated by a vector of empty strings ''.

defaultMethod

A vector of three strings containing the default imputation methods for numerical columns, factor columns with 2 levels, and columns with (unordered or ordered) factors with more than two levels, respectively. If nothing is specified, the following defaults will be used: pmm, predictive mean matching (numeric data) logreg, logistic regression imputation (binary data, factor with 2 levels) polyreg, polytomous regression imputation for unordered categorical data (factor >= 2 levels) polr, proportional odds model for (ordered, >= 2 levels)

maxit

A scalar giving the number of iterations. The default is 5.

diagnostics

A Boolean flag. If TRUE, diagnostic information will be appended to the value of the function. If FALSE, only the imputed data are saved. The default is TRUE.

printFlag

If TRUE, mice will print history on console. Use print=FALSE for silent computation.

seed

An integer that is used as argument by the set.seed() for offsetting the random number generator. Default is to leave the random number generator alone.

imputationMethod

Same as method argument. Included for backwards compatibility.

defaultImputationMethod

Same as defaultMethod argument. Included for backwards compatibility.

data.init

A data frame of the same size and type as data, without missing data, used to initialize imputations before the start of the iterative process. The default NULL implies that starting imputation are created by a simple random draw from the data. Note that specification of data.init will start the m Gibbs sampling streams from the same imputations.

...

Named arguments that are passed down to the elementary imputation functions.

Details

Generates multiple imputations for incomplete multivariate data by Gibbs sampling. Missing data can occur anywhere in the data. The algorithm imputes an incomplete column (the target column) by generating 'plausible' synthetic values given other columns in the data. Each incomplete column must act as a target column, and has its own specific set of predictors. The default set of predictors for a given target consists of all other columns in the data. For predictors that are incomplete themselves, the most recently generated imputations are used to complete the predictors prior to imputation of the target column.

A separate univariate imputation model can be specified for each column. The default imputation method depends on the measurement level of the target column. In addition to these, several other methods are provided. You can also write their own imputation functions, and call these from within the algorithm.

The data may contain categorical variables that are used in a regressions on other variables. The algorithm creates dummy variables for the categories of these variables, and imputes these from the corresponding categorical variable. The extended model containing the dummy variables is called the padded model. Its structure is stored in the list component pad.

Built-in elementary imputation methods are:

pmm

Predictive mean matching (any)

norm

Bayesian linear regression (numeric)

norm.nob

Linear regression ignoring model error (numeric)

norm.boot

Linear regression using bootstrap (numeric)

norm.predict

Linear regression, predicted values (numeric)

mean

Unconditional mean imputation (numeric)

2l.norm

Two-level normal imputation (numeric)

2l.pan

Two-level normal imputation using pan (numeric)

2lonly.mean

Imputation at level-2 of the class mean (numeric)

2lonly.norm

Imputation at level-2 by Bayesian linear regression (numeric)

2lonly.pmm

Imputation at level-2 by Predictive mean matching (any)

quadratic

Imputation of quadratic terms (numeric)

logreg

Logistic regression (factor, 2 levels)

logreg.boot

Logistic regression with bootstrap

polyreg

Polytomous logistic regression (factor, >= 2 levels)

polr

Proportional odds model (ordered, >=2 levels)

lda

Linear discriminant analysis (factor, >= 2 categories)

cart

Classification and regression trees (any)

rf

Random forest imputations (any)

ri

Random indicator method for nonignorable data (numeric)

sample

Random sample from the observed values (any)

fastpmm

Experimental: Fast predictive mean matching using C++ (any)

These corresponding functions are coded in the mice library under names mice.impute.method, where method is a string with the name of the elementary imputation method name, for example norm. The method argument specifies the methods to be used. For the j'th column, mice() calls the first occurence of paste('mice.impute.',method[j],sep='') in the search path. The mechanism allows uses to write customized imputation function, mice.impute.myfunc. To call it for all columns specify method='myfunc'. To call it only for, say, column 2 specify method=c('norm','myfunc','logreg',...{}).

Passive imputation: mice() supports a special built-in method, called passive imputation. This method can be used to ensure that a data transform always depends on the most recently generated imputations. In some cases, an imputation model may need transformed data in addition to the original data (e.g. log, quadratic, recodes, interaction, sum scores, and so on).

Passive imputation maintains consistency among different transformations of the same data. Passive imputation is invoked if ~ is specified as the first character of the string that specifies the elementary method. mice() interprets the entire string, including the ~ character, as the formula argument in a call to model.frame(formula, data[!r[,j],]). This provides a simple mechanism for specifying determinstic dependencies among the columns. For example, suppose that the missing entries in variables data$height and data$weight are imputed. The body mass index (BMI) can be calculated within mice by specifying the string '~I(weight/height^2)' as the elementary imputation method for the target column data$bmi. Note that the ~ mechanism works only on those entries which have missing values in the target column. You should make sure that the combined observed and imputed parts of the target column make sense. An easy way to create consistency is by coding all entries in the target as NA, but for large data sets, this could be inefficient. Note that you may also need to adapt the default predictorMatrix to evade linear dependencies among the predictors that could cause errors like Error in solve.default() or Error: system is exactly singular. Though not strictly needed, it is often useful to specify visitSequence such that the column that is imputed by the ~ mechanism is visited each time after one of its predictors was visited. In that way, deterministic relation between columns will always be synchronized.

Value

Returns an S3 object of class mids (multiply imputed data set)

Author(s)

Stef van Buuren [email protected], Karin Groothuis-Oudshoorn [email protected], 2000-2010, with contributions of Alexander Robitzsch, Gerko Vink, Shahab Jolani, Roel de Jong, Jason Turner, Lisa Doove, John Fox, Frank E. Harrell, and Peter Malewski.

References

Van Buuren, S., Groothuis-Oudshoorn, K. (2011). mice: Multivariate Imputation by Chained Equations in R. Journal of Statistical Software, 45(3), 1-67. http://www.jstatsoft.org/v45/i03/

van Buuren, S. (2012). Flexible Imputation of Missing Data. Boca Raton, FL: Chapman & Hall/CRC Press.

Van Buuren, S., Brand, J.P.L., Groothuis-Oudshoorn C.G.M., Rubin, D.B. (2006) Fully conditional specification in multivariate imputation. Journal of Statistical Computation and Simulation, 76, 12, 1049–1064.

Van Buuren, S. (2007) Multiple imputation of discrete and continuous data by fully conditional specification. Statistical Methods in Medical Research, 16, 3, 219–242.

Van Buuren, S., Boshuizen, H.C., Knook, D.L. (1999) Multiple imputation of missing blood pressure covariates in survival analysis. Statistics in Medicine, 18, 681–694.

Brand, J.P.L. (1999) Development, implementation and evaluation of multiple imputation strategies for the statistical analysis of incomplete data sets. Dissertation. Rotterdam: Erasmus University.

See Also

mids, with.mids, set.seed, complete

Examples


# do default multiple imputation on a numeric matrix
imp <- mice(nhanes)
imp

# list the actual imputations for BMI
imp$imputations$bmi

# first completed data matrix
complete(imp)


# imputation on mixed data with a different method per column

mice(nhanes2, meth=c('sample','pmm','logreg','norm'))


[Package mice version 2.30 ]

Use 'full' PredictorMatrix (default)

If the 'design' variables are one of the $j$ columns, then they are used to impute, eg $[X_j|X_{\{-j\}}]$ for all $j$.

In [10]:
########################################
# use 'full' PredictorMatrix (default)
# eg X_j|X_{-j} for all j
# if 'design' variables part of j, then they are used to impute
########################################

imp_pred_mat_all <- mice(dat_imp_ex,n.imp=5,seed=1234)
summary(imp_pred_mat_all)

# 'imp' on lhs will be what's used in design+analysis going forward
imp = imp_pred_mat_all

# note: interpret 1s/0s of 'PredictorMatrix'
 iter imp variable
  1   1  hc  marst  lowact  smoke  inc
  1   2  hc  marst  lowact  smoke  inc
  1   3  hc  marst  lowact  smoke  inc
  1   4  hc  marst  lowact  smoke  inc
  1   5  hc  marst  lowact  smoke  inc
  2   1  hc  marst  lowact  smoke  inc
  2   2  hc  marst  lowact  smoke  inc
  2   3  hc  marst  lowact  smoke  inc
  2   4  hc  marst  lowact  smoke  inc
  2   5  hc  marst  lowact  smoke  inc
  3   1  hc  marst  lowact  smoke  inc
  3   2  hc  marst  lowact  smoke  inc
  3   3  hc  marst  lowact  smoke  inc
  3   4  hc  marst  lowact  smoke  inc
  3   5  hc  marst  lowact  smoke  inc
  4   1  hc  marst  lowact  smoke  inc
  4   2  hc  marst  lowact  smoke  inc
  4   3  hc  marst  lowact  smoke  inc
  4   4  hc  marst  lowact  smoke  inc
  4   5  hc  marst  lowact  smoke  inc
  5   1  hc  marst  lowact  smoke  inc
  5   2  hc  marst  lowact  smoke  inc
  5   3  hc  marst  lowact  smoke  inc
  5   4  hc  marst  lowact  smoke  inc
  5   5  hc  marst  lowact  smoke  inc
Multiply imputed data set
Call:
mice(data = dat_imp_ex, seed = 1234, n.imp = 5)
Number of multiple imputations:  5
Missing cells per column:
 ststr   swts            hc  marst lowact  smoke    inc 
     0      0      0   1117     57    676     54   1061 
Imputation methods:
    ststr      swts      agez        hc     marst    lowact     smoke       inc 
       ""        ""        ""     "pmm" "polyreg"     "pmm"     "pmm" "polyreg" 
VisitSequence:
    hc  marst lowact  smoke    inc 
     4      5      6      7      8 
PredictorMatrix:
       ststr swts agez hc marst lowact smoke inc
ststr      0    0    0  0     0      0     0   0
swts       0    0    0  0     0      0     0   0
agez       0    0    0  0     0      0     0   0
hc         1    1    1  0     1      1     1   1
marst      1    1    1  1     0      1     1   1
lowact     1    1    1  1     1      0     1   1
smoke      1    1    1  1     1      1     0   1
inc        1    1    1  1     1      1     1   0
Random generator seed value:  1234 

Interpret the 1s/0s of 'PredictorMatrix' output.
$agez$ row has its column entries 0s, meaning $agez$ not imputed by others.
$inc$ row has 1s in all but last column entries, meaning $inc$ imputed by all others.

Use 'custom' PredictorMatrix

Exclude variables in column 1 and 2. This is just a demo, since you SHOULD include design variables in your imputation model.

http://www2.stat.duke.edu/~jerry/Papers/SM06.pdf

In [11]:
########################################
# exclude variables in column 1 and 2
########################################

names(dat_imp_ex)
# columns 1 and 2 were 'design' variables
predictorMatrix_cust = (1 - diag(1, ncol(dat_imp_ex)))
predictorMatrix_cust[,1] = 0
predictorMatrix_cust[,2] = 0

predictorMatrix_cust

imp_pred_mat_custom <- mice(dat_imp_ex,n.imp=5,seed=1234,
                            predictorMatrix=predictorMatrix_cust)
summary(imp_pred_mat_custom)

# this is just a demo on how to toggle custom predictorMatrix
# note: 'imp_pred_mat_custom' is not used going forward
  1. 'ststr'
  2. 'swts'
  3. 'agez'
  4. 'hc'
  5. 'marst'
  6. 'lowact'
  7. 'smoke'
  8. 'inc'
00111111
00111111
00011111
00101111
00110111
00111011
00111101
00111110
 iter imp variable
  1   1  hc  marst  lowact  smoke  inc
  1   2  hc  marst  lowact  smoke  inc
  1   3  hc  marst  lowact  smoke  inc
  1   4  hc  marst  lowact  smoke  inc
  1   5  hc  marst  lowact  smoke  inc
  2   1  hc  marst  lowact  smoke  inc
  2   2  hc  marst  lowact  smoke  inc
  2   3  hc  marst  lowact  smoke  inc
  2   4  hc  marst  lowact  smoke  inc
  2   5  hc  marst  lowact  smoke  inc
  3   1  hc  marst  lowact  smoke  inc
  3   2  hc  marst  lowact  smoke  inc
  3   3  hc  marst  lowact  smoke  inc
  3   4  hc  marst  lowact  smoke  inc
  3   5  hc  marst  lowact  smoke  inc
  4   1  hc  marst  lowact  smoke  inc
  4   2  hc  marst  lowact  smoke  inc
  4   3  hc  marst  lowact  smoke  inc
  4   4  hc  marst  lowact  smoke  inc
  4   5  hc  marst  lowact  smoke  inc
  5   1  hc  marst  lowact  smoke  inc
  5   2  hc  marst  lowact  smoke  inc
  5   3  hc  marst  lowact  smoke  inc
  5   4  hc  marst  lowact  smoke  inc
  5   5  hc  marst  lowact  smoke  inc
Multiply imputed data set
Call:
mice(data = dat_imp_ex, predictorMatrix = predictorMatrix_cust, 
    seed = 1234, n.imp = 5)
Number of multiple imputations:  5
Missing cells per column:
 ststr   swts            hc  marst lowact  smoke    inc 
     0      0      0   1117     57    676     54   1061 
Imputation methods:
    ststr      swts      agez        hc     marst    lowact     smoke       inc 
       ""        ""        ""     "pmm" "polyreg"     "pmm"     "pmm" "polyreg" 
VisitSequence:
    hc  marst lowact  smoke    inc 
     4      5      6      7      8 
PredictorMatrix:
       ststr swts agez hc marst lowact smoke inc
ststr      0    0    0  0     0      0     0   0
swts       0    0    0  0     0      0     0   0
agez       0    0    0  0     0      0     0   0
hc         0    0    1  0     1      1     1   1
marst      0    0    1  1     0      1     1   1
lowact     0    0    1  1     1      0     1   1
smoke      0    0    1  1     1      1     0   1
inc        0    0    1  1     1      1     1   0
Random generator seed value:  1234 

Use other "Imputation methods"

Interpret the "Imputation methods" output.

Imputation methods:
    ststr      swts      agez        hc     marst    lowact     smoke       inc 
       ""        ""        ""     "pmm" "polyreg"     "pmm"     "pmm" "polyreg"

Here's where [R] shines. what if I want to use a random forest for $[X_j|X_{\{-j\}}]$ ?
On your own, type ?mice in the codebox below to read the help manual.
Hint: read the defaultMethod option.

Propagate Imputation Uncertainty into Design via 'mitools'

In [12]:
########################################
# middle man processeing step
# mice::complete + imputationList()
# 'survey' needs list structure
########################################
# ?mice::complete  # to concatenate 'observed' and 'imputed' items
# ?mitools::imputationList  # to structure as list, where list item is a 'completed' dataset

# library(mitools)

data_imputed <- mitools::imputationList(lapply(1:5, FUN=mice::complete, x=imp))
str(data_imputed)

message('the list of imputations')
summary(data_imputed)

########################################
# use library(survey)
# 1) respect the survey design
# 2) later: conduct survey analysis
########################################
# library(survey)

# respect survey design
# ouptut of imputationList() 'data_imputed' as data argument
# svydesign(data=data_imputed,.)  
# previously 'data_no_imputation' was data argument when no imputation used
# svydesign(data=data_no_imputation,.)

# ?survey::svydesign

options(survey.lonely.psu = "adjust")

# force completed data thru asserted design
des_imputed <- svydesign(data=data_imputed,
                         ids=~1, strata=~ststr, weights=~swts, 
                         nest=TRUE)

message('the list of imputations propogated into the survey design')

class(des_imputed)
summary(des_imputed)
List of 2
 $ imputations:List of 5
  ..$ :'data.frame':	7594 obs. of  8 variables:
  .. ..$ ststr : int [1:7594] 48011 48011 48011 48011 48011 48011 48011 48011 48011 48011 ...
  .. ..$ swts  : num [1:7594(1d)] 4.632 1.012 0.289 1.222 0.728 ...
  .. .. ..- attr(*, "dimnames")=List of 1
  .. .. .. ..$ : chr [1:7594] "48 48011" "48 48011" "48 48011" "48 48011" ...
  .. ..$ agez  : num [1:7594, 1] -1.9494 0.4711 0.9777 -0.4859 -0.0918 ...
  .. ..$ hc    : num [1:7594] 0 0 1 1 1 0 1 0 0 1 ...
  .. ..$ marst : Factor w/ 6 levels "married","cohab",..: 4 1 6 1 1 1 1 1 1 1 ...
  .. .. ..- attr(*, "contrasts")= num [1:6, 1:5] 0 1 0 0 0 0 0 0 1 0 ...
  .. .. .. ..- attr(*, "dimnames")=List of 2
  .. .. .. .. ..$ : chr [1:6] "married" "cohab" "divorced" "nm" ...
  .. .. .. .. ..$ : chr [1:5] "2" "3" "4" "5" ...
  .. ..$ lowact: num [1:7594] 0 1 1 1 0 0 0 0 0 0 ...
  .. ..$ smoke : num [1:7594] 0 0 0 0 0 0 0 1 0 0 ...
  .. ..$ inc   : Factor w/ 5 levels "1","2","3","4",..: 5 5 2 4 5 5 5 5 5 4 ...
  .. .. ..- attr(*, "contrasts")= num [1:5, 1:4] 0 1 0 0 0 0 0 1 0 0 ...
  .. .. .. ..- attr(*, "dimnames")=List of 2
  .. .. .. .. ..$ : chr [1:5] "1" "2" "3" "4" ...
  .. .. .. .. ..$ : chr [1:4] "2" "3" "4" "5"
  ..$ :'data.frame':	7594 obs. of  8 variables:
  .. ..$ ststr : int [1:7594] 48011 48011 48011 48011 48011 48011 48011 48011 48011 48011 ...
  .. ..$ swts  : num [1:7594(1d)] 4.632 1.012 0.289 1.222 0.728 ...
  .. .. ..- attr(*, "dimnames")=List of 1
  .. .. .. ..$ : chr [1:7594] "48 48011" "48 48011" "48 48011" "48 48011" ...
  .. ..$ agez  : num [1:7594, 1] -1.9494 0.4711 0.9777 -0.4859 -0.0918 ...
  .. ..$ hc    : num [1:7594] 0 0 1 0 1 0 1 0 0 1 ...
  .. ..$ marst : Factor w/ 6 levels "married","cohab",..: 4 1 6 1 1 1 1 1 1 1 ...
  .. .. ..- attr(*, "contrasts")= num [1:6, 1:5] 0 1 0 0 0 0 0 0 1 0 ...
  .. .. .. ..- attr(*, "dimnames")=List of 2
  .. .. .. .. ..$ : chr [1:6] "married" "cohab" "divorced" "nm" ...
  .. .. .. .. ..$ : chr [1:5] "2" "3" "4" "5" ...
  .. ..$ lowact: num [1:7594] 0 1 1 1 0 0 0 0 0 0 ...
  .. ..$ smoke : num [1:7594] 0 0 0 0 0 0 0 1 0 0 ...
  .. ..$ inc   : Factor w/ 5 levels "1","2","3","4",..: 5 5 2 4 5 5 5 5 5 4 ...
  .. .. ..- attr(*, "contrasts")= num [1:5, 1:4] 0 1 0 0 0 0 0 1 0 0 ...
  .. .. .. ..- attr(*, "dimnames")=List of 2
  .. .. .. .. ..$ : chr [1:5] "1" "2" "3" "4" ...
  .. .. .. .. ..$ : chr [1:4] "2" "3" "4" "5"
  ..$ :'data.frame':	7594 obs. of  8 variables:
  .. ..$ ststr : int [1:7594] 48011 48011 48011 48011 48011 48011 48011 48011 48011 48011 ...
  .. ..$ swts  : num [1:7594(1d)] 4.632 1.012 0.289 1.222 0.728 ...
  .. .. ..- attr(*, "dimnames")=List of 1
  .. .. .. ..$ : chr [1:7594] "48 48011" "48 48011" "48 48011" "48 48011" ...
  .. ..$ agez  : num [1:7594, 1] -1.9494 0.4711 0.9777 -0.4859 -0.0918 ...
  .. ..$ hc    : num [1:7594] 0 0 1 1 1 0 1 0 0 1 ...
  .. ..$ marst : Factor w/ 6 levels "married","cohab",..: 4 1 6 1 1 1 1 1 1 1 ...
  .. .. ..- attr(*, "contrasts")= num [1:6, 1:5] 0 1 0 0 0 0 0 0 1 0 ...
  .. .. .. ..- attr(*, "dimnames")=List of 2
  .. .. .. .. ..$ : chr [1:6] "married" "cohab" "divorced" "nm" ...
  .. .. .. .. ..$ : chr [1:5] "2" "3" "4" "5" ...
  .. ..$ lowact: num [1:7594] 0 1 1 1 0 0 0 0 0 0 ...
  .. ..$ smoke : num [1:7594] 0 0 0 0 0 0 0 1 0 0 ...
  .. ..$ inc   : Factor w/ 5 levels "1","2","3","4",..: 5 5 2 4 5 5 5 5 5 4 ...
  .. .. ..- attr(*, "contrasts")= num [1:5, 1:4] 0 1 0 0 0 0 0 1 0 0 ...
  .. .. .. ..- attr(*, "dimnames")=List of 2
  .. .. .. .. ..$ : chr [1:5] "1" "2" "3" "4" ...
  .. .. .. .. ..$ : chr [1:4] "2" "3" "4" "5"
  ..$ :'data.frame':	7594 obs. of  8 variables:
  .. ..$ ststr : int [1:7594] 48011 48011 48011 48011 48011 48011 48011 48011 48011 48011 ...
  .. ..$ swts  : num [1:7594(1d)] 4.632 1.012 0.289 1.222 0.728 ...
  .. .. ..- attr(*, "dimnames")=List of 1
  .. .. .. ..$ : chr [1:7594] "48 48011" "48 48011" "48 48011" "48 48011" ...
  .. ..$ agez  : num [1:7594, 1] -1.9494 0.4711 0.9777 -0.4859 -0.0918 ...
  .. ..$ hc    : num [1:7594] 0 0 1 1 1 0 1 0 0 1 ...
  .. ..$ marst : Factor w/ 6 levels "married","cohab",..: 4 1 6 1 1 1 1 1 1 1 ...
  .. .. ..- attr(*, "contrasts")= num [1:6, 1:5] 0 1 0 0 0 0 0 0 1 0 ...
  .. .. .. ..- attr(*, "dimnames")=List of 2
  .. .. .. .. ..$ : chr [1:6] "married" "cohab" "divorced" "nm" ...
  .. .. .. .. ..$ : chr [1:5] "2" "3" "4" "5" ...
  .. ..$ lowact: num [1:7594] 0 1 1 1 0 0 0 0 0 0 ...
  .. ..$ smoke : num [1:7594] 0 0 0 0 0 0 0 1 0 0 ...
  .. ..$ inc   : Factor w/ 5 levels "1","2","3","4",..: 5 5 2 4 5 5 5 5 5 4 ...
  .. .. ..- attr(*, "contrasts")= num [1:5, 1:4] 0 1 0 0 0 0 0 1 0 0 ...
  .. .. .. ..- attr(*, "dimnames")=List of 2
  .. .. .. .. ..$ : chr [1:5] "1" "2" "3" "4" ...
  .. .. .. .. ..$ : chr [1:4] "2" "3" "4" "5"
  ..$ :'data.frame':	7594 obs. of  8 variables:
  .. ..$ ststr : int [1:7594] 48011 48011 48011 48011 48011 48011 48011 48011 48011 48011 ...
  .. ..$ swts  : num [1:7594(1d)] 4.632 1.012 0.289 1.222 0.728 ...
  .. .. ..- attr(*, "dimnames")=List of 1
  .. .. .. ..$ : chr [1:7594] "48 48011" "48 48011" "48 48011" "48 48011" ...
  .. ..$ agez  : num [1:7594, 1] -1.9494 0.4711 0.9777 -0.4859 -0.0918 ...
  .. ..$ hc    : num [1:7594] 0 0 1 0 1 0 1 0 0 1 ...
  .. ..$ marst : Factor w/ 6 levels "married","cohab",..: 4 1 6 1 1 1 1 1 1 1 ...
  .. .. ..- attr(*, "contrasts")= num [1:6, 1:5] 0 1 0 0 0 0 0 0 1 0 ...
  .. .. .. ..- attr(*, "dimnames")=List of 2
  .. .. .. .. ..$ : chr [1:6] "married" "cohab" "divorced" "nm" ...
  .. .. .. .. ..$ : chr [1:5] "2" "3" "4" "5" ...
  .. ..$ lowact: num [1:7594] 0 1 1 1 0 0 0 0 0 0 ...
  .. ..$ smoke : num [1:7594] 0 0 0 0 0 0 0 1 0 0 ...
  .. ..$ inc   : Factor w/ 5 levels "1","2","3","4",..: 5 5 2 4 5 5 5 5 5 4 ...
  .. .. ..- attr(*, "contrasts")= num [1:5, 1:4] 0 1 0 0 0 0 0 1 0 0 ...
  .. .. .. ..- attr(*, "dimnames")=List of 2
  .. .. .. .. ..$ : chr [1:5] "1" "2" "3" "4" ...
  .. .. .. .. ..$ : chr [1:4] "2" "3" "4" "5"
 $ call       : language mitools::imputationList(lapply(1:5, FUN = mice::complete, x = imp))
 - attr(*, "class")= chr "imputationList"
the list of imputations
            Length Class  Mode
imputations 5      -none- list
call        2      -none- call
the list of imputations propogated into the survey design
'svyimputationList'
        Length Class  Mode
designs 5      -none- list
call    6      -none- call

Step 2b2) Conduct Survey Analysis (with Missing Data)

Descriptives

In [13]:
########################
# descriptives
########################
summary(des_imputed)

message('proportion of smoke after imputation')
MIcombine(with(des_imputed, svymean(~as.factor(smoke))))

message('total counts of smoke after imputation')
MIcombine(with(des_imputed, svytotal(~as.factor(smoke))))

message('compare with non-imputed design ')

message('proportion of smoke on non-imputed')
svymean(~as.factor(smoke),design=des_non_impute)

message('proportion of smoke on case-deleted')
svymean(~as.factor(smoke),design=des_non_impute,na.rm=TRUE)

message('total counts of smoke on case-deleted')
svytotal(~as.factor(smoke),design=des_non_impute,na.rm=TRUE)
        Length Class  Mode
designs 5      -none- list
call    6      -none- call
proportion of smoke after imputation
Multiple imputation results:
      with(des_imputed, svymean(~as.factor(smoke)))
      MIcombine.default(with(des_imputed, svymean(~as.factor(smoke))))
                    results          se
as.factor(smoke)0 0.8545753 0.005507924
as.factor(smoke)1 0.1454247 0.005507924
total counts of smoke after imputation
Multiple imputation results:
      with(des_imputed, svytotal(~as.factor(smoke)))
      MIcombine.default(with(des_imputed, svytotal(~as.factor(smoke))))
                   results       se
as.factor(smoke)0 6489.645 71.29230
as.factor(smoke)1 1104.355 43.89727
compare with non-imputed design 
proportion of smoke on non-imputed
                  mean SE
as.factor(smoke)0   NA NA
as.factor(smoke)1   NA NA
proportion of smoke on case-deleted
                     mean     SE
as.factor(smoke)0 0.85476 0.0055
as.factor(smoke)1 0.14524 0.0055
total counts of smoke on case-deleted
                   total     SE
as.factor(smoke)0 6438.5 70.982
as.factor(smoke)1 1094.1 43.478

Regression

In [14]:
########################################
# smoke as outcome in logistic regression
# run model with svyglm using 5 imputed data sets contained in design
# NOTE: regression estimates / inference are 'design-based' 
########################################

fit_imputed_smoke <- with(des_imputed, svyglm(smoke ~ agez + inc + lowact,
                                        family=quasibinomial))

message('glm results from 1st imputed dataset')
fit_imputed_smoke[[1]]

message('glm results from 5th imputed dataset')
fit_imputed_smoke[[5]]

message('glm results from combining all imputed datasets')
summary(MIcombine(fit_imputed_smoke))


message('compare glm results with non-imputed design')
message('note: default glm() behavior omits rows with NAs (case-deletion)')

fit_non_impute_smoke = svyglm(design=des_non_impute,
                              formula=smoke ~ agez + inc + lowact,
                              family=quasibinomial())

summary(fit_non_impute_smoke)
glm results from 1st imputed dataset
Stratified Independent Sampling design (with replacement)
svydesign(ids = ids, probs = probs, strata = strata, variables = variables, 
    fpc = fpc, nest = nest, check.strata = check.strata, weights = weights, 
    data = d, pps = pps, ...)

Call:  svyglm(formula = smoke ~ agez + inc + lowact, family = quasibinomial, 
    design = .design)

Coefficients:
(Intercept)         agez         inc2         inc3         inc4         inc5  
    -1.3418      -0.2405      -0.2631      -0.3877      -0.6893      -1.2173  
     lowact  
     0.3280  

Degrees of Freedom: 7593 Total (i.e. Null);  7558 Residual
Null Deviance:	    6291 
Residual Deviance: 5981 	AIC: NA
glm results from 5th imputed dataset
Stratified Independent Sampling design (with replacement)
svydesign(ids = ids, probs = probs, strata = strata, variables = variables, 
    fpc = fpc, nest = nest, check.strata = check.strata, weights = weights, 
    data = d, pps = pps, ...)

Call:  svyglm(formula = smoke ~ agez + inc + lowact, family = quasibinomial, 
    design = .design)

Coefficients:
(Intercept)         agez         inc2         inc3         inc4         inc5  
    -1.3563      -0.2416      -0.1349      -0.3616      -0.7495      -1.1490  
     lowact  
     0.2588  

Degrees of Freedom: 7593 Total (i.e. Null);  7558 Residual
Null Deviance:	    6282 
Residual Deviance: 5987 	AIC: NA
glm results from combining all imputed datasets
Multiple imputation results:
      with(des_imputed, svyglm(smoke ~ agez + inc + lowact, family = quasibinomial))
      MIcombine.default(fit_imputed_smoke)
               results        se      (lower      upper) missInfo
(Intercept) -1.3072243 0.1406189 -1.58546226 -1.02898643     19 %
agez        -0.2431934 0.0406350 -0.32283712 -0.16354962      1 %
inc2        -0.2596865 0.1771862 -0.61513972  0.09576678     30 %
inc3        -0.3769678 0.1790426 -0.72820138 -0.02573428      6 %
inc4        -0.7739785 0.1975982 -1.16782327 -0.38013366     25 %
inc5        -1.1990073 0.1435594 -1.48171122 -0.91630333     13 %
lowact       0.2705990 0.1010996  0.07062494  0.47057300     19 %
compare glm results with non-imputed design
note: default glm() behavior omits rows with NAs (case-deletion)
Call:
svyglm(design = des_non_impute, formula = smoke ~ agez + inc + 
    lowact, family = quasibinomial())

Survey design:
svydesign(ids = ~1, strata = ~ststr, weights = ~swts, data = brfss_11)

Coefficients:
            Estimate Std. Error t value Pr(>|t|)    
(Intercept) -1.26656    0.13886  -9.121  < 2e-16 ***
agez        -0.24541    0.04689  -5.234 1.72e-07 ***
inc2        -0.32622    0.16661  -1.958   0.0503 .  
inc3        -0.38498    0.19005  -2.026   0.0428 *  
inc4        -0.75757    0.18955  -3.997 6.50e-05 ***
inc5        -1.19397    0.14582  -8.188 3.21e-16 ***
lowact       0.25195    0.10040   2.509   0.0121 *  
---
Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

(Dispersion parameter for quasibinomial family taken to be 1.004407)

Number of Fisher Scoring iterations: 4

Appendix

Useful resources

https://www.jstatsoft.org/article/view/v009i08

http://www.asdfree.com/search/label/american%20community%20survey%20%28acs%29

http://stats.idre.ucla.edu/other/mult-pkg/faq/faq-choosing-the-correct-analysis-for-various-survey-designs/

NOTE: MICE is easy to implement / easy to abuse. If user does not want to get bogged down with tree-aging potential outcomes literature.

joint vs conditional imputation https://www.ncbi.nlm.nih.gov/pubmed/17621469

software for multivariate "joint model" imputation https://www.jstatsoft.org/article/view/v070i01