Sites for more info:
- Jon Helm

Notes from Jon’s Lecture Slides

Patterns of missingness

  • Univariate (uncommon)
  • General (missing on more than one variable, typically non-monotonic)

Causes and Mechanisms of missingness

CAUSES

  • True reason (not the assumed reason) that data are missing

MECHANISMS

MCAR: Missing completely at random

  • Missingness on Y is unrelated to the expected Y; unrelated to any observed Xs.
  • When we run analyses without first accounting for missingness, we are inherently assuming MCAR (which may be wrong).
  • Unfortunately, MCAR can never be truly known, because we would need the actual complete data to test the first part of the assumption (that Y miss is unrelated to Y complete)

MAR: Missing at random

  • Missingness on Y is unrelated to the expected Y, after adjusting for Xs (which are related; i.e., there is some variable in the analysis that can account for the missingness).
  • We can test this by seeing whether missingness indicators are related to the observed Xs in the data.
  • This has shortcomings too however, particularly when dealing with multivariate missingness (it creates exponential patterns, which drastically reduces power and sample size for any MAR testing, and inflates Type 1 error due to multiple tests [e.g., even with 3 variables, we can have 8 patterns of missingness - i.e., 8 subgroups to test against each other])
  • This can never be known either, because we still need the complete data to know whether Ymiss is unrelated to Ycomplete

NMAR/MNAR: Not Missing at Random/Missing Not at Random

  • Missingness on Y is related to the expected Y, after adjusting for Xs (which are related; i.e., regardless of other variables in the analysis, Y is still missing because of what Y would have been).
  • This can never be known for certain either (in all cases you need complete data).

QUESTION – Is there any empirical way to test/estimate the standard (or expected) rate of missingness for a particular variable in a particular population?

An issue related to power. It matters more WHY they’re missing, even if you expect the missingness, you are still making assumptions about why (one of the mechanisms above) - and so are all prior studies from which you derive the expected missingness value (e.g., diary studies assume 80% retention/20% attrition, generally due to boredom or fatigue).


Imputation Strategies

Single Imputation

  • Replacing missing values with a best guess
  • Different approaches to this process include (but are not limited to):
    :::: Mean imputation
    :::: Hot deck imputation, Cold deck imputation
    :::: LOCF, NOCB, or Sandwich methods
    :::: Regression-based imputation

  • Not as widely used, presents issues - generally more benefits than drawbacks
    :::: Regression based single imputation doesn’t include stochastic elements such as the standard errors of the estimates, and the residual of the model.
    :::: Multiple imputation helps include some of this stochastic information.

Multiple Imputation

  • Averaging betas across multiple imputed data sets.
  • Standard errors are handled differently: :::: variability WITHIN imputation - average of squares of se (variacnes wihtin each imputation) :::: variability BETWEEN imputations ::::::: We calculate them (see slides for example):

within = SE1sq + SE2sq / N-imputations

between = var(M1, M2, M3)

V-b0 = W + ( 1 + (1/M)) B

se-b0 = SQRT(Vb0)


QUESTION – Would it be appropriate to conduct beta comparisons across imputed vs. observed groups (assuming sufficiently large subgroups), to determine whether you are getting particularly different estimates from original data (e.g., a Wald test)? If we don’t test, how do we know that imputation is giving us the results we actually want?

QUESTION – Do you get different results when using a bootstrapping approach (as MI normally does) vs. a simulation approach (i.e., guesses based on the population estimates derived from the data)? This can be tested, no?


Multiple Imputation software RARELY uses multiple regression to conduct the procedure in practice. It’s a bit more technical and is based on multivariate normality.

Typically, Multiple imputation will perform well if one of the variables in the imputation accounts for the missingness (i.e., the data are MAR). - When data are MAR, one of the X variables really does explain scores on Y, so if that X is in the MI model, the best guesses will be pretty good.

Typically, it also works if missingness is not related to any variable in the data set (i.e., the data are MCAR)

It does NOT work well if you have NMAR data. - MI will produced biased guesses with NMAR data because your complete observations are a non-random sample of a random sample.

Data Analysis

Part 1. Imputing the Data

Start by importing the data.

ID Math01 Math02 Math03 Portu01 Portu02 Portu03 Sex Rel_status Guardian Reason
1 7 NA NA 13 13 13 F no mother home
2 8 NA 5 13 NA 11 F yes mother reputation
3 14 13 13 NA 13 12 F no mother reputation
4 10 9 NA 10 11 NA F no mother course
5 10 10 10 13 13 13 F yes other reputation
6 NA NA 11 11 NA NA F no mother course

The MICE program (Multiple Imputation by Chained Equations) can be used for basic multiple imputation procedures.

We will use it with the Academic Achievement data (data set ‘aa’)

To perform the MI use:

imp_data = mice(aa, m = 40, seed = 142)

We can view the results of one of the imputations using the complete() function (nested within the head() function so we don’t get the whole thing)

ID Math01 Math02 Math03 Portu01 Portu02 Portu03 Sex Rel_status Guardian Reason
1 7 10 8 13 13 13 F no mother home
2 8 0 5 13 10 11 F yes mother reputation
3 14 13 13 13 13 12 F no mother reputation
4 10 9 10 10 11 11 F no mother course
5 10 10 10 13 13 13 F yes other reputation
6 13 10 11 11 11 13 F no mother course

Part 2. Conducting the Analyses

t-tests using the car package

Now we can perform a t-test (via ANOVA) using some of the variables from the Academic Achievement data set. We will need the car package to calculate type 3 sums of squares.

As an initial set, we need to make sure we are using an effect coding strategy. This detail is specific to ANOVA, not multiple imputation.

Now lets fit the model that tests for biological sex differences across Math scores that were collected from the first semester. After fitting the model like a regression, we can use the Anova() function to get the ANOVA table from the output.

Sum Sq Df F value Pr(>F)
(Intercept) 31987.14071 1 2919.262676 0.0000000
Sex 31.36543 1 2.862523 0.0918406
Residuals 2903.67577 265 NA NA
## 
## Call:
## lm(formula = Math01 ~ Sex, data = aa)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -6.6087 -2.6087 -0.6087  2.3913  8.3913 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  10.9516     0.2027  54.030   <2e-16 ***
## Sex1         -0.3429     0.2027  -1.692   0.0918 .  
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 3.31 on 265 degrees of freedom
##   (115 observations deleted due to missingness)
## Multiple R-squared:  0.01069,    Adjusted R-squared:  0.006953 
## F-statistic: 2.863 on 1 and 265 DF,  p-value: 0.09184

Just to follow up, we can also perform a t-test using the sample data.

## 
##  Welch Two Sample t-test
## 
## data:  Math01 by Sex
## t = -1.6896, df = 261.96, p-value = 0.09229
## alternative hypothesis: true difference in means is not equal to 0
## 95 percent confidence interval:
##  -1.4851991  0.1134431
## sample estimates:
## mean in group F mean in group M 
##        10.60870        11.29457

Performing a t-test (via ANOVA) with Multiply Imputed Data Sets

First we impute the data using:

imp_data2 = mice(aa, m = 40, seed = 142)

Quick check of the imputed data:

ID Math01 Math02 Math03 Portu01 Portu02 Portu03 Sex Rel_status Guardian Reason
1 7 8 8 13 13 13 F no mother home
2 8 0 5 13 12 11 F yes mother reputation
3 14 13 13 14 13 12 F no mother reputation
4 10 9 9 10 11 11 F no mother course
5 10 10 10 13 13 13 F yes other reputation
6 16 12 11 11 12 14 F no mother course

Perform the analysis on each of the imputed data sets with the ‘mi.anova()’ function from the ‘miceadds’ library.

## Univariate ANOVA for Multiply Imputed Data (Type 3)  
## 
## lm Formula:  Math01 ~ 1 + Sex
## R^2=0.02 
## ..........................................................................
## ANOVA Table 
##                 SSQ df1      df2 F value  Pr(>F)    eta2 partial.eta2
## Sex        85.94995   1 6839.481   7.088 0.00778 0.01998      0.01998
## Residual 4215.18304  NA       NA      NA      NA      NA           NA
x
0.0199831
SSQ df1 df2 F value Pr(>F) eta2 partial.eta2
Sex 85.94995 1 6839.481 7.088 0.007778 0.019983 0.019983
Residual 4215.18304 NA NA NA NA NA NA
x
3

Practice problem: Using MI with salary data

Using the salary data set:

Perform a t-test on salary using biological sex as a predictor Create multiply imputed data sets (use 40 imputations, set the seed equal to 806) Perform the t-test across all data sets Combine the results across data sets

## 
##  Welch Two Sample t-test
## 
## data:  salary by BioSex
## t = -0.42543, df = 58.688, p-value = 0.6721
## alternative hypothesis: true difference in means is not equal to 0
## 95 percent confidence interval:
##  -8758.271  5687.361
## sample estimates:
## mean in group F mean in group M 
##        98438.25        99973.70
Sum Sq Df F value Pr(>F)
(Intercept) 652934818077 1 2912.7660488 0.0000000
BioSex 39102846 1 0.1744392 0.6775122
Residuals 15243094325 68 NA NA

Impute the data using: imp_data.prac <- mice(salary, m=40, seed=806)

Perform the t test across all imputed data sets (using Anova), and combine them using mi.anova() function:

## Univariate ANOVA for Multiply Imputed Data (Type 3)  
## 
## lm Formula:  salary ~ 1 + BioSex
## R^2=0.0038 
## ..........................................................................
## ANOVA Table 
##                  SSQ df1      df2 F value  Pr(>F)    eta2 partial.eta2
## BioSex      82139648   1 3481.781  0.2335 0.62899 0.00385      0.00385
## Residual 21270000828  NA       NA      NA      NA      NA           NA
x
0.0038469
SSQ df1 df2 F value Pr(>F) eta2 partial.eta2
BioSex 82139648 1 3481.781 0.2335 0.628993 0.003847 0.003847
Residual 21270000828 NA NA NA NA NA NA
x
3

Part 3. Doing other ANOVAs (including two-way ANOVA) in MI

As an initial set, we need to make sure we are using an effect coding strategy. This detail is specific to ANOVA, not multiple imputation.

Now lets fit the two-way model that tests for guardian differences by Sex across Math scores that were collected from the first semester.

Sum Sq Df F value Pr(>F)
(Intercept) 10157.636245 1 936.7339101 0.0000000
Guardian 62.423324 2 2.8783293 0.0580154
Sex 25.022233 1 2.3075422 0.1299579
Guardian:Sex 4.810075 2 0.2217918 0.8012330
Residuals 2830.198663 261 NA NA

Now to run the MI and compare the results.

Using the following code: imp_data3 <- mice(aa, m=40, seed=142)

Perform the analysis on each of the imputed data sets with the ‘mi.anova()’ function from the ‘miceadds’ library.

## Univariate ANOVA for Multiply Imputed Data (Type 3)  
## 
## lm Formula:  Math01 ~ 1 + Guardian + Sex + Guardian*Sex
## R^2=0.0284 
## ..........................................................................
## ANOVA Table 
##                     SSQ df1      df2 F value  Pr(>F)    eta2 partial.eta2
## Guardian       50.70465   2 30568.98  2.1910 0.11182 0.01192      0.01212
## Sex            54.89645   1 37976.04  4.7974 0.02851 0.01291      0.01311
## Guardian:Sex   15.10919   2 49344.50  0.6391 0.52776 0.00355      0.00364
## Residual     4131.92524  NA       NA      NA      NA      NA           NA

Univariate ANOVA for Multiply Imputed Data (Type 3)

lm Formula: Math01 ~ 1 + Guardian + Sex + Guardian*Sex R^2=0.0284 .......................................................................... ANOVA Table SSQ df1 df2 F value Pr(>F) eta2 partial.eta2 Guardian 50.70465 2 30568.98 2.1910 0.11182 0.01192 0.01212 Sex 54.89645 1 37976.04 4.7974 0.02851 0.01291 0.01311 Guardian:Sex 15.10919 2 49344.50 0.6391 0.52776 0.00355 0.00364 Residual 4131.92524 NA NA NA NA NA NA

Part 4. Correlations with MI

We can start by calculating a correlation across two variables

## 
##  Pearson's product-moment correlation
## 
## data:  aa$Math01 and aa$Math02
## t = 24.486, df = 177, p-value < 2.2e-16
## alternative hypothesis: true correlation is not equal to 0
## 95 percent confidence interval:
##  0.8403011 0.9082913
## sample estimates:
##       cor 
## 0.8786774

Now we can run a combined correlation across the MI data sets we generated in Part 3 above.

variable1 variable2 r rse fisher_r fisher_rse fmi t p lower95 upper95
Math01 Math02 0.8811656 0.0171152 1.380958 0.0766292 0.5574908 18.02131 0 0.8428018 0.9106209
Math02 Math01 0.8811656 0.0171152 1.380958 0.0766292 0.5574908 18.02131 0 0.8428018 0.9106209

Part 5. Regression with MI

Same basic process as above.

First test the model with the observed data

## 
## Call:
## lm(formula = Math02 ~ 1 + Math01 + Portu01, data = aa)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -11.4305  -0.8456  -0.0402   0.9966   4.5778 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept) -0.38681    0.77343  -0.500   0.6179    
## Math01       0.86714    0.05984  14.490   <2e-16 ***
## Portu01      0.14116    0.07807   1.808   0.0731 .  
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 1.672 on 122 degrees of freedom
##   (257 observations deleted due to missingness)
## Multiple R-squared:  0.7629, Adjusted R-squared:  0.759 
## F-statistic: 196.3 on 2 and 122 DF,  p-value: < 2.2e-16

We use the with() function to run this model across all the MI data sets stored in the imp_data3 list we made earlier in Part 3.

results = with(imp_data3, lm(Math02 ~ 1+ Math01 + Portu01))

estimate std.error statistic df p.value
(Intercept) -0.8268730 0.5957985 -1.38784 86.40788 0.1687541
Math01 0.9075514 0.0429691 21.12103 106.55834 0.0000000
Portu01 0.1368116 0.0599034 2.28387 80.92820 0.0250000