Sites for more info:
- Jon Helm

Notes from Jon’s Lecture Slides

Sampling and how it links to the problem of missing data

Scientific process involves random sampling. When we have missingness, randoom sampling assumption is violated. Generalizability is undermined as a result.

Performing deletion methods essentially gives us a non-random sample of a random sample. Once we get to this point, we can no longer generalize to the population. Listwise deletion makes this a problem because it is too aggressive (though pairwise deletion also presents issues too).

Most data analytic programs will perform listwise deletion before conducting standard analytic procedures (and will not necessarily inform you, or will bury it in the footnotes of the output). Be careful, because listwise deletion always assumes MCAR. If the data are MAR or NMAR, you obtain biased estimates.

Patterns of missingness

Univariate (uncommon)
General (missing on more than one variable, typically non-monotonic)

Causes and Mechanisms of missingness

CAUSES

True reason (not the assumed reason) that data are missing

MECHANISMS

MCAR: Missing completely at random

Missingness on Y is unrelated to the expected Y; unrelated to any observed Xs.
When we run analyses without first accounting for missingness, we are inherently assuming MCAR (which may be wrong).
Unfortunately, MCAR can never be truly known, because we would need the actual complete data to test the first part of the assumption (that Y miss is unrelated to Y complete)

MAR: Missing at random

Missingness on Y is unrelated to the expected Y, after adjusting for Xs (which are related; i.e., there is some variable in the analysis that can account for the missingness).
We can test this by seeing whether missingness indicators are related to the observed Xs in the data.
This has shortcomings too however, particularly when dealing with multivariate missingness (it creates exponential patterns, which drastically reduces power and sample size for any MAR testing, and inflates Type 1 error due to multiple tests [e.g., even with 3 variables, we can have 8 patterns of missingness - i.e., 8 subgroups to test against each other])
This can never be known either, because we still need the complete data to know whether Ymiss is unrelated to Ycomplete

NMAR/MNAR: Not Missing at Random/Missing Not at Random

Missingness on Y is related to the expected Y, after adjusting for Xs (which are related; i.e., regardless of other variables in the analysis, Y is still missing because of what Y would have been).
This can never be known for certain either (in all cases you need complete data).

QUESTION – Is there any empirical way to test/estimate the standard (or expected) rate of missingness for a particular variable in a particular population?

An issue related to power. It matters more WHY they’re missing, even if you expect the missingness, you are still making assumptions about why (one of the mechanisms above) - and so are all prior studies from which you derive the expected missingness value (e.g., diary studies assume 80% retention/20% attrition, generally due to boredom or fatigue).

Imputation Strategies

Single Imputation

Replacing missing values with a best guess
Different approaches to this process include (but are not limited to):
:::: Mean imputation
:::: Hot deck imputation, Cold deck imputation
:::: LOCF, NOCB, or Sandwich methods
:::: Regression-based imputation
Not as widely used, presents issues - generally more benefits than drawbacks
:::: Regression based single imputation doesn’t include stochastic elements such as the standard errors of the estimates, and the residual of the model.
:::: Multiple imputation helps include some of this stochastic information.

Multiple Imputation

Averaging betas across multiple imputed data sets.
Standard errors are handled differently: :::: variability WITHIN imputation - average of squares of se (variacnes wihtin each imputation) :::: variability BETWEEN imputations ::::::: We calculate them (see slides for example):

within = SE1sq + SE2sq / N-imputations

between = var(M1, M2, M3)

V-b0 = W + ( 1 + (1/M)) B

se-b0 = SQRT(Vb0)

QUESTION – Would it be appropriate to conduct beta comparisons across imputed vs. observed groups (assuming sufficiently large subgroups), to determine whether you are getting particularly different estimates from original data (e.g., a Wald test)? If we don’t test, how do we know that imputation is giving us the results we actually want?

QUESTION – Do you get different results when using a bootstrapping approach (as MI normally does) vs. a simulation approach (i.e., guesses based on the population estimates derived from the data)? This can be tested, no?

Multiple Imputation software RARELY uses multiple regression to conduct the procedure in practice. It’s a bit more technical and is based on multivariate normality.

Typically, Multiple imputation will perform well if one of the variables in the imputation accounts for the missingness (i.e., the data are MAR). - When data are MAR, one of the X variables really does explain scores on Y, so if that X is in the MI model, the best guesses will be pretty good.

Typically, it also works if missingness is not related to any variable in the data set (i.e., the data are MCAR)

It does NOT work well if you have NMAR data. - MI will produced biased guesses with NMAR data because your complete observations are a non-random sample of a random sample.

Data Analysis

Part 1. Imputing the Data

Start by importing the data.

salary <- read.csv("3. Data Sets/data_salary.csv", header=T)
aa <- read.csv("3. Data Sets/data_AcadAchiev.csv", header=T)
#set.seed = 806
head(aa)

ID	Math01	Math02	Math03	Portu01	Portu02	Portu03	Sex	Rel_status	Guardian	Reason
1	7	NA	NA	13	13	13	F	no	mother	home
2	8	NA	5	13	NA	11	F	yes	mother	reputation
3	14	13	13	NA	13	12	F	no	mother	reputation
4	10	9	NA	10	11	NA	F	no	mother	course
5	10	10	10	13	13	13	F	yes	other	reputation
6	NA	NA	11	11	NA	NA	F	no	mother	course

The MICE program (Multiple Imputation by Chained Equations) can be used for basic multiple imputation procedures.

We will use it with the Academic Achievement data (data set ‘aa’)

To perform the MI use:

imp_data = mice(aa, m = 40, seed = 142)

We can view the results of one of the imputations using the complete() function (nested within the head() function so we don’t get the whole thing)

head(complete(imp_data, 1))

ID	Math01	Math02	Math03	Portu01	Portu02	Portu03	Sex	Rel_status	Guardian	Reason
1	7	10	8	13	13	13	F	no	mother	home
2	8	0	5	13	10	11	F	yes	mother	reputation
3	14	13	13	13	13	12	F	no	mother	reputation
4	10	9	10	10	11	11	F	no	mother	course
5	10	10	10	13	13	13	F	yes	other	reputation
6	13	10	11	11	11	13	F	no	mother	course

  # The complete function prints back the complete data set
  # The second argument (the '1' in this case) indicates which imputed data
  # we would like to view. Since we imputed 40, we can choose values 1-40.

Part 2. Conducting the Analyses

t-tests using the car package

Now we can perform a t-test (via ANOVA) using some of the variables from the Academic Achievement data set. We will need the car package to calculate type 3 sums of squares.

As an initial set, we need to make sure we are using an effect coding strategy. This detail is specific to ANOVA, not multiple imputation.

#adding this statement makes the results of ANOVA similar to those generated in standard packages like SPSS.
options(contrasts = c('contr.sum', 'contr.poly'))

Now lets fit the model that tests for biological sex differences across Math scores that were collected from the first semester. After fitting the model like a regression, we can use the Anova() function to get the ANOVA table from the output.

model.01 = lm(Math01 ~ Sex, data = aa)

Anova(model.01, type = 3)

	Sum Sq	Df	F value	Pr(>F)
(Intercept)	31987.14071	1	2919.262676	0.0000000
Sex	31.36543	1	2.862523	0.0918406
Residuals	2903.67577	265	NA	NA

#use summary() to get information on missingness in the original data
summary(model.01)

## 
## Call:
## lm(formula = Math01 ~ Sex, data = aa)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -6.6087 -2.6087 -0.6087  2.3913  8.3913 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  10.9516     0.2027  54.030   <2e-16 ***
## Sex1         -0.3429     0.2027  -1.692   0.0918 .  
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 3.31 on 265 degrees of freedom
##   (115 observations deleted due to missingness)
## Multiple R-squared:  0.01069,    Adjusted R-squared:  0.006953 
## F-statistic: 2.863 on 1 and 265 DF,  p-value: 0.09184

Just to follow up, we can also perform a t-test using the sample data.

t.test(Math01 ~ Sex, data = aa)

## 
##  Welch Two Sample t-test
## 
## data:  Math01 by Sex
## t = -1.6896, df = 261.96, p-value = 0.09229
## alternative hypothesis: true difference in means is not equal to 0
## 95 percent confidence interval:
##  -1.4851991  0.1134431
## sample estimates:
## mean in group F mean in group M 
##        10.60870        11.29457

Performing a t-test (via ANOVA) with Multiply Imputed Data Sets

First we impute the data using:

imp_data2 = mice(aa, m = 40, seed = 142)

Quick check of the imputed data:

head(complete(imp_data2, 1))

ID	Math01	Math02	Math03	Portu01	Portu02	Portu03	Sex	Rel_status	Guardian	Reason
1	7	8	8	13	13	13	F	no	mother	home
2	8	0	5	13	12	11	F	yes	mother	reputation
3	14	13	13	14	13	12	F	no	mother	reputation
4	10	9	9	10	11	11	F	no	mother	course
5	10	10	10	13	13	13	F	yes	other	reputation
6	16	12	11	11	12	14	F	no	mother	course

Perform the analysis on each of the imputed data sets with the ‘mi.anova()’ function from the ‘miceadds’ library.

options(contrasts = c('contr.sum', 'contr.poly'))
anova2<-mi.anova(mi.res = imp_data2, 
        formula = "Math01 ~ 1 + Sex", 
        type = 3)

## Univariate ANOVA for Multiply Imputed Data (Type 3)  
## 
## lm Formula:  Math01 ~ 1 + Sex
## R^2=0.02 
## ..........................................................................
## ANOVA Table 
##                 SSQ df1      df2 F value  Pr(>F)    eta2 partial.eta2
## Sex        85.94995   1 6839.481   7.088 0.00778 0.01998      0.01998
## Residual 4215.18304  NA       NA      NA      NA      NA           NA

kable(anova2)

x
0.0199831

	SSQ	df1	df2	F value	Pr(>F)	eta2	partial.eta2
Sex	85.94995	1	6839.481	7.088	0.007778	0.019983	0.019983
Residual	4215.18304	NA	NA	NA	NA	NA	NA

x
3

Practice problem: Using MI with salary data

Using the salary data set:

Perform a t-test on salary using biological sex as a predictor Create multiply imputed data sets (use 40 imputations, set the seed equal to 806) Perform the t-test across all data sets Combine the results across data sets

#perform the t test
tprac <- t.test(salary ~ BioSex, data = salary)
tprac

## 
##  Welch Two Sample t-test
## 
## data:  salary by BioSex
## t = -0.42543, df = 58.688, p-value = 0.6721
## alternative hypothesis: true difference in means is not equal to 0
## 95 percent confidence interval:
##  -8758.271  5687.361
## sample estimates:
## mean in group F mean in group M 
##        98438.25        99973.70

anovaprac =lm(salary ~ BioSex, data = salary)

Anova(anovaprac, type = 3)

	Sum Sq	Df	F value	Pr(>F)
(Intercept)	652934818077	1	2912.7660488	0.0000000
BioSex	39102846	1	0.1744392	0.6775122
Residuals	15243094325	68	NA	NA

Impute the data using: imp_data.prac <- mice(salary, m=40, seed=806)

Perform the t test across all imputed data sets (using Anova), and combine them using mi.anova() function:

options(contrasts = c('contr.sum', 'contr.poly'))
anovapracMI<-mi.anova(mi.res = imp_data.prac, 
        formula = "salary ~ 1 + BioSex", 
        type = 3)

## Univariate ANOVA for Multiply Imputed Data (Type 3)  
## 
## lm Formula:  salary ~ 1 + BioSex
## R^2=0.0038 
## ..........................................................................
## ANOVA Table 
##                  SSQ df1      df2 F value  Pr(>F)    eta2 partial.eta2
## BioSex      82139648   1 3481.781  0.2335 0.62899 0.00385      0.00385
## Residual 21270000828  NA       NA      NA      NA      NA           NA

kable(anovapracMI)

x
0.0038469

	SSQ	df1	df2	F value	Pr(>F)	eta2	partial.eta2
BioSex	82139648	1	3481.781	0.2335	0.628993	0.003847	0.003847
Residual	21270000828	NA	NA	NA	NA	NA	NA

x
3

Part 3. Doing other ANOVAs (including two-way ANOVA) in MI

As an initial set, we need to make sure we are using an effect coding strategy. This detail is specific to ANOVA, not multiple imputation.

options(contrasts = c('contr.sum', 'contr.poly'))

Now lets fit the two-way model that tests for guardian differences by Sex across Math scores that were collected from the first semester.

model.03 = lm(Math01 ~ Guardian + Sex + 
                Guardian * Sex, 
              data = aa)

#this drops 115 cases
Anova(model.03, type = 3)

	Sum Sq	Df	F value	Pr(>F)
(Intercept)	10157.636245	1	936.7339101	0.0000000
Guardian	62.423324	2	2.8783293	0.0580154
Sex	25.022233	1	2.3075422	0.1299579
Guardian:Sex	4.810075	2	0.2217918	0.8012330
Residuals	2830.198663	261	NA	NA

Now to run the MI and compare the results.

Using the following code: imp_data3 <- mice(aa, m=40, seed=142)

Perform the analysis on each of the imputed data sets with the ‘mi.anova()’ function from the ‘miceadds’ library.

mi.anova(mi.res = imp_data3, 
        formula = "Math01 ~ 1 + Guardian + Sex + Guardian*Sex", 
        type = 3)

## Univariate ANOVA for Multiply Imputed Data (Type 3)  
## 
## lm Formula:  Math01 ~ 1 + Guardian + Sex + Guardian*Sex
## R^2=0.0284 
## ..........................................................................
## ANOVA Table 
##                     SSQ df1      df2 F value  Pr(>F)    eta2 partial.eta2
## Guardian       50.70465   2 30568.98  2.1910 0.11182 0.01192      0.01212
## Sex            54.89645   1 37976.04  4.7974 0.02851 0.01291      0.01311
## Guardian:Sex   15.10919   2 49344.50  0.6391 0.52776 0.00355      0.00364
## Residual     4131.92524  NA       NA      NA      NA      NA           NA

Univariate ANOVA for Multiply Imputed Data (Type 3)

lm Formula: Math01 ~ 1 + Guardian + Sex + Guardian*Sex R^2=0.0284 .......................................................................... ANOVA Table SSQ df1 df2 F value Pr(>F) eta2 partial.eta2 Guardian 50.70465 2 30568.98 2.1910 0.11182 0.01192 0.01212 Sex 54.89645 1 37976.04 4.7974 0.02851 0.01291 0.01311 Guardian:Sex 15.10919 2 49344.50 0.6391 0.52776 0.00355 0.00364 Residual 4131.92524 NA NA NA NA NA NA

Part 4. Correlations with MI

We can start by calculating a correlation across two variables

cor.test(aa$Math01, aa$Math02, use = 'pairwise.complete.obs')

## 
##  Pearson's product-moment correlation
## 
## data:  aa$Math01 and aa$Math02
## t = 24.486, df = 177, p-value < 2.2e-16
## alternative hypothesis: true correlation is not equal to 0
## 95 percent confidence interval:
##  0.8403011 0.9082913
## sample estimates:
##       cor 
## 0.8786774

# Remember this is still doing listwise deletion in the bivariate case, even though the use statement says pairwise.

Now we can run a combined correlation across the MI data sets we generated in Part 3 above.

corrMI <- micombine.cor(mi.res = imp_data3, 
                  variables = c('Math01', 'Math02'))
kable(corrMI)

variable1	variable2	r	rse	fisher_r	fisher_rse	fmi	t	p	lower95	upper95
Math01	Math02	0.8811656	0.0171152	1.380958	0.0766292	0.5574908	18.02131	0	0.8428018	0.9106209
Math02	Math01	0.8811656	0.0171152	1.380958	0.0766292	0.5574908	18.02131	0	0.8428018	0.9106209

Part 5. Regression with MI

Same basic process as above.

First test the model with the observed data

model.5 <- lm(Math02 ~ 1+ Math01 + Portu01, data=aa)
summary(model.5)

## 
## Call:
## lm(formula = Math02 ~ 1 + Math01 + Portu01, data = aa)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -11.4305  -0.8456  -0.0402   0.9966   4.5778 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept) -0.38681    0.77343  -0.500   0.6179    
## Math01       0.86714    0.05984  14.490   <2e-16 ***
## Portu01      0.14116    0.07807   1.808   0.0731 .  
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 1.672 on 122 degrees of freedom
##   (257 observations deleted due to missingness)
## Multiple R-squared:  0.7629, Adjusted R-squared:  0.759 
## F-statistic: 196.3 on 2 and 122 DF,  p-value: < 2.2e-16

We use the with() function to run this model across all the MI data sets stored in the imp_data3 list we made earlier in Part 3.

results = with(imp_data3, lm(Math02 ~ 1+ Math01 + Portu01))

results = with(imp_data3, lm(Math02 ~ 1+ Math01 + Portu01))
summary(pool(results))

	estimate	std.error	statistic	df	p.value
(Intercept)	-0.8268730	0.5957985	-1.38784	86.40788	0.1687541
Math01	0.9075514	0.0429691	21.12103	106.55834	0.0000000
Portu01	0.1368116	0.0599034	2.28387	80.92820	0.0250000

Missing Data Workshop (Jon Helm, Ph.D.)

Fred Clavel

5/17/2019