Sites for more info:
- Jon Helm
Scientific process involves random sampling. When we have missingness, randoom sampling assumption is violated. Generalizability is undermined as a result.
Performing deletion methods essentially gives us a non-random sample of a random sample. Once we get to this point, we can no longer generalize to the population. Listwise deletion makes this a problem because it is too aggressive (though pairwise deletion also presents issues too).
Most data analytic programs will perform listwise deletion before conducting standard analytic procedures (and will not necessarily inform you, or will bury it in the footnotes of the output). Be careful, because listwise deletion always assumes MCAR. If the data are MAR or NMAR, you obtain biased estimates.
MCAR: Missing completely at random
MAR: Missing at random
NMAR/MNAR: Not Missing at Random/Missing Not at Random
QUESTION – Is there any empirical way to test/estimate the standard (or expected) rate of missingness for a particular variable in a particular population?
An issue related to power. It matters more WHY they’re missing, even if you expect the missingness, you are still making assumptions about why (one of the mechanisms above) - and so are all prior studies from which you derive the expected missingness value (e.g., diary studies assume 80% retention/20% attrition, generally due to boredom or fatigue).
Different approaches to this process include (but are not limited to):
:::: Mean imputation
:::: Hot deck imputation, Cold deck imputation
:::: LOCF, NOCB, or Sandwich methods
:::: Regression-based imputation
Not as widely used, presents issues - generally more benefits than drawbacks
:::: Regression based single imputation doesn’t include stochastic elements such as the standard errors of the estimates, and the residual of the model.
:::: Multiple imputation helps include some of this stochastic information.
within = SE1sq + SE2sq / N-imputations
between = var(M1, M2, M3)
V-b0 = W + ( 1 + (1/M)) B
se-b0 = SQRT(Vb0)
QUESTION – Would it be appropriate to conduct beta comparisons across imputed vs. observed groups (assuming sufficiently large subgroups), to determine whether you are getting particularly different estimates from original data (e.g., a Wald test)? If we don’t test, how do we know that imputation is giving us the results we actually want?
QUESTION – Do you get different results when using a bootstrapping approach (as MI normally does) vs. a simulation approach (i.e., guesses based on the population estimates derived from the data)? This can be tested, no?
Multiple Imputation software RARELY uses multiple regression to conduct the procedure in practice. It’s a bit more technical and is based on multivariate normality.
Typically, Multiple imputation will perform well if one of the variables in the imputation accounts for the missingness (i.e., the data are MAR). - When data are MAR, one of the X variables really does explain scores on Y, so if that X is in the MI model, the best guesses will be pretty good.
Typically, it also works if missingness is not related to any variable in the data set (i.e., the data are MCAR)
It does NOT work well if you have NMAR data. - MI will produced biased guesses with NMAR data because your complete observations are a non-random sample of a random sample.
Start by importing the data.
salary <- read.csv("3. Data Sets/data_salary.csv", header=T)
aa <- read.csv("3. Data Sets/data_AcadAchiev.csv", header=T)
#set.seed = 806
head(aa)
ID | Math01 | Math02 | Math03 | Portu01 | Portu02 | Portu03 | Sex | Rel_status | Guardian | Reason |
---|---|---|---|---|---|---|---|---|---|---|
1 | 7 | NA | NA | 13 | 13 | 13 | F | no | mother | home |
2 | 8 | NA | 5 | 13 | NA | 11 | F | yes | mother | reputation |
3 | 14 | 13 | 13 | NA | 13 | 12 | F | no | mother | reputation |
4 | 10 | 9 | NA | 10 | 11 | NA | F | no | mother | course |
5 | 10 | 10 | 10 | 13 | 13 | 13 | F | yes | other | reputation |
6 | NA | NA | 11 | 11 | NA | NA | F | no | mother | course |
The MICE program (Multiple Imputation by Chained Equations) can be used for basic multiple imputation procedures.
We will use it with the Academic Achievement data (data set ‘aa’)
To perform the MI use:
imp_data = mice(aa, m = 40, seed = 142)
We can view the results of one of the imputations using the complete() function (nested within the head() function so we don’t get the whole thing)
ID | Math01 | Math02 | Math03 | Portu01 | Portu02 | Portu03 | Sex | Rel_status | Guardian | Reason |
---|---|---|---|---|---|---|---|---|---|---|
1 | 7 | 10 | 8 | 13 | 13 | 13 | F | no | mother | home |
2 | 8 | 0 | 5 | 13 | 10 | 11 | F | yes | mother | reputation |
3 | 14 | 13 | 13 | 13 | 13 | 12 | F | no | mother | reputation |
4 | 10 | 9 | 10 | 10 | 11 | 11 | F | no | mother | course |
5 | 10 | 10 | 10 | 13 | 13 | 13 | F | yes | other | reputation |
6 | 13 | 10 | 11 | 11 | 11 | 13 | F | no | mother | course |
Now we can perform a t-test (via ANOVA) using some of the variables from the Academic Achievement data set. We will need the car package to calculate type 3 sums of squares.
As an initial set, we need to make sure we are using an effect coding strategy. This detail is specific to ANOVA, not multiple imputation.
#adding this statement makes the results of ANOVA similar to those generated in standard packages like SPSS.
options(contrasts = c('contr.sum', 'contr.poly'))
Now lets fit the model that tests for biological sex differences across Math scores that were collected from the first semester. After fitting the model like a regression, we can use the Anova() function to get the ANOVA table from the output.
Sum Sq | Df | F value | Pr(>F) | |
---|---|---|---|---|
(Intercept) | 31987.14071 | 1 | 2919.262676 | 0.0000000 |
Sex | 31.36543 | 1 | 2.862523 | 0.0918406 |
Residuals | 2903.67577 | 265 | NA | NA |
##
## Call:
## lm(formula = Math01 ~ Sex, data = aa)
##
## Residuals:
## Min 1Q Median 3Q Max
## -6.6087 -2.6087 -0.6087 2.3913 8.3913
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 10.9516 0.2027 54.030 <2e-16 ***
## Sex1 -0.3429 0.2027 -1.692 0.0918 .
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 3.31 on 265 degrees of freedom
## (115 observations deleted due to missingness)
## Multiple R-squared: 0.01069, Adjusted R-squared: 0.006953
## F-statistic: 2.863 on 1 and 265 DF, p-value: 0.09184
Just to follow up, we can also perform a t-test using the sample data.
##
## Welch Two Sample t-test
##
## data: Math01 by Sex
## t = -1.6896, df = 261.96, p-value = 0.09229
## alternative hypothesis: true difference in means is not equal to 0
## 95 percent confidence interval:
## -1.4851991 0.1134431
## sample estimates:
## mean in group F mean in group M
## 10.60870 11.29457
First we impute the data using:
imp_data2 = mice(aa, m = 40, seed = 142)
Quick check of the imputed data:
ID | Math01 | Math02 | Math03 | Portu01 | Portu02 | Portu03 | Sex | Rel_status | Guardian | Reason |
---|---|---|---|---|---|---|---|---|---|---|
1 | 7 | 8 | 8 | 13 | 13 | 13 | F | no | mother | home |
2 | 8 | 0 | 5 | 13 | 12 | 11 | F | yes | mother | reputation |
3 | 14 | 13 | 13 | 14 | 13 | 12 | F | no | mother | reputation |
4 | 10 | 9 | 9 | 10 | 11 | 11 | F | no | mother | course |
5 | 10 | 10 | 10 | 13 | 13 | 13 | F | yes | other | reputation |
6 | 16 | 12 | 11 | 11 | 12 | 14 | F | no | mother | course |
Perform the analysis on each of the imputed data sets with the ‘mi.anova()’ function from the ‘miceadds’ library.
options(contrasts = c('contr.sum', 'contr.poly'))
anova2<-mi.anova(mi.res = imp_data2,
formula = "Math01 ~ 1 + Sex",
type = 3)
## Univariate ANOVA for Multiply Imputed Data (Type 3)
##
## lm Formula: Math01 ~ 1 + Sex
## R^2=0.02
## ..........................................................................
## ANOVA Table
## SSQ df1 df2 F value Pr(>F) eta2 partial.eta2
## Sex 85.94995 1 6839.481 7.088 0.00778 0.01998 0.01998
## Residual 4215.18304 NA NA NA NA NA NA
|
|
|
Using the salary data set:
Perform a t-test on salary using biological sex as a predictor Create multiply imputed data sets (use 40 imputations, set the seed equal to 806) Perform the t-test across all data sets Combine the results across data sets
##
## Welch Two Sample t-test
##
## data: salary by BioSex
## t = -0.42543, df = 58.688, p-value = 0.6721
## alternative hypothesis: true difference in means is not equal to 0
## 95 percent confidence interval:
## -8758.271 5687.361
## sample estimates:
## mean in group F mean in group M
## 98438.25 99973.70
Sum Sq | Df | F value | Pr(>F) | |
---|---|---|---|---|
(Intercept) | 652934818077 | 1 | 2912.7660488 | 0.0000000 |
BioSex | 39102846 | 1 | 0.1744392 | 0.6775122 |
Residuals | 15243094325 | 68 | NA | NA |
Impute the data using: imp_data.prac <- mice(salary, m=40, seed=806)
Perform the t test across all imputed data sets (using Anova), and combine them using mi.anova() function:
options(contrasts = c('contr.sum', 'contr.poly'))
anovapracMI<-mi.anova(mi.res = imp_data.prac,
formula = "salary ~ 1 + BioSex",
type = 3)
## Univariate ANOVA for Multiply Imputed Data (Type 3)
##
## lm Formula: salary ~ 1 + BioSex
## R^2=0.0038
## ..........................................................................
## ANOVA Table
## SSQ df1 df2 F value Pr(>F) eta2 partial.eta2
## BioSex 82139648 1 3481.781 0.2335 0.62899 0.00385 0.00385
## Residual 21270000828 NA NA NA NA NA NA
|
|
|
As an initial set, we need to make sure we are using an effect coding strategy. This detail is specific to ANOVA, not multiple imputation.
Now lets fit the two-way model that tests for guardian differences by Sex across Math scores that were collected from the first semester.
model.03 = lm(Math01 ~ Guardian + Sex +
Guardian * Sex,
data = aa)
#this drops 115 cases
Anova(model.03, type = 3)
Sum Sq | Df | F value | Pr(>F) | |
---|---|---|---|---|
(Intercept) | 10157.636245 | 1 | 936.7339101 | 0.0000000 |
Guardian | 62.423324 | 2 | 2.8783293 | 0.0580154 |
Sex | 25.022233 | 1 | 2.3075422 | 0.1299579 |
Guardian:Sex | 4.810075 | 2 | 0.2217918 | 0.8012330 |
Residuals | 2830.198663 | 261 | NA | NA |
Now to run the MI and compare the results.
Using the following code: imp_data3 <- mice(aa, m=40, seed=142)
Perform the analysis on each of the imputed data sets with the ‘mi.anova()’ function from the ‘miceadds’ library.
## Univariate ANOVA for Multiply Imputed Data (Type 3)
##
## lm Formula: Math01 ~ 1 + Guardian + Sex + Guardian*Sex
## R^2=0.0284
## ..........................................................................
## ANOVA Table
## SSQ df1 df2 F value Pr(>F) eta2 partial.eta2
## Guardian 50.70465 2 30568.98 2.1910 0.11182 0.01192 0.01212
## Sex 54.89645 1 37976.04 4.7974 0.02851 0.01291 0.01311
## Guardian:Sex 15.10919 2 49344.50 0.6391 0.52776 0.00355 0.00364
## Residual 4131.92524 NA NA NA NA NA NA
Univariate ANOVA for Multiply Imputed Data (Type 3)
lm Formula: Math01 ~ 1 + Guardian + Sex + Guardian*Sex
R^2=0.0284
..........................................................................
ANOVA Table
SSQ df1 df2 F value Pr(>F) eta2 partial.eta2
Guardian 50.70465 2 30568.98 2.1910 0.11182 0.01192 0.01212
Sex 54.89645 1 37976.04 4.7974 0.02851 0.01291 0.01311
Guardian:Sex 15.10919 2 49344.50 0.6391 0.52776 0.00355 0.00364
Residual 4131.92524 NA NA NA NA NA NA
We can start by calculating a correlation across two variables
##
## Pearson's product-moment correlation
##
## data: aa$Math01 and aa$Math02
## t = 24.486, df = 177, p-value < 2.2e-16
## alternative hypothesis: true correlation is not equal to 0
## 95 percent confidence interval:
## 0.8403011 0.9082913
## sample estimates:
## cor
## 0.8786774
Now we can run a combined correlation across the MI data sets we generated in Part 3 above.
variable1 | variable2 | r | rse | fisher_r | fisher_rse | fmi | t | p | lower95 | upper95 |
---|---|---|---|---|---|---|---|---|---|---|
Math01 | Math02 | 0.8811656 | 0.0171152 | 1.380958 | 0.0766292 | 0.5574908 | 18.02131 | 0 | 0.8428018 | 0.9106209 |
Math02 | Math01 | 0.8811656 | 0.0171152 | 1.380958 | 0.0766292 | 0.5574908 | 18.02131 | 0 | 0.8428018 | 0.9106209 |
Same basic process as above.
First test the model with the observed data
##
## Call:
## lm(formula = Math02 ~ 1 + Math01 + Portu01, data = aa)
##
## Residuals:
## Min 1Q Median 3Q Max
## -11.4305 -0.8456 -0.0402 0.9966 4.5778
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -0.38681 0.77343 -0.500 0.6179
## Math01 0.86714 0.05984 14.490 <2e-16 ***
## Portu01 0.14116 0.07807 1.808 0.0731 .
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 1.672 on 122 degrees of freedom
## (257 observations deleted due to missingness)
## Multiple R-squared: 0.7629, Adjusted R-squared: 0.759
## F-statistic: 196.3 on 2 and 122 DF, p-value: < 2.2e-16
We use the with()
function to run this model across all the MI data sets stored in the imp_data3
list we made earlier in Part 3.
results = with(imp_data3, lm(Math02 ~ 1+ Math01 + Portu01))
estimate | std.error | statistic | df | p.value | |
---|---|---|---|---|---|
(Intercept) | -0.8268730 | 0.5957985 | -1.38784 | 86.40788 | 0.1687541 |
Math01 | 0.9075514 | 0.0429691 | 21.12103 | 106.55834 | 0.0000000 |
Portu01 | 0.1368116 | 0.0599034 | 2.28387 | 80.92820 | 0.0250000 |