This document will show how to perform regression across multiply imputed data sets.
Let’s begin by reading in the data set.
data_AcadAchiev = read.csv('/Users/jhelm/Desktop/data_AcadAchiev.csv')
We can start by performing regression with the observed data
model.01 = lm(Math02 ~ 1 + Math01 + Portu01, data = data_AcadAchiev)
summary(model.01)
##
## Call:
## lm(formula = Math02 ~ 1 + Math01 + Portu01, data = data_AcadAchiev)
##
## Residuals:
## Min 1Q Median 3Q Max
## -11.4305 -0.8456 -0.0402 0.9966 4.5778
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -0.38681 0.77343 -0.500 0.6179
## Math01 0.86714 0.05984 14.490 <2e-16 ***
## Portu01 0.14116 0.07807 1.808 0.0731 .
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 1.672 on 122 degrees of freedom
## (257 observations deleted due to missingness)
## Multiple R-squared: 0.7629, Adjusted R-squared: 0.759
## F-statistic: 196.3 on 2 and 122 DF, p-value: < 2.2e-16
First, we need to create the multiply imputed data sets.
library(mice)
library(miceadds)
Create the imputed data sets
imp_data = mice(data_AcadAchiev, m = 40, seed = 142)
# This will create 40 imputed data sets to fill in the missing
# values from the data set 'data_AcadAchiev'
# If we set the seed value (Jon recommends this), then we will
# reproduce the results if we rerun the imputation
Perform the analysis on each of the imputed data sets. We can use the ‘with()’ function.
results = with(imp_data, lm(Math02 ~ Math01 + Portu01))
Now we can pool the estimates across these analyses.
summary(pool(results))
## estimate std.error statistic df p.value
## (Intercept) -0.8441735 0.53849558 -1.567652 120.98786 0.11957422
## Math01 0.9246537 0.04768383 19.391349 75.45568 0.00000000
## Portu01 0.1228634 0.05989967 2.051154 79.47602 0.04354663
Using the salary data set: