This document will show how to perform regression across multiply imputed data sets.

Let’s begin by reading in the data set.

data_AcadAchiev = read.csv('/Users/jhelm/Desktop/data_AcadAchiev.csv')

Perform a Regression

We can start by performing regression with the observed data

model.01 = lm(Math02 ~ 1 + Math01 + Portu01, data = data_AcadAchiev)
summary(model.01)
## 
## Call:
## lm(formula = Math02 ~ 1 + Math01 + Portu01, data = data_AcadAchiev)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -11.4305  -0.8456  -0.0402   0.9966   4.5778 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept) -0.38681    0.77343  -0.500   0.6179    
## Math01       0.86714    0.05984  14.490   <2e-16 ***
## Portu01      0.14116    0.07807   1.808   0.0731 .  
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 1.672 on 122 degrees of freedom
##   (257 observations deleted due to missingness)
## Multiple R-squared:  0.7629, Adjusted R-squared:  0.759 
## F-statistic: 196.3 on 2 and 122 DF,  p-value: < 2.2e-16

Performing Regression across Multiply Imputed Data Sets

First, we need to create the multiply imputed data sets.

library(mice)
library(miceadds)

Create the imputed data sets

imp_data = mice(data_AcadAchiev, m = 40, seed = 142)

    # This will create 40 imputed data sets to fill in the missing
    # values from the data set 'data_AcadAchiev'

    # If we set the seed value (Jon recommends this), then we will
    # reproduce the results if we rerun the imputation 

Perform the analysis on each of the imputed data sets. We can use the ‘with()’ function.

results = with(imp_data, lm(Math02 ~ Math01 + Portu01))

Now we can pool the estimates across these analyses.

summary(pool(results))
##               estimate  std.error statistic        df    p.value
## (Intercept) -0.8441735 0.53849558 -1.567652 120.98786 0.11957422
## Math01       0.9246537 0.04768383 19.391349  75.45568 0.00000000
## Portu01      0.1228634 0.05989967  2.051154  79.47602 0.04354663

Practice Problem

Using the salary data set:

  1. Perform a regression analysis that predicts salary from number of publications and years on the job
  2. Create multiply imputed data sets (use 40 imputations, set the seed equal to 806)
  3. Perform a regression analysis that predicts salary from number of publications and years on the job across all data sets
  4. Combine the results across data sets