Is more delirium (in-hospital brain dysfunction) associated with scores on a cognitive test after hospital discharge?
Then we might want to somehow summarize that exposure and see if it's related to some long-term outcome. This could include
SLIDE
BRAIN-ICU cohort:
Other possible examples:
BRAIN-ICU Cohort Study, NEJM 2013
BRAIN-ICU Cohort Study, NEJM 2013
Describe visdat:
And then...
So, how to deal with the missingness in the way that gives us least bias?
Four strategies:
NA = unexposed:
Only count the exposure we know about
NA = exposed:
All missing time points get exposure
Pros: Straightforward to implement; plausible, if we know a lot about data collection
Cons: Prone to bias
In our case, the exposure is usually bad, so I've called our first strategy assuming the best: We only count the exposure we know about, and assume all the missing days have no exposure. For example, here we have a patient with two known days of exposure, so that's what they get - we ignore these two missing days.
Of course, we can also assume the worst, meaning that we assume every missing day does have the exposure. So here our hypothetical patient would get a value of four.
These approaches are straightforward to implement, and in certain cases they might make a lot of sense. For example, we use an organ dysfunction score which incorporates bilirubin levels, but even in our prospective studies, bilirubin is generally not measured every day unless it's clinically helpful. Working with our clinical collaborators, we've usually decided that a missing bilirubin means there was no clinical reason to suspect liver dysfunction.
But of course, if you don't know for certain whether or how your missingness is informative, these approaches can be quite prone to bias.
NA
Pros:
Cons:
Of course both of those approaches make some pretty big assumptions, so in this next approach we could assume nothing: If the patient has even one missing value, we can't know for certain what the overall summary value would be, so we consider that whole value missing.
In this case, we use multiple imputation prior to regression modeling; in this example, I've "used" five imputations, and our patient gets values all over the map. This might be OK if we have good covariate data with which to impute, and lots of missing patient-days; it is pretty straightforward to implement, and it at least acknowledges the uncertainty, which our simple approaches do not.
However, it's essentially throwing out all the data we do have available - in this case, these four patient-days where we know the exposure value count for nothing. And while it at least acknowledges the uncertainty, this approach will likely overestimate it.
Pros:
Cons:
In our final approach, we try to assume as little as possible, using the data from the days we know about and imputing at this lowest hierarchy the days we don't. Once we've multiply imputed missing patient-days, we summarize each of those imputed datasets, then use those summarized datasets in multiple imputation when we model.
This approach, of course, maximizes the use of the data that we have available, including daily covariate data to help predict our missing values. But it's the most computationally intensive approach, which can matter especially in large EMR studies, and may involve some gnarly data wrangling when you go back and forth between imputing and summarizing.
We assumed that this approach was our best bet; it makes intuitive sense. However, what kind of statisticians would we be if we didn't test that assumption and simulate some things? So that's exactly what we did.
We used our BRAIN-ICU cohort data as our inspiration for these simulations, meaning that our exposure was daily delirium, yes or no, and our outcome is a simulated score of cognitive impairment.
We set a range of our individual patient-days to be missing, and set different types of missingness. For our exposures that were missing at random, we set a weak, moderate, and strong association between that missingness and a daily severity of illness score. For exposures that were missing not at random, we set a weak, moderate, and strong association between missingness and the true exposure value.
In the two methods which use imputation, we again used severity of illness to help predict either the overall duration of delirium, or delirium on a daily basis.
First I'll describe the setup:
Main points:
Moving to standard errors, we see the same pattern with the imputation of the summary value - it's bad with very little missingness, and gets worse the stronger the actual effect size.
This slide is for those of you who prefer your variance estimates in terms of confidence intervals.
More generally, we see that assuming missingness indicates exposure leads to less variance to some extent than the other two methods. When data is missing at random or missing completely at random, there's not a lot of difference until you have a lot of missing data; but again, when missingness is substantially associated with the likelihood of the exposure, the difference becomes much more apparent.
Looking at coverage, the same patterns emerge:
This slide is very boring and is just here to show you that all the methods performed about equally well in terms of statistical power.
We're going to come full circle now, and apply each strategy to our two real-world studies. These analyses are simpler than we did in real life, but do adjust for severity of illness, which is likely a major confounder.
As we might expect, when there is little missing data, and it's likely to be missing more or less at random, our results look very, very similar. When we have more missingness, we can see that the results might be quite different. We can also see that there is less variance when we assume missingness is indicative of having the exposure, though it's likely that that result is biased.
In the future, I hope to repeat these analyses with a continuous exposure, which has the added twist of being able to be summarized many different ways, and incorporating more complex relationships between covariates, exposures, and missingness. We purposely kept these straightforward, but as we all know, real life is not always that clear cut!
I want to thank a few R developers for making this work easier and more fun:
And of course, we'd love to thank the principal investigators and research staff at the Vanderbilt Center for Critical Illness, Brain Dysfunction and Survivorship, who come up with these ideas and work incredibly hard to design these studies and collect the data.
Finally, here's where you can find a link to these slides as well as the code I used for simulation and creating the slides and visuals.
Keyboard shortcuts
↑, ←, Pg Up, k | Go to previous slide |
↓, →, Pg Dn, Space, j | Go to next slide |
Home | Go to first slide |
End | Go to last slide |
Number + Return | Go to specific slide |
b / m / f | Toggle blackout / mirrored / fullscreen mode |
c | Clone slideshow |
p | Toggle presenter mode |
t | Restart the presentation timer |
?, h | Toggle this help |
Esc | Back to slideshow |