This is my informal definition for what we mean by computational reproducibility, which is. We all want the work we do to be both meaningful and correct, and want to be able to discern whether other research is trustworthy. Reproducible practices like sharing organized code scripts can help us with that.
"Science should be ‘show me’, not ‘trust me’; it should be ‘help me if you can’, not ‘catch me if you can’."
-- Philip B. Stark, Nature 2018
This is my informal definition for what we mean by computational reproducibility, which is. We all want the work we do to be both meaningful and correct, and want to be able to discern whether other research is trustworthy. Reproducible practices like sharing organized code scripts can help us with that.
"Did I mention that subjects with IDs > 100 are actually kangaroos and should be excluded?"
I won the lottery 👋; now my coworker is taking over
Journal reviews back after 8 months! Time for revisions!
"How does that patient have 15 months of treatment when we only followed people for 12 months?"
Our reasons for pursuing reproducibility don't have to be entirely altruistic or philosophical. The same tools that make it easier for others to reproduce your analyses also make it easier for you to reproduce them, and have more confidence in your own results.
True story: Someone asked me last month for a script I started writing in 2009
Even worse: This script was related to a project with five different data sources and multiple data management scripts
Past me was not living her most reproducible life :( - she had not considered this scenario nine years ago when starting this project, and it was a bit of a mess. Current me is a bit more experienced and knows to expect the unexpected.
All these tools we'll talk about today benefit from forethought - the more we can think ahead, the better off we'll be.
“It’s not thinking, ‘This is easiest for myself right now.’ It’s thinking, ‘When I’m working on this next week, next month, right before I graduate — how do I set myself up so that it’s easier later?’
-- Julia Stewart Lowndes in "A toolkit for data transparency takes shape," Nature 20 August 2018
“It’s not thinking, ‘This is easiest for myself right now.’ It’s thinking, ‘When I’m working on this next week, next month, right before I graduate — how do I set myself up so that it’s easier later?’
-- Julia Stewart Lowndes in "A toolkit for data transparency takes shape," Nature 20 August 2018
I love this quote from Julie Lowndes - reproducibility is partially about making life easier for our future selves.
But sometimes we may feel a bit like Rory Gilmore in her first semester here at Yale, maybe that it's a little overwhelming to think years from now when you have an abstract deadline next week.
With all the tools and "best practices" that exist for reproducible research, sometimes it can be overwhelming or feel like an all-or-nothing proposition - if you haven't given someone a full Docker image and every iota of code and data, what's the point? While this is certainly the ideal, it can be intimidating depending on your starting point and the time you have available. (In the case of my hypothetical pulmonary fellow who needs a conference abstract in a week, creating a fully reproducible masterpiece is not my immediate priority.)
Demonstrate several R tools & practices which can help us not only improve our scientific rigor, but make our lives more pleasant throughout the course of a project.
BUT! We can use these tools incrementally or all together - anything we do will help, and I've learned this from my own experience. My goal today: show you how these tools built for increasing the level or ease of reproducibility have been helpful in my work in clinical research. Your goal: Take what applies to and interests you and mold it to your own life (then share those ideas so we can learn from you!). Some of you may be using some or all of these tools, but hopefully we can all learn at least one new thing! I am kind of assuming that everyone is aware of and using literate programming tools like RMarkdown; if you're not, come talk to me later and I'm happy to try to convince you. :)
So the first thing I'll talk about is how to organize your project. This may feel kind of basic, but it has been really key for my sanity. And setting up our workflow well will come back to help us in a later segment, so stay tuned.
RStudio Projects + version control can help! Projects:
packrat
, version controlHere I've shown an example file structure with deidentified components for a real study that we have ongoing. This is pretty typical of what I need for my projects; you may need something different, but the point is to find an organized structure that works for you. The most important aspect is that one project equals one directory.
When things are organized well, it's not only calming, but makes things easier:
If you're an RStudio user, using Projects can be helpful here, as Mine mentioned yesterday. The first reason I started using them was the ability to have more than one RStudio session open at once with no headaches, and as Mine discussed yesterday, they can help make sure we're in the correct directory. They also encourage a "one project, one directory" mindset, since each analysis gets its own directory, which gets one project file. Even more helpful is that this is also the structure that works best with version control, and Projects have several features that help facilitate getting git or other version control set up.
If you haven't yet drunk the version control Kool-Aid, I'll give you some very practical ways that it's helped me:
[data person]
, Should Use Version Control [data person]
, Should Use Version Control Basically, I use version control because I'm an optimistic cynic: One day, something will go wrong or change, or someone will question me, and this is a really helpful tool for managing those situations.
usethis
for Thisusethis
is particularly helpful for creating R packages, but has
plenty of functions that will help you set up a general project:
create_project()
)use_r()
, use_data_raw()
)use_git()
, use_github()
)Bonus points: Use usethis
to turn your project into a fully transportable package!
The usethis
package is basically built to make setting up all of these details, like git repositories and file structures, a bit easier. Especially with git, getting set up can be a little tricky, for example (I know this from personal experience). And usethis
is particularly helpful if you build R packages, as it really shines when setting up things like documentation and licensing.
And of course, read anything Jenny Bryan has written on these topics.
Now that our workflow is set up we're ready to actually work with some data. By "accessing data programmatically" I mean keeping data stored in some sort of database that is not CSV files or spreadsheets floating around, and accessing it via code of some kind rather than manually exporting or saving email attachments.
This is an ideal, of course, and depending on your situation may or may not be tenable, but I hope to inspire you to pursue it whenever possible!
Many files, potentially with different versions, from disparate sources, manually created/exported
Accessing your data programmatically
(REDCap API, SQL database...)
Our group used to use a system that involved me manually exporting about 25 forms as separate CSVs every time I went to update the data. It was, as you might imagine, not the most fun part of my day, and more importantly it was rife with opportunity for error. What if I accidentally skipped one or saved one with the wrong name?
It's hard to overcome institutional inertia if you're used to getting data in spreadsheets, and APIs might be a bit intimidating to folks who are used to receiving data in formats that they can open and see. I empathize with that, but after having Excel reformat dates and not being able to remember years later which spreadsheet contained what, I've become an API believer.
Writing code to extract data via APIs allows you to track exactly which data points you downloaded. All your original data remains untouched in a safe location, and there's a complete record of what you did to transform or summarize it. So anyone with access to that database can reproduce your work later if needed.
library(httr)monthly_post <- httr::POST( url = "https://redcap.vanderbilt.edu/api/", body = list( token = Sys.getenv("MYTOKEN"), ## API token gives you permission content = "record", ## export *records* format = "csv", ## export as *CSV* forms = "monthly_data", ## which form(s)? fields = c("study_id"), ## additional fields events = paste(sprintf("month_%s_arm_1", 1:3), collapse = ","), ## all 3 monthly visit events rawOrLabel = "label" ## export factor *labels* v codes ))monthly_df <- read.csv( text = as.character(monthly_post), stringsAsFactors = FALSE, na.strings = "")
Our group uses REDCap basically exclusively, so here I've shown an example of how I use the httr
package to extract data via the REDCap API.
My toy example here is from a longitudinal study where we had a baseline visit; three monthly visits; and a study completion visit. I'm exporting this data in order to summarize some of the monthly values.
If you use a different data capture system your code might look a bit different. My goal is less about the specifics of the code than to demonstrate some key points:
library(httr)monthly_post <- httr::POST( url = "https://redcap.vanderbilt.edu/api/", body = list( token = Sys.getenv("MYTOKEN"), ## API token gives you permission content = "record", ## export *records* format = "csv", ## export as *CSV* forms = "monthly_data", ## which form(s)? fields = c("study_id"), ## additional fields events = paste(sprintf("month_%s_arm_1", 1:3), collapse = ","), ## all 3 monthly visit events rawOrLabel = "label" ## export factor *labels* v codes ))monthly_df <- read.csv( text = as.character(monthly_post), stringsAsFactors = FALSE, na.strings = "")
library(httr)monthly_post <- httr::POST( url = "https://redcap.vanderbilt.edu/api/", body = list( token = Sys.getenv("MYTOKEN"), ## API token gives you permission content = "record", ## export *records* format = "csv", ## export as *CSV* forms = "monthly_data", ## which form(s)? fields = c("study_id"), ## additional fields events = paste(sprintf("month_%s_arm_1", 1:3), collapse = ","), ## all 3 monthly visit events rawOrLabel = "label" ## export factor *labels* v codes ))monthly_df <- read.csv( text = as.character(monthly_post), stringsAsFactors = FALSE, na.strings = "")
httr
call gives an example of an option you can give REDCap using the API - here I'm telling it I want text labels for categorical variables and not the numeric codes, but there may be times when you want the opposite. You can choose based on your needs for any given export.Once the initial script or report is built, we can - with one line of code* - do the following, always assured that we're using the most recently updated data:
flexdashboards
)httr
; intro to APIs by Lucy D'Agostino McGowanredcapAPI
(Benjamin Nutter), REDCapR
(Will Beasley)Once we're using the API to export data, the world is our oyster - we don't have to worry about waiting on updated spreadsheets or manually extracting data. I use this capability to do several things, including:
flexdashboard
format of RMarkdown to monitor our studies, making sure we're hitting enrollment targets and looking for potential pain points with protocol aspects like wearing devices or drawing blood. Again, I can run those reports once a week with one line of code, assured that it's current with the database.If you're new to working with APIs, the httr
package has some great vignettes; if you do use REDCap, there are at least two packages specifically for working with that API. If you're using other sources of web data, rOpenSci has a collection of packages for working with APIs that you should investigate.
Once our data is in our hands, we want to make sure it reflects our understanding of it and our needs. If you develop software, you're likely already familiar with the concept of testing; if you do any data wrangling, you've definitely done this informally - checked to see if your calculated age is negative, for example. So what I'll encourage in the next few slides is a formal way to incorporate those practices into data exporting and wrangling, so that we're more confident in our final data and therefore our final conclusions.
assertr
Going far too long without realizing you've made a major yet sneaky error in your data wrangling.
assertr
Going far too long without realizing you've made a major yet sneaky error in your data wrangling.
With assertr
(author: Tony Fischetti), we can
Hopefully I'm not the only one who's had this experience: You've done your data wrangling, everything looks OK, and then one day someone asks the exact right question that allows you to uncover a mistake you made.
The assertr
package helps us avoid this by providing ways to proactively make assertions about our data. We can do this when reading in raw data, to make sure we understand the data we're dealing with and that the data fits the conventions we've established.
We can also do this after doing some data management, to make sure our raw data and our code are working together as well as we'd expect. Using these assertions can help us discover special cases or unmet assumptions earlier.
library(assertr)## Creatinine must be <=20monthly_df %>% verify(creat_m <= 20)
library(assertr)## Creatinine must be <=20monthly_df %>% verify(creat_m <= 20)
Pipeline stops and prints an informative message
verification [creat_m <= 20] failed! (1 failure) verb redux_fn predicate column index value1 verify NA creat_m <= 20 NA 5 NA
monthly_df[5, c("study_id", "redcap_event_name", "creat_m")]
study_id redcap_event_name creat_m5 3 Month 2 25
In our example study, we believe that any creatinine above 20 must be some kind of data issue. So we use assertr
's verify
function at the beginning of a data management pipeline to "verify" that that assumption is true. If that condition is met, the function will return the original data.frame, and we can continue with our data management.
Of course, this being an example, the condition is not met. verify
throws an error and stops our pipeline in its tracks, and prints a message showing us that the fifth row in our data.frame violates this assumption. In this case, my most likely guess is that someone forgot a decimal point. We can work with the PI or the data entry team to resolve the issue. Once it's resolved and we're using the updated data, this same pipeline will run with no issue.
One note: If you're not a pipeline user, assertr
certainly works with base R as well! It's built to work nicely within pipelines but you can use it no matter your preference.
verify(monthly_df, is.na(creat_m) | creat_m <= 20)
monthly_df %>% ## Visit date *must* be entered; ## HDL, LDL must be within range assert(not_na, date_visit_m) %>% assert(within_bounds(25, 95), hdl_m) %>% assert(within_bounds(1, 200), ldl_m)
monthly_df %>% ## Visit date *must* be entered; ## HDL, LDL must be within range assert(not_na, date_visit_m) %>% assert(within_bounds(25, 95), hdl_m) %>% assert(within_bounds(1, 200), ldl_m)
Returns our original data frame so we can move on down the pipeline!
It's similar to verify
, but assertr
's assert
function comes with built-in predicate functions which perform common checks; in this case, we want to make sure that every monthly visit has a recorded date, using not_na()
, and that all our cholesterol values fall within a reasonable range using within_bounds()
.
Here, our assertion discovers no problems, and we can continue with confidence.
error_fun
error_fun
## Maybe it's OK if creatinine is super highnewdf <- monthly_df %>% verify(creat_m <= 20, error_fun = error_append)attr(newdf, "assertr_errors")
[[1]]verification [creat_m <= 20] failed! (1 failure) verb redux_fn predicate column index value1 verify NA creat_m <= 20 NA 5 NA
Sometimes you might want to know about an issue but still keep the pipeline going. The error_fun
argument gives you control over how assertr
behaves when the conditions are met or unmet.
You have several options; in this case, I've run our original assertion, but set this argument to error_append
. Using this option will allow the pipeline to keep going, and will return our original data.frame as though there were no problems, but it will add an attribute
to that data.frame which includes all our errors.
The tradeoff to this approach is that you do need to be paying attention somehow: sink those error attributes messages to a text file or take some other tactic, so you don't defeat the purpose and miss a potential problem.
Hopefully these basic examples have intrigued you enough to investigate further. If you're interested, I'd really recommend this assertr
tutorial that Tony has put together with some great examples.
Now that we've tested our data, we're finally ready to perform our analysis and report our conclusions. In my experience projects can get complex quickly, and in this case, there are some powerful tools that can help us take advantage of that structure we worked on earlier.
drake
Built to encourage and enable efficient, reproducible workflows
plan
that will create/update these targets
make()
the plan
drake
knows which components are up to date, which to updateMuch more at:
drake
manualI'm new to drake
myself, and as soon as I sat down to try it out, I wished I'd had it available two years ago, when I first started work on a project that involves bootstrapping about twelve separate models.
The point of drake
is to know when something needs to be updated and when it needs to be left in peace. You set up your workflow and define a set of components, or "targets"; describe a "plan" that will either create or update these targets as needed; and then "make" the plan. (If you're familiar with the concept of a makefile
, that terminology will be familiar to you, but here we can do everything right within R.) drake
knows which components need to be updated due to changes you've made since the last make
, and which are current and can be left alone. So it takes care of any changes without wasting time or computing power on things that are already perfectly fine.
drake
has some fantastic documentation, including a web site and a full manual. To get started, I'd recommend checking out Will Landau's recent talk at R/Pharma, which was all about drake
. It's much more powerful than I have time to go into here.
So! You may have noticed that several times I've referenced rOpenSci, so... what exactly is rOpenSci?
rOpenSci is an organization that creates the technical and social infrastructure to empower a community of people, of which I'm a part, who are interested in improving scientific research via increased reproducibility and transparency.
So, how exactly do we create that infrastructure?
The first piece is through software. If you're a package developer, rOpenSci has an excellent manual to guide you through best practices in development and maintenance. This is really useful whether or not you submit your package for review with rOpenSci.
If you do successfully submit your package for review, you'll get the benefits of more discoverability: for example, rOpenSci's web site has a searchable list of all its packages, and houses material like the assertr
tutorial I linked to earlier.
That peer review process is the second piece of the rOpenSci puzzle. We review packages that fall within the scope of enabling and encouraging reproducible research and managing the data life cycle.
Developers get the benefit of feedback on design and usability, as well as additional visibility from rOpenSci blog posts. Reviewers get to contribute to the community, help improve a package, and almost certainly learn a lot as they do so. I was one of the reviewers for the skimr
package, which is really fantastic, by the way, and learned a lot during the process by thinking about things like package API design.
Welcoming, diverse community that
Welcoming, diverse community that
"rOpenSci combines expertise and approachability, and its community inspires people to collaborate as the best versions of themselves."
-- Will Landau, rOpenSci community member & drake
developer
Finally, we create the social infrastructure by putting together this fantastic community of people with really diverse social and professional backgrounds, all interested in making their work more reproducible and more open. By facilitating these relationships and creating a welcoming, friendly, approachable culture, we increase what we're able to produce and create more and stronger collaboration.
This quote from Will Landau sums up my feelings so well that I asked if I could just quote him: "rOpenSci combines expertise and approachability, and its community inspires people to collaborate as the best versions of themselves."
BONUS SLIDES
I love this diagram from Jeff Leek so much:
Saying "we did multivariable logistic regression" only tells the reader one of many, many decisions you made in your analysis, which can be affected by all kinds of factors as Jeff points out. So I'm starting off by encouraging you to use R's literate programming tools for not only reporting, but also for things like exploratory data analysis that informs your modeling decisions, or writing simulations that get summarized in your final report.
"Science should be ‘show me’, not ‘trust me’; it should be ‘help me if you can’, not ‘catch me if you can’."
-- Philip B. Stark, Nature 2018
This is my informal definition for what we mean by computational reproducibility, which is. We all want the work we do to be both meaningful and correct, and want to be able to discern whether other research is trustworthy. Reproducible practices like sharing organized code scripts can help us with that.
Keyboard shortcuts
↑, ←, Pg Up, k | Go to previous slide |
↓, →, Pg Dn, Space, j | Go to next slide |
Home | Go to first slide |
End | Go to last slide |
Number + Return | Go to specific slide |
b / m / f | Toggle blackout / mirrored / fullscreen mode |
c | Clone slideshow |
p | Toggle presenter mode |
t | Restart the presentation timer |
?, h | Toggle this help |
Esc | Back to slideshow |