Login Sign up

Measuring notebook reproducibility with repo2docker

Min Ragan-Kelley, Vilde Dille Øvreeide

Audience level:
Intermediate

Brief Summary

repo2docker powers mybinder.org, aiming to reliably turn repositories into interactive environments where notebooks can be executed, enabling reproducible interactive publications. We set out to validate repo2docker's mission of "automating existing best practices" by sampling GitHub repositories with notebooks, testing whether repo2docker creates an environment where notebooks can be executed.

Outline

repo2docker has as its guiding principal to "automate and encourage existing community best practices for reproducible computational environments" by generating Dockerfiles with installation commands based on standard files such as requirements.txt or environment.yml in a repository.

Reproducibility can be challenging to measure. We can only observe how repositories that are reproducible at publication time may become not-reproducible over time once time has actually passed. Notebooks have been around long enough now that this is happening with some regularity.

To measure reproducibility with repo2docker, we sampled repositories containing notebooks on GitHub and executed them using nbconvert. We used the lowest bar of "does it execute without errors" to explore the following questions:

We used two sources of repositories for testing:

  1. the mybinder.org events archive, for repositories that have been used with mybinder.org, and thus are known to have been tested with repo2docker, and
  2. sampling open data from a 2019 study of hundreds of thousands of notebook-containing repositories on GitHub (DOI).

Because of the prior study, we are able to compare the results of our repo2docker-based approach, with another group's approach to measuring reproducibility of the same repositories, evaluated at a different point in time.

We will present key differences between repo2docker's approach and those of other groups, as well as trends in failures that suggest common pitfalls to reproducibility, even for repositories that may have been reproducible in the past.

Finally, we use these findings to inform proposals for new features in repo2docker to improve the likelihood of reproducing a working environment from a given repository.

repo2docker testing code: https://github.com/minrk/repo2docker-checker

Study data: https://github.com/Vildeeide/repo2docker-reproducibility