repo2docker powers mybinder.org, aiming to reliably turn repositories into interactive environments where notebooks can be executed, enabling reproducible interactive publications. We set out to validate repo2docker's mission of "automating existing best practices" by sampling GitHub repositories with notebooks, testing whether repo2docker creates an environment where notebooks can be executed.
repo2docker has as its guiding principal to "automate and encourage existing community best practices for reproducible computational environments" by generating Dockerfiles with installation commands based on standard files such as requirements.txt
or environment.yml
in a repository.
Reproducibility can be challenging to measure. We can only observe how repositories that are reproducible at publication time may become not-reproducible over time once time has actually passed. Notebooks have been around long enough now that this is happening with some regularity.
To measure reproducibility with repo2docker, we sampled repositories containing notebooks on GitHub and executed them using nbconvert. We used the lowest bar of "does it execute without errors" to explore the following questions:
We used two sources of repositories for testing:
Because of the prior study, we are able to compare the results of our repo2docker-based approach, with another group's approach to measuring reproducibility of the same repositories, evaluated at a different point in time.
We will present key differences between repo2docker's approach and those of other groups, as well as trends in failures that suggest common pitfalls to reproducibility, even for repositories that may have been reproducible in the past.
Finally, we use these findings to inform proposals for new features in repo2docker to improve the likelihood of reproducing a working environment from a given repository.
repo2docker testing code: https://github.com/minrk/repo2docker-checker
Study data: https://github.com/Vildeeide/repo2docker-reproducibility