Co-founder at 2i2c.org. Ex Wikimedia, ex GNOME. On a motorcycle or watching star trek or texting someone when not on a computer. Death to accidental complexity.
Building & maintaining the docker image used whenever a user logs into a JupyterHub is one of the most time consuming, hard yet rewarding parts of running a JupyterHub. Done right, it can wow users - "YOU CAN DO THAT?!". But if not properly designed, something as simple as adding a new python package can turn into a multi-week ordeal that breaks everything, and makes the maintainer of the image hate computers. A lot of the industry's docker image advice needs to be modified for use when building docker images for use with JupyterHub, as they are meant to have arbitrary code executed in them. General docker advice often does not cover our use cases at all (who else is putting fortran into docker?).
This talk summarizes lessons learnt in building a wide variety of images for JupyterHubs over the years. It will cover:
- Building the simplest possible image that can work with a JupyterHub
- Best practices for installing python & most python-related packages inside the image, and why (hint: use
- When to base off a community built docker image (such as rocker, pangeo or jupyter-stacks), and the various tradeoffs associated with it.
- Basic maintenance that must be performed on an image periodically so a simple package install doesn't turn into a 3 week nightmare
- Suggestions for automatic building & testing with CI / CD and mybinder.org
- Best practices for including R in your image
- Best practices for running non-Jupyter frontends (such as RStudio, virtual linux desktop, code-server, etc) in your JupyterHub
The audience would be able to walk away with much better knowledge on how to maintain a docker image for use with JupyterHub in a way that keeps both the users and the maintainers of the image happy.
The International Interactive Computing Collaboration (2i2c) manages the configuration and deployment of multiple Kubernetes clusters and JupyterHubs for a range of research and education purposes, spanning not only domains, but the globe. For the sake of optimising our engineering team’s operations, we manage these deployments from a single, open infrastructure repository. This presents a challenging problem since we need to centralise information about a number of independent cloud vendors, and independent JupyterHubs whose user communities are not necessarily related. Given that each hub has an independent user base, this centralisation must not come at the cost of a community being unable to extricate their JupyterHub configuration from 2i2c’s infrastructure and deploy it elsewhere, as detailed by our Right to Replicate.
In this talk, we will discuss a recent overhaul of 2i2c’s tooling that facilitates the centralisation of information and optimal operation of the engineering team, whilst protecting a community’s Right to Replicate their infrastructure. Critical to protecting the Right to Replicate is a configuration schema for both clusters and JupyterHubs, where these files should live in the repository, and how the contents should be structured. Each individual JupyterHub we deploy is defined by its own individual set of configuration files which enables simple extrication from the repository, and they can be deployed independently with a basic command. There is no added magic in the rest of 2i2c’s specific tooling that would prevent this.
Further tooling to optimise the deployment and management of these JupyterHubs for 2i2c’s engineering team includes:
- A Python “deployer” module that knows how to read the configuration for a given JupyterHub on a given cluster and can perform an upgrade action
- A function within the deployer module that can extrapolate which JupyterHubs on which cluster require an upgrade from a list of changed files in the repository (e.g. from a Pull Request)
- A GitHub Actions workflow that can deploy to multiple clusters in parallel, deploy production JupyterHubs in parallel, implement Canary deployments using staging JupyterHubs, and intelligently prevent a Canary deployment failure affecting the deployments on an unrelated cluster
Details of these efforts were first published in the “Tech update: Multiple JupyterHubs, multiple clusters, one repository” blog post.