Sarah Gibson
Sarah Gibson is an Open Source Infrastructure Engineer at 2i2c, an open source contributor and advocate. She holds more than two years of experience as a Research Engineer at a national institute for data science and artificial intelligence, as well as holding a core contributor role in the open source projects Binder, JupyterHub, and the Turing Way. Sarah is passionate about working with domain experts to leverage cloud computing in order to accelerate cutting-edge, data-intensive research and disseminating the results in an open, reproducible and reusable manner. Sarah holds a Fellowship with the Software Sustainability Institute and advocates for best software practices in research. She is a member of the mybinder.org operating team and maintains infrastructure supporting a global community in sharing reproducible computational environments. She has also mentored projects through two cohorts of the Open Life Science programme, imparting lived experience of her skills participating and leading in open science projects.
Sessions
In 2021, JupyterHub was awarded a CZI EOSS grant to improve community practices around inclusion within the project, and that work began in earnest in 2022. An important part of this work involves developing pathways into the community that cater for i) contributors that are diverse and bring a new perspective that is not already represented in our community; and ii) contributors beyond the “burnt out PhD” archetype that is prevalent throughout the landscape of open source scientific software.
One strategy we employed from the start of the grant-writing process was to secure funding for four rounds of Outreachy, with two interns per round, over the grant duration of two years. Outreachy is a mission-aligned organisation dedicated to placing interns from backgrounds that are underrepresented in tech, into open source projects. The mentorship these interns receive is the bedrock on which sustainable entry-level pathways into the community can be built. Since Outreachy supports more than only coding projects, we can also provide other pathways into the community that do not rely on being a “coder” or “software developer”.
This kind of “Mountain of Engagement” work is important to any community-led project, whether within the Jupyter ecosystem or beyond, and as such we have been capturing lessons learned in a guide as we go. This will ensure that the process of participating in Outreachy as a community is a little more repeatable with each round, and provide clear pathways for other community members to become involved in the processes after the term of the grant. We also hope that by sharing our experiences, this resource becomes usable by other Jupyter subprojects, or elsewhere, to begin their own internship initiatives.
- Repository: https://github.com/jupyterhub/outreachy
- Website: https://jupyterhub-outreachy.readthedocs.io
By the time JupyterCon 2023 arrives, JupyterHub will have completed the first Outreachy round funded by the CZI grant. We have already learned, and will continue to learn, a great deal around the processes required for running these internships, which we have captured in the above guide. During this talk, we will discuss some strategies the JupyterHub team implemented during this initial round, such as:
- Establishing partnerships with other mentoring organisations, such as Open Life Science, to deliver support through mentor training and cohort calls for interns
- Developing processes during the Outreachy contribution period to manage and evaluate applications
The International Interactive Computing Collaboration (2i2c) manages the configuration and deployment of multiple Kubernetes clusters and JupyterHubs for a range of research and education purposes, spanning not only domains, but the globe. For the sake of optimising our engineering team’s operations, we manage these deployments from a single, open infrastructure repository. This presents a challenging problem since we need to centralise information about a number of independent cloud vendors, and independent JupyterHubs whose user communities are not necessarily related. Given that each hub has an independent user base, this centralisation must not come at the cost of a community being unable to extricate their JupyterHub configuration from 2i2c’s infrastructure and deploy it elsewhere, as detailed by our Right to Replicate.
In this talk, we will discuss a recent overhaul of 2i2c’s tooling that facilitates the centralisation of information and optimal operation of the engineering team, whilst protecting a community’s Right to Replicate their infrastructure. Critical to protecting the Right to Replicate is a configuration schema for both clusters and JupyterHubs, where these files should live in the repository, and how the contents should be structured. Each individual JupyterHub we deploy is defined by its own individual set of configuration files which enables simple extrication from the repository, and they can be deployed independently with a basic command. There is no added magic in the rest of 2i2c’s specific tooling that would prevent this.
Further tooling to optimise the deployment and management of these JupyterHubs for 2i2c’s engineering team includes:
- A Python “deployer” module that knows how to read the configuration for a given JupyterHub on a given cluster and can perform an upgrade action
- A function within the deployer module that can extrapolate which JupyterHubs on which cluster require an upgrade from a list of changed files in the repository (e.g. from a Pull Request)
- A GitHub Actions workflow that can deploy to multiple clusters in parallel, deploy production JupyterHubs in parallel, implement Canary deployments using staging JupyterHubs, and intelligently prevent a Canary deployment failure affecting the deployments on an unrelated cluster
Details of these efforts were first published in the “Tech update: Multiple JupyterHubs, multiple clusters, one repository” blog post.
JupyterHub has a range of documentation that covers both developer and user audiences in order to help them deploy, maintain, and use their own instance of a JupyterHub. The success of an open source software project to (i) be adopted by users, and (ii) receive meaningful contributions relies heavily on the quality, navigability and accessibility of documentation so that users and developers have all the information they need to achieve what they want to do.
A framework for organising technical documentation has arisen called Diátaxis. It takes a systematic approach to understanding user requirements of documentation throughout the lifecycle of interaction with a product and posits that different user needs require different approaches in creation of the documentation, as well as a layout to navigate these different “modes” of documentation.
Between December 2022 and March 2023, the JupyterHub project will be participating in the December 2022 round of Outreachy internships with the aim of improving its documentation. The project focuses on refactoring the documentation for the JupyterHub package. As the intern in charge of this process, my work began by performing a review of the present documentation, categorising these into the diataxis framework, and then restructuring the documentation files in the repository. Once the documentation is transformed into this framework, it will be easier to identify missing and unclear documentation (those that were difficult to categorise). Subsequently, the JupyterHub team can curate resources that can fill the gaps and improve documentation that is not specific enough.
This undertaking is not without its challenges, joys, and lessons, which can be extrapolated to other open-source documentation projects. The proposed talk will focus on highlighting these areas from the point of view of the JypyterHub Outreachy intern as well as their lead mentor. Specifically, it will seek to cover three main points within the allocated talk time:
- What is the importance of well-written and -structured documentation to an open-source project and to JupyterHub, specifically?
- What is the Diataxis framework and why did JupyterHub select to use it to restructure its documentation?
- What lessons can other open-source projects learn from JupyterHub’s experience to make clear and well-structured documentation?
The talk targets anyone who authors or contributes to open-source software documentation, including technical writers and team leads. The audience can be working with any programming language but should have intermediate knowledge of technical writing practices, including what it entails and some of the tools used.