Login Sign up

End-to-end notebook life cycle at Astro Data Lab

Robert Nikutta

Audience level:
Intermediate

Brief Summary

Astro Data Lab (DL) is NOIRLab's science platform, providing free access to ~100 TB of astronomical catalogs and over 2 PB of associated images, plus co-located compute. The main access mode is via Jupyter notebooks using IVOA protocols and DL APIs. Our poster shows the life cycle of DL-curated notebooks from development on Github, to deployment, launching at scale, and testing in our environment.

Outline

The Astro Data Lab [1] science platform is developed and operated at NSF's National Optical-Infrared Astronomy Research Laboratory's (NOIRLab) Community Science and Data Center (CSDC). Free and open access to it's enormous volumes of astronomical data is facilitated in several ways: though web and programmatic APIs, through our own set of authentication/query/virtual-storage clients, or through third-party clients aware of IVOA protocols such as e.g. TOPCAT. However, the major mode for our users is through a dedicated Jupyter notebook server, where all necessary DL APIs and packages relevant for astronomy are pre-loaded with the kernel.

To ease the entry barrier and to train our users, the Data Lab team constantly develops and curates a large suite of example notebooks that showcase the capabilities of the science platform. These notebooks range from introductory level, through technical How-To notebooks, to complete end-to-end science examples. The latter typically reproduce a scientific result from the literature, starting with the science question, and going over how to access the required data, carrying out the analysis, and presenting results. Data Lab also encourages its users to contribute interesting science notebooks. We provide a detailed contribution how-to, a notebook template to follow, and offer support in polishing and testing the notebooks.

All development is version controlled on Github [2]. The entire notebook suite is provided to all newly registered users in their account space, and a user can obtain a latest copy anytime from a local repository through a dedicated shell function. This local repository must be kept in sync with the Github master branch. In the past it was a manual and time-consuming process, but we have realized that it can be fully automated with webhooks; Github enables this functionality. Now, whenever a PR is merged into master by a Data Lab team member, a webhook is triggered and sends a beacon to a service endpoint on the Data Lab servers. Our service verifies the beacon's authenticity, and performs a pull from Github into the local repository. Since this local cache is read-mounted in every user's notebook space, the newest notebooks are available immediately to all users.

The final challenge is in testing that all notebooks continue to execute correctly at all times; this is a serious promise to our user community, and quite difficult to guarantee manually, as changes are frequent in the DL-hosted datasets, the DB backend engine and host configurations, and the middleware code base that powers all of Data Lab. We solved this problem with the help of nbconvert and a custom workflow script, which globs for all notebooks to be tested (exclusion patterns are permitted). It executes all notebooks using nbconvert, recording success or failure in the process including any error tracelogs. The script can be executed both in a terminal, and as a "super"-notebook for convenience. Automation is then only a cronjob away.

Our poster lays out the entire notebook life cycle at Data Lab, and explains in detail the technical challenges and solutions that Data Lab has arrived at.

Links: [1] https://datalab.noao.edu [2] https://github.com/noaodatalab/notebooks-latest/