Science abounds with large, complex datasets that are shared by many researchers. Publishing these datasets in an analysis-ready, cloud-optimized format opens up new possibilities for scientific discovery. This talk will describe emerging best practices for creating and maintaining cloud-native scientific data repositories using open-source tech, and an implementation by the Pangeo Project.
This talk is for anyone who is interested in using technology to help science advance. Contemporary science abounds with large, complex datasets that are shared by many researchers. For example, thousands of climate scientists study with the same multi-petabyte climate model simulation dataset (CMIP6). The Human Cell Atlas and ESA’s Gaia star database play similar roles for biologists and astronomers respectively. These datasets offer exciting potential for new discoveries on important scientific problems and also represent an ideal target for exploitation by emerging machine-learning approaches. However, the science community’s approach to infrastructure may be holding us back from realizing this potential.
Traditionally, scientific data has been distributed via a download model, wherein scientists download many individual data files to local computers for analysis. Yet the download model poses several challenges. After downloading all these files, scientists typically have to do extensive processing and organizing to make them useful for data analysis; this creates a barrier to reproducibility, since a scientist’s analysis code must account for this unique “local” organization. Furthermore, the sheer size of the datasets (many TB to PB) can make downloading effectively impossible. Finally, this model reinforces inequality between privileged institutions who have the resources to host local copies of the data and those who don’t. This restricts who can participate in science.
Cloud computing, with its ability to place large datasets and massive computational resources in close proximity, seems to offer an ideal solution to these problems. However, there are many different possible ways to organize and structure cloud computing for data-driven science. In this talk, we will outline the difference between closed platforms and open architectures. Closed platforms, like Google Earth Engine, are one-stop-shops solutions that provide both data and computing. They are very powerful but generally controlled by a single company, with limited flexibility and modularity. Open architectures assume data will be distributed over the internet and seek interoperability between different data catalogs and computational tools. While less polished, we argue that open architectures are the best path forward for big-data scientific infrastructure.
Motivated by the vision of open architecture, the Pangeo Project has begun to build a prototype cloud-native repository for big climate data. This repository consists of:
We also identify the biggest current challenge in operating this repository: the need to automate the production of analysis-ready, cloud-optimized (ARCO) data from diverse primary data repositories. We suspect that the collaborative production of datasets is an important emerging frontier in data-driven science, and we describe our nascent efforts to build open-source tools to ease this process. We conclude by describing how other institutions and disciplines can collaborate around these tools and adapt the Pangeo approach to meet their needs.