In this talk, we will first reflect on the common practice data scientists currently use for collaboration, as well as tools that are designed for facilitating collaboration in different scenarios. We will then present Callisto, an extension to computational notebooks that captures and stores contextual links between discussion messages and notebook elements with minimal effort from users.
Our talk will last for 30 minutes and we will cover the following items:
We will begin by introducing the design of computational notebooks: what are computational notebooks, why it is popular among data scientists, the design of Project Jupyter, how Jupyter Notebook has been widely used for writing and sharing computational narratives in various contexts. The goal of this part is to situate the audience. Even if they are not familiar with computational notebooks, we hope by introducing the design through verbal descriptions, screenshots, and video recordings, audience will gain a general understanding of computational notebooks.
Data science involves a large amount of experimentation and subjective decision making. Thus, it is important for data scientists to document the story behind the computation of results (e.g., reporting alternative solutions and explaining the limitations for them). Data scientists often create computational narratives, which combine data, code to process those data, and natural language explanations to form a narrative. Some even consider computational narratives to be the engine of collaborative data science. Computational notebooks allow data scientists to create and share computational narratives.
Jupyter Notebook is the most popular computational notebook platform that supports more than 40 programming languages. Project Jupyter evolved from IPython, a terminal-based interactive shell that originally designed for creating interactive visualizations for scientific computing. Wrapping IPython as the kernel, Project Jupyter is designed as a web-based platform for authoring a single document that combines code cells and intermediate results.
Jupyter notebooks consist of ``cells’‘ — typically small chunks of code or narrative text in the Markdown format. Users can execute cells (typically, but not necessarily, from top to bottom) and observe their outputs, which can include visualizations, data frames, or rendered narrative text.
Jupyter Notebook has been widely used for writing and sharing computational narratives in various contexts. Kross and Guo interviewed practitioners who taught data science and found that Jupyter notebooks have been widely used by instructors to deliver course materials. They also found that Jupyter notebooks allow students to easily write computational narratives with a low cost for setting up an environment. Kery et al. studied how professional data scientists used Jupyter notebooks in their daily work to create computational narratives. Randles et al. investigated how Jupyter notebooks can be used for open science under the principles of FAIR (Findable, Accessible, Interoperable, Reusable). In fact, some academic venues encourage paper authors to include notebooks with their submissions (e.g., the Distill Journal in the area of machine learning).
In this part, we will reflect on limitations and challenges with current computational notebooks, as well as tools and innovations that are designed to better support exploration and collaboration in notebooks.
Studies have identified several limitations with computational notebooks. Rule et al. conducted a large scale analysis of over 1 million open-source computational notebooks and found that only one in four held explanatory text. Kery et al. interviewed 21 data scientists to study their coding behaviors using computational notebooks and highlighted the challenges of tracking the history of experimentation. Both studies revealed the tension between using computational notebooks for rapid exploration and instructive explanation. For quick exploration, data scientists sometimes generate messy and informal notebooks, which can be difficult for others (or even the author) to read later on. Data scientists have to use strategies like actively pausing the experiment to curate and clean notebooks into narratives, which may hinder the exploration process. The tension between quick exploration and instructive explanation can be contextually sensitive depending on how exploratory and open-ended the task is.
Building upon current computational notebooks, other systems have explored designs to better support non-linear exploration in notebooks. For instance, Kery et al. integrated a lightweight local versioning mechanism to help data scientists keep track of their exploration history, while Rule et al. took a different approach to enable data scientists to fold content blocks with annotations in notebooks. Head et al. explored code gathering extensions for data scientists to manage and navigate through cluttered and inconsistent notebooks, and Zhang and Guo created DS.js to transform any webpage to a computational notebook, lowering the barrier for novices to retrieve data for exploration.
Companies and the notebook community have also built innovations on the notebook infrastructure to support various data-related tasks in practice. For example, the Netflix data team developed nteract, with extra features for data explorations; Papermill, which facilitates rapid exploration by spawning notebooks for different parameter sets; and Commuter — a platform for curating and sharing notebooks. Project ReviewNB enables better diff rendering for notebook projects using git. It also allows cell-level commenting for better team collaboration.
Recently, tools like Google Colab and DeepNote have further fostered collaborative data science by allowing multiple users to edit the same notebook in real-time. However, prior studies on computational notebooks have only inspected individuals authoring the notebooks, not yet multiple data scientists collaboratively authoring the notebooks. Only recently, Koesten et al. interviewed data practitioners about their collaborative practices with structured data. They synthesized collaboration needs across a wide range of scenarios from co-creation data analysis to reusing others’ data in a new context.
We are a team of researchers from the University of Michigan interested in redesigning notebooks for better collaboration. Our recent study investigates real-time collaborative editing in computational notebooks. We aim to understand whether collaborative editing would help collaborators maintain a shared understanding or rather intensify the tension between exploration and explanation. Based on the results, we build Callisto, an extension to computational notebooks that captures and stores contextual links between discussion messages and notebook elements with minimal effort from users.
In this part, we will present a recent study on how data scientists use computational notebooks for real-time collaboration. We will introduce the study procedure and report the main findings. We would like to attach the slides for this paper, which is presented on the ACM Conference on Computer-Supported Cooperative Work and Social Computing (CSCW) this year.
To understand the tools and strategies data scientists currently used in practice, we first conducted a survey with 195 data scientists/data science students who came from diverse backgrounds. In particular, we identified two approaches for data scientists to collaborate: 1) the traditional collaboration setting where team members work on individual Jupyter Notebook and update each others’ work asynchronously, and 2) the emerging collaboration setting where team members work closely together on a shared Jupyter Notebook and all the edits are synchronized in real-time.
To further compare how data scientists’ collaboration styles varied between two approaches, we conducted an observational study with 24 intermediate data scientists working in pairs remotely to solve a predictive modeling problem. We summarized the common collaboration styles that emerged in the two collaboration settings. We reported the comparison of communication styles, performance, and perceptions of the collaboration experience. We also analyzed the challenges that participants faced in using real-time collaborative editing features.
Our main findings indicate that synchronous editing helps data scientists maintain a shared understanding while reducing communication costs, thus improving the overall efficiency of collaboration. However, current synchronous editing features can be challenging to use and require collaborators to be strategic with respect to coordination.
In this part, we will present the design of Callisto with a set of features to make chat messages more useful for understanding the past exploration process. We will give a live demo of the tool (if the live demo does not work for any reason, we will have a backup video demo).
When teams of data scientists collaborate on computational notebooks, their discussions often contain valuable insight into their design decisions. These discussions not only explain analysis in the current notebooks but also alternative paths, which are often poorly documented. However, these discussions are disconnected from the notebooks for which they could provide valuable context. We designed Callisto to improve collaborative data science by better connecting discussions with notebook content.
Callisto extends the Jupyter Notebook platform in several ways. First, it allows users to share notebooks, collaborate in real-time, and discuss with collaborators. Second, it enables users to connect discussions with elements in the shared notebook, including code, output, individual cells, or edits. Third, it leverages these connections to make it easier to navigate discussions and notebook content—for example, to find discussions about a particular part of the notebook.
Callisto allows notebook readers to better understand the current notebook content and the overall problem-solving process that led to it, by making it possible to browse the discussions and code history relevant to any part of the notebook. This is particularly helpful for onboarding new notebook collaborators to avoid misinterpretations and duplicated work, as we found in a two-stage evaluation with 32 data science students.