A software engineer who does something meaningful
Audience level: Intermediate
- Project Jupyter users, looking to build Jupyter Extensions
- Developers hoping to deep dive into Project Jupyter
In this session, we will introduce 2 challenges we faced on Project Jupyter while developing our extension - MakinaRocks Link; and how we solved these challenges and how we implemented our ideas
- Handling how to display cell outputs
- Handling how to display error messages
We hope that sharing our lessons learnt from our deep dive into Project Jupyter can help other software developers of similar Project Jupyter-related extensions.
Background & Introduction
MakinaRocks Link is a JupyterLab extension that allows you to create pipelines on Jupyter by converting cells into components and setting parent-child relationships between components. We have built this extension to allow users access to all features originally available in the Jupyter environment. They say a picture is worth a thousand words - click below to find out more about how to create pipelines and run these pipelines.
2 challenges we wanted to solve
- 1 Displaying only the outputs of a child cell, when multiple parent cells are executed internally
- When we run a certain component in a pipeline, we execute all dependent components as well. Then, what would be the ideal output when that child component is run?
- Before we solved the issue - The outputs of all parent components are also displayed
- 2 Displaying error messages like Jupyter
- We hoped to replicate the useability of JupyterLab as much as possible. So, when an error occurs, we wished to replicate the JupyterLab error message and not expose our source code that is operating in the background, acting as unnecessary noise to the user.
- Before we solved the issue - When an error occurs, the extension source code is also displayed, making it difficult for users to find the error messages about their own source code.
#1. Displaying only the outputs of a child cell, when multiple parent cells are executed internally
- Our goal was to display only the output from the “executed component” and store the other outputs from other components internally
- Description of the original logic for displaying the output of a child cell with multiple parent cells on JupyterLab
- Development steps to display only the output from the “executed child component”
- Inspired by
- Learnt that stdout & stderr are related to
OutStreamobject, and the displays(images etc.) are related to
- Our solution - snitched
DisplayPublisherobjects and modified the
publishmethods respectively to implement our goal.
- Inspired by
#2. Displaying error messages like Jupyter
- Our goal was to provide the identical Jupyter experience that users are familiar with. (We wanted to minimize time lost by having to learn something new. Hence, displaying error messages in the following way allows users to experience the same Jupyter error message environment.
- Description of the original logic of displaying error messages on JupyterLab
- Development steps for displaying the error message in the ‘right’ way
showtracebackmethod, which is used on
ZMQInteractiveShellobject displays error messages.
- Eureka! Found
showtracebackmethod, that allows customizations.
- Our solution - wrote an algorithm to implement
_render_traceback_()to display only the actual error message with zero noise.
We have shared our case of deep diving into Project Jupyter, and how we solved the challenges of showing the ‘right’ cell outputs and the ‘right’ error messages. We hope that Project Jupyter users can enhance their understanding of Jupyter and be inspired to solve more challenges with our case story.
Audience level: Novice
- Everyone who knows Project Jupyter
JupyterLab is an IDE that is loved by many in the fields of data science and machine learning. Jupyter provides an outstanding, interactive feature that allows the REPL based execution and review of cell-level code, and facilitates data exploration and machine learning experiments. It is used by many including students and experts who apply Jupyter for their work.
Data science and machine learning code generally require large amounts of computing. Operating these code on personal laptops or local environments may require excessive amounts of time or fail to run successfully due to a memory shortage. These issues can be resolved by installing JupyterLab on a high computing power workstation and and access it via port forwarding, or deploying it on a Kubernetes cluster. Using a remote workstation’s JupyterHub or JupyterLab can lead to issues on the shared resources. If the IPython kernel connected to the Jupyter notebook is not terminated, resources, such as the memory and the GPU, will not be returned. This means that other users of the workstation will not be able to use those resources when they need to.
We thought of new ways to execute code remotely on JupyterLab while avoiding these issues. We were able to implement a remote execution feature that allows codes to run on remote environments per the user request. Link allows each pipeline component (i.e. each Jupyter cell) to run either locally or on a designated remote environment. Moreover, the resources used for the execution is returned automatically, leading to a more efficient shared resource management. In next section of this note, we will explain the design of Link’s remote execution feature.
Remote exectution on Link
Link pipeline consists of one or more components, and each component corresponds to one Jupyter cell. Each component has properties, and properties contain information from the local or remote environment. Depending on the execution information, components can be executed in an independent environment. Link executes code in a specific environment according to the user request and returns the resources used. As a result, users can efficiently use and manage the shared resources of workstation.
Figure 1: Design of per-cell remote execution
Per-cell remote execution is designed and composed of a message queue, data store, and remote worker as shown in Figure-1. Local Link and remote Link workers communicate with each other through message queue and data storage. The message queue manages running tasks, and the storage stores data such as code and objects. Remote execution of each component operates in the following process.
- Serialize and transfer the selected cell’s code and parent cells’ data to the remote worker via the message queue and data storage.
- Remote worker receives the task from the message queue, deserializes the code and data from the data storage and executes the code.
- Execution results and the output data is serialized and transferred to the local environment via the message queue and the data storage.
- The local environment receives the results from the message queue and imports the output data from the data storage
Figure 2: Add a remote worker
Figure 3: Select components to execute remotely
Link can connect to a remote worker using the message queue and data storage access information. Users can register with an easy-to-understand alias. After successfully connecting to the remote worker, users can select certain components to run on this worker, and the selected components will be executed remotely. This information is available even when the computer is turned off and on again, even after several days.
JupyterLab is an IDE loved by many developers ranging from junior students to experts in the fields of data science and machine learning. Data science and machine learning code require large amounts of computing, and executing these codes in individual local environments may require a lot of time or may fail due to a lack of memory. These issues can be overcome by installing and JupyterLab on a high compute workstation and utilizing that environment. However, using a remote JupyterLab may lead to shared resources not being returned correctly, leading to problems in using these shared resources among different users. In order to avoid these problems, we have implemented the remote feature to run only parts of the code on the remote environment, as requested by the user. Link allows user to designate and run respective components (i.e. cells) on either local or remote environments. Link enhances efficiency even further by automatically returning the shared resources upon the execution of the code.