A software engineer who does something meaningful
Work at MakinaRocks where we develop MLOps products called "Link" and "Runway"
I'm eager to enable machine learning to have a real-world impact.
Audience level: Intermediate
- Project Jupyter users, looking to build Jupyter Extensions
- Developers hoping to deep dive into Project Jupyter
In this session, we will introduce 2 challenges we faced on Project Jupyter while developing our extension - MakinaRocks Link; and how we solved these challenges and how we implemented our ideas
- Handling how to display cell outputs
- Handling how to display error messages
We hope that sharing our lessons learnt from our deep dive into Project Jupyter can help other software developers of similar Project Jupyter-related extensions.
Background & Introduction
MakinaRocks Link is a JupyterLab extension that allows you to create pipelines on Jupyter by converting cells into components and setting parent-child relationships between components. We have built this extension to allow users access to all features originally available in the Jupyter environment. They say a picture is worth a thousand words - click below to find out more about how to create pipelines and run these pipelines.
2 challenges we wanted to solve
- 1 Displaying only the outputs of a child cell, when multiple parent cells are executed internally
- When we run a certain component in a pipeline, we execute all dependent components as well. Then, what would be the ideal output when that child component is run?
- Before we solved the issue - The outputs of all parent components are also displayed
- 2 Displaying error messages like Jupyter
- We hoped to replicate the useability of JupyterLab as much as possible. So, when an error occurs, we wished to replicate the JupyterLab error message and not expose our source code that is operating in the background, acting as unnecessary noise to the user.
- Before we solved the issue - When an error occurs, the extension source code is also displayed, making it difficult for users to find the error messages about their own source code.
#1. Displaying only the outputs of a child cell, when multiple parent cells are executed internally
- Our goal was to display only the output from the “executed component” and store the other outputs from other components internally
- Description of the original logic for displaying the output of a child cell with multiple parent cells on JupyterLab
- Development steps to display only the output from the “executed child component”
- Inspired by
- Learnt that stdout & stderr are related to
OutStreamobject, and the displays(images etc.) are related to
- Our solution - snitched
DisplayPublisherobjects and modified the
publishmethods respectively to implement our goal.
- Inspired by
#2. Displaying error messages like Jupyter
- Our goal was to provide the identical Jupyter experience that users are familiar with. (We wanted to minimize time lost by having to learn something new. Hence, displaying error messages in the following way allows users to experience the same Jupyter error message environment.
- Description of the original logic of displaying error messages on JupyterLab
- Development steps for displaying the error message in the ‘right’ way
showtracebackmethod, which is used on
ZMQInteractiveShellobject displays error messages.
- Eureka! Found
showtracebackmethod, that allows customizations.
- Our solution - wrote an algorithm to implement
_render_traceback_()to display only the actual error message with zero noise.
We have shared our case of deep diving into Project Jupyter, and how we solved the challenges of showing the ‘right’ cell outputs and the ‘right’ error messages. We hope that Project Jupyter users can enhance their understanding of Jupyter and be inspired to solve more challenges with our case story.
Audience level: Novice
- Everyone who knows Project Jupyter
JupyterLab is an IDE that is loved by many in the fields of data science and machine learning. Jupyter provides an outstanding, interactive feature that allows the REPL based execution and review of cell-level code, and facilitates data exploration and machine learning experiments. It is used by many including students and experts who apply Jupyter for their work.
Data science and machine learning code generally require large amounts of computing. Operating these code on personal laptops or local environments may require excessive amounts of time or fail to run successfully due to a memory shortage. These issues can be resolved by installing JupyterLab on a high computing power workstation and and access it via port forwarding, or deploying it on a Kubernetes cluster. Using a remote workstation’s JupyterHub or JupyterLab can lead to issues on the shared resources. If the IPython kernel connected to the Jupyter notebook is not terminated, resources, such as the memory and the GPU, will not be returned. This means that other users of the workstation will not be able to use those resources when they need to.
We thought of new ways to execute code remotely on JupyterLab while avoiding these issues. We were able to implement a remote execution feature that allows codes to run on remote environments per the user request. Link allows each pipeline component (i.e. each Jupyter cell) to run either locally or on a designated remote environment. Moreover, the resources used for the execution is returned automatically, leading to a more efficient shared resource management. In next section of this note, we will explain the design of Link’s remote execution feature.
Remote exectution on Link
Link pipeline consists of one or more components, and each component corresponds to one Jupyter cell. Each component has properties, and properties contain information from the local or remote environment. Depending on the execution information, components can be executed in an independent environment. Link executes code in a specific environment according to the user request and returns the resources used. As a result, users can efficiently use and manage the shared resources of workstation.
Figure 1: Design of per-cell remote execution
Per-cell remote execution is designed and composed of a message queue, data store, and remote worker as shown in Figure-1. Local Link and remote Link workers communicate with each other through message queue and data storage. The message queue manages running tasks, and the storage stores data such as code and objects. Remote execution of each component operates in the following process.
- Serialize and transfer the selected cell’s code and parent cells’ data to the remote worker via the message queue and data storage.
- Remote worker receives the task from the message queue, deserializes the code and data from the data storage and executes the code.
- Execution results and the output data is serialized and transferred to the local environment via the message queue and the data storage.
- The local environment receives the results from the message queue and imports the output data from the data storage
Figure 2: Add a remote worker
Figure 3: Select components to execute remotely
Link can connect to a remote worker using the message queue and data storage access information. Users can register with an easy-to-understand alias. After successfully connecting to the remote worker, users can select certain components to run on this worker, and the selected components will be executed remotely. This information is available even when the computer is turned off and on again, even after several days.
JupyterLab is an IDE loved by many developers ranging from junior students to experts in the fields of data science and machine learning. Data science and machine learning code require large amounts of computing, and executing these codes in individual local environments may require a lot of time or may fail due to a lack of memory. These issues can be overcome by installing and JupyterLab on a high compute workstation and utilizing that environment. However, using a remote JupyterLab may lead to shared resources not being returned correctly, leading to problems in using these shared resources among different users. In order to avoid these problems, we have implemented the remote feature to run only parts of the code on the remote environment, as requested by the user. Link allows user to designate and run respective components (i.e. cells) on either local or remote environments. Link enhances efficiency even further by automatically returning the shared resources upon the execution of the code.
- 'Intermediate' level of programming
- Jupyter users
- Data scientists
Development is not only for individuals, but also for enterprises and organizations where efficient collaboration is of great importance. Most developers share and manage source codes with github. Most files are managed efficiently on github, but Jupyter files - ipynb - are stored as text, making it difficult to identify diffs and resloving conflicts between versions. ipynb files consist of cells with codes, which requires manual amendments to the file text in case of merge conflicts. While Jupyter supports the Jupyter-git extension, which allows either deleting conflicting files or selecting one file over the other, it does not directly solve the conflicts within conflicting files or let users view the diffs easily. Also, there are many cases where users collaborate by sharing ipynb files. However, when opening other users’ files, it is often difficult to understand the flow of their code and the order in which the cells should be executed. Using comments or markdown syntax can alleviate these problems, but sharing detailed levels of changes by text is not the most efficient method.
Link is a JupyterLab extension that allows users to create pipelines on Jupyter by connecting different cells. The user can connect cells in their desired order into a DAG structure to run the code. Link also provides a Link-git extension, which provides git features on Jupyter for ipynb files with pipelines.
Git diff check: Users can visualize the commit history of an ipynb file. The feature shows all changes made at the code level for each commit, and also how the structure of the pipelines changed. When collaborating, users are able to review the history of previous works before moving on to the next stage, and also decide from which commit they wish to begin.
Merge conflict management: Link-git contains a merge driver, which resolves all merge conflicts at cell levels - for both the code and pipeline structure - when a conflict occurs between different users working from their respective local environments. Using this feature, a team of developers can create an overall code framework in the form of a pipeline to facilitate the merge process of pipelines after writing codes at cell levels on their respective local environments.
Sharing pipelines and cache
Link provides features to facilitate code sharing in file formats. Conventionally, Jupyter users often share codes as ipynb files to make modifications on existing codes or to re-use them. However, on Jupyter, code cells are linearly listed and contains code cells that are not well organized, leading to difficulties in reproducing results or to make changes or additions. Link provides the following file export, import features to resolve these complications.
Pipeline export, import: Link users can export entire pipelines or component cells as json files. When another Link user imports these json files, they can re-open the pipeline or component cell on Jupyter. As the code includes the DAG structure, users can re-open the ordered pipelines together with the code cells, allowing them to more efficiently understand the flow of the code without any additional textual explanations. Also, users can share only the relevant code cells that are required for execution.
Cache export, import: Users can store the cache of each component after executing a whole pipeline, and export the cache as an archived name file(.tar.gz). When another user imports this file, they can use the pipeline results without having to reiterate the pipeline execution. This feature is a time-saving feature for users when they need to repeat certain jobs on their own pipelines, or for different users to reproduce the same code.
In short, Link provides a useful framework for easier collaboration with its Link-git feature and the pipeline/cache export and import features.