Login Sign up

Wednesday Oct. 14, 2020, 4 p.m.–Oct. 14, 2020, 4:30 p.m. in Jupyter Community: Practices

Structuring Notebooks Around Their Outputs

David Koop

Audience level:

Brief Summary

Because notebooks provide the opportunity to examine and evaluate intermediate outputs, it is important to both provide intuitive interactions with those outputs and enhance methods to use and recall them. ipycollections and dfnotebook are JupyterLab extensions that respectively improve Jupyter's display of rich output collections and help users recall and reproduce past outputs.



Notebooks have shifted the coding paradigm from one where programs are many lines of computation where input goes through many steps to generate final output to one where code is divided into smaller cells, each with its own output. A user examines those intermediate outputs to help decide what the next step is, potentially drilling down into specific features to gain understanding. If the output looks ok, a user plans the next steps. If the output looks suspect, a user may edit one or more cells and re-execute them. These tasks may be performed immediately or well in the future, perhaps by a different user, and thus support for them spans many concerns. In any setting, it is important to provide (a) intuitive interaction with output and (b) improved methods to use and recall output. ipycollections is a JupyterLab extension that seeks to improve Jupyter's display of rich output collections. The dataflow notebook (dfnotebook) is a JupyterLab extension that provides enhanced methods for referring to, recalling, and reproducing output. Together, they enhance the JupyterLab ecosystem by improving the way output is handled.



Because this code is being developed at Argonne National Laboratory, it is currently undergoing open-source review. It will be made available as open-source code when that review is completed, and that is anticipated to be complete before the conference. Previous versions of the dataflow notebook code are available on GitHub: https://github.com/dataflownb/dfkernel

Displaying Output

There are many good solutions for displaying outputs in notebooks, ranging from pretty-printed text to static figures to interactive, JavaScript-enabled graphics. The mimebundle support in IPython coupled with Jupyter's extensible renderers allows an easy way to register and support better output displays. However, these outputs don't always play nicely with Python's data structures. If you've ever embedded an IPython Image inside of a list, or output a dictionary of pandas dataframes, you know this pain. Instead of nicely formatted tables or graphical displays, you might see a class name or a plaintext description. Similarly, when browsing a deeply nested collection of dictionaries, it can be difficult to determine the actual structure without writing more code to pull things apart.

ipycollections and ipycollections-renderer seek to address this problem by extending the standard output of python collections to take advantage of the excellent display/renderer support that exists in Jupyter and IPython. Specifically, the IPython extension follows the same pretty-printing recursion strategy used in the core code but instead of generating nicely formatted text, it formats the information in JavaScript Object Notation (JSON) with a couple of extra fields providing type information. On the Jupyter side, a custom renderer produces interactive widgets to navigate the collections while delegating the display of rich objects that already have custom renderers.

The extension also provides opportunities to enable more navigation strategies than (frustrating) scrolling. Specifically, items in a nested collection can be expanded or collapsed. Similarly, the number of items may be limited but with the ability to show more if a user desires that. Future work may couple this with messaging strategies that allow the kernel to provide on-demand access to portions of the output, reducing notebook size.

Recalling Outputs

In IPython notebooks, it can be confusing to know which "version" of a variable you are referencing. For example, if a variable is assigned a value in three different cells, we need to know which cell was last run to determine its value. An oft-cited rule is to make sure the notebook is linear, running from top to bottom, but that isn't necessarily the way people work when developing code, especially during exploration. Furthermore, when an output is reused, any reordering of the cells might lead to side effects that cause differences in outputs. When a variable has been overwritten and we want to go back and modify a step, we may be required to run a whole sequence of cells (or all of them) to get back to the way things were when that variable was originally defined.

Dataflow notebooks seek to provide more structure to notebooks by emphasizing links between cells through outputs. In this way, the dependencies between cells are explicit, providing enhanced reproducibility. We can recursively execute cells that have changed to ensure that the whole chain is up-to-date. In a dataflow notebook, cell outputs are named so that scrolling through the notebook allows a user to browse the available references. However, a major tradeoff in globally referring to an output by name is that the name must be unique. If you create an output foo and then create a second output that you want to be the "new" foo, you need a new name. The following code would not be valid in past versions of the dataflow notebook:

[1]: foo = 12

[2]: foo = foo + 30

This is a large shift in the way many people are accustomed to writing code. For example, pandas dataframes are often referred to as df, and even as the dataframe is transformed, it is useful to always know that the data you wish to refer to is in the variable df. Yes, you could create df1, df2, df3, ... or df_orig, df_with_col_X, df_pivot, but recalling which variable is most recent is problematic. Some may argue that this recall problem is bigger than the benefit of being able to trace the code back.

Instead, consider a solution where we associate a new tag to each "version" of a variable. Most of the time, we want to refer to the latest instance of a variable while retaining the explicit cell dependencies in the dataflow notebook. If a cell needs updating, it is likely not with respect to the names of the variables. What if we wait to bake dependencies into the code until the code is executed? Thus, a user would write

[1]: df = pd.read_csv('test.csv')

[2]: df = df.drop(columns=['a','b'])

but this becomes

[1]: df = pd.read_csv('test.csv')

[2]: df = df$1.drop(columns=['a','b'])

when the suffix $1 refers to the fact that the version of df we are referring to is the one defined in cell 1.

We can enhance this process using cell tags. Instead of associating cells with meaningless integer or hexadecimal identifiers, we can let users tag the cells with meaningful names.

[orig]: df = pd.read_csv('test.csv')

   [2]: df = df$orig.drop(columns=['a','b'])

These names can be used when coding but are translated to the permanent identifiers when the cell is executed so that the code will be reproducible even if tags change in the future. Further enhancements may include modes to switch the code in a cell from the reproducible version to the, perhaps more meaningful tagged version, or even back to the unscoped version.


Because these extensions are somewhat non-standard, this work produced some interesting lessons about the extension process. Both the ipycollections renderer and the dataflow notebook have been built as JupyterLab extensions, though both also have libraries that live kernel-side. Throughout the process, the extensible architecture of JupyterLab along with good documentation has made these implementations possible. Along the way, there was a lot of trial-and-error, and there are some interesting takeaways that may be useful for other developers.

The ipycollections module is an IPython extension because it hooks into the formatters that IPython provides. It follows a similar style as pretty-printing, recursively building JSON from nested objects. When nested objects already have a MIME bundle or HTML serialization, that serialization is used. This JSON is tagged with a particular MIME type, and that formatter registered. If the extension is disabled, the formatter is removed.

On the Jupyter side, we basically have a rendermime plugin, but in order to get access to other renderers as delegates for our renderer, we need to augment the existing plugin to pull in the rendermime registry. The core rendering code takes the JSON structure and renders it with expandable fields, delegating rendering of mimebundles to their respective renderers. Lists and dictionaries show indices and keys, and while truncated by default, can be expanded by users.

The dataflow notebook extension is built on the existing dfkernel kernel with a few updates to mesh with IPython 7.x. There are a few tweaks in how information is delivered from the kernel, but the core changes are how variables are referred to. Specifically, we must parse identifiers of the form id$1 where id is the variable name and 1 is the cell identifier. Because the symbol "$" is illegal in Python, we use the SyntaxError to locate such identifiers and expand them for the kernel to access the stored outputs (e.g. Out[1]['id']). If an identifier is not scoped to a cell, we must decide which cell it refers to. Because we use a most recently used strategy, we can ask the kernel which cell last generated an output with the specific identifier and add that information to the code. It is important that this information is returned to the user and the notebook so it can be saved.

On the Jupyter side, we replace the existing notebook-extension with a similar dfnotebook-extension, and replace the cells, notebook, and outputarea packages with df variables. We tried to make notebook-extension and dfnotebook-extension co-exist but after a significant number of headaches, it made more sense to disable the notebook-extension and enable the dfnotebook-extension. A major headache, perhaps due to the author's TypeScript ability, was an inability to get Jupyter to resolve tokens to the updated classes. This required some hacks to make the compiler ignore mismatches in the token interfaces. Because a single cell execution may trigger other cells to execute, the code involves redirecting messages to the corresponding cells.


The vision here is to better facilitate decisions during tasks like data analysis by improving the quality and recall of output while also improving future reproducibility. The enhanced displays of ipycollections improve the users' ability to understand outputs, and dataflow notebooks improve help improve access to previous outputs when writing code while maintaining reproducibility.