Login Sign up

Monday Oct. 12, 2020, 4:45 p.m.–Oct. 12, 2020, 5 p.m. in Jupyter Community: Tools

Streamline your Data Science projects with Ploomber

Eduardo Blancas

Audience level:

Brief Summary

This talk showcases Ploomber, a tool that allows Data Science teams to adopt better software development practices without requiring members to be trained on any tool-specific details. By adopting the convention over configuration philosophy, Ploomber streamlines pipeline execution, allowing teams to confidently push changes to a remote repository.


Problem statement (3 minutes)

Developing reproducible data pipelines is of paramount importance to scientific research and industry applications, but this is easier said than done. Without proper development processes in place, pipelines do not evolve well.

Solo developers sometimes code their projects as gigantic monoliths (e.g. a Jupyter notebook with 1000+ cells) because it's convenient. To "test" such code, you only have to click on "run all cells". But the longer the file, the more difficult it is to maintain.

For teams, the situation gets worse. A pipeline is likely to evolve in a set of disparate scripts where it isn't clear how to do an end-to-end run. This causes team friction, as members have to go around asking others how to execute their code.

Current approaches (2 minutes)

Current workflow management tools address this by providing a framework to stitch parts together (e.g. GNU Make). For individual projects where the author is proficient in any of these tools, this alleviates the problem.

But given the myriad of options, it is unlikely that for a given team, all members will be proficient in the same tool. This leads to a tough choice: choose one (and train people) or let mess take over.

Introducing Ploomber (5 minutes)

Ploomber adheres to a convention over configuration philosophy, it acts as an invisible orchestrator that brings order without even knowing the tool is there.

There are three simple conventions:

  1. Each task in the pipeline is a script
  2. At the top, declare an "upstream" variable, a list of dependencies (other scripts)
  3. Declare a "product" variable with the desired output location

How it works (10 minutes)

Ploomber collects all scripts and runs static analysis to extract the "upstream" and "product" variables. Then, it assembles a directed acyclic graph (DAG) by representing scripts as nodes and "upstream" dependencies as vertices, then, it applies topological sorting to determine execution order (always run dependencies first).

A code preparation stage follows. Ploomber converts each script to a Jupyter notebook (via jupytext); this provides the team with rich logs (with tables and charts) for each execution. Then, a new "upstream "cell is injected to pass the location of input files. Unlike the original "upstream" cell that only contains task names, this one maps names their products. Finally, the pipeline is executed. The following diagram shows this workflow for a simple pipeline:

Given that Ploomber relies on Jupyter (via papermill) to execute scripts, it is easy to support any programming language which has a Jupyter kernel available. Currently, our proof-of-concept only supports Python.

Integration with Jupyter (5 minutes)

Ploomber does not sacrifice interactivity. Users can open their scripts as regular notebooks and develop them interactively. Through a Jupyter server extension, Ploomber provides the appropriate execution context: If the currently open script belongs to a pipeline, Ploomber performs the aforementioned cell injection. This way, the notebook reflects the exact code that the pipeline executes.

Continuous Integration for Data Science (5 minutes)

Adding Continuous Integration on top of this is simple. Once the pipeline is structured this way, Ploomber can orchestrate an end-to-end execution with a single command. This streamlined process has dramatically increased team's productivity and allows us to quickly iterate with confidence. CI gets us automatic feedback if things break.


Summary diagram:


External resources:

Note: some important features such as task parallelization, incremental runs, and SQL support were omitted for brevity.