Login Sign up

Thursday Oct. 15, 2020, 5 p.m.–Oct. 15, 2020, 5:15 p.m. in Enterprise Jupyter Infrastructure

Building distributed pipelines with Jupyter Notebooks

Weijun Qian, Karthik Banala

Audience level:
Intermediate

Brief Summary

This talk covers how multiple notebooks can be orchestrated as a pipeline and executed in distributed way in Kubernetes. With the integration of airflow, a notebook is no longer a standalone piece of development process, but an integral component serves different purpose in a complex pipeline. Furthermore, notebook kernels are developed as docker images with all dependencies inbuilt.

Outline

At PayPal, notebooks is used by all data analyst, data science, data engineers and machine learning engineers. There are thousands of notebooks and tens of thousands of jobs run each day through notebooks at PayPal. The traditional way of scheduling notebooks has lot of complexities, some of them are

  1. Time consuming and not maintainable

  2. Having dependencies between the notebooks is even more complex

  3. Difficult to use different environments for different notebooks

We needed a simplified and robust solution to address some of these challenges. Weijun and Praveen will explain how the above challenges were addressed by using airflow scheduler on Kubernetes with notebooks integration.

Topics include:

  1. Airflow on Kubernetes – for scheduling notebooks

  2. How scheduler API can be used for orchestrating notebooks jobs run as pipelines

  3. How different computation resources (ex: cpu, gpu) can be coordinated in a pipeline to run a complex ML job

  4. How docker image can be used as notebooks kernel

Background knowledge for attendees:

  1. Basic understanding of notebooks and python

  2. Basic understanding of papermil, dockers and Kubernetes (useful but not required)

  3. Basic understanding of any scheduler (useful but not required)