This talk covers how multiple notebooks can be orchestrated as a pipeline and executed in distributed way in Kubernetes. With the integration of airflow, a notebook is no longer a standalone piece of development process, but an integral component serves different purpose in a complex pipeline. Furthermore, notebook kernels are developed as docker images with all dependencies inbuilt.
At PayPal, notebooks is used by all data analyst, data science, data engineers and machine learning engineers. There are thousands of notebooks and tens of thousands of jobs run each day through notebooks at PayPal. The traditional way of scheduling notebooks has lot of complexities, some of them are
Time consuming and not maintainable
Having dependencies between the notebooks is even more complex
Difficult to use different environments for different notebooks
We needed a simplified and robust solution to address some of these challenges. Weijun and Praveen will explain how the above challenges were addressed by using airflow scheduler on Kubernetes with notebooks integration.
Topics include:
Airflow on Kubernetes – for scheduling notebooks
How scheduler API can be used for orchestrating notebooks jobs run as pipelines
How different computation resources (ex: cpu, gpu) can be coordinated in a pipeline to run a complex ML job
How docker image can be used as notebooks kernel
Background knowledge for attendees:
Basic understanding of notebooks and python
Basic understanding of papermil, dockers and Kubernetes (useful but not required)
Basic understanding of any scheduler (useful but not required)