Oct. 15, 2020, 4:30 p.m.–Oct. 15, 2020, 5 p.m.
in Enterprise Jupyter Infrastructure
NotebookOps: A pattern for building notebook-centric data platforms
- Audience level:
In 2018, Netflix and PayPal wrote about how they set up powerful data platforms centered around Jupyter notebooks. This talk will look at the open-source components required for building such data platforms, illustrate how they all tie together, and reflect on some learnings from setting up a notebook-centric data platform at one of India's largest online grocery delivery companies.
Over the past few years, we've seen large organizations adopt Jupyter at scale to set up their internal DIY ("do it yourself") analytics notebook infrastructure. In 2018, Netflix and PayPal wrote about how they set up powerful data platforms centered around Jupyter notebooks to fuel experimentation and innovation at scale. In this talk, we'll look at the components required for building a notebook-centric data platform along with all the open-source tools involved, understand how the components tie together, and reflect on some learnings from setting up such a platform at one of India's largest online grocery delivery companies.
This talk is aimed at data engineers but it's also relevant to data analysts and data scientists. Basic knowledge of Python, Jupyter notebooks and the Jupyter ecosystem will be useful, but not required. After this talk, the audience will have an understanding of the open-source components that can help them build notebook-centric data platforms.
- The enterprise notebook infrastructure trend
- Large scale adoption of Jupyter inside organizations
- How Netflix put notebooks at the core of their data platforms in 2018
- Experience building this at Grofers, one of India's largest online grocery services
- Talk briefly about how it was all set up before (a single JupyterLab server, other standalone servers)
- Highlight individual components of a notebook-centric data platform, and then dive into each component
- Component 1: A multi-user notebook environment
- "Where do I run my notebook? Help! I need more resources!"
- JupyterHub: A Multi-user version of JupyterLab designed for large user groups
- The setup on Kubernetes, each JupyterLab server launches as a Kubernetes pod
- Pre-building different environments for different types of users / use-cases, different kernels and packages
- Letting users select the cpu and memory for their environment
- The setup with a home directory for each user on a network file system for persistence (for example: AWS EFS)
- "I need some data from a table in some database. How do I get it?"
- Tools that can be built to let users access data from various systems and databases in the organization
- Fetching credentials from a database or secret store, maintained by the data platform team (for example: Vault)
- Building libraries or magics that let users run queries by just specifying a unique id for the target system
- Component 2: A notebook scheduling environment
- "How do I schedule my notebook to send out a report every day?"
- "How do I train this model on new data every day?"
- What are Airflow and papermill?
- The concept of DAGs, Airflow as a notebook scheduler
- Airflow macros as input parameters to papermill
- Parameterized notebook execution using the PapermillOperator
- The setup on Kubernetes with KubernetesExecutor, each notebook launches as a Kubernetes pod
- Running notebooks in the same environment they were written in, with the same kernels and packages
- Re-using JupyterHub environment information: cpu, memory and docker image selected by the user
- Running multiple notebooks sequentially or parallelly
- "My notebook job failed. How do I look at the error?"
- Writing notebook outputs to S3 using papermill
- Run Commuter on S3 bucket
- Send relevant info and tag people on alert channels
- Component 3: A collaboration tool
- "I need someone to review my notebook before we can merge and deploy it!"
- Using GitHub for notebook projects
- Issues and pull requests for notebook reviews
- Building tools around git and the GitHub API to automate common tasks
- Cloning and opening existing notebook projects on JupyterHub
- How notebooks in GitHub repos can act as immutable input to Airflow
- CLI tool to schedule notebooks directly from JupyterHub
- How GitHub enables continuous deployment for notebook projects and how to use that to your advantage
- Learnings and improvements
- Removing accidental complexity
- Not force pushing software tools (for example: git) on a data team and expect them to pick them up quickly
- Give users single-click solutions, automate common tasks like pushing a notebook to GitHub / scheduling a notebook on Airflow
- Actively monitoring the whole system using Prometheus and Grafana
The 2018 blog posts by Netflix and PayPal: