Login Sign up

Thursday Oct. 15, 2020, 4:45 p.m.–Oct. 15, 2020, 5:15 p.m. in Data Science Applications

Distributed Computing using Jupyter Notebooks with Spark

Jason Yang

Audience level:

Brief Summary

Jupyter + Spark is a powerful combination in your data science toolkit to handle big data. In this talk, we explore running Jupyter on a cluster of machines in the cloud. Using PySpark, we construct a pipeline for machine learning (LDA Topic Modeling) on 2.8 million news articles on COVID-19.


In recent years, cloud providers (AWS, Azure, GCP) have simplified deploying and managing Jupyter Notebooks on Spark clusters. Giving notebooks with computational powers of Spark clusters unlocks data scientists to handle today’s explosion of big data. As an example of how to use Jupyter + Spark to tackle real-world big data problems, I will show you how I analyze and built machine learning models on all news articles available online relating to COVID-19.

In this talk, we will cover: 1) How to spin up a Spark cluster with Jupyter Notebook on Google Cloud; 2) Using 16-node cluster with 1TB+ RAM for LDA topic modeling; 3) Visualizing Spark results in Jupyter Notebook.

This talk is designed for intermediate users of Jupyter with novice users in mind. To get the most out of the talk, attendees should have a general knowledge of distributed computing and cloud computing.