Matthew is an open source software developer in the numeric Python ecosystem. He maintains several PyData libraries, but today focuses mostly on Dask a library for scalable computing. Matthew worked for Anaconda Inc for several years, then built out the Dask team at NVIDIA for RAPIDS, and most recently founded Coiled to improve Python's scalability with Dask for large organizations.
Learn how to run the Python data science ecosystem in parallel with Dask in this hands-on tutorial.
Dask is an open source library for parallel computing in Python with deep integration with common Python libraries like numpy, pandas, xgboost, pytorch, xarray, and of course Jupyter. In this hands-on tutorial we will launch clusters of distributed machines, and use those clusters to process and analyze data on the cloud.
Students should be mildly familiar with Python and Pandas syntax, and be interested in the challenges of large scale computation.
Distributed computing is great! Unfortunately, distributed computing is also hard and often heavyweight. This friction gets in the way of the human+computer joint data exploration process that we value so dearly in the Jupyter ecosystem.
Dask is a popular library for parallel and distributed computing in Python that was co-developed alongside Jupyter with human interaction and interactivity in mind. In this talk we'll discuss Dask in the context of interactive data science, highlighting the ways in which Dask and Jupyter leverage each other to achieve a powerful and scalable user experience that fits easily into your hand. In particular we'll highlight rich notebook outputs, JupyterLab dashboard extensions, and JupyterHub deployment integrations, and how leveraging the extensibility of Jupyter can result in a first-class open source experience