JupyterCon 2023

Machine learning with dirty tables: encoding, joining and deduplicating
05-11, 15:30–16:00 (Europe/Paris), Gaston Berger

Data scientists and analysts working with Jupyter are too often forced to deal with dirty data (with typos, abbreviations, duplicates, missing values...) that comes from various sources.

Let us step in the shoes of a data scientist, and with a Jupyter Notebook try to perform a classification or regression task on data coming from a collection of raw tables.

In this tutorial, we will demonstrate how dirty_cat, an open source Python package developed in our team, can help with table preparation for machine learning tasks and improve results of prediction tasks in the presence of dirty data.

Some of the common problems we will be tackling are:
- joining groups of tables on inexact matches;
- de-duplicating values;
- encoding dirty categories with interpretable results.

And all of this on dirty categorical columns that will be transformed into numerical arrays ready for machine learning.

Examples of individual features can be seen here:
- https://dirty-cat.github.io/stable/
- https://github.com/dirty-cat/dirty_cat/tree/main/examples (link to Jupyter notebooks)

See also:

I am currently working as a Software Engineer at Inria Saclay, in the Soda team, that focuses on research around data science and machine learning applied to health and society.
I am also a maintainer of the dirty-cat package, an open-source Python tool that facilitates machine learning on dirty data.
I worked previously as a data scientist for Eurostat and the OECD, two international organizations.