JupyterCon 2023

Predictive survival analysis and competing risk modeling with scikit-learn, scikit-survival, lifelines, Ibis, and DuckDB (Part 1)
05-12, 10:30–13:00 (Europe/Paris), Room 3 (Tutorial)

While the tutorial attendance is comprised in the conference pass, we ask you to register for this tutorial on https://www.jupytercon.com/tickets as the seats available are limited.

Tutorial notebooks:

According to Wikipedia:

Survival analysis is a branch of statistics for analyzing the expected duration of time until one event occurs, such as deaths in biological organisms and failure in mechanical systems. [...]. Survival analysis attempts to answer certain questions, such as what is the proportion of a population which will survive past a certain time? Of those that survive, at what rate will they die or fail? Can multiple causes of death or failure be taken into account? How do particular circumstances or characteristics increase or decrease the probability of survival?

In this two-part tutorial (morning and afternoon), we will deep dive into a practical case study of predictive maintenance using tools from the scientific Python ecosystem. Here is a tentative agenda:

Part 1 (Morning)
- What is time-censored data and why it is a problem to train time-to-event regression models.
- Single event survival analysis with Kaplan-Meier using scikit-survival.
- Competing risks modeling with Nelson–Aalen, Aalen-Johansen using lifelines.
- Evaluation of the calibration of survival analysis estimators using the integrated brier score (IBS) metric.
- Predictive survival analysis modeling with Cox Proportional Hazards, Survival Forests using scikit-survival, GradientBoostedIBS implemented from scratch with scikit-learn.
- Estimation of the cause-specific cumulative incidence function (CIF) using our GradientBoostedIBS model.

Part 2 (Afternoon)
- How to use a trained GradientBoostedIBS model to estimate the median survival time and the probability of survival at a fixed time horizon.
- Measuring the statistical association between input features and survival probabilities using partial dependence plot and permutation feature importance.
- Presentation of the results of a benchmark of various survival analysis estimators on the KKBox dataset.
- Extracting implicit failure data from operation logs using sessionization with Ibis and DuckDB.
- Hands-on wrap-up exercise.

It is not recommended to attend Part 2 without having attended Part 1.

Target audience: good familiarity with machine learning concepts, with prior experience using scikit-learn (you know what cross-validation means and how to fit a Random Forest on a Pandas dataframe).

Olivier is a software engineer at Inria and work as a maintainer for the scikit-learn project, a popular open source machine learning library for Python. Olivier also teaches applied Deep Learning and Machine Learning at UBS Vannes.

This speaker also appears in: