Login Sign up

Tuesday Oct. 13, 2020, 4 p.m.–Oct. 13, 2020, 4:30 p.m. in Data Science Applications

Optimizing model performance with feature engineering and hyperparameter optimisation

Nanthini Balasubramanian, Devin Robison

Audience level:

Brief Summary

Optimizing performance of a machine learning model can be a labor-intensive process. It is often overlooked in real-life applications. In this talk, we'll see a Jupyter Notebook walkthrough of GPU-accelerated libraries - RAPIDS, Optuna and xfeat as a potential solution to address some of the constraints of Feature Engineering and Hyperparameter Optimizations, and use MLflow for experiment tracking


This talk will walk through a demo Jupyter notebook on how we can use RAPIDS, Optuna, xfeat, and MLflow to illustrate the use of feature engineering and hyperparameter optimisation on a classification problem, in conjunction with experiment tracking and eventual production deployment.

Feature Engineering is a process to transform raw data into features that can represent the underlying patterns of the data better. Hyperparameter optimization is a process that can complement a good model by tuning its parameters. These can significantly boost a model's accuracy. RAPIDS framework provides a suite of libraries that can execute end-to-end data science pipelines entirely on GPUs. Optuna is a lightweight framework for automatic hyperparameter optimization, and xfeat is a feature engineering and exploration library using GPUs and Optuna. MLflow is a framework for tracking experiment state, ensuring reproducibility, and model storage / deployment.

We’ll utilize xfeat for performing feature engineering operations to add more features to the dataset using Numerical and Categorical encoding strategies - like arithmetic combinations, target encoding, etc., cuML, a library in RAPIDS, has a set of Machine Learning models that are GPU-accelerated. Optuna will be used to select the most pertinent features among the original and the newly added features, along with the hyper parameters for the model we use. Lastly, MLflow will be used to record the entire process, and publish the final model as a REST service.

Using the combination of the libraries, we will be able to notice the boost in the performance of the model and compare the results. With minimal effort, we would be able to improve the performance of the model and run the end-to-end pipeline much faster than if it is run entirely on CPU.

This talk can serve as a starting point for anyone looking to get started on optimizing Data Science pipelines on the GPU, and an introduction to the RAPIDS framework.