Navigating Data Errors in Machine Learning Pipelines: Identify, Debug, and Learn

Bojan Karlaš, Babak Salimi, Sebastian Schelter

Abstract

Addressing data errors such as wrong, missing, noisy, biased, or out-of-distribution values has become a crucial part of the machine learning (ML) development lifecycle. Unfortunately, traditional approaches have relied either on treating the symptom by refining the model architecture, or on improving data quality by repairing incorrect values regardless of their significance to the downstream model. Both strategies end up tackling the problem in isolation, and disregard the structure of modern ML pipelines, which involve a series of steps for data preprocessing, model training, and prediction processing. Consequently, they miss the opportunity to consider how different data errors propagate through the pipeline and how they impact its ability to perform downstream tasks. In recent years, the research community has made significant strides towards more holistic approaches for identifying the most harmful data errors, performing the most beneficial repairs, and ensuring reliable performance even if some data errors remain present. This tutorial will survey some prominent work published in this space and showcase several tools that have been developed. By combining theoretical foundations with practical demonstrations, attendees will gain actionable strategies to diagnose and mitigate data quality issues, improving the reliability, fairness, and transparency of ML systems in real-world settings.

Type

Conference paper

Publication

IEEE International Conference on Data Engineering (ICDE, tutorial)

Date

December, 2024

Links

PDF Code Slides