Navigating Data Errors in Machine Learning Pipelines: Identify, Debug, and Learn

Bojan Karlaš, Babak Salimi, Sebastian Schelter

Abstract

Addressing data errors—such as missing, incorrect, noisy, biased, or out-of-distribution values—is essential to building reliable machine learning (ML) systems. Traditional methods often focus on refining the training process to minimize error symptoms or repairing data errors indiscriminately, without addressing their root causes. These isolated approaches ignore how errors originate and propagate through the interconnected stages of ML pipelines—data preprocessing, model training, and prediction—resulting in superficial fixes and suboptimal solutions. Consequently, they miss the opportunity to understand how data errors impact downstream tasks and to implement targeted, effective interventions. In recent years, the research community has made significant progress in developing holistic approaches to identify the most harmful data errors, prioritize impactful repairs, and reason about their effects when errors cannot be fully resolved. This tutorial surveys prominent work in this area and introduces practical tools designed to address data quality issues across the ML lifecycle. By combining theoretical insights with hands-on demonstrations, attendees will gain actionable strategies to diagnose, repair, and manage data errors, enhancing the reliability, fairness, and transparency of ML systems in real-world applications.

Type

Conference paper

Publication

ACM SIGMOD (tutorial)

Date

March, 2025

Links

PDF Code Slides