Saga++: A Scalable Framework for Optimizing Data Cleaning Pipelines for Machine Learning Applications

Abstract

In the exploratory data science lifecycle, data scientists often spent the majority of their time finding, integrating, validating and cleaning relevant datasets. Despite recent work on data validation, and numerous error detection and correction algorithms, in practice, data cleaning for ML remains largely a manual, unpleasant, and labor-intensive trial and error process, especially in large-scale, distributed computation settings. The target ML application—such as classification or regression models—can be used as a signal of valuable feedback though, for selecting effective data cleaning strategies. In this paper, we introduce SAGA++, a framework for automatically generating the top-K most effective data cleaning pipelines. SAGA++ adopts ideas from Auto-ML, feature selection, and hyper-parameter tuning. Our framework is extensible for user-provided constraints, new data cleaning primitives, and ML applications; automatically generates hybrid runtime plans of local and distributed operations; and performs pruning by interesting properties (e.g., monotonicity). Furthermore, we exploit guided sampling on the input dataset to enable enumeration on a smaller subset, reducing the time required to discover the top-K pipelines. As a post-processing step, we also perform pipeline pruning on the selected top-K pipelines, removing redundant and less effective cleaning primitives. Instead of full automation—which is rather unrealistic—SAGA++ simplifies the mechanical aspects of data cleaning. Our experiments show that SAGA++ yields robust accuracy improvements over state-of-the-art, and good scalability regarding increasing data sizes and number of evaluated pipelines.

Publication
ACM Transactions on Database Systems (TODS)
Date
Links