Towards Automated Task-Aware Data Validation

Abstract

Data is a central resource for modern enterprises and institutions, and data validation is essential for ensuring the reliability of downstream applications. However, a major limitation of existing automated data unit testing frameworks is that they ignore the specific requirements of the tasks that consume the data. This paper introduces a task-aware approach to data validation that leverages large language models to generate customized data unit tests based on the semantics of downstream code. We present ‘tadv’, a prototype system that analyzes task code and dataset profiles to identify data access patterns, infer implicit data assumptions, and produce executable code for data unit tests. We evaluate our prototype with a novel benchmark comprising over 100 downstream tasks across two datasets, including annotations of their column access patterns and support for assessing the impact of synthetically injected data errors. We demonstrate that tadv outperforms task-agnostic baselines in detecting the data columns accessed by downstream tasks and generating data unit tests that account for the end-to-end impact of data errors. We make our benchmark and prototype code publicly available.

Publication
Workshop on Data Management for End-to-End Machine Learning (DEEM) at SIGMOD
Date
Links