Towards Regaining Control over Messy Machine Learning Pipelines

Stefan Grafberger, Hao Chen, Olga Ovcharenko, Sebastian Schelter

Abstract

Software systems that learn from data with machine learning (ML) are increasingly used to automate impactful decisions. However, the resulting ML pipelines suffer from many unsolved data management challenges with respect to personal and security-critical data and compliance with legal regulations. We argue that this is due to shortcomings in existing ML pipeline abstractions and messy imperative code produced by data scientists. We propose a new approach for ML pipelines that leverages the code generation capabilities of large language models to extract declarative logical query plans from messy data science code. We envision this as a foundation to manage deployed ML pipelines and their data artifacts in upcoming Data-AI systems. We discuss a challenging example scenario and present initial experiments with a prototype to validate our vision.

Type

Conference paper

Publication

Workshop on Data-AI Systems (DAIS) at ICDE

Date

March, 2025

Links

PDF Code Video