Declarative machine learning pipeline management via logical query plans

Abstract

Machine learning (ML) systems are increasingly used to automate impactful decisions. However, the resulting ML pipelines suffer from many unsolved data management challenges, including ensuring correctness, reliability, and compliance with legal regulations. We argue that this is because current ML pipeline libraries and ML cloud services lack fundamental data-centric abstractions, similar to logical query plans in databases. In this thesis, we propose a new approach for managing ML pipelines by extracting ‘logical query plans’ from ML pipeline code and automatically inferring pipeline semantics. Based on this declarative pipeline abstraction, we show how to enhance ML applications and tooling with provenance tracking and automatic rewriting capabilities. This enables us to manage ML pipelines and their data artifacts in novel ways. We present five contributions, algorithmic and methodological, each embodied in a library, and organize them into three main parts. The first part focuses on the extraction of logical query plans from ML pipelines. mlinspect enables lightweight inspection and efficient instrumentation of pipelines. Lester automatically rewrites messy imperative code to clean declarative pipelines before deployment, enabling high automation for production use cases such as compliance with the right-to-be-forgotten. The second part addresses automatic rewriting of ML pipelines. mlwhatif enables data-centric what-if analysis and optimises what-if workloads via multi-query optimisation. mlidea assists with interactively improving ML data preparation code via automatically generated ‘shadow pipelines’ and incremental view maintenance. The final part covers provenance tracking and reasoning about the input and output data of ML pipelines: ArgusEyes provides provenance-based screening in continuous integration workflows.

Type
Publication
PhD Thesis, University of Amsterdam
Date
Links