Recent Publications

All Publications

(2024). AnyMatch - Efficient Zero-Shot Entity Matching with a Small Language Model. Workshop on Preparing Good Data for Generative AI at AAAI.

(2024). Navigating Data Errors in Machine Learning Pipelines: Identify, Debug, and Learn. IEEE International Conference on Data Engineering (ICDE, tutorial).

(2024). Towards Query Optimizer as a Service (QOaaS) in a Unified LakeHouse Platform: Can One QO Rule Them All?. Conference on Innovative Data Systems Research (CIDR).

(2024). Messy Code Makes Managing ML Pipelines Difficult? Just Let LLMs Rewrite the Code. [arxiv preprint].

PDF

(2024). Efficient and accurate forecasting in large-scale settings. PhD Thesis, University of Amsterdam.

PDF

Team

Faculty, PhD Students & Staff
Prof. Dr. Sebastian Schelter Stefan Grafberger Hao Chen
Olga Ovcharenko Pierre Lubitzsch
External PhD Students & Guests
Zeyu Zhang
(University of Amsterdam)
Shubha Guha
(University of Amsterdam)
Till Doehmen
(Motherduck)
Yichun Wang
(University of Amsterdam)
David Campos
(Aalborg University)
(External) Master Students
Aynaz Abdollahzadeh
(University of Amsterdam)
Leonardo Dominci
(University of Amsterdam)

Alumni (name, role and first employment)

Prof. Dr.-Ing. Sebastian Schelter

Sebastian Schelter is a Full Professor at the Berlin Institute on the Foundations of Learning and Data (BIFOLD) and Technische Universität Berlin. His research is focused on the intersection of data management and machine learning with the goal to foster the responsible management of data and to democratise data science technologies.

The research of his group is accompanied by efficient and scalable open source implementations, many of which are applied in real world use cases, for example in the Amazon Web Services cloud and in large European e-commerce platforms.

In the past, he has been an assistant professor at the University of Amsterdam, a faculty fellow at New York University, a senior applied scientist at Amazon Research and a research intern at Twitter and IBM Almaden in California. His research contributions have been recognized with an ACM SIGMOD Systems Award, an ACM SIGMOD Best Demo Runner Up Award, and a Best Paper Runner Up Award from the Table Representation Learning workshop at NeurIPS.

Scientific Service
  • Editorial duties: Associate Editor for PVLDB Volume 15, Action Editor for the Journal of Data-Centric Machine Learning Research (DMLR), Action Editor for the open source track of the Journal of Machine Learning Research (JMLR), Guest editor for the IEEE Data Engineering Bulletin
  • Organisation: Founder and co-organiser (until 2020) of the workshop series on “Data Management for End-to- End Machine Learning (DEEM)” at SIGMOD, workshop chair EDBT 2026, co-chair industry track of EDBT 2022, web chair of SIGMOD 2025, co-chair BOSS workshop at VLDB in 2016, Co-organiser of the “Dutch Data Systems Design Seminar” series with CWI Amsterdam
  • Program Committee: SIGMOD 2017 & 2019-2026, VLDB 2021, ICDE 2018-2021 & 2023-2024, EDBT 2017 & 2021, CIKM 2020, PhD Symposium at VLDB 2021, DEEM workshop at SIGMOD 2021-2024, aiDM workshop at SIGMOD 2019, LSRS workshop at RecSys 2013-2015, AIDB workshop at VLDB 2020, DBML workshop at ICDE 2021,2024,2025, TRL workshop at NeurIPS 2022-2024, Provenance Week 2020
  • Awards: ACM SIGMOD Systems Award 2023, ACM SIGMOD Best Demo Runner Up Award 2023, Best Paper Runner Up Award from the Table Representation Learning workshop at NeurIPS
  • Keynotes: Workshop on Online Recommender Systems and User Modeling at RecSys'20, Workshop on Data Management for End-to-End Machine Learning at SIGMOD'21, Data Centric AI Workshop from ETH Zuerich/Stanford 2021, Workshop on Quality in Databases at VLDB'24
  • Panelist: Systems for ML at VLDB 2021, PhD symposium at ICDE 2021, Data management challenges for LLM-powered solutions at DEEM@SIGMOD'23
Completed PhD dissertations as advisor
  • Olivier Sprangers, Efficient and accurate forecasting in large-scale settings, University of Amsterdam, 2024
    (with Maarten de Rijke)
  • Mozhdeh Ariannezhad, User-oriented recommender systems in retail, University of Amsterdam, 2023
    (with Maarten de Rijke)
Completed PhD dissertations as a committee member
  • Andra Ionescu, Feature Discovery for Data-Centric AI, TU Delft, 2025
  • Gerardo Vitagliano, Modeling the structure of tabular files for data preparation, HPI Potsdam, 2024
  • Madelon Hulsebos, Table representation learning, University of Amsterdam, 2024
  • Bojan Karlaš, Data systems for managing and debugging machine learning workflows, ETH Zürich, 2023
  • Cedric Renggli, Building data-centric systems for machine learning development and operations, ETH Zürich, 2023
  • Amir Pouya Aghasadeghi, Generating and querying temporal property graphs, New York University, 2022
  • Ke Yang, Fairness, diversity, and interpretability in ranking, New York University, 2021
Past employments
Professional Memberships
  • Apache Software Foundation (emeritus)
  • Association for Computing Machinery
  • Electronic Frontier Foundation
  • Deutscher Hochschulverband

Teaching

Winter semester 24-25:

Details on registration and seat assignment will be discussed in the first in-person class of each course.

EDML – Engineering Data for Machine Learning

This highly technical course focuses on the engineering and life-cycle management of data for production machine learning deployments. The course starts by recapping fundamentals about relational data processing and dataflow systems. Subsequently, students learn about encoding, storing and managing vectorised feature representations of heterogeneous input data sources for machine learning applications, and the architecture of current state-of-the- art systems for this task such as Google’s Tensorflow Extended Platform. Concurrently, the students will be exposed to foundational theory for this problem space, such as incremental view maintenance for relational data, fine-grained data provenance tracking via provenance semi-rings and differential computation.

In addition, students will learn to identify, quantify and address common quality issues with respect to the completeness and consistency of the data. Furthermore, students will be exposed to ongoing research efforts in this space such as ML pipeline debugging or error detection techniques from data-centric AI. In addition, they will have the opportunity to discuss the practical implications of the covered technologies with invited practitioners.

Course page available at https://isis.tu-berlin.de/course/view.php?id=40407

RDSEM – Seminar on Responsible Data Engineering

In this seminar, students will learn how to: (a) critically read and interpret scientific papers drawn from literature on responsible data management and responsible data science, (b) give a good scientific presentation that is technically precise, concentrated on the relevant topics, and also enjoyable; © write a scientific survey based on papers drawn from varying sources, such as contemporary computer science journals and conference proceedings. In addition, students will learn about state-of-the art and current research topics in responsible data science and data management.

Course page available at https://isis.tu-berlin.de/course/view.php?id=40406

Programming Practical – Machine Learning Pipelines

At the beginning of every semester, we will define a new project (or multiple) related to the implementation of machine learning pipelines (which prepare data from heterogeneous sources, encode it as features and train one or more ML models afterwards). Students receive specifications of these components as well as selected implementation goals. The task is then to create - in self-organised teams of 4 persons - correct implementations of these components in Python, Scala or Rust. Apart from developing the source code of the prototypes, other important aspects include the use of version control tools, test-driven development, design documentation, as well as runtime experiments and improvements. At the same time, this project allows for a deeper understanding of methods related to machine learning pipelines and data analysis, as well as algorithms and data structures. The focus is, however, on a problem-oriented utilisation of programming skills so solve a practical problem, not a holistic coverage of functionality of machine learning applications.

Course page available at https://isis.tu-berlin.de/enrol/index.php?id=39392

Job Openings

PhD Positions in Data Engineering (via the BIFOLD Graduate School)

The “Berlin Institute for the Foundations of Learning and Data” (BIFOLD) offers several positions for doctoral students as part of its graduate school. Join the graduate school allows you to conduct research with the DEEM Lab as well.

If you are interested in this opportunity, please send your application via email (quoting the job reference number IV-63124) as one file in PDF format to Prof. Dr. Volker Markl and Prof. Dr. Klaus-Robert Müller, at gsapplication@bifold.tu-berlin.de until the 3rd of February.

Your application for the DEEM Lab has to include the following documents:

  • Filled-in application form, in order to apply for our lab, please enter the following information under “2.5 Research interests”:
    • Research group leads name: Prof. Dr. Sebastian Schelter
    • Research areas of interest: Pick one from “Data-centric debugging and testing of machine learning pipelines”, “Data processing in compliance with legal regulations”, “Automated validation of data at scale”
  • Letter of motivation and CV
  • Copies of your degree certificates, academic transcripts, and a list of publications
  • Names and contact details of 2 referees (whose letters should be available by the deadline of this call)

Contact

Email: schelter [at] tu-berlin [dot] de

Technische Universität Berlin
FG Management of Data Science Processes
Sekr. TEL 9-2
Ernst-Reuter Platz 7
10587 Berlin
Germany

Responsibility under the German Press Law §55 Sect. 2 RStV:
Prof. Dr.-Ing. Sebastian Schelter