DEEM Lab

Data Engineering for ML

BIFOLD & TU Berlin

The DEEM Lab is a cross-organisational research group uniting the chair for the management of data science processes at Technische Universität Berlin with external members from multiple universities and industry. The lab is led by Prof. Dr.-Ing. Sebastian Schelter and is part of the Berlin Institute for the Foundations of Learning and Data (BIFOLD).

Our lab conducts fundamental research at the intersection of data engineering and machine learning, which addresses data-related problems in AI-driven applications that cause negative economic, societal or scientific impact. Our goal is to foster the responsible and resilient management of data and to lower the technical bar for working with data and AI.

Our research is accompanied by efficient and scalable open source implementations, many of which are applied in real world use cases, for example in the Amazon Web Services cloud and in large European e-commerce platforms. The focus areas of the lab are:

Data-centric debugging of machine learning pipelines and systems
Data processing in compliance with legal regulations, such as the “right-to-be-forgotten”
Automated validation of data at scale

Our research contributions have been recognized with an ACM SIGMOD Systems Award, an ACM SIGMOD Best Demo Runner Up Award and a Best Paper Runner Up Award from the Table Representation Learning workshop at NeurIPS. Furthermore, we received awards from industry such as an Amazon Education Research Grant Award and the Olympus Mons Pioneer Award. We have ongoing collaborations with researchers from the University of Amsterdam and CWI in the Netherlands, and New York University and Harvard in the US.

News

Previous News

Sep 26, 2025 ‐ Barrie and Pierre are at RecSys in Prague this week! Barrie presents our work on on Scalable Data Debugging for Neighborhood-based Recommendation with Data Shapley Values, which was selected as a spotlight oral. Pierre gives a talk on his initial ideas Towards a Real-World Aligned Benchmark for Unlearning in Recommender Systems at the Responsible recommendation workshop.
Jul 15, 2025 ‐ Meet Olga from our lab at ICML in Vancouver this week! She will present a research paper on scSSL-Bench: Benchmarking Self-Supervised Learning for Single-Cell Data, which was selected as a spotlight poster, as well as a second paper on Towards Cross-Modal Error Detection with Tables and Images at the DataWorld workshop.
Jul 14, 2025 ‐ Our external PhD student Zeyu from the University of Amsterdam is starting his internship with the Amazon Q team of AWS Berlin this week!
Jul 7, 2025 ‐ Our external PhD Student Barrie Kersbergen (co-supervised with Maarten de Rijke) has successfully defended his PhD at the University of Amsterdam! Barrie’s research on recommender systems has been deployed to millions of users at the European e-commerce platform bol.com.
Jun 10, 2025 ‐ Meet our lab at the SIGMOD conference in Berlin next week! We are part of the organizing committee of the conference and co-organise the DEEM workshop as well. Furthermore, we will present a workshop paper on Towards Automated Task-Aware Data Validation and run a tutorial on Navigating Data Errors in Machine Learning Pipelines on Friday.

Recent Publications

All Publications

Fatemeh Sarvi (2025). Learning to rank for e-commerce search. PhD Thesis, University of Amsterdam.

PDF

Shafaq Siddiqi, Arnab Phani, Roman Kern, Matthias Boehm (2025). Saga++: A Scalable Framework for Optimizing Data Cleaning Pipelines for Machine Learning Applications. ACM Transactions on Database Systems (TODS).

Stefan Grafberger (2025). Declarative machine learning pipeline management via logical query plans. PhD Thesis, University of Amsterdam.

PDF

Pierre Lubitzsch, Olga Ovcharenko, Hao Chen, Maarten de Rijke, Sebastian Schelter (2025). Towards a Real-World Aligned Benchmark for Unlearning in Recommender Systems. FAccTRec Workshop: Responsible Recommendation at the ACM Conference on Recommender Systems (RecSys).

PDF Code

Barrie Kersbergen, Olivier Sprangers, Bojan Karlaš, Maarten de Rijke, Sebastian Schelter (2025). Scalable Data Debugging for Neighborhood-based Recommendation with Data Shapley Values. ACM Conference on Recommender Systems (RecSys), spotlight oral.

PDF Code Slides Video

Team

Faculty, Postdocs & Staff
Prof. Dr. Sebastian Schelter	Dr. Arnab Phani	Celia Bohnhardt-Schneider
Wilbert Wiryadi
PhD Students
Hao Chen	Olga Ovcharenko	Pierre Lubitzsch
Anna Richter	Leonardo Dominici	Zeyu Zhang (University of Amsterdam)
Master Students
Elias Strauss	Luciano Duarte
Guest Researchers
Dr. Stefan Grafberger (Snowflake)	Shubha Guha (University of Amsterdam)	Till Doehmen (Motherduck)
Yichun Wang (University of Amsterdam)

Alumni (name, role and first employment)

Dr. Stefan Grafberger, PhD student, Software engineer at Snowflake
Dr. Barrie Kersbergen, PhD student, Staff scientist at bol.com
Dr. Olivier Sprangers, PhD student, Applied scientist at Nixtla
Dr. Mozhdeh Ariannezhad, PhD student, ML scientist at booking.com
Dr. Arezoo Sarvi, PhD student, Data scientist search at Albert Heijn
Aynaz Abdollahzadeh, Master student, Data engineer at Churned
Radu Geacu, Master student, Data engineer at the Royal Schiphol Group
David Vos, Master student, PhD student at the University of Amsterdam
Benjamin Wang, Master student, ML scientist at booking.com
Dr. Ji Zhang, Guest researcher, Research scientist at Huawei

Prof. Dr.-Ing. Sebastian Schelter

Sebastian Schelter is a Full Professor at the Berlin Institute on the Foundations of Learning and Data (BIFOLD) and Technische Universität Berlin. His research is focused on the intersection of data management and machine learning with the goal to foster the responsible management of data and to democratise data science technologies.

The research of his group is accompanied by efficient and scalable open source implementations, many of which are applied in real world use cases, for example in the Amazon Web Services cloud and in large European e-commerce platforms.

In the past, he has been an assistant professor at the University of Amsterdam, a faculty fellow at New York University, a senior applied scientist at Amazon Research and a research intern at Twitter and IBM Almaden in California. His research contributions have been recognized with an ACM SIGMOD Systems Award, an ACM SIGMOD Best Demo Runner Up Award, and a Best Paper Runner Up Award from the Table Representation Learning workshop at NeurIPS.

Email: schelter[at]tu-berlin[dot]de
Profiles: Google Scholar, Bluesky, Twitter, Linkedin, DBLP, ORCID

Scientific Service

Editorial duties: Associate Editor for SIGMOD 2027, Associate Editor for PVLDB Volume 15, Action Editor for the Journal of Data-Centric Machine Learning Research (DMLR), Action Editor for the open source track of the Journal of Machine Learning Research (JMLR) 2022-2025, Guest editor for the IEEE Data Engineering Bulletin
Organisation: Founder and co-organiser (until 2020) of the workshop series on “Data Management for End-to- End Machine Learning (DEEM)” at SIGMOD, workshop chair EDBT 2026, co-chair industry track of EDBT 2022, web chair of SIGMOD 2025, workshop co-chair of EDBT 2026, co-chair BOSS workshop at VLDB in 2016, co-organiser of the “Dutch Data Systems Design Seminar” series with CWI Amsterdam
Program Committee: SIGMOD 2017 & 2019-2026, VLDB 2021, ICDE 2018-2021 & 2023-2024, NeurIPS'25, EDBT 2017 & 2021, CIKM 2020, PhD Symposium at VLDB 2021, DEEM workshop at SIGMOD 2021-2024, aiDM workshop at SIGMOD 2019, LSRS workshop at RecSys 2013-2015, AIDB workshop at VLDB 2020, DBML workshop at ICDE 2021,2024,2025, TRL workshop at NeurIPS 2022-2025, Provenance Week 2020
Awards: ACM SIGMOD Systems Award 2023, ACM SIGMOD Best Demo Runner Up Award 2023, Best Paper Runner Up Award from the Table Representation Learning workshop at NeurIPS 2022; Amazon Education Research Grant Award, Olympus Mons Pioneer Award
Keynotes: Workshop on Online Recommender Systems and User Modeling at RecSys'20, Workshop on Data Management for End-to-End Machine Learning at SIGMOD'21, Data Centric AI Workshop from ETH Zuerich/Stanford 2021, Workshop on Quality in Databases at VLDB'24
Panelist: Systems for ML at VLDB 2021, PhD symposium at ICDE 2021, Data management challenges for LLM-powered solutions at DEEM@SIGMOD'23, Panel on Open Science und AI at the Weizenbaum Institute 2025
Reviewer for Grant Proposals: Open Competition ENW (Dutch Research Council NWO), Binational Science Foundation (United States - Israel)

Completed PhD dissertations as advisor

Stefan Grafberger, Declarative machine learning pipeline management via logical query plans, University of Amsterdam, 2025
Barrie Kersbergen, Expanding boundaries in scalable session-based recommendations, University of Amsterdam, 2025
Olivier Sprangers, Efficient and accurate forecasting in large-scale settings, University of Amsterdam, 2024
(with Maarten de Rijke)
Mozhdeh Ariannezhad, User-oriented recommender systems in retail, University of Amsterdam, 2023
(with Maarten de Rijke)

Completed PhD dissertations as a committee member

Andra Ionescu, Feature discovery for data-centric AI, TU Delft, 2025
Gerardo Vitagliano, Modeling the structure of tabular files for data preparation, HPI Potsdam, 2024
Madelon Hulsebos, Table representation learning, University of Amsterdam, 2024
Bojan Karlaš, Data systems for managing and debugging machine learning workflows, ETH Zürich, 2023
Cedric Renggli, Building data-centric systems for machine learning development and operations, ETH Zürich, 2023
Amir Pouya Aghasadeghi, Generating and querying temporal property graphs, New York University, 2022
Ke Yang, Fairness, diversity, and interpretability in ranking, New York University, 2021

Past employments

since 2024
- Full Professor at Technische Universität Berlin and BIFOLD
2020-2024
- Assistant Professor of Data Engineering at the University of Amsterdam
- Manager of the AI for Retail Lab as part of the Innovation Center for Artificial Intelligence
- Research Fellow at Ahold Delhaize, a large retailer based in the Netherlands
2018-2020
- Data Science/Faculty Fellow at the Center for Data Science at New York University
- Senior Applied Scientist at Amazon Research in New York
2015-2018:
- Applied Scientist at the CoreML team of Amazon Research in Berlin
- Senior researcher / guest lecturer at the database systems group of TU Berlin
2011-2015
- Ph.D. student at the database systems group of TU Berlin
- Research intern at Twitter, California (2014)
- Research intern at IBM Research, California (2013)
2009-2011
- Software engineer at Rocket Internet and Zalando in Berlin

Professional Memberships

Apache Software Foundation (emeritus)
Association for Computing Machinery
Electronic Frontier Foundation
Deutscher Hochschulverband

Teaching

Winter semester 2025-2026

We offer the following courses during the winter semester 2025-2026:

RDPRO - Responsible Data Engineering Project
RDSEM - Seminar on Responsible Data Engineering
PPDS - Programmierpraktikum Datensysteme: Machine Learning Pipelines
Research Seminar on Data Engineering for ML (available to students writing a thesis with us)

For taking one of our courses, please sign up on the corresponding course page on ISIS and attend the first lecture, where we will discuss the details for the formal registration.

Theses

If you are interested in writing a bachelor or master thesis with us, please check out our list of available topics and consult our information on the logistics of writing a thesis with us. Below is a selection of recently completed theses in our lab:

End-to-end machine unlearning through data preparation pipelines and feature stores (Leonardo Dominici)
Towards fair database access: detecting and mitigating social bias in NL2SQL systems using reinforcement learning (Aynaz Abdollahzadeh)
Robust and dialect-independent parsing of schema information from SQL scripts (Radu Geacu)
Language control prefixes: conditional prefix-tuning for efficient multilingual data-to-text generation in low-resource languages (David Vos)
Understanding the characteristics and robustness of data preparation pipelines for machine learning (Jialin Dong)
Towards amnesiac recommender systems: efficient incremental and decremental updates for next basket recommendation (Benjamin Wang)
Efficient lineage-based inspection of machine learning pipelines (Stefan Grafberger)

Job Openings

Contact

Email: sekr[at]deem[dot]tu-berlin[dot]de

Technische Universität Berlin
FG Management of Data Science Processes
Sekr. TEL 9-2
Ernst-Reuter Platz 7
10587 Berlin
Germany

Responsibility under the German Press Law §55 Sect. 2 RStV:
Prof. Dr.-Ing. Sebastian Schelter