DEEM Lab

Data Engineering for ML

BIFOLD & TU Berlin

The DEEM Lab is a cross-organisational research group uniting the chair for the management of data science processes at Technische Universität Berlin with external members from multiple universities and industry. The lab is led by Prof. Dr.-Ing. Sebastian Schelter and is part of the Berlin Institute for the Foundations of Learning and Data (BIFOLD).

Our lab conducts fundamental research at the intersection of data management and machine learning, which addresses data-related problems in ML applications that cause negative economic, societal or scientific impact. Our goal is to foster the responsible management of data and to lower the technical bar for working with data science technologies.

Our research is accompanied by efficient and scalable open source implementations, many of which are applied in real world use cases, for example in the Amazon Web Services cloud and in large European e-commerce platforms. The focus areas of the lab are:

Data-centric debugging and testing of machine learning pipelines
Data processing in compliance with legal regulations, such as the “right-to-be-forgotten”
Automated validation of data at scale

Our research contributions have been recognized with an ACM SIGMOD Systems Award, an ACM SIGMOD Best Demo Runner Up Award and a Best Paper Runner Up Award from the Table Representation Learning workshop at NeurIPS. We have ongoing collaborations with the University of Amsterdam and CWI in the Netherlands, as well as with the Center for Responsible AI at New York University.

News

Previous News

Jun 10, 2025 ‐ Meet our lab at the SIGMOD conference in Berlin next week! We are part of the organizing committee of the conference and co-organise the DEEM workshop as well. Furthermore, we will present a workshop paper on Towards Automated Task-Aware Data Validation and run a tutorial on Navigating Data Errors in Machine Learning Pipelines on Friday.
Jun 10, 2025 ‐ Sebastian has been invited to give a keynote at the AI for Data Editing workshop at KDD’25 in Toronto, Canada.
May 2, 2025 ‐ Olga and Sebastian took part in the seminar on the Challenges and Opportunities of Table Representation Learning in Dagstuhl, which aims to connect the communities of data management, machine learning, and natural language processing to discuss the future of learning on tabular data.
Mar 25, 2025 ‐ Zeyu gave an invited talk about the efficient utilization of language models for table data preparation at the industry event on Next-Generation Data Management Systems at EDBT 2025 in Barcelona, and subsequently presented our paper on A Deep Dive Into Cross-Dataset Entity Matching with Large and Small Language Models.
Jan 11, 2025 ‐ Stefan will be co-organising the workshop on Data Management for End-to-End Machine Learning (DEEM) at SIGMOD 2025 in Berlin.

Recent Publications

All Publications

Olga Ovcharenko, Sebastian Schelter (2025). Towards Cross-Modal Error Detection with Tables and Images. Workshop on Unifying Data Curation Frameworks Across Domains (DataWorld) at ICML.

PDF

Stefan Grafberger, Paul Groth, Sebastian Schelter (2025). mlidea: Interactively Improving ML Data Preparation Code via 'Shadow Pipelines'. International Conference on Very Large Databases (VLDB, demo).

Video

Olga Ovcharenko, Philip Toma, Imant Daunhawer, Julia E Vogt, Sebastian Schelter, Florian Barkmann, Valentina Boeva (2025). scSSL-Bench: Benchmarking Self-Supervised Learning for Single-Cell Data. International Conference on Machine Learning (ICML), spotlight.

PDF

Hao Chen, Sebastian Schelter (2025). Towards Automated Task-Aware Data Validation. Workshop on Data Management for End-to-End Machine Learning (DEEM) at SIGMOD.

PDF Code

Bojan Karlaš, Babak Salimi, Sebastian Schelter (2025). Navigating Data Errors in Machine Learning Pipelines: Identify, Debug, and Learn. ACM SIGMOD (tutorial).

PDF Code Slides

Team

Faculty, Postdocs & Staff
Prof. Dr. Sebastian Schelter	Dr. Arnab Phani	Celia Bohnhardt-Schneider
PhD Students & Guests
Hao Chen	Olga Ovcharenko	Pierre Lubitzsch
Zeyu Zhang (University of Amsterdam)	Shubha Guha (University of Amsterdam)	Till Doehmen (Motherduck)
Yichun Wang (University of Amsterdam)	Stefan Grafberger (Snowflake)
Master Students
Aynaz Abdollahzadeh (University of Amsterdam)	Leonardo Dominici (University of Amsterdam)

Alumni (name, role and first employment)

(Dr.) Stefan Grafberger, PhD student, Software engineer at Snowflake
(Dr.) Barrie Kersbergen, PhD student, Staff scientist at bol.com
Dr. Olivier Sprangers, PhD student, Applied scientist at Nixtla
Dr. Mozhdeh Ariannezhad, PhD student, ML scientist at booking.com
(Dr.) Arezoo Sarvi, PhD student, Data scientist search at Albert Heijn
Radu Geacu, Master student, Data engineer at the Royal Schiphol Group
David Vos, Master student, PhD student at the University of Amsterdam
Benjamin Wang, Master student, ML scientist at booking.com
Dr. Ji Zhang, Guest researcher, Research scientist at Huawei

Prof. Dr.-Ing. Sebastian Schelter

Sebastian Schelter is a Full Professor at the Berlin Institute on the Foundations of Learning and Data (BIFOLD) and Technische Universität Berlin. His research is focused on the intersection of data management and machine learning with the goal to foster the responsible management of data and to democratise data science technologies.

The research of his group is accompanied by efficient and scalable open source implementations, many of which are applied in real world use cases, for example in the Amazon Web Services cloud and in large European e-commerce platforms.

In the past, he has been an assistant professor at the University of Amsterdam, a faculty fellow at New York University, a senior applied scientist at Amazon Research and a research intern at Twitter and IBM Almaden in California. His research contributions have been recognized with an ACM SIGMOD Systems Award, an ACM SIGMOD Best Demo Runner Up Award, and a Best Paper Runner Up Award from the Table Representation Learning workshop at NeurIPS.

Email: schelter[at]tu-berlin[dot]de
Profiles: Google Scholar, Bluesky Twitter, Linkedin, DBLP, ORCID

Scientific Service

Editorial duties: Associate Editor for PVLDB Volume 15, Action Editor for the Journal of Data-Centric Machine Learning Research (DMLR), Action Editor for the open source track of the Journal of Machine Learning Research (JMLR) 2022-2025, Guest editor for the IEEE Data Engineering Bulletin
Organisation: Founder and co-organiser (until 2020) of the workshop series on “Data Management for End-to- End Machine Learning (DEEM)” at SIGMOD, workshop chair EDBT 2026, co-chair industry track of EDBT 2022, web chair of SIGMOD 2025, co-chair BOSS workshop at VLDB in 2016, Co-organiser of the “Dutch Data Systems Design Seminar” series with CWI Amsterdam
Program Committee: SIGMOD 2017 & 2019-2026, VLDB 2021, ICDE 2018-2021 & 2023-2024, NeurIPS'25, EDBT 2017 & 2021, CIKM 2020, PhD Symposium at VLDB 2021, DEEM workshop at SIGMOD 2021-2024, aiDM workshop at SIGMOD 2019, LSRS workshop at RecSys 2013-2015, AIDB workshop at VLDB 2020, DBML workshop at ICDE 2021,2024,2025, TRL workshop at NeurIPS 2022-2025, Provenance Week 2020
Awards: ACM SIGMOD Systems Award 2023, ACM SIGMOD Best Demo Runner Up Award 2023, Best Paper Runner Up Award from the Table Representation Learning workshop at NeurIPS
Keynotes: Workshop on Online Recommender Systems and User Modeling at RecSys'20, Workshop on Data Management for End-to-End Machine Learning at SIGMOD'21, Data Centric AI Workshop from ETH Zuerich/Stanford 2021, Workshop on Quality in Databases at VLDB'24
Panelist: Systems for ML at VLDB 2021, PhD symposium at ICDE 2021, Data management challenges for LLM-powered solutions at DEEM@SIGMOD'23, Panel on Open Science und AI at the Weizenbaum Institute 2025
Reviewer for Grant Proposals: Open Competition ENW (Dutch Research Council NWO), Binational Science Foundation (United States - Israel)

Completed PhD dissertations as advisor

Olivier Sprangers, Efficient and accurate forecasting in large-scale settings, University of Amsterdam, 2024
(with Maarten de Rijke)
Mozhdeh Ariannezhad, User-oriented recommender systems in retail, University of Amsterdam, 2023
(with Maarten de Rijke)

Completed PhD dissertations as a committee member

Andra Ionescu, Feature discovery for data-centric AI, TU Delft, 2025
Gerardo Vitagliano, Modeling the structure of tabular files for data preparation, HPI Potsdam, 2024
Madelon Hulsebos, Table representation learning, University of Amsterdam, 2024
Bojan Karlaš, Data systems for managing and debugging machine learning workflows, ETH Zürich, 2023
Cedric Renggli, Building data-centric systems for machine learning development and operations, ETH Zürich, 2023
Amir Pouya Aghasadeghi, Generating and querying temporal property graphs, New York University, 2022
Ke Yang, Fairness, diversity, and interpretability in ranking, New York University, 2021

Past employments

since 2024
- Full Professor at Technische Universität Berlin and BIFOLD
2020-2024
- Assistant Professor of Data Engineering at the University of Amsterdam
- Manager of the AI for Retail Lab as part of the Innovation Center for Artificial Intelligence
- Research Fellow at Ahold Delhaize, a large retailer based in the Netherlands
2018-2020
- Data Science/Faculty Fellow at the Center for Data Science at New York University
- Senior Applied Scientist at Amazon Research in New York
2015-2018:
- Applied Scientist at the CoreML team of Amazon Research in Berlin
- Senior researcher / guest lecturer at the database systems group of TU Berlin
2011-2015
- Ph.D. student at the database systems group of TU Berlin
- Research intern at Twitter, California (2014)
- Research intern at IBM Research, California (2013)
2009-2011
- Software engineer at Rocket Internet and Zalando in Berlin

Professional Memberships

Apache Software Foundation (emeritus)
Association for Computing Machinery
Electronic Frontier Foundation
Deutscher Hochschulverband

Teaching

Summer semester 2025

We offer the following courses during the summer semester 2025:

For taking one of our courses, please sign up on the corresponding course page on ISIS and attend the first lecture, where we will discuss the details for the formal registration.

Theses

If you are interested in writing a bachelor and master thesis with us, please check out our list of available topics at theses.tu-berlin.de.

Job Openings

No current openings.

Contact

Email: sekr[at]deem[dot]tu-berlin[dot]de

Technische Universität Berlin
FG Management of Data Science Processes
Sekr. TEL 9-2
Ernst-Reuter Platz 7
10587 Berlin
Germany

Responsibility under the German Press Law §55 Sect. 2 RStV:
Prof. Dr.-Ing. Sebastian Schelter