The DEEM Lab is a cross-organisational research group uniting the chair for the management of data science processes at Technische Universität Berlin with external members from multiple universities and industry. The lab is led by Prof. Dr.-Ing. Sebastian Schelter and is part of the Berlin Institute for the Foundations of Learning and Data (BIFOLD).
Our lab conducts fundamental research at the intersection of data management and machine learning, which addresses data-related problems in ML applications that cause negative economic, societal or scientific impact. Our goal is to foster the responsible management of data and to lower the technical bar for working with data science technologies.
Our research is accompanied by efficient and scalable open source implementations, many of which are applied in real world use cases, for example in the Amazon Web Services cloud and in large European e-commerce platforms. The focus areas of the lab are:
Our research contributions have been recognized with an ACM SIGMOD Systems Award, an ACM SIGMOD Best Demo Runner Up Award and a Best Paper Runner Up Award from the Table Representation Learning workshop at NeurIPS. We have ongoing collaborations with the University of Amsterdam and CWI in the Netherlands, as well as with the Center for Responsible AI at New York University.
Faculty, PhD Students & Staff |
||
![]() |
![]() |
![]() |
![]() |
![]() |
|
External PhD Students & Guests |
||
![]() (University of Amsterdam) |
![]() (University of Amsterdam) |
![]() (Motherduck) |
![]() (University of Amsterdam) |
![]() (Aalborg University) |
|
(External) Master Students |
||
![]() (University of Amsterdam) |
![]() (University of Amsterdam) |
Sebastian Schelter is a Full Professor at the Berlin Institute on the Foundations of Learning and Data (BIFOLD) and Technische Universität Berlin. His research is focused on the intersection of data management and machine learning with the goal to foster the responsible management of data and to democratise data science technologies.
The research of his group is accompanied by efficient and scalable open source implementations, many of which are applied in real world use cases, for example in the Amazon Web Services cloud and in large European e-commerce platforms.
In the past, he has been an assistant professor at the University of Amsterdam, a faculty fellow at New York University, a senior applied scientist at Amazon Research and a research intern at Twitter and IBM Almaden in California. His research contributions have been recognized with an ACM SIGMOD Systems Award, an ACM SIGMOD Best Demo Runner Up Award, and a Best Paper Runner Up Award from the Table Representation Learning workshop at NeurIPS.
Details on registration and seat assignment will be discussed in the first in-person class of each course.
This highly technical course focuses on the engineering and life-cycle management of data for production machine learning deployments. The course starts by recapping fundamentals about relational data processing and dataflow systems. Subsequently, students learn about encoding, storing and managing vectorised feature representations of heterogeneous input data sources for machine learning applications, and the architecture of current state-of-the- art systems for this task such as Google’s Tensorflow Extended Platform. Concurrently, the students will be exposed to foundational theory for this problem space, such as incremental view maintenance for relational data, fine-grained data provenance tracking via provenance semi-rings and differential computation.
In addition, students will learn to identify, quantify and address common quality issues with respect to the completeness and consistency of the data. Furthermore, students will be exposed to ongoing research efforts in this space such as ML pipeline debugging or error detection techniques from data-centric AI. In addition, they will have the opportunity to discuss the practical implications of the covered technologies with invited practitioners.
Course page available at https://isis.tu-berlin.de/course/view.php?id=40407
In this seminar, students will learn how to: (a) critically read and interpret scientific papers drawn from literature on responsible data management and responsible data science, (b) give a good scientific presentation that is technically precise, concentrated on the relevant topics, and also enjoyable; © write a scientific survey based on papers drawn from varying sources, such as contemporary computer science journals and conference proceedings. In addition, students will learn about state-of-the art and current research topics in responsible data science and data management.
Course page available at https://isis.tu-berlin.de/course/view.php?id=40406
At the beginning of every semester, we will define a new project (or multiple) related to the implementation of machine learning pipelines (which prepare data from heterogeneous sources, encode it as features and train one or more ML models afterwards). Students receive specifications of these components as well as selected implementation goals. The task is then to create - in self-organised teams of 4 persons - correct implementations of these components in Python, Scala or Rust. Apart from developing the source code of the prototypes, other important aspects include the use of version control tools, test-driven development, design documentation, as well as runtime experiments and improvements. At the same time, this project allows for a deeper understanding of methods related to machine learning pipelines and data analysis, as well as algorithms and data structures. The focus is, however, on a problem-oriented utilisation of programming skills so solve a practical problem, not a holistic coverage of functionality of machine learning applications.
Course page available at https://isis.tu-berlin.de/enrol/index.php?id=39392
We are looking for a PhD student to work on responsible data engineering. The research will focus on data preparation and data pipelines for complex machine learning (ML) systems. Such ML systems are increasingly used to automate impactful decisions but suffer from many unsolved data management challenges with respect to their correctness, reliability, and compliance with legal regulations.
The goal of the research will be to design and efficiently implement data-centric methods to make ML systems guarantee their users control over their personal data (e.g., with respect to the "right-to-be-forgotten" from GDPR) and adhere to legal regulations such as the upcoming European AI Act.
This will be achieved via novel declarative methods to create, maintain and assess datasets for ML use cases. These will assist non-expert users with data-centric tasks, such as evaluating the robustness of their ML pipelines to data errors and potentially leverage the code generation capabilities of large language models. The resulting methods will be accompanied by efficient and scalable implementations and made publicly available as open source libraries.
Requirements
Desirable
How to apply
Please send your application with the usual documents by e-mail to Prof. Dr. Sebastian Schelter at schelter [at] tu-berlin [dot] de , quoting the reference number IV-22/25, until the 14th of February 2025.
We are looking for a postdoc to conduct independent research in responsible data engineering. The research direction should be compatible with the themes of our lab, such as data-centric debugging and testing of machine learning applications, data processing in compliance with legal regulations, or the automation of data validation and preparation for ML. Software and data artifacts resulting from the research should be made available under open source licenses or contributed to existing open source projects.
Further tasks of the position include the collaboration with PhD students, the coordination with other research groups in BIFOLD and external partners, the supervision of master/bachelor theses and teaching activities.
Requirements
Desirable
How to apply
Please send your application with the usual documents by e-mail to Prof. Dr. Sebastian Schelter at schelter [at] tu-berlin [dot] de , quoting the reference number IV-576/24, until the 25th of February 2025.
Email: schelter [at] tu-berlin [dot] de
Technische Universität Berlin
FG Management of Data Science Processes
Sekr. TEL 9-2
Ernst-Reuter Platz 7
10587 Berlin
Germany
Responsibility under the German Press Law §55 Sect. 2 RStV:
Prof. Dr.-Ing. Sebastian Schelter