News

Recent Publications

All Publications

(2024). Instrumentation and Analysis of Native ML Pipelines via Logical Query Plans. PhD Workshop at VLDB.

(2024). Snapcase - Regain Control over Your Predictions with Low-Latency Machine Unlearning. VLDB (demo).

(2024). Towards Interactively Improving ML Data Preparation Code via 'Shadow Pipelines'. Data Management for End-to-End Machine Learning workshop at ACM SIGMOD.

PDF

(2024). Directions Towards Efficient and Automated Data Wrangling with Large Language Models. Databases and Machine Learning workshop at ICDE.

PDF

(2024). SchemaPile: A Large Collection of Relational Database Schemas. ACM SIGMOD.

PDF

Team

Faculty, PhD Students & Staff
Prof. Dr. Sebastian Schelter Stefan Grafberger Hao Chen
External PhD Students & Guests
Barrie Kersbergen
(bol.com)
Zeyu Zhang
(University of Amsterdam)
Shubha Guha
(University of Amsterdam)
Till Doehmen
(Motherduck)
Yichun Wang
(University of Amsterdam)
(External) Master Students
Aynaz Abdollahzadeh
(University of Amsterdam)
Leonardo Dominci
(University of Amsterdam)

Alumni (name, role and first employment)

Prof. Dr.-Ing. Sebastian Schelter

Sebastian Schelter is a Full Professor at the Berlin Institute on the Foundations of Learning and Data (BIFOLD) and Technische Universität Berlin. His research is focused on the intersection of data management and machine learning with the goal to foster the responsible management of data and to democratise data science technologies.

The research of his group is accompanied by efficient and scalable open source implementations, many of which are applied in real world use cases, for example in the Amazon Web Services cloud and in large European e-commerce platforms.

In the past, he has been an assistant professor at the University of Amsterdam, a faculty fellow at New York University, a senior applied scientist at Amazon Research and a research intern at Twitter and IBM Almaden in California. His research contributions have been recognized with an ACM SIGMOD Systems Award, an ACM SIGMOD Best Demo Runner Up Award, and a Best Paper Runner Up Award from the Table Representation Learning workshop at NeurIPS.

Scientific Service
  • Editorial duties: Associate Editor for PVLDB Volume 15, Action Editor for the Journal of Data-Centric Machine Learning Research (DMLR), Action Editor for the open source track of the Journal of Machine Learning Research (JMLR), Guest editor for the IEEE Data Engineering Bulletin
  • Organisation: Founder and co-organiser (until 2020) of the workshop series on “Data Management for End-to- End Machine Learning (DEEM)” at SIGMOD, workshop chair EDBT 2026, co-chair industry track of EDBT 2022, web chair of SIGMOD 2025, co-chair BOSS workshop at VLDB in 2016, Co-organiser of the “Dutch Data Systems Design Seminar” series with CWI Amsterdam
  • Program Committee: SIGMOD 2017 & 2019-2025, VLDB 2021, ICDE 2018-2021 & 2023-2024, EDBT 2017 & 2021, CIKM 2020, PhD Symposium at VLDB 2021, DEEM workshop at SIGMOD 2021-2024, aiDM workshop at SIGMOD 2019, LSRS workshop at RecSys 2013-2015, AIDB workshop at VLDB 2020, DBML workshop at ICDE 2021, TRL workshop at NeurIPS 2022-2023, Provenance Week 2020
  • Awards: ACM SIGMOD Systems Award 2023, ACM SIGMOD Best Demo Runner Up Award 2023, Best Paper Runner Up Award from the Table Representation Learning workshop at NeurIPS
  • Keynotes: Workshop on Online Recommender Systems and User Modeling at RecSys'20, Workshop on Data Management for End-to- End Machine Learning at SIGMOD'21, Data Centric AI Workshop from ETH Zuerich/Stanford 2021, Workshop on Quality in Databases at VLDB'24
  • Panelist: Systems for ML at VLDB 2021, PhD symposium at ICDE 2021, Data management challenges for LLM-powered solutions at DEEM@SIGMOD'23
Past employments
Professional Memberships
  • Apache Software Foundation (emeritus)
  • Association for Computing Machinery
  • Electronic Frontier Foundation
  • Deutscher Hochschulverband

Teaching

Summer 2024:

IMSEM – Seminar on Hot Topics in Information Management

We are running IMSEM this year together with our colleages from the database group as a block seminar. Please use the corresponding ISIS page for registration and course logistics.

The paper list for the seminar is available at https://github.com/deem-teaching/2024-imsem/.

Upcoming Courses:

We will soon post an overview of our offered courses for the winter semester here.

Job Openings

PhD Position - Responsible Data Engineering for ML

The research will focus on complex ML applications, which include data integration and data pre-processing pipelines. Such applications are difficult to build as they utilise data from heterogeneous sources that needs to be integrated and transformed into features before the data can be consumed by ML models. This requires combining different relational and linear algebra operations, which often leads to performance problems and the loss of important information about the origin of the processed data.

The goal of the research will be twofold: First, we want to make the creation of complex ML applications easier for non-expert users, for example when they want to integrate domain-specific knowledge into their applications or when they evaluate the robustness of their ML applications. Secondly, we aim to develop the foundations for ML applications that guarantee their users’ control over their personal data (e.g. with regard to the "right to be forgotten" from the GDPR) and comply with legal regulations such as the upcoming European AI Act.

This will be achieved through novel declarative methods for the automatic design, testing and debugging of ML applications that potentially utilise the code generation capabilities of Large Language Models. The research will lead to efficient and scalable implementations that will be made publicly available as open source libraries.
Requirements
  • Successfully completed university degree (Master, Diplom or equivalent) in Computer Science or Artificial Intelligence
  • Creativity and independent thinking, self-motivated working style
  • Strong programming skills in Python and at least one other language (Java/Scala/Rust/C++)
  • Knowledge of data processing with dataflow systems, relational databases and/or dataframe libraries (e.g., Apache Spark, DuckDB, Pandas, etc.)
  • Experience with increasing the efficiency, scalability and correctness of data-centric programmes
  • Basic knowledge of machine learning and knowledge of common libraries (e.g., Pandas, Sklearn, Pytorch, SparkML, etc.)
Not required, but helpful:
  • Experience with real-world data processing systems and/or ML deployments (e.g. from internships, jobs or entrepreneurial experience)
  • Experience with regulations such as GDPR and the EU AI Act
  • Contributions to open source projects
Applications should contain the usual documents (motivation letter, CV, ...) and can be sent to schelter@tu-berlin.de. Please do not hesitate to reach out in case of questions as well.

PhD Position - Enforcing the Right-to-be-Forgotten in Recommender Systems

The research project revolves around enforcing the “right-to-be-forgotten” in recommender systems to empower users to efficiently remove their personal data from such systems. This is especially important in recommender systems that assist people in critical use cases such as finding medical supplies, food or care products for their children. Unfortunately, existing recommender systems are designed to maximise prediction quality and lack “unlearning” and data removal functionality, which can lead to devastating consequences: Imagine a person struggling with alcohol addiction, who decides to stop consuming alcoholic products. Unfortunately, this person will still be exposed to recommendations for alcohol products online, since the underlying ML models will have learned their preference for alcohol.

On a technical level, the research project will focus on the following questions: How can we benchmark unlearning methods in recommender systems with respect to unlearning guarantees and response latency? How can we augment existing state-of-the-art recommendation algorithms with unlearning functionality? How can we efficiently execute the unlearning operations at scale? The goal of the research project is to develop the algorithmic foundations and their corresponding efficient execution strategies for sub-second unlearning in recommender systems. Furthermore, we aim to implement an open source recommendation library with unlearning functionality to contribute to the public infrastructure for responsible data management.

The research will be conducted in close collaboration with Prof. Maarten de Rijke from the Information Retrieval Lab at the University of Amsterdam.
Requirements
  • Successfully completed university degree (Master, Diplom or equivalent) in Computer Science or Artificial Intelligence
  • Creativity and independent thinking, self-motivated working style
  • Strong programming skills in Python and at least one other language (Java/Scala/Rust/C++)
  • Knowledge of machine learning and recommender systems
  • Experience working with relational databases and/or data processing systems such as Apache Spark, DuckDB, etc.
Not required, but helpful:
  • Practical experience with real-world ML applications and MLOps
  • Experience with regulations such as GDPR and the EU AI Act
  • Contributions to open source projects
Applications should contain the usual documents (motivation letter, CV, ...) and can be sent to schelter@tu-berlin.de.

Contact

Phone: +49 30 314 23555
Email: schelter [at] tu-berlin [dot] de

Technische Universität Berlin
Management of Data Science Processes, EN-726
Einsteinufer 17
10587 Berlin
Germany

Responsibility under the German Press Law §55 Sect. 2 RStV:
Prof. Dr.-Ing. Sebastian Schelter