Publications

Filter by type:

. Towards Query Optimizer as a Service (QOaaS) in a Unified LakeHouse Platform: Can One QO Rule Them All?. Conference on Innovative Data Systems Research (CIDR), 2024.

. Messy Code Makes Managing ML Pipelines Difficult? Just Let LLMs Rewrite the Code. [arxiv preprint], 2024.

PDF

. AnyMatch - Efficient Zero-Shot Entity Matching with a Small Language Model. [arxiv preprint], 2024.

PDF

. A Flexible Forecasting Stack. International Conference on Very Large Databases (VLDB), 2024.

PDF

. Snapcase - Regain Control over Your Predictions with Low-Latency Machine Unlearning. International Conference on Very Large Databases (VLDB, demo), 2024.

PDF

. Towards Interactively Improving ML Data Preparation Code via 'Shadow Pipelines'. Data Management for End-to-End Machine Learning workshop at ACM SIGMOD, 2024.

PDF

. Directions Towards Efficient and Automated Data Wrangling with Large Language Models. Databases and Machine Learning workshop at ICDE, 2024.

PDF

. SchemaPile: A Large Collection of Relational Database Schemas. ACM SIGMOD, 2024.

PDF

. Red Onions, Soft Cheese and Data: From Food Safety to Data Traceability for Responsible AI. IEEE Data Engineering Bulletin (Special Issue on Data-Centric Responsible AI), 2024.

PDF

. Canonpipe: Data Debugging with Shapley Importance over Machine Learning Pipelines. International Conference on Learning Representations (ICLR), 2024.

PDF

. Automated Data Cleaning Can Hurt Fairness in Machine Learning-based Decision Making. IEEE Transactions on Knowledge and Data Engineering (TKDE), Special Issue for Best and Innovation Papers from ICDE’23, 2024.

PDF

. Assisted Design of Data Science Pipelines. The VLDB Journal — The International Journal on Very Large Data Bases, 2024.

PDF

. Etude - Evaluating the Inference Latency of Session-Based Recommendation Models at Scale. International Conference on Data Engineering (ICDE), 2024.

PDF

. Domain Generalization in Time Series Forecasting. ACM Transactions on Knowledge Discovery from Data (TKDD), 2024.

. Hierarchical Forecasting at Scale. International Journal of Forecasting, 2024.

PDF

. Improving Retrieval-Augmented Large Language Models via Data Importance Learning. [arxiv preprint], 2023.

PDF

. Towards Declarative Systems for Data-Centric Machine Learning. Data-Centric Machine Learning Research (DMLR) Workshop at ICML, 2023.

PDF

. mlwhatif: What If You Could Stop Re-Implementing Your Machine Learning Pipeline Analyses Over and Over?. VLDB (demo), 2023.

. Forget Me Now - Fast and Exact Unlearning in Neighborhood-Based Recommendation. ACM SIGIR, 2023.

PDF

. On the Impact of Outlier Bias on User Clicks. ACM SIGIR, 2023.

PDF

. Automating and Optimizing Data-Centric What-If Analyses on Native Machine Learning Pipelines. ACM SIGMOD, 2023.

PDF

. Proactively Screening Machine Learning Pipelines with ArgusEyes. ACM SIGMOD (demo), 2023.

PDF

. Automated Data Cleaning Can Hurt Fairness in Machine Learning-based Decision Making. International Conference on Data Engineering (ICDE), 2023.

PDF

. How to Make an Outlier? Studying the Effect of Presentational Features on the Outlierness of Items in Product Search Results. ACM Conference on Human Information Interaction and Retrieval (CHIIR), 2022.

PDF

. Reconstructing and Querying ML Pipeline Intermediates. Conference on Innovative Data Systems Research (CIDR, abstract), 2022.

PDF

. Towards Parameter-Efficient Automation of Data Wrangling Tasks with Prefix-Tuning. Table Representation Learning workshop at NeurIPS, 2022.

PDF

. A Personalized Neighborhood-based Model for Within-basket Recommendation in Grocery Shopping. ACM International Conference on Web Search and Data Mining (WSDM), 2022.

PDF

. DORIAN in action: Assisted Design of Data Science Pipelines. VLDB (demo), 2022.

PDF

. Responsible Data Management. Communications of the ACM, 2022.

PDF

. Letter from the Special Issue Editor. Special issue on “Directions Towards GDPR-Compliant Data Systems and Applications” of the IEEE Data Engineering Bulletin (Vol 45, Issue 1), 2022.

PDF

. Towards Data-Centric What-If Analysis for Native Machine Learning Pipelines. Data Management for End-to-End Machine Learning workshop at ACM SIGMOD, 2022.

PDF

. ReCANet: A Repeat Consumption-Aware Neural Network for Next Basket Recommendation in Grocery Shopping. ACM SIGIR, 2022.

PDF

. GitSchemas: A Schema Dataset for Automating Relational Data Preparation Tasks. Databases for Machine Learning workshop at ICDE, 2022.

PDF

. Serving Low-Latency Session-Based Recommendations at bol.com. ECIR (industry talk), 2022.

PDF

. Serenade - Low-Latency Session-Based Recommendation in e-Commerce at Scale. ACM SIGMOD, 2021.

PDF

. Data Distribution Debugging in Machine Learning Pipelines. The VLDB Journal — The International Journal on Very Large Data Bases (Special Issue on Data Science for Responsible Data Management), 2021.

PDF

. Parameter Efficient Deep Probabilistic Forecasting. International Journal of Forecasting, 2021.

PDF

. Screening Native Machine Learning Pipelines with ArgusEyes. Conference on Innovative Data Systems Research (CIDR, abstract), 2021.

PDF

. Understanding and Mitigating the Effect of Outliers in Fair Ranking. ACM International Conference on Web Search and Data Mining (WSDM), 2021.

PDF

. Efficiently Maintaining Next Basket Recommendations under Additions and Deletions of Baskets and Items. Workshop on Online Recommender Systems and User Modeling at ACM RecSys, 2021.

PDF

. Understanding Multi-channel Customer Behavior in Retail. ACM Conference on Information and Knowledge Management (CIKM), 2021.

PDF

. Towards Efficient Machine Unlearning via Incremental View Maintenance. Workshop on Challenges in Deploying and Monitoring ML Systems at the International Conference on Machine Learning (ICML), 2021.

PDF

. DuckDQ: Data Quality Assertions for Machine Learning Pipelines. Workshop on Challenges in Deploying and Monitoring ML Systems at the International Conference on Machine Learning (ICML), 2021.

PDF

. Probabilistic Gradient Boosting Machines for Large-Scale Probabilistic Regression. ACM SIGKDD, 2021.

PDF

. Letter from the Special Issue Editor. Special issue on “Data validation for machine learning models and applications” of the IEEE Data Engineering Bulletin (Vol 44, Issue 1), 2021.

PDF

. HedgeCut: Maintaining Randomised Trees for Low-Latency Machine Unlearning. ACM SIGMOD, 2021.

PDF

. Learnings from a Retail Recommendation System on Billions of Interactions at bol.com. International Conference on Data Engineering (ICDE), 2021.

PDF

. mlinspect: a Data Distribution Debugger for Machine Learning Pipelines. ACM SIGMOD (demo), 2021.

PDF

. Automating Data Quality Validation for Dynamic Data Ingestion. International Conference on Extending Database Technology (EDBT), 2021.

PDF

. Jenga - A Framework to Study the Impact of Data Errors on the Predictions of Machine Learning Models. International Conference on Extending Database Technology (EDBT), 2020.

PDF

. Taming Technical Bias in Machine Learning Pipelines. IEEE Data Engineering Bulletin (Special Issue on Interdisciplinary Perspectives on Fairness and Artificial Intelligence Systems), 2020.

PDF

. Lightweight Inspection of Data Preprocessing in Native Machine Learning Pipelines. Conference on Innovative Data Systems Research (CIDR), 2020.

PDF

. RetaiL: Open your own grocery store to reduce waste. NeurIPS (demonstration track), 2020.

PDF

. Technical Perspective: Query Optimization for Faster Deep CNN Explanations. ACM SIGMOD Record (Vol 49, Issue 1), 2020.

PDF

. Demand Forecasting in the Presence of Privileged Information. Workshop on Advanced Analytics and Learning on Temporal Data at ECML/PKDD, 2020.

PDF

. A Comparison of Supervised Learning to Match Methods for Product Search. eCommerce workshop at SIGIR, 2020.

PDF

. Analyzing and Predicting Purchase Intent in E-commerce: Anonymous vs. Identified Customers. eCommerce workshop at SIGIR, 2020.

PDF

. Apache Mahout: Machine Learning on Distributed Dataflow Systems. Journal of Machine Learning Research (JMLR), open source software track, 2020.

PDF

. AlphaJoin: Join Order Selection à la AlphaGo. PhD workshop at VLDB, 2020.

. Fairness-Aware Instrumentation of Preprocessing Pipelines for Machine Learning. Human-In-the-Loop Data Analytics workshop at ACM SIGMOD, 2020.

PDF

. HDDse: Enabling High-Dimensional Disk State Embedding for Generic Failure Detection of Heterogeneous Disks in Large Data Centers. USENIX Annual Technical Conference (ATC), 2020.

. Elastic Machine Learning Algorithms in Amazon SageMaker. ACM SIGMOD, 2020.

PDF

. Tier-Scrubbing: An Adaptive and Tiered Disk Scrubbing Scheme with Improved MTTD and Reduced Cost. Design Automation Conference (DAC), 2020.

. Towards Unsupervised Data Quality Validation on Dynamic Data. Workshop on Explainability for Trustworthy ML Pipelines at EDBT, 2020.

PDF

. Towards Automated ML Model Monitoring: Measure, Improve and Quantify Data Quality. ML Ops workshop at the Conference on Machine Learning and Systems (MLSys), 2020.

PDF

. Tier-Scrubbing: An Adaptive and Tiered Disk Scrubbing Scheme. USENIX Conference on File and Storage Technologies (FAST), work-in-progress track., 2020.

PDF

. Exploring Monte Carlo Tree Search for Join Order Selection. North East Database Day, 2020.

PDF

. Learning to Validate the Predictions of Black Box Classifiers on Unseen Data. ACM SIGMOD, 2020.

PDF

. FairPrep: Promoting Data to a First-Class Citizen in Studies on Fairness-Enhancing Interventions. International Conference on Extending Database Technology (EDBT), 2020.

PDF

. Zooming Out on an Evolving Graph. International Conference on Extending Database Technology (EDBT), 2020.

PDF

. 'Amnesia' - A Selection of Machine Learning Models That Can Forget User Data Very Fast. Conference on Innovative Data Systems Research (CIDR), 2020.

PDF

. DataWig - Missing Value Imputation for Tables. Journal of Machine Learning Research (JMLR), open source software track, 2019.

PDF

. An Intermediate Representation for Optimizing Machine Learning Pipelines. International Conference on Very Large Databases (VLDB), 2019.

PDF

. AdaBench - Towards an Industry Standard Benchmark for Advanced Analytics. TPC Technology Conference on Performance Evaluation & Benchmarking (TPCTC), 2019.

PDF

. 'Amnesia' - Towards Machine Learning Models That Can Forget User Data Very Fast. Workshop on Applied AI for Database Systems and Applications (AIDB) at VLDB, 2019.

PDF

. Efficient Incremental Cooccurrence Analysis for Item-Based Collaborative Filtering. International Conference on Scientific and Statistical Database Management (SSDBM), 2019.

PDF

. Learning to Validate the Predictions of Black Box Machine Learning Models on Unseen Data. Human-In-the-Loop Data Analytics workshop at ACM SIGMOD, 2019.

PDF

. DEEM 2019: Workshop on Data Management for End-to-End Machine Learning. ACM SIGMOD (workshop summary), 2019.

PDF

. Differential Data Quality Verification on Partitioned Data. International Conference on Data Engineering (ICDE), 2019.

PDF

. Unit Testing Data with Deequ. ACM SIGMOD (demo), 2019.

PDF

. Data-Related Challenges in End-to-End Machine Learning. North East Database Day, 2019.

PDF

. On Challenges in Machine Learning Model Management. IEEE Data Engineering Bulletin, 2018.

PDF

. Deequ - Data Quality Validation for Machine Learning Pipelines. Machine Learning Systems workshop at the conference on Neural Information Processing Systems (NeurIPS), 2018.

PDF

. Deep Learning for Missing Value Imputation in Tables with Non-Numerical Data. ACM Conference on Information and Knowledge Management (CIKM), 2018.

PDF

. Benchmarking Distributed Data Processing Systems for Machine Learning Workloads. TPC Technology Conference on Performance Evaluation & Benchmarking (TPCTC), 2018.

PDF

. BlockJoin: Efficient Matrix Partitioning Through Joins. International Conference on Very Large Databases (VLDB), 2018.

PDF

. Automating Large-Scale Data Quality Verification. International Conference on Very Large Databases (VLDB), 2018.

PDF

. On the Ubiquity of Web Tracking: Insights from a Billion-Page Web Crawl. Journal of Web Science (JWS), 2018.

PDF

. Automatically Tracking Metadata and Provenance of Machine Learning Experiments. Machine Learning Systems workshop at the conference on Neural Information Processing Systems (NIPS), 2017.

PDF

. Dark Germany: Hidden Patterns of Participation in Online Far-Right Protests Against Refugee Housing. International Conference on Social Informatics (SocInfo), 2017.

PDF

. Probabilistic Demand Forecasting at Scale. International Conference on Very Large Databases (VLDB), 2017.

PDF

. Dark Germany: Hidden Patterns of Participation in Online Far-Right Protests Against Refugee Housing. ACM Web Science Conference (WebSci), 2017.

PDF

. Gilbert: Declarative Sparse Linear Algebra on Massively Parallel Dataflow Systems. Fachtagung für Business, Technologie und Web (BTW), 2017.

PDF

. Structural Patterns in the Rise of Germany’s New Right on Facebook. Data Mining in Politics workshop at the International Conference on Data Mining (ICDM), 2016.

PDF

. Samsara: Declarative Machine Learning on Distributed Dataflow Systems. Machine Learning Systems workshop at the conference on Neural Information Processing Systems (NIPS), 2016.

PDF

. Doubly stochastic large scale kernel learning with the empirical kernel map. arxiv, 2016.

PDF

. Predicting Political Party Affiliation from Text. International Conference on the Advances in Computational Analysis of Political Text (PolText), 2016.

PDF

. Tracking The Trackers: A Large-Scale Analysis of Embedded Web Trackers. AAAI International Conference on Web and Social Media (ICWSM), 2016.

PDF

. Scaling Data Mining in Massively Parallel Dataflow Systems. Technische Universität Berlin, 2015.

PDF

. Optimistic Recovery for Iterative Dataflows in Action. ACM SIGMOD (demo), 2015.

PDF

. Efficient Sample Generation for Scalable Meta Learning. IEEE International Conference on Data Engineering (ICDE), 2015.

PDF

. Factorbird - a Parameter Server Approach to Distributed Matrix Factorization. Distributed Machine Learning and Matrix Computations workshop at the conference on Neural Information Processing Systems (NIPS), 2014.

PDF

. The Stratosphere platform for big data analytics. The VLDB Journal — The International Journal on Very Large Data Bases, 2014.

PDF

. Scaling Data Mining in Massively Parallel Dataflow Systems. PhD Symposium at ACM SIGMOD, 2014.

PDF

. 'All Roads Lead to Rome:' Optimistic Recovery for Distributed Iterative Data Processing. ACM Conference on Information and Knowledge Management (CIKM), 2013.

PDF

. Distributed Matrix Factorization with MapReduce using a series of Broadcast-Joins. ACM Conference on Recommender Systems (RecSys), 2013.

PDF

. Iterative Parallel Data Processing with Stratosphere: An Inside Look. ACM SIGMOD (demo), 2013.

PDF

. Collaborative Filtering with Apache Mahout. Recommender Systems Challenge Workshop in conjunction with ACM RecSys, 2012.

PDF

. Scalable Similarity-Based Neighborhood Methods with MapReduce. ACM Conference on Recommender Systems (RecSys), 2012.

PDF

. . 0001.