Scalable Data Debugging for Neighborhood-based Recommendation with Data Shapley Values

Barrie Kersbergen, Olivier Sprangers, Bojan Karlaš, Maarten de Rijke, Sebastian Schelter

Abstract

Machine learning-powered recommendation systems help users find items they like. Issues in the interaction data processed by these systems frequently lead to problems, e.g., to the accidental recommendation of low-quality products or dangerous items. Such data issues are hard to anticipate upfront, and are typically detected post-deployment after they have already impacted the user experience. We argue that a principled data debugging process is required during which human experts identify potentially hurtful data issues and preemptively mitigate them. Recent notions of ‘data importance’, such as the Data Shapley value (DSV), represent a promising direction to identify training data points likely to cause issues. However, the scale of real-world interaction datasets makes it infeasible to apply existing techniques to compute the DSV in recommendation scenarios. We tackle this problem by introducing the KMC-Shapley algorithm for the scalable estimation of Data Shapley values in neighborhood-based recommendation on sparse interaction data. We conduct an experimental evaluation of the efficiency and scalability of our algorithm on both public and proprietary datasets with millions of interactions, and showcase that the DSV identifies impactful data points for two recommendation tasks in e-commerce. Furthermore, we discuss applications of the DSV on real-world click and purchase data in e-commerce, such as identifying dangerous and low-quality products as well as improving the ecological sustainability of product recommendations.

Type

Conference paper

Publication

ACM Conference on Recommender Systems (RecSys), spotlight oral

Date

July, 2025

Links

PDF Code Slides Video