Illoominate – Data Shapley Values for Debugging Neighborhood-based Recommender Systems

Abstract

Machine learning-powered recommender systems help users find items they like. Issues in the interaction data processed by these systems frequently lead to problems, e.g., to the accidental recommendation of low-quality products or dangerous items. Such data issues are hard to anticipate upfront, and are typically detected post-deployment after they have already impacted the user experience. We argue that a principled data debugging process is required during which human experts identify potentially hurtful data issues and preemptively mitigate them. Recent notions of ‘data importance’, such as the Data Shapley value, represent a promising direction to identify training data points likely to cause issues. However, the scale of real-world interaction datasets makes it infeasible to apply existing techniques to compute the Data Shapley value in recommendation scenarios. We tackle this problem by introducing the KMC-Shapley algorithm for the scalable estimation of Data Shapley values in neighborhood-based recommendation on sparse interaction data. We conduct an experimental evaluation of the efficiency and scalability of our algorithm on both public and proprietary datasets with millions of interactions, and showcase that Data Shapley value identify impactful data points for two recommendation tasks in e-commerce. Furthermore, we discuss applications of Data Shapley values on real-world click and purchase data in e-commerce, such as identifying dangerous products or improving the ecological sustainability of product recommendations. We implement our approach in the Illominate library and release it under an open license.

Publication
ACM Transactions on Recommender Systems (Special Issue on Highlights of RecSys’25)
Date
Links