A Deep Dive Into Cross-Dataset Entity Matching with Large and Small Language Models

Abstract

Entity matching (EM) is the problem of determining whether two data records refer to the same real-world entity. A particularly challenging scenario is cross-dataset entity matching, where the matcher has to work with an unseen target dataset for which no labelled examples are available. Cross-dataset EM is crucial in scenarios where a high level of automation is required, and where it is unlikely or impractical to force a domain expert to manually label training data. Recently, approaches based on language models have become popular for EM, and often promise impressive transfer capabilities. However, there is a lack of a comprehensive and systematic study of the cross-dataset EM capabilities of these recent approaches. It is unclear, which categories of language models are actually applicable in a cross-dataset EM setting, how well current EM approaches perform when they are evaluated systematically under a cross-dataset setting, and what the relationship between the prediction quality and deployment cost of various large language model-based EM approaches is. We address these open questions with the first comprehensive and systematic study on cross-dataset entity matching, where we evaluate eight matchers on 11 benchmark datasets, cover a wide variety of model sizes and transfer learning approaches, and also explore and quantify the relation between prediction quality and deployment cost of the matching approaches. We find that fine-tuned small models can perform on par with prompted large models, that data-centric approaches outperform model-centric approaches and that approaches using well-performing small models can be deployed at an orders of magnitude lower cost than comparably performing approaches with large commercial models.

Publication
International Conference on Extending Database Technology (EDBT)
Date
Links