Performance metrics like precision, recall, and F1 score are commonly used to characterize the accuracy of entity resolution systems. Computing these metrics requires results to evaluate - our predicted entity clusters - and a benchmark dataset to use as a reference for ground truth. It's a deceptively simple process that, unfortunately, often yields completely wrong results.
The reasons why computing performance metrics on benchmark datasets does not give representative results are known by some experts but are often ignored in practice. This leads to several issues, such as:
Over-optimistic results, with computed precision often above 95% even when the true system performance may be much lower.
Performance rank reversals, meaning that poor algorithms can be ranked higher than better alternatives.
Poor system design, with performance metrics underrating the importance of essential features (as first noticed by Wang et al. in 2022).
Why You Can't Just Compute Performance Metrics on Benchmark Datasets
1. Entity resolution performance is linked to the amount of data you have.
In a large dataset, there is much more opportunity for errors. Imagine, for example, that you were to sample a few individuals at random in the United States. It is very unlikely that any two of these individuals will share the same name, even though thousands of people in the US share the same name. As such, it is much more difficult to disambiguate between individuals in a large dataset than in a small subset. It's easy to get good performance on a small benchmark dataset, but it's hard to get good performance for a large population.
2. Benchmark datasets are rarely representative of the full population.
Benchmark datasets used for evaluation often are not representative of the full population. Typically, they are observational datasets obtained by convenience. They may only cover a very specific sub-population. Furthermore, the ratio of matching to non-matching pairs in the benchmark dataset is likely to be different (and much higher) in the benchmark dataset than in the full population.
3. There is data leakage with the benchmark dataset.
Oftentimes, entities in a benchmark dataset are also represented in the general population, outside of the benchmark dataset. In terms of a train/test split for training a machine learning model, this means that you will have data leakage - your train and test splits are not independent. This is a common cause of failure for embedding-based entity resolution systems.
What Can You Do Instead?
When estimating performance metrics, you need to:
Account for population size bias. You need to account for the fact that performance is going to be lower on the full dataset.
Account for sampling processes. You need to account for the way that the data was obtained, and for the representation differences between the benchmark dataset and the full population.
Use the right data for the right purposes. You need to have a carefully crafted test set that is known to be completely independent from your training dataset, especially when evaluating embedding-based and deep learning models.
Conclusion
Accurately estimating performance metrics in entity resolution is crucial for developing and implementing effective solutions. However, naively using benchmark datasets can lead to misleading results, over-optimistic assessments, and poorly ranked algorithms. To overcome these issues, it is essential to account for population size bias, sampling processes, and data independence when evaluating performance metrics.
How Valires Can Help
At Valires, we have developed principled performance estimators that account for all of the above issues. Our methodology has been published in peer-reviewed scientific journals, implemented, and tested. Reach out to learn more.
Comments