Estimating the Performance of Entity Resolution Algorithms Lessons Learned Through PatentsView.org Olivier Binette12 Sokhna A York2 Emma Hickerson2 Youngsoo Baek1 Sarvo Madhavan2

2025-04-29 0 0 482.75KB 20 页 10玖币
侵权投诉
Estimating the Performance of Entity Resolution Algorithms:
Lessons Learned Through PatentsView.org
Olivier Binette1,2, Sokhna A York2, Emma Hickerson2, Youngsoo Baek1, Sarvo Madhavan2,
and Christina Jones2
1Duke University
2American Institutes for Research
April 19, 2023
Abstract
This paper introduces a novel evaluation methodology for entity resolution algorithms. It is motivated
by PatentsView.org, a public-use patent data exploration platform that disambiguates patent inventors
using an entity resolution algorithm. We provide a data collection methodology and tailored performance
estimators that account for sampling biases. Our approach is simple, practical and principled – key
characteristics that allow us to paint the first representative picture of PatentsView’s disambiguation
performance. The results are used to inform PatentsView’s users of the reliability of the data and to
allow the comparison of competing disambiguation algorithms.
1 Introduction
Entity resolution (also called record linkage, deduplication, or disambiguation) is the task of identifying
records in a database that refer to the same entity. An entity may be a person, a company, an object or an
event. Records are assumed to contain partially identifying information about these entities. When there
is no unique identifier (such as a social security number) available for all records, entity resolution becomes
a complex problem which requires sophisticated algorithmic solutions (Herzog et al.,2007;Christen,2012;
Dong and Srivastava,2015;Ilyas and Chu,2019;Christophides et al.,2021;Christen,2019;Papadakis et al.,
2021;Binette and Steorts,2022).
Specifically, we are interested in the entity resolution system used by PatentsView.org, a public-use
patent data exploration platform maintained by the American Institutes for Research (AIR), to disambiguate
inventor mentions in patent data. The U.S. Patents and Trademarks Office (USPTO) makes available patent
data dating back to 1790 (digitized full-text data is available from 1976). However, there is no standard for
uniquely identifying inventors on patent applications. The result is a set of ambiguous mentions of inventors,
where a single person’s name may be spelled in different ways on two applications and where two different
inventors with the same name may be difficult to distinguish. Inventors moving between locations and
employers further complicates their identification. Following seminal works (Trajtenberg and Shiff,2008;
Ferreira et al.,2012;Ventura et al.,2013;Li et al.,2014), a disambiguation competition was held in 2015
leading to the disambiguation system currently used for PatentsView.org. Since then, disambiguated inventor
data has been one of PatentsView’s most popular data products, complementing its data visualizations,
1
arXiv:2210.01230v2 [cs.DL] 17 Apr 2023
search tools, and Application Programming Interface (API), and other data products that serve a wide
variety of audience including students, educators, researchers, policymakers, small business owners, and the
public (Toole et al.,2021). Given challenges associated with the disambiguation process, the topic continues
to be an active area of research (Ventura et al.,2015;Kim et al.,2016;Yang et al.,2017;Morrison et al.,
2017;M¨uller,2017;Traylor et al.,2017;Balsmeier et al.,2018;Tam et al.,2019;Monath et al.,2019;Doherr,
2021).
One key challenge in using, maintaining, and improving entity resolution systems is to evaluate their
performance. In the case of PatentsView’s disambiguation, no principled evaluation methodology is avail-
able to measure performance, to inform users of the reliability of the data, and to support methodological
research to improve upon PatentsView’s disambiguation algorithms. The state-of-the-art in entity resolution
evaluation, namely computing performance evaluation metrics (precision, recall, etc) on benchmark datasets,
leads to misleading and highly biased performance metrics as shown in section 1.1. This is concerning given
the many scientific uses of PatentsView’s data: prior to June 2021, 179 research studies cited PatentsView
as a data source, including around 25% from the field of economics (Toole et al.,2021). Furthermore, a
common theme of research is the study of the relationship between public policy, inventor mobility, and in-
ventor demographics, on innovation and patenting. This requires accurate inventor disambiguation to track
inventors and entities through the breadth of patent data.
Our paper addresses this challenge, thus informing users of the reliability of disambiguated data and sup-
porting methodological research to improve disambiguation algorithms. We propose novel evaluation method-
ology that is principled and cost-effective, and we demonstrate its effectiveness to evaluate PatentsView’s
disambiguation.
Before continuing with the rest of this introduction, we review terminology used throughout in section
1.0.1. In section 1.1, we then continue with challenges of evaluation, past work, and our contributions.
1.0.1 Terminology
We consider a database of records, where each record represents a mention to a given inventor (e.g., the first
inventor of Patent number 12345). In this context, the records are also referred to as inventor mentions.
The goal of entity resolution is to cluster inventor mentions according to the entity (real-world inventor)
to which they refer. Clusterings obtained from algorithms are referred to as predicted clusters or predicted
disambiguations, whereas the (unknown) clustering corresponding to the true set of inventors is referred to
as the ground truth. Two inventor mentions are said to match or to be a true match if they refer to the same
inventor. If two inventors are in the same predicted cluster, then they are a link, or a predicted match. The
proportion of true matches among all predicted matches is called the pairwise precision, while the proportion
of predicted matches among all true matches is called the pairwise recall.
1.1 The Evaluation Problem
The entity resolution evaluation problem is to extrapolate from observed performance in small samples
to real performance in a database with millions of records. Wang et al. (2022) refer to this as bridging
the reality-ideality gap in entity resolution, where high performance on benchmark datasets often does not
translate into the real world. Here, performance may be defined as any combination of commonly used
evaluation metrics for entity resolution, such as precision and recall, cluster homogeneity and completeness,
rand index, or generalized merge distance (Maidasani et al.,2012). These metrics can be computed on
benchmark datasets for which we have a ground truth disambiguation. However, the key evaluation problem
is to obtain estimates that are representative of performance on the full data, for which no ground truth
2
disambiguation is available. This is challenging for the following reasons.
First, entity resolution problems do not scale linearly. While it may be easy to disambiguate a small
dataset, the opportunity for errors grows quadratically in the number of records. As such, we may observe
good performance of an algorithm on a small benchmark dataset, while the true performance on the entire
dataset may be something else entirely. This particular effect of dataset size in entity resolution is explored
in Draisbach and Naumann (2013) in the context of choosing similarity thresholds. This is a problem
that PatentsView.org currently faces. Despite encouraging performance evaluation metrics on benchmark
datasets, with nearly perfect precision and recall reported in the latest methodological report (Monath et al.,
2021), the data science team at AIR observes lower real-world accuracy. This phenomenon is illustrated in
example 1below.
A second problem is large class imbalance in entity resolution (Marchant and Rubinstein,2017). Viewing
entity resolution as a classification problem, the task is to classify record pairs as being a match or non-
match. However, among all pairs of records, only a small fraction (usually much less than a fraction of a
percent) refer to the same entity. The vast majority of record pairs are not a match. This makes it difficult
to evaluate performance through random sampling of record pairs.
A third problem is the multiplicity of sampling mechanisms used to obtain benchmark datasets. To
construct hand-disambiguated datasets, blocks, entity clusters, or predicted clusters may be sampled with
various probability weights. These sampling approaches must be accounted for in order to obtain represen-
tative performance estimates (Fuller,2011).
Our approach, detailed in sections 1.1.3 and 2.3, addresses these challenges by putting forward novel
cluster-based expressions for performance metrics that reflect various sampling schemes. Each of these
representations immediately suggests simple estimators that properly account for the above issues.
Example 1 (Bias of precision computed on benchmark datasets).To exemplify the problem with the trivial
use of performance evaluation metrics on benchmark datasets, we carried out a toy experiment that is
described in detail in appendix A.1. In short, we evaluated a disambiguation algorithm by sampling ground
truth clusters and computing pairwise precision on this set of sampled clusters. This is analogous to the
way that many real-world benchmark datasets are obtained and typically used. In this experiment, we know
that the disambiguation algorithm has a precision of 52% for the entire dataset.
Figure 1: Distribution of precision estimates versus the true precision of 52% (shown as a dotted vertical
line). Panel Ashows the trivial precision estimates computed for sampled records. Panel Bshows our
proposed precision estimates which accounts for the sampling mechanism. Sample bias and root mean squared
error (rmse) are reported in each figure.
In panel Aof figure 1, we see the distribution of precision estimates versus the true precision of 52%
3
shown as a dotted vertical line. Precision estimates are usually very close to 100% and always higher than
80%, despite the truth being a precision of only 52%. In contrast, panel Bshows the distribution of our
proposed precision estimator which is nearly unbiased. Both precision estimators rely on exactly the same
data. They only differ in how they account for the underlying sampling process and the extrapolation from
small benchmark datasets to the full data.
The same phenomenon can be observed in PatentsView’s data, where naive precision is nearly 1 on all
benchmark datasets. In our simulation studies (see Figure 3for instance), naive precision estimates are
always nearly 1, despite the true precision ranging from 60% to 90%.
The reason why naive performance estimation performs disastrously is that it is much easier to disam-
biguate a small benchmark dataset than a large population with millions of records. Indeed, as a dataset
grows, opportunity for erroneous links grows quadratically. False links between similarly named inventor,
which are common in the full data, disappear when the benchmark dataset only contains a random sample
of inventors. Our performance estimators extrapolate from performance observed on a small benchmark to
true performance on the full data.
1.1.1 Why Bother With Evaluation?
There are two main uses for accurate and statistically rigorous evaluation methodology.
The first is model selection and comparison. PatentsView.org continually works at improving disam-
biguation methodology. This requires choosing between alternative methods and evaluating the results of
methodological experiments. Without sound evaluation methodology, decisions regarding the disambigua-
tion algorithm may not align with real-world use and real-world performance. Notably, for a performance
metric such as the pairwise f-score, one algorithm may perform better than another on a small benchmark
dataset, while the opposite may hold true for performance on the entire data. This problem arises with
typical benchmark datasets obtained from randomly sampling blocks or randomly sampling clusters (see
section 2.3 for a definition of different sampling mechanisms).
The second is adequate use of disambiguated data. PatentsView.org’s disambiguation results have been
used in numerous scientific studies (Toole et al.,2021). For example, Choudhury and Kim (2019) studied
the effect of skilled worker immigration on patenting at U.S. companies and institutions, using PatentsView’s
inventor disambiguation to track individual immigrant inventors across time and location. These studies
make assumptions about the reliability of the data that need to be validated and upheld. In short, users of
disambiguated data need to understand its reliability in order to make scientifically appropriate use of it.
Evaluation aims to provide this rigorous reliability information.
1.1.2 Past Work
Much of the past literature has focused on defining and using relevant clustering evaluation metrics. The topic
of estimating performance from samples has received much less attention, usually focusing on importance
sampling estimators based on record pairs. We review the contributions to these two main topics below.
Metrics Pairwise precision and recall metrics were first reported in Newcombe et al. (1959), with Bilenko
and Mooney (2003) and Christen and Goiser (2007) emphasizing the importance of precision-recall curves
for algorithm evaluation. However, there are issues with the use of pairwise precision and recall in entity
resolution applications, such as the large relative importance of large clusters. As such, other clustering
metrics have been proposed, including cluster precision and recall, cluster homogeneity and completeness,
the B3metric (Bagga and Baldwin,1998), and generalized merge distances (Michelson and Macskassy,2009;
4
摘要:

EstimatingthePerformanceofEntityResolutionAlgorithms:LessonsLearnedThroughPatentsView.orgOlivierBinette1,2,SokhnaAYork2,EmmaHickerson2,YoungsooBaek1,SarvoMadhavan2,andChristinaJones21DukeUniversity2AmericanInstitutesforResearchApril19,2023AbstractThispaperintroducesanovelevaluationmethodologyforenti...

展开>> 收起<<
Estimating the Performance of Entity Resolution Algorithms Lessons Learned Through PatentsView.org Olivier Binette12 Sokhna A York2 Emma Hickerson2 Youngsoo Baek1 Sarvo Madhavan2.pdf

共20页,预览4页

还剩页未读, 继续阅读

声明:本站为文档C2C交易模式,即用户上传的文档直接被用户下载,本站只是中间服务平台,本站所有文档下载所得的收益归上传人(含作者)所有。玖贝云文库仅提供信息存储空间,仅对用户上传内容的表现方式做保护处理,对上载内容本身不做任何修改或编辑。若文档所含内容侵犯了您的版权或隐私,请立即通知玖贝云文库,我们立即给予删除!
分类:图书资源 价格:10玖币 属性:20 页 大小:482.75KB 格式:PDF 时间:2025-04-29

开通VIP享超值会员特权

  • 多端同步记录
  • 高速下载文档
  • 免费文档工具
  • 分享文档赚钱
  • 每日登录抽奖
  • 优质衍生服务
/ 20
客服
关注