Estimating the Performance of Entity Resolution Algorithms Lessons Learned Through PatentsView.org Olivier Binette12 Sokhna A York2 Emma Hickerson2 Youngsoo Baek1 Sarvo Madhavan2

2025-04-29 0 0 482.75KB 20 页 10玖币

侵权投诉

Estimating the Performance of Entity Resolution Algorithms:

Lessons Learned Through PatentsView.org

Olivier Binette1,2, Sokhna A York2, Emma Hickerson2, Youngsoo Baek1, Sarvo Madhavan2,

and Christina Jones2

1Duke University

2American Institutes for Research

April 19, 2023

Abstract

This paper introduces a novel evaluation methodology for entity resolution algorithms. It is motivated

by PatentsView.org, a public-use patent data exploration platform that disambiguates patent inventors

using an entity resolution algorithm. We provide a data collection methodology and tailored performance

estimators that account for sampling biases. Our approach is simple, practical and principled – key

characteristics that allow us to paint the ﬁrst representative picture of PatentsView’s disambiguation

performance. The results are used to inform PatentsView’s users of the reliability of the data and to

allow the comparison of competing disambiguation algorithms.

1 Introduction

Entity resolution (also called record linkage, deduplication, or disambiguation) is the task of identifying

records in a database that refer to the same entity. An entity may be a person, a company, an object or an

event. Records are assumed to contain partially identifying information about these entities. When there

is no unique identiﬁer (such as a social security number) available for all records, entity resolution becomes

a complex problem which requires sophisticated algorithmic solutions (Herzog et al.,2007;Christen,2012;

Dong and Srivastava,2015;Ilyas and Chu,2019;Christophides et al.,2021;Christen,2019;Papadakis et al.,

2021;Binette and Steorts,2022).

Speciﬁcally, we are interested in the entity resolution system used by PatentsView.org, a public-use

patent data exploration platform maintained by the American Institutes for Research (AIR), to disambiguate

inventor mentions in patent data. The U.S. Patents and Trademarks Oﬃce (USPTO) makes available patent

data dating back to 1790 (digitized full-text data is available from 1976). However, there is no standard for

uniquely identifying inventors on patent applications. The result is a set of ambiguous mentions of inventors,

where a single person’s name may be spelled in diﬀerent ways on two applications and where two diﬀerent

inventors with the same name may be diﬃcult to distinguish. Inventors moving between locations and

employers further complicates their identiﬁcation. Following seminal works (Trajtenberg and Shiﬀ,2008;

Ferreira et al.,2012;Ventura et al.,2013;Li et al.,2014), a disambiguation competition was held in 2015

leading to the disambiguation system currently used for PatentsView.org. Since then, disambiguated inventor

data has been one of PatentsView’s most popular data products, complementing its data visualizations,

arXiv:2210.01230v2 [cs.DL] 17 Apr 2023

search tools, and Application Programming Interface (API), and other data products that serve a wide

variety of audience including students, educators, researchers, policymakers, small business owners, and the

public (Toole et al.,2021). Given challenges associated with the disambiguation process, the topic continues

to be an active area of research (Ventura et al.,2015;Kim et al.,2016;Yang et al.,2017;Morrison et al.,

2017;M¨uller,2017;Traylor et al.,2017;Balsmeier et al.,2018;Tam et al.,2019;Monath et al.,2019;Doherr,

2021).

One key challenge in using, maintaining, and improving entity resolution systems is to evaluate their

performance. In the case of PatentsView’s disambiguation, no principled evaluation methodology is avail-

able to measure performance, to inform users of the reliability of the data, and to support methodological

research to improve upon PatentsView’s disambiguation algorithms. The state-of-the-art in entity resolution

evaluation, namely computing performance evaluation metrics (precision, recall, etc) on benchmark datasets,

leads to misleading and highly biased performance metrics as shown in section 1.1. This is concerning given

the many scientiﬁc uses of PatentsView’s data: prior to June 2021, 179 research studies cited PatentsView

as a data source, including around 25% from the ﬁeld of economics (Toole et al.,2021). Furthermore, a

common theme of research is the study of the relationship between public policy, inventor mobility, and in-

ventor demographics, on innovation and patenting. This requires accurate inventor disambiguation to track

inventors and entities through the breadth of patent data.

Our paper addresses this challenge, thus informing users of the reliability of disambiguated data and sup-

porting methodological research to improve disambiguation algorithms. We propose novel evaluation method-

ology that is principled and cost-eﬀective, and we demonstrate its eﬀectiveness to evaluate PatentsView’s

disambiguation.

Before continuing with the rest of this introduction, we review terminology used throughout in section

1.0.1. In section 1.1, we then continue with challenges of evaluation, past work, and our contributions.

1.0.1 Terminology

We consider a database of records, where each record represents a mention to a given inventor (e.g., the ﬁrst

inventor of Patent number 12345). In this context, the records are also referred to as inventor mentions.

The goal of entity resolution is to cluster inventor mentions according to the entity (real-world inventor)

to which they refer. Clusterings obtained from algorithms are referred to as predicted clusters or predicted

disambiguations, whereas the (unknown) clustering corresponding to the true set of inventors is referred to

as the ground truth. Two inventor mentions are said to match or to be a true match if they refer to the same

inventor. If two inventors are in the same predicted cluster, then they are a link, or a predicted match. The

proportion of true matches among all predicted matches is called the pairwise precision, while the proportion

of predicted matches among all true matches is called the pairwise recall.

1.1 The Evaluation Problem

The entity resolution evaluation problem is to extrapolate from observed performance in small samples

to real performance in a database with millions of records. Wang et al. (2022) refer to this as bridging

the reality-ideality gap in entity resolution, where high performance on benchmark datasets often does not

translate into the real world. Here, performance may be deﬁned as any combination of commonly used

evaluation metrics for entity resolution, such as precision and recall, cluster homogeneity and completeness,

rand index, or generalized merge distance (Maidasani et al.,2012). These metrics can be computed on

benchmark datasets for which we have a ground truth disambiguation. However, the key evaluation problem

is to obtain estimates that are representative of performance on the full data, for which no ground truth

disambiguation is available. This is challenging for the following reasons.

First, entity resolution problems do not scale linearly. While it may be easy to disambiguate a small

dataset, the opportunity for errors grows quadratically in the number of records. As such, we may observe

good performance of an algorithm on a small benchmark dataset, while the true performance on the entire

dataset may be something else entirely. This particular eﬀect of dataset size in entity resolution is explored

in Draisbach and Naumann (2013) in the context of choosing similarity thresholds. This is a problem

that PatentsView.org currently faces. Despite encouraging performance evaluation metrics on benchmark

datasets, with nearly perfect precision and recall reported in the latest methodological report (Monath et al.,

2021), the data science team at AIR observes lower real-world accuracy. This phenomenon is illustrated in

example 1below.

A second problem is large class imbalance in entity resolution (Marchant and Rubinstein,2017). Viewing

entity resolution as a classiﬁcation problem, the task is to classify record pairs as being a match or non-

match. However, among all pairs of records, only a small fraction (usually much less than a fraction of a

percent) refer to the same entity. The vast majority of record pairs are not a match. This makes it diﬃcult

to evaluate performance through random sampling of record pairs.

A third problem is the multiplicity of sampling mechanisms used to obtain benchmark datasets. To

construct hand-disambiguated datasets, blocks, entity clusters, or predicted clusters may be sampled with

various probability weights. These sampling approaches must be accounted for in order to obtain represen-

tative performance estimates (Fuller,2011).

Our approach, detailed in sections 1.1.3 and 2.3, addresses these challenges by putting forward novel

cluster-based expressions for performance metrics that reﬂect various sampling schemes. Each of these

representations immediately suggests simple estimators that properly account for the above issues.

Example 1 (Bias of precision computed on benchmark datasets).To exemplify the problem with the trivial

use of performance evaluation metrics on benchmark datasets, we carried out a toy experiment that is

described in detail in appendix A.1. In short, we evaluated a disambiguation algorithm by sampling ground

truth clusters and computing pairwise precision on this set of sampled clusters. This is analogous to the

way that many real-world benchmark datasets are obtained and typically used. In this experiment, we know

that the disambiguation algorithm has a precision of 52% for the entire dataset.

Figure 1: Distribution of precision estimates versus the true precision of 52% (shown as a dotted vertical

line). Panel Ashows the trivial precision estimates computed for sampled records. Panel Bshows our

proposed precision estimates which accounts for the sampling mechanism. Sample bias and root mean squared

error (rmse) are reported in each ﬁgure.

In panel Aof ﬁgure 1, we see the distribution of precision estimates versus the true precision of 52%

shown as a dotted vertical line. Precision estimates are usually very close to 100% and always higher than

80%, despite the truth being a precision of only 52%. In contrast, panel Bshows the distribution of our

proposed precision estimator which is nearly unbiased. Both precision estimators rely on exactly the same

data. They only diﬀer in how they account for the underlying sampling process and the extrapolation from

small benchmark datasets to the full data.

The same phenomenon can be observed in PatentsView’s data, where naive precision is nearly 1 on all

benchmark datasets. In our simulation studies (see Figure 3for instance), naive precision estimates are

always nearly 1, despite the true precision ranging from 60% to 90%.

The reason why naive performance estimation performs disastrously is that it is much easier to disam-

biguate a small benchmark dataset than a large population with millions of records. Indeed, as a dataset

grows, opportunity for erroneous links grows quadratically. False links between similarly named inventor,

which are common in the full data, disappear when the benchmark dataset only contains a random sample

of inventors. Our performance estimators extrapolate from performance observed on a small benchmark to

true performance on the full data.

1.1.1 Why Bother With Evaluation?

There are two main uses for accurate and statistically rigorous evaluation methodology.

The ﬁrst is model selection and comparison. PatentsView.org continually works at improving disam-

biguation methodology. This requires choosing between alternative methods and evaluating the results of

methodological experiments. Without sound evaluation methodology, decisions regarding the disambigua-

tion algorithm may not align with real-world use and real-world performance. Notably, for a performance

metric such as the pairwise f-score, one algorithm may perform better than another on a small benchmark

dataset, while the opposite may hold true for performance on the entire data. This problem arises with

typical benchmark datasets obtained from randomly sampling blocks or randomly sampling clusters (see

section 2.3 for a deﬁnition of diﬀerent sampling mechanisms).

The second is adequate use of disambiguated data. PatentsView.org’s disambiguation results have been

used in numerous scientiﬁc studies (Toole et al.,2021). For example, Choudhury and Kim (2019) studied

the eﬀect of skilled worker immigration on patenting at U.S. companies and institutions, using PatentsView’s

inventor disambiguation to track individual immigrant inventors across time and location. These studies

make assumptions about the reliability of the data that need to be validated and upheld. In short, users of

disambiguated data need to understand its reliability in order to make scientiﬁcally appropriate use of it.

Evaluation aims to provide this rigorous reliability information.

1.1.2 Past Work

Much of the past literature has focused on deﬁning and using relevant clustering evaluation metrics. The topic

of estimating performance from samples has received much less attention, usually focusing on importance

sampling estimators based on record pairs. We review the contributions to these two main topics below.

Metrics Pairwise precision and recall metrics were ﬁrst reported in Newcombe et al. (1959), with Bilenko

and Mooney (2003) and Christen and Goiser (2007) emphasizing the importance of precision-recall curves

for algorithm evaluation. However, there are issues with the use of pairwise precision and recall in entity

resolution applications, such as the large relative importance of large clusters. As such, other clustering

metrics have been proposed, including cluster precision and recall, cluster homogeneity and completeness,

the B3metric (Bagga and Baldwin,1998), and generalized merge distances (Michelson and Macskassy,2009;

文档加载中……请稍候！
如果长时间未打开，您也可以点击刷新试试。

下载文档到电脑，查找使用更方便

10 玖币 0人已下载

立即下载

摘要：

EstimatingthePerformanceofEntityResolutionAlgorithms:LessonsLearnedThroughPatentsView.orgOlivierBinette1,2,SokhnaAYork2,EmmaHickerson2,YoungsooBaek1,SarvoMadhavan2,andChristinaJones21DukeUniversity2AmericanInstitutesforResearchApril19,2023AbstractThispaperintroducesanovelevaluationmethodologyforenti...

展开>> 收起<<

Estimating the Performance of Entity Resolution Algorithms Lessons Learned Through PatentsView.org Olivier Binette12 Sokhna A York2 Emma Hickerson2 Youngsoo Baek1 Sarvo Madhavan2.pdf

共20页,预览4页

还剩页未读，继续阅读

声明：本站为文档C2C交易模式，即用户上传的文档直接被用户下载，本站只是中间服务平台，本站所有文档下载所得的收益归上传人(含作者)所有。玖贝云文库仅提供信息存储空间，仅对用户上传内容的表现方式做保护处理，对上载内容本身不做任何修改或编辑。若文档所含内容侵犯了您的版权或隐私，请立即通知玖贝云文库，我们立即给予删除！

Estimating the Performance of Entity Resolution Algorithms Lessons Learned Through PatentsView.org Olivier Binette12 Sokhna A York2 Emma Hickerson2 Youngsoo Baek1 Sarvo Madhavan2

相关推荐

开通VIP享超值会员特权

作者详情

相关内容

热门标签

举报选择: