Noise-Robust De-Duplication at Scale NOISE -ROBUST DE-DUPLICATION AT SCALE Emily Silcock1 Luca DAmico-Wong2 Jinglin Yang3 Melissa Dell4

2025-05-02 0 0 412.79KB 27 页 10玖币

侵权投诉

Noise-Robust De-Duplication at Scale

NOISE-ROBUST DE-DUPLICATION AT SCALE

Emily Silcock1, Luca D’Amico-Wong2, Jinglin Yang3, Melissa Dell4∗

1Department of Economics, Harvard University; Cambridge, MA, USA.

2Harvard College; Cambridge, MA, USA.

3Department of Economics, University of California Berkeley; Berkeley, CA, USA.

4Department of Economics, Harvard University and NBER; Cambridge, MA, USA.

∗Corresponding author: melissadell@fas.harvard.edu.

ABSTRACT

Identifying near duplicates within large, noisy text corpora has a myriad of ap-

plications that range from de-duplicating training datasets, reducing privacy risk,

and evaluating test set leakage, to identifying reproduced news articles and liter-

ature within large corpora. Across these diverse applications, the overwhelming

majority of work relies on N-grams. Limited efforts have been made to evaluate

how well N-gram methods perform, in part because it is unclear how one could

create an unbiased evaluation dataset for a massive corpus. This study uses the

unique timeliness of historical news wires to create a 27,210 document dataset,

with 122,876 positive duplicate pairs, for studying noise-robust de-duplication.

The time-sensitivity of news makes comprehensive hand labelling feasible - de-

spite the massive overall size of the corpus - as duplicates occur within a nar-

row date range. The study then develops and evaluates a range of de-duplication

methods: hashing and N-gram overlap (which predominate in the literature), a

contrastively trained bi-encoder, and a “re-rank” style approach combining a bi-

and cross-encoder. The neural approaches signiﬁcantly outperform hashing and

N-gram overlap. We show that the bi-encoder scales well, de-duplicating a 10

million article corpus on a single GPU card in a matter of hours. We also apply

our pre-trained model to the RealNews and patent portions of C4 (Colossal Clean

Crawled Corpus), illustrating that a neural approach can identify many near du-

plicates missed by hashing, in the presence of various types of noise. The public

release of our NEWS-COPY de-duplication dataset, codebase, and the pre-trained

models will facilitate further research and applications.

1 INTRODUCTION

Robust identiﬁcation of near-duplicate texts in large, noisy corpora is important for a variety of appli-

cations. Duplication in training data degrades model performance (Lee et al., 2021), can raise serious

privacy risks (Kandpal et al., 2022), and can degrade performance on downstream tasks (Schoﬁeld

et al., 2017; Liu et al., 2022; Allamanis, 2019). Additionally, the presence of test set leakage com-

plicates evaluation of model performance, concerns that are elevated with large language models

that have greater capacity to memorize training data or can consult an external database. Patterns

of duplication are also themselves of interest, for studying the dissemination of reproduced content

such as literature or news (Cordell, 2015; Smith et al., 2015; Vesanto et al., 2017) and for reducing

noise in datasets used for statistical analyses.

In contrast to the literature on semantic textual similarity, where deep neural architectures predomi-

nate - e.g. Reimers & Gurevych (2019) - text de-duplication overwhelmingly uses N-gram methods.

There have been few efforts to formally evaluate the adequacy of N-gram based de-duplication or

to explore potential performance gains from neural text de-duplication. This study builds a large de-

duplication dataset and develops neural methods for robust textual de-duplication that signiﬁcantly

outperform N-gram based methods and scale efﬁciently.

A major hurdle to overcome in systematically studying text de-duplication is the lack of data for an

unbiased evaluation of different methods. Typically, there is no way to exhaustively identify all du-

arXiv:2210.04261v2 [cs.CL] 24 Apr 2024

Noise-Robust De-Duplication at Scale

plicates of a given example in a large corpus, complicating comparisons of recall. To circumvent this

challenge, we examine duplication in historical news. Reproduction from news wires and syndicate

services was widespread, forming over half the content of U.S. local newspapers. Media historian

Julia Guarneri (2017) writes: “by the 1910s and 1920s, most of the articles that Americans read in

their local papers had either been bought or sold on the national news market... This constructed a

broadly understood American ‘way of life’ that would become a touchstone of U.S. domestic politics

and international relations throughout the twentieth century.” Because news is timely, reproduction

happens within a narrow time window, and hence annotators can exhaustively identify all dupli-

cates despite the massive overall size of the corpus. To build an unbiased evaluation sample, highly

skilled human annotators manually reviewed every front page article from 973 newspapers on four

randomly chosen days in 1930, 1955, and 1974 to create clusters of duplicated articles (including

all singletons). Additional data, spanning the period from 1920 to 1977, were compiled for model

training. The resulting public NEWS-COPY dataset - which contains 27,210 articles, comprising

122,876 positive duplicate pairs - aims to encourage further study of robust de-duplication.

In the absence of evaluation data, the literature has largely assumed that text de-duplication is suf-

ﬁciently simple that neural methods are not required. However, noise is an integral feature of large

text datasets, resulting from OCR errors, abridgement, news aggregators, plagiarism, or machine

translation, to name a few reasons. This can lead near duplicate documents to have low N-gram

similarity. Amongst duplicated pairs of articles in the NEWS-COPY test set, the average Jaccard

similarity using 3-grams (4-grams, 5-grams) between pairs of reproduced articles is 30% (26%,

23%). 19% of duplicates have no 10-grams in common and 31% have no 15-grams in common,

often as a result of minor text noise. Neural methods are plausibly more robust.

Using the NEWS-COPY dataset, we examine different text de-duplication methods that vary along

two key dimensions: whether or not the method is neural and computational cost. Drawing inspira-

tion from work on semantic textual similarity and on retrieval, we develop two approaches for neural

text de-duplication: a contrastively trained bi-encoder plus clustering method and a ‘reranking’ style

method, which uses a computationally cheap transformer bi-encoder to measure the pairwise sim-

ilarity between all articles and then passes each article’s nearest neighbors to a cross-encoder, at

an additional computational cost. We also examine N-gram overlap and locally sensitive hashing,

the latter of which is highly scalable. The neural methods signiﬁcantly outperform the non-neural

approaches. The Adjusted Rand Index (ARI) for the re-rank model is 93.7 and for the bi-encoder

model is 91.5, versus 73.7 for LSH and 75.0 for N-gram overlap.

While the primary advantage of hashing - and a central motivation for its frequent usage - is its

scalability, massive scale similarity search (Johnson et al., 2019) is sufﬁciently cheap on modern

GPUs to make neural de-duplication highly scalable. We use our contrastively-trained bi-encoder

and a single NVIDIA 40GB A6000 GPU card to de-duplicate a 10 million document, 19 GB corpus

in 11 hours and 45 minutes. While this cost is already marginal in the context of working with large

text corpora, it could be reduced signiﬁcantly further by using a lighter weight language model, as

the majority of the time cost is embedding the 10M articles.

The publicly available neural de-duplication models, available at https://github.com/

dell-research-harvard/NEWS-COPY, can be applied to novel de-duplication problems. To

evaluate off-the-shelf performance, we apply our bi-encoder model to two subsets of C4 (Colossal

Clean Crawled Corpus), a massive dataset created by applying a series of ﬁlters to a single snapshot

of Common Crawl (Raffel et al., 2019; Dodge et al., 2021): RealNews - which consists of around 13

million digital news articles - and all 90,671 patents scraped from Google’s online patent database.

We also examine test set leakage between SuperGlue (Sarlin et al., 2020) and RealNews. While

there is not an unbiased ground truth measure for these datasets, an analysis of predicted duplicates

shows that the bi-encoder detects a variety of noisy duplicates that hashing overlooks, which result

from aggregators of digital news, machine translation, and other sources of noise.

The rest of this paper is organized as follows: Section 2 provides an overview of the relevant liter-

ature. Section 3 describes the NEWS-COPY dataset, and Section 4 develops neural de-duplication

methods and their non-neural comparisons. Section 5 evaluates the performance of different de-

duplication methods, Section 6 explores scaling, and Section 7 applies de-duplication to a subset of

C4. Finally, Section 8 concludes.

Noise-Robust De-Duplication at Scale

2 LITERATURE

De-Duplication: Textual de-duplication is a fundamental task for curating the large text corpora

that support the deep learning revolution. Lee et al. (2021) review the de-duplication literature,

providing evidence that duplication in training datasets is widespread: e.g. Dodge et al. (2021) ﬁnd

up to 14.4% of test examples of various standard benchmarks verbatim in C4 and Bandy & Vincent

(2021) document that the Books Corpus (Zhu et al., 2015) - used in training BERT (Devlin et al.,

2018), GPT (Brown et al., 2020), and other large language models - contains 4,255 unique books

and 2,930 books that are exactly duplicated at least once.

Lee et al. (2021) document that models trained on deduplicated data regenerate approximately 10

times less training data, and Kandpal et al. (2022) ﬁnd a superlinear relationship between the number

of times a sequence is present in training data and regeneration, with a sequence present 10 times

being regenerated 1000 times more often than a sequence present once. Carlini et al. (2022) ﬁnd that

the likelihood of a model generating exact continuations from the training data scales with model

size, training data duplicates, and preﬁx length. This could raise plagiarism risks (Lee et al., 2022).

There is also a literature showing that duplicates adversely affect downstream tasks. Schoﬁeld et al.

(2017) study the impact of text duplication on semantic models, documenting that substantial over-

representation can overwhelm meaningful topical patterns. Allamanis (2019) show that duplica-

tion in code datasets worsens performance on code understanding. Liu et al. (2022) show that

de-duplication of an open electronic health record database signiﬁcantly improves clinical natural

language processing models. Moreover, when training LMs that can consult a massive external

database - as in a retrieval enhanced transformer language setup (Borgeaud et al., 2022) - test set

leakage becomes a particularly salient concern. Borgeaud et al. (2022) conclude: “Further work is

yet needed to better understand the role of test set leakage in the performance of LMs.”

Non-neural methods predominate in textual de-duplication (Leskovec et al., 2020). Borgeaud et al.

(2022) compute 13-gram Jaccard similarity between train and test documents using MinHashing

and remove all training documents with 0.8 similarity or higher to validation/test documents. Rad-

ford et al. (2019) use 8-gram overlaps for post-hoc identiﬁcation of duplication between GPT-2’s

training data and evaluation datasets, and Brown et al. (2020) remove from the GPT-3 training data

any example with a 13-gram overlap with an evaluation example. Other de-duplication contexts in-

clude large datasets of medical notes (Shenoy et al., 2017) and scholarly articles (which can include

updates) (Gyawali et al., 2020), both of which have been examined with locally sensitive hashing.

Identifying reproduced texts within historical newspapers is itself an application that has generated

considerable interest. The Viral Texts Project (Cordell, 2015; Smith et al., 2015) uses N-gram

comparisons to track the dissemination of reproduced literature in antebellum newspapers. Viral

Texts utilizes the Chronicling America (Culpepper, 2007) OCR, which does not recognize individual

articles, headlines, captions, etc. This leads to scrambled up texts. We ﬁrst apply object detection

methods to the document layouts (He et al., 2017; Shen et al., 2021) to extract structured texts of

individual articles that allow us to capture performance gains from the language understanding of

neural methods.

Vesanto et al. (2017) use NCBI BLAST, a software for comparing and aligning biological sequences,

to quantify text reproduction at scale in Finish newspapers from 1771 to 1910. They remove all

characters besides the 23 most common letters from an uncased corpus of Finish newspapers, and

then convert these to the alphabet of 23 amino acids recognized by BLAST. BLAST is used to

make pairwise comparisons between all documents in the corpus, indicating which pairs have text

overlap. To scale the problem, we use hashing - which avoids the need to convert texts into amino

acid sequences - or a contrastively trained bi-encoder - which leverages the power of deep learning.

Semantic Textual Similarity: There are important parallels between semantic textual similarity

(STS) and textual de-duplication. Notably, our bi-encoder method draws inspiration from Sentence

BERT (S-BERT) (Reimers & Gurevych, 2019), and we use an S-BERT pre-trained bi-encoder as

our base language model. S-BERT adds a pooling operation to BERT/RoBERTa embeddings - that

takes the mean of all output vectors - to derive a ﬁxed sized sentence embedding that can then be

examined with clustering methods.

Retrieval: We draw inspiration for our reranking approach from the literature on open domain

retrieval and question answering (Wang et al., 2018; Lin et al., 2018; Karpukhin et al., 2020; Thakur

Noise-Robust De-Duplication at Scale

et al., 2021; Wu et al., 2019), which avoids the infeasible quadratic cost of applying a cross-encoder

to a massive corpus by ﬁrst ranking documents with a bi-encoder (or with sparse methods). In our

re-ranking model, instead of a passage encoder and a query encoder, there is a symmetric bi-encoder.

3 THE NEWS-COPY DATASET

3.1 REPRODUCTION IN NEWS

Reproduction is an important feature of news. News wire services distribute stories written by their

own news bureaus and by member newspapers to member news outlets, whereas syndicates dissem-

inate to their subscribers columns written by freelance journalists or purchased from newspapers.

The nation’s largest newspapers also ran syndicate services to redistribute their own stories. The

main news wire services in the United States historically were the Associated Press (AP), the United

Press (UP), and the International News Service (INS), the latter two of which merged to form United

Press International (UPI) in 1958.

Editing could take place at multiple places along the news transmission chain. Wire staff veriﬁed

and edited stories after receiving them from members, and then stories could be edited again by local

wire bureaus, of which there were around 100 for the Associated Press. Finally, local newspapers

could abridge content to ﬁt space requirements. This leads to a range of near duplicates in the

presence of abridgement and OCR noise. Noisy duplicates in news are not limited to the historical

context, with digital news aggregators today leading to a similar phenomenon (Coddington, 2019).

3.2 DESCRIPTION OF THE NEWS-COPY DATASET

Table 1 summarizes the key features of the NEWS-COPY dataset. It consists of 27,210 articles, drawn

from 973 newspapers between 1920 and 1977.1NEWS-COPY contains two types of data: data for

training and four full day exhaustively labeled evaluation samples, constructed with two consecutive

days of content in 1930 and single days in 1955 and 1974, selected at random. The 1955 sample

is a validation set used to select hyperparemters for both the N-gram and neural methods. 1930

and 1974 are pooled to form the test set and used only to produce the results shown in this paper.

In the full day samples, there are far more negative than positive pairs, as is generally the case in

de-duplication problems, whereas the training data contain a more balanced sample.

3.3 PROCEDURE FOR BUILDING THE DATASET

To build NEWS-COPY, we ﬁrst apply Layout Parser (Shen et al., 2021) with a custom-trained ob-

ject detection model (He et al., 2017) to front page scans of off-copyright historical newspapers to

identify individual article bounding boxes. The contents of article bounding boxes are OCR’ed with

Tesseract. When component bounding boxes span multiple columns on the same page, the OCR’ed

texts are associated into full articles using a rule-based association method that exploits the coor-

dinates of headline and article bounding boxes. This pipeline extracts the structured article texts.

Headlines were chosen by local newspapers - not wires - and as a result are rarely reproduced and

not included in the dataset. Weather forecasts are removed by running a distil-RoBERTa classiﬁer

trained on 392 labeled articles (179 positive, 202 negative). This removes 4.4% of the validation set

and 3.3% of the test set. We also hand-removed documents containing incorrectly merged article

bounding boxes from different underlying source articles (as there was no single ground truth cluster

to which these articles belonged), and news summaries, which summarize multiple news stories in

a single article and hence also have no clear cluster with which they are associated. These represent

3.4% and 3.3% of the validation and test sets, respectively.

Duplicates are deﬁned as articles that came from the same original source article, regardless of the

degree of abridgement or OCR noise. Articles from different source articles that contain the same

quote are labeled as non-duplicated. Likewise, articles updated to reﬂect breaking news are labeled

as different, as are different articles on the same overarching story.

1A copyright law change effective January 1, 1978 resulted in nearly all newspapers from that date forward

being under copyright by default.

Noise-Robust De-Duplication at Scale

Positives Negative Reproduced Singleton Total

Pairs Pairs Articles Articles Articles

Training Data

Training 36,291 37,637 891 – 7,728

Validation 3,042 3,246 20 – 283

Full Day Evaluation

Validation 28,547 12,409,031 447 2,162 4,988

Test 54,996 100,914,159 1,236 8,046 14,211

Full Dataset 122,876 113,364,073 2,594 10,208 27,210

Table 1: This table provides summary statistics from the NEWS-COPY dataset, decomposed into the

training sample and the full day evaluation data.

To construct the full-day samples, we ﬁrst ran 5-gram overlap with a very conservative N-gram

overlap threshold of 1% to create large candidate duplicate clusters. Highly trained student re-

search assistants carefully reviewed these clusters, breaking false positive links. A sub-sample was

doubled-labeled to ensure our deﬁnition of a duplicated article was coherent, and that labeling was

consistent across annotators. Interannotator agreement on a subset of 8512 pairs was 98.1% (90.9

Cohen’s Kappa). Next, annotators reviewed each of the resulting clusters to merge together clusters

as needed. Finally, annotators exhaustively reviewed every singleton article, associating them with

article clusters as needed. Articles were sorted by byline (recognized with a custom-trained named

entity recognition model) to facilitate this process. For building the training data, the approach was

similar, which provides hard negatives. We did not review all singletons, as the aim was to produce

labeled batches for constrastive training. About two thirds of the negative pairs in the training data

are hard negatives, with the remaining third coming from randomly selected article pairs.

4 MODEL ARCHITECTURES

4.1 THE BI-ENCODER MODEL

We contrastively train a symmetric bi-encoder to learn similar representations for near duplicate ar-

ticles and dissimilar representations for non-duplicated articles. We use an S-BERT MPNET model

(Reimers & Gurevych, 2019; Song et al., 2020) contrastively trained on over a billion sentence pairs

- drawn from STS datasets - as the base language model. The S-BERT architecture pools represen-

tations for up to the ﬁrst 512 tokens in each article, using mean pooling, to construct a document

level representation. Like Reimers & Gurevych (2019), we found when experimenting with vanilla

RoBERTa embeddings - which also perform well on de-duplication - that mean pooling of each of

the representations signiﬁcantly outperforms using the [CLS] token to represent the document. S-

BERT provides a speed-optimized implementation of this pooling strategy. We chose the MPNET

S-BERT because it performs best overall on STS benchmarks.

We use S-BERT’s online contrastive loss (Hadsell et al., 2006) implementation, with a 0.2 margin

and cosine similarity distance. The learning rate is 2e-5 with 100% warm up and a batch size of 32.

We use an AdamW optimizer, and the model is trained for 16 epochs.

The bi-encoder dense document representations can be clustered to identify duplicates. We use

FAISS (Johnson et al., 2019) to compute all embeddings within a given distance range, a hyperpa-

rameter tuned on the full-day validation sample. This output is used to build a graph, where nodes

are articles and edges connect articles within the threshold distance. Connected components can be

extracted to deﬁne clusters - which is equivalent to single linkage clustering - or Louvain commu-

nity detection can be applied to the graph to control false positive edges that can merge otherwise

disparate groups of articles.

4.2 THE RE-RANKING MODEL

While a cross-encoder can offer the most ﬂexible, expressive comparisons between texts, it re-

quires N2embeddings to compare Ntexts, infeasible in large corpora. To make the use of a cross-

encoder feasible, we draw inspiration from the retrieval literature (Wang et al., 2018; Lin et al.,

文档加载中……请稍候！
如果长时间未打开，您也可以点击刷新试试。

下载文档到电脑，查找使用更方便

10 玖币 0人已下载

立即下载

摘要：

Noise-RobustDe-DuplicationatScaleNOISE-ROBUSTDE-DUPLICATIONATSCALEEmilySilcock1,LucaD’Amico-Wong2,JinglinYang3,MelissaDell4∗1DepartmentofEconomics,HarvardUniversity;Cambridge,MA,USA.2HarvardCollege;Cambridge,MA,USA.3DepartmentofEconomics,UniversityofCaliforniaBerkeley;Berkeley,CA,USA.4DepartmentofEc...

展开>> 收起<<

Noise-Robust De-Duplication at Scale NOISE -ROBUST DE-DUPLICATION AT SCALE Emily Silcock1 Luca DAmico-Wong2 Jinglin Yang3 Melissa Dell4.pdf

共27页,预览5页

还剩页未读，继续阅读

声明：本站为文档C2C交易模式，即用户上传的文档直接被用户下载，本站只是中间服务平台，本站所有文档下载所得的收益归上传人(含作者)所有。玖贝云文库仅提供信息存储空间，仅对用户上传内容的表现方式做保护处理，对上载内容本身不做任何修改或编辑。若文档所含内容侵犯了您的版权或隐私，请立即通知玖贝云文库，我们立即给予删除！

Noise-Robust De-Duplication at Scale NOISE -ROBUST DE-DUPLICATION AT SCALE Emily Silcock1 Luca DAmico-Wong2 Jinglin Yang3 Melissa Dell4

相关推荐

开通VIP享超值会员特权

作者详情

相关内容

热门标签

举报选择: