Noise-Robust De-Duplication at Scale NOISE -ROBUST DE-DUPLICATION AT SCALE Emily Silcock1 Luca DAmico-Wong2 Jinglin Yang3 Melissa Dell4

2025-05-02 0 0 412.79KB 27 页 10玖币
侵权投诉
Noise-Robust De-Duplication at Scale
NOISE-ROBUST DE-DUPLICATION AT SCALE
Emily Silcock1, Luca D’Amico-Wong2, Jinglin Yang3, Melissa Dell4
1Department of Economics, Harvard University; Cambridge, MA, USA.
2Harvard College; Cambridge, MA, USA.
3Department of Economics, University of California Berkeley; Berkeley, CA, USA.
4Department of Economics, Harvard University and NBER; Cambridge, MA, USA.
Corresponding author: melissadell@fas.harvard.edu.
ABSTRACT
Identifying near duplicates within large, noisy text corpora has a myriad of ap-
plications that range from de-duplicating training datasets, reducing privacy risk,
and evaluating test set leakage, to identifying reproduced news articles and liter-
ature within large corpora. Across these diverse applications, the overwhelming
majority of work relies on N-grams. Limited efforts have been made to evaluate
how well N-gram methods perform, in part because it is unclear how one could
create an unbiased evaluation dataset for a massive corpus. This study uses the
unique timeliness of historical news wires to create a 27,210 document dataset,
with 122,876 positive duplicate pairs, for studying noise-robust de-duplication.
The time-sensitivity of news makes comprehensive hand labelling feasible - de-
spite the massive overall size of the corpus - as duplicates occur within a nar-
row date range. The study then develops and evaluates a range of de-duplication
methods: hashing and N-gram overlap (which predominate in the literature), a
contrastively trained bi-encoder, and a “re-rank” style approach combining a bi-
and cross-encoder. The neural approaches significantly outperform hashing and
N-gram overlap. We show that the bi-encoder scales well, de-duplicating a 10
million article corpus on a single GPU card in a matter of hours. We also apply
our pre-trained model to the RealNews and patent portions of C4 (Colossal Clean
Crawled Corpus), illustrating that a neural approach can identify many near du-
plicates missed by hashing, in the presence of various types of noise. The public
release of our NEWS-COPY de-duplication dataset, codebase, and the pre-trained
models will facilitate further research and applications.
1 INTRODUCTION
Robust identification of near-duplicate texts in large, noisy corpora is important for a variety of appli-
cations. Duplication in training data degrades model performance (Lee et al., 2021), can raise serious
privacy risks (Kandpal et al., 2022), and can degrade performance on downstream tasks (Schofield
et al., 2017; Liu et al., 2022; Allamanis, 2019). Additionally, the presence of test set leakage com-
plicates evaluation of model performance, concerns that are elevated with large language models
that have greater capacity to memorize training data or can consult an external database. Patterns
of duplication are also themselves of interest, for studying the dissemination of reproduced content
such as literature or news (Cordell, 2015; Smith et al., 2015; Vesanto et al., 2017) and for reducing
noise in datasets used for statistical analyses.
In contrast to the literature on semantic textual similarity, where deep neural architectures predomi-
nate - e.g. Reimers & Gurevych (2019) - text de-duplication overwhelmingly uses N-gram methods.
There have been few efforts to formally evaluate the adequacy of N-gram based de-duplication or
to explore potential performance gains from neural text de-duplication. This study builds a large de-
duplication dataset and develops neural methods for robust textual de-duplication that significantly
outperform N-gram based methods and scale efficiently.
A major hurdle to overcome in systematically studying text de-duplication is the lack of data for an
unbiased evaluation of different methods. Typically, there is no way to exhaustively identify all du-
1
arXiv:2210.04261v2 [cs.CL] 24 Apr 2024
Noise-Robust De-Duplication at Scale
plicates of a given example in a large corpus, complicating comparisons of recall. To circumvent this
challenge, we examine duplication in historical news. Reproduction from news wires and syndicate
services was widespread, forming over half the content of U.S. local newspapers. Media historian
Julia Guarneri (2017) writes: “by the 1910s and 1920s, most of the articles that Americans read in
their local papers had either been bought or sold on the national news market... This constructed a
broadly understood American ‘way of life’ that would become a touchstone of U.S. domestic politics
and international relations throughout the twentieth century. Because news is timely, reproduction
happens within a narrow time window, and hence annotators can exhaustively identify all dupli-
cates despite the massive overall size of the corpus. To build an unbiased evaluation sample, highly
skilled human annotators manually reviewed every front page article from 973 newspapers on four
randomly chosen days in 1930, 1955, and 1974 to create clusters of duplicated articles (including
all singletons). Additional data, spanning the period from 1920 to 1977, were compiled for model
training. The resulting public NEWS-COPY dataset - which contains 27,210 articles, comprising
122,876 positive duplicate pairs - aims to encourage further study of robust de-duplication.
In the absence of evaluation data, the literature has largely assumed that text de-duplication is suf-
ficiently simple that neural methods are not required. However, noise is an integral feature of large
text datasets, resulting from OCR errors, abridgement, news aggregators, plagiarism, or machine
translation, to name a few reasons. This can lead near duplicate documents to have low N-gram
similarity. Amongst duplicated pairs of articles in the NEWS-COPY test set, the average Jaccard
similarity using 3-grams (4-grams, 5-grams) between pairs of reproduced articles is 30% (26%,
23%). 19% of duplicates have no 10-grams in common and 31% have no 15-grams in common,
often as a result of minor text noise. Neural methods are plausibly more robust.
Using the NEWS-COPY dataset, we examine different text de-duplication methods that vary along
two key dimensions: whether or not the method is neural and computational cost. Drawing inspira-
tion from work on semantic textual similarity and on retrieval, we develop two approaches for neural
text de-duplication: a contrastively trained bi-encoder plus clustering method and a ‘reranking’ style
method, which uses a computationally cheap transformer bi-encoder to measure the pairwise sim-
ilarity between all articles and then passes each article’s nearest neighbors to a cross-encoder, at
an additional computational cost. We also examine N-gram overlap and locally sensitive hashing,
the latter of which is highly scalable. The neural methods significantly outperform the non-neural
approaches. The Adjusted Rand Index (ARI) for the re-rank model is 93.7 and for the bi-encoder
model is 91.5, versus 73.7 for LSH and 75.0 for N-gram overlap.
While the primary advantage of hashing - and a central motivation for its frequent usage - is its
scalability, massive scale similarity search (Johnson et al., 2019) is sufficiently cheap on modern
GPUs to make neural de-duplication highly scalable. We use our contrastively-trained bi-encoder
and a single NVIDIA 40GB A6000 GPU card to de-duplicate a 10 million document, 19 GB corpus
in 11 hours and 45 minutes. While this cost is already marginal in the context of working with large
text corpora, it could be reduced significantly further by using a lighter weight language model, as
the majority of the time cost is embedding the 10M articles.
The publicly available neural de-duplication models, available at https://github.com/
dell-research-harvard/NEWS-COPY, can be applied to novel de-duplication problems. To
evaluate off-the-shelf performance, we apply our bi-encoder model to two subsets of C4 (Colossal
Clean Crawled Corpus), a massive dataset created by applying a series of filters to a single snapshot
of Common Crawl (Raffel et al., 2019; Dodge et al., 2021): RealNews - which consists of around 13
million digital news articles - and all 90,671 patents scraped from Google’s online patent database.
We also examine test set leakage between SuperGlue (Sarlin et al., 2020) and RealNews. While
there is not an unbiased ground truth measure for these datasets, an analysis of predicted duplicates
shows that the bi-encoder detects a variety of noisy duplicates that hashing overlooks, which result
from aggregators of digital news, machine translation, and other sources of noise.
The rest of this paper is organized as follows: Section 2 provides an overview of the relevant liter-
ature. Section 3 describes the NEWS-COPY dataset, and Section 4 develops neural de-duplication
methods and their non-neural comparisons. Section 5 evaluates the performance of different de-
duplication methods, Section 6 explores scaling, and Section 7 applies de-duplication to a subset of
C4. Finally, Section 8 concludes.
2
Noise-Robust De-Duplication at Scale
2 LITERATURE
De-Duplication: Textual de-duplication is a fundamental task for curating the large text corpora
that support the deep learning revolution. Lee et al. (2021) review the de-duplication literature,
providing evidence that duplication in training datasets is widespread: e.g. Dodge et al. (2021) find
up to 14.4% of test examples of various standard benchmarks verbatim in C4 and Bandy & Vincent
(2021) document that the Books Corpus (Zhu et al., 2015) - used in training BERT (Devlin et al.,
2018), GPT (Brown et al., 2020), and other large language models - contains 4,255 unique books
and 2,930 books that are exactly duplicated at least once.
Lee et al. (2021) document that models trained on deduplicated data regenerate approximately 10
times less training data, and Kandpal et al. (2022) find a superlinear relationship between the number
of times a sequence is present in training data and regeneration, with a sequence present 10 times
being regenerated 1000 times more often than a sequence present once. Carlini et al. (2022) find that
the likelihood of a model generating exact continuations from the training data scales with model
size, training data duplicates, and prefix length. This could raise plagiarism risks (Lee et al., 2022).
There is also a literature showing that duplicates adversely affect downstream tasks. Schofield et al.
(2017) study the impact of text duplication on semantic models, documenting that substantial over-
representation can overwhelm meaningful topical patterns. Allamanis (2019) show that duplica-
tion in code datasets worsens performance on code understanding. Liu et al. (2022) show that
de-duplication of an open electronic health record database significantly improves clinical natural
language processing models. Moreover, when training LMs that can consult a massive external
database - as in a retrieval enhanced transformer language setup (Borgeaud et al., 2022) - test set
leakage becomes a particularly salient concern. Borgeaud et al. (2022) conclude: “Further work is
yet needed to better understand the role of test set leakage in the performance of LMs.
Non-neural methods predominate in textual de-duplication (Leskovec et al., 2020). Borgeaud et al.
(2022) compute 13-gram Jaccard similarity between train and test documents using MinHashing
and remove all training documents with 0.8 similarity or higher to validation/test documents. Rad-
ford et al. (2019) use 8-gram overlaps for post-hoc identification of duplication between GPT-2’s
training data and evaluation datasets, and Brown et al. (2020) remove from the GPT-3 training data
any example with a 13-gram overlap with an evaluation example. Other de-duplication contexts in-
clude large datasets of medical notes (Shenoy et al., 2017) and scholarly articles (which can include
updates) (Gyawali et al., 2020), both of which have been examined with locally sensitive hashing.
Identifying reproduced texts within historical newspapers is itself an application that has generated
considerable interest. The Viral Texts Project (Cordell, 2015; Smith et al., 2015) uses N-gram
comparisons to track the dissemination of reproduced literature in antebellum newspapers. Viral
Texts utilizes the Chronicling America (Culpepper, 2007) OCR, which does not recognize individual
articles, headlines, captions, etc. This leads to scrambled up texts. We first apply object detection
methods to the document layouts (He et al., 2017; Shen et al., 2021) to extract structured texts of
individual articles that allow us to capture performance gains from the language understanding of
neural methods.
Vesanto et al. (2017) use NCBI BLAST, a software for comparing and aligning biological sequences,
to quantify text reproduction at scale in Finish newspapers from 1771 to 1910. They remove all
characters besides the 23 most common letters from an uncased corpus of Finish newspapers, and
then convert these to the alphabet of 23 amino acids recognized by BLAST. BLAST is used to
make pairwise comparisons between all documents in the corpus, indicating which pairs have text
overlap. To scale the problem, we use hashing - which avoids the need to convert texts into amino
acid sequences - or a contrastively trained bi-encoder - which leverages the power of deep learning.
Semantic Textual Similarity: There are important parallels between semantic textual similarity
(STS) and textual de-duplication. Notably, our bi-encoder method draws inspiration from Sentence
BERT (S-BERT) (Reimers & Gurevych, 2019), and we use an S-BERT pre-trained bi-encoder as
our base language model. S-BERT adds a pooling operation to BERT/RoBERTa embeddings - that
takes the mean of all output vectors - to derive a fixed sized sentence embedding that can then be
examined with clustering methods.
Retrieval: We draw inspiration for our reranking approach from the literature on open domain
retrieval and question answering (Wang et al., 2018; Lin et al., 2018; Karpukhin et al., 2020; Thakur
3
Noise-Robust De-Duplication at Scale
et al., 2021; Wu et al., 2019), which avoids the infeasible quadratic cost of applying a cross-encoder
to a massive corpus by first ranking documents with a bi-encoder (or with sparse methods). In our
re-ranking model, instead of a passage encoder and a query encoder, there is a symmetric bi-encoder.
3 THE NEWS-COPY DATASET
3.1 REPRODUCTION IN NEWS
Reproduction is an important feature of news. News wire services distribute stories written by their
own news bureaus and by member newspapers to member news outlets, whereas syndicates dissem-
inate to their subscribers columns written by freelance journalists or purchased from newspapers.
The nation’s largest newspapers also ran syndicate services to redistribute their own stories. The
main news wire services in the United States historically were the Associated Press (AP), the United
Press (UP), and the International News Service (INS), the latter two of which merged to form United
Press International (UPI) in 1958.
Editing could take place at multiple places along the news transmission chain. Wire staff verified
and edited stories after receiving them from members, and then stories could be edited again by local
wire bureaus, of which there were around 100 for the Associated Press. Finally, local newspapers
could abridge content to fit space requirements. This leads to a range of near duplicates in the
presence of abridgement and OCR noise. Noisy duplicates in news are not limited to the historical
context, with digital news aggregators today leading to a similar phenomenon (Coddington, 2019).
3.2 DESCRIPTION OF THE NEWS-COPY DATASET
Table 1 summarizes the key features of the NEWS-COPY dataset. It consists of 27,210 articles, drawn
from 973 newspapers between 1920 and 1977.1NEWS-COPY contains two types of data: data for
training and four full day exhaustively labeled evaluation samples, constructed with two consecutive
days of content in 1930 and single days in 1955 and 1974, selected at random. The 1955 sample
is a validation set used to select hyperparemters for both the N-gram and neural methods. 1930
and 1974 are pooled to form the test set and used only to produce the results shown in this paper.
In the full day samples, there are far more negative than positive pairs, as is generally the case in
de-duplication problems, whereas the training data contain a more balanced sample.
3.3 PROCEDURE FOR BUILDING THE DATASET
To build NEWS-COPY, we first apply Layout Parser (Shen et al., 2021) with a custom-trained ob-
ject detection model (He et al., 2017) to front page scans of off-copyright historical newspapers to
identify individual article bounding boxes. The contents of article bounding boxes are OCR’ed with
Tesseract. When component bounding boxes span multiple columns on the same page, the OCR’ed
texts are associated into full articles using a rule-based association method that exploits the coor-
dinates of headline and article bounding boxes. This pipeline extracts the structured article texts.
Headlines were chosen by local newspapers - not wires - and as a result are rarely reproduced and
not included in the dataset. Weather forecasts are removed by running a distil-RoBERTa classifier
trained on 392 labeled articles (179 positive, 202 negative). This removes 4.4% of the validation set
and 3.3% of the test set. We also hand-removed documents containing incorrectly merged article
bounding boxes from different underlying source articles (as there was no single ground truth cluster
to which these articles belonged), and news summaries, which summarize multiple news stories in
a single article and hence also have no clear cluster with which they are associated. These represent
3.4% and 3.3% of the validation and test sets, respectively.
Duplicates are defined as articles that came from the same original source article, regardless of the
degree of abridgement or OCR noise. Articles from different source articles that contain the same
quote are labeled as non-duplicated. Likewise, articles updated to reflect breaking news are labeled
as different, as are different articles on the same overarching story.
1A copyright law change effective January 1, 1978 resulted in nearly all newspapers from that date forward
being under copyright by default.
4
Noise-Robust De-Duplication at Scale
Positives Negative Reproduced Singleton Total
Pairs Pairs Articles Articles Articles
Training Data
Training 36,291 37,637 891 7,728
Validation 3,042 3,246 20 283
Full Day Evaluation
Validation 28,547 12,409,031 447 2,162 4,988
Test 54,996 100,914,159 1,236 8,046 14,211
Full Dataset 122,876 113,364,073 2,594 10,208 27,210
Table 1: This table provides summary statistics from the NEWS-COPY dataset, decomposed into the
training sample and the full day evaluation data.
To construct the full-day samples, we first ran 5-gram overlap with a very conservative N-gram
overlap threshold of 1% to create large candidate duplicate clusters. Highly trained student re-
search assistants carefully reviewed these clusters, breaking false positive links. A sub-sample was
doubled-labeled to ensure our definition of a duplicated article was coherent, and that labeling was
consistent across annotators. Interannotator agreement on a subset of 8512 pairs was 98.1% (90.9
Cohen’s Kappa). Next, annotators reviewed each of the resulting clusters to merge together clusters
as needed. Finally, annotators exhaustively reviewed every singleton article, associating them with
article clusters as needed. Articles were sorted by byline (recognized with a custom-trained named
entity recognition model) to facilitate this process. For building the training data, the approach was
similar, which provides hard negatives. We did not review all singletons, as the aim was to produce
labeled batches for constrastive training. About two thirds of the negative pairs in the training data
are hard negatives, with the remaining third coming from randomly selected article pairs.
4 MODEL ARCHITECTURES
4.1 THE BI-ENCODER MODEL
We contrastively train a symmetric bi-encoder to learn similar representations for near duplicate ar-
ticles and dissimilar representations for non-duplicated articles. We use an S-BERT MPNET model
(Reimers & Gurevych, 2019; Song et al., 2020) contrastively trained on over a billion sentence pairs
- drawn from STS datasets - as the base language model. The S-BERT architecture pools represen-
tations for up to the first 512 tokens in each article, using mean pooling, to construct a document
level representation. Like Reimers & Gurevych (2019), we found when experimenting with vanilla
RoBERTa embeddings - which also perform well on de-duplication - that mean pooling of each of
the representations significantly outperforms using the [CLS] token to represent the document. S-
BERT provides a speed-optimized implementation of this pooling strategy. We chose the MPNET
S-BERT because it performs best overall on STS benchmarks.
We use S-BERT’s online contrastive loss (Hadsell et al., 2006) implementation, with a 0.2 margin
and cosine similarity distance. The learning rate is 2e-5 with 100% warm up and a batch size of 32.
We use an AdamW optimizer, and the model is trained for 16 epochs.
The bi-encoder dense document representations can be clustered to identify duplicates. We use
FAISS (Johnson et al., 2019) to compute all embeddings within a given distance range, a hyperpa-
rameter tuned on the full-day validation sample. This output is used to build a graph, where nodes
are articles and edges connect articles within the threshold distance. Connected components can be
extracted to define clusters - which is equivalent to single linkage clustering - or Louvain commu-
nity detection can be applied to the graph to control false positive edges that can merge otherwise
disparate groups of articles.
4.2 THE RE-RANKING MODEL
While a cross-encoder can offer the most flexible, expressive comparisons between texts, it re-
quires N2embeddings to compare Ntexts, infeasible in large corpora. To make the use of a cross-
encoder feasible, we draw inspiration from the retrieval literature (Wang et al., 2018; Lin et al.,
5
摘要:

Noise-RobustDe-DuplicationatScaleNOISE-ROBUSTDE-DUPLICATIONATSCALEEmilySilcock1,LucaD’Amico-Wong2,JinglinYang3,MelissaDell4∗1DepartmentofEconomics,HarvardUniversity;Cambridge,MA,USA.2HarvardCollege;Cambridge,MA,USA.3DepartmentofEconomics,UniversityofCaliforniaBerkeley;Berkeley,CA,USA.4DepartmentofEc...

展开>> 收起<<
Noise-Robust De-Duplication at Scale NOISE -ROBUST DE-DUPLICATION AT SCALE Emily Silcock1 Luca DAmico-Wong2 Jinglin Yang3 Melissa Dell4.pdf

共27页,预览5页

还剩页未读, 继续阅读

声明:本站为文档C2C交易模式,即用户上传的文档直接被用户下载,本站只是中间服务平台,本站所有文档下载所得的收益归上传人(含作者)所有。玖贝云文库仅提供信息存储空间,仅对用户上传内容的表现方式做保护处理,对上载内容本身不做任何修改或编辑。若文档所含内容侵犯了您的版权或隐私,请立即通知玖贝云文库,我们立即给予删除!
分类:图书资源 价格:10玖币 属性:27 页 大小:412.79KB 格式:PDF 时间:2025-05-02

开通VIP享超值会员特权

  • 多端同步记录
  • 高速下载文档
  • 免费文档工具
  • 分享文档赚钱
  • 每日登录抽奖
  • 优质衍生服务
/ 27
客服
关注