
Noise-Robust De-Duplication at Scale
plicates of a given example in a large corpus, complicating comparisons of recall. To circumvent this
challenge, we examine duplication in historical news. Reproduction from news wires and syndicate
services was widespread, forming over half the content of U.S. local newspapers. Media historian
Julia Guarneri (2017) writes: “by the 1910s and 1920s, most of the articles that Americans read in
their local papers had either been bought or sold on the national news market... This constructed a
broadly understood American ‘way of life’ that would become a touchstone of U.S. domestic politics
and international relations throughout the twentieth century.” Because news is timely, reproduction
happens within a narrow time window, and hence annotators can exhaustively identify all dupli-
cates despite the massive overall size of the corpus. To build an unbiased evaluation sample, highly
skilled human annotators manually reviewed every front page article from 973 newspapers on four
randomly chosen days in 1930, 1955, and 1974 to create clusters of duplicated articles (including
all singletons). Additional data, spanning the period from 1920 to 1977, were compiled for model
training. The resulting public NEWS-COPY dataset - which contains 27,210 articles, comprising
122,876 positive duplicate pairs - aims to encourage further study of robust de-duplication.
In the absence of evaluation data, the literature has largely assumed that text de-duplication is suf-
ficiently simple that neural methods are not required. However, noise is an integral feature of large
text datasets, resulting from OCR errors, abridgement, news aggregators, plagiarism, or machine
translation, to name a few reasons. This can lead near duplicate documents to have low N-gram
similarity. Amongst duplicated pairs of articles in the NEWS-COPY test set, the average Jaccard
similarity using 3-grams (4-grams, 5-grams) between pairs of reproduced articles is 30% (26%,
23%). 19% of duplicates have no 10-grams in common and 31% have no 15-grams in common,
often as a result of minor text noise. Neural methods are plausibly more robust.
Using the NEWS-COPY dataset, we examine different text de-duplication methods that vary along
two key dimensions: whether or not the method is neural and computational cost. Drawing inspira-
tion from work on semantic textual similarity and on retrieval, we develop two approaches for neural
text de-duplication: a contrastively trained bi-encoder plus clustering method and a ‘reranking’ style
method, which uses a computationally cheap transformer bi-encoder to measure the pairwise sim-
ilarity between all articles and then passes each article’s nearest neighbors to a cross-encoder, at
an additional computational cost. We also examine N-gram overlap and locally sensitive hashing,
the latter of which is highly scalable. The neural methods significantly outperform the non-neural
approaches. The Adjusted Rand Index (ARI) for the re-rank model is 93.7 and for the bi-encoder
model is 91.5, versus 73.7 for LSH and 75.0 for N-gram overlap.
While the primary advantage of hashing - and a central motivation for its frequent usage - is its
scalability, massive scale similarity search (Johnson et al., 2019) is sufficiently cheap on modern
GPUs to make neural de-duplication highly scalable. We use our contrastively-trained bi-encoder
and a single NVIDIA 40GB A6000 GPU card to de-duplicate a 10 million document, 19 GB corpus
in 11 hours and 45 minutes. While this cost is already marginal in the context of working with large
text corpora, it could be reduced significantly further by using a lighter weight language model, as
the majority of the time cost is embedding the 10M articles.
The publicly available neural de-duplication models, available at https://github.com/
dell-research-harvard/NEWS-COPY, can be applied to novel de-duplication problems. To
evaluate off-the-shelf performance, we apply our bi-encoder model to two subsets of C4 (Colossal
Clean Crawled Corpus), a massive dataset created by applying a series of filters to a single snapshot
of Common Crawl (Raffel et al., 2019; Dodge et al., 2021): RealNews - which consists of around 13
million digital news articles - and all 90,671 patents scraped from Google’s online patent database.
We also examine test set leakage between SuperGlue (Sarlin et al., 2020) and RealNews. While
there is not an unbiased ground truth measure for these datasets, an analysis of predicted duplicates
shows that the bi-encoder detects a variety of noisy duplicates that hashing overlooks, which result
from aggregators of digital news, machine translation, and other sources of noise.
The rest of this paper is organized as follows: Section 2 provides an overview of the relevant liter-
ature. Section 3 describes the NEWS-COPY dataset, and Section 4 develops neural de-duplication
methods and their non-neural comparisons. Section 5 evaluates the performance of different de-
duplication methods, Section 6 explores scaling, and Section 7 applies de-duplication to a subset of
C4. Finally, Section 8 concludes.
2