
2 Cynical Data Selection
Methods for data selection for language-related
tasks have been widely studied, usually to select
in-domain data (Axelrod et al.,2011;van der Wees
et al.,2017;Dai et al.,2020;Killamsetty et al.,
2020). One such method is Cynical Data Selection
(Axelrod,2017). The intuition behind cynical se-
lection is greedily ranking sentences from the text
corpus, based on its score computed against text
representative of the target domain, which is based
on how much information gained by selecting it.
Concretely, given representative text from the
target domain, cynical selection uses the cross-
entropy of the selected text against the representa-
tive text and calculates the information gain of each
sentence in the general corpus. It then picks the
most useful sentence relative to what has already
been selected and its similarity to the representative
text. This also leads to a bias towards shorter sen-
tences and preferring sentences that contain words
with high probability in the representative text.
Our work extends the cynical selection to the
document level selection. Sentences are still scored
at the sentence level, but the average sentence-level
gain determines the information gain of a docu-
ment.
2
We demonstrate its advantages in efficiently
selecting related documents to the target domain.
3 Experiments and Results
In this work, we set OntoNotes 5.0 (Weischedel
et al.,2013) as our target corpus, and we use a
smaller sample from the training corpus of the
CoNLL 2012 Shared Task (Pradhan et al.,2012)
as the representative corpus for data selection. We
first train an encoder based on the selected data and
use the Edge Probing suite (Tenney et al.,2019b)
for the downstream task evaluation, which has pre-
viously been used to probe and evaluate language
models (Clark et al.,2019;Tenney et al.,2019a;
Jiang et al.,2020;Zhang et al.,2021).
3.1 Data Selection
Dataset
We adopt the Pile (Gao et al.,2021) for
data selection, which consists of 1250GB text from
22 domains. Cynical selection naturally prefers
text data based on the target corpus. To make a
more fair comparison, we exclude 100GB data
from “DM Mathematics” and “Github” to eliminate
the noise of non-text data in random selection.
2
A formal explanation of Cynical selection and its exten-
sion is in Appendix B.
Figure 2: Validation perplexity on held-out set (left),
and OntoNotes (right) at 100k training steps.
Selection Strategy
Encoder pretraining is natu-
rally a document-level task, as context contributes
critically to improved representations. Thus, we
need to extend the sentence selection into the doc-
ument selection to achieve a better-contextualized
representation at the pretraining stage.
3
We apply
our extended document-level cynical selection to
the Pile and extract the top
{0.5%,1%,2%,5%}
scored documents.
4
We also randomly sample the
same percentage of documents from Pile to use as
a corresponding baseline. As a baseline for manual
selection, we use 30GB text from "Wikipedia" and
"BookCorpus" subsets, following Liu et al. (2019).
3.2 Encoder Pretraining
We set up a BERT-base model and follow the
pretraining objective and settings described in
RoBERTa(Liu et al.,2019).
5
In Figure 2, we plot
the validation perplexity on both the representative
corpus (CoNLL 2012 Shared Task) and a held-out
set of the Pile. The perplexity on the held-out set
decreases when there is more training data for both
the cynical and random selection. Cynical selection
attains a higher perplexity, which shows that while
the selected documents are more adapted to the
target domain, it is not better adapted to the general
corpus. As each encoder needs different training
steps for different corpus sizes, we try to make a
fair comparison by assuming a fixed training bud-
get of 100k update steps. In Figure 2, we find that
at 100k steps, 2% of the cynically selected data
achieves the lowest perplexity, and more training
data does not help the adaptation to the target cor-
pus. Also, cynical selected documents consistently
outperforms the random selection, demonstrating
the effectiveness of adapting to the target domain.
3
We unsurprisingly find that selection at the document-
level works better than at the sentence-level (Appendix A).
4
Our code repository is publicly available at
https://
github.com/jsedoc/DL-CynDS.
5
We adopt the training scripts from FairSeq for encoder
pretraining, https://github.com/facebookresearch/fairseq.