Automatic Document Selection for Efficient Encoder Pretraining Yukun Feng1Patrick Xia1Benjamin Van Durme1João Sedoc2 1Johns Hopkins University

2025-04-29 0 0 541.96KB 9 页 10玖币
侵权投诉
Automatic Document Selection for Efficient Encoder Pretraining
Yukun Feng1Patrick Xia1Benjamin Van Durme1João Sedoc2
1Johns Hopkins University
2New York University
{yfeng55, paxia, vandurme}@jhu.edu, jsedoc@stern.nyu.edu
Abstract
Building pretrained language models is con-
sidered expensive and data-intensive, but must
we increase dataset size to achieve better
performance? We propose an alternative to
larger training sets by automatically identify-
ing smaller yet domain-representative subsets.
We extend Cynical Data Selection, a statistical
sentence scoring method that conditions on a
representative target domain corpus. As an ex-
ample, we treat the OntoNotes corpus as a tar-
get domain and pretrain a RoBERTa-like en-
coder from a cynically selected subset of the
Pile. On both perplexity and across several
downstream tasks in the target domain, it con-
sistently outperforms random selection with
20x less data, 3x fewer training iterations, and
2x less estimated cloud compute cost, validat-
ing the recipe of automatic document selection
for LM pretraining.
1 Introduction
Large pretrained language models have achieved
state-of-the-art performance in NLP tasks (Devlin
et al.,2019;Liu et al.,2019,i.a.). These studies
find that increasing pretraining data size usually
leads to better task performance. For many tasks,
additional task (in-domain) data helps improve the
performance further (Gururangan et al.,2020;Dery
et al.,2021;Li et al.,2022). Several studies have
found that directly pretraining on task data is more
effective : science texts (Beltagy et al.,2019),
tweets (Nguyen et al.,2020), legal texts (Chalkidis
et al.,2020) or code (Tabassum et al.,2020;Chen
et al.,2021). Notably, these domains are known
a priori, and identifying data sources for curation
is straightforward. In other instances where the
domain is less clear, like “offensive online content”
(Bai et al.,2021), more complicated data sampling
is employed to guess at the desired data distribution
suitable for training a downstream classifier.
To address such scenarios, we propose automat-
ically identifying relevant domain-specific train-
87.45
86.25
85.40
Score
Pile ~ 1250GB
Random ~ 60GB
Manual ~ 30GB
Cynical ~ 2.5 GB
Figure 1: This figure highlights the efficiency of the au-
tomatic cynical selection of documents in the target do-
main. Scores are averaged from 8 Edge Probing tasks.
Cynically selected 2.5GB data achieves the best score.
ing data for a large corpus and subsequently pre-
training a model on the selected data. Specifi-
cally, we use Cynical Data Selection (Axelrod,
2017), an approach that advanced Moore-Lewis
sampling (Moore and Lewis,2010), to select data
from the Pile dataset (Gao et al.,2021). This auto-
matic selection method can include possibly over-
looked yet relevant documents from domains that
may not be too close to the target domain. Figure 1
illustrates this method which achieves higher per-
formance on tasks in the target domain by using
only 2.5GB (0.5%) of cynically selected data.
Specifically, we experiment with pretraining en-
coders with varying amounts of data sampled from
the Pile.
1
With our “target corpus” of OntoNotes
(Weischedel et al.,2013), we compare language
models trained with cynical and random selection
at various data levels. We find that the cynically
selected encoder achieves consistently lower target
corpus perplexity than one trained with random
selection. We further finetune the encoders on a
suite of tasks, some of which are derived from
OntoNotes. Again, we find that models pretrained
with cynical selection perform best. We suggest
this as a viable method for inexpensively pretrain-
ing effective domain-specific encoders.
1
The Pile consists of 800GB raw text but for this paper,
we refer to its “effective” size, which is 1250GB.
arXiv:2210.10951v2 [cs.CL] 26 Oct 2022
2 Cynical Data Selection
Methods for data selection for language-related
tasks have been widely studied, usually to select
in-domain data (Axelrod et al.,2011;van der Wees
et al.,2017;Dai et al.,2020;Killamsetty et al.,
2020). One such method is Cynical Data Selection
(Axelrod,2017). The intuition behind cynical se-
lection is greedily ranking sentences from the text
corpus, based on its score computed against text
representative of the target domain, which is based
on how much information gained by selecting it.
Concretely, given representative text from the
target domain, cynical selection uses the cross-
entropy of the selected text against the representa-
tive text and calculates the information gain of each
sentence in the general corpus. It then picks the
most useful sentence relative to what has already
been selected and its similarity to the representative
text. This also leads to a bias towards shorter sen-
tences and preferring sentences that contain words
with high probability in the representative text.
Our work extends the cynical selection to the
document level selection. Sentences are still scored
at the sentence level, but the average sentence-level
gain determines the information gain of a docu-
ment.
2
We demonstrate its advantages in efficiently
selecting related documents to the target domain.
3 Experiments and Results
In this work, we set OntoNotes 5.0 (Weischedel
et al.,2013) as our target corpus, and we use a
smaller sample from the training corpus of the
CoNLL 2012 Shared Task (Pradhan et al.,2012)
as the representative corpus for data selection. We
first train an encoder based on the selected data and
use the Edge Probing suite (Tenney et al.,2019b)
for the downstream task evaluation, which has pre-
viously been used to probe and evaluate language
models (Clark et al.,2019;Tenney et al.,2019a;
Jiang et al.,2020;Zhang et al.,2021).
3.1 Data Selection
Dataset
We adopt the Pile (Gao et al.,2021) for
data selection, which consists of 1250GB text from
22 domains. Cynical selection naturally prefers
text data based on the target corpus. To make a
more fair comparison, we exclude 100GB data
from “DM Mathematics” and “Github” to eliminate
the noise of non-text data in random selection.
2
A formal explanation of Cynical selection and its exten-
sion is in Appendix B.
Figure 2: Validation perplexity on held-out set (left),
and OntoNotes (right) at 100k training steps.
Selection Strategy
Encoder pretraining is natu-
rally a document-level task, as context contributes
critically to improved representations. Thus, we
need to extend the sentence selection into the doc-
ument selection to achieve a better-contextualized
representation at the pretraining stage.
3
We apply
our extended document-level cynical selection to
the Pile and extract the top
{0.5%,1%,2%,5%}
scored documents.
4
We also randomly sample the
same percentage of documents from Pile to use as
a corresponding baseline. As a baseline for manual
selection, we use 30GB text from "Wikipedia" and
"BookCorpus" subsets, following Liu et al. (2019).
3.2 Encoder Pretraining
We set up a BERT-base model and follow the
pretraining objective and settings described in
RoBERTa(Liu et al.,2019).
5
In Figure 2, we plot
the validation perplexity on both the representative
corpus (CoNLL 2012 Shared Task) and a held-out
set of the Pile. The perplexity on the held-out set
decreases when there is more training data for both
the cynical and random selection. Cynical selection
attains a higher perplexity, which shows that while
the selected documents are more adapted to the
target domain, it is not better adapted to the general
corpus. As each encoder needs different training
steps for different corpus sizes, we try to make a
fair comparison by assuming a fixed training bud-
get of 100k update steps. In Figure 2, we find that
at 100k steps, 2% of the cynically selected data
achieves the lowest perplexity, and more training
data does not help the adaptation to the target cor-
pus. Also, cynical selected documents consistently
outperforms the random selection, demonstrating
the effectiveness of adapting to the target domain.
3
We unsurprisingly find that selection at the document-
level works better than at the sentence-level (Appendix A).
4
Our code repository is publicly available at
https://
github.com/jsedoc/DL-CynDS.
5
We adopt the training scripts from FairSeq for encoder
pretraining, https://github.com/facebookresearch/fairseq.
摘要:

AutomaticDocumentSelectionforEfcientEncoderPretrainingYukunFeng1PatrickXia1BenjaminVanDurme1JoãoSedoc21JohnsHopkinsUniversity2NewYorkUniversity{yfeng55,paxia,vandurme}@jhu.edu,jsedoc@stern.nyu.eduAbstractBuildingpretrainedlanguagemodelsiscon-sideredexpensiveanddata-intensive,butmustweincreasedatase...

展开>> 收起<<
Automatic Document Selection for Efficient Encoder Pretraining Yukun Feng1Patrick Xia1Benjamin Van Durme1João Sedoc2 1Johns Hopkins University.pdf

共9页,预览2页

还剩页未读, 继续阅读

声明:本站为文档C2C交易模式,即用户上传的文档直接被用户下载,本站只是中间服务平台,本站所有文档下载所得的收益归上传人(含作者)所有。玖贝云文库仅提供信息存储空间,仅对用户上传内容的表现方式做保护处理,对上载内容本身不做任何修改或编辑。若文档所含内容侵犯了您的版权或隐私,请立即通知玖贝云文库,我们立即给予删除!
分类:图书资源 价格:10玖币 属性:9 页 大小:541.96KB 格式:PDF 时间:2025-04-29

开通VIP享超值会员特权

  • 多端同步记录
  • 高速下载文档
  • 免费文档工具
  • 分享文档赚钱
  • 每日登录抽奖
  • 优质衍生服务
/ 9
客服
关注