Automatic Document Selection for Efﬁcient Encoder Pretraining Yukun Feng1Patrick Xia1Benjamin Van Durme1João Sedoc2 1Johns Hopkins University

2025-04-29 0 0 541.96KB 9 页 10玖币

侵权投诉

Automatic Document Selection for Efﬁcient Encoder Pretraining

Yukun Feng1Patrick Xia1Benjamin Van Durme1João Sedoc2

1Johns Hopkins University

2New York University

{yfeng55, paxia, vandurme}@jhu.edu, jsedoc@stern.nyu.edu

Abstract

Building pretrained language models is con-

sidered expensive and data-intensive, but must

we increase dataset size to achieve better

performance? We propose an alternative to

larger training sets by automatically identify-

ing smaller yet domain-representative subsets.

We extend Cynical Data Selection, a statistical

sentence scoring method that conditions on a

representative target domain corpus. As an ex-

ample, we treat the OntoNotes corpus as a tar-

get domain and pretrain a RoBERTa-like en-

coder from a cynically selected subset of the

Pile. On both perplexity and across several

downstream tasks in the target domain, it con-

sistently outperforms random selection with

20x less data, 3x fewer training iterations, and

2x less estimated cloud compute cost, validat-

ing the recipe of automatic document selection

for LM pretraining.

1 Introduction

Large pretrained language models have achieved

state-of-the-art performance in NLP tasks (Devlin

et al.,2019;Liu et al.,2019,i.a.). These studies

ﬁnd that increasing pretraining data size usually

leads to better task performance. For many tasks,

additional task (in-domain) data helps improve the

performance further (Gururangan et al.,2020;Dery

et al.,2021;Li et al.,2022). Several studies have

found that directly pretraining on task data is more

effective : science texts (Beltagy et al.,2019),

tweets (Nguyen et al.,2020), legal texts (Chalkidis

et al.,2020) or code (Tabassum et al.,2020;Chen

et al.,2021). Notably, these domains are known

a priori, and identifying data sources for curation

is straightforward. In other instances where the

domain is less clear, like “offensive online content”

(Bai et al.,2021), more complicated data sampling

is employed to guess at the desired data distribution

suitable for training a downstream classiﬁer.

To address such scenarios, we propose automat-

ically identifying relevant domain-speciﬁc train-

87.45

86.25

85.40

Score

Pile ~ 1250GB

Random ~ 60GB

Manual ~ 30GB

Cynical ~ 2.5 GB

Figure 1: This ﬁgure highlights the efﬁciency of the au-

tomatic cynical selection of documents in the target do-

main. Scores are averaged from 8 Edge Probing tasks.

Cynically selected 2.5GB data achieves the best score.

ing data for a large corpus and subsequently pre-

training a model on the selected data. Speciﬁ-

cally, we use Cynical Data Selection (Axelrod,

2017), an approach that advanced Moore-Lewis

sampling (Moore and Lewis,2010), to select data

from the Pile dataset (Gao et al.,2021). This auto-

matic selection method can include possibly over-

looked yet relevant documents from domains that

may not be too close to the target domain. Figure 1

illustrates this method which achieves higher per-

formance on tasks in the target domain by using

only 2.5GB (0.5%) of cynically selected data.

Speciﬁcally, we experiment with pretraining en-

coders with varying amounts of data sampled from

the Pile.

With our “target corpus” of OntoNotes

(Weischedel et al.,2013), we compare language

models trained with cynical and random selection

at various data levels. We ﬁnd that the cynically

selected encoder achieves consistently lower target

corpus perplexity than one trained with random

selection. We further ﬁnetune the encoders on a

suite of tasks, some of which are derived from

OntoNotes. Again, we ﬁnd that models pretrained

with cynical selection perform best. We suggest

this as a viable method for inexpensively pretrain-

ing effective domain-speciﬁc encoders.

The Pile consists of 800GB raw text but for this paper,

we refer to its “effective” size, which is 1250GB.

arXiv:2210.10951v2 [cs.CL] 26 Oct 2022

2 Cynical Data Selection

Methods for data selection for language-related

tasks have been widely studied, usually to select

in-domain data (Axelrod et al.,2011;van der Wees

et al.,2017;Dai et al.,2020;Killamsetty et al.,

2020). One such method is Cynical Data Selection

(Axelrod,2017). The intuition behind cynical se-

lection is greedily ranking sentences from the text

corpus, based on its score computed against text

representative of the target domain, which is based

on how much information gained by selecting it.

Concretely, given representative text from the

target domain, cynical selection uses the cross-

entropy of the selected text against the representa-

tive text and calculates the information gain of each

sentence in the general corpus. It then picks the

most useful sentence relative to what has already

been selected and its similarity to the representative

text. This also leads to a bias towards shorter sen-

tences and preferring sentences that contain words

with high probability in the representative text.

Our work extends the cynical selection to the

document level selection. Sentences are still scored

at the sentence level, but the average sentence-level

gain determines the information gain of a docu-

ment.

We demonstrate its advantages in efﬁciently

selecting related documents to the target domain.

3 Experiments and Results

In this work, we set OntoNotes 5.0 (Weischedel

et al.,2013) as our target corpus, and we use a

smaller sample from the training corpus of the

CoNLL 2012 Shared Task (Pradhan et al.,2012)

as the representative corpus for data selection. We

ﬁrst train an encoder based on the selected data and

use the Edge Probing suite (Tenney et al.,2019b)

for the downstream task evaluation, which has pre-

viously been used to probe and evaluate language

models (Clark et al.,2019;Tenney et al.,2019a;

Jiang et al.,2020;Zhang et al.,2021).

3.1 Data Selection

Dataset

We adopt the Pile (Gao et al.,2021) for

data selection, which consists of 1250GB text from

22 domains. Cynical selection naturally prefers

text data based on the target corpus. To make a

more fair comparison, we exclude 100GB data

from “DM Mathematics” and “Github” to eliminate

the noise of non-text data in random selection.

A formal explanation of Cynical selection and its exten-

sion is in Appendix B.

Figure 2: Validation perplexity on held-out set (left),

and OntoNotes (right) at 100k training steps.

Selection Strategy

Encoder pretraining is natu-

rally a document-level task, as context contributes

critically to improved representations. Thus, we

need to extend the sentence selection into the doc-

ument selection to achieve a better-contextualized

representation at the pretraining stage.

We apply

our extended document-level cynical selection to

the Pile and extract the top

{0.5%,1%,2%,5%}

scored documents.

We also randomly sample the

same percentage of documents from Pile to use as

a corresponding baseline. As a baseline for manual

selection, we use 30GB text from "Wikipedia" and

"BookCorpus" subsets, following Liu et al. (2019).

3.2 Encoder Pretraining

We set up a BERT-base model and follow the

pretraining objective and settings described in

RoBERTa(Liu et al.,2019).

In Figure 2, we plot

the validation perplexity on both the representative

corpus (CoNLL 2012 Shared Task) and a held-out

set of the Pile. The perplexity on the held-out set

decreases when there is more training data for both

the cynical and random selection. Cynical selection

attains a higher perplexity, which shows that while

the selected documents are more adapted to the

target domain, it is not better adapted to the general

corpus. As each encoder needs different training

steps for different corpus sizes, we try to make a

fair comparison by assuming a ﬁxed training bud-

get of 100k update steps. In Figure 2, we ﬁnd that

at 100k steps, 2% of the cynically selected data

achieves the lowest perplexity, and more training

data does not help the adaptation to the target cor-

pus. Also, cynical selected documents consistently

outperforms the random selection, demonstrating

the effectiveness of adapting to the target domain.

We unsurprisingly ﬁnd that selection at the document-

level works better than at the sentence-level (Appendix A).

Our code repository is publicly available at

https://

github.com/jsedoc/DL-CynDS.

We adopt the training scripts from FairSeq for encoder

pretraining, https://github.com/facebookresearch/fairseq.

文档加载中……请稍候！
如果长时间未打开，您也可以点击刷新试试。

下载文档到电脑，查找使用更方便

10 玖币 0人已下载

立即下载

摘要：

AutomaticDocumentSelectionforEfcientEncoderPretrainingYukunFeng1PatrickXia1BenjaminVanDurme1JoãoSedoc21JohnsHopkinsUniversity2NewYorkUniversity{yfeng55,paxia,vandurme}@jhu.edu,jsedoc@stern.nyu.eduAbstractBuildingpretrainedlanguagemodelsiscon-sideredexpensiveanddata-intensive,butmustweincreasedatase...

展开>> 收起<<

Automatic Document Selection for Efﬁcient Encoder Pretraining Yukun Feng1Patrick Xia1Benjamin Van Durme1João Sedoc2 1Johns Hopkins University.pdf

共9页,预览2页

还剩页未读，继续阅读

声明：本站为文档C2C交易模式，即用户上传的文档直接被用户下载，本站只是中间服务平台，本站所有文档下载所得的收益归上传人(含作者)所有。玖贝云文库仅提供信息存储空间，仅对用户上传内容的表现方式做保护处理，对上载内容本身不做任何修改或编辑。若文档所含内容侵犯了您的版权或隐私，请立即通知玖贝云文库，我们立即给予删除！

Automatic Document Selection for Efﬁcient Encoder Pretraining Yukun Feng1Patrick Xia1Benjamin Van Durme1João Sedoc2 1Johns Hopkins University

相关推荐

开通VIP享超值会员特权

作者详情

相关内容

热门标签

举报选择: