Instance Regularization for Discriminative Language Model Pre-training Zhuosheng Zhang12 Hai Zhao12 Ming Zhou3 1Department of Computer Science and Engineering Shanghai Jiao Tong University

2025-04-24 0 0 982.09KB 11 页 10玖币

侵权投诉

Instance Regularization for Discriminative Language Model Pre-training

Zhuosheng Zhang1,2*, Hai Zhao1,2, Ming Zhou3

1Department of Computer Science and Engineering, Shanghai Jiao Tong University

2Key Laboratory of Shanghai Education Commission for Intelligent Interaction

and Cognitive Engineering, Shanghai Jiao Tong University

3Langboat Technology

zhangzs@sjtu.edu.cn, zhaohai@cs.sjtu.edu.cn, zhouming@chuangxin.com

Abstract

Discriminative pre-trained language models

(PrLMs) can be generalized as denoising

auto-encoders that work with two procedures,

ennoising and denoising. First, an ennoising

process corrupts texts with arbitrary noising

functions to construct training instances. Then,

a denoising language model is trained to

restore the corrupted tokens. Existing

studies have made progress by optimizing

independent strategies of either ennoising or

denosing. They treat training instances equally

throughout the training process, with little

attention on the individual contribution of

those instances. To model explicit signals

of instance contribution, this work proposes

to estimate the complexity of restoring the

original sentences from corrupted ones in

language model pre-training. The estimations

involve the corruption degree in the ennoising

data construction process and the predic-

tion conﬁdence in the denoising counterpart.

Experimental results on natural language

understanding and reading comprehension

benchmarks show that our approach improves

pre-training efﬁciency, effectiveness, and ro-

bustness. Code is publicly available at https:

//github.com/cooelf/InstanceReg

1 Introduction

Leveraging self-supervised objectives to pre-train

language models (PrLMs) on massive unlabeled

data has shown success in natural language

processing (NLP) (Peters et al.,2018;Radford

et al.,2018;Devlin et al.,2019;Dong et al.,2019;

Lan et al.,2020;Clark et al.,2020;Luo et al.,

2021;Zhu et al.,2022). A wide landscape of

pre-training objectives has been produced, such

as autoregressive (Radford et al.,2018;Yang

et al.,2019) and autoencoding (Devlin et al.,2019;

* Work done during internship at Lanboat. This work

was partially supported by Key Projects of National Natural

Science Foundation of China (U1836222 and 61733011).

Ennoising Denoising

[MASK] cute dog [MASK] playing on the [MASK] ...

A cute [MASK] is [MASK] on the [MASK] ...

Original

Sentence

Corrupted

Sentence

Restored

Sentence

Discriminative Training

Figure 1: Overview of AutoDecoders. As the

two examples show, the random sampling operation

during ennoising would result in training instances of

different degree of difﬁculty, e.g., the variety of valid

alternatives.

Joshi et al.,2020) language modeling objectives,

which serve as the principled mechanisms to

teach language models general-purpose knowledge

through the pre-training, and then those pre-

trained PrLMs can be ﬁne-tuned for downstream

tasks. Based on these unsupervised functions,

three classes of PrLMs have been proposed:

autoregressive language models (e.g. GPT

(Radford et al.,2018)), autoencoding models

(e.g. BERT (Devlin et al.,2019)), and encoder-

decoder models (e.g. BART (Lewis et al.,2020a)

and T5 (Raffel et al.,2020)). In this work,

we focus on the research line of autoencoding

models, also known as discriminative PrLMs that

have achieved impressive performance on natural

language understanding (NLU).

Although the discriminative PrLMs may vary

in language modeling functions or architectures

as discussed above, they can be generalized

as denoising auto-encoders, which contain two

procedures, ennoising and denoising. The pre-

training procedure is illustrated in Figure 1.

1) Ennoising corrupts texts with arbitrary noising

functions to construct training instances. The

corruption scheme includes edit operations like

insertion, deletion, replacement, permutation, and

retrieval (Devlin et al.,2019;Lewis et al.,2020b;

arXiv:2210.05471v1 [cs.CL] 11 Oct 2022

Xu and Zhao,2021;Wang et al.,2020;Guu et al.,

2020). For example, masked language modeling

(MLM) (Devlin et al.,2019) replaces some input

tokens in a sentence with a special symbol. BART

uses token deletion, text inﬁlling, and sentence

permutation for corruption (Lewis et al.,2020a).

2) Denoising enables a language model to

predict missing or otherwise corrupted tokens in

the input sequences. Recent studies focus on

designing improved language modeling functions

to mitigate discrepancies between the pre-training

phase and the ﬁne-tuning phase. Yang et al. (2019)

reformulates MLM in XLNet by restoring the

permuted tokens in factorization order, such that

the input sequence is autoregressively generated

after permutation. In addition, using synonyms

for the masking purpose (Cui et al.,2020) and

simple pre-training objectives based on token-level

classiﬁcation tasks (Yamaguchi et al.,2021) have

also proved effective as an MLM alternative.

Most of the existing studies of PrLMs fall into

the scope of either investigating better ennoising

operations or more effective denoising strategies.

They treat training instances equally throughout

the training process. Little attention is paid

to the individual contribution of those instances.

In standard MLM ennoising, randomly masking

different tokens would lead to different degrees

of corruption that may, therefore, cause different

levels of difﬁculty in sentence restoration in

denoising (as shown in Figure 1) and thus increase

the uncertainty in restoring the original sentence

structure during the denoising process. For

example, if “not” is masked, the corrupted sentence

tends to have a contrary meaning.

In this work, we are motivated to estimate

the complexity of restoring the original sentences

from corrupted ones in language model pre-

training, to provide explicit regularization signals

to encourage more effective and robust pre-training.

Our approach includes two sides of penalty:

1) ennoising corruption penalty that measures

the distribution disparity between the corrupted

sentence and the original sentence, to measure

the corruption degree in the ennoising process;

2) denoising prediction penalty that measures

the distribution difference between the restored

sequence and the original sentence to measure

the sentence-level prediction conﬁdence in the

denoising counterpart. Experiments show that

language models trained with our regularization

terms can yield better performance and become

more robust against adversarial attacks.

2 Related Work

Training powerful large-scale language models

on a large unlabeled corpus with self-supervised

objectives has attracted lots of attention, which

commonly work in two procedures of ennoising

and denoising. The most representative task

for pre-training is MLM, which is introduced in

Devlin et al. (2019) to pre-train a bidirectional

BERT. A spectrum of ennoising extensions has

been proposed to enhance MLM further and

alleviate the potential drawbacks, which fall into

two categories: 1) mask units and 2) noising

scheme. Mask units correspond to the language

modeling units that serve as knowledge carriers

in different granularity. The variants focusing on

mask units include the standard subword masking

(Devlin et al.,2019), span masking (Joshi et al.,

2020), and

-gram masking (Levine et al.,2021;

Li and Zhao,2021). For noising scheme, BART

(Lewis et al.,2020a) corrupts text with arbitrary

noising functions, including token deletion, text

inﬁlling, sentence permutation, in conjunction with

MLM. UniLM (Dong et al.,2019) extends the

mask prediction to generation tasks by adding

the auto-regressive objectives. XLNet (Yang

et al.,2019) proposes the permuted language

modeling to learn the dependencies among the

masked tokens. MacBERT (Cui et al.,2020)

suggests using similar words for the masking

purpose. Yamaguchi et al. (2021) also investigates

simple pre-training objectives based on token-level

classiﬁcation tasks as replacements of MLM, which

are often computationally cheaper and result in

comparable performance to MLM. In addition,

ELECTRA (Clark et al.,2020) proposes a novel

training objective called replaced token detection,

which is deﬁned over all input tokens.

Although the above studies have an adequate

investigation to reduce the mismatch between

pre-training and ﬁne-tuning tasks, an essential

problem of the common denoising mechanism

lacks attention. The construction of training

examples based on ennoising operations would

cause the break of sentence structure, either for

replacement, addition, or deletion-based noising

functions. In extreme cases, the destruction would

lead to completely different sentences, making it

difﬁcult for the model to predict the corrupted

tokens. Therefore, in this work, we propose to

enhance the pre-training quality by using instance

regularization (IR) terms to estimate the restoration

complexity from both sides of ennoising and

denoising aspects.

The proposed approach is partially related to

some prior studies of hardness measurement in

training deep learning models (Lin et al.,2017;

Kalantidis et al.,2020;Hao et al.,2021), whose

focus is to guide the model to pay special attention

to hard examples and prevent the vast number of

easy negatives from overwhelming the training

process. In contrast to optimizing the training

process by heuristically ﬁnding the hard negatives,

this work does not need to distinguish hard

examples from ordinary ones, but measures the

corruption degree between the masked sentence

and the original sentence instead, and uses the

degree as the explicit training signals.

3 Methodology

This section will start by formulating the ennoising

and denoising processes for building PrLMs and

then introduce our instance regularization approach

to estimate the restoration complexity in both

ennoising and denoising views.

3.1 Preliminary: Denoising Auto-Encoders

The training procedure for discriminative language

models includes ennoising and denoising processes,

as described below. For the sake of simplicity, we

take the widely-used MLM as a typical example to

describe the ennoising process.

Ennoising

Given a sentence

{w1, w2, . . . , wn}

with

tokens,

we randomly

mask some percentage of the input tokens with

a special mask symbol

[MASK]

and then predict

those masked tokens. Suppose that there are

tokens replaced by the mask symbol. Let

D={k1, k2, . . . , km}

denote set of masked

positions, we have

as the masked sentence and

M={wk1, wk2, . . . , wkm}

are the masked tokens.

In the following part, we use

to denote each

masked token for simplicity.

Denoising

In the denoising process, a language

model is trained to predict the masked token based

on the context.

is fed into the PrLM to

We assume that

has already been tokenized into a

sequence of subwords.

obtain the contextual representations from the last

Transformer layer, which is denoted as H.

Training

The training objective is to maximize

the following objective:

LDAE =−1

|D|

k∈D

log p(wk|W0).(1)

3.2 Instance Regularization

In this part, we will introduce our instance

regularization approach, which involves two

sides: corruption degree in the ennoising data

construction process and the sequence-level

prediction conﬁdence in the denoising counterpart.

During denoising, the PrLM trained by MLM is

required to predict the original masked tokens

given the hidden states

of the corrupted input

Let

denote the predicted tokens, we replace the

mask symbols by ﬁlling

back to

. As a result,

we have the predicted sequence, denoted as

{p1, p2, . . . , pn}

, where the tokens in positions of

are predicted ones; otherwise, they are the same

as the originals ones in W.

Obviously, the corruption would break the

sentence structure and easily cause the semantic

deviation of sentence representations. According

to our observation, the hidden states would vary

dramatically before and after the token corruption

– similar ﬁndings were also observed in Wang

et al. (2021) that small disturbance can inveigle

PrLMs into making false predictions. In a more

general perspective, replacing a modest percentage

of tokens may result in a totally different sentence,

let alone imperceptible disturbance as used for

textual attacks.

Therefore, we propose two approaches called

ennoising corruption penalty (ECP) and denoising

prediction penalty (DPP) as the regularization

terms in the training process to alleviate the issue.

Figure 2overviews the overall procedure. ECP

measures the semantic shift from the original

sentence to the corrupted one as an explicit signal to

help the model distinguish easy and hard examples

and learn with different weights, which can be

seen as instance weighting compared with MLM.

As the complement, DPP measures the sequence-

level semantic distance between the predicted and

original sentence to supplement the rough token-

level matching of MLM, thus transforming the

token prediction task to sequence matching to pay

more attention to sentence-level semantics.

文档加载中……请稍候！
如果长时间未打开，您也可以点击刷新试试。

下载文档到电脑，查找使用更方便

10 玖币 0人已下载

立即下载

摘要：

InstanceRegularizationforDiscriminativeLanguageModelPre-trainingZhuoshengZhang1,2*,HaiZhao1,2,MingZhou31DepartmentofComputerScienceandEngineering,ShanghaiJiaoTongUniversity2KeyLaboratoryofShanghaiEducationCommissionforIntelligentInteractionandCognitiveEngineering,ShanghaiJiaoTongUniversity3LangboatT...

展开>> 收起<<

Instance Regularization for Discriminative Language Model Pre-training Zhuosheng Zhang12 Hai Zhao12 Ming Zhou3 1Department of Computer Science and Engineering Shanghai Jiao Tong University.pdf

共11页,预览3页

还剩页未读，继续阅读

声明：本站为文档C2C交易模式，即用户上传的文档直接被用户下载，本站只是中间服务平台，本站所有文档下载所得的收益归上传人(含作者)所有。玖贝云文库仅提供信息存储空间，仅对用户上传内容的表现方式做保护处理，对上载内容本身不做任何修改或编辑。若文档所含内容侵犯了您的版权或隐私，请立即通知玖贝云文库，我们立即给予删除！

Instance Regularization for Discriminative Language Model Pre-training Zhuosheng Zhang12 Hai Zhao12 Ming Zhou3 1Department of Computer Science and Engineering Shanghai Jiao Tong University

相关推荐

开通VIP享超值会员特权

作者详情

相关内容

热门标签

举报选择: