Instance Regularization for Discriminative Language Model Pre-training Zhuosheng Zhang12 Hai Zhao12 Ming Zhou3 1Department of Computer Science and Engineering Shanghai Jiao Tong University

2025-04-24 0 0 982.09KB 11 页 10玖币
侵权投诉
Instance Regularization for Discriminative Language Model Pre-training
Zhuosheng Zhang1,2*, Hai Zhao1,2, Ming Zhou3
1Department of Computer Science and Engineering, Shanghai Jiao Tong University
2Key Laboratory of Shanghai Education Commission for Intelligent Interaction
and Cognitive Engineering, Shanghai Jiao Tong University
3Langboat Technology
zhangzs@sjtu.edu.cn, zhaohai@cs.sjtu.edu.cn, zhouming@chuangxin.com
Abstract
Discriminative pre-trained language models
(PrLMs) can be generalized as denoising
auto-encoders that work with two procedures,
ennoising and denoising. First, an ennoising
process corrupts texts with arbitrary noising
functions to construct training instances. Then,
a denoising language model is trained to
restore the corrupted tokens. Existing
studies have made progress by optimizing
independent strategies of either ennoising or
denosing. They treat training instances equally
throughout the training process, with little
attention on the individual contribution of
those instances. To model explicit signals
of instance contribution, this work proposes
to estimate the complexity of restoring the
original sentences from corrupted ones in
language model pre-training. The estimations
involve the corruption degree in the ennoising
data construction process and the predic-
tion confidence in the denoising counterpart.
Experimental results on natural language
understanding and reading comprehension
benchmarks show that our approach improves
pre-training efficiency, effectiveness, and ro-
bustness. Code is publicly available at https:
//github.com/cooelf/InstanceReg
1 Introduction
Leveraging self-supervised objectives to pre-train
language models (PrLMs) on massive unlabeled
data has shown success in natural language
processing (NLP) (Peters et al.,2018;Radford
et al.,2018;Devlin et al.,2019;Dong et al.,2019;
Lan et al.,2020;Clark et al.,2020;Luo et al.,
2021;Zhu et al.,2022). A wide landscape of
pre-training objectives has been produced, such
as autoregressive (Radford et al.,2018;Yang
et al.,2019) and autoencoding (Devlin et al.,2019;
* Work done during internship at Lanboat. This work
was partially supported by Key Projects of National Natural
Science Foundation of China (U1836222 and 61733011).
Ennoising Denoising
[MASK] cute dog [MASK] playing on the [MASK] ...
A cute [MASK] is [MASK] on the [MASK] ...
Original
Sentence
Corrupted
Sentence
Restored
Sentence
Discriminative Training
Figure 1: Overview of AutoDecoders. As the
two examples show, the random sampling operation
during ennoising would result in training instances of
different degree of difficulty, e.g., the variety of valid
alternatives.
Joshi et al.,2020) language modeling objectives,
which serve as the principled mechanisms to
teach language models general-purpose knowledge
through the pre-training, and then those pre-
trained PrLMs can be fine-tuned for downstream
tasks. Based on these unsupervised functions,
three classes of PrLMs have been proposed:
autoregressive language models (e.g. GPT
(Radford et al.,2018)), autoencoding models
(e.g. BERT (Devlin et al.,2019)), and encoder-
decoder models (e.g. BART (Lewis et al.,2020a)
and T5 (Raffel et al.,2020)). In this work,
we focus on the research line of autoencoding
models, also known as discriminative PrLMs that
have achieved impressive performance on natural
language understanding (NLU).
Although the discriminative PrLMs may vary
in language modeling functions or architectures
as discussed above, they can be generalized
as denoising auto-encoders, which contain two
procedures, ennoising and denoising. The pre-
training procedure is illustrated in Figure 1.
1) Ennoising corrupts texts with arbitrary noising
functions to construct training instances. The
corruption scheme includes edit operations like
insertion, deletion, replacement, permutation, and
retrieval (Devlin et al.,2019;Lewis et al.,2020b;
arXiv:2210.05471v1 [cs.CL] 11 Oct 2022
Xu and Zhao,2021;Wang et al.,2020;Guu et al.,
2020). For example, masked language modeling
(MLM) (Devlin et al.,2019) replaces some input
tokens in a sentence with a special symbol. BART
uses token deletion, text infilling, and sentence
permutation for corruption (Lewis et al.,2020a).
2) Denoising enables a language model to
predict missing or otherwise corrupted tokens in
the input sequences. Recent studies focus on
designing improved language modeling functions
to mitigate discrepancies between the pre-training
phase and the fine-tuning phase. Yang et al. (2019)
reformulates MLM in XLNet by restoring the
permuted tokens in factorization order, such that
the input sequence is autoregressively generated
after permutation. In addition, using synonyms
for the masking purpose (Cui et al.,2020) and
simple pre-training objectives based on token-level
classification tasks (Yamaguchi et al.,2021) have
also proved effective as an MLM alternative.
Most of the existing studies of PrLMs fall into
the scope of either investigating better ennoising
operations or more effective denoising strategies.
They treat training instances equally throughout
the training process. Little attention is paid
to the individual contribution of those instances.
In standard MLM ennoising, randomly masking
different tokens would lead to different degrees
of corruption that may, therefore, cause different
levels of difficulty in sentence restoration in
denoising (as shown in Figure 1) and thus increase
the uncertainty in restoring the original sentence
structure during the denoising process. For
example, if “not” is masked, the corrupted sentence
tends to have a contrary meaning.
In this work, we are motivated to estimate
the complexity of restoring the original sentences
from corrupted ones in language model pre-
training, to provide explicit regularization signals
to encourage more effective and robust pre-training.
Our approach includes two sides of penalty:
1) ennoising corruption penalty that measures
the distribution disparity between the corrupted
sentence and the original sentence, to measure
the corruption degree in the ennoising process;
2) denoising prediction penalty that measures
the distribution difference between the restored
sequence and the original sentence to measure
the sentence-level prediction confidence in the
denoising counterpart. Experiments show that
language models trained with our regularization
terms can yield better performance and become
more robust against adversarial attacks.
2 Related Work
Training powerful large-scale language models
on a large unlabeled corpus with self-supervised
objectives has attracted lots of attention, which
commonly work in two procedures of ennoising
and denoising. The most representative task
for pre-training is MLM, which is introduced in
Devlin et al. (2019) to pre-train a bidirectional
BERT. A spectrum of ennoising extensions has
been proposed to enhance MLM further and
alleviate the potential drawbacks, which fall into
two categories: 1) mask units and 2) noising
scheme. Mask units correspond to the language
modeling units that serve as knowledge carriers
in different granularity. The variants focusing on
mask units include the standard subword masking
(Devlin et al.,2019), span masking (Joshi et al.,
2020), and
n
-gram masking (Levine et al.,2021;
Li and Zhao,2021). For noising scheme, BART
(Lewis et al.,2020a) corrupts text with arbitrary
noising functions, including token deletion, text
infilling, sentence permutation, in conjunction with
MLM. UniLM (Dong et al.,2019) extends the
mask prediction to generation tasks by adding
the auto-regressive objectives. XLNet (Yang
et al.,2019) proposes the permuted language
modeling to learn the dependencies among the
masked tokens. MacBERT (Cui et al.,2020)
suggests using similar words for the masking
purpose. Yamaguchi et al. (2021) also investigates
simple pre-training objectives based on token-level
classification tasks as replacements of MLM, which
are often computationally cheaper and result in
comparable performance to MLM. In addition,
ELECTRA (Clark et al.,2020) proposes a novel
training objective called replaced token detection,
which is defined over all input tokens.
Although the above studies have an adequate
investigation to reduce the mismatch between
pre-training and fine-tuning tasks, an essential
problem of the common denoising mechanism
lacks attention. The construction of training
examples based on ennoising operations would
cause the break of sentence structure, either for
replacement, addition, or deletion-based noising
functions. In extreme cases, the destruction would
lead to completely different sentences, making it
difficult for the model to predict the corrupted
tokens. Therefore, in this work, we propose to
enhance the pre-training quality by using instance
regularization (IR) terms to estimate the restoration
complexity from both sides of ennoising and
denoising aspects.
The proposed approach is partially related to
some prior studies of hardness measurement in
training deep learning models (Lin et al.,2017;
Kalantidis et al.,2020;Hao et al.,2021), whose
focus is to guide the model to pay special attention
to hard examples and prevent the vast number of
easy negatives from overwhelming the training
process. In contrast to optimizing the training
process by heuristically finding the hard negatives,
this work does not need to distinguish hard
examples from ordinary ones, but measures the
corruption degree between the masked sentence
and the original sentence instead, and uses the
degree as the explicit training signals.
3 Methodology
This section will start by formulating the ennoising
and denoising processes for building PrLMs and
then introduce our instance regularization approach
to estimate the restoration complexity in both
ennoising and denoising views.
3.1 Preliminary: Denoising Auto-Encoders
The training procedure for discriminative language
models includes ennoising and denoising processes,
as described below. For the sake of simplicity, we
take the widely-used MLM as a typical example to
describe the ennoising process.
Ennoising
Given a sentence
W=
{w1, w2, . . . , wn}
with
n
tokens,
1
we randomly
mask some percentage of the input tokens with
a special mask symbol
[MASK]
and then predict
those masked tokens. Suppose that there are
m
tokens replaced by the mask symbol. Let
D={k1, k2, . . . , km}
denote set of masked
positions, we have
W0
as the masked sentence and
M={wk1, wk2, . . . , wkm}
are the masked tokens.
In the following part, we use
wk
to denote each
masked token for simplicity.
Denoising
In the denoising process, a language
model is trained to predict the masked token based
on the context.
W0
is fed into the PrLM to
1
We assume that
W
has already been tokenized into a
sequence of subwords.
obtain the contextual representations from the last
Transformer layer, which is denoted as H.
Training
The training objective is to maximize
the following objective:
LDAE =1
m
|D|
X
k∈D
log p(wk|W0).(1)
3.2 Instance Regularization
In this part, we will introduce our instance
regularization approach, which involves two
sides: corruption degree in the ennoising data
construction process and the sequence-level
prediction confidence in the denoising counterpart.
During denoising, the PrLM trained by MLM is
required to predict the original masked tokens
wk
given the hidden states
H
of the corrupted input
W0
.
Let
w0
k
denote the predicted tokens, we replace the
mask symbols by filling
w0
k
back to
W0
. As a result,
we have the predicted sequence, denoted as
P=
{p1, p2, . . . , pn}
, where the tokens in positions of
D
are predicted ones; otherwise, they are the same
as the originals ones in W.
Obviously, the corruption would break the
sentence structure and easily cause the semantic
deviation of sentence representations. According
to our observation, the hidden states would vary
dramatically before and after the token corruption
– similar findings were also observed in Wang
et al. (2021) that small disturbance can inveigle
PrLMs into making false predictions. In a more
general perspective, replacing a modest percentage
of tokens may result in a totally different sentence,
let alone imperceptible disturbance as used for
textual attacks.
Therefore, we propose two approaches called
ennoising corruption penalty (ECP) and denoising
prediction penalty (DPP) as the regularization
terms in the training process to alleviate the issue.
Figure 2overviews the overall procedure. ECP
measures the semantic shift from the original
sentence to the corrupted one as an explicit signal to
help the model distinguish easy and hard examples
and learn with different weights, which can be
seen as instance weighting compared with MLM.
As the complement, DPP measures the sequence-
level semantic distance between the predicted and
original sentence to supplement the rough token-
level matching of MLM, thus transforming the
token prediction task to sequence matching to pay
more attention to sentence-level semantics.
摘要:

InstanceRegularizationforDiscriminativeLanguageModelPre-trainingZhuoshengZhang1,2*,HaiZhao1,2,MingZhou31DepartmentofComputerScienceandEngineering,ShanghaiJiaoTongUniversity2KeyLaboratoryofShanghaiEducationCommissionforIntelligentInteractionandCognitiveEngineering,ShanghaiJiaoTongUniversity3LangboatT...

展开>> 收起<<
Instance Regularization for Discriminative Language Model Pre-training Zhuosheng Zhang12 Hai Zhao12 Ming Zhou3 1Department of Computer Science and Engineering Shanghai Jiao Tong University.pdf

共11页,预览3页

还剩页未读, 继续阅读

声明:本站为文档C2C交易模式,即用户上传的文档直接被用户下载,本站只是中间服务平台,本站所有文档下载所得的收益归上传人(含作者)所有。玖贝云文库仅提供信息存储空间,仅对用户上传内容的表现方式做保护处理,对上载内容本身不做任何修改或编辑。若文档所含内容侵犯了您的版权或隐私,请立即通知玖贝云文库,我们立即给予删除!
分类:图书资源 价格:10玖币 属性:11 页 大小:982.09KB 格式:PDF 时间:2025-04-24

开通VIP享超值会员特权

  • 多端同步记录
  • 高速下载文档
  • 免费文档工具
  • 分享文档赚钱
  • 每日登录抽奖
  • 优质衍生服务
/ 11
客服
关注