N-gram Is Back Residual Learning of Neural Text Generation withn-gram Language Model Huayang LiDeng CaiJin XuTaro Watanabe

2025-05-02 0 0 382.53KB 11 页 10玖币
侵权投诉
N-gram Is Back: Residual Learning of Neural Text Generation
with n-gram Language Model
Huayang LiDeng CaiJin XuTaro Watanabe
Nara Institute of Science and Technology The Chinese University of Hong Kong
Institute for Interdisciplinary Information Sciences, Tsinghua University
{li.huayang.lh6, taro}@is.naist.jp thisisjcykcd@gmail.com
xujin21@mails.tsinghua.edu.cn
Abstract
N-gram language models (LM) have been
largely superseded by neural LMs as the latter
exhibits better performance. However, we find
that n-gram models can achieve satisfactory
performance on a large proportion of testing
cases, indicating they have already captured
abundant knowledge of the language with rela-
tively low computational cost. With this obser-
vation, we propose to learn a neural LM that
fits the residual between an n-gram LM and
the real-data distribution. The combination of
n-gram and neural LMs not only allows the
neural part to focus on the deeper understand-
ing of language but also provides a flexible
way to customize an LM by switching the un-
derlying n-gram model without changing the
neural model. Experimental results on three
typical language tasks (i.e., language model-
ing, machine translation, and summarization)
demonstrate that our approach attains addi-
tional performance gains over popular stan-
dalone neural models consistently. We also
show that our approach allows for effective
domain adaptation by simply switching to a
domain-specific n-gram model, without any
extra training. Our code is released at https:
//github.com/ghrua/NgramRes.
1 Introduction
N
-gram language model (LM) was widely adopted
in a broad range of natural language processing
(NLP) applications, such as input method (Chen
et al.,2019), statistical machine translation (Brown
et al.,1990), and audio speech recognition (Bahl
et al.,1983). However, with the development of
deep learning, neural LMs have gradually taken the
place of
n
-gram LMs and became the new standard
in recent literature (Merity et al.,2017;Vaswani
et al.,2017;Radford et al.,2019). One critical
reason is the superior performance of neural LMs,
e.g., the GPT-2 model (Radford et al.,2019) can
generate text near the human level, outperforming
n-gram LMs by large margins.
PPL
1
10
100
Index of Bins
1
2
3
4
5
5-gram
GPT-2
5.83224260940605
47.2420739963172
117.796423416745
775.022784173262
7.16233646523529
26.0673702595165
44.8518225772504
428.45097511973
Figure 1: Sentence-level perplexity (PPL) of 5-gram
LM and GPT-2 LM on the validation dataset of
wikitext-103. We sort sentences in the validation
dataset according to their 5-gram PPL scores, and col-
lect them into 5 bins with an equal number of sentences.
The reported PPL score of each bin is the average over
the sentences in it, and the y-axis uses a logarithmic
scale. Details of the dataset and LMs are shown in sec-
tion 5.1.
Despite that neural LMs have surpassed
n
-gram
models at the macro level, we find that
n
-gram
LMs are still attractive: they are able to achieve
satisfactory performance on a large proportion of
testing cases at a much lower cost than neural LMs.
As observed in Figure 1, our preliminary experi-
ments show that the performance of
5
-gram LM
is close to the GPT-2 model trained from scratch
on 3 out of 5 bins ( 1, 2, and 5). Moreover, the
performance of
5
-gram on the first bin is slightly
better than GPT-2. Because training a neural LM
is much more expensive, spending effort on learn-
ing the knowledge that can be cheaply captured by
n-gram seems a waste.
Inspired by the above observation, we propose to
learn a neural LM that focuses on the information
gap that has not been captured by an
n
-gram model:
F:= G − Q, where Gand Qare the real-data dis-
tribution and the
n
-gram prediction distribution
respectively, which is in a similar spirit to resid-
ual learning (He et al.,2016). More concretely,
arXiv:2210.14431v3 [cs.CL] 3 Nov 2022
we combine the logits (the unnormalized probabil-
ity scores before
softmax
layer) of a neural model
and those derived from an
n
-gram model. The joint
neuro-symbolic system at least brings two appeal-
ing characteristics. First, since the neural model
stands on the shoulders of the shallow
n
-gram LM,
it can concentrate on deeper understanding. Sec-
ond, the underlying
n
-gram LM can be purpose-
fully switched without changing the neural model,
which offers great flexibility in tackling scenarios
such as domain adaptation. That is, we can adapt
the model to a specific domain by changing the
underlying
n
-gram LM in a plug-and-play man-
ner, without changing any parameters of the neural
model.
We conduct extensive experiments to evaluate
the proposed approach. Experiments on the stan-
dard benchmarks of three typical language tasks,
including language modeling, machine translation,
and summarization, show that our approach can
improve the performance of recent state-of-the-
art neural models consistently and considerably.
For example, our approach outperforms popular
baseline models by at least 0.7 PPL scores on the
wikitext-103 dataset for language modeling, 0.65
BLEU scores on average on IWSLT datasets for
machine translation, and 0.36 ROUGE-L scores
on the CNN/DailyMail dataset for summarization.
Moreover, on the language modeling task, when
switching the underlying
n
-gram LM to a particular
domain-specific one (e.g., IT, Koran, Law, Medi-
cal, and Subtitles) in a plug-and-play manner, our
model can reduce the PPL by 5.4 points on aver-
age without any domain-specific training of the
neural part. Remarkably, the performance of our
approach is even close to fine-tuning the whole
model on domain-specific corpora.
Our contributions are three-fold:
We propose a residual learning approach for
two heterogeneous structures, i.e.,
n
-gram and
neural LMs, which forces the neural LM to
approximate the information gap that has not
been captured by n-gram LM.
Our approach is able to improve the perfor-
mance of recent state-of-the-art neural mod-
els consistently and considerably on language
modeling, machine translation, and summa-
rization.
Experiments on domain adaptation demon-
strate that our approach can effectively and
cheaply adapt the model to a specific domain
by changing the used
n
-gram LM in a plug-
and-play manner, without changing any pa-
rameters of the neural model.
2 Related Work
Language Model
The
n
-gram language model
(LM) has been widely used in lots of applications
of natural language processing (NLP) since a long
time ago (Jurafsky,2000). The emergence of ad-
vanced smoothing technologies makes the
n
-gram
model able to provide a better estimation of hu-
man languages (Kneser and Ney,1995;Chen and
Goodman,1996;Heafield et al.,2013). In statis-
tical machine translation (Brown et al.,1990) and
automatic speech recognition (Bahl et al.,1983),
the decoder-side
n
-gram model is critical to esti-
mate the quality of generated candidates. In recent
literature on input methods, the
n
-gram LM is still
the most popular choice for providing word sug-
gestions (Huang et al.,2015;Chen et al.,2019),
because of its low cost and low latency.
However, with the development of deep neural
networks, the macro-level performance of neural
LM has surpassed that of
n
-gram LM by a large
margin. Comparing with the
n
-gram LM, one
big advantage of the neural LM basing on recur-
rent neural network (Hochreiter and Schmidhuber,
1997;Chung et al.,2014) and attention neural net-
work (Vaswani et al.,2017;Radford et al.,2019) is
their ability to modeling long-distance dependen-
cies (Grave et al.,2017). The success of neural
LM can also be observed in the big improvement
achieved in lots of downstream tasks, e.g., text gen-
eration (Holtzman et al.,2020;Welleck et al.,2020;
Su et al.,2022;Xu et al.,2022;Li et al.,2022;Cai
et al.,2022), machine translation (Bahdanau et al.,
2015;Luong and Manning,2015;Vaswani et al.,
2017;Cai et al.,2021) and summarization (Li et al.,
2017;See et al.,2017;Bi et al.,2020).
Although neural LM has outperformed
n
-gram
LM at the macro level, we find that
n
-gram LM can
achieve satisfactory performance on a large portion
of testing cases. Since the training cost of neural
LM is much more expensive and the model capacity
is fixed, we hypothesize that it is not necessary to
train the neural LM to learn the knowledge that can
be captured by
n
-gram LM at a much lower cost.
Therefore, we propose a residual learning method
to let the neural LM learn the gap of knowledge
that has not been captured by n-gram LM.
Residual Learning
Residual learning is a useful
technique for lots of neural networks in computer
vision (CV) and natural language processing (NLP).
He et al. (2016) propose deep residual learning to
alleviate the training difficulties of deep models,
which has been the backbone of lots of tasks in
CV. In NLP, Wang and Tian (2016) and Prakash
et al. (2016) use the residual learning technique to
train deep recurrent neural networks for text gener-
ation. Different from previous works that conduct
residual learning over different layers, Werlen et al.
(2018) propose to aggregate the information of his-
torical predictions using residual learning. In He
et al. (2021), they use the residual learning to prop-
agate attention scores across different layers of the
Transformer-based model.
Most of these works conduct residual learning
over homogeneous model structures, e.g., stacked
identical layers of the same model. In our work,
we use residual learning to combine the neural and
symbolic models, i.e., learn a neural LM that ap-
proximates the information that has not been cap-
tured by the n-gram model.
3 Background
Models that estimate the probabilities of sequences
of words are called language models (LM) (Juraf-
sky,2000). Let
x={x1, x2, ..., xL}
be a sequence
of words with length
L
. The probability of
P(x)
can be formalized according to the chain rule of
probability:
P(x) = P(x1)P(x2|x1). . . P (xL|xL1
1)
=
L
Y
k=1
P(xk|xk1
1),(1)
where
xk1
1
is called the prefix or context of
xk
. In
this section we will briefly introduce two kinds of
language models, the n-gram and neural language
models, to compute the probability in Eq. (1).
3.1 N-gram Language Model
Among lots of variants of
n
-gram LMs, the
n
-gram
LM with modified Kneser-Ney smoothing is widely
adopted in lots of related tasks, because of its low
perplexity and efficiency (Kneser and Ney,1995;
Chen and Goodman,1996;Heafield et al.,2013).
Like most
n
-gram LMs, the Kneser-Ney approxi-
mates the entire context
xk1
1
in Eq. (1) by the last
n1words in the context:
P(xk|xk1
1)PNG(xk|xk1
kn+1).(2)
In Kneser-Ney algorithm, the estimation of
PNG(xk|xk1
kn+1)
is defined according to a recur-
sive equation:
PNG(xk|xk1
kn+1) = U(xk|xk1
kn+1)+
b(xk1
kn+1)PNG(xk|xk1
kn+2),
(3)
U(xk|xk1
kn+1) = c(xk
kn+1)d
Pwc(xk1
kn+1w),
where
w
indicates a word appears after
xk1
kn+1
,
b(·)
is the backoff value for lower-order estimation,
c(·)
is the adjusted counts,
d
is the discounts for
smoothing (Jurafsky,2000;Heafield et al.,2013)
1
.
According to Eq. (3), Kneser-Ney allows us to
assign probabilities for unseen
n
-grams (e.g., 5-
grams), using the lower-order information (e.g., 4-,
3-, or even uni-grams).
3.2 Neural Language Model
An neural LM typically estimates the probability
of
xk
based on the whole context
xk1
1
. The pa-
rameter
θ
of a neural LM is optimized through the
following MLE loss:
LNU =X
x∈D
L
X
k=1
log PNU (xk|xk1
1;θ)(4)
where
D
is the training dataset. The probability of
PNU (xk)is computed by:
PNU (xk|xk1
1;θ) = softmax(φ(hk))[xk],(5)
where
hk
is the hidden vector output by the last
layer of an neural LM, e.g., the GPT-2 model (Rad-
ford et al.,2019) or LSTM model (Grave et al.,
2017). The
[xk]
is defined as taking the component
regarding to
xk
in a vector, i.e., the probabilistic
distribution got from
softmax
in this equation. The
φ(·)
is a linear layer that transforms the hidden vec-
tor
hk
to a vector in the vocabulary space, which is
also called the logits.
1
More details about adjusting counts and computing the
backoff values and discounts are shown in Jurafsky (2000)
and Heafield et al. (2013).
摘要:

N-gramIsBack:ResidualLearningofNeuralTextGenerationwithn-gramLanguageModelHuayangLi|DengCai~JinXu}TaroWatanabe||NaraInstituteofScienceandTechnology~TheChineseUniversityofHongKong}InstituteforInterdisciplinaryInformationSciences,TsinghuaUniversity{li.huayang.lh6,taro}@is.naist.jpthisisjcykcd@gmail.co...

展开>> 收起<<
N-gram Is Back Residual Learning of Neural Text Generation withn-gram Language Model Huayang LiDeng CaiJin XuTaro Watanabe.pdf

共11页,预览3页

还剩页未读, 继续阅读

声明:本站为文档C2C交易模式,即用户上传的文档直接被用户下载,本站只是中间服务平台,本站所有文档下载所得的收益归上传人(含作者)所有。玖贝云文库仅提供信息存储空间,仅对用户上传内容的表现方式做保护处理,对上载内容本身不做任何修改或编辑。若文档所含内容侵犯了您的版权或隐私,请立即通知玖贝云文库,我们立即给予删除!
分类:图书资源 价格:10玖币 属性:11 页 大小:382.53KB 格式:PDF 时间:2025-05-02

开通VIP享超值会员特权

  • 多端同步记录
  • 高速下载文档
  • 免费文档工具
  • 分享文档赚钱
  • 每日登录抽奖
  • 优质衍生服务
/ 11
客服
关注