N-gram Is Back Residual Learning of Neural Text Generation withn-gram Language Model Huayang LiDeng CaiJin XuTaro Watanabe

2025-05-02 1 0 382.53KB 11 页 10玖币

侵权投诉

N-gram Is Back: Residual Learning of Neural Text Generation

with n-gram Language Model

Huayang Li♣Deng Cai♥Jin Xu♦Taro Watanabe♣

♣Nara Institute of Science and Technology ♥The Chinese University of Hong Kong

♦Institute for Interdisciplinary Information Sciences, Tsinghua University

{li.huayang.lh6, taro}@is.naist.jp thisisjcykcd@gmail.com

xujin21@mails.tsinghua.edu.cn

Abstract

N-gram language models (LM) have been

largely superseded by neural LMs as the latter

exhibits better performance. However, we ﬁnd

that n-gram models can achieve satisfactory

performance on a large proportion of testing

cases, indicating they have already captured

abundant knowledge of the language with rela-

tively low computational cost. With this obser-

vation, we propose to learn a neural LM that

ﬁts the residual between an n-gram LM and

the real-data distribution. The combination of

n-gram and neural LMs not only allows the

neural part to focus on the deeper understand-

ing of language but also provides a ﬂexible

way to customize an LM by switching the un-

derlying n-gram model without changing the

neural model. Experimental results on three

typical language tasks (i.e., language model-

ing, machine translation, and summarization)

demonstrate that our approach attains addi-

tional performance gains over popular stan-

dalone neural models consistently. We also

show that our approach allows for effective

domain adaptation by simply switching to a

domain-speciﬁc n-gram model, without any

extra training. Our code is released at https:

//github.com/ghrua/NgramRes.

1 Introduction

-gram language model (LM) was widely adopted

in a broad range of natural language processing

(NLP) applications, such as input method (Chen

et al.,2019), statistical machine translation (Brown

et al.,1990), and audio speech recognition (Bahl

et al.,1983). However, with the development of

deep learning, neural LMs have gradually taken the

place of

-gram LMs and became the new standard

in recent literature (Merity et al.,2017;Vaswani

et al.,2017;Radford et al.,2019). One critical

reason is the superior performance of neural LMs,

e.g., the GPT-2 model (Radford et al.,2019) can

generate text near the human level, outperforming

n-gram LMs by large margins.

PPL

100

Index of Bins

5-gram

GPT-2

5.83224260940605

47.2420739963172

117.796423416745

203.440992166668

775.022784173262

7.16233646523529

26.0673702595165

44.8518225772504

52.8395811179091

428.45097511973

Figure 1: Sentence-level perplexity (PPL) of 5-gram

LM and GPT-2 LM on the validation dataset of

wikitext-103. We sort sentences in the validation

dataset according to their 5-gram PPL scores, and col-

lect them into 5 bins with an equal number of sentences.

The reported PPL score of each bin is the average over

the sentences in it, and the y-axis uses a logarithmic

scale. Details of the dataset and LMs are shown in sec-

tion 5.1.

Despite that neural LMs have surpassed

-gram

models at the macro level, we ﬁnd that

-gram

LMs are still attractive: they are able to achieve

satisfactory performance on a large proportion of

testing cases at a much lower cost than neural LMs.

As observed in Figure 1, our preliminary experi-

ments show that the performance of

-gram LM

is close to the GPT-2 model trained from scratch

on 3 out of 5 bins ( 1, 2, and 5). Moreover, the

performance of

-gram on the ﬁrst bin is slightly

better than GPT-2. Because training a neural LM

is much more expensive, spending effort on learn-

ing the knowledge that can be cheaply captured by

n-gram seems a waste.

Inspired by the above observation, we propose to

learn a neural LM that focuses on the information

gap that has not been captured by an

-gram model:

F:= G − Q, where Gand Qare the real-data dis-

tribution and the

-gram prediction distribution

respectively, which is in a similar spirit to resid-

ual learning (He et al.,2016). More concretely,

arXiv:2210.14431v3 [cs.CL] 3 Nov 2022

we combine the logits (the unnormalized probabil-

ity scores before

softmax

layer) of a neural model

and those derived from an

-gram model. The joint

neuro-symbolic system at least brings two appeal-

ing characteristics. First, since the neural model

stands on the shoulders of the shallow

-gram LM,

it can concentrate on deeper understanding. Sec-

ond, the underlying

-gram LM can be purpose-

fully switched without changing the neural model,

which offers great ﬂexibility in tackling scenarios

such as domain adaptation. That is, we can adapt

the model to a speciﬁc domain by changing the

underlying

-gram LM in a plug-and-play man-

ner, without changing any parameters of the neural

model.

We conduct extensive experiments to evaluate

the proposed approach. Experiments on the stan-

dard benchmarks of three typical language tasks,

including language modeling, machine translation,

and summarization, show that our approach can

improve the performance of recent state-of-the-

art neural models consistently and considerably.

For example, our approach outperforms popular

baseline models by at least 0.7 PPL scores on the

wikitext-103 dataset for language modeling, 0.65

BLEU scores on average on IWSLT datasets for

machine translation, and 0.36 ROUGE-L scores

on the CNN/DailyMail dataset for summarization.

Moreover, on the language modeling task, when

switching the underlying

-gram LM to a particular

domain-speciﬁc one (e.g., IT, Koran, Law, Medi-

cal, and Subtitles) in a plug-and-play manner, our

model can reduce the PPL by 5.4 points on aver-

age without any domain-speciﬁc training of the

neural part. Remarkably, the performance of our

approach is even close to ﬁne-tuning the whole

model on domain-speciﬁc corpora.

Our contributions are three-fold:

•

We propose a residual learning approach for

two heterogeneous structures, i.e.,

-gram and

neural LMs, which forces the neural LM to

approximate the information gap that has not

been captured by n-gram LM.

•

Our approach is able to improve the perfor-

mance of recent state-of-the-art neural mod-

els consistently and considerably on language

modeling, machine translation, and summa-

rization.

•

Experiments on domain adaptation demon-

strate that our approach can effectively and

cheaply adapt the model to a speciﬁc domain

by changing the used

-gram LM in a plug-

and-play manner, without changing any pa-

rameters of the neural model.

2 Related Work

Language Model

The

-gram language model

(LM) has been widely used in lots of applications

of natural language processing (NLP) since a long

time ago (Jurafsky,2000). The emergence of ad-

vanced smoothing technologies makes the

-gram

model able to provide a better estimation of hu-

man languages (Kneser and Ney,1995;Chen and

Goodman,1996;Heaﬁeld et al.,2013). In statis-

tical machine translation (Brown et al.,1990) and

automatic speech recognition (Bahl et al.,1983),

the decoder-side

-gram model is critical to esti-

mate the quality of generated candidates. In recent

literature on input methods, the

-gram LM is still

the most popular choice for providing word sug-

gestions (Huang et al.,2015;Chen et al.,2019),

because of its low cost and low latency.

However, with the development of deep neural

networks, the macro-level performance of neural

LM has surpassed that of

-gram LM by a large

margin. Comparing with the

-gram LM, one

big advantage of the neural LM basing on recur-

rent neural network (Hochreiter and Schmidhuber,

1997;Chung et al.,2014) and attention neural net-

work (Vaswani et al.,2017;Radford et al.,2019) is

their ability to modeling long-distance dependen-

cies (Grave et al.,2017). The success of neural

LM can also be observed in the big improvement

achieved in lots of downstream tasks, e.g., text gen-

eration (Holtzman et al.,2020;Welleck et al.,2020;

Su et al.,2022;Xu et al.,2022;Li et al.,2022;Cai

et al.,2022), machine translation (Bahdanau et al.,

2015;Luong and Manning,2015;Vaswani et al.,

2017;Cai et al.,2021) and summarization (Li et al.,

2017;See et al.,2017;Bi et al.,2020).

Although neural LM has outperformed

-gram

LM at the macro level, we ﬁnd that

-gram LM can

achieve satisfactory performance on a large portion

of testing cases. Since the training cost of neural

LM is much more expensive and the model capacity

is ﬁxed, we hypothesize that it is not necessary to

train the neural LM to learn the knowledge that can

be captured by

-gram LM at a much lower cost.

Therefore, we propose a residual learning method

to let the neural LM learn the gap of knowledge

that has not been captured by n-gram LM.

Residual Learning

Residual learning is a useful

technique for lots of neural networks in computer

vision (CV) and natural language processing (NLP).

He et al. (2016) propose deep residual learning to

alleviate the training difﬁculties of deep models,

which has been the backbone of lots of tasks in

CV. In NLP, Wang and Tian (2016) and Prakash

et al. (2016) use the residual learning technique to

train deep recurrent neural networks for text gener-

ation. Different from previous works that conduct

residual learning over different layers, Werlen et al.

(2018) propose to aggregate the information of his-

torical predictions using residual learning. In He

et al. (2021), they use the residual learning to prop-

agate attention scores across different layers of the

Transformer-based model.

Most of these works conduct residual learning

over homogeneous model structures, e.g., stacked

identical layers of the same model. In our work,

we use residual learning to combine the neural and

symbolic models, i.e., learn a neural LM that ap-

proximates the information that has not been cap-

tured by the n-gram model.

3 Background

Models that estimate the probabilities of sequences

of words are called language models (LM) (Juraf-

sky,2000). Let

x={x1, x2, ..., xL}

be a sequence

of words with length

. The probability of

P(x)

can be formalized according to the chain rule of

probability:

P(x) = P(x1)P(x2|x1). . . P (xL|xL−1

k=1

P(xk|xk−1

1),(1)

where

xk−1

is called the preﬁx or context of

. In

this section we will brieﬂy introduce two kinds of

language models, the n-gram and neural language

models, to compute the probability in Eq. (1).

3.1 N-gram Language Model

Among lots of variants of

-gram LMs, the

-gram

LM with modiﬁed Kneser-Ney smoothing is widely

adopted in lots of related tasks, because of its low

perplexity and efﬁciency (Kneser and Ney,1995;

Chen and Goodman,1996;Heaﬁeld et al.,2013).

Like most

-gram LMs, the Kneser-Ney approxi-

mates the entire context

xk−1

in Eq. (1) by the last

n−1words in the context:

P(xk|xk−1

1)≈PNG(xk|xk−1

k−n+1).(2)

In Kneser-Ney algorithm, the estimation of

PNG(xk|xk−1

k−n+1)

is deﬁned according to a recur-

sive equation:

PNG(xk|xk−1

k−n+1) = U(xk|xk−1

k−n+1)+

b(xk−1

k−n+1)PNG(xk|xk−1

k−n+2),

(3)

U(xk|xk−1

k−n+1) = c(xk

k−n+1)−d

Pwc(xk−1

k−n+1w),

where

indicates a word appears after

xk−1

k−n+1

b(·)

is the backoff value for lower-order estimation,

c(·)

is the adjusted counts,

is the discounts for

smoothing (Jurafsky,2000;Heaﬁeld et al.,2013)

According to Eq. (3), Kneser-Ney allows us to

assign probabilities for unseen

-grams (e.g., 5-

grams), using the lower-order information (e.g., 4-,

3-, or even uni-grams).

3.2 Neural Language Model

An neural LM typically estimates the probability

based on the whole context

xk−1

. The pa-

rameter

of a neural LM is optimized through the

following MLE loss:

LNU =X

x∈D

k=1

log PNU (xk|xk−1

1;θ)(4)

where

is the training dataset. The probability of

PNU (xk|·)is computed by:

PNU (xk|xk−1

1;θ) = softmax(φ(hk))[xk],(5)

where

is the hidden vector output by the last

layer of an neural LM, e.g., the GPT-2 model (Rad-

ford et al.,2019) or LSTM model (Grave et al.,

2017). The

[xk]

is deﬁned as taking the component

regarding to

in a vector, i.e., the probabilistic

distribution got from

softmax

in this equation. The

φ(·)

is a linear layer that transforms the hidden vec-

tor

to a vector in the vocabulary space, which is

also called the logits.

More details about adjusting counts and computing the

backoff values and discounts are shown in Jurafsky (2000)

and Heaﬁeld et al. (2013).

文档加载中……请稍候！
如果长时间未打开，您也可以点击刷新试试。

下载文档到电脑，查找使用更方便

10 玖币 0人已下载

立即下载

摘要：

N-gramIsBack:ResidualLearningofNeuralTextGenerationwithn-gramLanguageModelHuayangLi|DengCai~JinXu}TaroWatanabe||NaraInstituteofScienceandTechnology~TheChineseUniversityofHongKong}InstituteforInterdisciplinaryInformationSciences,TsinghuaUniversity{li.huayang.lh6,taro}@is.naist.jpthisisjcykcd@gmail.co...

展开>> 收起<<

N-gram Is Back Residual Learning of Neural Text Generation withn-gram Language Model Huayang LiDeng CaiJin XuTaro Watanabe.pdf

共11页,预览3页

还剩页未读，继续阅读

声明：本站为文档C2C交易模式，即用户上传的文档直接被用户下载，本站只是中间服务平台，本站所有文档下载所得的收益归上传人(含作者)所有。玖贝云文库仅提供信息存储空间，仅对用户上传内容的表现方式做保护处理，对上载内容本身不做任何修改或编辑。若文档所含内容侵犯了您的版权或隐私，请立即通知玖贝云文库，我们立即给予删除！

N-gram Is Back Residual Learning of Neural Text Generation withn-gram Language Model Huayang LiDeng CaiJin XuTaro Watanabe

相关推荐

开通VIP享超值会员特权

作者详情

相关内容

热门标签

举报选择: