we combine the logits (the unnormalized probabil-
ity scores before
softmax
layer) of a neural model
and those derived from an
n
-gram model. The joint
neuro-symbolic system at least brings two appeal-
ing characteristics. First, since the neural model
stands on the shoulders of the shallow
n
-gram LM,
it can concentrate on deeper understanding. Sec-
ond, the underlying
n
-gram LM can be purpose-
fully switched without changing the neural model,
which offers great flexibility in tackling scenarios
such as domain adaptation. That is, we can adapt
the model to a specific domain by changing the
underlying
n
-gram LM in a plug-and-play man-
ner, without changing any parameters of the neural
model.
We conduct extensive experiments to evaluate
the proposed approach. Experiments on the stan-
dard benchmarks of three typical language tasks,
including language modeling, machine translation,
and summarization, show that our approach can
improve the performance of recent state-of-the-
art neural models consistently and considerably.
For example, our approach outperforms popular
baseline models by at least 0.7 PPL scores on the
wikitext-103 dataset for language modeling, 0.65
BLEU scores on average on IWSLT datasets for
machine translation, and 0.36 ROUGE-L scores
on the CNN/DailyMail dataset for summarization.
Moreover, on the language modeling task, when
switching the underlying
n
-gram LM to a particular
domain-specific one (e.g., IT, Koran, Law, Medi-
cal, and Subtitles) in a plug-and-play manner, our
model can reduce the PPL by 5.4 points on aver-
age without any domain-specific training of the
neural part. Remarkably, the performance of our
approach is even close to fine-tuning the whole
model on domain-specific corpora.
Our contributions are three-fold:
•
We propose a residual learning approach for
two heterogeneous structures, i.e.,
n
-gram and
neural LMs, which forces the neural LM to
approximate the information gap that has not
been captured by n-gram LM.
•
Our approach is able to improve the perfor-
mance of recent state-of-the-art neural mod-
els consistently and considerably on language
modeling, machine translation, and summa-
rization.
•
Experiments on domain adaptation demon-
strate that our approach can effectively and
cheaply adapt the model to a specific domain
by changing the used
n
-gram LM in a plug-
and-play manner, without changing any pa-
rameters of the neural model.
2 Related Work
Language Model
The
n
-gram language model
(LM) has been widely used in lots of applications
of natural language processing (NLP) since a long
time ago (Jurafsky,2000). The emergence of ad-
vanced smoothing technologies makes the
n
-gram
model able to provide a better estimation of hu-
man languages (Kneser and Ney,1995;Chen and
Goodman,1996;Heafield et al.,2013). In statis-
tical machine translation (Brown et al.,1990) and
automatic speech recognition (Bahl et al.,1983),
the decoder-side
n
-gram model is critical to esti-
mate the quality of generated candidates. In recent
literature on input methods, the
n
-gram LM is still
the most popular choice for providing word sug-
gestions (Huang et al.,2015;Chen et al.,2019),
because of its low cost and low latency.
However, with the development of deep neural
networks, the macro-level performance of neural
LM has surpassed that of
n
-gram LM by a large
margin. Comparing with the
n
-gram LM, one
big advantage of the neural LM basing on recur-
rent neural network (Hochreiter and Schmidhuber,
1997;Chung et al.,2014) and attention neural net-
work (Vaswani et al.,2017;Radford et al.,2019) is
their ability to modeling long-distance dependen-
cies (Grave et al.,2017). The success of neural
LM can also be observed in the big improvement
achieved in lots of downstream tasks, e.g., text gen-
eration (Holtzman et al.,2020;Welleck et al.,2020;
Su et al.,2022;Xu et al.,2022;Li et al.,2022;Cai
et al.,2022), machine translation (Bahdanau et al.,
2015;Luong and Manning,2015;Vaswani et al.,
2017;Cai et al.,2021) and summarization (Li et al.,
2017;See et al.,2017;Bi et al.,2020).
Although neural LM has outperformed
n
-gram
LM at the macro level, we find that
n
-gram LM can
achieve satisfactory performance on a large portion
of testing cases. Since the training cost of neural
LM is much more expensive and the model capacity
is fixed, we hypothesize that it is not necessary to
train the neural LM to learn the knowledge that can
be captured by
n
-gram LM at a much lower cost.
Therefore, we propose a residual learning method
to let the neural LM learn the gap of knowledge
that has not been captured by n-gram LM.