Categorizing Semantic Representations for Neural Machine Translation Yongjing Yin12 Yafu Li12 Fandong Meng3 Jie Zhou3 Yue Zhang24 1Zhejiang University

2025-04-24 0 0 736.75KB 13 页 10玖币
侵权投诉
Categorizing Semantic Representations for Neural Machine Translation
Yongjing Yin1,2
, Yafu Li1,2, Fandong Meng3, Jie Zhou3, Yue Zhang2,4
1Zhejiang University
2School of Engineering, Westlake University
3Pattern Recognition Center, WeChat AI, Tencent Inc
4Institute of Advanced Technology, Westlake Institute for Advanced Study
{yinyongjing,liyafu}@westlake.edu.cn {fandongmeng,withtomzhou}@tencent.com
yue.zhang@wias.org.cn
Abstract
Modern neural machine translation (NMT)
models have achieved competitive perfor-
mance in standard benchmarks. However,
they have recently been shown to suffer limi-
tation in compositional generalization, failing
to effectively learn the translation of atoms
(e.g., words) and their semantic composi-
tion (e.g., modification) from seen compounds
(e.g., phrases), and thus suffering from signif-
icantly weakened translation performance on
unseen compounds during inference. We ad-
dress this issue by introducing categorization
to the source contextualized representations.
The main idea is to enhance generalization
by reducing sparsity and overfitting, which is
achieved by finding prototypes of token rep-
resentations over the training set and integrat-
ing their embeddings into the source encod-
ing. Experiments on a dedicated MT dataset
(i.e., CoGnition) show that our method re-
duces compositional generalization error rates
by 24% error reduction. In addition, our
conceptually simple method gives consistently
better results than the Transformer baseline on
a range of general MT datasets.
1 Introduction
Neural machine translation (NMT) has achieved
competitive performance on benchmark datasets
such as WMT (Vaswani et al.,2017;Edunov et al.,
2018;So et al.,2019). However, the generalizaiton
to low-resource domains (Bapna and Firat,2019b;
Zeng et al.,2019;Bapna and Firat,2019a;Khan-
delwal et al.,2021) and robustness to slight input
perturbations (Belinkov and Bisk,2018;Xu et al.,
2021b) are relatively low for NMT models. In addi-
tion, recent studies show that NMT systems are vul-
nerable to compositional generalization (Lake and
Baroni,2018;Raunak et al.,2019;Guo et al.,2020;
Li et al.,2021;Dankers et al.,2021;Chaabouni
This work was done as an intern at Pattern Recognition
Center, WeChat AI, Tencent Inc, China.
……
Novel Compounds
DT ADJ N
the large chair
the small car
… …
Source Sentences with
Novel Compounds
The park is near the small car.
……
For Pattern 1.2:
Compound Patterns
Pattern 1.1: DET+N
Pattern 1.2: DET+ADJ+N
Pattern 1.3: DET+N+MOD
……
Atoms
DET: the, a, any…
ADJ: small, large, red…
N: car, chair, doctor…
……
Figure 1: The novel compounds in the CoGnition test
set are constructed by composing a few basis seman-
tic atoms (e.g., determiners (DET), nouns (N), and ad-
jectives (ADJ)) according to the composition patterns.
The compounds are then put into corresponding source
contexts extracted from the training data.
et al.,2021), namely the ability to understand and
produce a potentially infinite (formally exponential
to the input size) number of novel combinations
of known atoms (Chomsky,2009;Montague and
Thomason,1975;Janssen and Partee,1997;Lake
and Baroni,2018;Keysers et al.,2020a).
Take CoGnition (Li et al.,2021), a dedicated
MT dataset, for example (Figure 1). Despite that
certain instances of translation atoms (e.g., small,
large,car, and chair) and their semantic composi-
tions (e.g., small chair and large car) are frequent
in training data, unseen compositions of the same
atoms (e.g., large chair) during testing can suffer
from large translation error rates. Composition-
ality is also a fundamental issue in language un-
derstanding and motivated for translation (Janssen
and Partee,1997;Janssen,1998), which has been
suggested as being essential for robust translation
(Raunak et al.,2019;Li et al.,2021) and efficient
low-resource learning (Chaabouni et al.,2021).
The current dominant method to NMT employs a
sequence-to-sequence architecture (Sutskever et al.,
2014;Vaswani et al.,2017), where an encoder is
used to find representations of each input token that
arXiv:2210.06709v1 [cs.CL] 13 Oct 2022
thoroughly integrates its sequence-level context in-
formation, and a decoder refers to such contextu-
alized representations for generating a translation
sequence. A key reason of failure on composi-
tional generalization is that the correspondence
between pairs of token sequences is modeled as
a whole. Specifically, NMT models are trained
end-to-end over large parallel data without disen-
tangling the representation of individual words or
phrases from that of whole token sequences. At the
sequence level, the source input sample space is
highly sparse mainly due to semantic composition,
and small changes to a sentence can lead to out-of-
distribution issues (Sagawa et al.,2020;Conklin
et al.,2021;Liu et al.,2021).
Intuitively, one way to solve this problem is to
decouple token-level information from the source
sequence by injecting token-level translation distri-
bution (e.g.,
P(petit|small)
) into the source rep-
resentation. Given the fact that the source-side
contextualized representations encode rich token-
level translation information (Kasai et al.,2021;
Xu et al.,2021a), we categorize sparse token-level
contextualized source representations into a few
representative prototypes over training instances,
and make use of them to enrich source encoding.
In this way, when encoding a sequence, the model
observes less sparse prototypes of each token, thus
alleviating excessively memorizing the sequence-
level information.
We propose a two-stage framework to train
prototype-based Transformer models (Proto-
Transformer). In the first stage, we warm up an
initial Transformer model which can generate rea-
sonable representations. In the second stage, for
each token, we run the trained model to extract all
contextualized representations over the training cor-
pus. Then, we perform clustering (e.g., K-Means)
to obtain the prototype representations for each to-
ken. Take Figure 2as an example, for the token
“Toy”, we collect all the contextualized representa-
tions and cluster them into 3 prototypes. Finally,
we extend the base model by fusing the prototype
information back into the encoding process through
a prototype-attention module, and continue to train
the whole model until convergence.
Experimental results on CoGnition show that our
method significantly improves novel composition
translation by over 24% error reduction, demon-
strating the effectiveness for tackling the composi-
tional generalization problem. To further verify the
effectiveness on more datasets, we conduct experi-
ments on 10 commonly used MT benchmarks and
our method gives consistent BLEU improvement.
We also present empirical analysis for prototypes
and quantitative analysis on compositional gener-
alizaiton. The comparison between the one-pass
and the two-pass training procedure shows that the
one-pass method is both faster and more accurate
than the two-pass one, demonstrating that more
generalizable prototypes extracted from early train-
ing phrase are more beneficial to compositional
generalization. Additionally, quantitative analysis
demonstrates that our proposed model is better at
handling longer compounds and more difficult com-
position patterns. The code is publicly available at
https://github.com/ARIES-LM/CatMT4CG.git.
2 Related Work
Compositional Generalization
Recent work
(Lake and Baroni,2018;Keysers et al.,2020b)
has demonstrated weak compositionality of neural
models using dedicated datasets. Various methods
haven been proposed to solve the issue of composi-
tional generalization such as encoding more induc-
tive bias (Li et al.,2019;Korrel et al.,2019;Baan
et al.,2019;Chen et al.,2020a;Gordon et al.,2020;
Herzig and Berant,2021), meta-learning (Lake,
2019;Conklin et al.,2021), and data augmentation
(Andreas,2020;Akyürek et al.,2021). Recently,
Ontañón et al. (2021) and Csordás et al. (2021)
show that the Transformer architecture can per-
form better on compositional generalization with
some modifications. Although these methods have
demonstrated better generalization or interpretabil-
ity, most of them are limited small vocabulary and
limited samples semantic parsing datasets. In the
context of machine translation, Lake and Baroni
(2018) construct a small dataset where the training
data contains a word daxy along with its parallel
sentences of a single pattern (e.g., I am daxy,je
suis daxist) while the test set contains novel pat-
terns (e.g., He is daxy). However, the experiment
is limited in that the test set only consists of 8 sam-
ples. Different from existing work, Li et al. (2021)
propose a large dataset (CoGnition) and construct a
large-scale test set that contains newly constructed
constituents as novel compounds, so that general-
ization ability can be evaluated directly based on
compound translation error rate. We proposed a
method enhancing compositional generalization on
the dedicated dataset of Li et al. (2021), while at
She chose a cute style.
Taylor chose to get flowers.
Taylor chose the color she
wanted.
He chose a toy car.
One day he went to
the dealership.
He walked to his car.
He wanted a new toy.
The boy played with
that toy all day.
ChoseHe Toy
He Chose
Toy
× 𝑳
Encoder
Self-
Attention
He bought a toy.
Feed-Forward
Network
Prototype-
Attention
Embedding Layer
prototype
Training Corpus
It was his favorite toy.
……
He
Token-Prototype
Lookup Table
Chose
Toy
Car
Girl
Categorization Proto-Transformer
Retrieve
token
prototypes
contextualized
representation
Figure 2: Architecture of Proto-Transformer. The dotted box denotes the prototype-attention introduced in stage
2.
the same time gives improvements to the machine
translation quality at practical test cases.
Neural Machine Translation
Recent research
on NMT has paid increasing attention to robustness
(Cheng et al.,2018,2020;Xu et al.,2021b), domain
adaptation (Bapna and Firat,2019b;Zeng et al.,
2019;Bapna and Firat,2019a;Khandelwal et al.,
2021), and compositional generalization (Lake and
Baroni,2018;Raunak et al.,2019;Fadaee and
Monz,2020;Guo et al.,2020;Li et al.,2021). Lake
and Baroni (2018) propose a simple toy experiment
to first show the problem of compositionality. Rau-
nak et al. (2019) find that NMT models behave
poorly on recombining known parts and generaliz-
ing on samples beyond the observed length during
training, Fadaee and Monz (2020) find that NMT
models are vulnerable to modifications such as re-
moval of ad-verbs and number substitutions. More
recently, Li et al. (2021) observe significant com-
positional generalization issues on CoGnition, and
Dankers et al. (2021) argue that MT is a suitable
testing ground to ask how compositional models
trained on natural data are. Our work is in line with
the above methods, but we consider a method to
address the issue rather than analyse the problem.
Technically, Raunak et al. (2019) propose to use
bag-of-word regularization to refine encoder and
Guo et al. (2020) propose sequence-level mixup to
create synthetic samples. Different from them, we
propose to enhance models’ compositional gener-
alization by categorizing contextualized represen-
tations, which turns out more effective.
3 Method
3.1 Transformer Baseline
Given a sequence of source sentence
X=
{x1, ..., xT}
, where
T
denotes the number of to-
kens, the Transformer encoder (Vaswani et al.,
2017) first maps
X
to embeddings, packing them
as a matrix
H0
, and then takes
H0
as input and
outputs a contextualized sequence representation
HLRd×T
, where
d
and
L
denote dimension size
and the number of layers respectively.
Attention.
Formally, given a set of packed query,
key, and value matrices
Q
,
K
, and
V
, the dot prod-
uct attention mechanism are defined as
Attention(Q, K, V ) = Softmax(QTK
d)V, (1)
where dis the dimension of the key vector.
A typical extension of the above is multi-head
attention (MHA), where multiple linear projections
are executed in parallel, and the outputs of all heads
are concatenated:
MHA(Q, K, V ) = WO[head1;...;headh],(2)
headi=Attention(WQ
iQ, W K
iK, W V
iV),(3)
where
WO
,
WQ
i
,
WK
i
, and
WV
i
are model param-
eters.
Layer Structure.
The Transformer encoder has
L
identical layers, each of which is composed of
two sublayers (i.e., self-attention and feed-forward
networks). In the
l
-th self-attention layer, the query,
key, and value matrices are all the hidden states
from the previous layer Hl1:
Hl
a=MHA(Hl1, Hl1, Hl1).(4)
摘要:

CategorizingSemanticRepresentationsforNeuralMachineTranslationYongjingYin1;2,YafuLi1;2,FandongMeng3,JieZhou3,YueZhang2;41ZhejiangUniversity2SchoolofEngineering,WestlakeUniversity3PatternRecognitionCenter,WeChatAI,TencentInc4InstituteofAdvancedTechnology,WestlakeInstituteforAdvancedStudy{yinyongjing...

展开>> 收起<<
Categorizing Semantic Representations for Neural Machine Translation Yongjing Yin12 Yafu Li12 Fandong Meng3 Jie Zhou3 Yue Zhang24 1Zhejiang University.pdf

共13页,预览3页

还剩页未读, 继续阅读

声明:本站为文档C2C交易模式,即用户上传的文档直接被用户下载,本站只是中间服务平台,本站所有文档下载所得的收益归上传人(含作者)所有。玖贝云文库仅提供信息存储空间,仅对用户上传内容的表现方式做保护处理,对上载内容本身不做任何修改或编辑。若文档所含内容侵犯了您的版权或隐私,请立即通知玖贝云文库,我们立即给予删除!
分类:图书资源 价格:10玖币 属性:13 页 大小:736.75KB 格式:PDF 时间:2025-04-24

开通VIP享超值会员特权

  • 多端同步记录
  • 高速下载文档
  • 免费文档工具
  • 分享文档赚钱
  • 每日登录抽奖
  • 优质衍生服务
/ 13
客服
关注