Categorizing Semantic Representations for Neural Machine Translation Yongjing Yin12 Yafu Li12 Fandong Meng3 Jie Zhou3 Yue Zhang24 1Zhejiang University

2025-04-24 0 0 736.75KB 13 页 10玖币

侵权投诉

Categorizing Semantic Representations for Neural Machine Translation

Yongjing Yin1,2∗

, Yafu Li1,2, Fandong Meng3, Jie Zhou3, Yue Zhang2,4

1Zhejiang University

2School of Engineering, Westlake University

3Pattern Recognition Center, WeChat AI, Tencent Inc

4Institute of Advanced Technology, Westlake Institute for Advanced Study

{yinyongjing,liyafu}@westlake.edu.cn {fandongmeng,withtomzhou}@tencent.com

yue.zhang@wias.org.cn

Abstract

Modern neural machine translation (NMT)

models have achieved competitive perfor-

mance in standard benchmarks. However,

they have recently been shown to suffer limi-

tation in compositional generalization, failing

to effectively learn the translation of atoms

(e.g., words) and their semantic composi-

tion (e.g., modiﬁcation) from seen compounds

(e.g., phrases), and thus suffering from signif-

icantly weakened translation performance on

unseen compounds during inference. We ad-

dress this issue by introducing categorization

to the source contextualized representations.

The main idea is to enhance generalization

by reducing sparsity and overﬁtting, which is

achieved by ﬁnding prototypes of token rep-

resentations over the training set and integrat-

ing their embeddings into the source encod-

ing. Experiments on a dedicated MT dataset

(i.e., CoGnition) show that our method re-

duces compositional generalization error rates

by 24% error reduction. In addition, our

conceptually simple method gives consistently

better results than the Transformer baseline on

a range of general MT datasets.

1 Introduction

Neural machine translation (NMT) has achieved

competitive performance on benchmark datasets

such as WMT (Vaswani et al.,2017;Edunov et al.,

2018;So et al.,2019). However, the generalizaiton

to low-resource domains (Bapna and Firat,2019b;

Zeng et al.,2019;Bapna and Firat,2019a;Khan-

delwal et al.,2021) and robustness to slight input

perturbations (Belinkov and Bisk,2018;Xu et al.,

2021b) are relatively low for NMT models. In addi-

tion, recent studies show that NMT systems are vul-

nerable to compositional generalization (Lake and

Baroni,2018;Raunak et al.,2019;Guo et al.,2020;

Li et al.,2021;Dankers et al.,2021;Chaabouni

∗

This work was done as an intern at Pattern Recognition

Center, WeChat AI, Tencent Inc, China.

……

Novel Compounds

DT ADJ N

the large chair

the small car

… … …

Source Sentences with

Novel Compounds

The park is near the small car.

……

For Pattern 1.2:

Compound Patterns

Pattern 1.1: DET+N

Pattern 1.2: DET+ADJ+N

Pattern 1.3: DET+N+MOD

……

Atoms

DET: the, a, any…

ADJ: small, large, red…

N: car, chair, doctor…

……

Figure 1: The novel compounds in the CoGnition test

set are constructed by composing a few basis seman-

tic atoms (e.g., determiners (DET), nouns (N), and ad-

jectives (ADJ)) according to the composition patterns.

The compounds are then put into corresponding source

contexts extracted from the training data.

et al.,2021), namely the ability to understand and

produce a potentially inﬁnite (formally exponential

to the input size) number of novel combinations

of known atoms (Chomsky,2009;Montague and

Thomason,1975;Janssen and Partee,1997;Lake

and Baroni,2018;Keysers et al.,2020a).

Take CoGnition (Li et al.,2021), a dedicated

MT dataset, for example (Figure 1). Despite that

certain instances of translation atoms (e.g., small,

large,car, and chair) and their semantic composi-

tions (e.g., small chair and large car) are frequent

in training data, unseen compositions of the same

atoms (e.g., large chair) during testing can suffer

from large translation error rates. Composition-

ality is also a fundamental issue in language un-

derstanding and motivated for translation (Janssen

and Partee,1997;Janssen,1998), which has been

suggested as being essential for robust translation

(Raunak et al.,2019;Li et al.,2021) and efﬁcient

low-resource learning (Chaabouni et al.,2021).

The current dominant method to NMT employs a

sequence-to-sequence architecture (Sutskever et al.,

2014;Vaswani et al.,2017), where an encoder is

used to ﬁnd representations of each input token that

arXiv:2210.06709v1 [cs.CL] 13 Oct 2022

thoroughly integrates its sequence-level context in-

formation, and a decoder refers to such contextu-

alized representations for generating a translation

sequence. A key reason of failure on composi-

tional generalization is that the correspondence

between pairs of token sequences is modeled as

a whole. Speciﬁcally, NMT models are trained

end-to-end over large parallel data without disen-

tangling the representation of individual words or

phrases from that of whole token sequences. At the

sequence level, the source input sample space is

highly sparse mainly due to semantic composition,

and small changes to a sentence can lead to out-of-

distribution issues (Sagawa et al.,2020;Conklin

et al.,2021;Liu et al.,2021).

Intuitively, one way to solve this problem is to

decouple token-level information from the source

sequence by injecting token-level translation distri-

bution (e.g.,

P(petit|small)

) into the source rep-

resentation. Given the fact that the source-side

contextualized representations encode rich token-

level translation information (Kasai et al.,2021;

Xu et al.,2021a), we categorize sparse token-level

contextualized source representations into a few

representative prototypes over training instances,

and make use of them to enrich source encoding.

In this way, when encoding a sequence, the model

observes less sparse prototypes of each token, thus

alleviating excessively memorizing the sequence-

level information.

We propose a two-stage framework to train

prototype-based Transformer models (Proto-

Transformer). In the ﬁrst stage, we warm up an

initial Transformer model which can generate rea-

sonable representations. In the second stage, for

each token, we run the trained model to extract all

contextualized representations over the training cor-

pus. Then, we perform clustering (e.g., K-Means)

to obtain the prototype representations for each to-

ken. Take Figure 2as an example, for the token

“Toy”, we collect all the contextualized representa-

tions and cluster them into 3 prototypes. Finally,

we extend the base model by fusing the prototype

information back into the encoding process through

a prototype-attention module, and continue to train

the whole model until convergence.

Experimental results on CoGnition show that our

method signiﬁcantly improves novel composition

translation by over 24% error reduction, demon-

strating the effectiveness for tackling the composi-

tional generalization problem. To further verify the

effectiveness on more datasets, we conduct experi-

ments on 10 commonly used MT benchmarks and

our method gives consistent BLEU improvement.

We also present empirical analysis for prototypes

and quantitative analysis on compositional gener-

alizaiton. The comparison between the one-pass

and the two-pass training procedure shows that the

one-pass method is both faster and more accurate

than the two-pass one, demonstrating that more

generalizable prototypes extracted from early train-

ing phrase are more beneﬁcial to compositional

generalization. Additionally, quantitative analysis

demonstrates that our proposed model is better at

handling longer compounds and more difﬁcult com-

position patterns. The code is publicly available at

https://github.com/ARIES-LM/CatMT4CG.git.

2 Related Work

Compositional Generalization

Recent work

(Lake and Baroni,2018;Keysers et al.,2020b)

has demonstrated weak compositionality of neural

models using dedicated datasets. Various methods

haven been proposed to solve the issue of composi-

tional generalization such as encoding more induc-

tive bias (Li et al.,2019;Korrel et al.,2019;Baan

et al.,2019;Chen et al.,2020a;Gordon et al.,2020;

Herzig and Berant,2021), meta-learning (Lake,

2019;Conklin et al.,2021), and data augmentation

(Andreas,2020;Akyürek et al.,2021). Recently,

Ontañón et al. (2021) and Csordás et al. (2021)

show that the Transformer architecture can per-

form better on compositional generalization with

some modiﬁcations. Although these methods have

demonstrated better generalization or interpretabil-

ity, most of them are limited small vocabulary and

limited samples semantic parsing datasets. In the

context of machine translation, Lake and Baroni

(2018) construct a small dataset where the training

data contains a word daxy along with its parallel

sentences of a single pattern (e.g., I am daxy,je

suis daxist) while the test set contains novel pat-

terns (e.g., He is daxy). However, the experiment

is limited in that the test set only consists of 8 sam-

ples. Different from existing work, Li et al. (2021)

propose a large dataset (CoGnition) and construct a

large-scale test set that contains newly constructed

constituents as novel compounds, so that general-

ization ability can be evaluated directly based on

compound translation error rate. We proposed a

method enhancing compositional generalization on

the dedicated dataset of Li et al. (2021), while at

She chose a cute style.

Taylor chose to get flowers.

Taylor chose the color she

wanted.

…

He chose a toy car.

One day he went to

the dealership.

………

He walked to his car.

He wanted a new toy.

The boy played with

that toy all day.

ChoseHe Toy

He Chose

…

Toy

× 𝑳

Encoder

Self-

Attention

He bought a toy.

Feed-Forward

Network

Prototype-

Attention

Embedding Layer

prototype

Training Corpus

It was his favorite toy.

……

Token-Prototype

Lookup Table

Chose

Toy

Car

Girl

Categorization Proto-Transformer

Retrieve

token

prototypes

contextualized

representation

Figure 2: Architecture of Proto-Transformer. The dotted box denotes the prototype-attention introduced in stage

the same time gives improvements to the machine

translation quality at practical test cases.

Neural Machine Translation

Recent research

on NMT has paid increasing attention to robustness

(Cheng et al.,2018,2020;Xu et al.,2021b), domain

adaptation (Bapna and Firat,2019b;Zeng et al.,

2019;Bapna and Firat,2019a;Khandelwal et al.,

2021), and compositional generalization (Lake and

Baroni,2018;Raunak et al.,2019;Fadaee and

Monz,2020;Guo et al.,2020;Li et al.,2021). Lake

and Baroni (2018) propose a simple toy experiment

to ﬁrst show the problem of compositionality. Rau-

nak et al. (2019) ﬁnd that NMT models behave

poorly on recombining known parts and generaliz-

ing on samples beyond the observed length during

training, Fadaee and Monz (2020) ﬁnd that NMT

models are vulnerable to modiﬁcations such as re-

moval of ad-verbs and number substitutions. More

recently, Li et al. (2021) observe signiﬁcant com-

positional generalization issues on CoGnition, and

Dankers et al. (2021) argue that MT is a suitable

testing ground to ask how compositional models

trained on natural data are. Our work is in line with

the above methods, but we consider a method to

address the issue rather than analyse the problem.

Technically, Raunak et al. (2019) propose to use

bag-of-word regularization to reﬁne encoder and

Guo et al. (2020) propose sequence-level mixup to

create synthetic samples. Different from them, we

propose to enhance models’ compositional gener-

alization by categorizing contextualized represen-

tations, which turns out more effective.

3 Method

3.1 Transformer Baseline

Given a sequence of source sentence

{x1, ..., xT}

, where

denotes the number of to-

kens, the Transformer encoder (Vaswani et al.,

2017) ﬁrst maps

to embeddings, packing them

as a matrix

, and then takes

as input and

outputs a contextualized sequence representation

HL∈Rd×T

, where

and

denote dimension size

and the number of layers respectively.

Attention.

Formally, given a set of packed query,

key, and value matrices

, and

, the dot prod-

uct attention mechanism are deﬁned as

Attention(Q, K, V ) = Softmax(QTK

√d)V, (1)

where dis the dimension of the key vector.

A typical extension of the above is multi-head

attention (MHA), where multiple linear projections

are executed in parallel, and the outputs of all heads

are concatenated:

MHA(Q, K, V ) = WO[head1;...;headh],(2)

headi=Attention(WQ

iQ, W K

iK, W V

iV),(3)

where

, and

are model param-

eters.

Layer Structure.

The Transformer encoder has

identical layers, each of which is composed of

two sublayers (i.e., self-attention and feed-forward

networks). In the

-th self-attention layer, the query,

key, and value matrices are all the hidden states

from the previous layer Hl−1:

a=MHA(Hl−1, Hl−1, Hl−1).(4)

文档加载中……请稍候！
如果长时间未打开，您也可以点击刷新试试。

下载文档到电脑，查找使用更方便

10 玖币 0人已下载

立即下载

摘要：

CategorizingSemanticRepresentationsforNeuralMachineTranslationYongjingYin1;2,YafuLi1;2,FandongMeng3,JieZhou3,YueZhang2;41ZhejiangUniversity2SchoolofEngineering,WestlakeUniversity3PatternRecognitionCenter,WeChatAI,TencentInc4InstituteofAdvancedTechnology,WestlakeInstituteforAdvancedStudy{yinyongjing...

展开>> 收起<<

Categorizing Semantic Representations for Neural Machine Translation Yongjing Yin12 Yafu Li12 Fandong Meng3 Jie Zhou3 Yue Zhang24 1Zhejiang University.pdf

共13页,预览3页

还剩页未读，继续阅读

声明：本站为文档C2C交易模式，即用户上传的文档直接被用户下载，本站只是中间服务平台，本站所有文档下载所得的收益归上传人(含作者)所有。玖贝云文库仅提供信息存储空间，仅对用户上传内容的表现方式做保护处理，对上载内容本身不做任何修改或编辑。若文档所含内容侵犯了您的版权或隐私，请立即通知玖贝云文库，我们立即给予删除！

Categorizing Semantic Representations for Neural Machine Translation Yongjing Yin12 Yafu Li12 Fandong Meng3 Jie Zhou3 Yue Zhang24 1Zhejiang University

相关推荐

开通VIP享超值会员特权

作者详情

相关内容

热门标签

举报选择: