Low-resource Neural Machine Translation with Cross-modal Alignment Zhe Yang12 Qingkai Fang12 Yang Feng12 1Key Laboratory of Intelligent Information Processing

2025-05-06 0 0 1.83MB 13 页 10玖币

侵权投诉

Low-resource Neural Machine Translation with Cross-modal Alignment

Zhe Yang1,2, Qingkai Fang1,2, Yang Feng1,2∗

1Key Laboratory of Intelligent Information Processing

Institute of Computing Technology, Chinese Academy of Sciences (ICT/CAS)

2University of Chinese Academy of Sciences, Beijing, China

{yangzhe22s1,fangqingkai21b,fengyang}@ict.ac.cn

Abstract

How to achieve neural machine translation

with limited parallel data? Existing techniques

often rely on large-scale monolingual corpora,

which is impractical for some low-resource

languages. In this paper, we turn to connect

several low-resource languages to a particular

high-resource one by additional visual modal-

ity. Speciﬁcally, we propose a cross-modal

contrastive learning method to learn a shared

space for all languages, where both a coarse-

grained sentence-level objective and a ﬁne-

grained token-level one are introduced. Ex-

perimental results and further analysis show

that our method can effectively learn the cross-

modal and cross-lingual alignment with a

small amount of image-text pairs and achieves

signiﬁcant improvements over the text-only

baseline under both zero-shot and few-shot

scenarios. Our code could be found at https:

//github.com/ictnlp/LNMT-CA.

1 Introduction

Neural machine translation (NMT) has shown ex-

cellent performance and becomes the dominant

paradigm of machine translation. However, NMT

is a data-driven approach, which requires a large

amount of parallel data. When the data is insufﬁ-

cient, it is impractical to train a reasonable NMT

model. Unfortunately, there are many languages in

the world for which sufﬁcient training data is not

available, and sometimes there is no parallel data

at all. Therefore, the translation of low-resource

languages is a vital challenge for NMT.

In recent years, researchers have attempted to

improve the performance of NMT for low-resource

languages. Lample et al. (2018a) proposed an unsu-

pervised approach to learn weak mappings between

languages with large amount of monolingual data

(>1M), which is also costly for low-resource lan-

guages. Liu et al. (2020); Lin et al. (2020b); Pan

∗Corresponding author: Yang Feng.

<EN> a dog is running in the snow

<DE> ein hund

rennt im schnee

<FR> un chien court

dans la neige

zero-shot

few-shot

Cross-modal

Alignment

<DE>

<FR>

Figure 1: We aim at realizing zero-shot and few-shot

machine translation for the low-resource language. Dif-

ferent languages with the same meanings are projected

to a shared space by cross-modal alignment.

et al. (2021) proposed multilingual NMT models,

which learn a shared space of multiple languages to

achieve translations between languages that appear

in the training set but do not have the corresponding

parallel data. However, they still require auxiliary

parallel data of source and target languages along

with many other languages, which is still infeasible

for low-resource languages.

In recent years, with increasing attention of

multi-modal tasks, resource of image-text pairs

have become more abundant. Inspired by re-

cent efforts on cross-modal alignment (Radford

et al.,2021;Li et al.,2021;Fang et al.,2022), in

this paper, we propose a cross-modal contrastive

learning method, which align different languages

with images as the pivot to enable zero-shot and

few-shot translations for low-resource languages.

With parallel sentence pairs between one high-

resource auxiliary language and the target lan-

arXiv:2210.06716v1 [cs.CL] 13 Oct 2022

guage, we can achieve the translation from low-

resource languages to the target language only

by obtaining small amounts of image-text pairs

(<0.1M) for those languages. The parallel sen-

tence pairs are used to learn the mapping from

the high-resource language to the target language,

and the image-text pairs are used to learn a shared

space for all languages through cross-modal align-

ment. With images as the pivot, the mapping from

the low-resource languages to the target language

are learned, thus achieving zero-shot translation

without any parallel sentence pairs between them.

As shown in Figure 1, the high-resource language

German and the low-resource language French are

brought together by cross-modal alignment, which

transfers the translation ability from DE

→

EN to

→

EN. Experiments and analysis show that our

method consistently outperforms the baseline un-

der both zero-shot and few-shot scenarios. Fur-

thermore, our method can effectively realize cross-

modal and cross-lingual alignment.

2 Method

In this section, we present our proposed cross-

modal contrastive learning method, which includes

both sentence-level and token-level objectives.

2.1 Task Deﬁnition

Our goal is to achieve zero-shot or few-shot transla-

tion from

low-resource languages

L1, L2, ..., LT

to the target language

with the help of a par-

ticular high-resource language

. For the high-

resource language

, there are triples of data

L={(i,x,y)}

, where

is the image and

and

are the descriptions in

and

respectively. For

each low-resource language

, only paired data

DLi={(i,x)}

are available. Note that different

languages never share the same images.

2.2 Model Framework

As shown in Figure 2, our model consists of four

sub-modules: image encoder,source encoder,tar-

get decoder and contrastive module.

We use Vision Transformer (ViT) (Dosovitskiy

et al.,2021) as the image encoder to extract vi-

sual features. ViT ﬁrst splits the image into

several patches, and then feed the sequence of

embed patches with a special

[class]

token

into Transformer (Vaswani et al.,2017). Finally,

the image is encoded as a sequence of vectors

v= (v0, v1, ..., vm)

, where

is the representa-

tion of

[class]

token which can be regarded

as the global representation of the image, and

vp= (v1, ..., vm)

are the patch-level representa-

tions. In next sections, we use

for sentence-level

contrastive learning and

for token-level con-

trastive learning.

The source encoder consists of

Transformer

encoder layers, which is shared across all lan-

guages (

L1...T

and

). For the input sentence

x= (x1, ..., xn)

, the output of source encoder is

denoted as

w= (w1, ..., wn)

. The target decoder

consists of

Transformer decoder layers. For

the sentence pairs

(x,y)

, the cross-entropy loss is

deﬁned as:

LCE =−

|y|

i=1

log p(y∗

i|y<i,x).(1)

The contrastive module aims to align the output

of image encoder and source encoder, which con-

tains both sentence-level and token-level parts. We

will introduce them in Section 2.3 and 2.4.

2.3 Sentence-level Contrastive Learning

We start with the sentence-level contrastive learn-

ing objective, which aims at learning coarse align-

ment between image and text.

Contrastive Learning

The idea of contrastive

learning (Sohn,2016) is to make the representa-

tions of corresponding pairs closer and, on the con-

trary, to make the irrelevant pairs farther.

Given two sets

X={xi}M

i=1

and

Y={yi}M

i=1

for each

, the positive example is

(xi, yi)

and the

remaining

M−1

irrelevant pairs

(xi, yj)(i6=j)

are considered as negative examples. The con-

trastive loss between Xand Yis deﬁned as:

Lctr(X,Y) = −

i=1

log exp(s(xi, yi)/τ)

j=1 exp(s(xi, yj)/τ),

(2)

where

s()

is the cosine similarity function

s(a, b) = a>b/kakkbk

is the temperature hyper-

parameter to control the strength of penalties on

hard negative samples (Wang and Liu,2021).

Sentence-level Contrast

Sentence-level con-

trastive learning aims to align the sentence-level

representations across modalities, which are de-

ﬁned as follows:

ws=1

i=1

wi,(3)

vs=v0.(4)

Vision Transformer

…

1…

Selective Attention

Input

Embedding

Multi-head

Attention

Add & Norm

Feed Forward

Add & Norm

Source Encoder

Positional

Encoding

Output

Embedding

Masked

Multi-head

Attention

Add & Norm

Feed Forward

Add & Norm

Target Decoder

Positional

Encoding

Multi-head

Attention

Add & Norm

Softmax

Linear

Output Probabilities

Source Target

* Extra [class]

Embedding

2 3 8 90 *

Linear Projection of Patches

Patch + Position

Embedding

…

Token-Level CTR

Average Pooling

Sentence-Level CTR

𝑣!

𝑣"

𝑣#𝑣$𝑣%𝑣&𝑤"𝑤!𝑤$𝑤'

𝑣"

(𝑣!

(𝑣$

(𝑣'

(

Contrastive

Module

Image

Encoder

𝑣)𝑤)

Figure 2: Overview of our proposed model.

We then calculate the contrastive loss within a

batch of size

, whose textual representations and

visual representations are

Ws={ws

1, ...ws

and

Vs={vs

1, ..., vs

, respectively. The correspond-

ing pairs of images and captions (ws

i, vs

i)are posi-

tive examples, and other pairs

(ws

i, vs

j)(i6=j)

are

considered as negative examples. Finally, the loss

function of sentence-level contrastive learning is

deﬁned as follows:

Ls−ctr(Ws,Vs) = Lctr(Ws,Vs) + Lctr(Vs,Ws).

(5)

Since we have image-text pairs in different lan-

guages within a batch, we ﬁrst separate the batch

into several mini-batches according to the language,

and then calculate the contrastive loss for every lan-

guage respectively. It is worth mentioning that we

also calculate contrastive loss for target language

with paired data

{(i,y)}

. We will ana-

lyze its effect in Section 4.3.

2.4 Token-level Contrastive Learning

Though sentence-level contrastive learning can

learn coarse-grained alignment between modalities,

it may ignore some detailed information, which is

crucial for predicting translations. To achieve better

alignment between modalities, we propose token-

level contrastive learning to learn ﬁne-grained cor-

respondences between images and text.

Selective Attention

To model the correlations

between image patches and words, we use selective

attention (Li et al.,2022) to learn the patch-level

contribution of images. For patch-level visual rep-

resentations

vp= (v1, ...vm)

and word-level tex-

tual representations

w= (w1, ..., wn)

, the query,

key and value of selective attention are

w,vp,vp

respectively:

vt=Softmax (WQ·w)(WK·vp)>

√dk(WV·vp),

(6)

where

and

are learnable matrix pa-

rameters.

Token-level Contrast

After the selective atten-

tion, we obtain two sequences

w= (w1, ..., wn)

and

vt= (vt

1, ..., vt

with the same length of

We then calculate the token-level contrastive loss

within each pair of sequences. Tokens with same in-

dex

(wi, vt

are positive examples, and other pairs

of tokens

(wi, vt

j)(i6=j)

are negative examples.

The token-level contrastive loss is as follows:

Lt−ctr(w,vt) = Lctr(w,vt) + Lctr(vt,w).

(7)

The token-level contrastive loss of all image-text

pairs will be summed together.

文档加载中……请稍候！
如果长时间未打开，您也可以点击刷新试试。

下载文档到电脑，查找使用更方便

10 玖币 0人已下载

立即下载

摘要：

Low-resourceNeuralMachineTranslationwithCross-modalAlignmentZheYang1,2,QingkaiFang1,2,YangFeng1,21KeyLaboratoryofIntelligentInformationProcessingInstituteofComputingTechnology,ChineseAcademyofSciences(ICT/CAS)2UniversityofChineseAcademyofSciences,Beijing,China{yangzhe22s1,fangqingkai21b,fengyang}@i...

展开>> 收起<<

Low-resource Neural Machine Translation with Cross-modal Alignment Zhe Yang12 Qingkai Fang12 Yang Feng12 1Key Laboratory of Intelligent Information Processing.pdf

共13页,预览3页

还剩页未读，继续阅读

声明：本站为文档C2C交易模式，即用户上传的文档直接被用户下载，本站只是中间服务平台，本站所有文档下载所得的收益归上传人(含作者)所有。玖贝云文库仅提供信息存储空间，仅对用户上传内容的表现方式做保护处理，对上载内容本身不做任何修改或编辑。若文档所含内容侵犯了您的版权或隐私，请立即通知玖贝云文库，我们立即给予删除！

Low-resource Neural Machine Translation with Cross-modal Alignment Zhe Yang12 Qingkai Fang12 Yang Feng12 1Key Laboratory of Intelligent Information Processing

相关推荐

开通VIP享超值会员特权

作者详情

相关内容

热门标签

举报选择: