Low-resource Neural Machine Translation with Cross-modal Alignment Zhe Yang12 Qingkai Fang12 Yang Feng12 1Key Laboratory of Intelligent Information Processing

2025-05-06 0 0 1.83MB 13 页 10玖币
侵权投诉
Low-resource Neural Machine Translation with Cross-modal Alignment
Zhe Yang1,2, Qingkai Fang1,2, Yang Feng1,2
1Key Laboratory of Intelligent Information Processing
Institute of Computing Technology, Chinese Academy of Sciences (ICT/CAS)
2University of Chinese Academy of Sciences, Beijing, China
{yangzhe22s1,fangqingkai21b,fengyang}@ict.ac.cn
Abstract
How to achieve neural machine translation
with limited parallel data? Existing techniques
often rely on large-scale monolingual corpora,
which is impractical for some low-resource
languages. In this paper, we turn to connect
several low-resource languages to a particular
high-resource one by additional visual modal-
ity. Specifically, we propose a cross-modal
contrastive learning method to learn a shared
space for all languages, where both a coarse-
grained sentence-level objective and a fine-
grained token-level one are introduced. Ex-
perimental results and further analysis show
that our method can effectively learn the cross-
modal and cross-lingual alignment with a
small amount of image-text pairs and achieves
significant improvements over the text-only
baseline under both zero-shot and few-shot
scenarios. Our code could be found at https:
//github.com/ictnlp/LNMT-CA.
1 Introduction
Neural machine translation (NMT) has shown ex-
cellent performance and becomes the dominant
paradigm of machine translation. However, NMT
is a data-driven approach, which requires a large
amount of parallel data. When the data is insuffi-
cient, it is impractical to train a reasonable NMT
model. Unfortunately, there are many languages in
the world for which sufficient training data is not
available, and sometimes there is no parallel data
at all. Therefore, the translation of low-resource
languages is a vital challenge for NMT.
In recent years, researchers have attempted to
improve the performance of NMT for low-resource
languages. Lample et al. (2018a) proposed an unsu-
pervised approach to learn weak mappings between
languages with large amount of monolingual data
(>1M), which is also costly for low-resource lan-
guages. Liu et al. (2020); Lin et al. (2020b); Pan
Corresponding author: Yang Feng.
<EN> a dog is running in the snow
<DE> ein hund
rennt im schnee
<FR> un chien court
dans la neige
zero-shot
few-shot
MT
Cross-modal
Alignment
<DE>
<FR>
Figure 1: We aim at realizing zero-shot and few-shot
machine translation for the low-resource language. Dif-
ferent languages with the same meanings are projected
to a shared space by cross-modal alignment.
et al. (2021) proposed multilingual NMT models,
which learn a shared space of multiple languages to
achieve translations between languages that appear
in the training set but do not have the corresponding
parallel data. However, they still require auxiliary
parallel data of source and target languages along
with many other languages, which is still infeasible
for low-resource languages.
In recent years, with increasing attention of
multi-modal tasks, resource of image-text pairs
have become more abundant. Inspired by re-
cent efforts on cross-modal alignment (Radford
et al.,2021;Li et al.,2021;Fang et al.,2022), in
this paper, we propose a cross-modal contrastive
learning method, which align different languages
with images as the pivot to enable zero-shot and
few-shot translations for low-resource languages.
With parallel sentence pairs between one high-
resource auxiliary language and the target lan-
arXiv:2210.06716v1 [cs.CL] 13 Oct 2022
guage, we can achieve the translation from low-
resource languages to the target language only
by obtaining small amounts of image-text pairs
(<0.1M) for those languages. The parallel sen-
tence pairs are used to learn the mapping from
the high-resource language to the target language,
and the image-text pairs are used to learn a shared
space for all languages through cross-modal align-
ment. With images as the pivot, the mapping from
the low-resource languages to the target language
are learned, thus achieving zero-shot translation
without any parallel sentence pairs between them.
As shown in Figure 1, the high-resource language
German and the low-resource language French are
brought together by cross-modal alignment, which
transfers the translation ability from DE
EN to
FR
EN. Experiments and analysis show that our
method consistently outperforms the baseline un-
der both zero-shot and few-shot scenarios. Fur-
thermore, our method can effectively realize cross-
modal and cross-lingual alignment.
2 Method
In this section, we present our proposed cross-
modal contrastive learning method, which includes
both sentence-level and token-level objectives.
2.1 Task Definition
Our goal is to achieve zero-shot or few-shot transla-
tion from
T
low-resource languages
L1, L2, ..., LT
to the target language
Ly
with the help of a par-
ticular high-resource language
b
L
. For the high-
resource language
b
L
, there are triples of data
Db
L={(i,x,y)}
, where
i
is the image and
x
and
y
are the descriptions in
b
L
and
Ly
respectively. For
each low-resource language
Li
, only paired data
DLi={(i,x)}
are available. Note that different
languages never share the same images.
2.2 Model Framework
As shown in Figure 2, our model consists of four
sub-modules: image encoder,source encoder,tar-
get decoder and contrastive module.
We use Vision Transformer (ViT) (Dosovitskiy
et al.,2021) as the image encoder to extract vi-
sual features. ViT first splits the image into
several patches, and then feed the sequence of
embed patches with a special
[class]
token
into Transformer (Vaswani et al.,2017). Finally,
the image is encoded as a sequence of vectors
v= (v0, v1, ..., vm)
, where
v0
is the representa-
tion of
[class]
token which can be regarded
as the global representation of the image, and
vp= (v1, ..., vm)
are the patch-level representa-
tions. In next sections, we use
v0
for sentence-level
contrastive learning and
vp
for token-level con-
trastive learning.
The source encoder consists of
N
Transformer
encoder layers, which is shared across all lan-
guages (
L1...T
and
b
L
). For the input sentence
x= (x1, ..., xn)
, the output of source encoder is
denoted as
w= (w1, ..., wn)
. The target decoder
consists of
N
Transformer decoder layers. For
the sentence pairs
(x,y)
, the cross-entropy loss is
defined as:
LCE =
|y|
X
i=1
log p(y
i|y<i,x).(1)
The contrastive module aims to align the output
of image encoder and source encoder, which con-
tains both sentence-level and token-level parts. We
will introduce them in Section 2.3 and 2.4.
2.3 Sentence-level Contrastive Learning
We start with the sentence-level contrastive learn-
ing objective, which aims at learning coarse align-
ment between image and text.
Contrastive Learning
The idea of contrastive
learning (Sohn,2016) is to make the representa-
tions of corresponding pairs closer and, on the con-
trary, to make the irrelevant pairs farther.
Given two sets
X={xi}M
i=1
and
Y={yi}M
i=1
,
for each
xi
, the positive example is
(xi, yi)
and the
remaining
M1
irrelevant pairs
(xi, yj)(i6=j)
are considered as negative examples. The con-
trastive loss between Xand Yis defined as:
Lctr(X,Y) =
M
X
i=1
log exp(s(xi, yi))
PM
j=1 exp(s(xi, yj)),
(2)
where
s()
is the cosine similarity function
s(a, b) = a>b/kakkbk
.
τ
is the temperature hyper-
parameter to control the strength of penalties on
hard negative samples (Wang and Liu,2021).
Sentence-level Contrast
Sentence-level con-
trastive learning aims to align the sentence-level
representations across modalities, which are de-
fined as follows:
ws=1
n
n
X
i=1
wi,(3)
vs=v0.(4)
Vision Transformer
1
Selective Attention
Input
Embedding
Multi-head
Attention
Add & Norm
Feed Forward
Add & Norm
Source Encoder
N
×
Positional
Encoding
Output
Embedding
Masked
Multi-head
Attention
Add & Norm
Feed Forward
Add & Norm
Target Decoder
N
×
Positional
Encoding
Multi-head
Attention
Add & Norm
Softmax
Linear
Output Probabilities
Source Target
* Extra [class]
Embedding
2 3 8 90 *
Linear Projection of Patches
Patch + Position
Embedding
Token-Level CTR
Average Pooling
Sentence-Level CTR
𝑣!
𝑣"
𝑣#𝑣$𝑣%𝑣&𝑤"𝑤!𝑤$𝑤'
𝑣"
(𝑣!
(𝑣$
(𝑣'
(
Contrastive
Module
Image
Encoder
𝑣)𝑤)
Figure 2: Overview of our proposed model.
We then calculate the contrastive loss within a
batch of size
B
, whose textual representations and
visual representations are
Ws={ws
1, ...ws
B}
and
Vs={vs
1, ..., vs
B}
, respectively. The correspond-
ing pairs of images and captions (ws
i, vs
i)are posi-
tive examples, and other pairs
(ws
i, vs
j)(i6=j)
are
considered as negative examples. Finally, the loss
function of sentence-level contrastive learning is
defined as follows:
Lsctr(Ws,Vs) = Lctr(Ws,Vs) + Lctr(Vs,Ws).
(5)
Since we have image-text pairs in different lan-
guages within a batch, we first separate the batch
into several mini-batches according to the language,
and then calculate the contrastive loss for every lan-
guage respectively. It is worth mentioning that we
also calculate contrastive loss for target language
Ly
with paired data
{(i,y)}
in
Db
L
. We will ana-
lyze its effect in Section 4.3.
2.4 Token-level Contrastive Learning
Though sentence-level contrastive learning can
learn coarse-grained alignment between modalities,
it may ignore some detailed information, which is
crucial for predicting translations. To achieve better
alignment between modalities, we propose token-
level contrastive learning to learn fine-grained cor-
respondences between images and text.
Selective Attention
To model the correlations
between image patches and words, we use selective
attention (Li et al.,2022) to learn the patch-level
contribution of images. For patch-level visual rep-
resentations
vp= (v1, ...vm)
and word-level tex-
tual representations
w= (w1, ..., wn)
, the query,
key and value of selective attention are
w,vp,vp
,
respectively:
vt=Softmax (WQ·w)(WK·vp)>
dk(WV·vp),
(6)
where
WQ
,
WK
and
WV
are learnable matrix pa-
rameters.
Token-level Contrast
After the selective atten-
tion, we obtain two sequences
w= (w1, ..., wn)
and
vt= (vt
1, ..., vt
n)
with the same length of
n
.
We then calculate the token-level contrastive loss
within each pair of sequences. Tokens with same in-
dex
(wi, vt
i)
are positive examples, and other pairs
of tokens
(wi, vt
j)(i6=j)
are negative examples.
The token-level contrastive loss is as follows:
Ltctr(w,vt) = Lctr(w,vt) + Lctr(vt,w).
(7)
The token-level contrastive loss of all image-text
pairs will be summed together.
摘要:

Low-resourceNeuralMachineTranslationwithCross-modalAlignmentZheYang1,2,QingkaiFang1,2,YangFeng1,21KeyLaboratoryofIntelligentInformationProcessingInstituteofComputingTechnology,ChineseAcademyofSciences(ICT/CAS)2UniversityofChineseAcademyofSciences,Beijing,China{yangzhe22s1,fangqingkai21b,fengyang}@i...

展开>> 收起<<
Low-resource Neural Machine Translation with Cross-modal Alignment Zhe Yang12 Qingkai Fang12 Yang Feng12 1Key Laboratory of Intelligent Information Processing.pdf

共13页,预览3页

还剩页未读, 继续阅读

声明:本站为文档C2C交易模式,即用户上传的文档直接被用户下载,本站只是中间服务平台,本站所有文档下载所得的收益归上传人(含作者)所有。玖贝云文库仅提供信息存储空间,仅对用户上传内容的表现方式做保护处理,对上载内容本身不做任何修改或编辑。若文档所含内容侵犯了您的版权或隐私,请立即通知玖贝云文库,我们立即给予删除!
分类:图书资源 价格:10玖币 属性:13 页 大小:1.83MB 格式:PDF 时间:2025-05-06

开通VIP享超值会员特权

  • 多端同步记录
  • 高速下载文档
  • 免费文档工具
  • 分享文档赚钱
  • 每日登录抽奖
  • 优质衍生服务
/ 13
客服
关注