Distill the Image to Nowhere Inversion Knowledge Distillation for Multimodal Machine Translation Ru Peng1 Yawen Zeng2 Junbo Zhao1y

2025-04-27 0 0 725.78KB 12 页 10玖币
侵权投诉
Distill the Image to Nowhere: Inversion Knowledge Distillation for
Multimodal Machine Translation
Ru Peng1, Yawen Zeng2
, Junbo Zhao1
1Zhejiang University, Zhejiang, China
2Tencent WeChat, Shenzhen, China
pengru709909347@gmail.com, yawenzeng11@gmail.com, j.zhao@zju.edu.cn
Abstract
Past works on multimodal machine translation
(MMT) elevate bilingual setup by incorpo-
rating additional aligned vision information.
However, an image-must requirement of the
multimodal dataset largely hinders MMT’s
development — namely that it demands an
aligned form of [image, source text, target
text]. This limitation is generally troublesome
during the inference phase especially when
the aligned image is not provided as in
the normal NMT setup. Thus, in this
work, we introduce IKD-MMT, a novel MMT
framework to support the image-free inference
phase via an inversion knowledge distillation
scheme. In particular, a multimodal feature
generator is executed with a knowledge
distillation module, which directly generates
the multimodal feature from (only) source
texts as the input. While there have been
a few prior works entertaining the possibility
to support image-free inference for machine
translation, their performances have yet to
rival the image-must translation. In our
experiments, we identify our method as the
first image-free approach to comprehensively
rival or even surpass (almost) all image-must
frameworks, and achieved the state-of-the-art
result on the often-used Multi30k benchmark1.
1 Introduction
Multimodal machine translation (MMT) is an
worthy task of elevating text-only translation by
introducing additional image modality (Specia
et al.,2016;Elliott et al.,2017;Barrault et al.,
2018). Existing works mostly focus on the fusion
and alignment of images and texts to improve
MMT (Calixto et al.,2017;Ive et al.,2019;
Yin et al.,2020), that they have managed to
Both authors contributed equally to this research.
Corresponding author.
1
Our code and data are available at:
https://github.
com/pengr/IKD-mmt/tree/master.
(a) Image-must MMT
A small boy playing
with a wheel. Translation Ein kleiner Junge
spielt
mit einem Rad.
(b) Image-free MMT (Ours)
A small boy playing
with a wheel.
A small boy playing
with a wheel.
Visual Feature
Distillation
Textual Feature
Training Ein kleiner Junge
spielt
mit einem Rad.
Multi-modal Feature
Visual Feature
Multi-modal Feature
Testing Ein kleiner Junge
spielt
mit einem Rad.
Figure 1: Examples of Image-must MMT (a), and our
Image-free MMT (b). During testing, our IKD-MMT
does not require the image as input.
concept-prove the effectiveness of the aligned
visual information. Nevertheless, the strict triplet
data form of the dataset, in both the training and
inference phases, has disabled the MMT model
to generalize further. In particular, if we consider
using an MMT model to conduct translation for
the normal bilingual text translation as in the NMT
setup, one must provide the aligned images during
inference. And unfortunately, this is not often
feasible. This general comparison between image-
free and image-must schemes is visually illustrated
in Figure 1(a). In hindsight, the quantity and
quality of attached images become a bottleneck
towards the development of MMT, as acquiring
such resources can be scarce and expensive (e.g.
Multi30K (Elliott et al.,2016)).
Indeed, there have been a few attempts to resolve
the image-must limitation. For instance, Elliott and
Kádár (2017) present a multi-task learning model
for MMT where they rely on an auxiliary visual
grounding task to obtain the visual feature. Zhang
et al. (2020) introduce an image retrieval paradigm
to find topic-related images from a small-scale
arXiv:2210.04468v2 [cs.CL] 21 Apr 2023
dataset. Further, Long et al. (2021) attempts to
utilize a set of generative adversarial networks to
obtain an imaginary vision feature. We may posit
that a (nearly) common ground for such image-
free frameworks is to learn and further obtain a
generated visual feature representation without
the actual image data provided during inference.
However, none of the aforementioned works has
managed to consistently reach the performance
of the image-must counterpart. In this work,
we hypothesise that this can be caused by the
inferior representation learned, insufficient visual
distribution coverage, improper multimodal fusion
stage (Caglayan et al.,2017;Arslan et al.,2018;
Helcl et al.,2018;Calixto and Liu,2017), and/or
lacked training stability, etc.
In this work, we intend to take a thorough
exploration towards this line. As Shown in
Figure 1(b), unlike prior works solely targeting
visual feature generation and/or relying on later
stages of fusion, our approach directly generates
a
multimodal feature
using only the source text
input. We enable this by proposing an inverse
knowledge distillation mechanism employing pre-
trained convolutional neural networks (CNN).
From our experiments, we find that this
architectural choice has notably enhanced the
training stability as well as the final representation
quality. To this end, we introduce the IKD-
MMT framework, an image-free framework that
systematically rivals or outperforms the image-
must frameworks. To set up the inverse knowledge
distillation flow, we incorporate dual CNNs with
inverted data feeding flow. Of the two, the teacher
network receives the pre-trained weights while the
student CNN is trained from scratch aiming to
provide a high-quality multimodal feature space
by incorporating both inter-modal and intra-modal
distillations.
Our contributions are summarized as follows:
i.
IKD-MMT framework is the first method
that systematically rivals or even outperforms
the existing image-must frameworks, which fully
demonstrates the feasibility of the image-free
concept;
ii.
We pioneer the exploration of knowledge-
distillation combined with the pre-trained models
in the regime of MMT, as well as the multimodal
feature generation. We posit that these techniques
have shed some light on the representation learning
and training stability of MMT.
2 Related Work
2.1 Multi-modal Machine Translation
As an intersection of multimedia and neural
machine translation (NMT), MMT has drawn great
attention in the research community. Technically,
existing methods mainly focus on how to better
integrate visual information into the framework of
NMT. 1) Calixto et al. (2017) propose a doubly-
attentive decoder to incorporate two separate
attention over the source words and visual features.
2) Ive et al. (2019) propose a translate-and-refine
approach to refine draft translations by visual
features. 3) Yao and Wan (2020) propose the
multimodal Transformer to induce the image
representations from the text under the guide
of image-aware attention. 4) Yin et al. (2020)
employs a unified multimodal graph to capture
various semantic interactions between multimodal
semantic units.
However, the quantity and quality of the
annotated images limit the development of this task,
which is scarce and expensive. In this work, we
aim to perform the MMT in an image-free manner,
which has the ability to break data constraints.
2.2 Knowledge Distillation
Knowledge distillation (KD) (Buciluco et al.,2006;
Hinton et al.,2015) aims to use a knowledge-
rich teacher network to guide the parameter
learning of the student network. In fact, KD
has been investigated in a wide range of fields.
Romero et al. (2014) transfer knowledge through
an intermediate hidden layer to extend the KD.
Yim et al. (2017) define the distilled knowledge
to be transferred in terms of flow between layers,
which is calculated by the inner product between
features from two layers. In the multimedia field,
Gupta et al. (2016) first introduce the technique
that transfers supervision between images from
different modalities. Yuan and Peng (2018)
propose the symmetric distillation networks for
the text-to-image synthesis task.
Inspired by these pioneering efforts, our IKD-
MMT framework is intents to take full advantage of
KD to generate a multimodal feature to overcome
triplet data constraints.
3 IKD-MMT Model
As illustrated in Figure 2, the proposed framework
consists of two components: an image-free MMT
backbone and a multimodal feature generator.
Real Image
S-conv1
Invesion Knowledge Distillation
(Train-Only)
S-conv2_1
Source Sentence
The bird has a
white crown and a
long yellow bill.
Textual
Embedding
Multi-Modal
Feature Generator
t
Multimodal
Transformer
Encoder
Image-Free MMT Backbone
Target Sentence
Translation
Decoder
Der Vogel
hat ...
Multi-modal Feature
m
t
Textual Feature
S-conv2_2
S-conv5_3
T-conv1
T-conv2_1
T-conv2_2
T-conv5_3
Multimodal Student Network
Visual Teacher Network
Intra-Modal Knowledge Distillation
Inter-Modal Knowledge Distillation
Figure 2: The framework of our IKD-MMT model. The multimodal feature generator, multimodal student network
and visual teacher network are the most critical modules, which help break the dataset constraints of image-must.
3.1 Image-Free MMT Backbone
Given a source sentence
X=(x1, . . . , xI)
, each
token
xi
is mapped into a word embedding vector
ExiRdw
through the textual embedding with
position encoding (Gehring et al.,2017).
dw
and
t= (Ex1, . . . , ExI)
are the word embedding
dimension and the textual feature, respectively.
Then, we feed the text feature
t
together with the
multimodal feature
m
(detail in Section 3.2.1) into
the multimodal transformer encoder (Yao and Wan,
2020). In the multimodal encoder layer, we cascade
the multimodal feature
m
and the text feature
t
to
reorganize a new multimodal feature
ex
as the query
vector:
ex=[t;mW m]R(I+P)d,(1)
where
I
is the length of source sentence, and
P
is the size of multimodal feature. Here, we can
understand this modal fusion from the perspective
of nodes and graphs. If we treat each source token
as a node, each region of the multimodal feature
can also be regarded as a pseudo-token and added
to the source token graph for modal fusion. The key
and value vectors are preserved as the text feature
t
, and the multimodal encoder layer is calculated
as follows:
ck=
I
X
i=1
˜αki tiWV,(2)
˜αki =softmax ˜xkWQtiWK>
d!.(3)
In this paper, we directly adopt the Transformer
decoder
2
(Vaswani et al.,2017) for translation.
2For details, please refer to the original paper.
Given a target sentence
Y=(y1, . . . , yJ)
, our
framework outputs the predicted probability of the
target word yjas follow:
p(yj|y<j ,X, m)exp WhHL
j+bh,(4)
where
HL
j
represents the top output of the decoder
at
j
-th decoding time step,
Wh
and
bh
are learnable
multi-layer perceptrons, and
exp()
is a Softmax
layer.
3.2 Multimodal Feature Generation
3.2.1 Preliminaries
In this part, we introduce the frame, symbol
definitions and task goal of multimodal feature
generation in advance.
The frame is composed of a multimodal feature
generator
F
, a visual teacher model
T
and
a multimodal student model
S
. The detailed
architecture of each module is shown in Table 7
of the appendix. The model parameters of
S
are
denoted as
θs
. When the global text feature
t
is
fed into
S
, the hidden representation produced
by the
l
-th layer is denoted as
ϕS
lt, θs
l
. The
F
outputs a multimodal feature
m
, and the
S
produces an inverse feature
Is
after the S-conv1
layer. The real image and the inverse feature are
{Is, Ir} ∈ Rmn3
. Given a feature
I
as input, the
hidden representation produced by the
l
-th layer of
Tis denoted as ϕT
l(I).
Our goal is to generate multimodal features from
the source text to break the image-must restriction
in testing. The visual perception of this multimodal
feature is extracted from the visual distillation
of the teacher-student model, while the textual
摘要:

DistilltheImagetoNowhere:InversionKnowledgeDistillationforMultimodalMachineTranslationRuPeng1,YawenZeng2,JunboZhao1y1ZhejiangUniversity,Zhejiang,China2TencentWeChat,Shenzhen,Chinapengru709909347@gmail.com,yawenzeng11@gmail.com,j.zhao@zju.edu.cnAbstractPastworksonmultimodalmachinetranslation(MMT)el...

展开>> 收起<<
Distill the Image to Nowhere Inversion Knowledge Distillation for Multimodal Machine Translation Ru Peng1 Yawen Zeng2 Junbo Zhao1y.pdf

共12页,预览3页

还剩页未读, 继续阅读

声明:本站为文档C2C交易模式,即用户上传的文档直接被用户下载,本站只是中间服务平台,本站所有文档下载所得的收益归上传人(含作者)所有。玖贝云文库仅提供信息存储空间,仅对用户上传内容的表现方式做保护处理,对上载内容本身不做任何修改或编辑。若文档所含内容侵犯了您的版权或隐私,请立即通知玖贝云文库,我们立即给予删除!
分类:图书资源 价格:10玖币 属性:12 页 大小:725.78KB 格式:PDF 时间:2025-04-27

开通VIP享超值会员特权

  • 多端同步记录
  • 高速下载文档
  • 免费文档工具
  • 分享文档赚钱
  • 每日登录抽奖
  • 优质衍生服务
/ 12
客服
关注