Distill the Image to Nowhere Inversion Knowledge Distillation for Multimodal Machine Translation Ru Peng1 Yawen Zeng2 Junbo Zhao1y

2025-04-27 1 0 725.78KB 12 页 10玖币

侵权投诉

Distill the Image to Nowhere: Inversion Knowledge Distillation for

Multimodal Machine Translation

Ru Peng1∗, Yawen Zeng2∗

, Junbo Zhao1†

1Zhejiang University, Zhejiang, China

2Tencent WeChat, Shenzhen, China

pengru709909347@gmail.com, yawenzeng11@gmail.com, j.zhao@zju.edu.cn

Abstract

Past works on multimodal machine translation

(MMT) elevate bilingual setup by incorpo-

rating additional aligned vision information.

However, an image-must requirement of the

multimodal dataset largely hinders MMT’s

development — namely that it demands an

aligned form of [image, source text, target

text]. This limitation is generally troublesome

during the inference phase especially when

the aligned image is not provided as in

the normal NMT setup. Thus, in this

work, we introduce IKD-MMT, a novel MMT

framework to support the image-free inference

phase via an inversion knowledge distillation

scheme. In particular, a multimodal feature

generator is executed with a knowledge

distillation module, which directly generates

the multimodal feature from (only) source

texts as the input. While there have been

a few prior works entertaining the possibility

to support image-free inference for machine

translation, their performances have yet to

rival the image-must translation. In our

experiments, we identify our method as the

ﬁrst image-free approach to comprehensively

rival or even surpass (almost) all image-must

frameworks, and achieved the state-of-the-art

result on the often-used Multi30k benchmark1.

1 Introduction

Multimodal machine translation (MMT) is an

worthy task of elevating text-only translation by

introducing additional image modality (Specia

et al.,2016;Elliott et al.,2017;Barrault et al.,

2018). Existing works mostly focus on the fusion

and alignment of images and texts to improve

MMT (Calixto et al.,2017;Ive et al.,2019;

Yin et al.,2020), that they have managed to

∗

Both authors contributed equally to this research.

†

Corresponding author.

Our code and data are available at:

https://github.

com/pengr/IKD-mmt/tree/master.

(a) Image-must MMT

A small boy playing

with a wheel. Translation Ein kleiner Junge

spielt

mit einem Rad.

(b) Image-free MMT (Ours)

A small boy playing

with a wheel.

A small boy playing

with a wheel.

Visual Feature

Distillation

Textual Feature

Training Ein kleiner Junge

spielt

mit einem Rad.

Multi-modal Feature

Visual Feature

Multi-modal Feature

Testing Ein kleiner Junge

spielt

mit einem Rad.

Figure 1: Examples of Image-must MMT (a), and our

Image-free MMT (b). During testing, our IKD-MMT

does not require the image as input.

concept-prove the effectiveness of the aligned

visual information. Nevertheless, the strict triplet

data form of the dataset, in both the training and

inference phases, has disabled the MMT model

to generalize further. In particular, if we consider

using an MMT model to conduct translation for

the normal bilingual text translation as in the NMT

setup, one must provide the aligned images during

inference. And unfortunately, this is not often

feasible. This general comparison between image-

free and image-must schemes is visually illustrated

in Figure 1(a). In hindsight, the quantity and

quality of attached images become a bottleneck

towards the development of MMT, as acquiring

such resources can be scarce and expensive (e.g.

Multi30K (Elliott et al.,2016)).

Indeed, there have been a few attempts to resolve

the image-must limitation. For instance, Elliott and

Kádár (2017) present a multi-task learning model

for MMT where they rely on an auxiliary visual

grounding task to obtain the visual feature. Zhang

et al. (2020) introduce an image retrieval paradigm

to ﬁnd topic-related images from a small-scale

arXiv:2210.04468v2 [cs.CL] 21 Apr 2023

dataset. Further, Long et al. (2021) attempts to

utilize a set of generative adversarial networks to

obtain an imaginary vision feature. We may posit

that a (nearly) common ground for such image-

free frameworks is to learn and further obtain a

generated visual feature representation without

the actual image data provided during inference.

However, none of the aforementioned works has

managed to consistently reach the performance

of the image-must counterpart. In this work,

we hypothesise that this can be caused by the

inferior representation learned, insufﬁcient visual

distribution coverage, improper multimodal fusion

stage (Caglayan et al.,2017;Arslan et al.,2018;

Helcl et al.,2018;Calixto and Liu,2017), and/or

lacked training stability, etc.

In this work, we intend to take a thorough

exploration towards this line. As Shown in

Figure 1(b), unlike prior works solely targeting

visual feature generation and/or relying on later

stages of fusion, our approach directly generates

multimodal feature

using only the source text

input. We enable this by proposing an inverse

knowledge distillation mechanism employing pre-

trained convolutional neural networks (CNN).

From our experiments, we ﬁnd that this

architectural choice has notably enhanced the

training stability as well as the ﬁnal representation

quality. To this end, we introduce the IKD-

MMT framework, an image-free framework that

systematically rivals or outperforms the image-

must frameworks. To set up the inverse knowledge

distillation ﬂow, we incorporate dual CNNs with

inverted data feeding ﬂow. Of the two, the teacher

network receives the pre-trained weights while the

student CNN is trained from scratch aiming to

provide a high-quality multimodal feature space

by incorporating both inter-modal and intra-modal

distillations.

Our contributions are summarized as follows:

IKD-MMT framework is the ﬁrst method

that systematically rivals or even outperforms

the existing image-must frameworks, which fully

demonstrates the feasibility of the image-free

concept;

ii.

We pioneer the exploration of knowledge-

distillation combined with the pre-trained models

in the regime of MMT, as well as the multimodal

feature generation. We posit that these techniques

have shed some light on the representation learning

and training stability of MMT.

2 Related Work

2.1 Multi-modal Machine Translation

As an intersection of multimedia and neural

machine translation (NMT), MMT has drawn great

attention in the research community. Technically,

existing methods mainly focus on how to better

integrate visual information into the framework of

NMT. 1) Calixto et al. (2017) propose a doubly-

attentive decoder to incorporate two separate

attention over the source words and visual features.

2) Ive et al. (2019) propose a translate-and-reﬁne

approach to reﬁne draft translations by visual

features. 3) Yao and Wan (2020) propose the

multimodal Transformer to induce the image

representations from the text under the guide

of image-aware attention. 4) Yin et al. (2020)

employs a uniﬁed multimodal graph to capture

various semantic interactions between multimodal

semantic units.

However, the quantity and quality of the

annotated images limit the development of this task,

which is scarce and expensive. In this work, we

aim to perform the MMT in an image-free manner,

which has the ability to break data constraints.

2.2 Knowledge Distillation

Knowledge distillation (KD) (Buciluco et al.,2006;

Hinton et al.,2015) aims to use a knowledge-

rich teacher network to guide the parameter

learning of the student network. In fact, KD

has been investigated in a wide range of ﬁelds.

Romero et al. (2014) transfer knowledge through

an intermediate hidden layer to extend the KD.

Yim et al. (2017) deﬁne the distilled knowledge

to be transferred in terms of ﬂow between layers,

which is calculated by the inner product between

features from two layers. In the multimedia ﬁeld,

Gupta et al. (2016) ﬁrst introduce the technique

that transfers supervision between images from

different modalities. Yuan and Peng (2018)

propose the symmetric distillation networks for

the text-to-image synthesis task.

Inspired by these pioneering efforts, our IKD-

MMT framework is intents to take full advantage of

KD to generate a multimodal feature to overcome

triplet data constraints.

3 IKD-MMT Model

As illustrated in Figure 2, the proposed framework

consists of two components: an image-free MMT

backbone and a multimodal feature generator.

Real Image

S-conv1

Invesion Knowledge Distillation

(Train-Only)

S-conv2_1

Source Sentence

“

The bird has a

white crown and a

long yellow bill.

”

Textual

Embedding

Multi-Modal

Feature Generator

Multimodal

Transformer

Encoder

Image-Free MMT Backbone

Target Sentence

Translation

Decoder

“

Der Vogel

hat ...

”

Multi-modal Feature

Textual Feature

S-conv2_2

S-conv5_3

T-conv1

T-conv2_1

T-conv2_2

T-conv5_3

Multimodal Student Network

Visual Teacher Network

Intra-Modal Knowledge Distillation

Inter-Modal Knowledge Distillation

Figure 2: The framework of our IKD-MMT model. The multimodal feature generator, multimodal student network

and visual teacher network are the most critical modules, which help break the dataset constraints of image-must.

3.1 Image-Free MMT Backbone

Given a source sentence

X=(x1, . . . , xI)

, each

token

is mapped into a word embedding vector

Exi∈Rdw

through the textual embedding with

position encoding (Gehring et al.,2017).

and

t= (Ex1, . . . , ExI)

are the word embedding

dimension and the textual feature, respectively.

Then, we feed the text feature

together with the

multimodal feature

(detail in Section 3.2.1) into

the multimodal transformer encoder (Yao and Wan,

2020). In the multimodal encoder layer, we cascade

the multimodal feature

and the text feature

reorganize a new multimodal feature

as the query

vector:

ex=[t;mW m]∈R(I+P)∗d,(1)

where

is the length of source sentence, and

is the size of multimodal feature. Here, we can

understand this modal fusion from the perspective

of nodes and graphs. If we treat each source token

as a node, each region of the multimodal feature

can also be regarded as a pseudo-token and added

to the source token graph for modal fusion. The key

and value vectors are preserved as the text feature

, and the multimodal encoder layer is calculated

as follows:

ck=

i=1

˜αki tiWV,(2)

˜αki =softmax ˜xkWQtiWK>

√d!.(3)

In this paper, we directly adopt the Transformer

decoder

(Vaswani et al.,2017) for translation.

2For details, please refer to the original paper.

Given a target sentence

Y=(y1, . . . , yJ)

, our

framework outputs the predicted probability of the

target word yjas follow:

p(yj|y<j ,X, m)∝exp WhHL

j+bh,(4)

where

represents the top output of the decoder

-th decoding time step,

and

are learnable

multi-layer perceptrons, and

exp()

is a Softmax

layer.

3.2 Multimodal Feature Generation

3.2.1 Preliminaries

In this part, we introduce the frame, symbol

deﬁnitions and task goal of multimodal feature

generation in advance.

The frame is composed of a multimodal feature

generator

, a visual teacher model

and

a multimodal student model

. The detailed

architecture of each module is shown in Table 7

of the appendix. The model parameters of

are

denoted as

θs

. When the global text feature

fed into

, the hidden representation produced

by the

-th layer is denoted as

ϕS

lt, θs

l

. The

outputs a multimodal feature

, and the

produces an inverse feature

after the S-conv1

layer. The real image and the inverse feature are

{Is, Ir} ∈ Rm∗n∗3

. Given a feature

as input, the

hidden representation produced by the

-th layer of

Tis denoted as ϕT

l(I).

Our goal is to generate multimodal features from

the source text to break the image-must restriction

in testing. The visual perception of this multimodal

feature is extracted from the visual distillation

of the teacher-student model, while the textual

文档加载中……请稍候！
如果长时间未打开，您也可以点击刷新试试。

下载文档到电脑，查找使用更方便

10 玖币 0人已下载

立即下载

摘要：

DistilltheImagetoNowhere:InversionKnowledgeDistillationforMultimodalMachineTranslationRuPeng1,YawenZeng2,JunboZhao1y1ZhejiangUniversity,Zhejiang,China2TencentWeChat,Shenzhen,Chinapengru709909347@gmail.com,yawenzeng11@gmail.com,j.zhao@zju.edu.cnAbstractPastworksonmultimodalmachinetranslation(MMT)el...

展开>> 收起<<

Distill the Image to Nowhere Inversion Knowledge Distillation for Multimodal Machine Translation Ru Peng1 Yawen Zeng2 Junbo Zhao1y.pdf

共12页,预览3页

还剩页未读，继续阅读

声明：本站为文档C2C交易模式，即用户上传的文档直接被用户下载，本站只是中间服务平台，本站所有文档下载所得的收益归上传人(含作者)所有。玖贝云文库仅提供信息存储空间，仅对用户上传内容的表现方式做保护处理，对上载内容本身不做任何修改或编辑。若文档所含内容侵犯了您的版权或隐私，请立即通知玖贝云文库，我们立即给予删除！

Distill the Image to Nowhere Inversion Knowledge Distillation for Multimodal Machine Translation Ru Peng1 Yawen Zeng2 Junbo Zhao1y

相关推荐

开通VIP享超值会员特权

作者详情

相关内容

热门标签

举报选择: