dataset. Further, Long et al. (2021) attempts to
utilize a set of generative adversarial networks to
obtain an imaginary vision feature. We may posit
that a (nearly) common ground for such image-
free frameworks is to learn and further obtain a
generated visual feature representation without
the actual image data provided during inference.
However, none of the aforementioned works has
managed to consistently reach the performance
of the image-must counterpart. In this work,
we hypothesise that this can be caused by the
inferior representation learned, insufficient visual
distribution coverage, improper multimodal fusion
stage (Caglayan et al.,2017;Arslan et al.,2018;
Helcl et al.,2018;Calixto and Liu,2017), and/or
lacked training stability, etc.
In this work, we intend to take a thorough
exploration towards this line. As Shown in
Figure 1(b), unlike prior works solely targeting
visual feature generation and/or relying on later
stages of fusion, our approach directly generates
a
multimodal feature
using only the source text
input. We enable this by proposing an inverse
knowledge distillation mechanism employing pre-
trained convolutional neural networks (CNN).
From our experiments, we find that this
architectural choice has notably enhanced the
training stability as well as the final representation
quality. To this end, we introduce the IKD-
MMT framework, an image-free framework that
systematically rivals or outperforms the image-
must frameworks. To set up the inverse knowledge
distillation flow, we incorporate dual CNNs with
inverted data feeding flow. Of the two, the teacher
network receives the pre-trained weights while the
student CNN is trained from scratch aiming to
provide a high-quality multimodal feature space
by incorporating both inter-modal and intra-modal
distillations.
Our contributions are summarized as follows:
i.
IKD-MMT framework is the first method
that systematically rivals or even outperforms
the existing image-must frameworks, which fully
demonstrates the feasibility of the image-free
concept;
ii.
We pioneer the exploration of knowledge-
distillation combined with the pre-trained models
in the regime of MMT, as well as the multimodal
feature generation. We posit that these techniques
have shed some light on the representation learning
and training stability of MMT.
2 Related Work
2.1 Multi-modal Machine Translation
As an intersection of multimedia and neural
machine translation (NMT), MMT has drawn great
attention in the research community. Technically,
existing methods mainly focus on how to better
integrate visual information into the framework of
NMT. 1) Calixto et al. (2017) propose a doubly-
attentive decoder to incorporate two separate
attention over the source words and visual features.
2) Ive et al. (2019) propose a translate-and-refine
approach to refine draft translations by visual
features. 3) Yao and Wan (2020) propose the
multimodal Transformer to induce the image
representations from the text under the guide
of image-aware attention. 4) Yin et al. (2020)
employs a unified multimodal graph to capture
various semantic interactions between multimodal
semantic units.
However, the quantity and quality of the
annotated images limit the development of this task,
which is scarce and expensive. In this work, we
aim to perform the MMT in an image-free manner,
which has the ability to break data constraints.
2.2 Knowledge Distillation
Knowledge distillation (KD) (Buciluco et al.,2006;
Hinton et al.,2015) aims to use a knowledge-
rich teacher network to guide the parameter
learning of the student network. In fact, KD
has been investigated in a wide range of fields.
Romero et al. (2014) transfer knowledge through
an intermediate hidden layer to extend the KD.
Yim et al. (2017) define the distilled knowledge
to be transferred in terms of flow between layers,
which is calculated by the inner product between
features from two layers. In the multimedia field,
Gupta et al. (2016) first introduce the technique
that transfers supervision between images from
different modalities. Yuan and Peng (2018)
propose the symmetric distillation networks for
the text-to-image synthesis task.
Inspired by these pioneering efforts, our IKD-
MMT framework is intents to take full advantage of
KD to generate a multimodal feature to overcome
triplet data constraints.
3 IKD-MMT Model
As illustrated in Figure 2, the proposed framework
consists of two components: an image-free MMT
backbone and a multimodal feature generator.