
12
12
Representation
Alignment Transference
Generation
Quantification
Reasoning
"!
Figure 1: Taxonomy of various challenges in the multimodal learning. A triangle and circle represents
elements from two different modalities. (1) Representation studies how to represent and summarize
multimodal data to reflect the heterogeneity and interconnections between individual modality
elements. (2) Alignment aims to identify the connections and interactions across all elements. (3)
Reasoning aims to compose knowledge from multimodal evidence usually through multiple inferential
steps for a task. (4) Generation involves learning a generative process to produce raw modalities
that reflect cross-modal interactions, structure, and coherence. (5) Transference aims to transfer
knowledge between modalities and their representations. (6) Quantification involves empirical and
theoretical studies to better understand heterogeneity, interconnections, and the multimodal learning
process.
4.
Finally we look at approach of generating unimodal labels from multimodal datasets in
self supervised fashion. Multitask learning is used to jointly train on both multimodal and
unimodal labels.
For brevity purpose we give a high level detail of each method while omit the superficial details and
refer the reader to the original work for further reference.
2 Methodology
2.1 Cross-modal generation
The main idea proposed by Gu et al. [2018] is addition to conventional cross-modal feature embedding
at the global semantic level, to introduce an additional cross-modal feature embedding at the local
level, which is grounded by two generative models: image-to-text and text-to-image. Figure 2
illustrates the concept of the proposed cross-modal feature embedding with generative models at high
level, which includes three learning steps: look,imagine, and match. Given a query in image or text,
first look at the query to extract an abstract representation. Then, imagine what the target item (text or
image) in the other modality should look like, and get a more concrete grounded representation. This
is accomplish by using the representation of one modality (to be estimated) to generate the item in
the other modality, and comparing the generated items with gold standards. After that, in match step
the right image-text pairs using the relevance score which is calculated based on a combination of
grounded and abstract representations.
Architecture
Figure 3shows the overall architecture for the proposed generative cross-modal
feature learning framework. The entire system consists of three training paths: multi-modal feature
embedding (the entire upper part), image-to-text generative feature learning (the blue path), and
text-to-image generative adversarial feature learning (the green path). The first path is similar to the
existing cross-modal feature embedding that maps different modality features into a common space.
However, the difference here is that they use two branches of feature embedding, i.e., making the
embedded visual feature
vh
(resp.
vl
) and the textual feature
th
(resp.
tl
) closer. They consider (
vh
,
th
) as high-level abstract features and (
vl
,
tl
) as detailed grounded features. The grounded features
2