A survey on Self Supervised learning approaches for improving Multimodal representation learning Naman Goyal

2025-04-30 0 0 4.08MB 7 页 10玖币

侵权投诉

A survey on Self Supervised learning approaches for

improving Multimodal representation learning

Naman Goyal

Department of Computer Science

Columbia University

ng2848@columbia.edu

Abstract

Recently self supervised learning has seen explosive growth and use in variety

of machine learning tasks because of its ability to avoid the cost of annotating

large-scale datasets. This paper gives an overview for best self supervised learn-

ing approaches for multimodal learning. The presented approaches have been

aggregated by extensive study of the literature and tackle the application of self

supervised learning in different ways. The approaches discussed are cross modal

generation, cross modal pretraining, cyclic translation, and generating unimodal

labels in self supervised fashion.

1 Introduction

Multimodal machine learning is a vibrant multi-disciplinary research ﬁeld that aims to design intelli-

gent systems for understanding, reasoning, and learning through integrating multiple communicative

modalities, including linguistic, acoustic, visual, tactile, and physiological messages. Multimodal

learning attracts intensive research interest because of broad applications such as intelligent tutoring

[Petrovica et al., 2017], robotics [Noda et al., 2014], and healthcare [Frantzidis et al., 2010]. Generally

speaking, existing research efforts mainly focus on how to fuse multimodal data effectively and how

to learn a good representation for each modality. Further the extensive survey by Liang et al. [2022]

gives a taxonomy of the challenges in multimodal learning as showing in Figure 1.

Self-supervised learning [Jaiswal et al., 2020] obtains supervisory signals from the data itself, often

leveraging the underlying structure in the data. The general technique of self-supervised learning is

to predict any unobserved or hidden part (or property) of the input from any observed or unhidden

part of the input. For example, as is common in NLP, we can hide part of a sentence and predict the

hidden words from the remaining words.

The aim of this paper is to provide an overview of best self supervised learning approaches which

tackle the representation chellenge in multimodal learning. We present 4 approaches after extensive

survey of literature in the domain.

We look at cross modal generation which basically for a given image-text pair, generates

image-to-text and text-to-image. We then compare the generated text and image samples

with the input pair.

We then look at cross modal transformer which uses cues from different modality namely

audio and video to do predict the masked token in masked language modelling.

We then look at the approach of cyclic translation between modalities using a Seq2Seq

network. A given modality is translated to another modality and then back translated. The

learned hidden encoding is used for ﬁnal prediction.

arXiv:2210.11024v1 [cs.LG] 20 Oct 2022

Representation

Alignment Transference

Generation

Quantification

Reasoning

Figure 1: Taxonomy of various challenges in the multimodal learning. A triangle and circle represents

elements from two different modalities. (1) Representation studies how to represent and summarize

multimodal data to reﬂect the heterogeneity and interconnections between individual modality

elements. (2) Alignment aims to identify the connections and interactions across all elements. (3)

Reasoning aims to compose knowledge from multimodal evidence usually through multiple inferential

steps for a task. (4) Generation involves learning a generative process to produce raw modalities

that reﬂect cross-modal interactions, structure, and coherence. (5) Transference aims to transfer

knowledge between modalities and their representations. (6) Quantiﬁcation involves empirical and

theoretical studies to better understand heterogeneity, interconnections, and the multimodal learning

process.

Finally we look at approach of generating unimodal labels from multimodal datasets in

self supervised fashion. Multitask learning is used to jointly train on both multimodal and

unimodal labels.

For brevity purpose we give a high level detail of each method while omit the superﬁcial details and

refer the reader to the original work for further reference.

2 Methodology

2.1 Cross-modal generation

The main idea proposed by Gu et al. [2018] is addition to conventional cross-modal feature embedding

at the global semantic level, to introduce an additional cross-modal feature embedding at the local

level, which is grounded by two generative models: image-to-text and text-to-image. Figure 2

illustrates the concept of the proposed cross-modal feature embedding with generative models at high

level, which includes three learning steps: look,imagine, and match. Given a query in image or text,

ﬁrst look at the query to extract an abstract representation. Then, imagine what the target item (text or

image) in the other modality should look like, and get a more concrete grounded representation. This

is accomplish by using the representation of one modality (to be estimated) to generate the item in

the other modality, and comparing the generated items with gold standards. After that, in match step

the right image-text pairs using the relevance score which is calculated based on a combination of

grounded and abstract representations.

Architecture

Figure 3shows the overall architecture for the proposed generative cross-modal

feature learning framework. The entire system consists of three training paths: multi-modal feature

embedding (the entire upper part), image-to-text generative feature learning (the blue path), and

text-to-image generative adversarial feature learning (the green path). The ﬁrst path is similar to the

existing cross-modal feature embedding that maps different modality features into a common space.

However, the difference here is that they use two branches of feature embedding, i.e., making the

embedded visual feature

(resp.

) and the textual feature

(resp.

) closer. They consider (

) as high-level abstract features and (

) as detailed grounded features. The grounded features

文档加载中……请稍候！
如果长时间未打开，您也可以点击刷新试试。

下载文档到电脑，查找使用更方便

10 玖币 0人已下载

立即下载

摘要：

AsurveyonSelfSupervisedlearningapproachesforimprovingMultimodalrepresentationlearningNamanGoyalDepartmentofComputerScienceColumbiaUniversityng2848@columbia.eduAbstractRecentlyselfsupervisedlearninghasseenexplosivegrowthanduseinvarietyofmachinelearningtasksbecauseofitsabilitytoavoidthecostofannotatin...

展开>> 收起<<

A survey on Self Supervised learning approaches for improving Multimodal representation learning Naman Goyal.pdf

共7页,预览2页

还剩页未读，继续阅读

声明：本站为文档C2C交易模式，即用户上传的文档直接被用户下载，本站只是中间服务平台，本站所有文档下载所得的收益归上传人(含作者)所有。玖贝云文库仅提供信息存储空间，仅对用户上传内容的表现方式做保护处理，对上载内容本身不做任何修改或编辑。若文档所含内容侵犯了您的版权或隐私，请立即通知玖贝云文库，我们立即给予删除！

A survey on Self Supervised learning approaches for improving Multimodal representation learning Naman Goyal

相关推荐

开通VIP享超值会员特权

作者详情

相关内容

热门标签

举报选择: