A survey on Self Supervised learning approaches for improving Multimodal representation learning Naman Goyal

2025-04-30 0 0 4.08MB 7 页 10玖币
侵权投诉
A survey on Self Supervised learning approaches for
improving Multimodal representation learning
Naman Goyal
Department of Computer Science
Columbia University
ng2848@columbia.edu
Abstract
Recently self supervised learning has seen explosive growth and use in variety
of machine learning tasks because of its ability to avoid the cost of annotating
large-scale datasets. This paper gives an overview for best self supervised learn-
ing approaches for multimodal learning. The presented approaches have been
aggregated by extensive study of the literature and tackle the application of self
supervised learning in different ways. The approaches discussed are cross modal
generation, cross modal pretraining, cyclic translation, and generating unimodal
labels in self supervised fashion.
1 Introduction
Multimodal machine learning is a vibrant multi-disciplinary research field that aims to design intelli-
gent systems for understanding, reasoning, and learning through integrating multiple communicative
modalities, including linguistic, acoustic, visual, tactile, and physiological messages. Multimodal
learning attracts intensive research interest because of broad applications such as intelligent tutoring
[Petrovica et al., 2017], robotics [Noda et al., 2014], and healthcare [Frantzidis et al., 2010]. Generally
speaking, existing research efforts mainly focus on how to fuse multimodal data effectively and how
to learn a good representation for each modality. Further the extensive survey by Liang et al. [2022]
gives a taxonomy of the challenges in multimodal learning as showing in Figure 1.
Self-supervised learning [Jaiswal et al., 2020] obtains supervisory signals from the data itself, often
leveraging the underlying structure in the data. The general technique of self-supervised learning is
to predict any unobserved or hidden part (or property) of the input from any observed or unhidden
part of the input. For example, as is common in NLP, we can hide part of a sentence and predict the
hidden words from the remaining words.
The aim of this paper is to provide an overview of best self supervised learning approaches which
tackle the representation chellenge in multimodal learning. We present 4 approaches after extensive
survey of literature in the domain.
1.
We look at cross modal generation which basically for a given image-text pair, generates
image-to-text and text-to-image. We then compare the generated text and image samples
with the input pair.
2.
We then look at cross modal transformer which uses cues from different modality namely
audio and video to do predict the masked token in masked language modelling.
3.
We then look at the approach of cyclic translation between modalities using a Seq2Seq
network. A given modality is translated to another modality and then back translated. The
learned hidden encoding is used for final prediction.
arXiv:2210.11024v1 [cs.LG] 20 Oct 2022
12
12
Representation
Alignment Transference
Generation
Quantification
Reasoning
"!
Figure 1: Taxonomy of various challenges in the multimodal learning. A triangle and circle represents
elements from two different modalities. (1) Representation studies how to represent and summarize
multimodal data to reflect the heterogeneity and interconnections between individual modality
elements. (2) Alignment aims to identify the connections and interactions across all elements. (3)
Reasoning aims to compose knowledge from multimodal evidence usually through multiple inferential
steps for a task. (4) Generation involves learning a generative process to produce raw modalities
that reflect cross-modal interactions, structure, and coherence. (5) Transference aims to transfer
knowledge between modalities and their representations. (6) Quantification involves empirical and
theoretical studies to better understand heterogeneity, interconnections, and the multimodal learning
process.
4.
Finally we look at approach of generating unimodal labels from multimodal datasets in
self supervised fashion. Multitask learning is used to jointly train on both multimodal and
unimodal labels.
For brevity purpose we give a high level detail of each method while omit the superficial details and
refer the reader to the original work for further reference.
2 Methodology
2.1 Cross-modal generation
The main idea proposed by Gu et al. [2018] is addition to conventional cross-modal feature embedding
at the global semantic level, to introduce an additional cross-modal feature embedding at the local
level, which is grounded by two generative models: image-to-text and text-to-image. Figure 2
illustrates the concept of the proposed cross-modal feature embedding with generative models at high
level, which includes three learning steps: look,imagine, and match. Given a query in image or text,
first look at the query to extract an abstract representation. Then, imagine what the target item (text or
image) in the other modality should look like, and get a more concrete grounded representation. This
is accomplish by using the representation of one modality (to be estimated) to generate the item in
the other modality, and comparing the generated items with gold standards. After that, in match step
the right image-text pairs using the relevance score which is calculated based on a combination of
grounded and abstract representations.
Architecture
Figure 3shows the overall architecture for the proposed generative cross-modal
feature learning framework. The entire system consists of three training paths: multi-modal feature
embedding (the entire upper part), image-to-text generative feature learning (the blue path), and
text-to-image generative adversarial feature learning (the green path). The first path is similar to the
existing cross-modal feature embedding that maps different modality features into a common space.
However, the difference here is that they use two branches of feature embedding, i.e., making the
embedded visual feature
vh
(resp.
vl
) and the textual feature
th
(resp.
tl
) closer. They consider (
vh
,
th
) as high-level abstract features and (
vl
,
tl
) as detailed grounded features. The grounded features
2
摘要:

AsurveyonSelfSupervisedlearningapproachesforimprovingMultimodalrepresentationlearningNamanGoyalDepartmentofComputerScienceColumbiaUniversityng2848@columbia.eduAbstractRecentlyselfsupervisedlearninghasseenexplosivegrowthanduseinvarietyofmachinelearningtasksbecauseofitsabilitytoavoidthecostofannotatin...

展开>> 收起<<
A survey on Self Supervised learning approaches for improving Multimodal representation learning Naman Goyal.pdf

共7页,预览2页

还剩页未读, 继续阅读

声明:本站为文档C2C交易模式,即用户上传的文档直接被用户下载,本站只是中间服务平台,本站所有文档下载所得的收益归上传人(含作者)所有。玖贝云文库仅提供信息存储空间,仅对用户上传内容的表现方式做保护处理,对上载内容本身不做任何修改或编辑。若文档所含内容侵犯了您的版权或隐私,请立即通知玖贝云文库,我们立即给予删除!
分类:图书资源 价格:10玖币 属性:7 页 大小:4.08MB 格式:PDF 时间:2025-04-30

开通VIP享超值会员特权

  • 多端同步记录
  • 高速下载文档
  • 免费文档工具
  • 分享文档赚钱
  • 每日登录抽奖
  • 优质衍生服务
/ 7
客服
关注