MMGA Multimodal Learning with Graph Alignment Xuan Yang Zhejiang University

2025-05-02 0 0 405.28KB 3 页 10玖币
侵权投诉
MMGA: Multimodal Learning with Graph Alignment
Xuan Yang
Zhejiang University
xuany@zju.edu.cn
Quanjin Tao
Zhejiang University
taoquanjin@zju.edu.cn
Xiao Feng
Zhejiang University
3200104919@zju.edu.cn
Donghong Cai
Zhejiang University
donghongcai@zju.edu.cn
Xiang Ren
University of Southern California
xiangren@usc.edu
Yang Yang
Zhejiang University
yangya@zju.edu.cn
ABSTRACT
Multimodal pre-training breaks down the modal barriers and allows
the individual modalities to be mutually augmented with informa-
tion, resulting in signicant advances in representation learning.
However, graph modality, as a very general and important form of
data, cannot be easily interacted with other modalities because of its
non-regular nature. In this paper, we propose MMGA (Multimodal
learning with Graph Alignment), a novel multimodal pre-training
framework to incorporate information from graph (social network),
image and text modalities on social media to enhance user represen-
tation learning. In MMGA, a multi-step graph alignment mechanism
is proposed to add the self-supervision from graph modality to op-
timize the image and text encoders, while using the information
from the image and text modalities to guide the graph encoder
learning. We conduct experiments on the dataset crawled from
Instagram. The experimental results show that MMGA works well
on the dataset and improves the fans prediction task’s performance.
We release our dataset, the rst social media multimodal dataset
with graph, of 60,000 users labeled with specic topics based on 2
million posts to facilitate future research.
CCS CONCEPTS
Computing methodologies Articial intelligence.
KEYWORDS
social media, multimodal learning, graph pre-training, user repre-
sentation
ACM Reference Format:
Xuan Yang, Quanjin Tao, Xiao Feng, Donghong Cai, Xiang Ren, and Yang
Yang. 2022. MMGA: Multimodal Learning with Graph Alignment. In Pro-
ceedings of ACM Conference (Conference’17). ACM, New York, NY, USA,
3 pages. https://doi.org/10.1145/nnnnnnn.nnnnnnn
Corresponding author.
Permission to make digital or hard copies of all or part of this work for personal or
classroom use is granted without fee provided that copies are not made or distributed
for prot or commercial advantage and that copies bear this notice and the full citation
on the rst page. Copyrights for components of this work owned by others than ACM
must be honored. Abstracting with credit is permitted. To copy otherwise, or republish,
to post on servers or to redistribute to lists, requires prior specic permission and/or a
fee. Request permissions from permissions@acm.org.
Conference’17, July 2017, Washington, DC, USA
©2022 Association for Computing Machinery.
ACM ISBN 978-x-xxxx-xxxx-x/YY/MM. . . $15.00
https://doi.org/10.1145/nnnnnnn.nnnnnnn
1 INTRODUCTION
Multimodal learning has gained increasing attention in recent years
as the heterogeneous data become ubiquitous in the real world.
Nowadays, many multimodal works study the question of how to
utilize the relationships between the multi modalities and enhance
the cross-modal representation learning, such as CLIP [
9
], BLIP [
6
]
and FLAVA [
10
]. However, these works mostly focus on the vision-
language modalities, few works pay attention to the graph modality.
The graph modality is commonly exists in the real world and is
very important in many applications. For example, on the social
media, besides the texts and the images that posted by users, there
is abundant information in graph data (such as interaction graph
and following graph), which can help us better capture the interest
of user and then improve the user recommendation or advertising
strategy. However, it is dicult to deeply incorporate the graph
modality with image and text modalities: compared with the im-
age and text data, the graph is naturally non-euclidean structure
data, which makes the existing modality fusion method unworkable.
Thus, rather than the natural alignment learning method that com-
monly used in image-text modalities, how to incorporate the graph
data into the multimodal learning is a key challenge. Though there
are works incorporate the graph data into the multimodal learning,
they mainly treat the graph data as an individual modality, which
only use the graph structure for information aggregation after the
multi-modal (text-image) fusion process [
8
][
11
] and ignoring the
cross-modality information between graph and other.
To address the above problem, we propose the Multimodal learn-
ing with Graph Alignment (MMGA) framework to better leverage
the information in multi modalities and enlarge their mutual in-
formation. Specically, we conduct our study on the social media
dataset. On social media, there are user social networks which
contains rich user social information and user posts which simul-
taneously contain image and text. We propose to use the graph
structure in the user social network to supervise the learnt space of
image and text modality. Meanwhile, we utilize the image and text
representations of users to infer the relationships between users,
such as the closeness between users, and then use it to support
the graph learning process. In other word, we align the explicit
graph structure with the implicit image semantic structure and text
semantic structure respectively. In this way, we align the graph
modality’s space with the semantic spaces of image and text modal-
ities; and then, we use the distance between two nodes in image
and text spaces to adjust the graph weight. In this way, we use
graph modality’s information to add an additional supervision on
text and image modality representation learning and use the text
arXiv:2210.09946v2 [cs.MM] 31 Oct 2022
摘要:

MMGA:MultimodalLearningwithGraphAlignmentXuanYangZhejiangUniversityxuany@zju.edu.cnQuanjinTaoZhejiangUniversitytaoquanjin@zju.edu.cnXiaoFengZhejiangUniversity3200104919@zju.edu.cnDonghongCaiZhejiangUniversitydonghongcai@zju.edu.cnXiangRenUniversityofSouthernCaliforniaxiangren@usc.eduYangYang∗Zhejian...

展开>> 收起<<
MMGA Multimodal Learning with Graph Alignment Xuan Yang Zhejiang University.pdf

共3页,预览1页

还剩页未读, 继续阅读

声明:本站为文档C2C交易模式,即用户上传的文档直接被用户下载,本站只是中间服务平台,本站所有文档下载所得的收益归上传人(含作者)所有。玖贝云文库仅提供信息存储空间,仅对用户上传内容的表现方式做保护处理,对上载内容本身不做任何修改或编辑。若文档所含内容侵犯了您的版权或隐私,请立即通知玖贝云文库,我们立即给予删除!
分类:图书资源 价格:10玖币 属性:3 页 大小:405.28KB 格式:PDF 时间:2025-05-02

开通VIP享超值会员特权

  • 多端同步记录
  • 高速下载文档
  • 免费文档工具
  • 分享文档赚钱
  • 每日登录抽奖
  • 优质衍生服务
/ 3
客服
关注