Semantics-Consistent Cross-domain Summarization via Optimal Transport Alignment Jielin Qiu Jiacheng Zhu Mengdi Xu Franck Dernoncourt

2025-05-03 0 0 1.3MB 11 页 10玖币
侵权投诉
Semantics-Consistent Cross-domain Summarization via
Optimal Transport Alignment
Jielin Qiu, Jiacheng Zhu, Mengdi Xu, Franck Dernoncourt,
Zhaowen Wang, Trung Bui, Bo Li, Ding Zhao, Hailin Jin
Carnegie Mellon University, Adobe Research, University of Illinois Urbana-Champaign
{jielinq,jzhu4,mengdixu}@andrew.cmu.edu, {dernonco,zhawang,bui,hljin}@adobe.com, lbo@illinois.edu
Abstract
Multimedia summarization with multimodal
output (MSMO) is a recently explored appli-
cation in language grounding. It plays an
essential role in real-world applications, i.e.,
automatically generating cover images and ti-
tles for news articles or providing introduc-
tions to online videos. However, existing meth-
ods extract features from the whole video and
article and use fusion methods to select the
representative one, thus usually ignoring the
critical structure and varying semantics. In
this work, we propose a Semantics-Consistent
Cross-domain Summarization (SCCS) model
based on optimal transport alignment with vi-
sual and textual segmentation. In specific,
our method first decomposes both video and
article into segments in order to capture the
structural semantics, respectively. Then SCCS
follows a cross-domain alignment objective
with optimal transport distance, which lever-
ages multimodal interaction to match and se-
lect the visual and textual summary. We eval-
uated our method on three recent multimodal
datasets and demonstrated the effectiveness of
our method in producing high-quality multi-
modal summaries.
1 Introduction
New multimedia content in the form of short videos
and corresponding text articles has become a sig-
nificant trend in influential digital media, including
CNN, BBC, Daily Mail, social media, etc. This
popular media type has shown to be successful in
drawing users’ attention and delivering essential
information in an efficient manner.
Multimedia summarization with multimodal out-
put (MSMO) has recently drawn increasing atten-
tion. Different from traditional video or textual
summarization (Gygli et al.,2014;Jadon and Jasim,
2020), where the generated summary is either a
keyframe or textual description, MSMO aims at
producing both visual and textual summaries si-
multaneously, making this task more complicated.
Figure 1: We proposed a segment-level cross-domain
alignment model to preserve the structural semantics
consistency within two domains for MSMO. We solve
an optimal transport problem to optimize the cross-
domain distance, which in turn finds the optimal match.
Previous works addressed the MSMO task by pro-
cessing the whole video and article together and
used fusion or attention methods to generate scores
for summary selection, which overlooked the struc-
ture and semantics of different domains (Duan
et al.,2022;Haopeng et al.,2022;Sah et al.,2017;
Zhu et al.,2018;Mingzhe et al.,2020;Fu et al.,
2021,2020). However, we believe the structure
of semantics is a crucial characteristic that can not
be ignored for multimodal summarization tasks.
Based on this hypothesis, we proposed Semantics-
Consistent Cross-domain Summarization (SCCS)
model, which explores segment-level cross-domain
representations through Optimal Transport (OT)
based multimodal alignment to generate both vi-
sual and textual summaries.
The comparison of our approach and previous
works is illustrated in Figure 1. We regard the video
and article as being composed of several topics re-
lated to the main idea, while each topic specifically
corresponds to one sub-idea. Thus, treating the
whole video or article uniformly and learning a
general representation will ignore these structural
semantics and leads to biased summarization. To
address this problem, instead of learning averaged
arXiv:2210.04722v1 [cs.CV] 10 Oct 2022
Figure 2: An illustration of the summarization process given by our SCCS method. Here we conduct OT-based
cross-domain alignment to each keyframe-sentence pair and a smaller OT distance means better alignment. (For
example, the best-aligned text and image summary (0.08) delivers the flooding content clearly and comprehen-
sively.)
representations for the whole video & article, we
focus on exploiting the original underlying struc-
ture. Our model first decomposes the video & arti-
cle into segments to discover the content structure,
then explores the cross-domain semantics relation-
ship at the segment level. We believe this is a
promising approach to exploit the consistency lie in
the structural semantics between different domains.
Since MSMO generates both visual & textual sum-
maries, We believe the optimal summary comes
from the video and text pair that are both 1) seman-
tically consistent and 2) best matched globally in a
cross-domain fashion. In addition, our framework
is more computationally efficient as it conducts
cross-domain alignment on segment-level rather
than taking the whole videos/articles as inputs.
Our contributions can be summarized as follow:
We propose SCCS (Semantics-Consistent
Cross-domain Summarization), a segment-
level alignment model for MSMO tasks.
Our method preserves the structural semantics
and explores the cross-domain relationship
through optimal transport to match and select
the visual and textual summary.
Our method serves as a hierarchical MSMO
framework that provides better interpretability
via Optimal Transport alignment.
We provide both qualitative and quantitative
results on three public datasets. Our method
outperforms baselines and provides good in-
terpretation of learned representations.
2 Related Work
Optimal Transport
Optimal Transport (OT)
studies the geometry of probability spaces (Vil-
lani,2003), a formalism for finding and quanti-
fying mass movement from one probability dis-
tribution to another. OT defines the Wasserstein
metric between probability distributions, revealing
a canonical geometric structure with rich proper-
ties to be exploited. The earliest contribution to
OT originated from Monge in the eighteenth cen-
tury. Kantorovich rediscovered it under a different
formalism, namely the Linear Programming for-
mulation of OT. With the development of scalable
solvers, OT is widely applied to many real-world
problems (Flamary et al.,2021;Chen et al.,2020a;
Yuan et al.,2020;Klicpera et al.,2021;Alqahtani
et al.,2021;Lee et al.,2019;Chen et al.,2019;Qiu
et al.,2022;Duan et al.,2022;Han et al.,2022;
Zhu et al.,2022).
Multimodal Alignment
Aligning representa-
tions from different modalities is an important tech-
nique in multimodal learning. With the recent
advancement, exploring the explicit relationship
across vision and language has drawn significant at-
tention (Wang et al.,2020). Torabi et al. (2016); Yu
et al. (2017) adopted attention mechanisms, Dong
et al. (2021) composed pairwise joint representa-
tion, Chen et al. (2020b); Wray et al. (2019); Zhang
et al. (2018) learned fine-grained or hierarchical
alignment, Lee et al. (2018); Wu et al. (2019) de-
composed the inputs into sub-tokens, Velickovic
et al. (2018); Yao et al. (2018) adopted graph atten-
tion for reasoning, and Yang et al. (2021) applied
contrastive learning algorithms.
Multimodal Summarization
Multimodal sum-
marization explored multiple modalities, i.e., audio
signals, video captions, Automatic Speech Recog-
nition (ASR) transcripts, video titles, etc, for sum-
mary generation. Otani et al. (2016); Yuan et al.
(2019); Wei et al. (2018); Fu et al. (2020) learned
the relevance or mapping in the latent space be-
tween different modalities. In addition to only gen-
erating visual summaries, Li et al. (2017); Atri
Figure 3: (a) The computational framework of our SCCS model, which takes a multimedia input (video+text)
and generates multimodal summaries. The framework includes five modules for: video temporal segmentation,
visual summarization, textual segmentation, textual summarization, and multimodal alignment. (b) The structure
of the video segmentation encoder. (c) The architecture of the textual segmentation module. (d) The multimodal
alignment module for multimodal summaries.
et al. (2021); Zhu et al. (2018) generated textual
summaries by taking audio, transcripts, or docu-
ments as input along with videos or images, using
seq2seq model (Sutskever et al.,2014) or atten-
tion mechanism (Bahdanau et al.,2015). Recent
trending on the MSMO task have also drawn much
attention (Zhu et al.,2018;Mingzhe et al.,2020;
Fu et al.,2021,2020;Zhang et al.,2022).
3 Methods
Our SCCS is a segment-level cross-domain seman-
tics alignment model for the MSMO task, where
MSMO aims at generating both visual and lan-
guage summaries. We follow the problem set-
ting in Mingzhe et al. (2020), for a multimedia
source with documents/articles and videos, the
document
XD={x1, x2, ..., xd}
has
d
words,
and the ground truth textual summary
YD=
{y1, y2, ..., yg}
has
g
words. The corresponding
video
XV
is aligned with the document, and there
exists a ground truth cover picture
YV
that can rep-
resent the most important information to describe
the video. Our SCCS model generates both textual
summary Y0
Dand video keyframe Y0
V.
Our SCCS model consists of five modules, as
shown in Figure 3(a): video temporal segmentation
(Section 3.1), visual summarization (Section 3.3),
textual segmentation (Section 3.2), textual summa-
rization (Section 3.4), and cross-domain alignment
(Section 3.5). Each module will be introduced in
the following subsections.
3.1 Video Temporal Segmentation
Video temporal segmentation aims at splitting the
original video into small segments, which the sum-
marization tasks build upon. VTS is formulated
as a binary classification problem on the segment
boundaries, similar to Rao et al. (2020). For a video
XV
, the video segmentation encoder separates the
video sequence into segments
[Xv1, Xv2, ..., Xvm]
,
where nis the number of segments.
As shown in Figure 3(b), the video segmentation
encoder contains a VTS module and a Bi-LSTM.
Video
XV
is first split into shots
[Sv1, Sv2, ..., Svn]
(Castellano,2021), then the VTS module takes a
clip of the video with
2ωb
shots as input and outputs
a boundary representation
bi
. The boundary rep-
resentation captures both differences and relations
between the shots before and after. VTS consists
of two branches, VTS
d
and VTS
r
, as shown in
摘要:

Semantics-ConsistentCross-domainSummarizationviaOptimalTransportAlignmentJielinQiu},JiachengZhu},MengdiXu},FranckDernoncourt,ZhaowenWang,TrungBui,BoLi~,DingZhao},HailinJin}CarnegieMellonUniversity,AdobeResearch,~UniversityofIllinoisUrbana-Champaignfjielinq,jzhu4,mengdixug@andrew.cmu.edu,fdernon...

展开>> 收起<<
Semantics-Consistent Cross-domain Summarization via Optimal Transport Alignment Jielin Qiu Jiacheng Zhu Mengdi Xu Franck Dernoncourt.pdf

共11页,预览3页

还剩页未读, 继续阅读

声明:本站为文档C2C交易模式,即用户上传的文档直接被用户下载,本站只是中间服务平台,本站所有文档下载所得的收益归上传人(含作者)所有。玖贝云文库仅提供信息存储空间,仅对用户上传内容的表现方式做保护处理,对上载内容本身不做任何修改或编辑。若文档所含内容侵犯了您的版权或隐私,请立即通知玖贝云文库,我们立即给予删除!
分类:图书资源 价格:10玖币 属性:11 页 大小:1.3MB 格式:PDF 时间:2025-05-03

开通VIP享超值会员特权

  • 多端同步记录
  • 高速下载文档
  • 免费文档工具
  • 分享文档赚钱
  • 每日登录抽奖
  • 优质衍生服务
/ 11
客服
关注