Semantics-Consistent Cross-domain Summarization via Optimal Transport Alignment Jielin Qiu Jiacheng Zhu Mengdi Xu Franck Dernoncourt

2025-05-03 0 0 1.3MB 11 页 10玖币

侵权投诉

Semantics-Consistent Cross-domain Summarization via

Optimal Transport Alignment

Jielin Qiu♦, Jiacheng Zhu♦, Mengdi Xu♦, Franck Dernoncourt♠,

Zhaowen Wang♠, Trung Bui♠, Bo Li♥, Ding Zhao♦, Hailin Jin♠

♦Carnegie Mellon University, ♠Adobe Research, ♥University of Illinois Urbana-Champaign

{jielinq,jzhu4,mengdixu}@andrew.cmu.edu, {dernonco,zhawang,bui,hljin}@adobe.com, lbo@illinois.edu

Abstract

Multimedia summarization with multimodal

output (MSMO) is a recently explored appli-

cation in language grounding. It plays an

essential role in real-world applications, i.e.,

automatically generating cover images and ti-

tles for news articles or providing introduc-

tions to online videos. However, existing meth-

ods extract features from the whole video and

article and use fusion methods to select the

representative one, thus usually ignoring the

critical structure and varying semantics. In

this work, we propose a Semantics-Consistent

Cross-domain Summarization (SCCS) model

based on optimal transport alignment with vi-

sual and textual segmentation. In speciﬁc,

our method ﬁrst decomposes both video and

article into segments in order to capture the

structural semantics, respectively. Then SCCS

follows a cross-domain alignment objective

with optimal transport distance, which lever-

ages multimodal interaction to match and se-

lect the visual and textual summary. We eval-

uated our method on three recent multimodal

datasets and demonstrated the effectiveness of

our method in producing high-quality multi-

modal summaries.

1 Introduction

New multimedia content in the form of short videos

and corresponding text articles has become a sig-

niﬁcant trend in inﬂuential digital media, including

CNN, BBC, Daily Mail, social media, etc. This

popular media type has shown to be successful in

drawing users’ attention and delivering essential

information in an efﬁcient manner.

Multimedia summarization with multimodal out-

put (MSMO) has recently drawn increasing atten-

tion. Different from traditional video or textual

summarization (Gygli et al.,2014;Jadon and Jasim,

2020), where the generated summary is either a

keyframe or textual description, MSMO aims at

producing both visual and textual summaries si-

multaneously, making this task more complicated.

Figure 1: We proposed a segment-level cross-domain

alignment model to preserve the structural semantics

consistency within two domains for MSMO. We solve

an optimal transport problem to optimize the cross-

domain distance, which in turn ﬁnds the optimal match.

Previous works addressed the MSMO task by pro-

cessing the whole video and article together and

used fusion or attention methods to generate scores

for summary selection, which overlooked the struc-

ture and semantics of different domains (Duan

et al.,2022;Haopeng et al.,2022;Sah et al.,2017;

Zhu et al.,2018;Mingzhe et al.,2020;Fu et al.,

2021,2020). However, we believe the structure

of semantics is a crucial characteristic that can not

be ignored for multimodal summarization tasks.

Based on this hypothesis, we proposed Semantics-

Consistent Cross-domain Summarization (SCCS)

model, which explores segment-level cross-domain

representations through Optimal Transport (OT)

based multimodal alignment to generate both vi-

sual and textual summaries.

The comparison of our approach and previous

works is illustrated in Figure 1. We regard the video

and article as being composed of several topics re-

lated to the main idea, while each topic speciﬁcally

corresponds to one sub-idea. Thus, treating the

whole video or article uniformly and learning a

general representation will ignore these structural

semantics and leads to biased summarization. To

address this problem, instead of learning averaged

arXiv:2210.04722v1 [cs.CV] 10 Oct 2022

Figure 2: An illustration of the summarization process given by our SCCS method. Here we conduct OT-based

cross-domain alignment to each keyframe-sentence pair and a smaller OT distance means better alignment. (For

example, the best-aligned text and image summary (0.08) delivers the ﬂooding content clearly and comprehen-

sively.)

representations for the whole video & article, we

focus on exploiting the original underlying struc-

ture. Our model ﬁrst decomposes the video & arti-

cle into segments to discover the content structure,

then explores the cross-domain semantics relation-

ship at the segment level. We believe this is a

promising approach to exploit the consistency lie in

the structural semantics between different domains.

Since MSMO generates both visual & textual sum-

maries, We believe the optimal summary comes

from the video and text pair that are both 1) seman-

tically consistent and 2) best matched globally in a

cross-domain fashion. In addition, our framework

is more computationally efﬁcient as it conducts

cross-domain alignment on segment-level rather

than taking the whole videos/articles as inputs.

Our contributions can be summarized as follow:

•

We propose SCCS (Semantics-Consistent

Cross-domain Summarization), a segment-

level alignment model for MSMO tasks.

•

Our method preserves the structural semantics

and explores the cross-domain relationship

through optimal transport to match and select

the visual and textual summary.

•

Our method serves as a hierarchical MSMO

framework that provides better interpretability

via Optimal Transport alignment.

•

We provide both qualitative and quantitative

results on three public datasets. Our method

outperforms baselines and provides good in-

terpretation of learned representations.

2 Related Work

Optimal Transport

Optimal Transport (OT)

studies the geometry of probability spaces (Vil-

lani,2003), a formalism for ﬁnding and quanti-

fying mass movement from one probability dis-

tribution to another. OT deﬁnes the Wasserstein

metric between probability distributions, revealing

a canonical geometric structure with rich proper-

ties to be exploited. The earliest contribution to

OT originated from Monge in the eighteenth cen-

tury. Kantorovich rediscovered it under a different

formalism, namely the Linear Programming for-

mulation of OT. With the development of scalable

solvers, OT is widely applied to many real-world

problems (Flamary et al.,2021;Chen et al.,2020a;

Yuan et al.,2020;Klicpera et al.,2021;Alqahtani

et al.,2021;Lee et al.,2019;Chen et al.,2019;Qiu

et al.,2022;Duan et al.,2022;Han et al.,2022;

Zhu et al.,2022).

Multimodal Alignment

Aligning representa-

tions from different modalities is an important tech-

nique in multimodal learning. With the recent

advancement, exploring the explicit relationship

across vision and language has drawn signiﬁcant at-

tention (Wang et al.,2020). Torabi et al. (2016); Yu

et al. (2017) adopted attention mechanisms, Dong

et al. (2021) composed pairwise joint representa-

tion, Chen et al. (2020b); Wray et al. (2019); Zhang

et al. (2018) learned ﬁne-grained or hierarchical

alignment, Lee et al. (2018); Wu et al. (2019) de-

composed the inputs into sub-tokens, Velickovic

et al. (2018); Yao et al. (2018) adopted graph atten-

tion for reasoning, and Yang et al. (2021) applied

contrastive learning algorithms.

Multimodal Summarization

Multimodal sum-

marization explored multiple modalities, i.e., audio

signals, video captions, Automatic Speech Recog-

nition (ASR) transcripts, video titles, etc, for sum-

mary generation. Otani et al. (2016); Yuan et al.

(2019); Wei et al. (2018); Fu et al. (2020) learned

the relevance or mapping in the latent space be-

tween different modalities. In addition to only gen-

erating visual summaries, Li et al. (2017); Atri

Figure 3: (a) The computational framework of our SCCS model, which takes a multimedia input (video+text)

and generates multimodal summaries. The framework includes ﬁve modules for: video temporal segmentation,

visual summarization, textual segmentation, textual summarization, and multimodal alignment. (b) The structure

of the video segmentation encoder. (c) The architecture of the textual segmentation module. (d) The multimodal

alignment module for multimodal summaries.

et al. (2021); Zhu et al. (2018) generated textual

summaries by taking audio, transcripts, or docu-

ments as input along with videos or images, using

seq2seq model (Sutskever et al.,2014) or atten-

tion mechanism (Bahdanau et al.,2015). Recent

trending on the MSMO task have also drawn much

attention (Zhu et al.,2018;Mingzhe et al.,2020;

Fu et al.,2021,2020;Zhang et al.,2022).

3 Methods

Our SCCS is a segment-level cross-domain seman-

tics alignment model for the MSMO task, where

MSMO aims at generating both visual and lan-

guage summaries. We follow the problem set-

ting in Mingzhe et al. (2020), for a multimedia

source with documents/articles and videos, the

document

XD={x1, x2, ..., xd}

has

words,

and the ground truth textual summary

YD=

{y1, y2, ..., yg}

has

words. The corresponding

video

is aligned with the document, and there

exists a ground truth cover picture

that can rep-

resent the most important information to describe

the video. Our SCCS model generates both textual

summary Y0

Dand video keyframe Y0

Our SCCS model consists of ﬁve modules, as

shown in Figure 3(a): video temporal segmentation

(Section 3.1), visual summarization (Section 3.3),

textual segmentation (Section 3.2), textual summa-

rization (Section 3.4), and cross-domain alignment

(Section 3.5). Each module will be introduced in

the following subsections.

3.1 Video Temporal Segmentation

Video temporal segmentation aims at splitting the

original video into small segments, which the sum-

marization tasks build upon. VTS is formulated

as a binary classiﬁcation problem on the segment

boundaries, similar to Rao et al. (2020). For a video

, the video segmentation encoder separates the

video sequence into segments

[Xv1, Xv2, ..., Xvm]

where nis the number of segments.

As shown in Figure 3(b), the video segmentation

encoder contains a VTS module and a Bi-LSTM.

Video

is ﬁrst split into shots

[Sv1, Sv2, ..., Svn]

(Castellano,2021), then the VTS module takes a

clip of the video with

2ωb

shots as input and outputs

a boundary representation

. The boundary rep-

resentation captures both differences and relations

between the shots before and after. VTS consists

of two branches, VTS

and VTS

, as shown in

文档加载中……请稍候！
如果长时间未打开，您也可以点击刷新试试。

下载文档到电脑，查找使用更方便

10 玖币 0人已下载

立即下载

摘要：

Semantics-ConsistentCross-domainSummarizationviaOptimalTransportAlignmentJielinQiu},JiachengZhu},MengdiXu},FranckDernoncourt,ZhaowenWang,TrungBui,BoLi~,DingZhao},HailinJin}CarnegieMellonUniversity,AdobeResearch,~UniversityofIllinoisUrbana-Champaignfjielinq,jzhu4,mengdixug@andrew.cmu.edu,fdernon...

展开>> 收起<<

Semantics-Consistent Cross-domain Summarization via Optimal Transport Alignment Jielin Qiu Jiacheng Zhu Mengdi Xu Franck Dernoncourt.pdf

共11页,预览3页

还剩页未读，继续阅读

声明：本站为文档C2C交易模式，即用户上传的文档直接被用户下载，本站只是中间服务平台，本站所有文档下载所得的收益归上传人(含作者)所有。玖贝云文库仅提供信息存储空间，仅对用户上传内容的表现方式做保护处理，对上载内容本身不做任何修改或编辑。若文档所含内容侵犯了您的版权或隐私，请立即通知玖贝云文库，我们立即给予删除！

Semantics-Consistent Cross-domain Summarization via Optimal Transport Alignment Jielin Qiu Jiacheng Zhu Mengdi Xu Franck Dernoncourt

相关推荐

开通VIP享超值会员特权

作者详情

相关内容

热门标签

举报选择: