
Semantics-Consistent Cross-domain Summarization via
Optimal Transport Alignment
Jielin Qiu♦, Jiacheng Zhu♦, Mengdi Xu♦, Franck Dernoncourt♠,
Zhaowen Wang♠, Trung Bui♠, Bo Li♥, Ding Zhao♦, Hailin Jin♠
♦Carnegie Mellon University, ♠Adobe Research, ♥University of Illinois Urbana-Champaign
{jielinq,jzhu4,mengdixu}@andrew.cmu.edu, {dernonco,zhawang,bui,hljin}@adobe.com, lbo@illinois.edu
Abstract
Multimedia summarization with multimodal
output (MSMO) is a recently explored appli-
cation in language grounding. It plays an
essential role in real-world applications, i.e.,
automatically generating cover images and ti-
tles for news articles or providing introduc-
tions to online videos. However, existing meth-
ods extract features from the whole video and
article and use fusion methods to select the
representative one, thus usually ignoring the
critical structure and varying semantics. In
this work, we propose a Semantics-Consistent
Cross-domain Summarization (SCCS) model
based on optimal transport alignment with vi-
sual and textual segmentation. In specific,
our method first decomposes both video and
article into segments in order to capture the
structural semantics, respectively. Then SCCS
follows a cross-domain alignment objective
with optimal transport distance, which lever-
ages multimodal interaction to match and se-
lect the visual and textual summary. We eval-
uated our method on three recent multimodal
datasets and demonstrated the effectiveness of
our method in producing high-quality multi-
modal summaries.
1 Introduction
New multimedia content in the form of short videos
and corresponding text articles has become a sig-
nificant trend in influential digital media, including
CNN, BBC, Daily Mail, social media, etc. This
popular media type has shown to be successful in
drawing users’ attention and delivering essential
information in an efficient manner.
Multimedia summarization with multimodal out-
put (MSMO) has recently drawn increasing atten-
tion. Different from traditional video or textual
summarization (Gygli et al.,2014;Jadon and Jasim,
2020), where the generated summary is either a
keyframe or textual description, MSMO aims at
producing both visual and textual summaries si-
multaneously, making this task more complicated.
Figure 1: We proposed a segment-level cross-domain
alignment model to preserve the structural semantics
consistency within two domains for MSMO. We solve
an optimal transport problem to optimize the cross-
domain distance, which in turn finds the optimal match.
Previous works addressed the MSMO task by pro-
cessing the whole video and article together and
used fusion or attention methods to generate scores
for summary selection, which overlooked the struc-
ture and semantics of different domains (Duan
et al.,2022;Haopeng et al.,2022;Sah et al.,2017;
Zhu et al.,2018;Mingzhe et al.,2020;Fu et al.,
2021,2020). However, we believe the structure
of semantics is a crucial characteristic that can not
be ignored for multimodal summarization tasks.
Based on this hypothesis, we proposed Semantics-
Consistent Cross-domain Summarization (SCCS)
model, which explores segment-level cross-domain
representations through Optimal Transport (OT)
based multimodal alignment to generate both vi-
sual and textual summaries.
The comparison of our approach and previous
works is illustrated in Figure 1. We regard the video
and article as being composed of several topics re-
lated to the main idea, while each topic specifically
corresponds to one sub-idea. Thus, treating the
whole video or article uniformly and learning a
general representation will ignore these structural
semantics and leads to biased summarization. To
address this problem, instead of learning averaged
arXiv:2210.04722v1 [cs.CV] 10 Oct 2022