Hierarchical3D Adapters for Long Video-to-text Summarization Pinelopi Papalampidi Mirella Lapata Institute for Language Cognition and Computation

2025-05-06 0 0 509.88KB 19 页 10玖币
侵权投诉
Hierarchical3D Adapters for Long Video-to-text Summarization
Pinelopi Papalampidi Mirella Lapata
Institute for Language, Cognition and Computation
School of Informatics, University of Edinburgh
p.papalampidi@sms.ed.ac.uk,mlap@inf.ed.ac.uk
Abstract
In this paper, we focus on video-to-text sum-
marization and investigate how to best utilize
multimodal information for summarizing long
inputs (e.g., an hour-long TV show) into long
outputs (e.g., a multi-sentence summary). We
extend SummScreen (Chen et al.,2021), a
dialogue summarization dataset consisting of
transcripts of TV episodes with reference sum-
maries, and create a multimodal variant by col-
lecting corresponding full-length videos. We
incorporate multimodal information into a pre-
trained textual summarizer efficiently using
adapter modules augmented with a hierarchi-
cal structure while tuning only 3.8% of model
parameters. Our experiments demonstrate that
multimodal information offers superior per-
formance over more memory-heavy and fully
fine-tuned textual summarization methods.
1 Introduction
What happens in the very last episode of “Friends”?
Anyone who has seen this episode can summa-
rize its key moments: Ross confesses his love for
Rachel, they decide to resume their relationship,
while Monica and Chandler adopt twins and move
to the suburbs. TV viewers can naturally perform
this dialogue summarization task having access to
multiple modalities: they not only hear the actors
speak but also see their expressions, actions, and
whereabouts on screen.
Despite recent advances in summarization (Nal-
lapati et al.,2016;See et al.,2017;Liu and Lapata,
2019b) and increasing interest in different types of
dialogue summarization, e.g., from meeting tran-
scripts (Gliwa et al.,2019;Zhong et al.,2021b) or
screenplays (Chen et al.,2021), the contribution
of modalities other than text remains relatively un-
derstudied. This is not entirely surprising given
the challenges associated with the multimodal sum-
marization task illustrated above (e.g., produce a
written summary of a TV episode). Firstly, the in-
put is long, it cannot fit into standard sequence-to-
sequence architectures, and the different modalities
have to be somehow combined; secondly, the out-
put is also long, summaries consist of multiple sen-
tences and rich vocabulary; and thirdly, it involves
complex inference over long-range dependencies
between events and characters and common sense
reasoning. At the same time, creating large-scale
multimodal datasets with long videos and aligned
textual data is challenging and time consuming,
limiting the research conducted in this domain.
Previous work on video-to-video summariza-
tion identifies highlights from YouTube videos, TV
shows, or movies (Song et al.,2015;Gygli et al.,
2014;De Avila et al.,2011;Papalampidi et al.,
2021b). However, in most cases, either the videos
are short or the datasets are small with a few hun-
dred examples. There is also limited work on video-
to-text summarization. We are only aware of one
large-scale multimodal dataset for this task, namely
How2 (Sanabria et al.,2018), which again contains
short videos (i.e., 2–3 minutes long) with simple
semantics, and short, single-sentence summaries.
In this paper, we focus on video-to-text summa-
rization and investigate how to best utilize mul-
timodal information for condensing long inputs
(e.g., an hour-long TV show) into long outputs
(e.g., a multi-sentence summary). We create a
multimodal variant of SummScreen (Chen et al.,
2021), a recently released dataset comprising of
transcripts of TV episodes and their summaries.
We collect
full-length
videos for 4,575 episodes
and multiple reference summaries. We build our
model on top of a pre-trained sequence-to-sequence
architecture (i.e., BART; Lewis et al. 2020) fine-
tuned on summarization and capable of generating
fluent long text. We convert its textual encoder
to a multimodal one by adding and tuning only
adapter layers (Rebuffi et al.,2017;Houlsby et al.,
2019), which account for 3.8% of model parame-
arXiv:2210.04829v1 [cs.CL] 10 Oct 2022
Modality Input Output Datasets
text-to-text text short short XSum (Narayan et al.,2018), CNN-DailyMail (Nallapati et al.,2016),
NYT (Durrett et al.,2016), Gigaword (Napoles et al.,2012)
text long long SamSum (Gliwa et al.,2019), QMSum (Zhong et al.,2021b),
SummScreen (Chen et al.,2021)
video-to-video
vision short short OVP (De Avila et al.,2011), YouTube (De Avila et al.,2011),
SumMe (Gygli et al.,2014)
vision/text short short TVSum (Song et al.,2015)
vision/text(/audio) long long LoL (Fu et al.,2017), TRIPOD+ (Papalampidi et al.,2021b)
video-to-text vision long short TACoS (Rohrbach et al.,2014)1
vision/text/audio short short How2 (Sanabria et al.,2018)
vision/text/audio long long SummScreen3D
Table 1: Summarization datasets grouped based on the input/output modalities and input/output length.
ters. We also explore strategies for content selec-
tion, since the input is too long to fit into standard
sequence-to-sequence models. Empirical results
across evaluation metrics demonstrate that mul-
timodal information yields superior performance
over just text, both in terms of content selection
and summarization; this is the case even when our
adapter model is compared to fully fine-tuned ap-
proaches and more memory-heavy architectures
(e.g., Longformer; Beltagy et al. 2020) that can
process the entire input.
Our contributions can be summarized as follows:
(1) we augment SummScreen (Chen et al.,2021)
with multimodal information, providing videos
aligned with transcripts and summaries; to the best
of our knowledge, this constitutes the largest avail-
able resource for long video multimodal summa-
rization; (2) we propose a parameter efficient ap-
proach to augment a pre-trained textual summarizer
with multimodal information; and (3) explore dif-
ferent methods for identifying salient moments in a
long video and show that multimodal information
also improves content selection.
2 Related Work
Video Summarization
Much previous work has
focused on text-to-text or video-to-video summa-
rization. We provide a comprehensive categoriza-
tion of existing datasets according to input/output
length and modality in Table 1.Multimodal
abstractive summarization (video-to-text) has at-
tracted less attention, mainly due to the difficulty
of collecting large-scale datasets. How2 (Sanabria
et al.,2018) is the only publicly available bench-
mark for this task, it includes short instructional
videos with textual transcripts and one-sentence
summaries. We generate multiple-sentence sum-
1
TACoS contains only 127 cooking videos without corre-
sponding transcripts and hence cannot be used for multimodal
summarization.
maries from long videos and their transcripts. Previ-
ous approaches to multimodal summarization have
focused on various modality fusion methods with
small RNN-based models (Palaskar et al.,2019).
We take advantage of large pre-trained LMs (Lewis
et al.,2020;Raffel et al.,2020;Radford et al.,2019)
for generating fluent textual summaries.
Recent years have also witnessed increasing in-
terest in multimodal video captioning, a task related
to multimodal summarization, which aims to gen-
erate one-sentence descriptions for localized events
in short videos (Xu et al.,2016;Rohrbach et al.,
2017;Zhou et al.,2018;Lei et al.,2020b). Exist-
ing methods employ strong language-and-vision
encoders with massive pre-training (Li et al.,2020;
Luo et al.,2020;Xu et al.,2021;Lei et al.,2020a;
Li et al.,2021), while the decoder is typically shal-
low and under-trained. Although good at generat-
ing short descriptions, they cannot maintain fluency
in long outputs with rich vocabularies.
Realizing the importance of large LMs for gener-
ation, recent work has focused on how to efficiently
render pre-trained LMs multimodal. Notably, Tsim-
poukelli et al. (2021) convert a pre-trained LM into
an image captioning model, by giving images as
prompts and training only a vision encoder. Yu
et al. (2021) summarize How2 videos by augment-
ing BART-base with visual information via a new
cross-attention block added to every encoder layer.
However, their approach cannot easily scale to
BART-large and beyond since they add a large num-
ber of new parameters, while the dataset sizes are
relatively small, leading to over-fitting.
Dialogue Summarization
In the context of text-
to-text generation, dialogue summarization is chal-
lenging due to the difficulty of fitting very long
input into pre-trained sequence-to-sequence mod-
els. Longformer (Beltagy et al.,2020) alleviates
this by employing local self-attention in combina-
tion with global tokens for reducing the computa-
Episodes 4,575
Input: transcript + video + audio
Shots 1,048,024
Shots/episode 193.64 (109.09)
Utterances/episode 322.76 (116.52)
Tokens/episode 5720.55 (2223.38)
Output: summaries
Summaries/episode 1.53 (0.79)
TVMegaSite/#tokens 4,280 395.69 (275.84)
YouTube/#tokens 334 136.22 (45.12)
IMDb/#tokens 946 111.21 (82.18)
tvdb/#tokens 1,454 126.14 (82.14)
Training (unique input-output pairs) 5,199
Validation episodes 296
Testing episodes 296
Table 2: SummScreen3D statistics. For summaries, we
show their provenance, number of summaries per site
(second column), and mean number of tokens per sum-
mary; standard deviations are shown in parentheses.
tional overhead. Despite recent attempts to make
self-attention more efficient (Kitaev et al.,2019;
Tay et al.,2020;Zaheer et al.,2020), it is still
unclear whether it has an advantage over content
selection with a full-attention mechanism (Zhang
et al.,2021b;Shaham et al.,2022) for long dia-
logue summarization. Zhong et al. (2021a) incor-
porate dialogue-specific objectives for pre-training
summarization models, while Zhang et al. (2021a)
follow a different approach and hierarchically sum-
marize the input chunk-by-chunk.
Parameter-efficient Tuning
Fine-tuning is a
common approach for transferring pre-trained mod-
els to different tasks or domains (Howard and
Ruder,2018). It is customary to fine-tune all the
parameters of the pretrained model which, however,
becomes prohibitive as model size and number of
tasks grow. Recent work has proposed parameter-
efficient transfer learning methods which fine-tune
only a small number of additional parameters. Two
popular approaches include adapter tuning, where
bottleneck layers are added and tuned at every layer
of the model (Rebuffi et al.,2017;Houlsby et al.,
2019) and prompt tuning, where (soft) prompts are
prepended as part of the input (Brown et al.,2020;
Li and Liang,2021). In this work, following recent
adapter-based approaches that efficiently convert
LMs to vision-and-language models (Sung et al.,
2022), we utilize the former method for adapting a
textual summarizer to our multimodal setting and
dialogue input format.
3 The SummScreen3D Dataset
SummScreen (Chen et al.,2021) is a long dialogue
summarization dataset containing transcripts from
TV episodes and human-written abstractive sum-
maries
2
. We extend this dataset to a multimodal
setting by also considering the corresponding full-
length videos. SummScreen is divided into two sub-
sets depending on the series genre: SummScreen-
FD and SummScreen-TMS. We use the latter sub-
set which mostly covers soap operas from TVMeg-
aSite
3
, as it is easier to obtain full-length videos
and each series has hundreds of episodes.
For each episode in SummScreen-TMS, we au-
tomatically search for the title and release date in
Youtube. If there is a match with large duration
(indicating that this is a full episode rather than a
segment), we download the video and closed cap-
tions (CC). Overall, we collected videos for 4,575
episodes from five different shows in SummScreen-
TMS.
4
In addition to TVMegaSite summaries (dis-
tributed with SummScreen), we further retrieved
summaries from YouTube descriptions, IMDb, and
tvdb, again using the episode title and release date
as search terms. The statistics of our dataset which
we call SummScreen
3D
(3D for language, video,
and audio) are in Table 2and we provide further de-
tails in Appendix A. As can be seen, each episode
has (on average) multiple references which vary in
length (TVMegaSite summaries are longest).
We split SummScreen
3D
into training, validation,
and test sets with the same distribution over differ-
ent shows per set. We reserved 296 episodes for val-
idation and the same number for testing, and used
the rest for training. Since we have multiple refer-
ence summaries for some episodes, we increased
the size of the training set by adding
m
episode-
summary pairs, matching the same episode with
each of its
m
references. This resulted in 5,199
unique samples for training.
4 Video-to-Text Summarization
Our approach leverages the generation capabil-
ities of large pre-trained sequence-to-sequence
models (Lewis et al.,2020;Raffel et al.,2020).
As our backbone model, we employ BART-
large (Lewis et al.,2020) which has been fine-tuned
on
CNN-DailyMail
(Nallapati et al.,2016;Zhang
et al.,2021b) and has thus acquired a summariza-
tion inductive bias. As TV show transcripts are very
long and cannot fit into BART, we select a subset of
utterances (i.e., speaker turns) as input via content
2https://github.com/mingdachen/SummScreen
3http://tvmegasite.net
4
We will release scripts for data collection and processing.
eMLM loss
Self-attention
Hierarchical3D
Adapter
Feed Forward
Masked
Self-attention
Feed Forward
Cross-attention
L
[m1, t1
1, <MASK>, …, t1
M1 m2, m3, <MASK>, t3
2 , …, t3
M3, …, mN ]
transcript
h’1, h1
1, h1
2 ,…, h1
M1,...
[<S>,s1, s2, …, sK-1, sK]
Vanilla Adapter
LM head
multimodal feats
Projection
Embed
[s1, s2, …, sK-1, sK, <EOS>]
L
LM head
Encoder Decoder
gold summary
trainable
frozen
(a) Multimodal augmentation of textual BART.
Down projection
LayerNorm
Up projection
[u1, h1
1, …, u2, u3, h3
1 , …, uN ]
Activation
[udown,1, h1
down,1,, …, udown,2, udown,2, ,h3
down,1,, , udown,N]
1 0.5 … 0.4
0.2 1 0.1
… … … …
0.4 0.3 … 1
Multimodal similarities:
interaction matrix H
Contextualize
Add & Combine
Hierarchical3D Adapter
u1u2uN
umultimodal,
utterance-level
htextual,
token-level
[udown, 1, h1
down,1, …,udown,2,udown,3 ,h3
down,1 ,... , udown,N ]
' ' ''
u1
u2
uN
bottleneck dimension
(b) Hierarchical3D adapter for the encoder layers.
Figure 1:
Multimodal augmentation of pre-trained BART. We augment the encoder and decoder layers with adapters which we
fine-tune on the target dataset, while the remaining network is frozen. As input, we consider textual tokens and coarse-grained
multimodal information which we prepend before each utterance. We also corrupt part of the textual input during training
and add an auxiliary MLM loss to the encoder for predicting the corrupted tokens. On the right, we show the hierarchical
adapter added to each encoder layer: after down-projecting all representations, we only consider the multimodal ones and further
contextualize them via attention. Then, we combine the representations and up-project again to the original model dimension.
selection (see details in Section 5). We transfer
this model to our task and domain (i.e., multimodal
dialogue summarization), by adding adapter lay-
ers (Rebuffi et al.,2017;Houlsby et al.,2019) in
both the encoder and decoder, and tuning them
on SummScreen
3D
while keeping the rest of the
network frozen. We briefly discuss below our back-
bone text-based model and then elaborate on how
we incorporate multimodal information.
4.1 Backbone Textual Model
Our summarizer follows a standard sequence-to-
sequence Transformer architecture (Vaswani et al.,
2017). The encoder maps tokens
[t1, t2, . . . , tN]
to a sequence of contextualized representations
[h1, h2, . . . , hN]
which are then fed to the decoder
for generating the summary. The encoder con-
sists of
L
stacked layers, each of which has a self-
attention block for contextualizing the token rep-
resentations, followed by a feed-forward network.
The decoder has a similar architecture, it addition-
ally contains a cross-attention block for identifying
relations between the input and currently gener-
ated text and makes use of masked self-attention
to control access to context for each token. The
decoder is followed by a linear layer (i.e., Lan-
guage Model (LM) head) which projects the out-
put representations onto the vocabulary and a final
softmax layer. The model is optimized for pre-
dicting the next token
st+1
in the summary given
[s0, s1, . . . , st]
, the context generated so far, and
the transcript [t1, t2, . . . , tN].
4.2 Multimodal Augmentation
Our hypothesis is that adding multimodal informa-
tion to a textual summarizer (i.e., converting the
textual encoder to a multimodal one) will increase
the quality of its output summaries. We expect
that the video/audio will compensate for important
non-verbal information typically absent from the
transcript (e.g., who is speaking to whom, who is
present in the same room, who is crying or yelling).
We further expect multimodal information to make
up for the loss of context incurred by content se-
lection. We next describe how we compute multi-
modal representations for an episode and how we
augment BART with these representations.
Multimodal Representations
We use utter-
ances as the unit of representation for multimodal
information. We segment episodes into shots (us-
ing PySceneDetect
5
) and map these to utterances
in the corresponding transcript. Specifically, we
align the closed captions in the video which are
time-stamped to the utterances in the transcript
using Dynamic Time Warping (DTW; Myers and
Rabiner 1981;Papalampidi et al. 2021b). We thus
create a one-to-many alignment where an utter-
ance corresponds to one or more shots. For each
shot, we extract textual, visual, and audio features
(see Appendix B.1 for details), and compute an
utterance-level representation for each modality by
average pooling over all aligned shots.
Given textual
xi
, visual
vi
, and audio
ai
repre-
5https://github.com/Breakthrough/PySceneDetect
摘要:

Hierarchical3DAdaptersforLongVideo-to-textSummarizationPinelopiPapalampidiMirellaLapataInstituteforLanguage,CognitionandComputationSchoolofInformatics,UniversityofEdinburghp.papalampidi@sms.ed.ac.uk,mlap@inf.ed.ac.ukAbstractInthispaper,wefocusonvideo-to-textsum-marizationandinvestigatehowtobestutili...

展开>> 收起<<
Hierarchical3D Adapters for Long Video-to-text Summarization Pinelopi Papalampidi Mirella Lapata Institute for Language Cognition and Computation.pdf

共19页,预览4页

还剩页未读, 继续阅读

声明:本站为文档C2C交易模式,即用户上传的文档直接被用户下载,本站只是中间服务平台,本站所有文档下载所得的收益归上传人(含作者)所有。玖贝云文库仅提供信息存储空间,仅对用户上传内容的表现方式做保护处理,对上载内容本身不做任何修改或编辑。若文档所含内容侵犯了您的版权或隐私,请立即通知玖贝云文库,我们立即给予删除!
分类:图书资源 价格:10玖币 属性:19 页 大小:509.88KB 格式:PDF 时间:2025-05-06

开通VIP享超值会员特权

  • 多端同步记录
  • 高速下载文档
  • 免费文档工具
  • 分享文档赚钱
  • 每日登录抽奖
  • 优质衍生服务
/ 19
客服
关注