Hierarchical3D Adapters for Long Video-to-text Summarization Pinelopi Papalampidi Mirella Lapata Institute for Language Cognition and Computation

2025-05-06 0 0 509.88KB 19 页 10玖币

侵权投诉

Hierarchical3D Adapters for Long Video-to-text Summarization

Pinelopi Papalampidi Mirella Lapata

Institute for Language, Cognition and Computation

School of Informatics, University of Edinburgh

p.papalampidi@sms.ed.ac.uk,mlap@inf.ed.ac.uk

Abstract

In this paper, we focus on video-to-text sum-

marization and investigate how to best utilize

multimodal information for summarizing long

inputs (e.g., an hour-long TV show) into long

outputs (e.g., a multi-sentence summary). We

extend SummScreen (Chen et al.,2021), a

dialogue summarization dataset consisting of

transcripts of TV episodes with reference sum-

maries, and create a multimodal variant by col-

lecting corresponding full-length videos. We

incorporate multimodal information into a pre-

trained textual summarizer efﬁciently using

adapter modules augmented with a hierarchi-

cal structure while tuning only 3.8% of model

parameters. Our experiments demonstrate that

multimodal information offers superior per-

formance over more memory-heavy and fully

ﬁne-tuned textual summarization methods.

1 Introduction

What happens in the very last episode of “Friends”?

Anyone who has seen this episode can summa-

rize its key moments: Ross confesses his love for

Rachel, they decide to resume their relationship,

while Monica and Chandler adopt twins and move

to the suburbs. TV viewers can naturally perform

this dialogue summarization task having access to

multiple modalities: they not only hear the actors

speak but also see their expressions, actions, and

whereabouts on screen.

Despite recent advances in summarization (Nal-

lapati et al.,2016;See et al.,2017;Liu and Lapata,

2019b) and increasing interest in different types of

dialogue summarization, e.g., from meeting tran-

scripts (Gliwa et al.,2019;Zhong et al.,2021b) or

screenplays (Chen et al.,2021), the contribution

of modalities other than text remains relatively un-

derstudied. This is not entirely surprising given

the challenges associated with the multimodal sum-

marization task illustrated above (e.g., produce a

written summary of a TV episode). Firstly, the in-

put is long, it cannot ﬁt into standard sequence-to-

sequence architectures, and the different modalities

have to be somehow combined; secondly, the out-

put is also long, summaries consist of multiple sen-

tences and rich vocabulary; and thirdly, it involves

complex inference over long-range dependencies

between events and characters and common sense

reasoning. At the same time, creating large-scale

multimodal datasets with long videos and aligned

textual data is challenging and time consuming,

limiting the research conducted in this domain.

Previous work on video-to-video summariza-

tion identiﬁes highlights from YouTube videos, TV

shows, or movies (Song et al.,2015;Gygli et al.,

2014;De Avila et al.,2011;Papalampidi et al.,

2021b). However, in most cases, either the videos

are short or the datasets are small with a few hun-

dred examples. There is also limited work on video-

to-text summarization. We are only aware of one

large-scale multimodal dataset for this task, namely

How2 (Sanabria et al.,2018), which again contains

short videos (i.e., 2–3 minutes long) with simple

semantics, and short, single-sentence summaries.

In this paper, we focus on video-to-text summa-

rization and investigate how to best utilize mul-

timodal information for condensing long inputs

(e.g., an hour-long TV show) into long outputs

(e.g., a multi-sentence summary). We create a

multimodal variant of SummScreen (Chen et al.,

2021), a recently released dataset comprising of

transcripts of TV episodes and their summaries.

We collect

full-length

videos for 4,575 episodes

and multiple reference summaries. We build our

model on top of a pre-trained sequence-to-sequence

architecture (i.e., BART; Lewis et al. 2020) ﬁne-

tuned on summarization and capable of generating

ﬂuent long text. We convert its textual encoder

to a multimodal one by adding and tuning only

adapter layers (Rebufﬁ et al.,2017;Houlsby et al.,

2019), which account for 3.8% of model parame-

arXiv:2210.04829v1 [cs.CL] 10 Oct 2022

Modality Input Output Datasets

text-to-text text short short XSum (Narayan et al.,2018), CNN-DailyMail (Nallapati et al.,2016),

NYT (Durrett et al.,2016), Gigaword (Napoles et al.,2012)

text long long SamSum (Gliwa et al.,2019), QMSum (Zhong et al.,2021b),

SummScreen (Chen et al.,2021)

video-to-video

vision short short OVP (De Avila et al.,2011), YouTube (De Avila et al.,2011),

SumMe (Gygli et al.,2014)

vision/text short short TVSum (Song et al.,2015)

vision/text(/audio) long long LoL (Fu et al.,2017), TRIPOD+ (Papalampidi et al.,2021b)

video-to-text vision long short TACoS (Rohrbach et al.,2014)1

vision/text/audio short short How2 (Sanabria et al.,2018)

vision/text/audio long long SummScreen3D

Table 1: Summarization datasets grouped based on the input/output modalities and input/output length.

ters. We also explore strategies for content selec-

tion, since the input is too long to ﬁt into standard

sequence-to-sequence models. Empirical results

across evaluation metrics demonstrate that mul-

timodal information yields superior performance

over just text, both in terms of content selection

and summarization; this is the case even when our

adapter model is compared to fully ﬁne-tuned ap-

proaches and more memory-heavy architectures

(e.g., Longformer; Beltagy et al. 2020) that can

process the entire input.

Our contributions can be summarized as follows:

(1) we augment SummScreen (Chen et al.,2021)

with multimodal information, providing videos

aligned with transcripts and summaries; to the best

of our knowledge, this constitutes the largest avail-

able resource for long video multimodal summa-

rization; (2) we propose a parameter efﬁcient ap-

proach to augment a pre-trained textual summarizer

with multimodal information; and (3) explore dif-

ferent methods for identifying salient moments in a

long video and show that multimodal information

also improves content selection.

2 Related Work

Video Summarization

Much previous work has

focused on text-to-text or video-to-video summa-

rization. We provide a comprehensive categoriza-

tion of existing datasets according to input/output

length and modality in Table 1.Multimodal

abstractive summarization (video-to-text) has at-

tracted less attention, mainly due to the difﬁculty

of collecting large-scale datasets. How2 (Sanabria

et al.,2018) is the only publicly available bench-

mark for this task, it includes short instructional

videos with textual transcripts and one-sentence

summaries. We generate multiple-sentence sum-

TACoS contains only 127 cooking videos without corre-

sponding transcripts and hence cannot be used for multimodal

summarization.

maries from long videos and their transcripts. Previ-

ous approaches to multimodal summarization have

focused on various modality fusion methods with

small RNN-based models (Palaskar et al.,2019).

We take advantage of large pre-trained LMs (Lewis

et al.,2020;Raffel et al.,2020;Radford et al.,2019)

for generating ﬂuent textual summaries.

Recent years have also witnessed increasing in-

terest in multimodal video captioning, a task related

to multimodal summarization, which aims to gen-

erate one-sentence descriptions for localized events

in short videos (Xu et al.,2016;Rohrbach et al.,

2017;Zhou et al.,2018;Lei et al.,2020b). Exist-

ing methods employ strong language-and-vision

encoders with massive pre-training (Li et al.,2020;

Luo et al.,2020;Xu et al.,2021;Lei et al.,2020a;

Li et al.,2021), while the decoder is typically shal-

low and under-trained. Although good at generat-

ing short descriptions, they cannot maintain ﬂuency

in long outputs with rich vocabularies.

Realizing the importance of large LMs for gener-

ation, recent work has focused on how to efﬁciently

render pre-trained LMs multimodal. Notably, Tsim-

poukelli et al. (2021) convert a pre-trained LM into

an image captioning model, by giving images as

prompts and training only a vision encoder. Yu

et al. (2021) summarize How2 videos by augment-

ing BART-base with visual information via a new

cross-attention block added to every encoder layer.

However, their approach cannot easily scale to

BART-large and beyond since they add a large num-

ber of new parameters, while the dataset sizes are

relatively small, leading to over-ﬁtting.

Dialogue Summarization

In the context of text-

to-text generation, dialogue summarization is chal-

lenging due to the difﬁculty of ﬁtting very long

input into pre-trained sequence-to-sequence mod-

els. Longformer (Beltagy et al.,2020) alleviates

this by employing local self-attention in combina-

tion with global tokens for reducing the computa-

Episodes 4,575

Input: transcript + video + audio

Shots 1,048,024

Shots/episode 193.64 (109.09)

Utterances/episode 322.76 (116.52)

Tokens/episode 5720.55 (2223.38)

Output: summaries

Summaries/episode 1.53 (0.79)

TVMegaSite/#tokens 4,280 395.69 (275.84)

YouTube/#tokens 334 136.22 (45.12)

IMDb/#tokens 946 111.21 (82.18)

tvdb/#tokens 1,454 126.14 (82.14)

Training (unique input-output pairs) 5,199

Validation episodes 296

Testing episodes 296

Table 2: SummScreen3D statistics. For summaries, we

show their provenance, number of summaries per site

(second column), and mean number of tokens per sum-

mary; standard deviations are shown in parentheses.

tional overhead. Despite recent attempts to make

self-attention more efﬁcient (Kitaev et al.,2019;

Tay et al.,2020;Zaheer et al.,2020), it is still

unclear whether it has an advantage over content

selection with a full-attention mechanism (Zhang

et al.,2021b;Shaham et al.,2022) for long dia-

logue summarization. Zhong et al. (2021a) incor-

porate dialogue-speciﬁc objectives for pre-training

summarization models, while Zhang et al. (2021a)

follow a different approach and hierarchically sum-

marize the input chunk-by-chunk.

Parameter-efﬁcient Tuning

Fine-tuning is a

common approach for transferring pre-trained mod-

els to different tasks or domains (Howard and

Ruder,2018). It is customary to ﬁne-tune all the

parameters of the pretrained model which, however,

becomes prohibitive as model size and number of

tasks grow. Recent work has proposed parameter-

efﬁcient transfer learning methods which ﬁne-tune

only a small number of additional parameters. Two

popular approaches include adapter tuning, where

bottleneck layers are added and tuned at every layer

of the model (Rebufﬁ et al.,2017;Houlsby et al.,

2019) and prompt tuning, where (soft) prompts are

prepended as part of the input (Brown et al.,2020;

Li and Liang,2021). In this work, following recent

adapter-based approaches that efﬁciently convert

LMs to vision-and-language models (Sung et al.,

2022), we utilize the former method for adapting a

textual summarizer to our multimodal setting and

dialogue input format.

3 The SummScreen3D Dataset

SummScreen (Chen et al.,2021) is a long dialogue

summarization dataset containing transcripts from

TV episodes and human-written abstractive sum-

maries

. We extend this dataset to a multimodal

setting by also considering the corresponding full-

length videos. SummScreen is divided into two sub-

sets depending on the series genre: SummScreen-

FD and SummScreen-TMS. We use the latter sub-

set which mostly covers soap operas from TVMeg-

aSite

, as it is easier to obtain full-length videos

and each series has hundreds of episodes.

For each episode in SummScreen-TMS, we au-

tomatically search for the title and release date in

Youtube. If there is a match with large duration

(indicating that this is a full episode rather than a

segment), we download the video and closed cap-

tions (CC). Overall, we collected videos for 4,575

episodes from ﬁve different shows in SummScreen-

TMS.

In addition to TVMegaSite summaries (dis-

tributed with SummScreen), we further retrieved

summaries from YouTube descriptions, IMDb, and

tvdb, again using the episode title and release date

as search terms. The statistics of our dataset which

we call SummScreen

(3D for language, video,

and audio) are in Table 2and we provide further de-

tails in Appendix A. As can be seen, each episode

has (on average) multiple references which vary in

length (TVMegaSite summaries are longest).

We split SummScreen

into training, validation,

and test sets with the same distribution over differ-

ent shows per set. We reserved 296 episodes for val-

idation and the same number for testing, and used

the rest for training. Since we have multiple refer-

ence summaries for some episodes, we increased

the size of the training set by adding

episode-

summary pairs, matching the same episode with

each of its

references. This resulted in 5,199

unique samples for training.

4 Video-to-Text Summarization

Our approach leverages the generation capabil-

ities of large pre-trained sequence-to-sequence

models (Lewis et al.,2020;Raffel et al.,2020).

As our backbone model, we employ BART-

large (Lewis et al.,2020) which has been ﬁne-tuned

CNN-DailyMail

(Nallapati et al.,2016;Zhang

et al.,2021b) and has thus acquired a summariza-

tion inductive bias. As TV show transcripts are very

long and cannot ﬁt into BART, we select a subset of

utterances (i.e., speaker turns) as input via content

2https://github.com/mingdachen/SummScreen

3http://tvmegasite.net

We will release scripts for data collection and processing.

eMLM loss

Self-attention

Hierarchical3D

Adapter

Feed Forward

Masked

Self-attention

Feed Forward

Cross-attention

[m1, t1

1, <MASK>, …, t1

M1 m2, m3, <MASK>, t3

2 , …, t3

M3, …, mN ]

transcript

h’1, h1

1, h1

2 ,…, h1

M1,...

[<S>,s1, s2, …, sK-1, sK]

Vanilla Adapter

LM head

multimodal feats

Projection

Embed

[s1, s2, …, sK-1, sK, <EOS>]

LM head

Encoder Decoder

gold summary

trainable

frozen

(a) Multimodal augmentation of textual BART.

Down projection

LayerNorm

Up projection

[u1, h1

1, …, u2, u3, h3

1 , …, uN ]

Activation

[udown,1, h1

down,1,, …, udown,2, udown,2, ,h3

down,1,, … , udown,N]

1 0.5 … 0.4

0.2 1 … 0.1

… … … …

0.4 0.3 … 1

Multimodal similarities:

interaction matrix H

Contextualize

Add & Combine

Hierarchical3D Adapter

u1u2uN

umultimodal,

utterance-level

htextual,

token-level

[udown, 1, h1

down,1, …,udown,2,udown,3 ,h3

down,1 ,... , udown,N ]

' ' ''

…

bottleneck dimension

(b) Hierarchical3D adapter for the encoder layers.

Figure 1:

Multimodal augmentation of pre-trained BART. We augment the encoder and decoder layers with adapters which we

ﬁne-tune on the target dataset, while the remaining network is frozen. As input, we consider textual tokens and coarse-grained

multimodal information which we prepend before each utterance. We also corrupt part of the textual input during training

and add an auxiliary MLM loss to the encoder for predicting the corrupted tokens. On the right, we show the hierarchical

adapter added to each encoder layer: after down-projecting all representations, we only consider the multimodal ones and further

contextualize them via attention. Then, we combine the representations and up-project again to the original model dimension.

selection (see details in Section 5). We transfer

this model to our task and domain (i.e., multimodal

dialogue summarization), by adding adapter lay-

ers (Rebufﬁ et al.,2017;Houlsby et al.,2019) in

both the encoder and decoder, and tuning them

on SummScreen

while keeping the rest of the

network frozen. We brieﬂy discuss below our back-

bone text-based model and then elaborate on how

we incorporate multimodal information.

4.1 Backbone Textual Model

Our summarizer follows a standard sequence-to-

sequence Transformer architecture (Vaswani et al.,

2017). The encoder maps tokens

[t1, t2, . . . , tN]

to a sequence of contextualized representations

[h1, h2, . . . , hN]

which are then fed to the decoder

for generating the summary. The encoder con-

sists of

stacked layers, each of which has a self-

attention block for contextualizing the token rep-

resentations, followed by a feed-forward network.

The decoder has a similar architecture, it addition-

ally contains a cross-attention block for identifying

relations between the input and currently gener-

ated text and makes use of masked self-attention

to control access to context for each token. The

decoder is followed by a linear layer (i.e., Lan-

guage Model (LM) head) which projects the out-

put representations onto the vocabulary and a ﬁnal

softmax layer. The model is optimized for pre-

dicting the next token

st+1

in the summary given

[s0, s1, . . . , st]

, the context generated so far, and

the transcript [t1, t2, . . . , tN].

4.2 Multimodal Augmentation

Our hypothesis is that adding multimodal informa-

tion to a textual summarizer (i.e., converting the

textual encoder to a multimodal one) will increase

the quality of its output summaries. We expect

that the video/audio will compensate for important

non-verbal information typically absent from the

transcript (e.g., who is speaking to whom, who is

present in the same room, who is crying or yelling).

We further expect multimodal information to make

up for the loss of context incurred by content se-

lection. We next describe how we compute multi-

modal representations for an episode and how we

augment BART with these representations.

Multimodal Representations

We use utter-

ances as the unit of representation for multimodal

information. We segment episodes into shots (us-

ing PySceneDetect

) and map these to utterances

in the corresponding transcript. Speciﬁcally, we

align the closed captions in the video which are

time-stamped to the utterances in the transcript

using Dynamic Time Warping (DTW; Myers and

Rabiner 1981;Papalampidi et al. 2021b). We thus

create a one-to-many alignment where an utter-

ance corresponds to one or more shots. For each

shot, we extract textual, visual, and audio features

(see Appendix B.1 for details), and compute an

utterance-level representation for each modality by

average pooling over all aligned shots.

Given textual

, visual

, and audio

repre-

5https://github.com/Breakthrough/PySceneDetect

文档加载中……请稍候！
如果长时间未打开，您也可以点击刷新试试。

下载文档到电脑，查找使用更方便

10 玖币 0人已下载

立即下载

摘要：

Hierarchical3DAdaptersforLongVideo-to-textSummarizationPinelopiPapalampidiMirellaLapataInstituteforLanguage,CognitionandComputationSchoolofInformatics,UniversityofEdinburghp.papalampidi@sms.ed.ac.uk,mlap@inf.ed.ac.ukAbstractInthispaper,wefocusonvideo-to-textsum-marizationandinvestigatehowtobestutili...

展开>> 收起<<

Hierarchical3D Adapters for Long Video-to-text Summarization Pinelopi Papalampidi Mirella Lapata Institute for Language Cognition and Computation.pdf

共19页,预览4页

还剩页未读，继续阅读

声明：本站为文档C2C交易模式，即用户上传的文档直接被用户下载，本站只是中间服务平台，本站所有文档下载所得的收益归上传人(含作者)所有。玖贝云文库仅提供信息存储空间，仅对用户上传内容的表现方式做保护处理，对上载内容本身不做任何修改或编辑。若文档所含内容侵犯了您的版权或隐私，请立即通知玖贝云文库，我们立即给予删除！

Hierarchical3D Adapters for Long Video-to-text Summarization Pinelopi Papalampidi Mirella Lapata Institute for Language Cognition and Computation

相关推荐

开通VIP享超值会员特权

作者详情

相关内容

热门标签

举报选择: