
Episodes 4,575
Input: transcript + video + audio
Shots 1,048,024
Shots/episode 193.64 (109.09)
Utterances/episode 322.76 (116.52)
Tokens/episode 5720.55 (2223.38)
Output: summaries
Summaries/episode 1.53 (0.79)
TVMegaSite/#tokens 4,280 395.69 (275.84)
YouTube/#tokens 334 136.22 (45.12)
IMDb/#tokens 946 111.21 (82.18)
tvdb/#tokens 1,454 126.14 (82.14)
Training (unique input-output pairs) 5,199
Validation episodes 296
Testing episodes 296
Table 2: SummScreen3D statistics. For summaries, we
show their provenance, number of summaries per site
(second column), and mean number of tokens per sum-
mary; standard deviations are shown in parentheses.
tional overhead. Despite recent attempts to make
self-attention more efficient (Kitaev et al.,2019;
Tay et al.,2020;Zaheer et al.,2020), it is still
unclear whether it has an advantage over content
selection with a full-attention mechanism (Zhang
et al.,2021b;Shaham et al.,2022) for long dia-
logue summarization. Zhong et al. (2021a) incor-
porate dialogue-specific objectives for pre-training
summarization models, while Zhang et al. (2021a)
follow a different approach and hierarchically sum-
marize the input chunk-by-chunk.
Parameter-efficient Tuning
Fine-tuning is a
common approach for transferring pre-trained mod-
els to different tasks or domains (Howard and
Ruder,2018). It is customary to fine-tune all the
parameters of the pretrained model which, however,
becomes prohibitive as model size and number of
tasks grow. Recent work has proposed parameter-
efficient transfer learning methods which fine-tune
only a small number of additional parameters. Two
popular approaches include adapter tuning, where
bottleneck layers are added and tuned at every layer
of the model (Rebuffi et al.,2017;Houlsby et al.,
2019) and prompt tuning, where (soft) prompts are
prepended as part of the input (Brown et al.,2020;
Li and Liang,2021). In this work, following recent
adapter-based approaches that efficiently convert
LMs to vision-and-language models (Sung et al.,
2022), we utilize the former method for adapting a
textual summarizer to our multimodal setting and
dialogue input format.
3 The SummScreen3D Dataset
SummScreen (Chen et al.,2021) is a long dialogue
summarization dataset containing transcripts from
TV episodes and human-written abstractive sum-
maries
2
. We extend this dataset to a multimodal
setting by also considering the corresponding full-
length videos. SummScreen is divided into two sub-
sets depending on the series genre: SummScreen-
FD and SummScreen-TMS. We use the latter sub-
set which mostly covers soap operas from TVMeg-
aSite
3
, as it is easier to obtain full-length videos
and each series has hundreds of episodes.
For each episode in SummScreen-TMS, we au-
tomatically search for the title and release date in
Youtube. If there is a match with large duration
(indicating that this is a full episode rather than a
segment), we download the video and closed cap-
tions (CC). Overall, we collected videos for 4,575
episodes from five different shows in SummScreen-
TMS.
4
In addition to TVMegaSite summaries (dis-
tributed with SummScreen), we further retrieved
summaries from YouTube descriptions, IMDb, and
tvdb, again using the episode title and release date
as search terms. The statistics of our dataset which
we call SummScreen
3D
(3D for language, video,
and audio) are in Table 2and we provide further de-
tails in Appendix A. As can be seen, each episode
has (on average) multiple references which vary in
length (TVMegaSite summaries are longest).
We split SummScreen
3D
into training, validation,
and test sets with the same distribution over differ-
ent shows per set. We reserved 296 episodes for val-
idation and the same number for testing, and used
the rest for training. Since we have multiple refer-
ence summaries for some episodes, we increased
the size of the training set by adding
m
episode-
summary pairs, matching the same episode with
each of its
m
references. This resulted in 5,199
unique samples for training.
4 Video-to-Text Summarization
Our approach leverages the generation capabil-
ities of large pre-trained sequence-to-sequence
models (Lewis et al.,2020;Raffel et al.,2020).
As our backbone model, we employ BART-
large (Lewis et al.,2020) which has been fine-tuned
on
CNN-DailyMail
(Nallapati et al.,2016;Zhang
et al.,2021b) and has thus acquired a summariza-
tion inductive bias. As TV show transcripts are very
long and cannot fit into BART, we select a subset of
utterances (i.e., speaker turns) as input via content
2https://github.com/mingdachen/SummScreen
3http://tvmegasite.net
4
We will release scripts for data collection and processing.