
To sum up, the main difficulties for temporally segment-
ing the Livestream videos are:
(1) The visual background remains similar for a consid-
erable time, even though the topic has already changed,
making the definition of boundaries ambiguous. For our
MultiLive dataset collected from Behance1, the hosts usu-
ally teach drawing or painting, where the main background
is the board and remains similar for most parts of the
video. Compared with movies, the movie’s background
changes dramatically when switching to another scene, so
the Livestream videos can not be split directly based on vi-
sual scene change or transition differences. Fig. 1 shows an
example comparison of temporal-pairwise cosine distance
(distance between the ith frame and i+ 1th frame of the
same video) on visual feature between a Livestream video
and a TVSum video [64], which shows the Livestream
video’s segment boundaries are not aligned with the visual
scene change, making it difficult to segment.
(2) The visual change is neither consistent nor clear. As
shown in Fig. 1, there are abrupt changes in the visual site
due to the host changing folders or zooming in/out, making
the visual information extremely noisy.
(3) There is not enough labeled data for this kind of
Livestream video, and it is challenging, time-consuming,
and expensive to label them manually. Because it requires
the human annotators to watch the entire video, understand
the topics, and then temporally segment it, making it much
more complicated than labeling images.
Our contributions are listed as follows:
• We introduced MultiLive, a new large dataset of
Livestream videos, among which, 1,000 videos were
manually segmented and annotated, providing human
insights and references for evaluation.
• We formulate a new temporal segmentation of long
Livestream videos (TSLLV) task according to the
newly introduced MultiLive dataset.
• We proposed LiveSeg, an unsupervised Livestream
temporal Segmentation method by exploring multi-
modal visual and language information as a solution
to TSLLV. We extract features from both modalities,
explore the relationship and dependencies across do-
mains, and generate accurate segmentation results.
LiveSeg achieved a 16.8% F1-score performance im-
provement compared with the SOTA method.
2. Related Work
Video Temporal Segmentation Temporal segmentation
aims at generating small segments based on the content or
topics of the video, which is easy to achieve when the video
is short or when the scene change is easy to detect, e.g., in
movie clips. Previous works mainly focused on short videos
1https://www.behance.net/live
or videos with clear scene changes, which is convenient to
manually label a huge amount of videos as training sets for
supervised learning [36, 61, 84, 48, 22, 62, 2].
Action, Shot, and Scene Segmentation Temporal action
segmentation in videos has been widely explored [74, 83,
38, 23, 37, 59, 76]. However, those videos’ characteris-
tics are far different from Livestream videos, where the ac-
tions are well-defined, the main goal is to group similar ac-
tions based on visual change, and the length of videos is
much shorter, so the methods can not be adopted directly.
Shot boundary detection task is also very relevant and has
been explored in many previous works [28, 66, 29, 3],
where shot is defined by the visual change. However, in
Livestream videos, segments are not solely defined by vi-
sual information, the topics contained in language also con-
tribute to the definition of each segment. Video scene de-
tection is the most relevant task. However, previous meth-
ods only used visual information to detect the scene change
[52, 56, 57, 11, 81], so the methods can not be adopted di-
rectly for Livestream videos either.
Unsupervised Methods Recently, unsupervised methods
have also been explored for video temporal segmentation.
[34] proposed incorporating multiple feature sources with
chunk and stride fusion to segment the video, but the
datasets used are still short videos [26, 64]. [20] used
Livestream videos as materials. However, they used inter-
nal software usage as the segmentation reference, which is
not available for most videos, making their method highly
restricted. Because for most videos, we can only get access
to visual and audio/language metadata.
Summary Although previous models have shown reason-
able results, they still suffer some drawbacks. Most work
targeted short videos with clear scene changes instead of
long videos, and only used visual information while ignor-
ing other domains, like language. Due to the characteristics
of the Livestream videos in our MultiLive dataset, methods
that solely depend on visual features can not obtain accu-
rate results, so a multimodal approach should be addressed
to incorporate visual and language information.
3. MultiLive Dataset
We introduced a large Livestream video dataset from Be-
hance2(the license is obtained and will be provided when
the dataset is released), which contains Livestream videos
for showcasing and discovering creative work. The dataset
includes video ID, title, video metadata, extracted transcript
metadata from audio signals (by Microsoft ASR [77]), off-
set (timestamp), duration of each sentence, etc. The whole
dataset contains 11,285 Livestream videos with a total du-
ration of 15,038.4 hours, the average duration per video is
2https://www.behance.net/live