LiveSeg Unsupervised Multimodal Temporal Segmentation of Long Livestream Videos Jielin Qiu12 Franck Dernoncourt1 Trung Bui1 Zhaowen Wang1 Ding Zhao2 Hailin Jin1

2025-04-27 0 0 1009.59KB 11 页 10玖币
侵权投诉
LiveSeg: Unsupervised Multimodal Temporal Segmentation
of Long Livestream Videos
Jielin Qiu1,2, Franck Dernoncourt1, Trung Bui1, Zhaowen Wang1, Ding Zhao2, Hailin Jin1
1Adobe Research, 2Carnegie Mellon University
{jielinq,dingzhao}@andrew.cmu.edu, {dernonco,zhawang,bui,hljin}@adobe.com
Abstract
Livestream videos have become a significant part of
online learning, where design, digital marketing, creative
painting, and other skills are taught by experienced experts
in the sessions, making them valuable materials. However,
Livestream tutorial videos are usually hours long, recorded,
and uploaded to the Internet directly after the live sessions,
making it hard for other people to catch up quickly. An out-
line will be a beneficial solution, which requires the video to
be temporally segmented according to topics. In this work,
we introduced a large Livestream video dataset named Mul-
tiLive, and formulated the temporal segmentation of the
long Livestream videos (TSLLV) task. We propose LiveSeg,
an unsupervised Livestream video temporal Segmentation
solution, which takes advantage of multimodal features
from different domains. Our method achieved a 16.8% F1-
score performance improvement compared with the state-
of-the-art method.
1. Introduction
Video temporal segmentation has become increasingly
important since it is the basis for many real-world applica-
tions, i.e., video scene detection, shot boundary detection,
etc. Video temporal segmentation can be considered an es-
sential pre-processing step, and an accurate temporal seg-
mentation result could benefit many other tasks. The video
temporal segmentation methods lie in two directions: uni-
modal and multimodal approaches. Unimodal approaches
only use the visual modality of the videos to learn scene
change or transition in a supervised manner, while multi-
modal methods exploit available textual metadata and learn
joint semantic representation in an unsupervised way.
A considerable amount of long Livestream videos are
uploaded to the Internet every day, but it is challenging to
understand the main content of the long video quickly. Tra-
ditionally, we can only have an inaccurate assumption by
reading the video’s title or using the control bar to manually
Figure 1. Comparison of temporal-pairwise cosine distance on vi-
sual features: (TOP) a Livestream video, (BOTTOM) a TVSum
video (Blue & Green: distance; Red: segment boundaries).
access the video, which is time-consuming, inaccurate, and
very easy to miss valuable information. An advantageous
solution is to segment the long video into small segments
based on the topics, making it easier for the users to navi-
gate the content.
Most existing video temporal segmentation work fo-
cused on short videos. Some work explored movie clips
extracted from long videos but easily segmented temporally
by scene change. Jadon et al. proposed a summarization
method based on the SumMe dataset [26], which are 1-6
min short videos with clear visual change [31]. When it
comes to the long Livestream videos, the previous meth-
ods do not work well due to the extremely long length and
new characteristics of the Livestream videos. So the critical
problem is finding a practical approach to temporally seg-
ment the Livestream videos into segments. The quality of
segmentation results can significantly impact further tasks.
So here we propose a new task, TSLLV, temporal segmen-
tation of long Livestream videos, which has not been ex-
plored yet. Different from other long videos, i.e., movies,
Livestream videos usually contain more noisy visual infor-
mation due to the visually abrupt change, and more noisy
language information due to random chatting, conversa-
tional languages, and intermittent sentences, which means
the content is neither clear nor well-organized, making it
extremely hard to detect the segment boundaries. Com-
parison of the visual noisiness of the Livestream video and
other videos and examples of Livestream transcripts are in-
troduced in Section 3.
arXiv:2210.05840v1 [cs.CV] 12 Oct 2022
To sum up, the main difficulties for temporally segment-
ing the Livestream videos are:
(1) The visual background remains similar for a consid-
erable time, even though the topic has already changed,
making the definition of boundaries ambiguous. For our
MultiLive dataset collected from Behance1, the hosts usu-
ally teach drawing or painting, where the main background
is the board and remains similar for most parts of the
video. Compared with movies, the movie’s background
changes dramatically when switching to another scene, so
the Livestream videos can not be split directly based on vi-
sual scene change or transition differences. Fig. 1 shows an
example comparison of temporal-pairwise cosine distance
(distance between the ith frame and i+ 1th frame of the
same video) on visual feature between a Livestream video
and a TVSum video [64], which shows the Livestream
video’s segment boundaries are not aligned with the visual
scene change, making it difficult to segment.
(2) The visual change is neither consistent nor clear. As
shown in Fig. 1, there are abrupt changes in the visual site
due to the host changing folders or zooming in/out, making
the visual information extremely noisy.
(3) There is not enough labeled data for this kind of
Livestream video, and it is challenging, time-consuming,
and expensive to label them manually. Because it requires
the human annotators to watch the entire video, understand
the topics, and then temporally segment it, making it much
more complicated than labeling images.
Our contributions are listed as follows:
We introduced MultiLive, a new large dataset of
Livestream videos, among which, 1,000 videos were
manually segmented and annotated, providing human
insights and references for evaluation.
We formulate a new temporal segmentation of long
Livestream videos (TSLLV) task according to the
newly introduced MultiLive dataset.
We proposed LiveSeg, an unsupervised Livestream
temporal Segmentation method by exploring multi-
modal visual and language information as a solution
to TSLLV. We extract features from both modalities,
explore the relationship and dependencies across do-
mains, and generate accurate segmentation results.
LiveSeg achieved a 16.8% F1-score performance im-
provement compared with the SOTA method.
2. Related Work
Video Temporal Segmentation Temporal segmentation
aims at generating small segments based on the content or
topics of the video, which is easy to achieve when the video
is short or when the scene change is easy to detect, e.g., in
movie clips. Previous works mainly focused on short videos
1https://www.behance.net/live
or videos with clear scene changes, which is convenient to
manually label a huge amount of videos as training sets for
supervised learning [36, 61, 84, 48, 22, 62, 2].
Action, Shot, and Scene Segmentation Temporal action
segmentation in videos has been widely explored [74, 83,
38, 23, 37, 59, 76]. However, those videos’ characteris-
tics are far different from Livestream videos, where the ac-
tions are well-defined, the main goal is to group similar ac-
tions based on visual change, and the length of videos is
much shorter, so the methods can not be adopted directly.
Shot boundary detection task is also very relevant and has
been explored in many previous works [28, 66, 29, 3],
where shot is defined by the visual change. However, in
Livestream videos, segments are not solely defined by vi-
sual information, the topics contained in language also con-
tribute to the definition of each segment. Video scene de-
tection is the most relevant task. However, previous meth-
ods only used visual information to detect the scene change
[52, 56, 57, 11, 81], so the methods can not be adopted di-
rectly for Livestream videos either.
Unsupervised Methods Recently, unsupervised methods
have also been explored for video temporal segmentation.
[34] proposed incorporating multiple feature sources with
chunk and stride fusion to segment the video, but the
datasets used are still short videos [26, 64]. [20] used
Livestream videos as materials. However, they used inter-
nal software usage as the segmentation reference, which is
not available for most videos, making their method highly
restricted. Because for most videos, we can only get access
to visual and audio/language metadata.
Summary Although previous models have shown reason-
able results, they still suffer some drawbacks. Most work
targeted short videos with clear scene changes instead of
long videos, and only used visual information while ignor-
ing other domains, like language. Due to the characteristics
of the Livestream videos in our MultiLive dataset, methods
that solely depend on visual features can not obtain accu-
rate results, so a multimodal approach should be addressed
to incorporate visual and language information.
3. MultiLive Dataset
We introduced a large Livestream video dataset from Be-
hance2(the license is obtained and will be provided when
the dataset is released), which contains Livestream videos
for showcasing and discovering creative work. The dataset
includes video ID, title, video metadata, extracted transcript
metadata from audio signals (by Microsoft ASR [77]), off-
set (timestamp), duration of each sentence, etc. The whole
dataset contains 11,285 Livestream videos with a total du-
ration of 15,038.4 hours, the average duration per video is
2https://www.behance.net/live
1.3 hours. The entire transcript contains 8,001,901 sen-
tences, and the average transcript length for each video is
709 sentences. (An example transcript is shown in the Ap-
pendix.) The detailed statistics of the dataset are shown in
Table 1 and Table 2. From Tables 1,2, most videos are less
than 3 hours and most videos’ transcript contains less than
1,500 sentences. In addition, we showed the histogram of
video length distribution and transcript length distribution
in Fig 2.
Table 1. Distribution of Livestream video duration.
Video Duration Number Percentage
0-1 h 4,827 42.774%
1-2 h 2,945 26.097%
2-3 h 2,523 22.357%
3-4 h 705 6.247%
4-5 h 210 1.861%
5-6 h 70 0.620%
6-7 h 11 0.097%
Table 2. Distribution of transcript length.
Transcript Length Number Percentage
0-500 5,512 48.844%
500-1,000 2,299 20.372%
1,000-1,500 1,890 16.748%
1,500-2,000 989 8.746%
2,000-2,500 365 3.234%
2,500-3,000 118 1.046%
3,000-3,500 84 0.744%
3,500-4,000 35 0.310%
4,000-4,500 12 0.106%
4,500-5,000 3 0.027%
Figure 2. Histogram of MultiLive video length distribution and
transcript length distribution (y-axis: number of videos).
Besides, for the purpose of evaluation, we provide hu-
man annotations of 1,000 videos with segmentation bound-
aries annotated manually by human annotators for evalu-
ation. The human annotators are asked to watch and un-
derstand the whole video and split each into several seg-
ments based on their understanding of the video content.
The current 1,000 videos’ annotation includes 10 annotators
from Amazon Mechanical Turk 3(legal agreement signed).
The annotators were separated into groups and each group
watched part of the videos and then discussed the results
together about the segmentation results to ensure the qual-
ity of the annotation was agreed upon by all the annotators.
3https://www.mturk.com/
They were instructed to pay more attention to topic change,
w.r.t. the moment that the live-streamer starts discussing a
different topic.
Table 3. Comparison of MultiLive with existing datasets.
Statistics MultiLive SumMe [26] TVSum [64] OVP [6]
Labeled videos 1,000 25 50 50
Ave. length (min) 78 mins 2.4 mins 4.2 mins 1.5 mins
Ave. scene num 8.8 5.1 52.2 8.8
Ave. SLR
(min/scene) 8.86 0.47 0.08 0.17
Ave. SD 0.07 0.22 0.19 0.35
There are several widely used video datasets in tempo-
ral segmentation or video summarization tasks [26, 64, 6],
Table 3 shows the comparison of our dataset with the oth-
ers. The amount of labeled videos of the others is less than
50, while we provide human annotations for 1,000 videos.
The average length of the videos from our dataset is much
longer than others, while the number of segments is in the
same order of magnitude or even smaller than the others.
The effect is that, the average SLR (scene length ratio) of
the Livestream dataset is much larger, where average SLR
(scene length ratio) can be considered a metric to represent
the average length of each scene in the video, calculated by
(ave.length / ave. scene num). So the larger the ratio, the
more content contained in each segment, leading to more
difficulty finding the segment boundaries.
Figure 3. (a) Visual features of a Livestream video; (b)Visual fea-
tures of a TVSum video, where different colors represent different
segments within one video.
To demonstrate a more precise understanding of the vi-
sual information of Livestream videos, we compared the vi-
sual features extracted from one example Livestream video
and one example TVSum video [64]. We extracted video
frames from the raw video sequence, used ResNet50 model
[30] (pre-trained on ImageNet) to extract the visual features
of each video frame, and adopted t-SNE [69] to visualize
the visual features. Fig. 3(a) shows the Livestream video’s
visual feature distribution, different colors with the same
marker “o” representing different segments, ten segments
in total. We can find that feature points which belong to
different segments mix together and thus are hard to sep-
arate. As for TVSum’s video result in Fig. 3(b), different
摘要:

LiveSeg:UnsupervisedMultimodalTemporalSegmentationofLongLivestreamVideosJielinQiu1;2,FranckDernoncourt1,TrungBui1,ZhaowenWang1,DingZhao2,HailinJin11AdobeResearch,2CarnegieMellonUniversityfjielinq,dingzhaog@andrew.cmu.edu,fdernonco,zhawang,bui,hljing@adobe.comAbstractLivestreamvideoshavebecomeasigni...

展开>> 收起<<
LiveSeg Unsupervised Multimodal Temporal Segmentation of Long Livestream Videos Jielin Qiu12 Franck Dernoncourt1 Trung Bui1 Zhaowen Wang1 Ding Zhao2 Hailin Jin1.pdf

共11页,预览3页

还剩页未读, 继续阅读

声明:本站为文档C2C交易模式,即用户上传的文档直接被用户下载,本站只是中间服务平台,本站所有文档下载所得的收益归上传人(含作者)所有。玖贝云文库仅提供信息存储空间,仅对用户上传内容的表现方式做保护处理,对上载内容本身不做任何修改或编辑。若文档所含内容侵犯了您的版权或隐私,请立即通知玖贝云文库,我们立即给予删除!
分类:图书资源 价格:10玖币 属性:11 页 大小:1009.59KB 格式:PDF 时间:2025-04-27

开通VIP享超值会员特权

  • 多端同步记录
  • 高速下载文档
  • 免费文档工具
  • 分享文档赚钱
  • 每日登录抽奖
  • 优质衍生服务
/ 11
客服
关注