LiveSeg Unsupervised Multimodal Temporal Segmentation of Long Livestream Videos Jielin Qiu12 Franck Dernoncourt1 Trung Bui1 Zhaowen Wang1 Ding Zhao2 Hailin Jin1

2025-04-27 0 0 1009.59KB 11 页 10玖币

侵权投诉

LiveSeg: Unsupervised Multimodal Temporal Segmentation

of Long Livestream Videos

Jielin Qiu1,2, Franck Dernoncourt1, Trung Bui1, Zhaowen Wang1, Ding Zhao2, Hailin Jin1

1Adobe Research, 2Carnegie Mellon University

{jielinq,dingzhao}@andrew.cmu.edu, {dernonco,zhawang,bui,hljin}@adobe.com

Abstract

Livestream videos have become a signiﬁcant part of

online learning, where design, digital marketing, creative

painting, and other skills are taught by experienced experts

in the sessions, making them valuable materials. However,

Livestream tutorial videos are usually hours long, recorded,

and uploaded to the Internet directly after the live sessions,

making it hard for other people to catch up quickly. An out-

line will be a beneﬁcial solution, which requires the video to

be temporally segmented according to topics. In this work,

we introduced a large Livestream video dataset named Mul-

tiLive, and formulated the temporal segmentation of the

long Livestream videos (TSLLV) task. We propose LiveSeg,

an unsupervised Livestream video temporal Segmentation

solution, which takes advantage of multimodal features

from different domains. Our method achieved a 16.8% F1-

score performance improvement compared with the state-

of-the-art method.

1. Introduction

Video temporal segmentation has become increasingly

important since it is the basis for many real-world applica-

tions, i.e., video scene detection, shot boundary detection,

etc. Video temporal segmentation can be considered an es-

sential pre-processing step, and an accurate temporal seg-

mentation result could beneﬁt many other tasks. The video

temporal segmentation methods lie in two directions: uni-

modal and multimodal approaches. Unimodal approaches

only use the visual modality of the videos to learn scene

change or transition in a supervised manner, while multi-

modal methods exploit available textual metadata and learn

joint semantic representation in an unsupervised way.

A considerable amount of long Livestream videos are

uploaded to the Internet every day, but it is challenging to

understand the main content of the long video quickly. Tra-

ditionally, we can only have an inaccurate assumption by

reading the video’s title or using the control bar to manually

Figure 1. Comparison of temporal-pairwise cosine distance on vi-

sual features: (TOP) a Livestream video, (BOTTOM) a TVSum

video (Blue & Green: distance; Red: segment boundaries).

access the video, which is time-consuming, inaccurate, and

very easy to miss valuable information. An advantageous

solution is to segment the long video into small segments

based on the topics, making it easier for the users to navi-

gate the content.

Most existing video temporal segmentation work fo-

cused on short videos. Some work explored movie clips

extracted from long videos but easily segmented temporally

by scene change. Jadon et al. proposed a summarization

method based on the SumMe dataset [26], which are 1-6

min short videos with clear visual change [31]. When it

comes to the long Livestream videos, the previous meth-

ods do not work well due to the extremely long length and

new characteristics of the Livestream videos. So the critical

problem is ﬁnding a practical approach to temporally seg-

ment the Livestream videos into segments. The quality of

segmentation results can signiﬁcantly impact further tasks.

So here we propose a new task, TSLLV, temporal segmen-

tation of long Livestream videos, which has not been ex-

plored yet. Different from other long videos, i.e., movies,

Livestream videos usually contain more noisy visual infor-

mation due to the visually abrupt change, and more noisy

language information due to random chatting, conversa-

tional languages, and intermittent sentences, which means

the content is neither clear nor well-organized, making it

extremely hard to detect the segment boundaries. Com-

parison of the visual noisiness of the Livestream video and

other videos and examples of Livestream transcripts are in-

troduced in Section 3.

arXiv:2210.05840v1 [cs.CV] 12 Oct 2022

To sum up, the main difﬁculties for temporally segment-

ing the Livestream videos are:

(1) The visual background remains similar for a consid-

erable time, even though the topic has already changed,

making the deﬁnition of boundaries ambiguous. For our

MultiLive dataset collected from Behance1, the hosts usu-

ally teach drawing or painting, where the main background

is the board and remains similar for most parts of the

video. Compared with movies, the movie’s background

changes dramatically when switching to another scene, so

the Livestream videos can not be split directly based on vi-

sual scene change or transition differences. Fig. 1 shows an

example comparison of temporal-pairwise cosine distance

(distance between the ith frame and i+ 1th frame of the

same video) on visual feature between a Livestream video

and a TVSum video [64], which shows the Livestream

video’s segment boundaries are not aligned with the visual

scene change, making it difﬁcult to segment.

(2) The visual change is neither consistent nor clear. As

shown in Fig. 1, there are abrupt changes in the visual site

due to the host changing folders or zooming in/out, making

the visual information extremely noisy.

(3) There is not enough labeled data for this kind of

Livestream video, and it is challenging, time-consuming,

and expensive to label them manually. Because it requires

the human annotators to watch the entire video, understand

the topics, and then temporally segment it, making it much

more complicated than labeling images.

Our contributions are listed as follows:

• We introduced MultiLive, a new large dataset of

Livestream videos, among which, 1,000 videos were

manually segmented and annotated, providing human

insights and references for evaluation.

• We formulate a new temporal segmentation of long

Livestream videos (TSLLV) task according to the

newly introduced MultiLive dataset.

• We proposed LiveSeg, an unsupervised Livestream

temporal Segmentation method by exploring multi-

modal visual and language information as a solution

to TSLLV. We extract features from both modalities,

explore the relationship and dependencies across do-

mains, and generate accurate segmentation results.

LiveSeg achieved a 16.8% F1-score performance im-

provement compared with the SOTA method.

2. Related Work

Video Temporal Segmentation Temporal segmentation

aims at generating small segments based on the content or

topics of the video, which is easy to achieve when the video

is short or when the scene change is easy to detect, e.g., in

movie clips. Previous works mainly focused on short videos

1https://www.behance.net/live

or videos with clear scene changes, which is convenient to

manually label a huge amount of videos as training sets for

supervised learning [36, 61, 84, 48, 22, 62, 2].

Action, Shot, and Scene Segmentation Temporal action

segmentation in videos has been widely explored [74, 83,

38, 23, 37, 59, 76]. However, those videos’ characteris-

tics are far different from Livestream videos, where the ac-

tions are well-deﬁned, the main goal is to group similar ac-

tions based on visual change, and the length of videos is

much shorter, so the methods can not be adopted directly.

Shot boundary detection task is also very relevant and has

been explored in many previous works [28, 66, 29, 3],

where shot is deﬁned by the visual change. However, in

Livestream videos, segments are not solely deﬁned by vi-

sual information, the topics contained in language also con-

tribute to the deﬁnition of each segment. Video scene de-

tection is the most relevant task. However, previous meth-

ods only used visual information to detect the scene change

[52, 56, 57, 11, 81], so the methods can not be adopted di-

rectly for Livestream videos either.

Unsupervised Methods Recently, unsupervised methods

have also been explored for video temporal segmentation.

[34] proposed incorporating multiple feature sources with

chunk and stride fusion to segment the video, but the

datasets used are still short videos [26, 64]. [20] used

Livestream videos as materials. However, they used inter-

nal software usage as the segmentation reference, which is

not available for most videos, making their method highly

restricted. Because for most videos, we can only get access

to visual and audio/language metadata.

Summary Although previous models have shown reason-

able results, they still suffer some drawbacks. Most work

targeted short videos with clear scene changes instead of

long videos, and only used visual information while ignor-

ing other domains, like language. Due to the characteristics

of the Livestream videos in our MultiLive dataset, methods

that solely depend on visual features can not obtain accu-

rate results, so a multimodal approach should be addressed

to incorporate visual and language information.

3. MultiLive Dataset

We introduced a large Livestream video dataset from Be-

hance2(the license is obtained and will be provided when

the dataset is released), which contains Livestream videos

for showcasing and discovering creative work. The dataset

includes video ID, title, video metadata, extracted transcript

metadata from audio signals (by Microsoft ASR [77]), off-

set (timestamp), duration of each sentence, etc. The whole

dataset contains 11,285 Livestream videos with a total du-

ration of 15,038.4 hours, the average duration per video is

2https://www.behance.net/live

1.3 hours. The entire transcript contains 8,001,901 sen-

tences, and the average transcript length for each video is

709 sentences. (An example transcript is shown in the Ap-

pendix.) The detailed statistics of the dataset are shown in

Table 1 and Table 2. From Tables 1,2, most videos are less

than 3 hours and most videos’ transcript contains less than

1,500 sentences. In addition, we showed the histogram of

video length distribution and transcript length distribution

in Fig 2.

Table 1. Distribution of Livestream video duration.

Video Duration Number Percentage

0-1 h 4,827 42.774%

1-2 h 2,945 26.097%

2-3 h 2,523 22.357%

3-4 h 705 6.247%

4-5 h 210 1.861%

5-6 h 70 0.620%

6-7 h 11 0.097%

Table 2. Distribution of transcript length.

Transcript Length Number Percentage

0-500 5,512 48.844%

500-1,000 2,299 20.372%

1,000-1,500 1,890 16.748%

1,500-2,000 989 8.746%

2,000-2,500 365 3.234%

2,500-3,000 118 1.046%

3,000-3,500 84 0.744%

3,500-4,000 35 0.310%

4,000-4,500 12 0.106%

4,500-5,000 3 0.027%

Figure 2. Histogram of MultiLive video length distribution and

transcript length distribution (y-axis: number of videos).

Besides, for the purpose of evaluation, we provide hu-

man annotations of 1,000 videos with segmentation bound-

aries annotated manually by human annotators for evalu-

ation. The human annotators are asked to watch and un-

derstand the whole video and split each into several seg-

ments based on their understanding of the video content.

The current 1,000 videos’ annotation includes 10 annotators

from Amazon Mechanical Turk 3(legal agreement signed).

The annotators were separated into groups and each group

watched part of the videos and then discussed the results

together about the segmentation results to ensure the qual-

ity of the annotation was agreed upon by all the annotators.

3https://www.mturk.com/

They were instructed to pay more attention to topic change,

w.r.t. the moment that the live-streamer starts discussing a

different topic.

Table 3. Comparison of MultiLive with existing datasets.

Statistics MultiLive SumMe [26] TVSum [64] OVP [6]

Labeled videos 1,000 25 50 50

Ave. length (min) 78 mins 2.4 mins 4.2 mins 1.5 mins

Ave. scene num 8.8 5.1 52.2 8.8

Ave. SLR

(min/scene) 8.86 0.47 0.08 0.17

Ave. SD 0.07 0.22 0.19 0.35

There are several widely used video datasets in tempo-

ral segmentation or video summarization tasks [26, 64, 6],

Table 3 shows the comparison of our dataset with the oth-

ers. The amount of labeled videos of the others is less than

50, while we provide human annotations for 1,000 videos.

The average length of the videos from our dataset is much

longer than others, while the number of segments is in the

same order of magnitude or even smaller than the others.

The effect is that, the average SLR (scene length ratio) of

the Livestream dataset is much larger, where average SLR

(scene length ratio) can be considered a metric to represent

the average length of each scene in the video, calculated by

(ave.length / ave. scene num). So the larger the ratio, the

more content contained in each segment, leading to more

difﬁculty ﬁnding the segment boundaries.

Figure 3. (a) Visual features of a Livestream video; (b)Visual fea-

tures of a TVSum video, where different colors represent different

segments within one video.

To demonstrate a more precise understanding of the vi-

sual information of Livestream videos, we compared the vi-

sual features extracted from one example Livestream video

and one example TVSum video [64]. We extracted video

frames from the raw video sequence, used ResNet50 model

[30] (pre-trained on ImageNet) to extract the visual features

of each video frame, and adopted t-SNE [69] to visualize

the visual features. Fig. 3(a) shows the Livestream video’s

visual feature distribution, different colors with the same

marker “o” representing different segments, ten segments

in total. We can ﬁnd that feature points which belong to

different segments mix together and thus are hard to sep-

arate. As for TVSum’s video result in Fig. 3(b), different

文档加载中……请稍候！
如果长时间未打开，您也可以点击刷新试试。

下载文档到电脑，查找使用更方便

10 玖币 0人已下载

立即下载

摘要：

LiveSeg:UnsupervisedMultimodalTemporalSegmentationofLongLivestreamVideosJielinQiu1;2,FranckDernoncourt1,TrungBui1,ZhaowenWang1,DingZhao2,HailinJin11AdobeResearch,2CarnegieMellonUniversityfjielinq,dingzhaog@andrew.cmu.edu,fdernonco,zhawang,bui,hljing@adobe.comAbstractLivestreamvideoshavebecomeasigni...

展开>> 收起<<

LiveSeg Unsupervised Multimodal Temporal Segmentation of Long Livestream Videos Jielin Qiu12 Franck Dernoncourt1 Trung Bui1 Zhaowen Wang1 Ding Zhao2 Hailin Jin1.pdf

共11页,预览3页

还剩页未读，继续阅读

声明：本站为文档C2C交易模式，即用户上传的文档直接被用户下载，本站只是中间服务平台，本站所有文档下载所得的收益归上传人(含作者)所有。玖贝云文库仅提供信息存储空间，仅对用户上传内容的表现方式做保护处理，对上载内容本身不做任何修改或编辑。若文档所含内容侵犯了您的版权或隐私，请立即通知玖贝云文库，我们立即给予删除！

LiveSeg Unsupervised Multimodal Temporal Segmentation of Long Livestream Videos Jielin Qiu12 Franck Dernoncourt1 Trung Bui1 Zhaowen Wang1 Ding Zhao2 Hailin Jin1

相关推荐

开通VIP享超值会员特权

作者详情

相关内容

热门标签

举报选择: