VISUAL ANSWER LOCALIZATION WITH CROSS-MODAL MUTUAL KNOWLEDGE TRANSFER Yixuan Wengy Bin Liz

2025-05-06 3 0 1.26MB 5 页 10玖币

侵权投诉

VISUAL ANSWER LOCALIZATION WITH CROSS-MODAL

MUTUAL KNOWLEDGE TRANSFER

Yixuan Weng*†, Bin Li*?‡

†National Laboratory of Pattern Recognition Institute of Automation, Chinese Academy Sciences

‡College of Electrical and Information Engineering, Hunan University

ABSTRACT

The goal of visual answering localization (VAL) in the video

is to obtain a relevant and concise time clip from a video as the

answer to the given natural language question. Early meth-

ods are based on the interaction modelling between video

and text to predict the visual answer by the visual predictor.

Later, using the textual predictor with subtitles for the VAL

proves to be more precise. However, these existing meth-

ods still have cross-modal knowledge deviations from visual

frames or textual subtitles. In this paper, we propose a cross-

modal mutual knowledge transfer span localization (Mutu-

alSL) method to reduce the knowledge deviation. MutualSL

has both visual predictor and textual predictor, where we ex-

pect the prediction results of these both to be consistent, so

as to promote semantic knowledge understanding between

cross-modalities. On this basis, we design a one-way dynamic

loss function to dynamically adjust the proportion of knowl-

edge transfer. We have conducted extensive experiments on

three public datasets for evaluation. The experimental results

show that our method outperforms other competitive state-of-

the-art (SOTA) methods, demonstrating its effectiveness1.

Index Terms—Cross-modal, Mutual Knowledge Trans-

fer, Visual Answer Localization

1. INTRODUCTION

The explosion of online videos has changed the way that

people obtain information, and knowledge [1, 2]. Various

video platforms make it more convenient for people to per-

form video queries [3, 4]. However, people who want to

get direct instructions or tutorials from the video often need

to browse the video content several times to locate relevant

parts, which usually takes time and effort [5]. Visual answer

localization (VAL) is an emerging technology to solve the

above problem [6], and has received wide attention because

of its practical value [7, 8]. As shown in Fig. 1(a), the task of

VAL is to ﬁnd a time clip that can answer the given question.

1All the experimental datasets and codes are open-sourced on the website

https://github.com/WENGSYX/MutualSL.

*: These authors contributed equally to this work.

?: Corresponding author.

Question: How do you begin "automatic start"?

Target Answer: 2:18

1:32

……

Automatic start all ows you to

test at your own pace, and

Now Iwant to introduce some

knowl edge about cars.

begins once you hit the

accelerator pedal.

Textual Predictor：2:01

Visual Predictor：2:37

(a) Task Introduction

…

Textual Predictor

Subtitles Start Span End Span

00:00:12 

00:00:24

…

00:01:45 

00:01:53

00:03:01 

00:03:17

Subtitle4

Subtitle5

Subtitle6

Subtitle11

Subtitle1

Visual Predictor

End Point

Start Point

End 2:37

(b) The Paradigm of Visual Predictor

Frames

Frame Span Timepoint

End 2:01

Subtitle TimeLine

Start 1:32

Start 1:48

(d) Cross-Modal Mutual Knowledge Transfer

Textual

Predictor

Visual

Predictor

Target Answer

Our Method

PTextual(Subtitles)

Output Probability:

PVisual(Frames)

Output Probability:

00:01:55 

00:02:01

00:01:32 

00:01:39

One-way Dynamic

Loss Function

1:32

1:48

Fig. 1. Task description of the visual answer localization,

where the below is the paradigms of the previous methods

and our method.

For example, when inputting “How do you begin ‘automatic

start’”, you may need to ﬁnd a clip according to voice content

(or transcribed text subtitles) and visual frames. The VAL

technology can not only recognize the relevant video clips

to the text questions but also return the target visual answer

(1:32 ˜ 2:18).

The existing VAL method can be mainly divided into vi-

sual predictor and textual predictor according to the predic-

tion contents. The paradigm of visual predictor is shown in

Fig. 1(b). The video information is ﬁrst extracted according

to the frame features, and then these frame features queried

by the question are used to predict the relevant time points [9,

10]. The paradigm of textual predictor is shown in Fig. 1(c).

The textual predictor adopts a span-based method to model

the cross-modal information, where the predicted span inter-

vals with subtitle timeline are used as the ﬁnal results [8, 11].

The performance of the textual predictor is better than the

visual one [7], because it uses the additional subtitle informa-

tion, and embeds visual information into the text feature space

with the visual information as an auxiliary feature. However,

as shown in Fig. 1(a), results from both two predictors suf-

fer cross-modal knowledge deviations. For the textual pre-

arXiv:2210.14823v3 [cs.CV] 28 Oct 2022

文档加载中……请稍候！
如果长时间未打开，您也可以点击刷新试试。

下载文档到电脑，查找使用更方便

10 玖币 0人已下载

立即下载

摘要：

VISUALANSWERLOCALIZATIONWITHCROSS-MODALMUTUALKNOWLEDGETRANSFERYixuanWeng*y,BinLi*?zyNationalLaboratoryofPatternRecognitionInstituteofAutomation,ChineseAcademyScienceszCollegeofElectricalandInformationEngineering,HunanUniversityABSTRACTThegoalofvisualansweringlocalization(VAL)inthevideoistoobtainarel...

展开>> 收起<<

VISUAL ANSWER LOCALIZATION WITH CROSS-MODAL MUTUAL KNOWLEDGE TRANSFER Yixuan Wengy Bin Liz.pdf

共5页,预览1页

还剩页未读，继续阅读

声明：本站为文档C2C交易模式，即用户上传的文档直接被用户下载，本站只是中间服务平台，本站所有文档下载所得的收益归上传人(含作者)所有。玖贝云文库仅提供信息存储空间，仅对用户上传内容的表现方式做保护处理，对上载内容本身不做任何修改或编辑。若文档所含内容侵犯了您的版权或隐私，请立即通知玖贝云文库，我们立即给予删除！

VISUAL ANSWER LOCALIZATION WITH CROSS-MODAL MUTUAL KNOWLEDGE TRANSFER Yixuan Wengy Bin Liz

相关推荐

开通VIP享超值会员特权

作者详情

相关内容

热门标签

举报选择: