Video Referring Expression Comprehension via Transformer with Content-aware Query Ji Jiang1 Meng Cao1 Tengtao Song1 Yuexian Zouy12

2025-05-06 0 0 2.53MB 10 页 10玖币

侵权投诉

Video Referring Expression Comprehension via Transformer

with Content-aware Query

Ji Jiang1∗, Meng Cao1∗, Tengtao Song1, Yuexian Zou†1,2

1School of Electronic and Computer Engineering, Peking University 2Peng Cheng Laboratory

Abstract

Video Referring Expression Comprehension (REC) aims

to localize a target object in video frames referred by the

natural language expression. Recently, the Transformer-

based methods have greatly boosted the performance limit.

However, we argue that the current query design is sub-

optima and suffers from two drawbacks: 1) the slow train-

ing convergence process; 2) the lack of ﬁne-grained align-

ment. To alleviate this, we aim to couple the pure learn-

able queries with the content information. Speciﬁcally, we

set up a ﬁxed number of learnable bounding boxes across

the frame and the aligned region features are employed to

provide fruitful clues. Besides, we explicitly link certain

phrases in the sentence to the semantically relevant visual

areas. To this end, we introduce two new datasets (i.e.,

VID-Entity and VidSTG-Entity) by augmenting the VID-

Sentence and VidSTG datasets with the explicitly referred

words in the whole sentence, respectively. Beneﬁting from

this, we conduct the ﬁne-grained cross-modal alignment at

the region-phrase level, which ensures more detailed feature

representations. Incorporating these two designs, our pro-

posed model (dubbed as ContFormer) achieves the state-

of-the-art performance on widely benchmarked datasets.

For example on VID-Entity dataset, compared to the previ-

ous SOTA, ContFormer achieves 8.75% absolute improve-

ment on Accu.@0.6. The dataset, code and models are

available at https://github.com/mengcaopku/

ContFormer.

1. Introduction

Referring Expression Comprehension (REC) [24,25,57,

58] aims to locate the image region described by the natural

language query. This task has attracted extensive attention

from both academia and industry, due to its wide applica-

tion, such as visual question answering [2,27], image/video

analysis [1,8,9] and relationship modeling [24,60]. During

the past years, most previous works restrict REC in static

∗denotes the equal contributions. †denotes the corresponding author.

45 6 7 891 2 3

epoch

vIoU@0.3

11 12 13 14 15

Figure 1. Comparisons of the convergence curves between Tube-

DETR and our ContFormer.

images [33,37,48,54–56]. Recently, with the increasing

number of videos uploaded online, grounding the target ob-

ject in the video is becoming an emerging requirement and

some recent attempts [14,18,47,62,63] begin to conduct

REC in the video domain. Different from image REC, video

REC is more challenging since it needs to deal with both

complex temporal and spatial information.

Current video REC methods can be classiﬁed into two

major categories: two-stage,proposal-driven methods and

one-stage,proposal-free methods. For the two-stage meth-

ods [19,20,26,62], they extract potential spatio-temporal

tubes and then align these candidates to the sentence to

ﬁnd the best matching one. The other stream of one-stage

methods [7,13,43,45,59] fuses visual-text features and di-

rectly predicts bounding boxes densely at all spatial loca-

tions. These two kinds of methods, however, are time-

consuming since they require some post-processing steps

(e.g., non-maximum suppression, NMS). Recently, DETR-

like methods [12] have been demonstrated effective in ob-

ject detection areas, which get rid of the manually-designed

rules and dataset-depend hyper-parameters. Following this

pipeline, the primary work TubeDETR [52] develops a sim-

ilar transformer model for video REC.

Although noticeable improvements have been achieved,

arXiv:2210.02953v1 [cs.CV] 6 Oct 2022

Query: The black cat on the left of

fox is walking to eat food in the yard.

Figure 2. The certain regions (i.e., object areas) of the frame are

usually more salient and highly overlapped with certain phrases

containing semantic meanings.

this vanilla method still has two overlooked drawbacks:

1) The slow training convergence process. DETR-like

methods formulate detection/localization as a set prediction

problem and use learnable queries to probe and pool frame

features. This structure, however, suffers from the notorious

slow convergence issue [34,38,49]. For example in Fig-

ure 1, TubeDETR requires about 10 epochs to achieve the

saturated performance. Such a problem greatly hinders its

practical applications. 2) Lack of ﬁne-grained alignments.

Empirically, we ﬁnd that the noun (i.e., subject or object) in

a sentence is important to carry the overall meaning. Ac-

cordingly, certain patches (i.e., object areas) of the frame

are usually more salient and highly overlapped with the se-

mantic meanings. For example in Figure 2, the sentence

contains two instances, i.e., “cat” and “fox”. The detailed

alignment and differentiation between the mentioned query

objects and the corresponding visual areas provide localiza-

tion clues. This ﬁne-grained correlation, however, is over-

looked in current Transformer-based methods.

Based on the above observations, we argue that the cur-

rent query design in video REC methods is sub-optimal. To

alleviate this, we propose the novel content-aware query in

transformer (dubbed as ContFormer). We contend that the

content-independent query design is the main cause of the

slow convergence. To this end, we propose to use query em-

beddings conditioned on the image content. Speciﬁcally,

we set up a ﬁxed number of bounding boxes across the

frame. Then the cropped and pooled regional features are

transformed into the query features of Transformer decoder.

Compared to the conventional high-dimension learnable

queries, our region-based features introduce more salient

prior, leading to a faster convergence process (cf. Figure 1).

Besides, current datasets only contain the coarse-grained

region-sentence level correspondences. In this work, we

take one step further to collect VID-Entity and VidSTG-

Entity datasets (cf. Figure 4), which annotate region-phrase

lables by grounding speciﬁc phrases in sentences with the

bounding boxes in the video frames. To further use these

detailed annotations, we also propose a ﬁne-grained align-

ment loss. Speciﬁcally, we ﬁrstly compute the similarity

scores between each query-word pair. Then, we adopt the

Hungarian algorithm [31] to select the query matching the

target bounding box. Supervised by the annotations of VID-

Entity and VidSTG-Entity datasets, the InfoNCE loss is ap-

plied to map the ﬁne-grained matched pair to be close.

We make three contributions in this paper:

• We contend that the current query design leads to the

slow convergence process in Transformer-based video

REC methods. To this end, we propose to generate

content-conditioned queries based on the frame context.

• Beyond the coarse-grained region-sentence one, we

build two datasets (i.e., VID–Entity and VidSTG–

Entity) and a ﬁne-grained alignment loss to enhance the

ﬁne-grained region-phrase alignment.

• Experimental results show that our ContFormer achieves

state-of-the-art performance on both trimmed and

untrimmed video REC benchmarks.

2. Related Work

Video Referring Expression Comprehension. The objec-

tive of video REC is to localize the spatial-temporal tube

according to the natural language query. Most of the pre-

vious works [19,22,43,45,62] can be divided into two

categories, i.e., two-stage methods and one-stage methods.

However, both kinds of methods require time-consuming

post-processing steps, which hinders their practical appli-

cations. Therefore, some recent REC works start to explore

other baselines. Based on the end-to-end detection frame-

work DETR [12], Kamath et.al [28] propose MDETR, an

image vision-language multi-modal pre-training framework

beneﬁting various downstream vision-language tasks. Yang

et.al [52] propose TubeDETR to conduct spatial-temporal

video grounding via a space-time decoder module in a

DETR-like manner. However, it still faces some problems:

1) TubeDETR processes each frame independently, which

may lead to the loss of temporal information. 2) As a

DETR-like method, TubeDETR suffers from slow training

convergence. 3) It just fuses visual and language features in

a simple concatenation manner and ignores detailed vision-

language alignments. In contrast, our ContFormer alleviates

the above problems by introducing the content-independent

query design and a ﬁne-grained region-phrase alignment.

Transformer Query Design. DETR [12] localizes ob-

jects by utilizing learnable queries to probe and ﬁlter im-

age regions that contain the target instance. However, this

learnable query mechanism has been demonstrated suffer-

ing from the slowing training convergence [4,5,34,38,49].

To this end, [49] designs object queries based on anchor

文档加载中……请稍候！
如果长时间未打开，您也可以点击刷新试试。

下载文档到电脑，查找使用更方便

10 玖币 0人已下载

立即下载

摘要：

VideoReferringExpressionComprehensionviaTransformerwithContent-awareQueryJiJiang1,MengCao1,TengtaoSong1,YuexianZouy1;21SchoolofElectronicandComputerEngineering,PekingUniversity2PengChengLaboratoryAbstractVideoReferringExpressionComprehension(REC)aimstolocalizeatargetobjectinvideoframesreferredbyth...

展开>> 收起<<

Video Referring Expression Comprehension via Transformer with Content-aware Query Ji Jiang1 Meng Cao1 Tengtao Song1 Yuexian Zouy12.pdf

共10页,预览2页

还剩页未读，继续阅读

声明：本站为文档C2C交易模式，即用户上传的文档直接被用户下载，本站只是中间服务平台，本站所有文档下载所得的收益归上传人(含作者)所有。玖贝云文库仅提供信息存储空间，仅对用户上传内容的表现方式做保护处理，对上载内容本身不做任何修改或编辑。若文档所含内容侵犯了您的版权或隐私，请立即通知玖贝云文库，我们立即给予删除！

Video Referring Expression Comprehension via Transformer with Content-aware Query Ji Jiang1 Meng Cao1 Tengtao Song1 Yuexian Zouy12

相关推荐

开通VIP享超值会员特权

作者详情

相关内容

热门标签

举报选择: