Video Referring Expression Comprehension via Transformer with Content-aware Query Ji Jiang1 Meng Cao1 Tengtao Song1 Yuexian Zouy12

2025-05-06 0 0 2.53MB 10 页 10玖币
侵权投诉
Video Referring Expression Comprehension via Transformer
with Content-aware Query
Ji Jiang1, Meng Cao1, Tengtao Song1, Yuexian Zou1,2
1School of Electronic and Computer Engineering, Peking University 2Peng Cheng Laboratory
Abstract
Video Referring Expression Comprehension (REC) aims
to localize a target object in video frames referred by the
natural language expression. Recently, the Transformer-
based methods have greatly boosted the performance limit.
However, we argue that the current query design is sub-
optima and suffers from two drawbacks: 1) the slow train-
ing convergence process; 2) the lack of fine-grained align-
ment. To alleviate this, we aim to couple the pure learn-
able queries with the content information. Specifically, we
set up a fixed number of learnable bounding boxes across
the frame and the aligned region features are employed to
provide fruitful clues. Besides, we explicitly link certain
phrases in the sentence to the semantically relevant visual
areas. To this end, we introduce two new datasets (i.e.,
VID-Entity and VidSTG-Entity) by augmenting the VID-
Sentence and VidSTG datasets with the explicitly referred
words in the whole sentence, respectively. Benefiting from
this, we conduct the fine-grained cross-modal alignment at
the region-phrase level, which ensures more detailed feature
representations. Incorporating these two designs, our pro-
posed model (dubbed as ContFormer) achieves the state-
of-the-art performance on widely benchmarked datasets.
For example on VID-Entity dataset, compared to the previ-
ous SOTA, ContFormer achieves 8.75% absolute improve-
ment on Accu.@0.6. The dataset, code and models are
available at https://github.com/mengcaopku/
ContFormer.
1. Introduction
Referring Expression Comprehension (REC) [24,25,57,
58] aims to locate the image region described by the natural
language query. This task has attracted extensive attention
from both academia and industry, due to its wide applica-
tion, such as visual question answering [2,27], image/video
analysis [1,8,9] and relationship modeling [24,60]. During
the past years, most previous works restrict REC in static
denotes the equal contributions. denotes the corresponding author.
45 6 7 891 2 3
epoch
10
26
28
30
32
34
36
38
40
42
44
vIoU@0.3
11 12 13 14 15
Figure 1. Comparisons of the convergence curves between Tube-
DETR and our ContFormer.
images [33,37,48,5456]. Recently, with the increasing
number of videos uploaded online, grounding the target ob-
ject in the video is becoming an emerging requirement and
some recent attempts [14,18,47,62,63] begin to conduct
REC in the video domain. Different from image REC, video
REC is more challenging since it needs to deal with both
complex temporal and spatial information.
Current video REC methods can be classified into two
major categories: two-stage,proposal-driven methods and
one-stage,proposal-free methods. For the two-stage meth-
ods [19,20,26,62], they extract potential spatio-temporal
tubes and then align these candidates to the sentence to
find the best matching one. The other stream of one-stage
methods [7,13,43,45,59] fuses visual-text features and di-
rectly predicts bounding boxes densely at all spatial loca-
tions. These two kinds of methods, however, are time-
consuming since they require some post-processing steps
(e.g., non-maximum suppression, NMS). Recently, DETR-
like methods [12] have been demonstrated effective in ob-
ject detection areas, which get rid of the manually-designed
rules and dataset-depend hyper-parameters. Following this
pipeline, the primary work TubeDETR [52] develops a sim-
ilar transformer model for video REC.
Although noticeable improvements have been achieved,
1
arXiv:2210.02953v1 [cs.CV] 6 Oct 2022
Query: The black cat on the left of
fox is walking to eat food in the yard.
Figure 2. The certain regions (i.e., object areas) of the frame are
usually more salient and highly overlapped with certain phrases
containing semantic meanings.
this vanilla method still has two overlooked drawbacks:
1) The slow training convergence process. DETR-like
methods formulate detection/localization as a set prediction
problem and use learnable queries to probe and pool frame
features. This structure, however, suffers from the notorious
slow convergence issue [34,38,49]. For example in Fig-
ure 1, TubeDETR requires about 10 epochs to achieve the
saturated performance. Such a problem greatly hinders its
practical applications. 2) Lack of fine-grained alignments.
Empirically, we find that the noun (i.e., subject or object) in
a sentence is important to carry the overall meaning. Ac-
cordingly, certain patches (i.e., object areas) of the frame
are usually more salient and highly overlapped with the se-
mantic meanings. For example in Figure 2, the sentence
contains two instances, i.e., “cat” and “fox”. The detailed
alignment and differentiation between the mentioned query
objects and the corresponding visual areas provide localiza-
tion clues. This fine-grained correlation, however, is over-
looked in current Transformer-based methods.
Based on the above observations, we argue that the cur-
rent query design in video REC methods is sub-optimal. To
alleviate this, we propose the novel content-aware query in
transformer (dubbed as ContFormer). We contend that the
content-independent query design is the main cause of the
slow convergence. To this end, we propose to use query em-
beddings conditioned on the image content. Specifically,
we set up a fixed number of bounding boxes across the
frame. Then the cropped and pooled regional features are
transformed into the query features of Transformer decoder.
Compared to the conventional high-dimension learnable
queries, our region-based features introduce more salient
prior, leading to a faster convergence process (cf. Figure 1).
Besides, current datasets only contain the coarse-grained
region-sentence level correspondences. In this work, we
take one step further to collect VID-Entity and VidSTG-
Entity datasets (cf. Figure 4), which annotate region-phrase
lables by grounding specific phrases in sentences with the
bounding boxes in the video frames. To further use these
detailed annotations, we also propose a fine-grained align-
ment loss. Specifically, we firstly compute the similarity
scores between each query-word pair. Then, we adopt the
Hungarian algorithm [31] to select the query matching the
target bounding box. Supervised by the annotations of VID-
Entity and VidSTG-Entity datasets, the InfoNCE loss is ap-
plied to map the fine-grained matched pair to be close.
We make three contributions in this paper:
We contend that the current query design leads to the
slow convergence process in Transformer-based video
REC methods. To this end, we propose to generate
content-conditioned queries based on the frame context.
Beyond the coarse-grained region-sentence one, we
build two datasets (i.e., VID–Entity and VidSTG–
Entity) and a fine-grained alignment loss to enhance the
fine-grained region-phrase alignment.
Experimental results show that our ContFormer achieves
state-of-the-art performance on both trimmed and
untrimmed video REC benchmarks.
2. Related Work
Video Referring Expression Comprehension. The objec-
tive of video REC is to localize the spatial-temporal tube
according to the natural language query. Most of the pre-
vious works [19,22,43,45,62] can be divided into two
categories, i.e., two-stage methods and one-stage methods.
However, both kinds of methods require time-consuming
post-processing steps, which hinders their practical appli-
cations. Therefore, some recent REC works start to explore
other baselines. Based on the end-to-end detection frame-
work DETR [12], Kamath et.al [28] propose MDETR, an
image vision-language multi-modal pre-training framework
benefiting various downstream vision-language tasks. Yang
et.al [52] propose TubeDETR to conduct spatial-temporal
video grounding via a space-time decoder module in a
DETR-like manner. However, it still faces some problems:
1) TubeDETR processes each frame independently, which
may lead to the loss of temporal information. 2) As a
DETR-like method, TubeDETR suffers from slow training
convergence. 3) It just fuses visual and language features in
a simple concatenation manner and ignores detailed vision-
language alignments. In contrast, our ContFormer alleviates
the above problems by introducing the content-independent
query design and a fine-grained region-phrase alignment.
Transformer Query Design. DETR [12] localizes ob-
jects by utilizing learnable queries to probe and filter im-
age regions that contain the target instance. However, this
learnable query mechanism has been demonstrated suffer-
ing from the slowing training convergence [4,5,34,38,49].
To this end, [49] designs object queries based on anchor
2
摘要:

VideoReferringExpressionComprehensionviaTransformerwithContent-awareQueryJiJiang1,MengCao1,TengtaoSong1,YuexianZouy1;21SchoolofElectronicandComputerEngineering,PekingUniversity2PengChengLaboratoryAbstractVideoReferringExpressionComprehension(REC)aimstolocalizeatargetobjectinvideoframesreferredbyth...

展开>> 收起<<
Video Referring Expression Comprehension via Transformer with Content-aware Query Ji Jiang1 Meng Cao1 Tengtao Song1 Yuexian Zouy12.pdf

共10页,预览2页

还剩页未读, 继续阅读

声明:本站为文档C2C交易模式,即用户上传的文档直接被用户下载,本站只是中间服务平台,本站所有文档下载所得的收益归上传人(含作者)所有。玖贝云文库仅提供信息存储空间,仅对用户上传内容的表现方式做保护处理,对上载内容本身不做任何修改或编辑。若文档所含内容侵犯了您的版权或隐私,请立即通知玖贝云文库,我们立即给予删除!
分类:图书资源 价格:10玖币 属性:10 页 大小:2.53MB 格式:PDF 时间:2025-05-06

开通VIP享超值会员特权

  • 多端同步记录
  • 高速下载文档
  • 免费文档工具
  • 分享文档赚钱
  • 每日登录抽奖
  • 优质衍生服务
/ 10
客服
关注