
Query: The black cat on the left of
fox is walking to eat food in the yard.
Figure 2. The certain regions (i.e., object areas) of the frame are
usually more salient and highly overlapped with certain phrases
containing semantic meanings.
this vanilla method still has two overlooked drawbacks:
1) The slow training convergence process. DETR-like
methods formulate detection/localization as a set prediction
problem and use learnable queries to probe and pool frame
features. This structure, however, suffers from the notorious
slow convergence issue [34,38,49]. For example in Fig-
ure 1, TubeDETR requires about 10 epochs to achieve the
saturated performance. Such a problem greatly hinders its
practical applications. 2) Lack of fine-grained alignments.
Empirically, we find that the noun (i.e., subject or object) in
a sentence is important to carry the overall meaning. Ac-
cordingly, certain patches (i.e., object areas) of the frame
are usually more salient and highly overlapped with the se-
mantic meanings. For example in Figure 2, the sentence
contains two instances, i.e., “cat” and “fox”. The detailed
alignment and differentiation between the mentioned query
objects and the corresponding visual areas provide localiza-
tion clues. This fine-grained correlation, however, is over-
looked in current Transformer-based methods.
Based on the above observations, we argue that the cur-
rent query design in video REC methods is sub-optimal. To
alleviate this, we propose the novel content-aware query in
transformer (dubbed as ContFormer). We contend that the
content-independent query design is the main cause of the
slow convergence. To this end, we propose to use query em-
beddings conditioned on the image content. Specifically,
we set up a fixed number of bounding boxes across the
frame. Then the cropped and pooled regional features are
transformed into the query features of Transformer decoder.
Compared to the conventional high-dimension learnable
queries, our region-based features introduce more salient
prior, leading to a faster convergence process (cf. Figure 1).
Besides, current datasets only contain the coarse-grained
region-sentence level correspondences. In this work, we
take one step further to collect VID-Entity and VidSTG-
Entity datasets (cf. Figure 4), which annotate region-phrase
lables by grounding specific phrases in sentences with the
bounding boxes in the video frames. To further use these
detailed annotations, we also propose a fine-grained align-
ment loss. Specifically, we firstly compute the similarity
scores between each query-word pair. Then, we adopt the
Hungarian algorithm [31] to select the query matching the
target bounding box. Supervised by the annotations of VID-
Entity and VidSTG-Entity datasets, the InfoNCE loss is ap-
plied to map the fine-grained matched pair to be close.
We make three contributions in this paper:
• We contend that the current query design leads to the
slow convergence process in Transformer-based video
REC methods. To this end, we propose to generate
content-conditioned queries based on the frame context.
• Beyond the coarse-grained region-sentence one, we
build two datasets (i.e., VID–Entity and VidSTG–
Entity) and a fine-grained alignment loss to enhance the
fine-grained region-phrase alignment.
• Experimental results show that our ContFormer achieves
state-of-the-art performance on both trimmed and
untrimmed video REC benchmarks.
2. Related Work
Video Referring Expression Comprehension. The objec-
tive of video REC is to localize the spatial-temporal tube
according to the natural language query. Most of the pre-
vious works [19,22,43,45,62] can be divided into two
categories, i.e., two-stage methods and one-stage methods.
However, both kinds of methods require time-consuming
post-processing steps, which hinders their practical appli-
cations. Therefore, some recent REC works start to explore
other baselines. Based on the end-to-end detection frame-
work DETR [12], Kamath et.al [28] propose MDETR, an
image vision-language multi-modal pre-training framework
benefiting various downstream vision-language tasks. Yang
et.al [52] propose TubeDETR to conduct spatial-temporal
video grounding via a space-time decoder module in a
DETR-like manner. However, it still faces some problems:
1) TubeDETR processes each frame independently, which
may lead to the loss of temporal information. 2) As a
DETR-like method, TubeDETR suffers from slow training
convergence. 3) It just fuses visual and language features in
a simple concatenation manner and ignores detailed vision-
language alignments. In contrast, our ContFormer alleviates
the above problems by introducing the content-independent
query design and a fine-grained region-phrase alignment.
Transformer Query Design. DETR [12] localizes ob-
jects by utilizing learnable queries to probe and filter im-
age regions that contain the target instance. However, this
learnable query mechanism has been demonstrated suffer-
ing from the slowing training convergence [4,5,34,38,49].
To this end, [49] designs object queries based on anchor
2