
points (§2.2). We estimate measurement error by
annotating 683K additional caption-video pairs,
which we call the FIRE dataset (§3).1
A core measurement principle is that operational-
ized metrics should strongly correlate to the quan-
tity they intend to measure (Mathison,2004;Liao
et al.,2021). For example, Recall@K operational-
izes the intent to measure retrieval quality. Label
errors are a common way that measurements are
invalidated (Bowman and Dahl,2021;Northcutt
et al.,2021). Our work shows that since TVR met-
rics are computed with false negative label errors,
Recall@K does not accurately reflect retrieval qual-
ity, which negates the measurement’s validity. In
the remainder of this paper, we posit rationales of
why models gain different score boosts (§4.1) and
estimate how useful the FIRE dataset is for evaluat-
ing future models (§4.2 and §4.3).
To conclude, we review the implications of our
findings. Looking to the past, retrieval effective-
ness has been understated for some models, which
gives an overly pessimistic view of recent ad-
vances (Bowman,2022). Critically, our results
also suggest that the MSR-VTT benchmark is near-
ing saturation and should be retired soon in favor
of a purpose-made benchmark. Looking outward,
we identify structurally similar benchmarks—such
as photo retrieval—that likely also have the same
F
alse
I
mplicit
Re
levance problem. A successful
benchmark should avoid the pitfalls we identify in
this paper, be faithful to the real-world user task it
targets (Rowe and Jain,2005;de Vries et al.,2020),
improve reproducibility, and evolve (§7).
2 Text-to-Video Retrieval Evaluation
This section reviews current TVR evaluation prac-
tices using two concepts: internal validity (Camp-
bell,1957, §2.2.1) and construct validity (Tague-
Sutcliffe,1992, §2.2.2). Internal validity refers to
whether an evaluation reliably establishes a cause-
effect relationship between the measured depen-
dent variable and the independent variable to be
estimated (Brewer and Crano,2014;Liao et al.,
2021). In TVR evaluations, false negatives con-
found model quality and label errors (i.e., is the
model wrong or is the label wrong?) which makes
reliably establishing cause (model quality) and ef-
fect (retrieval score) difficult. Construct validity
“pertains to the degree to which the measure of a
construct sufficiently measures the intended con-
1
Data and Code: pedro.ai/multimodal-retrieval-evaluation.
cept” (O’Leary-Kelly and J. Vokurka,1998)—in
TVR evaluations, an important intended concept is
real-world search quality. Construct validity asks:
can we expect that measuring retrieval quality with
the benchmarks at hand generalizes to real-world
search quality? This section argues that TVR evalu-
ations are not internally valid or construct valid.
2.1 Model Evaluation
Multimodal retrieval evaluations typically focus
on two tasks: text-to-video and video-to-text re-
trieval. The first task’s goal is—given a text query—
to retrieve videos that match; the second task’s
goal is—given a video—to retrieve the matching
queries. The applications of text-to-video search
are straightforward: it is useful for searching the
web and personal media.
2
Since the applications
of TVR are clear, and the false-negative problem is
present in both tasks, here we focus on TVR.
The MSR-VTT and MSVD Datasets:
It is
standard for TVR evaluations (Zhang et al.,2018;
Yu et al.,2018;Gabeur et al.,2020) to report on
MSR-VTT and MSVD, so in the interest of compa-
rability, we use these benchmarks too. Although
these datasets were originally meant for evaluating
video captioning models, they have been repur-
posed for TVR (Zhang et al.,2018;Gabeur et al.,
2020). In this paper, we focus our investigation
on MSR-VTT and MSVD since they are the most
prevalent in prior work. MSR-VTT consists of 10K
videos, 1K of which are in the test split. Each video
has twenty captions, but for evaluation, only one
(arbitrarily chosen) caption is used. MSVD contains
1,970 videos, 960 of which are in the test split.
Videos have about forty captions; unlike with MSR-
VTT, retrieval quality for each caption is evaluated.
Fundamentally, both MSR-VTT and MSVD are
video captioning datasets—not retrieval datasets.
MSVD addressed the lack of standard benchmarks
for paraphrasing (Chen and Dolan,2011). In the
original task, annotators selected short clips from
YouTube, watched the clip, and wrote a sentence
describing its contents. The process was repeated
for each video, with each sentence being written by
a new annotator. This conditional independence—
given the video—resulted in a diverse set of cap-
tions. MSR-VTT captions were collected similarly:
independent annotators captioned the same video.
Videos were sourced from the output of a commer-
2
The applications of video-to-text retrieval—that are not
simply captioning—are not clear to us.