Fighting FIRe with FIRE Assessing the Validity of Text-to-Video Retrieval Benchmarks Pedro Rodriguez

2025-04-27 0 0 3.04MB 22 页 10玖币
侵权投诉
Fighting FIRe with FIRE:
Assessing the Validity of Text-to-Video Retrieval Benchmarks
Pedro Rodriguez,
Mahmoud Azab,Becka Silvert,Renato Sanchez,
Linzy Labson,Hardik Shah,Seungwhan Moon
Meta AI
Abstract
Searching troves of videos with textual de-
scriptions is a core multimodal retrieval task.
Owing to the lack of a purpose-built dataset
for text-to-video retrieval, video captioning
datasets have been re-purposed to evaluate
models by (1) treating captions as positive
matches to their respective videos and (2) as-
suming all other videos to be negatives. How-
ever, this methodology leads to a fundamen-
tal flaw during evaluation: since captions are
marked as relevant only to their original video,
many alternate videos also match the caption,
which introduces false-negative caption-video
pairs. We show that when these false neg-
atives are corrected, a recent state-of-the-art
model gains 25% recall points—a difference
that threatens the validity of the benchmark it-
self. To diagnose and mitigate this issue, we
annotate and release 683K additional caption-
video pairs. Using these, we recompute effec-
tiveness scores for three models on two stan-
dard benchmarks (MSR-VTT and MSVD). We
find that (1) the recomputed metrics are up to
25% recall points higher for the best models,
(2) these benchmarks are nearing saturation for
Recall@10, (3) caption length (generality) is
related to the number of positives, and (4) an-
notation costs can be mitigated through sam-
pling. We recommend retiring these bench-
marks in their current form, and we make
recommendations for future text-to-video re-
trieval benchmarks.
1 Introduction
Text-to-video retrieval (TVR) is a challenging multi-
modal retrieval task (Hu et al.,2011) with practical
applications ranging from web search to organiz-
ing media collections (Lew et al.,2006). To mea-
sure TVR model improvement—despite a dearth
of purpose-built TVR benchmarks—researchers
created benchmarks by re-purposing video cap-
tioning datasets such as MSR-VTT (Xu et al.,
Correspondence to me@pedro.ai
Test Caption:
“Cartoon girl is
talking”
Predictions Rater Labels
Correct@1
!
: 0%
"
: 100%
!
:
,
"
:
Video 1
Video 2
Video 3
Model
Test Videos:
E.G., MSRVTT 1K
!
:
,
"
:
!
:
,
"
:
Metrics
Does “cartoon girl is talking”
caption the video?
Legend
!
: Original Label
"
: FIRE Label
: Relevant
: Irrelevant
Evaluation Data
Figure 1: MSR-VTT and MSVD have one positive
video per caption (each video’s caption). Captions of-
ten match multiple videos, leading to false negatives.
When models rank false negatives highly, model qual-
ity is understated (full example in Appendix Figure 5).
This leads to evaluations where reported metrics do not
reflect their true value and are therefore not internally
valid (§2.2.1).
2016), MSVD (Chen and Dolan,2011), and Activi-
tyNet (Heilbron et al.,2015;Krishna et al.,2017).
Early work established an evaluation paradigm that
treated captions as search queries over the collec-
tion of captioned videos (Zhang et al.,2018;Yu
et al.,2018;Gabeur et al.,2020); each caption and
their corresponding video are positives (relevant)
during retrieval, and all other caption-video pairs
are negatives (irrelevant).
However, even a cursory inspection of videos
and captions reveals many additional positive
caption-video pairs (§2). In current benchmarks,
true positives that are not the video’s original cap-
tion are falsely assumed to be negatives.Wray
et al. (2021) first identified this fundamental, false-
negative problem in TVR evaluation; our work
builds on this by quantifying the absolute metric
differences that false negatives induce (see discus-
sion in §6). Accurate absolute metrics are cru-
cial in industrial settings where deployment cri-
teria are often defined by minimum quality tar-
gets. These
F
alse
I
mplicit
Re
levance labels intro-
duce measurement error—e.g., CLIP4CLIPs (Luo
et al.,2021) Recall@1 is underestimated by 25%
arXiv:2210.05038v2 [cs.CL] 19 Apr 2023
points (§2.2). We estimate measurement error by
annotating 683K additional caption-video pairs,
which we call the FIRE dataset (§3).1
A core measurement principle is that operational-
ized metrics should strongly correlate to the quan-
tity they intend to measure (Mathison,2004;Liao
et al.,2021). For example, Recall@K operational-
izes the intent to measure retrieval quality. Label
errors are a common way that measurements are
invalidated (Bowman and Dahl,2021;Northcutt
et al.,2021). Our work shows that since TVR met-
rics are computed with false negative label errors,
Recall@K does not accurately reflect retrieval qual-
ity, which negates the measurement’s validity. In
the remainder of this paper, we posit rationales of
why models gain different score boosts (§4.1) and
estimate how useful the FIRE dataset is for evaluat-
ing future models (§4.2 and §4.3).
To conclude, we review the implications of our
findings. Looking to the past, retrieval effective-
ness has been understated for some models, which
gives an overly pessimistic view of recent ad-
vances (Bowman,2022). Critically, our results
also suggest that the MSR-VTT benchmark is near-
ing saturation and should be retired soon in favor
of a purpose-made benchmark. Looking outward,
we identify structurally similar benchmarks—such
as photo retrieval—that likely also have the same
F
alse
I
mplicit
Re
levance problem. A successful
benchmark should avoid the pitfalls we identify in
this paper, be faithful to the real-world user task it
targets (Rowe and Jain,2005;de Vries et al.,2020),
improve reproducibility, and evolve (§7).
2 Text-to-Video Retrieval Evaluation
This section reviews current TVR evaluation prac-
tices using two concepts: internal validity (Camp-
bell,1957, §2.2.1) and construct validity (Tague-
Sutcliffe,1992, §2.2.2). Internal validity refers to
whether an evaluation reliably establishes a cause-
effect relationship between the measured depen-
dent variable and the independent variable to be
estimated (Brewer and Crano,2014;Liao et al.,
2021). In TVR evaluations, false negatives con-
found model quality and label errors (i.e., is the
model wrong or is the label wrong?) which makes
reliably establishing cause (model quality) and ef-
fect (retrieval score) difficult. Construct validity
“pertains to the degree to which the measure of a
construct sufficiently measures the intended con-
1
Data and Code: pedro.ai/multimodal-retrieval-evaluation.
cept” (O’Leary-Kelly and J. Vokurka,1998)—in
TVR evaluations, an important intended concept is
real-world search quality. Construct validity asks:
can we expect that measuring retrieval quality with
the benchmarks at hand generalizes to real-world
search quality? This section argues that TVR evalu-
ations are not internally valid or construct valid.
2.1 Model Evaluation
Multimodal retrieval evaluations typically focus
on two tasks: text-to-video and video-to-text re-
trieval. The first task’s goal is—given a text query—
to retrieve videos that match; the second task’s
goal is—given a video—to retrieve the matching
queries. The applications of text-to-video search
are straightforward: it is useful for searching the
web and personal media.
2
Since the applications
of TVR are clear, and the false-negative problem is
present in both tasks, here we focus on TVR.
The MSR-VTT and MSVD Datasets:
It is
standard for TVR evaluations (Zhang et al.,2018;
Yu et al.,2018;Gabeur et al.,2020) to report on
MSR-VTT and MSVD, so in the interest of compa-
rability, we use these benchmarks too. Although
these datasets were originally meant for evaluating
video captioning models, they have been repur-
posed for TVR (Zhang et al.,2018;Gabeur et al.,
2020). In this paper, we focus our investigation
on MSR-VTT and MSVD since they are the most
prevalent in prior work. MSR-VTT consists of 10K
videos, 1K of which are in the test split. Each video
has twenty captions, but for evaluation, only one
(arbitrarily chosen) caption is used. MSVD contains
1,970 videos, 960 of which are in the test split.
Videos have about forty captions; unlike with MSR-
VTT, retrieval quality for each caption is evaluated.
Fundamentally, both MSR-VTT and MSVD are
video captioning datasets—not retrieval datasets.
MSVD addressed the lack of standard benchmarks
for paraphrasing (Chen and Dolan,2011). In the
original task, annotators selected short clips from
YouTube, watched the clip, and wrote a sentence
describing its contents. The process was repeated
for each video, with each sentence being written by
a new annotator. This conditional independence—
given the video—resulted in a diverse set of cap-
tions. MSR-VTT captions were collected similarly:
independent annotators captioned the same video.
Videos were sourced from the output of a commer-
2
The applications of video-to-text retrieval—that are not
simply captioning—are not clear to us.
cial video search engine (Xu et al.,2016). In both
datasets, video captions are used as search queries
and labeled relevant to the original video.
Metrics:
Previous TVR work (Zhang et al.,
2018;Yu et al.,2018;Gabeur et al.,2020;Luo
et al.,2020;Zhu and Yang,2020;Li et al.,2020;
Xu et al.,2021;Park et al.,2022) reports Recall@K
(R@K)
3
and sometimes supplemental metrics such
as median or mean rank of the first correct re-
sult. However, R@K in TVR work differs from
the textbook information retrieval definition (Man-
ning et al.,2008, p. 155) where
R@K =# retrieved positives in top K
# total positives in collection.(1)
In TVR work, query retrieval results are scored one
if a relevant video is in the top K and zero other-
wise. The traditional definition of Recall@K only
reduces to this when there is exactly one positive
in the collection but is not comparable when there
are multiple positives per caption—as in this case.
With the difference now salient, we avoid confu-
sion by defining a new quantity Correct@K (C@K)
which is 1 if at least one positive is in the top K
and 0 otherwise. Correct@K naturally reduces to
Recall@K—as defined in prior work—when there
is exactly one positive, but handles the additional
positives in our work. We recommend reporting
Correct@K as well as mean average precision (Su
et al.,2015;Mitra and Craswell,2018,MAP), a
metric widely used in Information Retrieval.
The drawback of Correct@K—shared by me-
dian (or mean) rank to first positive—is that it does
not directly factor in rank order when there are mul-
tiple positives in retrieved results, only coarsely
factoring in rank via K value. MAP (Mitra and
Craswell,2018, p. 19) is calculated by taking the
mean of
AvgPrecq=i,v⟩∈RqPrecq,i ×relq(v)
vVrelq(v)(2)
for each test query
q
where
i
is a video’s position
in the ranked list
Rq
of videos,
v
is a video in
collection
V
, and
relq(v)
denotes whether query
q
is relevant to video
v
. Intuitively, this translates to
calculating the mean of Precision@K for every K
where a positive occurs in ranked predictions
Rq
.
In all experiments, we report Correct@K and MAP.
2.2 Questioning the Validity of Evaluations
In this section, we experimentally argue that cur-
rent TVR evaluations are not internally valid. Then
3Typical K values include 1, 5, 10, and 50.
we argue that they are not construct valid by con-
sidering actual use-cases for video search.
2.2.1 Internal Validity
If an evaluation metric is internally valid (Liao
et al.,2021), then model effectiveness (cause)
should be accurately and reliably reflected in met-
rics (effect) (Brewer and Crano,2014). A central
hypothesis of this paper is that the prevalence of
false negatives invalidates the cause-effect relation-
ship between measured model effectiveness and
actual effectiveness–i.e., that correcting false nega-
tives will significantly change metrics.4
To test this hypothesis, we build the FIRE dataset,
which
F
ixes
I
mplicit
R
elevance
E
rrors. We de-
tail the dataset later (§3), but in short, we take
strong retrieval models from the past few years
and annotate their top ten predictions on both
MSR-VTT and MSVD. This process—called system
pooling—has been used for decades in information
retrieval (Spark-Jones,1975) and, by construction,
eliminates implicit false negatives.
5
For MSR-VTT,
we collect annotations from TeachText (Croitoru
et al.,2021), Support-Set Bottlenecks (Patrick et al.,
2021,SSB), and CLIP4CLIP (Luo et al.,2021) mod-
els; for MSVD, we collect annotations from Teach-
Text and CLIP4CLIP models.
6,7
Next, we compute
model scores using the original positives and com-
pare them to scores calculated with both the origi-
nal positives and the new positives in FIRE.
Table 1clearly demonstrates that FIRE annota-
tions reveal large metric differences in both MSR-
VTT and MSVD. For example, the C@1 score of
CLIP4CLIP is understated by 25% points, and its
C@10 score arguably saturates the benchmark at
95.7%. Even “small” differences such as those for
TeachText and SSB are on par with the differences
used to claim state-of-the-art results. False nega-
tives directly cause high measurement error, which
invalidates the internal validity of the benchmark.
4
We do not see rank changes in our three models, but score
differences suggest that ranks may change with more models.
5
By implicit, we mean false negative from the lack of
labeling and presuming non-positives are (implicitly) negative.
There may still be false negatives arising from human error
during annotation.
6
We prioritize models that are (1) publicly available and
(2) have sufficient documentation to reproduce.
7
Annotating MSR-VTT predictions translates to 1,000 * 10
= 10K annotations since only one caption per video is used.
This is easy compared to MSVD annotation, which uses tens
of captions per video.
Dataset Metric TeachText SSB CLIP4CLIP
MSR-VTT C@1 24.1 (23.3 + 0.800)% 27.3 (26.8 + 0.500)% 67.4 (42.4 + 25.0)%
MSR-VTT C@5 53.2 (50.9 + 2.30)% 55.9 (54.5 + 1.40)% 90.7 (70.4 + 20.3)%
MSR-VTT C@10 67.0 (64.8 + 2.20)% 68.9 (66.3 + 2.60)% 95.7 (80.2 + 15.5)%
MSR-VTT AP 36.1 (35.8 + 0.296)% 39.3 (39.2 + 0.0374)% 69.5 (54.9 + 14.7)%
MSVD C@1 34.7 (19.6 + 15.2)% Not Annotated 65.3 (46.6 + 18.8)%
MSVD C@5 64.7 (48.9 + 15.8)% Not Annotated 89.6 (76.8 + 12.8)%
MSVD C@10 76.1 (63.9 + 12.2)% Not Annotated 94.0 (85.4 + 8.61)%
MSVD AP 44.3 (33.1 + 11.2)% Not Annotated 71.3 (59.7 + 11.6)%
Table 1: The table shows the impact of FIRE annotations on MSR-VTT and MSVD text-to-video retrieval metrics.
A (B + C)” has metrics computed with FIRE positives (A), only original positives (B), and the delta (C). The deltas
emphasize the deleterious effects of false negatives: CLIP4CLIPs C@1 on MSR-VTT is understated by 25% points.
2.2.2 Construct Validity
In addition to problems with internal validity, we
posit that TVR evaluations are also not construct
valid (Cronbach and Meehl,1955;O’Leary-Kelly
and J. Vokurka,1998). Construct validity is re-
lated to “how closely our evaluations hit the mark
in appropriately characterizing the actual antici-
pated behaviour of the system in the real world or
progress on stated motivations and goals for the
field” (Raji et al.,2021). What is the real-world
use of text-to-video retrieval (or alternatively, the
field’s motivations)? Consider the most straight-
forward answer: that such systems will be used by
users to search through video collections, whether
on the web or in personal collections. First, search
queries issued by real users are very likely not sim-
ilar to captions written by crowd annotators; this is
easily observed by inspecting captions in Table 5
and Appendix Table 6. Second, the video distribu-
tion is unlikely to reflect real use-cases as they were
selected by annotators or are search results from
seed queries. Due to these problems, it seems un-
likely that the evaluations are construct valid, and
future benchmarks should improve this by building
evaluations that match the intended use of models—
i.e., be ecologically valid (de Vries et al.,2020).
3 FIRE Dataset Collection and
Validation
Next, we describe and analyze the FIRE dataset.
3.1 Annotation Task and Dataset Collection
In the FIRE annotation task, annotators mark
whether the displayed caption is relevant to the
displayed video. Implicitly, the caption’s video is
relevant to it, but how do we judge whether another
arbitrary video is relevant? In other words, how
should annotators mark whether a caption is rele-
vant to a video? In both datasets (§2.1), the caption
must be completely consistent with the video; oth-
erwise, it would not be an accurate caption. There-
fore, we enforce the same condition in our task to
preserve the original relevance semantics.8
Annotators are instructed to mark a caption as
relevant to a video only if every element men-
tioned in the query could be reasonably consid-
ered present. Elements included persons, objects,
locations, and activities, as well as quantifiers, qual-
ifiers, and adjectives. Raters are given some leeway
to use interpretation and inference but instructed to
err in favor of not relevant if the caption is ambigu-
ous or vague. For example, for the caption “a boy
playing the violin,” the video must show a boy who
is playing the violin, not a video of only violins or
a video with only a boy. Screenshots of the anno-
tation interfaces and details of sensitive category
handling are in Appendix B. Complete annotation
guidelines are included in supplemental materials.
To select caption-video pairs to annotate, we
obtain the top ten MSR-VTT and MSVD test set
predictions from three models: CLIP4CLIP (Luo
et al.,2021), SSB (Patrick et al.,2021), and Teach-
Text (Croitoru et al.,2021). For TeachText, we use
model checkpoints available on their webpage. For
CLIP4CLIP and SSB, checkpoints are not available,
so we train new models and verify that retrieval
quality is on par with the literature (see Table 1).
Table 2summarizes the resulting FIRE dataset.
During data collection, 683K labels were collected
across a set of 579K unique caption-video pairs.
Some duplication was intentional: we obtained a
second label for 10% of annotations, and if the la-
bels disagreed, we collected a third label to resolve
the disagreement. Elsewhere, duplication was unin-
tentional: for MSVD we did not deduplicate caption-
8
Requiring complete matches makes the annotation task
easier by eliminating ambiguous partial match cases.
Dataset # Pairs Percent # Labels
MSR-VTT 24,183 100% 24,507
Agreement 24,167 99.9% -
Relevant 2,855 11.8% -
Irrelevant 21,312 88.2% -
Disagreement 16 0.0662% -
MSVD 555,391 100% 659,126
Agreement 553,832 99.7% -
Relevant 39,909 7.21% -
Irrelevant 513,923 92.8% -
Disagreement 1,559 0.281% -
Table 2: The FIRE dataset is composed of labels for
MSR-VTT and MSVD text-video pairs. The positive-to-
negative ratio is skewed, reflecting that queries do not
match most videos. We multiply annotate a subset to
compute annotator agreement rates and Krippendorffs
α. Agreement on MSR-VTT was .931 with α=.691
and on MSVD was .958 with α=.798. Appendix C
disaggregates agreement rates which are consistent.
video pairs between two models, so where the pre-
dictions overlapped, we obtained additional labels.
Fortunately, this provided an unexpected opportu-
nity to further validate dataset quality.
3.2 Dataset Quality Validation
Before, throughout, and after the collection, we
took steps to collect high-quality data and validate
its quality. The annotation task was completed by
a team of one hundred raters specifically trained
to review caption-video pairs and assess relevance.
These annotators completed a 1,000 job training
queue, which was reviewed by data quality leads
and this paper’s authors. This allowed annotators
to learn to annotate according to our guidelines,
request clarification to the guidelines, and request
tooling improvements. Annotators could also es-
calate tasks for being too ambiguous or confusing,
which occurred less than 0.0001% of the time.
After the dataset was collected, we computed
three measures of quality in Table 2: (1) the rate
that judgments resolved to a label (Percent), (2)
the degree to which examples with multiples la-
bels agreed (Agreement), and (3) the Krippendorff
alpha score amongst examples with multiple la-
bels (Krippendorff,2004). Caption-video pairs re-
solved to a label
99.9%
of the time in MSR-VTT and
99.6%
of the time in MSVD. Agreement in both
datasets exceeded
90%
, and the Krippendorff score
suggests reasonable agreement as well. Based on
this analysis, we see no evidence of data quality
issues. The next section digs deeper into FIRE and
suggests explanations for the observed phenomena.
Dataset Models Overlap RBO
MSR-VTT C4C&SSB 0.0638 0.0568
MSR-VTT C4C&TT 0.0610 0.0509
MSR-VTT TT &SSB 0.440 0.231
MSVD C4C&TT 0.411 0.211
Table 3: Annotated predictions of one model boost the
score of another model when predictions overlap. In
MSR-VTT, there is little overlap between CLIP4CLIP
and other models; there is far more overlap in MSVD.
Model Data C@1 C@5 C@10
CLIP4CLIP All 0.674 0.907 0.957
CLIP4CLIP New 0.430 0.713 0.812
TeachText All 0.241 0.532 0.670
TeachText New 0.239 0.527 0.663
SSB All 0.273 0.559 0.689
SSB New 0.271 0.553 0.679
Table 4: We compare C@K of a MSR-VTT model: (1)
with all annotations (All) and (2) without the model’s
annotated predictions to emulate model development
(New). CLIP4CLIP exhibits large differences.
4 Analysis Experiments
The difference FIRE makes on metrics (Table 1) is
striking, which begs the question: why are there
such large differences? We suggest explanations
for these differences (§4.1) while investigating how
these metrics vary under commonplace evaluation
settings such as new model development (§4.2).
4.1 Why Are Score Boosts Not Uniform?
FIRE-based metrics are interesting for at least two
reasons: (1) the magnitude of difference and (2) the
non-uniformity of boosts. Specifically, CLIP4CLIP
has a larger boost than TeachText and SSB on MSR-
VTT. First, we investigate the degree of prediction
overlap between models. When predictions over-
lap, the models share the boost. Likewise, when
they do not overlap, there is an opportunity for dif-
fering boosts. Table 3shows this: on MSR-VTT,
CLIP4CLIP and the other two models have little
overlap; in contrast, TeachText and SSB have sub-
stantial overlap and their boosts are of roughly the
same magnitude. Overlap is computed between the
top ten predictions of each model using simple over-
lap and rank-biased overlap (Webber et al.,2010,
RBO).
9
As we might expect based on CLIP4CLIP
9
If the ordering of predictions amongst the top ten did not
matter, the overlap would be acceptable. However, as in most
IR settings, we do care about the order so use a rank-aware
metric like RBO.
摘要:

FightingFIRewithFIRE:AssessingtheValidityofText-to-VideoRetrievalBenchmarksPedroRodriguez,∗MahmoudAzab,BeckaSilvert,RenatoSanchez,LinzyLabson,HardikShah,SeungwhanMoonMetaAIAbstractSearchingtrovesofvideoswithtextualde-scriptionsisacoremultimodalretrievaltask.Owingtothelackofapurpose-builtdatasetforte...

展开>> 收起<<
Fighting FIRe with FIRE Assessing the Validity of Text-to-Video Retrieval Benchmarks Pedro Rodriguez.pdf

共22页,预览5页

还剩页未读, 继续阅读

声明:本站为文档C2C交易模式,即用户上传的文档直接被用户下载,本站只是中间服务平台,本站所有文档下载所得的收益归上传人(含作者)所有。玖贝云文库仅提供信息存储空间,仅对用户上传内容的表现方式做保护处理,对上载内容本身不做任何修改或编辑。若文档所含内容侵犯了您的版权或隐私,请立即通知玖贝云文库,我们立即给予删除!
分类:图书资源 价格:10玖币 属性:22 页 大小:3.04MB 格式:PDF 时间:2025-04-27

开通VIP享超值会员特权

  • 多端同步记录
  • 高速下载文档
  • 免费文档工具
  • 分享文档赚钱
  • 每日登录抽奖
  • 优质衍生服务
/ 22
客服
关注