Fighting FIRe with FIRE Assessing the Validity of Text-to-Video Retrieval Benchmarks Pedro Rodriguez

2025-04-27 0 0 3.04MB 22 页 10玖币

侵权投诉

Fighting FIRe with FIRE:

Assessing the Validity of Text-to-Video Retrieval Benchmarks

Pedro Rodriguez,∗

Mahmoud Azab,Becka Silvert,Renato Sanchez,

Linzy Labson,Hardik Shah,Seungwhan Moon

Meta AI

Abstract

Searching troves of videos with textual de-

scriptions is a core multimodal retrieval task.

Owing to the lack of a purpose-built dataset

for text-to-video retrieval, video captioning

datasets have been re-purposed to evaluate

models by (1) treating captions as positive

matches to their respective videos and (2) as-

suming all other videos to be negatives. How-

ever, this methodology leads to a fundamen-

tal ﬂaw during evaluation: since captions are

marked as relevant only to their original video,

many alternate videos also match the caption,

which introduces false-negative caption-video

pairs. We show that when these false neg-

atives are corrected, a recent state-of-the-art

model gains 25% recall points—a difference

that threatens the validity of the benchmark it-

self. To diagnose and mitigate this issue, we

annotate and release 683K additional caption-

video pairs. Using these, we recompute effec-

tiveness scores for three models on two stan-

dard benchmarks (MSR-VTT and MSVD). We

ﬁnd that (1) the recomputed metrics are up to

25% recall points higher for the best models,

(2) these benchmarks are nearing saturation for

Recall@10, (3) caption length (generality) is

related to the number of positives, and (4) an-

notation costs can be mitigated through sam-

pling. We recommend retiring these bench-

marks in their current form, and we make

recommendations for future text-to-video re-

trieval benchmarks.

1 Introduction

Text-to-video retrieval (TVR) is a challenging multi-

modal retrieval task (Hu et al.,2011) with practical

applications ranging from web search to organiz-

ing media collections (Lew et al.,2006). To mea-

sure TVR model improvement—despite a dearth

of purpose-built TVR benchmarks—researchers

created benchmarks by re-purposing video cap-

tioning datasets such as MSR-VTT (Xu et al.,

∗Correspondence to me@pedro.ai

Test Caption:

“Cartoon girl is

talking”

Predictions Rater Labels

Correct@1

: 0%

: 100%

❌

✅

Video 1

Video 2

Video 3

Model

Test Videos:

E.G., MSRVTT 1K

❌

❌!

✅

Metrics

Does “cartoon girl is talking”

caption the video?

Legend

: Original Label

: FIRE Label

✅

: Relevant

❌

: Irrelevant

Evaluation Data

Figure 1: MSR-VTT and MSVD have one positive

video per caption (each video’s caption). Captions of-

ten match multiple videos, leading to false negatives.

When models rank false negatives highly, model qual-

ity is understated (full example in Appendix Figure 5).

This leads to evaluations where reported metrics do not

reﬂect their true value and are therefore not internally

valid (§2.2.1).

2016), MSVD (Chen and Dolan,2011), and Activi-

tyNet (Heilbron et al.,2015;Krishna et al.,2017).

Early work established an evaluation paradigm that

treated captions as search queries over the collec-

tion of captioned videos (Zhang et al.,2018;Yu

et al.,2018;Gabeur et al.,2020); each caption and

their corresponding video are positives (relevant)

during retrieval, and all other caption-video pairs

are negatives (irrelevant).

However, even a cursory inspection of videos

and captions reveals many additional positive

caption-video pairs (§2). In current benchmarks,

true positives that are not the video’s original cap-

tion are falsely assumed to be negatives.Wray

et al. (2021) ﬁrst identiﬁed this fundamental, false-

negative problem in TVR evaluation; our work

builds on this by quantifying the absolute metric

differences that false negatives induce (see discus-

sion in §6). Accurate absolute metrics are cru-

cial in industrial settings where deployment cri-

teria are often deﬁned by minimum quality tar-

gets. These

alse

mplicit

levance labels intro-

duce measurement error—e.g., CLIP4CLIP’s (Luo

et al.,2021) Recall@1 is underestimated by 25%

arXiv:2210.05038v2 [cs.CL] 19 Apr 2023

points (§2.2). We estimate measurement error by

annotating 683K additional caption-video pairs,

which we call the FIRE dataset (§3).1

A core measurement principle is that operational-

ized metrics should strongly correlate to the quan-

tity they intend to measure (Mathison,2004;Liao

et al.,2021). For example, Recall@K operational-

izes the intent to measure retrieval quality. Label

errors are a common way that measurements are

invalidated (Bowman and Dahl,2021;Northcutt

et al.,2021). Our work shows that since TVR met-

rics are computed with false negative label errors,

Recall@K does not accurately reﬂect retrieval qual-

ity, which negates the measurement’s validity. In

the remainder of this paper, we posit rationales of

why models gain different score boosts (§4.1) and

estimate how useful the FIRE dataset is for evaluat-

ing future models (§4.2 and §4.3).

To conclude, we review the implications of our

ﬁndings. Looking to the past, retrieval effective-

ness has been understated for some models, which

gives an overly pessimistic view of recent ad-

vances (Bowman,2022). Critically, our results

also suggest that the MSR-VTT benchmark is near-

ing saturation and should be retired soon in favor

of a purpose-made benchmark. Looking outward,

we identify structurally similar benchmarks—such

as photo retrieval—that likely also have the same

alse

mplicit

levance problem. A successful

benchmark should avoid the pitfalls we identify in

this paper, be faithful to the real-world user task it

targets (Rowe and Jain,2005;de Vries et al.,2020),

improve reproducibility, and evolve (§7).

2 Text-to-Video Retrieval Evaluation

This section reviews current TVR evaluation prac-

tices using two concepts: internal validity (Camp-

bell,1957, §2.2.1) and construct validity (Tague-

Sutcliffe,1992, §2.2.2). Internal validity refers to

whether an evaluation reliably establishes a cause-

effect relationship between the measured depen-

dent variable and the independent variable to be

estimated (Brewer and Crano,2014;Liao et al.,

2021). In TVR evaluations, false negatives con-

found model quality and label errors (i.e., is the

model wrong or is the label wrong?) which makes

reliably establishing cause (model quality) and ef-

fect (retrieval score) difﬁcult. Construct validity

“pertains to the degree to which the measure of a

construct sufﬁciently measures the intended con-

Data and Code: pedro.ai/multimodal-retrieval-evaluation.

cept” (O’Leary-Kelly and J. Vokurka,1998)—in

TVR evaluations, an important intended concept is

real-world search quality. Construct validity asks:

can we expect that measuring retrieval quality with

the benchmarks at hand generalizes to real-world

search quality? This section argues that TVR evalu-

ations are not internally valid or construct valid.

2.1 Model Evaluation

Multimodal retrieval evaluations typically focus

on two tasks: text-to-video and video-to-text re-

trieval. The ﬁrst task’s goal is—given a text query—

to retrieve videos that match; the second task’s

goal is—given a video—to retrieve the matching

queries. The applications of text-to-video search

are straightforward: it is useful for searching the

web and personal media.

Since the applications

of TVR are clear, and the false-negative problem is

present in both tasks, here we focus on TVR.

The MSR-VTT and MSVD Datasets:

It is

standard for TVR evaluations (Zhang et al.,2018;

Yu et al.,2018;Gabeur et al.,2020) to report on

MSR-VTT and MSVD, so in the interest of compa-

rability, we use these benchmarks too. Although

these datasets were originally meant for evaluating

video captioning models, they have been repur-

posed for TVR (Zhang et al.,2018;Gabeur et al.,

2020). In this paper, we focus our investigation

on MSR-VTT and MSVD since they are the most

prevalent in prior work. MSR-VTT consists of 10K

videos, 1K of which are in the test split. Each video

has twenty captions, but for evaluation, only one

(arbitrarily chosen) caption is used. MSVD contains

1,970 videos, 960 of which are in the test split.

Videos have about forty captions; unlike with MSR-

VTT, retrieval quality for each caption is evaluated.

Fundamentally, both MSR-VTT and MSVD are

video captioning datasets—not retrieval datasets.

MSVD addressed the lack of standard benchmarks

for paraphrasing (Chen and Dolan,2011). In the

original task, annotators selected short clips from

YouTube, watched the clip, and wrote a sentence

describing its contents. The process was repeated

for each video, with each sentence being written by

a new annotator. This conditional independence—

given the video—resulted in a diverse set of cap-

tions. MSR-VTT captions were collected similarly:

independent annotators captioned the same video.

Videos were sourced from the output of a commer-

The applications of video-to-text retrieval—that are not

simply captioning—are not clear to us.

cial video search engine (Xu et al.,2016). In both

datasets, video captions are used as search queries

and labeled relevant to the original video.

Metrics:

Previous TVR work (Zhang et al.,

2018;Yu et al.,2018;Gabeur et al.,2020;Luo

et al.,2020;Zhu and Yang,2020;Li et al.,2020;

Xu et al.,2021;Park et al.,2022) reports Recall@K

(R@K)

and sometimes supplemental metrics such

as median or mean rank of the ﬁrst correct re-

sult. However, R@K in TVR work differs from

the textbook information retrieval deﬁnition (Man-

ning et al.,2008, p. 155) where

R@K =# retrieved positives in top K

# total positives in collection.(1)

In TVR work, query retrieval results are scored one

if a relevant video is in the top K and zero other-

wise. The traditional deﬁnition of Recall@K only

reduces to this when there is exactly one positive

in the collection but is not comparable when there

are multiple positives per caption—as in this case.

With the difference now salient, we avoid confu-

sion by deﬁning a new quantity Correct@K (C@K)

which is 1 if at least one positive is in the top K

and 0 otherwise. Correct@K naturally reduces to

Recall@K—as deﬁned in prior work—when there

is exactly one positive, but handles the additional

positives in our work. We recommend reporting

Correct@K as well as mean average precision (Su

et al.,2015;Mitra and Craswell,2018,MAP), a

metric widely used in Information Retrieval.

The drawback of Correct@K—shared by me-

dian (or mean) rank to ﬁrst positive—is that it does

not directly factor in rank order when there are mul-

tiple positives in retrieved results, only coarsely

factoring in rank via K value. MAP (Mitra and

Craswell,2018, p. 19) is calculated by taking the

mean of

AvgPrecq=⟨i,v⟩∈RqPrecq,i ×relq(v)

v∈Vrelq(v)(2)

for each test query

where

is a video’s position

in the ranked list

of videos,

is a video in

collection

, and

relq(v)

denotes whether query

is relevant to video

. Intuitively, this translates to

calculating the mean of Precision@K for every K

where a positive occurs in ranked predictions

In all experiments, we report Correct@K and MAP.

2.2 Questioning the Validity of Evaluations

In this section, we experimentally argue that cur-

rent TVR evaluations are not internally valid. Then

3Typical K values include 1, 5, 10, and 50.

we argue that they are not construct valid by con-

sidering actual use-cases for video search.

2.2.1 Internal Validity

If an evaluation metric is internally valid (Liao

et al.,2021), then model effectiveness (cause)

should be accurately and reliably reﬂected in met-

rics (effect) (Brewer and Crano,2014). A central

hypothesis of this paper is that the prevalence of

false negatives invalidates the cause-effect relation-

ship between measured model effectiveness and

actual effectiveness–i.e., that correcting false nega-

tives will signiﬁcantly change metrics.4

To test this hypothesis, we build the FIRE dataset,

which

ixes

mplicit

elevance

rrors. We de-

tail the dataset later (§3), but in short, we take

strong retrieval models from the past few years

and annotate their top ten predictions on both

MSR-VTT and MSVD. This process—called system

pooling—has been used for decades in information

retrieval (Spark-Jones,1975) and, by construction,

eliminates implicit false negatives.

For MSR-VTT,

we collect annotations from TeachText (Croitoru

et al.,2021), Support-Set Bottlenecks (Patrick et al.,

2021,SSB), and CLIP4CLIP (Luo et al.,2021) mod-

els; for MSVD, we collect annotations from Teach-

Text and CLIP4CLIP models.

6,7

Next, we compute

model scores using the original positives and com-

pare them to scores calculated with both the origi-

nal positives and the new positives in FIRE.

Table 1clearly demonstrates that FIRE annota-

tions reveal large metric differences in both MSR-

VTT and MSVD. For example, the C@1 score of

CLIP4CLIP is understated by 25% points, and its

C@10 score arguably saturates the benchmark at

95.7%. Even “small” differences such as those for

TeachText and SSB are on par with the differences

used to claim state-of-the-art results. False nega-

tives directly cause high measurement error, which

invalidates the internal validity of the benchmark.

We do not see rank changes in our three models, but score

differences suggest that ranks may change with more models.

By implicit, we mean false negative from the lack of

labeling and presuming non-positives are (implicitly) negative.

There may still be false negatives arising from human error

during annotation.

We prioritize models that are (1) publicly available and

(2) have sufﬁcient documentation to reproduce.

Annotating MSR-VTT predictions translates to 1,000 * 10

= 10K annotations since only one caption per video is used.

This is easy compared to MSVD annotation, which uses tens

of captions per video.

Dataset Metric TeachText SSB CLIP4CLIP

MSR-VTT C@1 24.1 (23.3 + 0.800)% 27.3 (26.8 + 0.500)% 67.4 (42.4 + 25.0)%

MSR-VTT C@5 53.2 (50.9 + 2.30)% 55.9 (54.5 + 1.40)% 90.7 (70.4 + 20.3)%

MSR-VTT C@10 67.0 (64.8 + 2.20)% 68.9 (66.3 + 2.60)% 95.7 (80.2 + 15.5)%

MSR-VTT AP 36.1 (35.8 + 0.296)% 39.3 (39.2 + 0.0374)% 69.5 (54.9 + 14.7)%

MSVD C@1 34.7 (19.6 + 15.2)% Not Annotated 65.3 (46.6 + 18.8)%

MSVD C@5 64.7 (48.9 + 15.8)% Not Annotated 89.6 (76.8 + 12.8)%

MSVD C@10 76.1 (63.9 + 12.2)% Not Annotated 94.0 (85.4 + 8.61)%

MSVD AP 44.3 (33.1 + 11.2)% Not Annotated 71.3 (59.7 + 11.6)%

Table 1: The table shows the impact of FIRE annotations on MSR-VTT and MSVD text-to-video retrieval metrics.

“A (B + C)” has metrics computed with FIRE positives (A), only original positives (B), and the delta (C). The deltas

emphasize the deleterious effects of false negatives: CLIP4CLIP’s C@1 on MSR-VTT is understated by 25% points.

2.2.2 Construct Validity

In addition to problems with internal validity, we

posit that TVR evaluations are also not construct

valid (Cronbach and Meehl,1955;O’Leary-Kelly

and J. Vokurka,1998). Construct validity is re-

lated to “how closely our evaluations hit the mark

in appropriately characterizing the actual antici-

pated behaviour of the system in the real world or

progress on stated motivations and goals for the

ﬁeld” (Raji et al.,2021). What is the real-world

use of text-to-video retrieval (or alternatively, the

ﬁeld’s motivations)? Consider the most straight-

forward answer: that such systems will be used by

users to search through video collections, whether

on the web or in personal collections. First, search

queries issued by real users are very likely not sim-

ilar to captions written by crowd annotators; this is

easily observed by inspecting captions in Table 5

and Appendix Table 6. Second, the video distribu-

tion is unlikely to reﬂect real use-cases as they were

selected by annotators or are search results from

seed queries. Due to these problems, it seems un-

likely that the evaluations are construct valid, and

future benchmarks should improve this by building

evaluations that match the intended use of models—

i.e., be ecologically valid (de Vries et al.,2020).

3 FIRE Dataset Collection and

Validation

Next, we describe and analyze the FIRE dataset.

3.1 Annotation Task and Dataset Collection

In the FIRE annotation task, annotators mark

whether the displayed caption is relevant to the

displayed video. Implicitly, the caption’s video is

relevant to it, but how do we judge whether another

arbitrary video is relevant? In other words, how

should annotators mark whether a caption is rele-

vant to a video? In both datasets (§2.1), the caption

must be completely consistent with the video; oth-

erwise, it would not be an accurate caption. There-

fore, we enforce the same condition in our task to

preserve the original relevance semantics.8

Annotators are instructed to mark a caption as

relevant to a video only if every element men-

tioned in the query could be reasonably consid-

ered present. Elements included persons, objects,

locations, and activities, as well as quantiﬁers, qual-

iﬁers, and adjectives. Raters are given some leeway

to use interpretation and inference but instructed to

err in favor of not relevant if the caption is ambigu-

ous or vague. For example, for the caption “a boy

playing the violin,” the video must show a boy who

is playing the violin, not a video of only violins or

a video with only a boy. Screenshots of the anno-

tation interfaces and details of sensitive category

handling are in Appendix B. Complete annotation

guidelines are included in supplemental materials.

To select caption-video pairs to annotate, we

obtain the top ten MSR-VTT and MSVD test set

predictions from three models: CLIP4CLIP (Luo

et al.,2021), SSB (Patrick et al.,2021), and Teach-

Text (Croitoru et al.,2021). For TeachText, we use

model checkpoints available on their webpage. For

CLIP4CLIP and SSB, checkpoints are not available,

so we train new models and verify that retrieval

quality is on par with the literature (see Table 1).

Table 2summarizes the resulting FIRE dataset.

During data collection, 683K labels were collected

across a set of 579K unique caption-video pairs.

Some duplication was intentional: we obtained a

second label for 10% of annotations, and if the la-

bels disagreed, we collected a third label to resolve

the disagreement. Elsewhere, duplication was unin-

tentional: for MSVD we did not deduplicate caption-

Requiring complete matches makes the annotation task

easier by eliminating ambiguous partial match cases.

Dataset # Pairs Percent # Labels

MSR-VTT 24,183 100% 24,507

⌞Agreement 24,167 99.9% -

⌞Relevant 2,855 11.8% -

⌞Irrelevant 21,312 88.2% -

⌞Disagreement 16 0.0662% -

MSVD 555,391 100% 659,126

⌞Agreement 553,832 99.7% -

⌞Relevant 39,909 7.21% -

⌞Irrelevant 513,923 92.8% -

⌞Disagreement 1,559 0.281% -

Table 2: The FIRE dataset is composed of labels for

MSR-VTT and MSVD text-video pairs. The positive-to-

negative ratio is skewed, reﬂecting that queries do not

match most videos. We multiply annotate a subset to

compute annotator agreement rates and Krippendorff’s

α. Agreement on MSR-VTT was .931 with α=.691

and on MSVD was .958 with α=.798. Appendix C

disaggregates agreement rates which are consistent.

video pairs between two models, so where the pre-

dictions overlapped, we obtained additional labels.

Fortunately, this provided an unexpected opportu-

nity to further validate dataset quality.

3.2 Dataset Quality Validation

Before, throughout, and after the collection, we

took steps to collect high-quality data and validate

its quality. The annotation task was completed by

a team of one hundred raters speciﬁcally trained

to review caption-video pairs and assess relevance.

These annotators completed a 1,000 job training

queue, which was reviewed by data quality leads

and this paper’s authors. This allowed annotators

to learn to annotate according to our guidelines,

request clariﬁcation to the guidelines, and request

tooling improvements. Annotators could also es-

calate tasks for being too ambiguous or confusing,

which occurred less than 0.0001% of the time.

After the dataset was collected, we computed

three measures of quality in Table 2: (1) the rate

that judgments resolved to a label (Percent), (2)

the degree to which examples with multiples la-

bels agreed (Agreement), and (3) the Krippendorff

alpha score amongst examples with multiple la-

bels (Krippendorff,2004). Caption-video pairs re-

solved to a label

99.9%

of the time in MSR-VTT and

99.6%

of the time in MSVD. Agreement in both

datasets exceeded

90%

, and the Krippendorff score

suggests reasonable agreement as well. Based on

this analysis, we see no evidence of data quality

issues. The next section digs deeper into FIRE and

suggests explanations for the observed phenomena.

Dataset Models Overlap RBO

MSR-VTT C4C&SSB 0.0638 0.0568

MSR-VTT C4C&TT 0.0610 0.0509

MSR-VTT TT &SSB 0.440 0.231

MSVD C4C&TT 0.411 0.211

Table 3: Annotated predictions of one model boost the

score of another model when predictions overlap. In

MSR-VTT, there is little overlap between CLIP4CLIP

and other models; there is far more overlap in MSVD.

Model Data C@1 C@5 C@10

CLIP4CLIP All 0.674 0.907 0.957

CLIP4CLIP New 0.430 0.713 0.812

TeachText All 0.241 0.532 0.670

TeachText New 0.239 0.527 0.663

SSB All 0.273 0.559 0.689

SSB New 0.271 0.553 0.679

Table 4: We compare C@K of a MSR-VTT model: (1)

with all annotations (All) and (2) without the model’s

annotated predictions to emulate model development

(New). CLIP4CLIP exhibits large differences.

4 Analysis Experiments

The difference FIRE makes on metrics (Table 1) is

striking, which begs the question: why are there

such large differences? We suggest explanations

for these differences (§4.1) while investigating how

these metrics vary under commonplace evaluation

settings such as new model development (§4.2).

4.1 Why Are Score Boosts Not Uniform?

FIRE-based metrics are interesting for at least two

reasons: (1) the magnitude of difference and (2) the

non-uniformity of boosts. Speciﬁcally, CLIP4CLIP

has a larger boost than TeachText and SSB on MSR-

VTT. First, we investigate the degree of prediction

overlap between models. When predictions over-

lap, the models share the boost. Likewise, when

they do not overlap, there is an opportunity for dif-

fering boosts. Table 3shows this: on MSR-VTT,

CLIP4CLIP and the other two models have little

overlap; in contrast, TeachText and SSB have sub-

stantial overlap and their boosts are of roughly the

same magnitude. Overlap is computed between the

top ten predictions of each model using simple over-

lap and rank-biased overlap (Webber et al.,2010,

RBO).

As we might expect based on CLIP4CLIP

If the ordering of predictions amongst the top ten did not

matter, the overlap would be acceptable. However, as in most

IR settings, we do care about the order so use a rank-aware

metric like RBO.

文档加载中……请稍候！
如果长时间未打开，您也可以点击刷新试试。

下载文档到电脑，查找使用更方便

10 玖币 0人已下载

立即下载

摘要：

FightingFIRewithFIRE:AssessingtheValidityofText-to-VideoRetrievalBenchmarksPedroRodriguez,∗MahmoudAzab,BeckaSilvert,RenatoSanchez,LinzyLabson,HardikShah,SeungwhanMoonMetaAIAbstractSearchingtrovesofvideoswithtextualde-scriptionsisacoremultimodalretrievaltask.Owingtothelackofapurpose-builtdatasetforte...

展开>> 收起<<

Fighting FIRe with FIRE Assessing the Validity of Text-to-Video Retrieval Benchmarks Pedro Rodriguez.pdf

共22页,预览5页

还剩页未读，继续阅读

声明：本站为文档C2C交易模式，即用户上传的文档直接被用户下载，本站只是中间服务平台，本站所有文档下载所得的收益归上传人(含作者)所有。玖贝云文库仅提供信息存储空间，仅对用户上传内容的表现方式做保护处理，对上载内容本身不做任何修改或编辑。若文档所含内容侵犯了您的版权或隐私，请立即通知玖贝云文库，我们立即给予删除！

Fighting FIRe with FIRE Assessing the Validity of Text-to-Video Retrieval Benchmarks Pedro Rodriguez

相关推荐

开通VIP享超值会员特权

作者详情

相关内容

热门标签

举报选择: