The Vendi Score A Diversity Evaluation Metric for Machine Learning Dan Friedman1and Adji Bousso Dieng1 2

2025-05-06 0 0 8.37MB 32 页 10玖币

侵权投诉

The Vendi Score: A Diversity Evaluation Metric

for Machine Learning

Dan Friedman1and Adji Bousso Dieng1, 2, *

1Department of Computer Science, Princeton University

2Vertaix

*Published in Transactions on Machine Learning Research (07/2023),

Reviewed on OpenReview

https://openreview.net/forum?id=g97OHbQyk1

July 4, 2023

Abstract

Diversity is an important criterion for many areas of machine learning (ml),

including generative modeling and dataset curation. However, existing metrics

for measuring diversity are often domain-speciﬁc and limited in ﬂexibility. In

this paper, we address the diversity evaluation problem by proposing the Vendi

Score, which connects and extends ideas from ecology and quantum statistical

mechanics to ml. The Vendi Score is deﬁned as the exponential of the Shannon

entropy of the eigenvalues of a similarity matrix. This matrix is induced by

a user-deﬁned similarity function applied to the sample to be evaluated for

diversity. In taking a similarity function as input, the Vendi Score enables

its user to specify any desired form of diversity. Importantly, unlike many

existing metrics in ml, the Vendi Score does not require a reference dataset

or distribution over samples or labels, it is therefore general and applicable

to any generative model, decoding algorithm, and dataset from any domain

where similarity can be deﬁned. We showcase the Vendi Score on molecular

generative modeling where we found it addresses shortcomings of the current

diversity metric of choice in that domain. We also applied the Vendi Score to

generative models of images and decoding algorithms of text where we found

it conﬁrms known results about diversity in those domains. Furthermore, we

used the Vendi Score to measure mode collapse, a known shortcoming of

generative adversarial networks (gans). In particular, the Vendi Score revealed

that even gans that capture all the modes of a labelled dataset can be less

diverse than the original dataset. Finally, the interpretability of the Vendi Score

allowed us to diagnose several benchmark ml datasets for diversity, opening

the door for diversity-informed data augmentation1.

Keywords: diversity, evaluation, entropy, ecology, quantum statistical mechan-

ics, machine learning

arXiv:2210.02410v2 [cs.LG] 2 Jul 2023

(a)

♠♠♠♠♠♠♣♣♣♣♣♣

♠

♣

VS = 2.00

IntDiv = 0.50

♠♠♠♠♣♣♣♣

♠

♣



VS = 3.00

IntDiv = 0.67

♠♠♠♣♣♣

♠

♣





VS = 4.00

IntDiv = 0.75

0.0

0.2

0.4

0.6

0.8

1.0

Similarity

(b)

♠♠♣♣ 

♠

♣



VS = 3.00

IntDiv = 0.67





VS = 3.00

IntDiv = 0.67

♠♠♣♣

♠

♣



VS = 4.00

IntDiv = 0.67

0.0

0.2

0.4

0.6

0.8

1.0

Similarity

(c)

♠♠♠♠♣♣♣♣

♠

♣



VS = 3.00

IntDiv = 0.67

♠♠♠♠♣♣♣ ♣ 

♠

♣



VS = 3.78

IntDiv = 0.67

♠♠♠♠♣♣♣♣ 

♠

♣



VS = 4.66

IntDiv = 0.67

0.0

0.2

0.4

0.6

0.8

1.0

Similarity

Figure 1: (a) The Vendi Score, vs in the ﬁgure, can be interpreted as the effective

number of unique elements in a sample. It increases linearly with the number of

modes in the dataset. IntDiv, the expected dissimilarity, becomes less sensitive as

the number of modes increases, converging to 1. (b) Combining distinct similarity

functions can increase the Vendi Score, as should be expected of a diversity metric,

while leaving IntDiv unchanged. (c) IntDiv does not take into account correlations

between features, but the Vendi Score does. The Vendi Score is highest when the

items in the sample differ in many attributes, and the attributes are not correlated

with each other.

1 Introduction

Diversity is a criterion that is sought after in many areas of machine learning (ml),

from dataset curation and generative modeling to reinforcement learning, active

learning, and decoding algorithms. A lack of diversity in datasets and models can

hinder the usefulness of ml in many critical applications, e.g. scientiﬁc discovery. It

is therefore important to be able to measure diversity.

Many diversity metrics have been proposed in ML, but these metrics are often

domain-speciﬁc and limited in ﬂexibility. These include metrics that deﬁne diversity

in terms of a reference dataset (Heusel et al.,2017;Sajjadi et al.,2018), a pre-

Code for calculating the Vendi Score is available at https://github.com/vertaix/Vendi-Score.

trained classiﬁer (Salimans et al.,2016;Srivastava et al.,2017), or discrete features,

like n-grams (Li et al.,2016). In this paper, we propose a general, reference-free

approach that deﬁnes diversity in terms of a user-speciﬁed similarity function.

Our approach is based on work in ecology, where biological diversity has been

deﬁned as the exponential of the entropy of the distribution of species within a

population (Hill,1973;Jost,2006;Leinster,2021). This value can be interpreted

as the effective number of species in the population. To adapt this approach to ML,

we deﬁne the diversity of a collection of elements

x1,..., xn

as the exponential of

the entropy of the eigenvalues of the

n×n

similarity matrix

, whose entries are

equal to the similarity scores between each pair of elements. This entropy can be

seen as the von Neumann entropy associated with

(Bengtsson and ˙

Zyczkowski,

2017), so we call our metric the Vendi Score, for the von Neumann diversity.

Contributions. We summarize our contributions as follows:

•

We extend ecological diversity to ML, and propose the Vendi Score, a metric for

evaluating diversity in ML. We study the properties of the Vendi Score, which

provides us with a more formal understanding of desiderata for diversity.

•

We showcase the ﬂexibility and wide applicability of the Vendi Score, char-

acteristics that stem from its sole reliance on the sample to be evaluated for

diversity and a user-deﬁned similarity function, and highlight the shortcom-

ings of existing metrics used to measure diversity in different domains.

2 Are We Measuring Diversity Correctly in ML?

Several existing metrics for diversity rely on a reference distribution or dataset.

These reference-based metrics deﬁne diversity in terms of coverage of the reference.

They assume access to an embedding function–such as a pretrained Inception

model (Szegedy et al.,2016)–that maps samples to real-valued vectors. One example

of a reference-based metric is Fréchet Inception distance (ﬁd) (Heusel et al.,2017),

which measures the Wasserstein-2 distance between two Gaussian distributions, one

Gaussian ﬁt to the embeddings of the reference sample and another one ﬁt to the

embeddings of the sample to be evaluated for diversity. ﬁd was originally proposed

for evaluating image generative adversarial networks (gans) but has since been

applied to text (Cífka et al.,2018) and molecules (Preuer et al.,2018) using domain-

speciﬁc neural network encoders. Sajjadi et al. (2018) proposed a two-metric

evaluation paradigm using precision and recall, with precision measuring quality

and recall measuring diversity in terms of coverage of the reference distribution.

Several other variations of precision and recall have been proposed (Kynkäänniemi

et al.,2019;Simon et al.,2019;Naeem et al.,2020). Compared to these approaches,

the Vendi Score is a reference-free metric, measuring the intrinsic diversity of a set

rather than the relationship to a reference distribution. This means that the Vendi

Score should be used along side a quality metric, but can be applied in settings

where there is no reference distribution.

Some other existing metrics evaluate diversity using a pre-trained classiﬁer, therefore

requiring labeled datasets. For example, the Inception score (is) (Salimans et al.,

2016), which is mainly used to evaluate the perceptual quality of image generative

models, evaluates diversity using the entropy of the marginal distribution of class

labels predicted by an ImageNet classiﬁer. Another example is number of modes

(nom) (Srivastava et al.,2017), a metric used to evaluate the diversity of gans. nom

is calculated by using a classiﬁer trained on a labeled dataset and then counting the

number of unique labels predicted by the classiﬁer when using samples from a gan as

input. Both is and nom deﬁne diversity in terms of predeﬁned labels, and therefore

require knowledge of the ground truth labels and a separate classiﬁer.

In some discrete domains, diversity is often evaluated in terms of the distribution of

unique features. For example in natural language processing (nlp), a standard metric

is n-gram diversity, which is deﬁned as the number of distinct n-grams divided by

the total number of n-grams (e.g. Li et al.,2016). These metrics require an explicit,

discrete feature representation.

There are proposed metrics that use similarity scores to deﬁne diversity. The most

widely used metric of this form is the average pairwise similarity score or the

complement, the average dissimilarity. In text, variants of this metric include

pairwise-bleu (Shen et al.,2019) and d-lex-sim (Fomicheva et al.,2020), in which

the similarity function is an n-gram overlap metric such as bleu (Papineni et al.,

2002). In biology, average dissimilarity is known as IntDiv (Benhenda,2017),

with similarity deﬁned as the Jaccard (Tanimoto) similarity between molecular

ﬁngerprints. Average similarity has some shortcomings, which we highlight in 1.

The ﬁgure shows the similarity matrices induced by a shape similarity function

and/or a color similarity function. Each of the similarity functions is 1 when the

index of the column and the index of the row have the same shape or color and 0

otherwise. As shown in 1, the average similarity–here measured by IntDiv–becomes

less sensitive as diversity increases and does not account for correlations between

features. This is not the case for the Vendi Score, which accounts for correlations

between features and is able to capture the increased diversity resulting from

composing distinct similarity functions. Related to the metric we propose here is a

similarity-sensitive diversity metric proposed in ecology by Leinster and Cobbold

(2012), and which was introduced in the context of ml by Posada et al. (2020).

This metric is based on a notion of entropy deﬁned in terms of a similarity proﬁle, a

vector whose entries are equal to the expected similarity scores of each element.

Like IntDiv, it does not account for correlations between features.

Some other diversity metrics in the ml literature fall outside of these categories.

The Birthday Paradox Test (Arora and Zhang,2018) aims to estimate the size of

the support of a generative model, but requires some manual inspection of samples.

gilbo (Alemi and Fischer,2018) is a reference-free metric but is only applicable

to latent variable generative models. Kviman et al. (2022) measure the diversity

of ensembles of variational approximations using the Jensen-Shannon Divergence

(jsd); this metric is only applicable to sets of probability distributions. Mitchell et al.

(2020) introduce metrics for diversity and inclusion, deﬁning diversity in terms of

the representation of socially relevant attributes like gender and race, and using

the term heterogeneity to refer to variety in arbitrary attributes; in this paper, we

use the term diversity to have the same sense as heterogeneity, meaning variety in

arbitrary (user-speciﬁed) attributes. In the context of drug exploration, Xie et al.

(2022) propose a metric based on the size of the largest subset of elements such that

the similarity between any pair of elements is below some threshold, but this metric

requires setting a threshold. Similarly, in the ﬁeld of evolutionary computation,

quality diversity (qd) algorithms (Pugh et al.,2015), have assessed diversity by

discretizing the feature space into grid of bins and counting the number of covered

bins, but this approach requires picking a bin size.

As discussed above, several attempts have been made to measure diversity in ml.

However, the proposed metrics can be limited in their applicability in that they re-

quire a reference dataset or predeﬁned labels, or are domain-speciﬁc and applicable

to one class of models. The existing metrics that do not have those applicability

limitations have shortcomings when it comes to capturing diversity that we have

illustrated in 1.

3 Measuring Diversity with the Vendi Score

We now deﬁne the Vendi Score, state its properties, and study its computational

complexity. (We relegate all proofs of lemmas and theorems to the appendix.)

3.1 Deﬁning the Vendi Score

To deﬁne a diversity metric in ml we look to ecology, the ﬁeld that centers diversity

in its work. In ecology, one main way diversity is deﬁned is as the exponential of the

entropy of the distribution of the species under study (Jost,2006;Leinster,2021).

This is a reasonable index for diversity. Consider a population with a uniform

distribution over

species, with entropy

log

(

). This population has maximal

ecological diversity

, the same diversity as a population with

members, each

belonging to a different species. The ecological diversity decreases as the distribution

over the species becomes less uniform, and is minimized and equal to one when

all members of the population belong to the same species. For a more extensive

mathematical discussion of entropy and diversity in the context of biodiversity, we

refer readers to Leinster (2021).

How can we extend this way of thinking about diversity to ml? One naive approach

is to deﬁne diversity as the exponential of the Shannon entropy of the probability

distribution deﬁned by a machine learning model or dataset. However, this approach

is limiting in that it requires a probability distribution for which entropy is tractable,

which is not possible in many ml settings. We would like to deﬁne a diversity metric

that only relies on the samples being evaluated for diversity. And we would like for

such a metric to achieve its maximum value when all samples are dissimilar and

its minimum value when all samples are the same. This implies the need to deﬁne

a similarity function over the samples. Endowed with such a similarity function,

we can deﬁne a form of entropy that only relies on the samples to be evaluated for

diversity. This leads us to the Vendi Score:

Deﬁnition 3.1 (Vendi Score).Let

x1,..., xn∈ X

denote a collection of samples, let

X ×X → R

be a positive semideﬁnite similarity function, with

(

x,x

) = 1for all

, and let

K∈Rn×n

denote the kernel matrix with entry

Ki,j

(

xi,xj

). Denote by

λ1,...,λn

the eigenvalues of

K/n

. The Vendi Score (

) is deﬁned as the exponential

文档加载中……请稍候！
如果长时间未打开，您也可以点击刷新试试。

下载文档到电脑，查找使用更方便

10 玖币 0人已下载

立即下载

摘要：

TheVendiScore:ADiversityEvaluationMetricforMachineLearningDanFriedman1andAdjiBoussoDieng1,2,*1DepartmentofComputerScience,PrincetonUniversity2Vertaix*PublishedinTransactionsonMachineLearningResearch(07/2023),ReviewedonOpenReviewhttps://openreview.net/forum?id=g97OHbQyk1July4,2023AbstractDiversityisa...

展开>> 收起<<

The Vendi Score A Diversity Evaluation Metric for Machine Learning Dan Friedman1and Adji Bousso Dieng1 2.pdf

共32页,预览5页

还剩页未读，继续阅读

声明：本站为文档C2C交易模式，即用户上传的文档直接被用户下载，本站只是中间服务平台，本站所有文档下载所得的收益归上传人(含作者)所有。玖贝云文库仅提供信息存储空间，仅对用户上传内容的表现方式做保护处理，对上载内容本身不做任何修改或编辑。若文档所含内容侵犯了您的版权或隐私，请立即通知玖贝云文库，我们立即给予删除！

The Vendi Score A Diversity Evaluation Metric for Machine Learning Dan Friedman1and Adji Bousso Dieng1 2

相关推荐

开通VIP享超值会员特权

作者详情

相关内容

热门标签

举报选择: