The Vendi Score A Diversity Evaluation Metric for Machine Learning Dan Friedman1and Adji Bousso Dieng1 2

2025-05-06 0 0 8.37MB 32 页 10玖币
侵权投诉
The Vendi Score: A Diversity Evaluation Metric
for Machine Learning
Dan Friedman1and Adji Bousso Dieng1, 2, *
1Department of Computer Science, Princeton University
2Vertaix
*Published in Transactions on Machine Learning Research (07/2023),
Reviewed on OpenReview
https://openreview.net/forum?id=g97OHbQyk1
July 4, 2023
Abstract
Diversity is an important criterion for many areas of machine learning (ml),
including generative modeling and dataset curation. However, existing metrics
for measuring diversity are often domain-specific and limited in flexibility. In
this paper, we address the diversity evaluation problem by proposing the Vendi
Score, which connects and extends ideas from ecology and quantum statistical
mechanics to ml. The Vendi Score is defined as the exponential of the Shannon
entropy of the eigenvalues of a similarity matrix. This matrix is induced by
a user-defined similarity function applied to the sample to be evaluated for
diversity. In taking a similarity function as input, the Vendi Score enables
its user to specify any desired form of diversity. Importantly, unlike many
existing metrics in ml, the Vendi Score does not require a reference dataset
or distribution over samples or labels, it is therefore general and applicable
to any generative model, decoding algorithm, and dataset from any domain
where similarity can be defined. We showcase the Vendi Score on molecular
generative modeling where we found it addresses shortcomings of the current
diversity metric of choice in that domain. We also applied the Vendi Score to
generative models of images and decoding algorithms of text where we found
it confirms known results about diversity in those domains. Furthermore, we
used the Vendi Score to measure mode collapse, a known shortcoming of
generative adversarial networks (gans). In particular, the Vendi Score revealed
that even gans that capture all the modes of a labelled dataset can be less
diverse than the original dataset. Finally, the interpretability of the Vendi Score
allowed us to diagnose several benchmark ml datasets for diversity, opening
the door for diversity-informed data augmentation1.
Keywords: diversity, evaluation, entropy, ecology, quantum statistical mechan-
ics, machine learning
1
arXiv:2210.02410v2 [cs.LG] 2 Jul 2023
(a)
♠♠♠♠♠♠♣♣♣♣♣♣
VS = 2.00
IntDiv = 0.50
♠♠♠♠♣♣♣♣
VS = 3.00
IntDiv = 0.67
♠♠♠♣♣♣
VS = 4.00
IntDiv = 0.75
0.0
0.2
0.4
0.6
0.8
1.0
Similarity
(b)
♠♠♣♣ 
VS = 3.00
IntDiv = 0.67
VS = 3.00
IntDiv = 0.67
VS = 4.00
IntDiv = 0.67
0.0
0.2
0.4
0.6
0.8
1.0
Similarity
(c)
♠♠♠♠♣♣♣♣
VS = 3.00
IntDiv = 0.67
♠♠♠♠♣♣♣ ♣ 
VS = 3.78
IntDiv = 0.67
♠♠♣♣ 
VS = 4.66
IntDiv = 0.67
0.0
0.2
0.4
0.6
0.8
1.0
Similarity
Figure 1: (a) The Vendi Score, vs in the figure, can be interpreted as the effective
number of unique elements in a sample. It increases linearly with the number of
modes in the dataset. IntDiv, the expected dissimilarity, becomes less sensitive as
the number of modes increases, converging to 1. (b) Combining distinct similarity
functions can increase the Vendi Score, as should be expected of a diversity metric,
while leaving IntDiv unchanged. (c) IntDiv does not take into account correlations
between features, but the Vendi Score does. The Vendi Score is highest when the
items in the sample differ in many attributes, and the attributes are not correlated
with each other.
1 Introduction
Diversity is a criterion that is sought after in many areas of machine learning (ml),
from dataset curation and generative modeling to reinforcement learning, active
learning, and decoding algorithms. A lack of diversity in datasets and models can
hinder the usefulness of ml in many critical applications, e.g. scientific discovery. It
is therefore important to be able to measure diversity.
Many diversity metrics have been proposed in ML, but these metrics are often
domain-specific and limited in flexibility. These include metrics that define diversity
in terms of a reference dataset (Heusel et al.,2017;Sajjadi et al.,2018), a pre-
1
Code for calculating the Vendi Score is available at https://github.com/vertaix/Vendi-Score.
2
trained classifier (Salimans et al.,2016;Srivastava et al.,2017), or discrete features,
like n-grams (Li et al.,2016). In this paper, we propose a general, reference-free
approach that defines diversity in terms of a user-specified similarity function.
Our approach is based on work in ecology, where biological diversity has been
defined as the exponential of the entropy of the distribution of species within a
population (Hill,1973;Jost,2006;Leinster,2021). This value can be interpreted
as the effective number of species in the population. To adapt this approach to ML,
we define the diversity of a collection of elements
x1,..., xn
as the exponential of
the entropy of the eigenvalues of the
n×n
similarity matrix
K
, whose entries are
equal to the similarity scores between each pair of elements. This entropy can be
seen as the von Neumann entropy associated with
K
(Bengtsson and ˙
Zyczkowski,
2017), so we call our metric the Vendi Score, for the von Neumann diversity.
Contributions. We summarize our contributions as follows:
We extend ecological diversity to ML, and propose the Vendi Score, a metric for
evaluating diversity in ML. We study the properties of the Vendi Score, which
provides us with a more formal understanding of desiderata for diversity.
We showcase the flexibility and wide applicability of the Vendi Score, char-
acteristics that stem from its sole reliance on the sample to be evaluated for
diversity and a user-defined similarity function, and highlight the shortcom-
ings of existing metrics used to measure diversity in different domains.
2 Are We Measuring Diversity Correctly in ML?
Several existing metrics for diversity rely on a reference distribution or dataset.
These reference-based metrics define diversity in terms of coverage of the reference.
They assume access to an embedding function–such as a pretrained Inception
model (Szegedy et al.,2016)–that maps samples to real-valued vectors. One example
of a reference-based metric is Fréchet Inception distance (fid) (Heusel et al.,2017),
which measures the Wasserstein-2 distance between two Gaussian distributions, one
Gaussian fit to the embeddings of the reference sample and another one fit to the
embeddings of the sample to be evaluated for diversity. fid was originally proposed
for evaluating image generative adversarial networks (gans) but has since been
applied to text (Cífka et al.,2018) and molecules (Preuer et al.,2018) using domain-
specific neural network encoders. Sajjadi et al. (2018) proposed a two-metric
evaluation paradigm using precision and recall, with precision measuring quality
and recall measuring diversity in terms of coverage of the reference distribution.
Several other variations of precision and recall have been proposed (Kynkäänniemi
et al.,2019;Simon et al.,2019;Naeem et al.,2020). Compared to these approaches,
the Vendi Score is a reference-free metric, measuring the intrinsic diversity of a set
rather than the relationship to a reference distribution. This means that the Vendi
Score should be used along side a quality metric, but can be applied in settings
where there is no reference distribution.
Some other existing metrics evaluate diversity using a pre-trained classifier, therefore
requiring labeled datasets. For example, the Inception score (is) (Salimans et al.,
2016), which is mainly used to evaluate the perceptual quality of image generative
3
models, evaluates diversity using the entropy of the marginal distribution of class
labels predicted by an ImageNet classifier. Another example is number of modes
(nom) (Srivastava et al.,2017), a metric used to evaluate the diversity of gans. nom
is calculated by using a classifier trained on a labeled dataset and then counting the
number of unique labels predicted by the classifier when using samples from a gan as
input. Both is and nom define diversity in terms of predefined labels, and therefore
require knowledge of the ground truth labels and a separate classifier.
In some discrete domains, diversity is often evaluated in terms of the distribution of
unique features. For example in natural language processing (nlp), a standard metric
is n-gram diversity, which is defined as the number of distinct n-grams divided by
the total number of n-grams (e.g. Li et al.,2016). These metrics require an explicit,
discrete feature representation.
There are proposed metrics that use similarity scores to define diversity. The most
widely used metric of this form is the average pairwise similarity score or the
complement, the average dissimilarity. In text, variants of this metric include
pairwise-bleu (Shen et al.,2019) and d-lex-sim (Fomicheva et al.,2020), in which
the similarity function is an n-gram overlap metric such as bleu (Papineni et al.,
2002). In biology, average dissimilarity is known as IntDiv (Benhenda,2017),
with similarity defined as the Jaccard (Tanimoto) similarity between molecular
fingerprints. Average similarity has some shortcomings, which we highlight in 1.
The figure shows the similarity matrices induced by a shape similarity function
and/or a color similarity function. Each of the similarity functions is 1 when the
index of the column and the index of the row have the same shape or color and 0
otherwise. As shown in 1, the average similarity–here measured by IntDiv–becomes
less sensitive as diversity increases and does not account for correlations between
features. This is not the case for the Vendi Score, which accounts for correlations
between features and is able to capture the increased diversity resulting from
composing distinct similarity functions. Related to the metric we propose here is a
similarity-sensitive diversity metric proposed in ecology by Leinster and Cobbold
(2012), and which was introduced in the context of ml by Posada et al. (2020).
This metric is based on a notion of entropy defined in terms of a similarity profile, a
vector whose entries are equal to the expected similarity scores of each element.
Like IntDiv, it does not account for correlations between features.
Some other diversity metrics in the ml literature fall outside of these categories.
The Birthday Paradox Test (Arora and Zhang,2018) aims to estimate the size of
the support of a generative model, but requires some manual inspection of samples.
gilbo (Alemi and Fischer,2018) is a reference-free metric but is only applicable
to latent variable generative models. Kviman et al. (2022) measure the diversity
of ensembles of variational approximations using the Jensen-Shannon Divergence
(jsd); this metric is only applicable to sets of probability distributions. Mitchell et al.
(2020) introduce metrics for diversity and inclusion, defining diversity in terms of
the representation of socially relevant attributes like gender and race, and using
the term heterogeneity to refer to variety in arbitrary attributes; in this paper, we
use the term diversity to have the same sense as heterogeneity, meaning variety in
arbitrary (user-specified) attributes. In the context of drug exploration, Xie et al.
(2022) propose a metric based on the size of the largest subset of elements such that
4
the similarity between any pair of elements is below some threshold, but this metric
requires setting a threshold. Similarly, in the field of evolutionary computation,
quality diversity (qd) algorithms (Pugh et al.,2015), have assessed diversity by
discretizing the feature space into grid of bins and counting the number of covered
bins, but this approach requires picking a bin size.
As discussed above, several attempts have been made to measure diversity in ml.
However, the proposed metrics can be limited in their applicability in that they re-
quire a reference dataset or predefined labels, or are domain-specific and applicable
to one class of models. The existing metrics that do not have those applicability
limitations have shortcomings when it comes to capturing diversity that we have
illustrated in 1.
3 Measuring Diversity with the Vendi Score
We now define the Vendi Score, state its properties, and study its computational
complexity. (We relegate all proofs of lemmas and theorems to the appendix.)
3.1 Defining the Vendi Score
To define a diversity metric in ml we look to ecology, the field that centers diversity
in its work. In ecology, one main way diversity is defined is as the exponential of the
entropy of the distribution of the species under study (Jost,2006;Leinster,2021).
This is a reasonable index for diversity. Consider a population with a uniform
distribution over
n
species, with entropy
log
(
n
). This population has maximal
ecological diversity
n
, the same diversity as a population with
n
members, each
belonging to a different species. The ecological diversity decreases as the distribution
over the species becomes less uniform, and is minimized and equal to one when
all members of the population belong to the same species. For a more extensive
mathematical discussion of entropy and diversity in the context of biodiversity, we
refer readers to Leinster (2021).
How can we extend this way of thinking about diversity to ml? One naive approach
is to define diversity as the exponential of the Shannon entropy of the probability
distribution defined by a machine learning model or dataset. However, this approach
is limiting in that it requires a probability distribution for which entropy is tractable,
which is not possible in many ml settings. We would like to define a diversity metric
that only relies on the samples being evaluated for diversity. And we would like for
such a metric to achieve its maximum value when all samples are dissimilar and
its minimum value when all samples are the same. This implies the need to define
a similarity function over the samples. Endowed with such a similarity function,
we can define a form of entropy that only relies on the samples to be evaluated for
diversity. This leads us to the Vendi Score:
Definition 3.1 (Vendi Score).Let
x1,..., xn∈ X
denote a collection of samples, let
k
:
X ×X R
be a positive semidefinite similarity function, with
k
(
x,x
) = 1for all
x
, and let
KRn×n
denote the kernel matrix with entry
Ki,j
=
k
(
xi,xj
). Denote by
λ1,...,λn
the eigenvalues of
K/n
. The Vendi Score (
VS
) is defined as the exponential
5
摘要:

TheVendiScore:ADiversityEvaluationMetricforMachineLearningDanFriedman1andAdjiBoussoDieng1,2,*1DepartmentofComputerScience,PrincetonUniversity2Vertaix*PublishedinTransactionsonMachineLearningResearch(07/2023),ReviewedonOpenReviewhttps://openreview.net/forum?id=g97OHbQyk1July4,2023AbstractDiversityisa...

展开>> 收起<<
The Vendi Score A Diversity Evaluation Metric for Machine Learning Dan Friedman1and Adji Bousso Dieng1 2.pdf

共32页,预览5页

还剩页未读, 继续阅读

声明:本站为文档C2C交易模式,即用户上传的文档直接被用户下载,本站只是中间服务平台,本站所有文档下载所得的收益归上传人(含作者)所有。玖贝云文库仅提供信息存储空间,仅对用户上传内容的表现方式做保护处理,对上载内容本身不做任何修改或编辑。若文档所含内容侵犯了您的版权或隐私,请立即通知玖贝云文库,我们立即给予删除!
分类:图书资源 价格:10玖币 属性:32 页 大小:8.37MB 格式:PDF 时间:2025-05-06

开通VIP享超值会员特权

  • 多端同步记录
  • 高速下载文档
  • 免费文档工具
  • 分享文档赚钱
  • 每日登录抽奖
  • 优质衍生服务
/ 32
客服
关注