
sitive to small (and often undocumented) choices
such as random seeds and hyperparameters (Pineau
et al.,2021). Model performance is often not com-
pared with proper statistical testing that takes this
variance into account, making many self-reported
comparisons unreliable. Our goal is to standardize
this process and thereby improve the reproduction
of ML evaluations.
Centralization: Historically, ML metrics have
been poorly documented, exacerbating an already
insufficient community-wide understanding of their
usage and shortcomings (Post,2018). As metrics
and datasets change, the onus is on the commu-
nity to keep results up-to-date, causing unnecessary
replication of work (Ma et al.,2021) and the prolif-
eration of outdated artifacts (Luccioni et al.,2022).
Coverage: ML as a field still focuses heavily on
accuracy-based metrics. While important, this fo-
cus glosses over other critical facets such as effi-
ciency (Min et al.,2021), bias and fairness (Qian
et al.,2022), robustness (Goel et al.,2021), and
how these factor into choosing a model (Ethayarajh
and Jurafsky,2020;Ma et al.,2021).
We introduce the open source Evaluate library
and the Evaluation on the Hub platform to address
many of these problems. We believe that better
evaluation can happen, if we—as a community—
establish better best practices and remove hurdles.
2 Related work
Open-Source Tools for Evaluation
There is a
long history of open source projects aiming to cap-
ture various measurements, metrics and statisti-
cal testing methods for ML. Torchmetrics (Detlef-
sen et al.,2022) implements a large number of
model evaluation metrics for PyTorch (Paszke
et al.,2019), which is similar to evaluation metrics
found in Keras (Chollet et al.,2015) for Tensor-
Flow. Libraries like Scikit-learn (Pedregosa et al.,
2011), SciPy (Virtanen et al.,2020), Statsmod-
els (Seabold and Perktold,2010), NLTK (Bird et al.,
2009), TrecTools (Palotti et al.,2019), RL Relia-
bility Metrics (Chan et al.,2020), NetworkX (Hag-
berg et al.,2008), Scikit-image (Van der Walt
et al.,2014), GEM (Gehrmann et al.,2021),
TorchFidelity (Obukhov et al.,2020) also sup-
port many evaluation measures across many do-
mains. As integrating metrics into specific frame-
works can be difficult, there are also many libraries
dedicated to individual evaluations for example
rouge_score,
1
BARTScore (Yuan et al.,2021), or
SacreBLEU (Post,2018). The fragmentation of
the ecosystem leads to various problems, such as a
wide range of incompatible conventions and APIs,
or misreporting due to differing implementations
and results.
In Evaluate, we provide one single interface
backed by a centralized Hub. Metrics can easily be
shared, are version controlled, have a standardized
interface, and allow for multimodal inputs.
Evaluation as a Service
The idea of Evaluation
as a Service (Ma et al.,2021;Kiela et al.,2021),
whereby models are submitted for another party to
be centrally evaluated, has recently gained traction
as a more reproducible way to conduct model eval-
uation. Central evaluation also facilitates holding
challenges and competitions around datasets (Ya-
dav et al.,2019;Pavao et al.,2022;Akhbardeh
et al.,2021) as opposed to simply evaluating self-
reported model results or comparing model scores
with benchmark suites (Bajaj et al.,2016;Coleman
et al.,2017;Wang et al.,2018,2019;Kardas et al.,
2020;Reddi et al.,2020;Liu et al.,2021;Goel et al.,
2021;Dror et al.,2019). The advantages of con-
ducting evaluation centrally are multiple, including
better reproducibility, forward/backward compati-
bility, and the ability to measure models along mul-
tiple axes of evaluation (e.g. efficiency and fairness,
in addition to accuracy), which can help contribute
towards a more systematic approach to evaluation.
Issues with Evaluation
Several studies of ML
research and practice have been carried out in re-
cent years on different aspects pertaining to ML
evaluation, and together they paint a bleak picture
of evaluation in our field. For instance, a 2019
large-scale replication study of 255 ML papers
found that only 63% of the results they reported
could be systematically replicated (Raff,2019). A
complementary survey of 3,800 papers from Pa-
pers with Code has shown that a large majority of
metrics used do not adequately reflect models’ per-
formance and that they largely do not correlate with
human judgement (Blagec et al.,2021). Finally, a
recent study of 770 papers in machine translation
from the last decade found that while 108 new met-
rics have been proposed for the task, 99.8% of pa-
pers continue to use BLEU score for reporting re-
sults (Marie et al.,2021), despite the fact that the
1
github.com/google-research/google-
research/tree/master/rouge