
of model performance on data groups as shown in
Figure 2. It comes pre-loaded with model outputs
for most downloaded HuggingFace (HF) models
and datasets, as well as scripts for loading data for
any dataset provided by the Datasets API and ex-
tracting embeddings of any HF-compatible model.
1
2.1 Error Discovery and Analysis
Identifying model failures via error discovery is
a crucial step in engineering robust systems that
generalize to diverse subsets of data. SEAL uses
the model’s loss on a datapoint as a proxy for po-
tential bugs or errors. Past work has examined
model behavior on individual datapoints for map-
ping training datasets (Swayamdipta et al.,2020).
We hope to leverage information about model be-
havior on individual evaluation data-points in a
similar fashion. We use quantiles for dividing the
model loss region for further analysis. For exam-
ple, Figure 2shows the
0.99
loss quantile for the
distilbert-base-uncased model (Sanh et al.,2019)
on the yelp_polarity (Zhang et al.,2015) sentiment
classification dataset. The SEAL interface allows
the user to control the loss quantile for fine-grained
analysis using the widget on the side panel.
SEAL uses k-means++ for clustering the high-
loss candidate datapoints from the above step.
Meng et al. (2022) used k-means for topic discov-
ery on the entire dataset and showed that the clus-
ters are stable only when
k
is very high (
k >> 100
)
because of the scale of the embedding space. In
contrast, SEAL only clusters the very high loss
slice (>0.98 quantile).
We use the representations of the models’ final
hidden layer (before the softmax) as embeddings.
If the evaluation dataset selected by the user has
ground truth annotations, then it groups the clusters
by error-types (false-positives and false-negatives
for binary classification). The visualization compo-
nent of the SEAL interface shows the error clusters
and their types using colors and symbols respec-
tively. We use a standard heuristic of setting the
number of clusters in k-means++ to be approxi-
mately pn/2, where nis the group size.
2.2 Semantic Error Labeling
Semantic error labeling is important for identifying
the underlying concept or topic connecting the dat-
apoints in a error group. Systematic errors can be
1
Based on usage data from July’22 at
https://huggingface.co/models?pipeline_tag=
text-classification&sort=downloads
mathematically modeled and fixed by data curation.
Contrast this with random errors that cannot be
mathematically modeled or fixed via data curation.
Past work analyzing NLP models have shown sys-
tematic errors on various tasks including sentiment
classification, natural language inference, and read-
ing comprehension (McCoy et al.,2019;Kaushik
et al.,2020;Jia and Liang,2017). SEAL uses
pretrained LLMs (such as GPT3 (Ouyang et al.,
2022) or Bloom (BigScience,2022)) for semantic
labeling of error clusters that could highlight such
possible systematic bugs in model performance.
We craft a prompt consisting of instruction and ex-
amples in the clusters extracted in the previous step
as follows.
1def bui ld _prom p t ( content )
2instruction = 'In this task , we `ll
assign a short and p r ecise label to
a gr o u p of do cume n ts based on the
topics or concep t s most r e leva n t to
these documents. The documents are
all subs et s of a $ { task } datas et . '
3
4examples = '\ n - '. j oin ( c on te nt )
5
6prompt = i nst ruc t ion + '-'+
examples+ '\ n Gr ou p l ab el : '
7
8return prompt
Here
task
is the task under consideration for ex-
ample ‘sentiment classification’ in our case. The
arg
to the function is a dataframe or dataframe
column with the dataset content as string that the
model uses for classification. Our prompt design
was experimented first in the few-shot setting be-
fore adapting to the zero-shot.
For the results and use case discussion in Sec-
tion 3, we use the OpenAI GPT3 API
2
via the CLI.
The maximum token length is limited to 4000 and
so we truncate the prompt to that length before feed-
ing the model. We observed that for many larger
groups of high-loss examples (
>25
) SEAL labels
degenerate to generic output such as “customer re-
views of products”, “movies reviews”, “restaurant
reviews”, etc. To prevent this and to generate co-
herent group labels, we sub-cluster the bigger error
groups until their size is
<25
. We verified the
group labels by running the Blei et al. (2003) LDA
topic model on the examples in each cluster after a
pre-processing step. The pre-processing included
tokenizing, lemmatizing, and removing stopwords.
For each dataset domain, we also removed the do-
main word list – (‘movie, watch, film, character’
2https://beta.openai.com/playground