SEAL Interactive Tool for Systematic Error Analysis and Labeling Nazneen Rajaniy Weixin Liangz Lingjiao Chenz Meg Mitchelly James Zouz yHugging FacezDepartment of Computer Science Stanford University

2025-04-26 0 0 5.24MB 12 页 10玖币
侵权投诉
SEAL: Interactive Tool for Systematic Error Analysis and Labeling
Nazneen Rajani, Weixin Liang, Lingjiao Chen, Meg Mitchell, James Zou
Hugging Face Department of Computer Science, Stanford University
{nazneen, meg}@huggingface.co {wxliang,lingjiao,jamesz}@stanford.edu
Abstract
With the advent of Transformers, large lan-
guage models (LLMs) have saturated well-
known NLP benchmarks and leaderboards
with high aggregate performance. However,
many times these models systematically fail
on tail data or rare groups not obvious in
aggregate evaluation. Identifying such prob-
lematic data groups is even more challenging
when there are no explicit labels (e.g., eth-
nicity, gender, etc.) and further compounded
for NLP datasets due to the lack of visual
features to characterize failure modes (e.g.,
Asian males, animals indoors, waterbirds on
land etc.). This paper introduces an interac-
tive Systematic Error Analysis and Labeling
(SEAL) tool that uses a two-step approach
to first identify high error slices of data and
then in the second step introduce methods to
give human-understandable semantics to those
under-performing slices. We explore a va-
riety of methods for coming up with coher-
ent semantics for the error groups using lan-
guage models for semantic labeling and a
text-to-image model for generating visual fea-
tures. SEAL toolkit and demo screencast is
available at https://huggingface.co/spaces/
nazneen/seal.
1 Introduction
Machine learning systems that seemingly perform
well on average can still make systematic errors
on important subsets of data. Examples include
such systems performing poorly for marginalized
groups in chatbots (Stuart-Ulin,2018), recruiting
tools (Hamilton,2018), cloud products (Kayser-
Bril,2020), ad targeting (Hao,2019), credit ser-
vices (Knight,2019), and image cropping (Hamil-
ton,2020). Discovering and labeling systematic
errors in ML systems is an open research prob-
lem that would enable building robust models that
generalize across subpopulations of data.
Uncovering underperforming groups of data of a
ML system is not straightforward. Firstly, the high-
Figure 1: SEAL interactive tool for discovering sys-
tematic errors in model performance. Steps 1 and 2
include extracting the model embeddings and cluster-
ing datapoints with high-loss. Steps 3 and 4 include
semantic labeling of error groups and generating visual
features to support debugging.
dimensional space of the representations learned by
the deep learning models makes it difficult to iden-
tify such groups of systematic errors. Secondly, it
is difficult to extract and label the hidden semantic
information in such groups with high errors with-
out a human-in-the-loop setup. Identifying system-
atic model failures requires practitioners to think
creatively about model evaluation (Ribeiro et al.,
2020;Wu et al.,2019;Goel et al.,2021b;Kiela
et al.,2021;Yuan et al.,2022). However, current
approaches are mostly limited to examining and
manipulating model mispredictions. The onus of
identifying what group or subset of data to evalu-
ate still falls on the practitioner, making it ineffi-
cient and prone to oversight. Recent works on fine-
grained error analysis, such as Domino (Eyuboglu
et al.,2022) and Spotlight (d’Eon et al.,2022) pro-
vide solutions to this problem but focus on image
datasets which are easier to visualize.
Error analysis for text data is less explored and
more challenging. It also highlights the need to
provide semantic summaries of text, which we
arXiv:2210.05839v1 [cs.CL] 11 Oct 2022
a b
c
Figure 2: SEAL interface showing high-error groups for the distilbert-base-uncased model evaluated on the
yelp_polarity dataset. The interface comprises of various components: (a) examples from the dataset in the high
error groups (sorted by loss), (b) statistics of tokens in high error groups relative to the entire evaluation set, (c)
interactive 2d visualization of the model embeddings showing groups of errors in color and low-loss groups in gray.
The colors indicate different error clusters. If the dataset has annotated classes, the visualization includes symbols
to represents those classes (and in the above figure). The panel on the left has multiple widgets that a user
can control to be able to interactively understand their model’s mispredictions relative to the rest of the model’s
outputs. Apart from the dataset and model, the user can select the loss quantile that want to examine for systematic
errors, if they want SEAL to group those errors using kmeans++ with the number of clusters, and how many data
points they want to visualize at a time in the visual component of the interface downsampled proportional to the
group size (we use Altair for plotting that supports a maximum of 5000 data points to be visualized at once).
tackle in SEAL. For example, NLP models could
underperform on hundreds of possible input types –
longer inputs, inputs from non-native speaker, in-
puts with topic domains underrepresented in train-
ing, etc. This is a huge barrier of entry for most
non-expert ML users who wish to gain a better un-
derstanding of their model and datasets with such
existing tools. Model evaluation should ideally give
actionable insights into a model’s performance on
a dataset in the form of data curation (Liang and
Zou,2022) or model patching (Goel et al.,2021a).
Our desiderata is a tool that summarizes fail-
ures of a model on textual data in a concise, coher-
ent and human intepretable way. Systematic Error
Analysis and Labeling (SEAL) is an interactive
tool to 1. identify candidate groups of data with
high systematic errors and 2. generate semantic la-
bels for those groups. For 1, we use k-means++ on
subset of evaluation data with highest loss. Seman-
tic labeling uses LLMs (like GPT3) in zero-shot
setting for identifying concepts or topics common
to examples in the candidate group. We also ex-
plored using a text-to-image model to generate vi-
sual features for high error clusters using the Dall-
e-mini (Dayma et al.,2021). Semantic descriptions
(via labeling or visual features) of such systematic
model errors not only enable practitioners to better
understand the failure modes of their model dur-
ing evaluation but also gives actionable insight to
fix them via some form of model patching or data
augmentation.
2 SEAL
We present Systematic Error Analysis and Labeling
(SEAL), an interactive visualization tool that pro-
vides rich data point comparison for text classifica-
tion systems, enabling fine-grained understanding
of model performance on data groups as shown in
Figure 2. It comes pre-loaded with model outputs
for most downloaded HuggingFace (HF) models
and datasets, as well as scripts for loading data for
any dataset provided by the Datasets API and ex-
tracting embeddings of any HF-compatible model.
1
2.1 Error Discovery and Analysis
Identifying model failures via error discovery is
a crucial step in engineering robust systems that
generalize to diverse subsets of data. SEAL uses
the model’s loss on a datapoint as a proxy for po-
tential bugs or errors. Past work has examined
model behavior on individual datapoints for map-
ping training datasets (Swayamdipta et al.,2020).
We hope to leverage information about model be-
havior on individual evaluation data-points in a
similar fashion. We use quantiles for dividing the
model loss region for further analysis. For exam-
ple, Figure 2shows the
0.99
loss quantile for the
distilbert-base-uncased model (Sanh et al.,2019)
on the yelp_polarity (Zhang et al.,2015) sentiment
classification dataset. The SEAL interface allows
the user to control the loss quantile for fine-grained
analysis using the widget on the side panel.
SEAL uses k-means++ for clustering the high-
loss candidate datapoints from the above step.
Meng et al. (2022) used k-means for topic discov-
ery on the entire dataset and showed that the clus-
ters are stable only when
k
is very high (
k >> 100
)
because of the scale of the embedding space. In
contrast, SEAL only clusters the very high loss
slice (>0.98 quantile).
We use the representations of the models’ final
hidden layer (before the softmax) as embeddings.
If the evaluation dataset selected by the user has
ground truth annotations, then it groups the clusters
by error-types (false-positives and false-negatives
for binary classification). The visualization compo-
nent of the SEAL interface shows the error clusters
and their types using colors and symbols respec-
tively. We use a standard heuristic of setting the
number of clusters in k-means++ to be approxi-
mately pn/2, where nis the group size.
2.2 Semantic Error Labeling
Semantic error labeling is important for identifying
the underlying concept or topic connecting the dat-
apoints in a error group. Systematic errors can be
1
Based on usage data from July’22 at
https://huggingface.co/models?pipeline_tag=
text-classification&sort=downloads
mathematically modeled and fixed by data curation.
Contrast this with random errors that cannot be
mathematically modeled or fixed via data curation.
Past work analyzing NLP models have shown sys-
tematic errors on various tasks including sentiment
classification, natural language inference, and read-
ing comprehension (McCoy et al.,2019;Kaushik
et al.,2020;Jia and Liang,2017). SEAL uses
pretrained LLMs (such as GPT3 (Ouyang et al.,
2022) or Bloom (BigScience,2022)) for semantic
labeling of error clusters that could highlight such
possible systematic bugs in model performance.
We craft a prompt consisting of instruction and ex-
amples in the clusters extracted in the previous step
as follows.
1def bui ld _prom p t ( content )
2instruction = 'In this task , we `ll
assign a short and p r ecise label to
a gr o u p of do cume n ts based on the
topics or concep t s most r e leva n t to
these documents. The documents are
all subs et s of a $ { task } datas et . '
3
4examples = '\ n - '. j oin ( c on te nt )
5
6prompt = i nst ruc t ion + '-'+
examples+ '\ n Gr ou p l ab el : '
7
8return prompt
Here
task
is the task under consideration for ex-
ample ‘sentiment classification’ in our case. The
arg
to the function is a dataframe or dataframe
column with the dataset content as string that the
model uses for classification. Our prompt design
was experimented first in the few-shot setting be-
fore adapting to the zero-shot.
For the results and use case discussion in Sec-
tion 3, we use the OpenAI GPT3 API
2
via the CLI.
The maximum token length is limited to 4000 and
so we truncate the prompt to that length before feed-
ing the model. We observed that for many larger
groups of high-loss examples (
>25
) SEAL labels
degenerate to generic output such as “customer re-
views of products”, “movies reviews”, “restaurant
reviews”, etc. To prevent this and to generate co-
herent group labels, we sub-cluster the bigger error
groups until their size is
<25
. We verified the
group labels by running the Blei et al. (2003) LDA
topic model on the examples in each cluster after a
pre-processing step. The pre-processing included
tokenizing, lemmatizing, and removing stopwords.
For each dataset domain, we also removed the do-
main word list – (‘movie, watch, film, character’
2https://beta.openai.com/playground
摘要:

SEAL:InteractiveToolforSystematicErrorAnalysisandLabelingNazneenRajaniy,WeixinLiangz,LingjiaoChenz,MegMitchelly,JamesZouzyHuggingFacezDepartmentofComputerScience,StanfordUniversity{nazneen,meg}@huggingface.co{wxliang,lingjiao,jamesz}@stanford.eduAbstractWiththeadventofTransformers,largelan-guagemode...

展开>> 收起<<
SEAL Interactive Tool for Systematic Error Analysis and Labeling Nazneen Rajaniy Weixin Liangz Lingjiao Chenz Meg Mitchelly James Zouz yHugging FacezDepartment of Computer Science Stanford University.pdf

共12页,预览3页

还剩页未读, 继续阅读

声明:本站为文档C2C交易模式,即用户上传的文档直接被用户下载,本站只是中间服务平台,本站所有文档下载所得的收益归上传人(含作者)所有。玖贝云文库仅提供信息存储空间,仅对用户上传内容的表现方式做保护处理,对上载内容本身不做任何修改或编辑。若文档所含内容侵犯了您的版权或隐私,请立即通知玖贝云文库,我们立即给予删除!
分类:图书资源 价格:10玖币 属性:12 页 大小:5.24MB 格式:PDF 时间:2025-04-26

开通VIP享超值会员特权

  • 多端同步记录
  • 高速下载文档
  • 免费文档工具
  • 分享文档赚钱
  • 每日登录抽奖
  • 优质衍生服务
/ 12
客服
关注