SEAL Interactive Tool for Systematic Error Analysis and Labeling Nazneen Rajaniy Weixin Liangz Lingjiao Chenz Meg Mitchelly James Zouz yHugging FacezDepartment of Computer Science Stanford University

2025-04-26 0 0 5.24MB 12 页 10玖币

侵权投诉

SEAL: Interactive Tool for Systematic Error Analysis and Labeling

Nazneen Rajani†, Weixin Liang‡, Lingjiao Chen‡, Meg Mitchell†, James Zou‡

†Hugging Face ‡Department of Computer Science, Stanford University

{nazneen, meg}@huggingface.co {wxliang,lingjiao,jamesz}@stanford.edu

Abstract

With the advent of Transformers, large lan-

guage models (LLMs) have saturated well-

known NLP benchmarks and leaderboards

with high aggregate performance. However,

many times these models systematically fail

on tail data or rare groups not obvious in

aggregate evaluation. Identifying such prob-

lematic data groups is even more challenging

when there are no explicit labels (e.g., eth-

nicity, gender, etc.) and further compounded

for NLP datasets due to the lack of visual

features to characterize failure modes (e.g.,

Asian males, animals indoors, waterbirds on

land etc.). This paper introduces an interac-

tive Systematic Error Analysis and Labeling

(SEAL) tool that uses a two-step approach

to ﬁrst identify high error slices of data and

then in the second step introduce methods to

give human-understandable semantics to those

under-performing slices. We explore a va-

riety of methods for coming up with coher-

ent semantics for the error groups using lan-

guage models for semantic labeling and a

text-to-image model for generating visual fea-

tures. SEAL toolkit and demo screencast is

available at https://huggingface.co/spaces/

nazneen/seal.

1 Introduction

Machine learning systems that seemingly perform

well on average can still make systematic errors

on important subsets of data. Examples include

such systems performing poorly for marginalized

groups in chatbots (Stuart-Ulin,2018), recruiting

tools (Hamilton,2018), cloud products (Kayser-

Bril,2020), ad targeting (Hao,2019), credit ser-

vices (Knight,2019), and image cropping (Hamil-

ton,2020). Discovering and labeling systematic

errors in ML systems is an open research prob-

lem that would enable building robust models that

generalize across subpopulations of data.

Uncovering underperforming groups of data of a

ML system is not straightforward. Firstly, the high-

Figure 1: SEAL interactive tool for discovering sys-

tematic errors in model performance. Steps 1 and 2

include extracting the model embeddings and cluster-

ing datapoints with high-loss. Steps 3 and 4 include

semantic labeling of error groups and generating visual

features to support debugging.

dimensional space of the representations learned by

the deep learning models makes it difﬁcult to iden-

tify such groups of systematic errors. Secondly, it

is difﬁcult to extract and label the hidden semantic

information in such groups with high errors with-

out a human-in-the-loop setup. Identifying system-

atic model failures requires practitioners to think

creatively about model evaluation (Ribeiro et al.,

2020;Wu et al.,2019;Goel et al.,2021b;Kiela

et al.,2021;Yuan et al.,2022). However, current

approaches are mostly limited to examining and

manipulating model mispredictions. The onus of

identifying what group or subset of data to evalu-

ate still falls on the practitioner, making it inefﬁ-

cient and prone to oversight. Recent works on ﬁne-

grained error analysis, such as Domino (Eyuboglu

et al.,2022) and Spotlight (d’Eon et al.,2022) pro-

vide solutions to this problem but focus on image

datasets which are easier to visualize.

Error analysis for text data is less explored and

more challenging. It also highlights the need to

provide semantic summaries of text, which we

arXiv:2210.05839v1 [cs.CL] 11 Oct 2022

a b

Figure 2: SEAL interface showing high-error groups for the distilbert-base-uncased model evaluated on the

yelp_polarity dataset. The interface comprises of various components: (a) examples from the dataset in the high

error groups (sorted by loss), (b) statistics of tokens in high error groups relative to the entire evaluation set, (c)

interactive 2d visualization of the model embeddings showing groups of errors in color and low-loss groups in gray.

The colors indicate different error clusters. If the dataset has annotated classes, the visualization includes symbols

to represents those classes (and ◦in the above ﬁgure). The panel on the left has multiple widgets that a user

can control to be able to interactively understand their model’s mispredictions relative to the rest of the model’s

outputs. Apart from the dataset and model, the user can select the loss quantile that want to examine for systematic

errors, if they want SEAL to group those errors using kmeans++ with the number of clusters, and how many data

points they want to visualize at a time in the visual component of the interface downsampled proportional to the

group size (we use Altair for plotting that supports a maximum of 5000 data points to be visualized at once).

tackle in SEAL. For example, NLP models could

underperform on hundreds of possible input types –

longer inputs, inputs from non-native speaker, in-

puts with topic domains underrepresented in train-

ing, etc. This is a huge barrier of entry for most

non-expert ML users who wish to gain a better un-

derstanding of their model and datasets with such

existing tools. Model evaluation should ideally give

actionable insights into a model’s performance on

a dataset in the form of data curation (Liang and

Zou,2022) or model patching (Goel et al.,2021a).

Our desiderata is a tool that summarizes fail-

ures of a model on textual data in a concise, coher-

ent and human intepretable way. Systematic Error

Analysis and Labeling (SEAL) is an interactive

tool to 1. identify candidate groups of data with

high systematic errors and 2. generate semantic la-

bels for those groups. For 1, we use k-means++ on

subset of evaluation data with highest loss. Seman-

tic labeling uses LLMs (like GPT3) in zero-shot

setting for identifying concepts or topics common

to examples in the candidate group. We also ex-

plored using a text-to-image model to generate vi-

sual features for high error clusters using the Dall-

e-mini (Dayma et al.,2021). Semantic descriptions

(via labeling or visual features) of such systematic

model errors not only enable practitioners to better

understand the failure modes of their model dur-

ing evaluation but also gives actionable insight to

ﬁx them via some form of model patching or data

augmentation.

2 SEAL

We present Systematic Error Analysis and Labeling

(SEAL), an interactive visualization tool that pro-

vides rich data point comparison for text classiﬁca-

tion systems, enabling ﬁne-grained understanding

of model performance on data groups as shown in

Figure 2. It comes pre-loaded with model outputs

for most downloaded HuggingFace (HF) models

and datasets, as well as scripts for loading data for

any dataset provided by the Datasets API and ex-

tracting embeddings of any HF-compatible model.

2.1 Error Discovery and Analysis

Identifying model failures via error discovery is

a crucial step in engineering robust systems that

generalize to diverse subsets of data. SEAL uses

the model’s loss on a datapoint as a proxy for po-

tential bugs or errors. Past work has examined

model behavior on individual datapoints for map-

ping training datasets (Swayamdipta et al.,2020).

We hope to leverage information about model be-

havior on individual evaluation data-points in a

similar fashion. We use quantiles for dividing the

model loss region for further analysis. For exam-

ple, Figure 2shows the

0.99

loss quantile for the

distilbert-base-uncased model (Sanh et al.,2019)

on the yelp_polarity (Zhang et al.,2015) sentiment

classiﬁcation dataset. The SEAL interface allows

the user to control the loss quantile for ﬁne-grained

analysis using the widget on the side panel.

SEAL uses k-means++ for clustering the high-

loss candidate datapoints from the above step.

Meng et al. (2022) used k-means for topic discov-

ery on the entire dataset and showed that the clus-

ters are stable only when

is very high (

k >> 100

)

because of the scale of the embedding space. In

contrast, SEAL only clusters the very high loss

slice (>0.98 quantile).

We use the representations of the models’ ﬁnal

hidden layer (before the softmax) as embeddings.

If the evaluation dataset selected by the user has

ground truth annotations, then it groups the clusters

by error-types (false-positives and false-negatives

for binary classiﬁcation). The visualization compo-

nent of the SEAL interface shows the error clusters

and their types using colors and symbols respec-

tively. We use a standard heuristic of setting the

number of clusters in k-means++ to be approxi-

mately pn/2, where nis the group size.

2.2 Semantic Error Labeling

Semantic error labeling is important for identifying

the underlying concept or topic connecting the dat-

apoints in a error group. Systematic errors can be

Based on usage data from July’22 at

https://huggingface.co/models?pipeline_tag=

text-classification&sort=downloads

mathematically modeled and ﬁxed by data curation.

Contrast this with random errors that cannot be

mathematically modeled or ﬁxed via data curation.

Past work analyzing NLP models have shown sys-

tematic errors on various tasks including sentiment

classiﬁcation, natural language inference, and read-

ing comprehension (McCoy et al.,2019;Kaushik

et al.,2020;Jia and Liang,2017). SEAL uses

pretrained LLMs (such as GPT3 (Ouyang et al.,

2022) or Bloom (BigScience,2022)) for semantic

labeling of error clusters that could highlight such

possible systematic bugs in model performance.

We craft a prompt consisting of instruction and ex-

amples in the clusters extracted in the previous step

as follows.

1def bui ld _prom p t ( content )

2instruction = 'In this task , we `ll

assign a short and p r ecise label to

a gr o u p of do cume n ts based on the

topics or concep t s most r e leva n t to

these documents. The documents are

all subs et s of a $ { task } datas et . '

4examples = '\ n - '. j oin ( c on te nt )

6prompt = i nst ruc t ion + '-'+

examples+ '\ n Gr ou p l ab el : '

8return prompt

Here

task

is the task under consideration for ex-

ample ‘sentiment classiﬁcation’ in our case. The

arg

to the function is a dataframe or dataframe

column with the dataset content as string that the

model uses for classiﬁcation. Our prompt design

was experimented ﬁrst in the few-shot setting be-

fore adapting to the zero-shot.

For the results and use case discussion in Sec-

tion 3, we use the OpenAI GPT3 API

via the CLI.

The maximum token length is limited to 4000 and

so we truncate the prompt to that length before feed-

ing the model. We observed that for many larger

groups of high-loss examples (

>25

) SEAL labels

degenerate to generic output such as “customer re-

views of products”, “movies reviews”, “restaurant

reviews”, etc. To prevent this and to generate co-

herent group labels, we sub-cluster the bigger error

groups until their size is

<25

. We veriﬁed the

group labels by running the Blei et al. (2003) LDA

topic model on the examples in each cluster after a

pre-processing step. The pre-processing included

tokenizing, lemmatizing, and removing stopwords.

For each dataset domain, we also removed the do-

main word list – (‘movie, watch, ﬁlm, character’

2https://beta.openai.com/playground

文档加载中……请稍候！
如果长时间未打开，您也可以点击刷新试试。

下载文档到电脑，查找使用更方便

10 玖币 0人已下载

立即下载

摘要：

SEAL:InteractiveToolforSystematicErrorAnalysisandLabelingNazneenRajaniy,WeixinLiangz,LingjiaoChenz,MegMitchelly,JamesZouzyHuggingFacezDepartmentofComputerScience,StanfordUniversity{nazneen,meg}@huggingface.co{wxliang,lingjiao,jamesz}@stanford.eduAbstractWiththeadventofTransformers,largelan-guagemode...

展开>> 收起<<

SEAL Interactive Tool for Systematic Error Analysis and Labeling Nazneen Rajaniy Weixin Liangz Lingjiao Chenz Meg Mitchelly James Zouz yHugging FacezDepartment of Computer Science Stanford University.pdf

共12页,预览3页

还剩页未读，继续阅读

声明：本站为文档C2C交易模式，即用户上传的文档直接被用户下载，本站只是中间服务平台，本站所有文档下载所得的收益归上传人(含作者)所有。玖贝云文库仅提供信息存储空间，仅对用户上传内容的表现方式做保护处理，对上载内容本身不做任何修改或编辑。若文档所含内容侵犯了您的版权或隐私，请立即通知玖贝云文库，我们立即给予删除！

SEAL Interactive Tool for Systematic Error Analysis and Labeling Nazneen Rajaniy Weixin Liangz Lingjiao Chenz Meg Mitchelly James Zouz yHugging FacezDepartment of Computer Science Stanford University

相关推荐

开通VIP享超值会员特权

作者详情

相关内容

热门标签

举报选择: