OUTLIER -ROBUST GROUP INFERENCE VIA GRADIENT SPACE CLUSTERING Yuchen Zeng

2025-04-26 0 0 1.41MB 17 页 10玖币

侵权投诉

OUTLIER-ROBUST GROUP INFERENCE VIA GRADIENT

SPACE CLUSTERING

Yuchen Zeng ∗

Department of Computer Science

University of Wisconsin-Madison

yzeng58@wisc.edu

Kristjan Greenewald

IBM Research

MIT-IBM Watson AI Lab

kristjan.h.greenewald@ibm.com

Kangwook Lee

Department of Electrical and Computer Engineering

University of Wisconsin-Madison

kangwook.lee@wisc.edu

Justin Solomon

Department of Electrical Engineering & Computer Science

Massachusetts Institute of Technology

jsolomon@mit.edu

Mikhail Yurochkin

IBM Research

MIT-IBM Watson AI Lab

mikhail.yurochkin@ibm.com

ABSTRACT

Traditional machine learning models focus on achieving good performance on

the overall training distribution, but they often underperform on minority groups.

Existing methods can improve the worst-group performance, but they can have

several limitations: (i) they require group annotations, which are often expensive

and sometimes infeasible to obtain, and/or (ii) they are sensitive to outliers. Most

related works fail to solve these two issues simultaneously as they focus on con-

ﬂicting perspectives of minority groups and outliers. We address the problem of

learning group annotations in the presence of outliers by clustering the data in the

space of gradients of the model parameters. We show that data in the gradient space

has a simpler structure while preserving information about minority groups and out-

liers, making it suitable for standard clustering methods like DBSCAN. Extensive

experiments demonstrate that our method signiﬁcantly outperforms state-of-the-art

both in terms of group identiﬁcation and downstream worst-group performance.

1 Introduction

Empirical Risk Minimization (ERM), i.e., the minimization of average training loss over the set of

model parameters, is the standard training procedure in machine learning. It yields models with

strong in-distribution performance

but does not guarantee satisfactory performance on minority

groups that contribute relatively few data points to the training loss function (Sagawa et al.,2019;

Koh et al.,2021). This effect is particularly problematic when the minority groups correspond to

socially-protected groups. For example, in the toxic text classiﬁcation task, certain identities are

overwhelmingly abused in online conversations that form data for training models detecting toxicity

(Dixon et al.,2018). Such data lacks sufﬁcient non-toxic examples mentioning these identities,

yielding problematic and unfair spurious correlations – as a result ERM learns to associate these

∗Work performed while doing an internship at IBM Research.

2I.e. low loss on test data drawn from the same distribution as the training dataset.

arXiv:2210.06759v1 [cs.LG] 13 Oct 2022

Outlier-Robust Group Inference via Gradient Space Clustering

Figure 1:

An illustration of learning group annotations in the presence of outliers.

(a) A toy

dataset in two dimensions. There are four groups

g= 1,2,3,4

and an outlier.

g= 1

and

g= 3

are

the majority groups distributed as mixtures of three components each;

g= 2

and

g= 4

are unimodal

minority groups.

-axis is the decision boundary of a logistic regression classiﬁer. Figures (b, c, d)

compare different data views for learning group annotations and detecting outliers via clustering of

samples with y= 0. (b) loss values can confuse outliers and minority samples which both can have

high loss; (c) in the original feature space it is difﬁcult to distinguish one of the majority group modes

and the minority group; (d) gradient space (bias gradient omitted for visualization) simpliﬁes the data

structure making it easier to identify the minority group and to detect outliers.

identities with toxicity (Dixon et al.,2018;Garg et al.,2019;Yurochkin & Sun,2020). A related

phenomenon is subpopulation shift (Koh et al.,2021), i.e., when the test distribution differs from

the train distribution in terms of group proportions. Under subpopulation shift, poor performance

on the minority groups in the train data translates into poor overall test distribution performance,

where these groups are more prevalent or more heavily weighted. Subpopulation shift occurs in many

application domains (Tatman,2017;Beery et al.,2018;Oakden-Rayner et al.,2020;Santurkar et al.,

2020;Koh et al.,2021).

Prior work offers a variety of methods for training models robust to subpopulation shift and spurious

correlations, including group distributionally robust optimization (gDRO) (Hu et al.,2018;Sagawa

et al.,2019), importance weighting (Shimodaira,2000;Byrd & Lipton,2019), subsampling (Sagawa

et al.,2020;Idrissi et al.,2022;Maity et al.,2022), and variations of tilted ERM (Li et al.,2020,

2021). These methods are successful in achieving comparable performance across groups in the data,

but they require group annotations. The annotations can be expensive to obtain, e.g., labeling spurious

backgrounds in image recognition (Beery et al.,2018) or labeling identity mentions in the toxicity

example. It also could be challenging to anticipate all potential spurious correlations in advance, e.g.,

it could be background, time of day, camera angle, or unanticipated identities subject to harassment.

Recently, methods have emerged for learning group annotations (Sohoni et al.,2020;Liu et al.,2021;

Creager et al.,2021) and variations of DRO that do not require groups (Hashimoto et al.,2018;Zhai

et al.,2021). One common theme is to treat data where an ERM model makes mistakes (i.e., high-loss

points) as a minority group (Hashimoto et al.,2018;Liu et al.,2021) and increase the weighting

of these points. Unfortunately, such methods are at risk of overﬁtting to outliers (e.g., mislabeled

data, corrupted images), which are also high-loss points. Indeed, existing methods for outlier-robust

training propose to ignore the high-loss points (Shen & Sanghavi,2019), the direct opposite of the

approach in (Hashimoto et al.,2018;Liu et al.,2021).

In this paper, our goal is to learn group annotations in the presence of outliers. Rather than using loss

values (which above were seen to create opposing tradeoffs), we propose to instead ﬁrst represent

data using gradients of a datum’s loss w.r.t. the model parameters. Such gradients tell us how a

speciﬁc data point wants the parameters of the model to change to ﬁt it better. In this gradient space,

we anticipate groups (conditioned on label) to correspond to gradients forming clusters. Outliers, on

the other hand, majorly correspond to isolated gradients: they are likely to want model parameters

to change differently from any of the groups and other outliers. See Figure 1for an illustration.

The gradient space structure allows us to separate out the outliers and learn the group annotations

via traditional clustering techniques such as DBSCAN (Ester et al.,1996). We use learned group

annotations to train models with improved worst-group performance (measured w.r.t. the true group

annotations).

We summarize our contributions below:

Outlier-Robust Group Inference via Gradient Space Clustering

•

We show that gradient space simpliﬁes the data structure and makes it easier to learn group

annotations via clustering.

•

We propose Gradient Space Partitioning (GRASP), a method for learning group annotations in the

presence of outliers for training models robust to subpopulation shift.

•

We conduct extensive experiments on one synthetic dataset and three datasets from different

modalities and demonstrate that our method achieves state-of-the-art both in terms of group

identiﬁcation quality and downstream worst-group performance.

2 Preliminaries and Related Work

In this section, we review the problem of training models in the presence of minority groups. Denote

[N] = {1, . . . , N}

. Consider a dataset

D={z}n

i=1 ⊂ Z

consisting of

samples

z∈ Z

z= (x,y)

where

x∈ X =Rd

is the input feature and

y∈ Y ={1, . . . , C}

is the class label. The samples

from each class

y∈ Y

are categorized into

groups. Denote

to be the total number of groups

{G1,...,GK},P⊂ Z

, where

K=Py∈Y Ky

. Denote the group membership of each point in the

dataset as

{gi}n

i=1

, where

gi∈[K]

for all

i∈[n]

. For example, in toxicity classiﬁcation, a group

could correspond to a toxic comment mentioning a speciﬁc identity, or, in image recognition, a group

could be an animal species appearing on an atypical background (Beery et al.,2018;Sagawa et al.,

2019).

The goal is to to learn a model

h∈ H :X → Y

parameterized by

θ∈Θ

that performs well on all

groups

, where

k∈[K]

. Depending on the application, this model can alleviate fairness concerns

(Dixon et al.,2018), remedy spurious correlations in the data (Sagawa et al.,2019), and promote

robustness to subpopulation shift (Koh et al.,2021), i.e., when the test data has unknown group

proportions.

We divide the approaches for learning in the presence of minority groups into three categories: the

group-aware setting where the group annotations

are known, the group-oblivious setting that does

not use the group annotations, and the group-learning setting where the group annotations are learned

from data to be used as inputs to the group-aware methods.

Group-aware setting. Many prior works assume access to the minority group annotations. Among

the state-of-the-art methods in this setting is group Distributionally Robust Optimization (gDRO)

(Sagawa et al.,2019). Let `:Y × Y be a loss function. The optimization problem of gDRO is

min

θ∈Θmax

k∈[K]

|Gk|X

z∈Gk

`(y, hθ(x)),(gDRO)

which aims to minimize the maximum group loss. In addition to assuming clean group annotations, an-

other line of research under this setting considers noisy or partially available group annotations (Jung

et al.,2022;Lamy et al.,2019;Mozannar et al.,2020;Celis et al.,2021). Methods in this class

achieve meaningful improvements over ERM in terms of worst-group accuracy, but anticipating

relevant minority groups and obtaining the annotations is often burdensome.

Group-oblivious setting.

In contrast to the group-aware setting, the group-oblivious setting attempts

to improve worst-group performance without group annotations. Methods in this group rely on various

forms of DRO (Hashimoto et al.,2018;Zhai et al.,2021) or adversarial reweighing (Lahoti et al.,2020).

Algorithmically, this results in up/down-weighing the contribution of the high/low-loss points. For

example, Hashimoto et al. (2018) optimizes a DRO objective with respect to a chi-square divergence

ball around the data distribution, which is equivalent to minimizing

nPi[`(y, hθ(x)) −η]2

, i.e., an

ERM discounting low-loss points by a constant depending on the ball radius.

Group-learning setting.

The ﬁnal category corresponds to a two-step procedure, wherein the data

points are ﬁrst assigned group annotations based on various criteria, followed by group-aware training

typically using gDRO. In this category, Just Train Twice (JTT) (Liu et al.,2021) trains an ERM model

and designates high-loss points as the minority and low-loss points as the majority group; George

(Sohoni et al.,2020) seeks to cluster the data to identify groups with a combination of dimensionality

reduction, overclustering, and augmenting features with loss values, and Environment Inference for

Invariant Learning (EIIL) (Creager et al.,2021) ﬁnds group partition that maximizes the Invariant

Risk criterion (Arjovsky et al.,2019).

Outlier-Robust Group Inference via Gradient Space Clustering

Table 1:

Summary of methods for learning in the presence of minority groups.

"-" indicates

that there is no clear evidence in the prior works.

Setting Group-aware Group-oblivious Group-learning

Method ERM gDRO χ2-DRO DORO JTT EIIL George GRASP

(Sagawa et al.,2019) (Hashimoto et al.,2018) (Zhai et al.,2021) (Liu et al.,2021) (Creager et al.,2021) (Sohoni et al.,2020) (Ours)

Improves worst-

group perfor-

mance?

7 3 3 3 3 3 3 3

No training group

annotations?

3 7 3 3 3 3 3 3

No validation group

annotations?

3 7 7 7 7 7 3 3

Group inference? 7 7 7 7 3 3 3 3

Robust to outliers? 7-7 7 7 - - 3

Our method, Gradient Space Partitioning (GraSP), belongs to this category. GraSP differs from prior

works in its ability to account for outliers in the data. In addition, prior methods in this and the

group-oblivious categories typically require validation data with true group annotations for model

selection to achieve meaningful worst-group performance improvements over ERM, while GraSP

does not need these annotations to achieve good performance. In our experiments, this can be

attributed to GraSP’s better recovery of the true group annotations, making them suitable for gDRO

model selection (see Section 4). We summarize properties of the most relevant methods in each

setting in Table 1.

The challenge of outliers.

Outliers, e.g., mislabeled samples or corrupted images, are ubiquitous in

applications (Singh & Upadhyaya,2012), and outlier detection has long been a topic of inquiry in

ML (Hodge & Austin,2004;Wang et al.,2019). Outliers are especially challenging to detect when

data has (unknown) minority groups, which could be hard to distinguish from outliers but require

the opposite treatment: Minority groups need to be upweighted while outliers must be discarded.

Hashimoto et al. (2018) write, “it is an open question whether it is possible to design algorithms

which are both fair to unknown latent groups and robust [to outliers].”

We provide an illustration of a dataset with minority groups and an outlier in Figure 1(a). Figure 1(b)

illustrates the problem with the methods relying on the loss values. Speciﬁcally, Liu et al. (2021) and

Hashimoto et al. (2018) upweigh high-loss points, overﬁtting the outlier. Zhai et al. (2021) optimize

Hashimoto et al. (2018)’s objective function after discarding a fraction of points with the largest

loss values to account for outliers. They assume that outliers will have higher loss values than the

minority group samples, which can easily be violated leading to exclusion of the minority samples,

as illustrated in Figure 1.

Gradients as data representations.

Given a model

hθ0(·)

and loss function

`(·,·)

, one can consider

an alternative representation of the data where each sample is mapped to the gradient with respect to

the model parameters of the loss on this sample:

fi=∂`(yi, hθ(xi))

∂θ



θ=θ0

for i= 1, . . . , n. (1)

We refer to equation 1as the gradient representation. Prior works considered gradient representations

(Mirzasoleiman et al.,2020), as well as loss values (Shen & Sanghavi,2019), for outlier-robust

learning. Gradient representations have also found success in novelty detection (Kwon et al.,2020b),

anomaly detection (Kwon et al.,2020a), and out-of-distribution inputs detection (Huang et al.,2021).

In this work, we show that, unlike loss values, gradient representations are suitable for simultaneously

learning group annotations and detecting outliers. Compared to the original feature space, gradient

space simpliﬁes the data structure, making it easier to identify minority groups. Figure 1(c) illustrates

a failure of feature space clustering. Here the majority group for class

y= 0

is a mixture of three

components with one of the components being close to the minority group in the feature space. In the

gradient space, for a logistic regression model, representations of misclassiﬁed points remain similar

to the original features, while the representations of correctly classiﬁed points are pushed towards

zero. We illustrate the beneﬁts of the gradient representations in Figure 1(d) and provide additional

details in the subsequent section.

文档加载中……请稍候！
如果长时间未打开，您也可以点击刷新试试。

下载文档到电脑，查找使用更方便

10 玖币 0人已下载

立即下载

摘要：

OUTLIER-ROBUSTGROUPINFERENCEVIAGRADIENTSPACECLUSTERINGYuchenZengDepartmentofComputerScienceUniversityofWisconsin-Madisonyzeng58@wisc.eduKristjanGreenewaldIBMResearchMIT-IBMWatsonAILabkristjan.h.greenewald@ibm.comKangwookLeeDepartmentofElectricalandComputerEngineeringUniversityofWisconsin-Madisonkan...

展开>> 收起<<

OUTLIER -ROBUST GROUP INFERENCE VIA GRADIENT SPACE CLUSTERING Yuchen Zeng.pdf

共17页,预览4页

还剩页未读，继续阅读

声明：本站为文档C2C交易模式，即用户上传的文档直接被用户下载，本站只是中间服务平台，本站所有文档下载所得的收益归上传人(含作者)所有。玖贝云文库仅提供信息存储空间，仅对用户上传内容的表现方式做保护处理，对上载内容本身不做任何修改或编辑。若文档所含内容侵犯了您的版权或隐私，请立即通知玖贝云文库，我们立即给予删除！

OUTLIER -ROBUST GROUP INFERENCE VIA GRADIENT SPACE CLUSTERING Yuchen Zeng

相关推荐

开通VIP享超值会员特权

作者详情

相关内容

热门标签

举报选择: