OUTLIER -ROBUST GROUP INFERENCE VIA GRADIENT SPACE CLUSTERING Yuchen Zeng

2025-04-26 0 0 1.41MB 17 页 10玖币
侵权投诉
OUTLIER-ROBUST GROUP INFERENCE VIA GRADIENT
SPACE CLUSTERING
Yuchen Zeng
Department of Computer Science
University of Wisconsin-Madison
yzeng58@wisc.edu
Kristjan Greenewald
IBM Research
MIT-IBM Watson AI Lab
kristjan.h.greenewald@ibm.com
Kangwook Lee
Department of Electrical and Computer Engineering
University of Wisconsin-Madison
kangwook.lee@wisc.edu
Justin Solomon
Department of Electrical Engineering & Computer Science
Massachusetts Institute of Technology
jsolomon@mit.edu
Mikhail Yurochkin
IBM Research
MIT-IBM Watson AI Lab
mikhail.yurochkin@ibm.com
ABSTRACT
Traditional machine learning models focus on achieving good performance on
the overall training distribution, but they often underperform on minority groups.
Existing methods can improve the worst-group performance, but they can have
several limitations: (i) they require group annotations, which are often expensive
and sometimes infeasible to obtain, and/or (ii) they are sensitive to outliers. Most
related works fail to solve these two issues simultaneously as they focus on con-
flicting perspectives of minority groups and outliers. We address the problem of
learning group annotations in the presence of outliers by clustering the data in the
space of gradients of the model parameters. We show that data in the gradient space
has a simpler structure while preserving information about minority groups and out-
liers, making it suitable for standard clustering methods like DBSCAN. Extensive
experiments demonstrate that our method significantly outperforms state-of-the-art
both in terms of group identification and downstream worst-group performance.
1 Introduction
Empirical Risk Minimization (ERM), i.e., the minimization of average training loss over the set of
model parameters, is the standard training procedure in machine learning. It yields models with
strong in-distribution performance
2
but does not guarantee satisfactory performance on minority
groups that contribute relatively few data points to the training loss function (Sagawa et al.,2019;
Koh et al.,2021). This effect is particularly problematic when the minority groups correspond to
socially-protected groups. For example, in the toxic text classification task, certain identities are
overwhelmingly abused in online conversations that form data for training models detecting toxicity
(Dixon et al.,2018). Such data lacks sufficient non-toxic examples mentioning these identities,
yielding problematic and unfair spurious correlations – as a result ERM learns to associate these
Work performed while doing an internship at IBM Research.
2I.e. low loss on test data drawn from the same distribution as the training dataset.
arXiv:2210.06759v1 [cs.LG] 13 Oct 2022
Outlier-Robust Group Inference via Gradient Space Clustering
Figure 1:
An illustration of learning group annotations in the presence of outliers.
(a) A toy
dataset in two dimensions. There are four groups
g= 1,2,3,4
and an outlier.
g= 1
and
g= 3
are
the majority groups distributed as mixtures of three components each;
g= 2
and
g= 4
are unimodal
minority groups.
y
-axis is the decision boundary of a logistic regression classifier. Figures (b, c, d)
compare different data views for learning group annotations and detecting outliers via clustering of
samples with y= 0. (b) loss values can confuse outliers and minority samples which both can have
high loss; (c) in the original feature space it is difficult to distinguish one of the majority group modes
and the minority group; (d) gradient space (bias gradient omitted for visualization) simplifies the data
structure making it easier to identify the minority group and to detect outliers.
identities with toxicity (Dixon et al.,2018;Garg et al.,2019;Yurochkin & Sun,2020). A related
phenomenon is subpopulation shift (Koh et al.,2021), i.e., when the test distribution differs from
the train distribution in terms of group proportions. Under subpopulation shift, poor performance
on the minority groups in the train data translates into poor overall test distribution performance,
where these groups are more prevalent or more heavily weighted. Subpopulation shift occurs in many
application domains (Tatman,2017;Beery et al.,2018;Oakden-Rayner et al.,2020;Santurkar et al.,
2020;Koh et al.,2021).
Prior work offers a variety of methods for training models robust to subpopulation shift and spurious
correlations, including group distributionally robust optimization (gDRO) (Hu et al.,2018;Sagawa
et al.,2019), importance weighting (Shimodaira,2000;Byrd & Lipton,2019), subsampling (Sagawa
et al.,2020;Idrissi et al.,2022;Maity et al.,2022), and variations of tilted ERM (Li et al.,2020,
2021). These methods are successful in achieving comparable performance across groups in the data,
but they require group annotations. The annotations can be expensive to obtain, e.g., labeling spurious
backgrounds in image recognition (Beery et al.,2018) or labeling identity mentions in the toxicity
example. It also could be challenging to anticipate all potential spurious correlations in advance, e.g.,
it could be background, time of day, camera angle, or unanticipated identities subject to harassment.
Recently, methods have emerged for learning group annotations (Sohoni et al.,2020;Liu et al.,2021;
Creager et al.,2021) and variations of DRO that do not require groups (Hashimoto et al.,2018;Zhai
et al.,2021). One common theme is to treat data where an ERM model makes mistakes (i.e., high-loss
points) as a minority group (Hashimoto et al.,2018;Liu et al.,2021) and increase the weighting
of these points. Unfortunately, such methods are at risk of overfitting to outliers (e.g., mislabeled
data, corrupted images), which are also high-loss points. Indeed, existing methods for outlier-robust
training propose to ignore the high-loss points (Shen & Sanghavi,2019), the direct opposite of the
approach in (Hashimoto et al.,2018;Liu et al.,2021).
In this paper, our goal is to learn group annotations in the presence of outliers. Rather than using loss
values (which above were seen to create opposing tradeoffs), we propose to instead first represent
data using gradients of a datum’s loss w.r.t. the model parameters. Such gradients tell us how a
specific data point wants the parameters of the model to change to fit it better. In this gradient space,
we anticipate groups (conditioned on label) to correspond to gradients forming clusters. Outliers, on
the other hand, majorly correspond to isolated gradients: they are likely to want model parameters
to change differently from any of the groups and other outliers. See Figure 1for an illustration.
The gradient space structure allows us to separate out the outliers and learn the group annotations
via traditional clustering techniques such as DBSCAN (Ester et al.,1996). We use learned group
annotations to train models with improved worst-group performance (measured w.r.t. the true group
annotations).
We summarize our contributions below:
2
Outlier-Robust Group Inference via Gradient Space Clustering
We show that gradient space simplifies the data structure and makes it easier to learn group
annotations via clustering.
We propose Gradient Space Partitioning (GRASP), a method for learning group annotations in the
presence of outliers for training models robust to subpopulation shift.
We conduct extensive experiments on one synthetic dataset and three datasets from different
modalities and demonstrate that our method achieves state-of-the-art both in terms of group
identification quality and downstream worst-group performance.
2 Preliminaries and Related Work
In this section, we review the problem of training models in the presence of minority groups. Denote
[N] = {1, . . . , N}
. Consider a dataset
D={z}n
i=1 ⊂ Z
consisting of
n
samples
z∈ Z
,
z= (x,y)
,
where
x∈ X =Rd
is the input feature and
y∈ Y ={1, . . . , C}
is the class label. The samples
from each class
y∈ Y
are categorized into
Ky
groups. Denote
K
to be the total number of groups
{G1,...,GK},P⊂ Z
, where
K=Py∈Y Ky
. Denote the group membership of each point in the
dataset as
{gi}n
i=1
, where
gi[K]
for all
i[n]
. For example, in toxicity classification, a group
could correspond to a toxic comment mentioning a specific identity, or, in image recognition, a group
could be an animal species appearing on an atypical background (Beery et al.,2018;Sagawa et al.,
2019).
The goal is to to learn a model
h∈ H :X → Y
parameterized by
θΘ
that performs well on all
groups
Gk
, where
k[K]
. Depending on the application, this model can alleviate fairness concerns
(Dixon et al.,2018), remedy spurious correlations in the data (Sagawa et al.,2019), and promote
robustness to subpopulation shift (Koh et al.,2021), i.e., when the test data has unknown group
proportions.
We divide the approaches for learning in the presence of minority groups into three categories: the
group-aware setting where the group annotations
gi
are known, the group-oblivious setting that does
not use the group annotations, and the group-learning setting where the group annotations are learned
from data to be used as inputs to the group-aware methods.
Group-aware setting. Many prior works assume access to the minority group annotations. Among
the state-of-the-art methods in this setting is group Distributionally Robust Optimization (gDRO)
(Sagawa et al.,2019). Let `:Y × Y be a loss function. The optimization problem of gDRO is
min
θΘmax
k[K]
1
|Gk|X
z∈Gk
`(y, hθ(x)),(gDRO)
which aims to minimize the maximum group loss. In addition to assuming clean group annotations, an-
other line of research under this setting considers noisy or partially available group annotations (Jung
et al.,2022;Lamy et al.,2019;Mozannar et al.,2020;Celis et al.,2021). Methods in this class
achieve meaningful improvements over ERM in terms of worst-group accuracy, but anticipating
relevant minority groups and obtaining the annotations is often burdensome.
Group-oblivious setting.
In contrast to the group-aware setting, the group-oblivious setting attempts
to improve worst-group performance without group annotations. Methods in this group rely on various
forms of DRO (Hashimoto et al.,2018;Zhai et al.,2021) or adversarial reweighing (Lahoti et al.,2020).
Algorithmically, this results in up/down-weighing the contribution of the high/low-loss points. For
example, Hashimoto et al. (2018) optimizes a DRO objective with respect to a chi-square divergence
ball around the data distribution, which is equivalent to minimizing
1
nPi[`(y, hθ(x)) η]2
+
, i.e., an
ERM discounting low-loss points by a constant depending on the ball radius.
Group-learning setting.
The final category corresponds to a two-step procedure, wherein the data
points are first assigned group annotations based on various criteria, followed by group-aware training
typically using gDRO. In this category, Just Train Twice (JTT) (Liu et al.,2021) trains an ERM model
and designates high-loss points as the minority and low-loss points as the majority group; George
(Sohoni et al.,2020) seeks to cluster the data to identify groups with a combination of dimensionality
reduction, overclustering, and augmenting features with loss values, and Environment Inference for
Invariant Learning (EIIL) (Creager et al.,2021) finds group partition that maximizes the Invariant
Risk criterion (Arjovsky et al.,2019).
3
Outlier-Robust Group Inference via Gradient Space Clustering
Table 1:
Summary of methods for learning in the presence of minority groups.
"-" indicates
that there is no clear evidence in the prior works.
Setting Group-aware Group-oblivious Group-learning
Method ERM gDRO χ2-DRO DORO JTT EIIL George GRASP
(Sagawa et al.,2019) (Hashimoto et al.,2018) (Zhai et al.,2021) (Liu et al.,2021) (Creager et al.,2021) (Sohoni et al.,2020) (Ours)
Improves worst-
group perfor-
mance?
7 3 3 3 3 3 3 3
No training group
annotations?
3 7 3 3 3 3 3 3
No validation group
annotations?
3 7 7 7 7 7 3 3
Group inference? 7 7 7 7 3 3 3 3
Robust to outliers? 7-7 7 7 - - 3
Our method, Gradient Space Partitioning (GraSP), belongs to this category. GraSP differs from prior
works in its ability to account for outliers in the data. In addition, prior methods in this and the
group-oblivious categories typically require validation data with true group annotations for model
selection to achieve meaningful worst-group performance improvements over ERM, while GraSP
does not need these annotations to achieve good performance. In our experiments, this can be
attributed to GraSP’s better recovery of the true group annotations, making them suitable for gDRO
model selection (see Section 4). We summarize properties of the most relevant methods in each
setting in Table 1.
The challenge of outliers.
Outliers, e.g., mislabeled samples or corrupted images, are ubiquitous in
applications (Singh & Upadhyaya,2012), and outlier detection has long been a topic of inquiry in
ML (Hodge & Austin,2004;Wang et al.,2019). Outliers are especially challenging to detect when
data has (unknown) minority groups, which could be hard to distinguish from outliers but require
the opposite treatment: Minority groups need to be upweighted while outliers must be discarded.
Hashimoto et al. (2018) write, “it is an open question whether it is possible to design algorithms
which are both fair to unknown latent groups and robust [to outliers].
We provide an illustration of a dataset with minority groups and an outlier in Figure 1(a). Figure 1(b)
illustrates the problem with the methods relying on the loss values. Specifically, Liu et al. (2021) and
Hashimoto et al. (2018) upweigh high-loss points, overfitting the outlier. Zhai et al. (2021) optimize
Hashimoto et al. (2018)’s objective function after discarding a fraction of points with the largest
loss values to account for outliers. They assume that outliers will have higher loss values than the
minority group samples, which can easily be violated leading to exclusion of the minority samples,
as illustrated in Figure 1.
Gradients as data representations.
Given a model
hθ0(·)
and loss function
`(·,·)
, one can consider
an alternative representation of the data where each sample is mapped to the gradient with respect to
the model parameters of the loss on this sample:
fi=`(yi, hθ(xi))
θ
θ=θ0
for i= 1, . . . , n. (1)
We refer to equation 1as the gradient representation. Prior works considered gradient representations
(Mirzasoleiman et al.,2020), as well as loss values (Shen & Sanghavi,2019), for outlier-robust
learning. Gradient representations have also found success in novelty detection (Kwon et al.,2020b),
anomaly detection (Kwon et al.,2020a), and out-of-distribution inputs detection (Huang et al.,2021).
In this work, we show that, unlike loss values, gradient representations are suitable for simultaneously
learning group annotations and detecting outliers. Compared to the original feature space, gradient
space simplifies the data structure, making it easier to identify minority groups. Figure 1(c) illustrates
a failure of feature space clustering. Here the majority group for class
y= 0
is a mixture of three
components with one of the components being close to the minority group in the feature space. In the
gradient space, for a logistic regression model, representations of misclassified points remain similar
to the original features, while the representations of correctly classified points are pushed towards
zero. We illustrate the benefits of the gradient representations in Figure 1(d) and provide additional
details in the subsequent section.
4
摘要:

OUTLIER-ROBUSTGROUPINFERENCEVIAGRADIENTSPACECLUSTERINGYuchenZengDepartmentofComputerScienceUniversityofWisconsin-Madisonyzeng58@wisc.eduKristjanGreenewaldIBMResearchMIT-IBMWatsonAILabkristjan.h.greenewald@ibm.comKangwookLeeDepartmentofElectricalandComputerEngineeringUniversityofWisconsin-Madisonkan...

展开>> 收起<<
OUTLIER -ROBUST GROUP INFERENCE VIA GRADIENT SPACE CLUSTERING Yuchen Zeng.pdf

共17页,预览4页

还剩页未读, 继续阅读

声明:本站为文档C2C交易模式,即用户上传的文档直接被用户下载,本站只是中间服务平台,本站所有文档下载所得的收益归上传人(含作者)所有。玖贝云文库仅提供信息存储空间,仅对用户上传内容的表现方式做保护处理,对上载内容本身不做任何修改或编辑。若文档所含内容侵犯了您的版权或隐私,请立即通知玖贝云文库,我们立即给予删除!
分类:图书资源 价格:10玖币 属性:17 页 大小:1.41MB 格式:PDF 时间:2025-04-26

开通VIP享超值会员特权

  • 多端同步记录
  • 高速下载文档
  • 免费文档工具
  • 分享文档赚钱
  • 每日登录抽奖
  • 优质衍生服务
/ 17
客服
关注