Nuisances via Negativa Adjusting for Spurious Correlations via Data Augmentation Aahlad Puli1Nitish Joshi1Yoav Wald2He He12Rajesh Ranganath123

2025-05-02 0 0 4.87MB 27 页 10玖币
侵权投诉
Nuisances via Negativa:
Adjusting for Spurious Correlations via Data Augmentation
Aahlad Puli1*Nitish Joshi 1Yoav Wald 2He He1,2 Rajesh Ranganath1,2,3
1Department of Computer Science, New York University
2Center for Data Science, New York University
3Department of Population Health, Langone Health, New York University
Abstract
In prediction tasks, there exist features that are related to the label in the same way across different settings
for that task; these are semantic features or semantics. Features with varying relationships to the label are
nuisances. For example, in detecting cows from natural images, the shape of the head is semantic but because
images of cows often have grass backgrounds but not always, the background is a nuisance. Models that exploit
nuisance-label relationships face performance degradation when these relationships change. Building models
robust to such changes requires additional knowledge beyond samples of the features and labels. For example,
existing work uses annotations of nuisances or assumes ERM-trained models depend on nuisances. Approaches to
integrate new kinds of additional knowledge enlarge the settings where robust models can be built. We develop
an approach to use knowledge about the semantics by corrupting them in data, and then using the corrupted
data to produce models which identify correlations between nuisances and the label. Once these correlations
are identified, they can be used to adjust for where nuisances drive predictions. We study semantic corruptions
in powering different spurious-correlation avoiding methods on multiple out-of-distribution (OOD) tasks like
classifying waterbirds, natural language inference (NLI), and detecting cardiomegaly in chest X-rays.
1 Introduction
Relationships between the label and the covariates can change across data collected at different places and times.
For example, in classifying animals, data collected in natural habitats have cows appear more often on grass-
lands, while penguins appear more often on backgrounds of snow; these animal-background relationships do
not hold outside natural habitats [Beery et al.,2018,Arjovsky et al.,2019]. Some features, like an animal’s
shape, are predictive of the label across all settings for a task; these are semantic features, or semantics in short.
Other features with varying relationships with the label, like the background, are nuisances. Even with seman-
tics present, models trained via empirical risk minimization (ERM) can predict using nuisances and thus fail to
generalize [Geirhos et al.,2020]. Models that rely only on the semantic features perform well even when the
nuisance-label relationship changes, unlike models that rely on nuisances.
Building models that generalize under changing nuisance-label relationships requires additional knowledge, be-
yond a dataset of features and labels sampled from the training distribution. For example, many works assume
knowledge of the nuisance. In the animal-background example, this would correspond to a feature that speci-
fies the image background, which we may use when specifying our learning algorithm. [Mahabadi et al.,2019,
Makar et al.,2022,Veitch et al.,2021,Puli et al.,2022]; another common type of assumption is access to multiple
datasets over which the nuisance-label correlation varies [Arjovsky et al.,2019,Peters et al.,2016,Wald et al.,
2021], and some other forms of knowledge have been explored [Mahajan et al.,2021,Gao et al.,2023,Feder
et al.,2023].
*1Corresponding author: aahlad@nyu.edu. Published at TMLR 2024: https://openreview.net/forum?id=RIFJsSzwKY.
1
arXiv:2210.01302v3 [cs.LG] 3 Jul 2024
Semantic Corruptions. In this paper, we explore the use of a different type of knowledge: corruptions of se-
mantic features. Intuitively, imagine trying to predict the label from a corrupted input T(x), where all semantic
information has been removed. Any better-than-chance prediction provides us a window into the nuisances, as
it must rely on them. We will then use these obtained biased models to guide methods that we identify here as
biased-model-based spurious-correlation avoiding methods (B-SCAMs).
B-SCAMs. There is a class of methods in the literature that use predictions of a biased model to adjust for nuisances,
and learn predictors that are free of spurious correlations. Among others, these include Just Train Twice (JTT)
[Liu et al.,2021], EILL [Creager et al.,2021], Nuisance-Randomized Distillation (NURD)[Puli et al.,2022], and
debiased focus loss (DFL), product of experts (POE)[Mahabadi et al.,2019]. The key question arising from these
works is how can we obtain biased models? In empirical studies, prior works on B-SCAMs either use annotations of
the nuisance or an ERM-trained model over the training data as a placeholder for the biased model. The latter
approach, based on an ERM-trained model, is successful if that model completely ignores semantic information.
In practice, these heuristics are rather fragile. Annotations for nuisances are seldom available, and we lack a
principled method to ascertain whether a model trained with ERM relies only on semantic features. Therefore,
employing semantic corruptions could serve as a valuable alternative to these heuristics. We claim that semantic
corruptions offer a principled and useful approach to obtaining biased models.
Semantic corruptions T(x)must strike a delicate balance between removing semantic information and preserving
nuisances. For example, if T(x)replaces all pixels in an image with random noise, it corrupts semantics while
simultaneously erasing all information about the nuisances. An ideal T(x)would isolate nuisances by targeting
only the semantic information in the input, e.g., by in-painting the animal for the task of classifying cows and
penguins. Implementing such ideal corruptions is unrealistic, as they are task-specific and may require human
annotations of the semantic features; e.g., one can segment the objects in every image. Doing so for all classifi-
cation problems is extremely laborious. In tasks like NLI, it is unclear even how to annotate semantics, as they do
not correspond to simple features like subsets of words. In summary, after outlining the desired characteristics of
semantic corruptions, we define corruptions that are beneficial across multiple tasks and do not require human
annotation. Our contributions are as follows:
1. Show that acquiring additional knowledge beyond a labeled dataset is necessary for effectively learning ro-
bust models (theorem 1). Then, in proposition 1, we formalize sufficient conditions under which additional
knowledge in the form of a semantic corruption enables B-SCAMs to learn robust models.
2. Develop multiple semantic corruptions for object recognition and natural language inference. These include
patch randomization, n-gram randomization, frequency filtering, and intensity filtering. Then, we situate
existing procedures, such as region-of-interest masking and premise masking, under the umbrella of semantic
corruptions.
3. Empirically, we demonstrate that any semantic corruption can power any B-SCAM. The corruption-powered
versions of these methods outperform ERM on out-of-distribution (OOD) generalization tasks like Waterbirds,
cardiomegaly detection from chest X-rays, and NLI. Corruption-powered NURD,DFL, and POE achieve perfor-
mance similar to said methods run with extra observed nuisance variables. Corruption-powered JTT outper-
forms vanilla JTT.
2 Biased-model-based spurious-correlation avoiding methods
A spurious correlation is a relationship between the covariates xand the label ythat changes across settings
like time and location [Geirhos et al.,2020]. The features whose relationship with the label changes are called
nuisances. With a vector of nuisances z, let pt r (y,z,x),pte(y,z,x)be the training and test distributions.
Achieving robustness to spurious correlations requires additional knowledge. In the presence of spurious
correlations, the training distribution pt r may not equal the test distribution pt e . Without further assumptions,
no algorithm that only sees data from pt r (y,x)can produce a predictor that works well on pt e . To achieve
generalization when pte ̸=pt r , work in the OOD generalization literature assumes a relationship between the
training and test distributions. We follow the work of Makar et al. [2022],Puli et al. [2022]and assume that only
nuisance-label relationships — i.e. the conditional z|y— changes between training and test. Formally, we let
2
pt r ,pt e come from a family of distributions whose members have different nuisance-label relationships but share
the same relationship between the label and the semantics x:
Definition 1. (Nuisance-varying family with semantic features x[Makar et al.,2022,Puli et al.,2022])
F={pD:pD(y,z,x,x) = p(y,x)pD(z|y)p(x|z,x)}. (1)
Many common tasks in OOD generalization, including some from section 4, fit this definition. For example, in
classifying natural images, the background type is the nuisance zand its relationship to the label can change
across places, each corresponding to a different member of F. The animal shape however is made of semantic
features xthat are related to the label in the same way across places. Like in this example, we assume that the
semantic features xequal a function of the covariates x=e(x)almost surely under every pD F , but neither
xnor e(·)are known. Finally, the semantics and nuisances together account for all the information that xhas
about y, meaning x
|=
pDy|x,z.
Building models that are robust to a shifting nuisance-label relationship relies on additional knowledge, such
as nuisance annotations, in the training data [Makar et al.,2022,Veitch et al.,2021,Puli et al.,2022,Sagawa
et al.,2019,Yao et al.,2022]. Given knowledge of z, work like [Makar et al.,2022,Puli et al.,2022]estimate a
distribution, denoted p
|=
, under which the label and nuisance are independent (y
|=
p
|=
z): p
|=
(y,x) = Rz,xp(y,x=
x)pt r (z=z)p(x|z=z,x=x)dzd x. Following Puli et al. [2022], we call p
|=
the nuisance-randomized dis-
tribution. The model p
|=
(y=1|x)achieves the lowest risk on any member of the family Famongst the set of
risk-invariant models; see Proposition 1 [Makar et al.,2022]). However, even when pt r ,pt e ∈ F and optimal
risk-invariant predictors can be built with nuisances, it is impossible to always beat random chance when given data
{y,x} ∼ pt r :
Theorem 1. For any learning algorithm, there exists a nuisance-varying family Fwhere predicting with p
|=
(y=1|x)
achieves 90% accuracy on all members such that given training data y,xfrom one member pt r F , the algorithm
cannot achieve better accuracy than 50% (random chance) on some pt e F .
The proof is in appendix A and proceeds in two steps. With ACCh(p)as the expected accuracy of a model hon
distribution p, the first step of the proof defines two nuisance-varying families F1,F2such that no single model
can perform well on both families simultaneously; any h(x)for which ACCp1(h)>50% for all p1∈ F will have
that ACCp2(h)<50% for some p2∈ F2and vice-versa. The second step shows that the two families F1,F2have
a member that has the same distribution over y,x; letting the training data come from this distribution means
that any learning algorithm that returns a performant model — one that beats 50% accuracy – on one family must
fail to return a performant model on the other. Next, we discuss different methods that use additional knowledge
beyond y,xto build robust predictors.
2.1 Biased-model-based spurious-correlation avoiding methods.
We focus on methods that correct models using knowledge of nuisances or where they might appear in the co-
variates [Mahabadi et al.,2019,Puli et al.,2022,Liu et al.,2021]. We first establish that the common central
part in these methods is a model that predicts the label using nuisances, which we call the biased model; due to
this commonality, we call these biased-model-based spurious-correlation avoiding methods (B-SCAMs). At a high
level, a B-SCAM has two components. The first is a biased model that is built to predict the label by exploiting the
nuisance-label relationship via extra knowledge or assumptions. The biased model is then used to guide a second
model to predict the label without relying on nuisances.
We briefly summarize the different B-SCAMs here, differentiated by the additional knowledge they use to build
biased models. The differences between the methods are summarized in table 1. We give details for NURD here
and defer algorithmic details about the rest to appendix B.
Biased models from knowledge of the nuisances. The first category of B-SCAMs from Mahabadi et al. [2019],
Puli et al. [2022]assumes additional knowledge in the form of nuisance annotations z. For example, in NLI — where
the goal is determining if a premise sentence entails a hypothesis — [Mahabadi et al.,2019]compute the fraction
3
Table 1: Summary of NURD,JTT,POE, and DFL. Each method approximates the biased model: pt r (y|z). This
table describes the different biased models, their names, how they are built.
Method Name What the biased model is Assumptions/Knowledge
JTT Identification model pt r (y|x)learned via ERM ERM learns biased models.
POE/DFL Biased model pt r (y|z)learned via ERM zfrom domain-knowledge.
NURD Weight model pt r (y|z)learned via ERM zfrom domain-knowledge.
of words shared between the hypothesis and the premise for each sample in the training data and use this as one
of the nuisance features in building the biased model. The biased model in NURD,POE,DFL is learned by predicting
the label from the nuisance annotations in the training data to estimate pt r (y|z). Using nuisance annotations,
Makar et al. [2022],Puli et al. [2022]use the model pt r (y|z)as the biased model to define importance weights
and minimize risk w.r.t a distribution p
|=
obtained as
p
|=
(y,z,x) = pt r (y)pt r (z)p(x|y,z) = p(y)
pt r (y|z)pt r (z)pt r (y|z)p(x|y,z) = p(y)
pt r (y|z)pt r (y,z,x).
The second step in NURD [Puli et al.,2022]trains a model to predict yfrom a representation r(x)on data from p
|=
such that z
|=
p
|=
y|r(x); this step is called distillation. Due to y
|=
p
|=
z, learning in p
|=
avoids features that depend
only on the nuisance and due to z
|=
p
|=
y|r(x), distillation avoids features that are mixed functions of the label
and the nuisance (e.g. x1=y+z). With these insights, NURD builds models of the form p
|=
(y|r(x)) that are most
informative of the label. Mechanically, NURD’s distillation solves this:
max
θ,γEp
|=
log pθ(y|rγ(x)) λIp
|=
(y;z|rγ(x)).
Puli et al. [2022]show that such models are best in a class of predictors with lower bounds on performance. The
mutual information above is zero when y
|=
p
|=
z|x; this condition holds for semantic corruptions as we discuss in
appendix B. Thus, we run the distillation step as importance-weighted ERM on the training data.
Mahabadi et al. [2019]consider two methods to train a biased model and a base predictive model jointly to make
the base model predict without relying on the biases. They propose 1) POE, where the loss is the sum of the log
loss of the two models and 2) DFL, where the biased model is used to weight the cross-entropy loss for the base
model. For both methods, Mahabadi et al. [2019]build a biased model as pt r (y|z). Intuitively, the base model
focuses on classifying samples that the biased model misclassifies. The methods fine-tune a BERT model [Devlin
et al.,2019]and do not propagate the gradients of the biased model to update the common parameters (token
embeddings).
Biased models from assumptions on ERM-trained models. The second category of B-SCAMs like LFF [Nam
et al.,2020], UMIX [Han et al.,2022], and JTT [Liu et al.,2021]require additional knowledge that vanilla ERM
builds a biased model that exploits the nuisance-label relationship. Given such a model, these works use it to reduce
a second model’s dependence on the nuisance. We focus on JTT [Liu et al.,2021]which aims to build models
robust to group shift, where the relative mass of a fixed set of disjoint groups of the data changes between training
and test times. The groups here are subsets of the data defined by a pair of values of discrete label and nuisance
values. While JTT works without relying on training group annotations, i.e. without nuisances, it assumes ERM’s
missclassifications are because of a reliance on the nuisance. JTT first builds an “identification” model via ERM to
isolate samples that are misclassified. Then, JTT trains a model via ERM on data with the loss for the misclassified
samples upweighted (by constant λ). The epochs to train the identification model and the upweighting constant
are hyperparameters that require tuning using group annotations [Liu et al.,2021].
The commonality of a biased model. The central part in NURD,DFL,POE, and JTT is a model that predicts
the label using nuisances, like pt r (y|z), which we call the biased model as in He et al. [2019]. The predictive
models in each B-SCAM are guided to not depend on nuisances used by the biased model. While B-SCAMs reduce
dependence on nuisances, they build biased models using additional nuisance annotations or require assumptions
that ERM-trained models predict using the nuisance. In the next section, we describe an alternative: corrupt
semantic information with data augmentations to construct biased models.
4
3 Out-of-distribution generalization via Semantic Corruptions
The previous section summarized how biased models can be built in B-SCAMs using either direct knowledge of
nuisances or knowledge that ERM-trained models rely on the nuisances. We now introduce semantic corruptions
and show how they enable building biased models. Semantic corruptions are transformations of the covariates
that do not retain any knowledge of the semantics, except what may be in the nuisance z:
Definition 2 (Semantic Corruption).A semantic corruption is a transformation of the covariates T(x,δ), where δ
is a random variable such that δ
|=
(y,z,x,x), if
pD∈ F T(x,δ)
|=
pDx|z.
Here, we characterize conditions under which biased models built from semantic corruptions could be used to
estimate robust models. As discussed in section 2,p
|=
(y|x)is the optimal risk-invariant predictor, and is the
target of ERM when predicting the label yfrom xunder the nuisance-randomized distribution p
|=
. NURD estimates
this distribution as part of the algorithm, and methods like JTT aim to approximate p
|=
, for example, upweighting
samples mis-classified by a model that relies on zto predict y. We compare p
|=
which is obtained by breaking the
nuisance-label relationship against the distribution obtained by breaking the relationship between the label and
the data augmentation :
p
|=
(y,x) = Zz
pt r (y)
pt r (y|z=z)pt r (y,z=z,x),pT(y,x) = Zδ
p(δ=δ)pt r (y)
pt r (y|T(x,δ)) pt r (y,x)dδ.
We show here that the L1distance between p
|=
(y,x)and pT(y,x)is controlled by an L2-distance between the
biased models built from the nuisance and the data augmentations respectively:
Proposition 1. Let T :X×RdXbe a function. Assume the r.v. pt r (y|T(x,δ))1has a bounded second moment
under the distribution p
|=
(y,z,x)p(δ), and that pt r (y|T(x,δ)) and pt r (y|z)satisfy
Ep
|=
(y,z,x)p(δ)pt r (y|T(x,δ))2m2,Ep
|=
(y,z,x)p(δ)|pt r (y|T(x,δ)) pt r (y|z)|2=ε2.
Then, the L1distance between p
|=
(y,x)and pT(y,x)is bounded: p
|=
(y,x)pT(y,x)1mε. For a semantic corrup-
tion that also satisfies y
|=
pt r z|T(x,δ)the inequalities hold with ε=0.
If ε=0, pT(y,x) = p
|=
(y,x)which means that almost surely the conditionals match p
|=
(y|x) = pT(y|x). Then,
as p
|=
(y|x)is the optimal risk-invariant predictor, so is pT(y|x). More generally, standard domain adaptation
risk bounds that are controlled by the total variation distance between source and target [Ben-David et al.,2010,
Theorem 1]bound the risk of a model under p
|=
using the L1bound mε— which upper bounds the total variation
— and the risk under pT.
Without nuisance annotations, one cannot test whether estimate the L2-distance between the two biased mod-
els pt r (y|z)and pt r (y|T(x,δ)) in proposition 1. This distance can be large when a transformation T(x,δ)
retains semantic information. To avoid, we turn to a complementary source of knowledge: semantic features.
Using this knowledge, we design families of data augmentations that corrupt the semantic information in xto
construct semantic corruptions. Focusing on two popular OOD tasks, object recognition and NLI, we use only
semantic knowledge to build corruptions that retain some aspects of the covariates. Biased models built on such
corruptions will depend on any retained nuisances; more retained nuisances mean better biased models.
3.1 Semantic corruptions via permutations
We first build corruptions when semantics appear as global structure. We give an intuitive example for such global
semantics. Consider the waterbirds dataset from Sagawa et al. [2019]with waterbirds and landbirds appearing
predominantly on backgrounds with water and land respectively. Semantic features like the wing shape and the
presence of webbed feet are corrupted by randomly permuting small patches. See fig. 1a. Formally, given subsets
of the covariates x1,· · · xkextracted in an order, global semantics e(x1,· · · ,xk)change with the order of extraction.
5
摘要:

NuisancesviaNegativa:AdjustingforSpuriousCorrelationsviaDataAugmentationAahladPuli1*NitishJoshi1YoavWald2HeHe1,2RajeshRanganath1,2,31DepartmentofComputerScience,NewYorkUniversity2CenterforDataScience,NewYorkUniversity3DepartmentofPopulationHealth,LangoneHealth,NewYorkUniversityAbstractInpredictionta...

展开>> 收起<<
Nuisances via Negativa Adjusting for Spurious Correlations via Data Augmentation Aahlad Puli1Nitish Joshi1Yoav Wald2He He12Rajesh Ranganath123.pdf

共27页,预览5页

还剩页未读, 继续阅读

声明:本站为文档C2C交易模式,即用户上传的文档直接被用户下载,本站只是中间服务平台,本站所有文档下载所得的收益归上传人(含作者)所有。玖贝云文库仅提供信息存储空间,仅对用户上传内容的表现方式做保护处理,对上载内容本身不做任何修改或编辑。若文档所含内容侵犯了您的版权或隐私,请立即通知玖贝云文库,我们立即给予删除!
分类:图书资源 价格:10玖币 属性:27 页 大小:4.87MB 格式:PDF 时间:2025-05-02

开通VIP享超值会员特权

  • 多端同步记录
  • 高速下载文档
  • 免费文档工具
  • 分享文档赚钱
  • 每日登录抽奖
  • 优质衍生服务
/ 27
客服
关注