Nuisances via Negativa Adjusting for Spurious Correlations via Data Augmentation Aahlad Puli1Nitish Joshi1Yoav Wald2He He12Rajesh Ranganath123

2025-05-02 0 0 4.87MB 27 页 10玖币

侵权投诉

Nuisances via Negativa:

Adjusting for Spurious Correlations via Data Augmentation

Aahlad Puli1*Nitish Joshi 1Yoav Wald 2He He1,2 Rajesh Ranganath1,2,3

1Department of Computer Science, New York University

2Center for Data Science, New York University

3Department of Population Health, Langone Health, New York University

Abstract

In prediction tasks, there exist features that are related to the label in the same way across different settings

for that task; these are semantic features or semantics. Features with varying relationships to the label are

nuisances. For example, in detecting cows from natural images, the shape of the head is semantic but because

images of cows often have grass backgrounds but not always, the background is a nuisance. Models that exploit

nuisance-label relationships face performance degradation when these relationships change. Building models

robust to such changes requires additional knowledge beyond samples of the features and labels. For example,

existing work uses annotations of nuisances or assumes ERM-trained models depend on nuisances. Approaches to

integrate new kinds of additional knowledge enlarge the settings where robust models can be built. We develop

an approach to use knowledge about the semantics by corrupting them in data, and then using the corrupted

data to produce models which identify correlations between nuisances and the label. Once these correlations

are identiﬁed, they can be used to adjust for where nuisances drive predictions. We study semantic corruptions

in powering different spurious-correlation avoiding methods on multiple out-of-distribution (OOD) tasks like

classifying waterbirds, natural language inference (NLI), and detecting cardiomegaly in chest X-rays.

1 Introduction

Relationships between the label and the covariates can change across data collected at different places and times.

For example, in classifying animals, data collected in natural habitats have cows appear more often on grass-

lands, while penguins appear more often on backgrounds of snow; these animal-background relationships do

not hold outside natural habitats [Beery et al.,2018,Arjovsky et al.,2019]. Some features, like an animal’s

shape, are predictive of the label across all settings for a task; these are semantic features, or semantics in short.

Other features with varying relationships with the label, like the background, are nuisances. Even with seman-

tics present, models trained via empirical risk minimization (ERM) can predict using nuisances and thus fail to

generalize [Geirhos et al.,2020]. Models that rely only on the semantic features perform well even when the

nuisance-label relationship changes, unlike models that rely on nuisances.

Building models that generalize under changing nuisance-label relationships requires additional knowledge, be-

yond a dataset of features and labels sampled from the training distribution. For example, many works assume

knowledge of the nuisance. In the animal-background example, this would correspond to a feature that speci-

ﬁes the image background, which we may use when specifying our learning algorithm. [Mahabadi et al.,2019,

Makar et al.,2022,Veitch et al.,2021,Puli et al.,2022]; another common type of assumption is access to multiple

datasets over which the nuisance-label correlation varies [Arjovsky et al.,2019,Peters et al.,2016,Wald et al.,

2021], and some other forms of knowledge have been explored [Mahajan et al.,2021,Gao et al.,2023,Feder

et al.,2023].

*1Corresponding author: aahlad@nyu.edu. Published at TMLR 2024: https://openreview.net/forum?id=RIFJsSzwKY.

arXiv:2210.01302v3 [cs.LG] 3 Jul 2024

Semantic Corruptions. In this paper, we explore the use of a different type of knowledge: corruptions of se-

mantic features. Intuitively, imagine trying to predict the label from a corrupted input T(x), where all semantic

information has been removed. Any better-than-chance prediction provides us a window into the nuisances, as

it must rely on them. We will then use these obtained biased models to guide methods that we identify here as

biased-model-based spurious-correlation avoiding methods (B-SCAMs).

B-SCAMs. There is a class of methods in the literature that use predictions of a biased model to adjust for nuisances,

and learn predictors that are free of spurious correlations. Among others, these include Just Train Twice (JTT)

[Liu et al.,2021], EILL [Creager et al.,2021], Nuisance-Randomized Distillation (NURD)[Puli et al.,2022], and

debiased focus loss (DFL), product of experts (POE)[Mahabadi et al.,2019]. The key question arising from these

works is how can we obtain biased models? In empirical studies, prior works on B-SCAMs either use annotations of

the nuisance or an ERM-trained model over the training data as a placeholder for the biased model. The latter

approach, based on an ERM-trained model, is successful if that model completely ignores semantic information.

In practice, these heuristics are rather fragile. Annotations for nuisances are seldom available, and we lack a

principled method to ascertain whether a model trained with ERM relies only on semantic features. Therefore,

employing semantic corruptions could serve as a valuable alternative to these heuristics. We claim that semantic

corruptions offer a principled and useful approach to obtaining biased models.

Semantic corruptions T(x)must strike a delicate balance between removing semantic information and preserving

nuisances. For example, if T(x)replaces all pixels in an image with random noise, it corrupts semantics while

simultaneously erasing all information about the nuisances. An ideal T(x)would isolate nuisances by targeting

only the semantic information in the input, e.g., by in-painting the animal for the task of classifying cows and

penguins. Implementing such ideal corruptions is unrealistic, as they are task-speciﬁc and may require human

annotations of the semantic features; e.g., one can segment the objects in every image. Doing so for all classiﬁ-

cation problems is extremely laborious. In tasks like NLI, it is unclear even how to annotate semantics, as they do

not correspond to simple features like subsets of words. In summary, after outlining the desired characteristics of

semantic corruptions, we deﬁne corruptions that are beneﬁcial across multiple tasks and do not require human

annotation. Our contributions are as follows:

1. Show that acquiring additional knowledge beyond a labeled dataset is necessary for effectively learning ro-

bust models (theorem 1). Then, in proposition 1, we formalize sufﬁcient conditions under which additional

knowledge in the form of a semantic corruption enables B-SCAMs to learn robust models.

2. Develop multiple semantic corruptions for object recognition and natural language inference. These include

patch randomization, n-gram randomization, frequency ﬁltering, and intensity ﬁltering. Then, we situate

existing procedures, such as region-of-interest masking and premise masking, under the umbrella of semantic

corruptions.

3. Empirically, we demonstrate that any semantic corruption can power any B-SCAM. The corruption-powered

versions of these methods outperform ERM on out-of-distribution (OOD) generalization tasks like Waterbirds,

cardiomegaly detection from chest X-rays, and NLI. Corruption-powered NURD,DFL, and POE achieve perfor-

mance similar to said methods run with extra observed nuisance variables. Corruption-powered JTT outper-

forms vanilla JTT.

2 Biased-model-based spurious-correlation avoiding methods

A spurious correlation is a relationship between the covariates xand the label ythat changes across settings

like time and location [Geirhos et al.,2020]. The features whose relationship with the label changes are called

nuisances. With a vector of nuisances z, let pt r (y,z,x),pte(y,z,x)be the training and test distributions.

Achieving robustness to spurious correlations requires additional knowledge. In the presence of spurious

correlations, the training distribution pt r may not equal the test distribution pt e . Without further assumptions,

no algorithm that only sees data from pt r (y,x)can produce a predictor that works well on pt e . To achieve

generalization when pte ̸=pt r , work in the OOD generalization literature assumes a relationship between the

training and test distributions. We follow the work of Makar et al. [2022],Puli et al. [2022]and assume that only

nuisance-label relationships — i.e. the conditional z|y— changes between training and test. Formally, we let

pt r ,pt e come from a family of distributions whose members have different nuisance-label relationships but share

the same relationship between the label and the semantics x∗:

Deﬁnition 1. (Nuisance-varying family with semantic features x∗[Makar et al.,2022,Puli et al.,2022])

F={pD:pD(y,z,x∗,x) = p(y,x∗)pD(z|y)p(x|z,x∗)}. (1)

Many common tasks in OOD generalization, including some from section 4, ﬁt this deﬁnition. For example, in

classifying natural images, the background type is the nuisance zand its relationship to the label can change

across places, each corresponding to a different member of F. The animal shape however is made of semantic

features x∗that are related to the label in the same way across places. Like in this example, we assume that the

semantic features x∗equal a function of the covariates x∗=e(x)almost surely under every pD∈ F , but neither

x∗nor e(·)are known. Finally, the semantics and nuisances together account for all the information that xhas

about y, meaning x

pDy|x∗,z.

Building models that are robust to a shifting nuisance-label relationship relies on additional knowledge, such

as nuisance annotations, in the training data [Makar et al.,2022,Veitch et al.,2021,Puli et al.,2022,Sagawa

et al.,2019,Yao et al.,2022]. Given knowledge of z, work like [Makar et al.,2022,Puli et al.,2022]estimate a

distribution, denoted p

, under which the label and nuisance are independent (y

z): p

(y,x) = Rz,x∗p(y,x∗=

x∗)pt r (z=z)p(x|z=z,x∗=x∗)dzd x∗. Following Puli et al. [2022], we call p

the nuisance-randomized dis-

tribution. The model p

(y=1|x)achieves the lowest risk on any member of the family Famongst the set of

risk-invariant models; see Proposition 1 [Makar et al.,2022]). However, even when pt r ,pt e ∈ F and optimal

risk-invariant predictors can be built with nuisances, it is impossible to always beat random chance when given data

{y,x} ∼ pt r :

Theorem 1. For any learning algorithm, there exists a nuisance-varying family Fwhere predicting with p

(y=1|x)

achieves 90% accuracy on all members such that given training data y,xfrom one member pt r ∈ F , the algorithm

cannot achieve better accuracy than 50% (random chance) on some pt e ∈ F .

The proof is in appendix A and proceeds in two steps. With ACCh(p)as the expected accuracy of a model hon

distribution p, the ﬁrst step of the proof deﬁnes two nuisance-varying families F1,F2such that no single model

can perform well on both families simultaneously; any h(x)for which ACCp1(h)>50% for all p1∈ F will have

that ACCp2(h)<50% for some p2∈ F2and vice-versa. The second step shows that the two families F1,F2have

a member that has the same distribution over y,x; letting the training data come from this distribution means

that any learning algorithm that returns a performant model — one that beats 50% accuracy – on one family must

fail to return a performant model on the other. Next, we discuss different methods that use additional knowledge

beyond y,xto build robust predictors.

2.1 Biased-model-based spurious-correlation avoiding methods.

We focus on methods that correct models using knowledge of nuisances or where they might appear in the co-

variates [Mahabadi et al.,2019,Puli et al.,2022,Liu et al.,2021]. We ﬁrst establish that the common central

part in these methods is a model that predicts the label using nuisances, which we call the biased model; due to

this commonality, we call these biased-model-based spurious-correlation avoiding methods (B-SCAMs). At a high

level, a B-SCAM has two components. The ﬁrst is a biased model that is built to predict the label by exploiting the

nuisance-label relationship via extra knowledge or assumptions. The biased model is then used to guide a second

model to predict the label without relying on nuisances.

We brieﬂy summarize the different B-SCAMs here, differentiated by the additional knowledge they use to build

biased models. The differences between the methods are summarized in table 1. We give details for NURD here

and defer algorithmic details about the rest to appendix B.

Biased models from knowledge of the nuisances. The ﬁrst category of B-SCAMs from Mahabadi et al. [2019],

Puli et al. [2022]assumes additional knowledge in the form of nuisance annotations z. For example, in NLI — where

the goal is determining if a premise sentence entails a hypothesis — [Mahabadi et al.,2019]compute the fraction

Table 1: Summary of NURD,JTT,POE, and DFL. Each method approximates the biased model: pt r (y|z). This

table describes the different biased models, their names, how they are built.

Method Name What the biased model is Assumptions/Knowledge

JTT Identiﬁcation model pt r (y|x)learned via ERM ERM learns biased models.

POE/DFL Biased model pt r (y|z)learned via ERM zfrom domain-knowledge.

NURD Weight model pt r (y|z)learned via ERM zfrom domain-knowledge.

of words shared between the hypothesis and the premise for each sample in the training data and use this as one

of the nuisance features in building the biased model. The biased model in NURD,POE,DFL is learned by predicting

the label from the nuisance annotations in the training data to estimate pt r (y|z). Using nuisance annotations,

Makar et al. [2022],Puli et al. [2022]use the model pt r (y|z)as the biased model to deﬁne importance weights

and minimize risk w.r.t a distribution p

obtained as

(y,z,x) = pt r (y)pt r (z)p(x|y,z) = p(y)

pt r (y|z)pt r (z)pt r (y|z)p(x|y,z) = p(y)

pt r (y|z)pt r (y,z,x).

The second step in NURD [Puli et al.,2022]trains a model to predict yfrom a representation r(x)on data from p

such that z

y|r(x); this step is called distillation. Due to y

z, learning in p

avoids features that depend

only on the nuisance and due to z

y|r(x), distillation avoids features that are mixed functions of the label

and the nuisance (e.g. x1=y+z). With these insights, NURD builds models of the form p

(y|r(x)) that are most

informative of the label. Mechanically, NURD’s distillation solves this:

max

θ,γEp

log pθ(y|rγ(x)) −λIp

(y;z|rγ(x)).

Puli et al. [2022]show that such models are best in a class of predictors with lower bounds on performance. The

mutual information above is zero when y

z|x; this condition holds for semantic corruptions as we discuss in

appendix B. Thus, we run the distillation step as importance-weighted ERM on the training data.

Mahabadi et al. [2019]consider two methods to train a biased model and a base predictive model jointly to make

the base model predict without relying on the biases. They propose 1) POE, where the loss is the sum of the log

loss of the two models and 2) DFL, where the biased model is used to weight the cross-entropy loss for the base

model. For both methods, Mahabadi et al. [2019]build a biased model as pt r (y|z). Intuitively, the base model

focuses on classifying samples that the biased model misclassiﬁes. The methods ﬁne-tune a BERT model [Devlin

et al.,2019]and do not propagate the gradients of the biased model to update the common parameters (token

embeddings).

Biased models from assumptions on ERM-trained models. The second category of B-SCAMs like LFF [Nam

et al.,2020], UMIX [Han et al.,2022], and JTT [Liu et al.,2021]require additional knowledge that vanilla ERM

builds a biased model that exploits the nuisance-label relationship. Given such a model, these works use it to reduce

a second model’s dependence on the nuisance. We focus on JTT [Liu et al.,2021]which aims to build models

robust to group shift, where the relative mass of a ﬁxed set of disjoint groups of the data changes between training

and test times. The groups here are subsets of the data deﬁned by a pair of values of discrete label and nuisance

values. While JTT works without relying on training group annotations, i.e. without nuisances, it assumes ERM’s

missclassiﬁcations are because of a reliance on the nuisance. JTT ﬁrst builds an “identiﬁcation” model via ERM to

isolate samples that are misclassiﬁed. Then, JTT trains a model via ERM on data with the loss for the misclassiﬁed

samples upweighted (by constant λ). The epochs to train the identiﬁcation model and the upweighting constant

are hyperparameters that require tuning using group annotations [Liu et al.,2021].

The commonality of a biased model. The central part in NURD,DFL,POE, and JTT is a model that predicts

the label using nuisances, like pt r (y|z), which we call the biased model as in He et al. [2019]. The predictive

models in each B-SCAM are guided to not depend on nuisances used by the biased model. While B-SCAMs reduce

dependence on nuisances, they build biased models using additional nuisance annotations or require assumptions

that ERM-trained models predict using the nuisance. In the next section, we describe an alternative: corrupt

semantic information with data augmentations to construct biased models.

3 Out-of-distribution generalization via Semantic Corruptions

The previous section summarized how biased models can be built in B-SCAMs using either direct knowledge of

nuisances or knowledge that ERM-trained models rely on the nuisances. We now introduce semantic corruptions

and show how they enable building biased models. Semantic corruptions are transformations of the covariates

that do not retain any knowledge of the semantics, except what may be in the nuisance z:

Deﬁnition 2 (Semantic Corruption).A semantic corruption is a transformation of the covariates T(x,δ), where δ

is a random variable such that δ

(y,z,x,x∗), if

∀pD∈ F T(x,δ)

pDx∗|z.

Here, we characterize conditions under which biased models built from semantic corruptions could be used to

estimate robust models. As discussed in section 2,p

(y|x)is the optimal risk-invariant predictor, and is the

target of ERM when predicting the label yfrom xunder the nuisance-randomized distribution p

. NURD estimates

this distribution as part of the algorithm, and methods like JTT aim to approximate p

, for example, upweighting

samples mis-classiﬁed by a model that relies on zto predict y. We compare p

which is obtained by breaking the

nuisance-label relationship against the distribution obtained by breaking the relationship between the label and

the data augmentation :

(y,x) = Zz

pt r (y)

pt r (y|z=z)pt r (y,z=z,x),pT(y,x) = Zδ

p(δ=δ)pt r (y)

pt r (y|T(x,δ)) pt r (y,x)dδ.

We show here that the L1distance between p

(y,x)and pT(y,x)is controlled by an L2-distance between the

biased models built from the nuisance and the data augmentations respectively:

Proposition 1. Let T :X×Rd→Xbe a function. Assume the r.v. pt r (y|T(x,δ))−1has a bounded second moment

under the distribution p

(y,z,x)p(δ), and that pt r (y|T(x,δ)) and pt r (y|z)satisfy

(y,z,x)p(δ)pt r (y|T(x,δ))−2≤m2,Ep

(y,z,x)p(δ)|pt r (y|T(x,δ)) −pt r (y|z)|2=ε2.

Then, the L1distance between p

(y,x)and pT(y,x)is bounded: ∥p

(y,x)−pT(y,x)∥1≤mε. For a semantic corrup-

tion that also satisﬁes y

pt r z|T(x,δ)the inequalities hold with ε=0.

If ε=0, pT(y,x) = p

(y,x)which means that almost surely the conditionals match p

(y|x) = pT(y|x). Then,

as p

(y|x)is the optimal risk-invariant predictor, so is pT(y|x). More generally, standard domain adaptation

risk bounds that are controlled by the total variation distance between source and target [Ben-David et al.,2010,

Theorem 1]bound the risk of a model under p

using the L1bound mε— which upper bounds the total variation

— and the risk under pT.

Without nuisance annotations, one cannot test whether estimate the L2-distance between the two biased mod-

els pt r (y|z)and pt r (y|T(x,δ)) in proposition 1. This distance can be large when a transformation T(x,δ)

retains semantic information. To avoid, we turn to a complementary source of knowledge: semantic features.

Using this knowledge, we design families of data augmentations that corrupt the semantic information in xto

construct semantic corruptions. Focusing on two popular OOD tasks, object recognition and NLI, we use only

semantic knowledge to build corruptions that retain some aspects of the covariates. Biased models built on such

corruptions will depend on any retained nuisances; more retained nuisances mean better biased models.

3.1 Semantic corruptions via permutations

We ﬁrst build corruptions when semantics appear as global structure. We give an intuitive example for such global

semantics. Consider the waterbirds dataset from Sagawa et al. [2019]with waterbirds and landbirds appearing

predominantly on backgrounds with water and land respectively. Semantic features like the wing shape and the

presence of webbed feet are corrupted by randomly permuting small patches. See ﬁg. 1a. Formally, given subsets

of the covariates x1,· · · xkextracted in an order, global semantics e(x1,· · · ,xk)change with the order of extraction.

文档加载中……请稍候！
如果长时间未打开，您也可以点击刷新试试。

下载文档到电脑，查找使用更方便

10 玖币 0人已下载

立即下载

摘要：

NuisancesviaNegativa:AdjustingforSpuriousCorrelationsviaDataAugmentationAahladPuli1*NitishJoshi1YoavWald2HeHe1,2RajeshRanganath1,2,31DepartmentofComputerScience,NewYorkUniversity2CenterforDataScience,NewYorkUniversity3DepartmentofPopulationHealth,LangoneHealth,NewYorkUniversityAbstractInpredictionta...

展开>> 收起<<

Nuisances via Negativa Adjusting for Spurious Correlations via Data Augmentation Aahlad Puli1Nitish Joshi1Yoav Wald2He He12Rajesh Ranganath123.pdf

共27页,预览5页

还剩页未读，继续阅读

声明：本站为文档C2C交易模式，即用户上传的文档直接被用户下载，本站只是中间服务平台，本站所有文档下载所得的收益归上传人(含作者)所有。玖贝云文库仅提供信息存储空间，仅对用户上传内容的表现方式做保护处理，对上载内容本身不做任何修改或编辑。若文档所含内容侵犯了您的版权或隐私，请立即通知玖贝云文库，我们立即给予删除！

Nuisances via Negativa Adjusting for Spurious Correlations via Data Augmentation Aahlad Puli1Nitish Joshi1Yoav Wald2He He12Rajesh Ranganath123

相关推荐

开通VIP享超值会员特权

作者详情

相关内容

热门标签

举报选择: