Monotonic Risk Relationships under Distribution Shifts for Regularized Risk Minimization Daniel LeJeune1 Jiayu Liu2 and Reinhard Heckel2

2025-05-02 0 0 2.73MB 34 页 10玖币

侵权投诉

Monotonic Risk Relationships under Distribution Shifts

for Regularized Risk Minimization

Daniel LeJeune∗1, Jiayu Liu†2, and Reinhard Heckel‡2

1Department of Statistics, Stanford University

2Department of Electrical and Computer Engineering, Technical University of Munich

Abstract

Machine learning systems are often applied to data that is drawn from a diﬀerent distribu-

tion than the training distribution. Recent work has shown that for a variety of classiﬁcation

and signal reconstruction problems, the out-of-distribution performance is strongly linearly cor-

related with the in-distribution performance. If this relationship or more generally a monotonic

one holds, it has important consequences. For example, it allows to optimize performance on

one distribution as a proxy for performance on the other. In this paper, we study conditions

under which a monotonic relationship between the performances of a model on two distributions

is expected. We prove an exact asymptotic linear relation for squared error and a monotonic

relation for misclassiﬁcation error for ridge-regularized general linear models under covariate

shift, as well as an approximate linear relation for linear inverse problems.

1 Introduction

Machine learning models are typically evaluated by shuﬄing a set of labeled data, splitting it into

training and test sets, and evaluating the model trained on the training set on the test set. This

measures how well the model performs on the distribution the model was trained on. However,

in practice a model is most commonly not applied to such in-distribution data, but rather to out-

of-distribution data that is almost always at least slightly diﬀerent. In order to understand the

performance of machine learning methods in practice, it is therefore important to understand how

out-of-distribution performance relates to in-distribution performance.

While there are settings in which models with similar in-distribution performance have diﬀerent

out-of-distribution performance (McCoy et al., 2020), a series of recent empirical studies have

shown that often, the in-distribution and out-of-distribution performances of models are strongly

correlated:

•Recht et al. (2019), Yadav and Bottou (2019), and Miller et al. (2020) constructed new test

sets for the popular CIFAR-10, ImageNet, and MNIST image classiﬁcation problems and

for the SQuAD question answering datasets by following the original data collection and

labeling process as closely as possible. For CIFAR-10 and ImageNet the performance drops

signiﬁcantly when evaluated on the new test set, indicating that even when following the

original data collection and labeling process, a signiﬁcant distribution shift can occur. In

addition, for all four distribution shifts, the in- and out-of-distribution errors are strongly

linearly correlated.

∗daniel@dlej.net

†jiayu.liu@tum.de

‡reinhard.heckel@tum.de

arXiv:2210.11589v2 [cs.LG] 20 Jul 2023

•Miller et al. (2021) identiﬁed a strong linear correlation of the performance of image classiﬁers

for a variety of natural distribution shifts. Apart from classiﬁcation, the linear performance

relationship phenomenon is also observed in machine learning tasks where models produce

real-valued output, for example in pose estimation (Miller et al., 2021) and object detec-

tion (Caine et al., 2021).

•Darestani et al. (2021) identiﬁed a strong linear correlation of the performance of image

reconstruction methods for a variety of natural distribution shifts. This relation between

in- and out-of-distribution performances persisted for image reconstruction methods that are

only tuned, i.e., only a small set of hyperparameters is chosen based on hyperparameter

optimization on the training data.

An important consequence of a linear, or more generally, a monotonic relationship between

in- and out-of-distribution performances is that a model that performs better in-distribution also

performs better on out-of-distribution data, and thus measuring in-distribution performance can

serve as a proxy for tuning and comparing diﬀerent models for application on out-of-distribution

data.

It is therefore important to understand when a linear or more generally a monotonic relation-

ship between the performance on two distributions occurs. In this paper we study this question

theoretically and empirically for a class of distribution shifts where the feature or signal models

come from diﬀerent distributions, also known as covariate shift.

First, we show that for a real-world regression problem, in- and out-of-distribution performances

are linearly correlated. Speciﬁcally, we show that for object detection, the performance of models

trained on the COCO 2017 training set and evaluated on the COCO 2017 validation set is linearly

correlated with the performance on the VOC 2012 dataset. This ﬁnding establishes that a linear risk

relation also occurs for regression problems, beyond classiﬁcation problems as established before.

We then consider a simple linear regression model with a feature vector drawn from a diﬀerent

subspace for in- and out-of-distribution data. We provide suﬃcient conditions for a linear estimator

that characterizes when a linear relation between in- and out-of-distribution occurs.

Next, we consider a general setup encompassing classiﬁcation and regression, and consider a

distribution shift model on the feature vectors. We consider a large class of estimators obtained

with regularized empirical risk minimization, and show that as various training parameters change,

including for example the regularization strength or the number of training examples (resulting in

diﬀerent estimators), the relationship between in- and out-of-distribution performances is mono-

tonic. Diﬀerent classes of estimators follow diﬀerent monotonic relations, and we also observe this

in practice (see Figure 3). Interestingly, for a certain class of shifts in classiﬁcation, we recover a

linear relation for a nonlinear function of the risks that is remarkably similar to that demonstrated

empirically by Miller et al. (2021).

Finally, we study linear inverse problems, to understand when a linear relation occurs in a signal

reconstruction problem. We consider a distribution shift model consisting of a shift in subspace as

well as noise variance, and again characterize conditions under which a linear or near-linear relation

between in- and out-of-distribution performances exists.

Our results suggest that linear risk relationships observed in regression and classiﬁcation actually

arise by independent mechanisms, being based on a shift in feature subspace for regression and a

shift in feature scaling for classiﬁcation.

Code for the experiments and ﬁgures in this paper can be found at https://github.com/

MLI-lab/monotonic_risk_relationships

0.01 0.015

0.02

0.04

0.06

0.08

COCO 2017 MSE

VOC 2012 MSE

Faster R-CNN

Mask R-CNN

Keypoint R-CNN

RetinaNet

SSD

YOLOv5

linear ﬁt

y=x

0 100 200 300 400 500

10−1

100

101

102

Index

Singular value

COCO 2017

VOC 2012

0 100 200 300 400 500

0.7

0.8

0.9

Subspace similarity

Figure 1: Bounding box prediction on COCO 2017 and VOC 2012 datasets. Left: There is an

approximate linear relation of mean squared error (MSE) for models trained COCO 2017. Middle:

The spectrum of the feature spaces of YOLOv5 on the two datasets decays quickly, which suggests

that a feature subspace model could be a reasonable approximation. Right: A principal-angle-

based similarity between subspaces spanned by the top kprincipal components on the two datasets.

The subspaces are well-aligned, which is a suﬃcient condition for a linear relation as stated in

Theorem 1.

1.1 Prior Theoretical Work on Characterizing Linear Performance Relations

Classical theory for characterizing out-of-distribution performance ensures that the diﬀerence be-

tween in- and out-of-distribution performance of an estimator is bounded by a function of the

distance of the training and test distributions (Qui˜nonero-Candela et al., 2008; Ben-David et al.,

2010; Cortes and Mohri, 2014). Such bounds often apply to a class of target distributions. In

contrast, we are interested in precise relationships between a ﬁxed source and target distribution.

Regarding characterizing linear relationships, Miller et al. (2021, Sec. 7) proved that for a

distribution shift for a binary mixture model, the in- and out-of-distribution accuracies have a

linear relation if the features vectors are suﬃciently high-dimensional. Mania and Sra (2020)

showed that an approximate linear relationship occurs under a model similarity assumption that

high accuracy models correctly classify most of the data points that are correctly classiﬁed by lower

accuracy models.

Most related to our work is that of Tripuraneni et al. (2021), who revealed an exact linear

relation for squared error of a linear random feature regression model under a covariate shift in the

high-dimensional limit. This covariate shift is philosophically similar to the simplifying assumption

we make for the main statement and interpretation of our results, and yields a similar linear relation

for squared error. However, our results apply to a broader class of general linear models and extend

to misclassiﬁcation error, and we go further to capture how the distribution shift can depend on

the task itself, which captures how classiﬁcation problems can become easier or harder. Moreover,

our results predict general monotonic relationships as opposed to only linear ones.

2 Linear Relations in Regression and Motivation for the Subspace

Model

Prior work in the distribution shift literature for prediction tasks has focused on either classiﬁcation

or on problems with real-valued outputs but using discrete performance metrics (for example, pose

estimation (Miller et al., 2021) and object detection (Caine et al., 2021)). Here, we consider a real-

valued squared error metric and show that linear relationships between in- and out-of-distribution

performances also occur in a standard regression setup.

We evaluate a collection of neural network models for object detection, which are trained on

the COCO 2017 training set (Lin et al., 2014): Faster R-CNN (Ren et al., 2015), Mask R-CNN (He

et al., 2017), Keypoint R-CNN (He et al., 2017), SSD (Liu et al., 2016), RetinaNet (Lin et al., 2017),

and YOLOv5 (Redmon et al., 2016; Jocher et al., 2020). See Figure 1 (left), where we compute

their mean squared errors for bounding box coordinate prediction on the COCO 2017 validation

set and the VOC 2012 training/validation set (Everingham et al., 2010). The models we consider

all perform worse on the out-of-distribution data, and the in- and out-of-distribution performances

are approximately linearly related.

It is in general diﬃcult to model distribution shifts analytically. In this work, one aspect of

distribution shifts that we model is the change in the subspaces where the feature vectors lie. To

motivate this model, we next examine the feature space of the YOLOv5 model on the in- and

out-of-distribution data.

The YOLOv5 model, and all other models considered, can be viewed to make a prediction

for an image by generating features through several layers, and then aggregating those features

with a linear layer (or a very shallow neural network) to make a prediction. We consider the 512-

dimensional feature vectors from the penultimate layer of YOLOv5 as the features. Let {z(i)

j∈

R512 :i∈[Nin], j ∈[K(i)

in ]}and {z(i)

j∈R512 :i∈[Nout], j ∈[K(i)

out]}be the set of feature vectors

of the in- and out-of-distribution data, respectively, where z(i)

jis the feature vector of the jth true

positive prediction on image i,Nin and Nout are the numbers of images of the respective datasets,

and K(i)

in and K(i)

out are the number of true positive predictions on the ith images of the respective

datasets. We perform principal component analysis on these two sets of feature vectors and plot

the spectra in Figure 1 (middle). We observe that approximating the feature space by the top

100 principal components explains 96.0% and 95.6% of the variances of COCO 2017 and VOC

2012 respectively. This observation demonstrates that the feature vectors approximately lie in

low-dimensional subspaces.

Moreover, Figure 1 (right) shows that the feature subspaces for the two distributions are

overlapping substantially. Speciﬁcally, Figure 1 (right) shows the subspace similarity deﬁned as

q∥cos(θ)∥2

2/k (Soltanolkotabi and Cand`es, 2012; Heckel and B¨olcskei, 2015), where θis the vec-

tor of principal angles between the subspaces spanned by the top kprincipal components of the

individual feature vector sets. The subspaces spanned by the top 100 principal components, which

account for over 95% of the variance, have a 0.855 subspace similarity (note that the maximum

value 1 is achieved for θ=0). More details on the experiment are in Appendix A.1.

Because the output of neural networks is simply a linear model applied to this feature space,

this observation suggests that the relationship between in- and out-of-distribution performances of

even highly nonlinear models such as neural networks on data from highly nonlinear spaces may be

modeled by a change in linear subspaces of a transformed feature space. Therefore, we theoretically

study the eﬀect of changes of subspace in linear models and the resulting performance relationships.

Our results consider ﬁxed feature spaces, while diﬀerent deep learning models have diﬀerent feature

representations at the ﬁnal layer. However, our study can shed light on performance changes of

models from the same family that share similar feature representations under distribution shifts.

3 Linear Relations in Regression in Finite Dimensions

We begin our theoretical study by considering the linear regression setting under additive noise:

y=xTβ∗+z, where β∗∈Rdis a ﬁxed parameter vector that determines the model, and zis

independent observation noise. We assume that the feature vector xis drawn randomly from a

subspace, also known as the hidden manifold model (Goldt et al., 2020). Let dP, dQ≤d. For data

from distribution P, the feature vector is given by x=UPcP, where UP∈Rd×dPhas orthonormal

columns and cP∈RdPis zero-mean and has identity covariance. The noise variable is zero-mean

and has variance σ2

P. The data from distribution Qis generated in the same manner, but the

signal is from a diﬀerent subspace with x=UQcQ, where UQ∈Rd×dQhas orthonormal columns,

cQ∈RdQis zero-mean and has identity covariance, and the noise is zero-mean and has variance

σ2

For an estimate b

βof β∗, deﬁne the risk on distribution Pwith respect to the squared error metric

as RP(b

β) = Ex∼Ph(y−x⊤b

β)2i(respectively RQ(b

β) on distribution Q). We are interested in the

relation of those risks for a class of estimators. We consider an estimate of the model parameter

β∗assuming knowledge of the distribution for simplicity, equivalent to having large amounts of

training data. The analysis can be extended readily to estimates based on ﬁnite samples. We

consider the estimator

βλ= arg min

EPh(βTx−y)2i+λ∥β∥2

parameterized by the regularization parameter λ. It can be shown that b

βλ=αUPUT

Pβ∗for

α= 1/(1 + λ), which is the projection of β∗onto the subspace scaled by the factor α∈[0,1].

The following theorem provides suﬃcient conditions for a linear relation between the in- and

out-of-distribution risks RP(b

βλ) and RQ(b

βλ) of this class of estimators parameterized by the reg-

ularization parameter λ. Theorem 1 is a consequence of Theorem 6 in Appendix C, which also

provides a necessary condition for a linear risk relation.

Theorem 1 (Suﬃcient conditions).The out-of-distrubiton risk RQ(b

βλ)is an aﬃne function of the

in-distribution risk RP(b

βλ)as a function of the regularization parameter λif one of the following

conditions holds:

(a) range(UQ)⊆range(UP), or range(UP)⊆range(UQ);

(b) β∗∈range(UP).

Moreover, for random β∗, the expected out-of-distribution risk, Eβ∗hRQ(b

βλ)i, is an aﬃne function

of the expected in-distribution risk Eβ∗hRP(b

βλ)iif

Condition (a) is a property of the distribution shift itself. When the subspaces are aligned

between the two distributions, we observe a linear risk relationship for the set of estimators param-

eterized by λ. Recall from the previous section, that the feature subspaces of the object detection

model we evaluate roughly align, as shown in Figure 1 (right). Thus, our theorem suggests a linear

relationship, which in turn sheds light on the linear relationship we observed in practice. We remark

that the linear relationship guaranteed by Theorem 1 is exact assuming full knowledge of the source

distribution, but only approximate in the ﬁnite sample regime for an estimate that minimizes the

regularized empirical risk.

Condition (b) is a property of the parameter vector β∗and its learnability under distribution

P. Under condition (b), b

βλ=αβ∗, which greatly simpliﬁes the risks:

RP(b

βλ) = (1 −α)2β∗⊤β∗+σ2

Pand RQ(b

βλ) = (1 −α)2β∗⊤UQU⊤

Qβ∗+σ2

文档加载中……请稍候！
如果长时间未打开，您也可以点击刷新试试。

下载文档到电脑，查找使用更方便

10 玖币 0人已下载

立即下载

摘要：

MonotonicRiskRelationshipsunderDistributionShiftsforRegularizedRiskMinimizationDanielLeJeune∗1,JiayuLiu†2,andReinhardHeckel‡21DepartmentofStatistics,StanfordUniversity2DepartmentofElectricalandComputerEngineering,TechnicalUniversityofMunichAbstractMachinelearningsystemsareoftenappliedtodatathatisdra...

展开>> 收起<<

Monotonic Risk Relationships under Distribution Shifts for Regularized Risk Minimization Daniel LeJeune1 Jiayu Liu2 and Reinhard Heckel2.pdf

共34页,预览5页

还剩页未读，继续阅读

声明：本站为文档C2C交易模式，即用户上传的文档直接被用户下载，本站只是中间服务平台，本站所有文档下载所得的收益归上传人(含作者)所有。玖贝云文库仅提供信息存储空间，仅对用户上传内容的表现方式做保护处理，对上载内容本身不做任何修改或编辑。若文档所含内容侵犯了您的版权或隐私，请立即通知玖贝云文库，我们立即给予删除！

Monotonic Risk Relationships under Distribution Shifts for Regularized Risk Minimization Daniel LeJeune1 Jiayu Liu2 and Reinhard Heckel2

相关推荐

开通VIP享超值会员特权

作者详情

相关内容

热门标签

举报选择: