Monotonic Risk Relationships under Distribution Shifts for Regularized Risk Minimization Daniel LeJeune1 Jiayu Liu2 and Reinhard Heckel2

2025-05-02 0 0 2.73MB 34 页 10玖币
侵权投诉
Monotonic Risk Relationships under Distribution Shifts
for Regularized Risk Minimization
Daniel LeJeune1, Jiayu Liu2, and Reinhard Heckel2
1Department of Statistics, Stanford University
2Department of Electrical and Computer Engineering, Technical University of Munich
Abstract
Machine learning systems are often applied to data that is drawn from a different distribu-
tion than the training distribution. Recent work has shown that for a variety of classification
and signal reconstruction problems, the out-of-distribution performance is strongly linearly cor-
related with the in-distribution performance. If this relationship or more generally a monotonic
one holds, it has important consequences. For example, it allows to optimize performance on
one distribution as a proxy for performance on the other. In this paper, we study conditions
under which a monotonic relationship between the performances of a model on two distributions
is expected. We prove an exact asymptotic linear relation for squared error and a monotonic
relation for misclassification error for ridge-regularized general linear models under covariate
shift, as well as an approximate linear relation for linear inverse problems.
1 Introduction
Machine learning models are typically evaluated by shuffling a set of labeled data, splitting it into
training and test sets, and evaluating the model trained on the training set on the test set. This
measures how well the model performs on the distribution the model was trained on. However,
in practice a model is most commonly not applied to such in-distribution data, but rather to out-
of-distribution data that is almost always at least slightly different. In order to understand the
performance of machine learning methods in practice, it is therefore important to understand how
out-of-distribution performance relates to in-distribution performance.
While there are settings in which models with similar in-distribution performance have different
out-of-distribution performance (McCoy et al., 2020), a series of recent empirical studies have
shown that often, the in-distribution and out-of-distribution performances of models are strongly
correlated:
Recht et al. (2019), Yadav and Bottou (2019), and Miller et al. (2020) constructed new test
sets for the popular CIFAR-10, ImageNet, and MNIST image classification problems and
for the SQuAD question answering datasets by following the original data collection and
labeling process as closely as possible. For CIFAR-10 and ImageNet the performance drops
significantly when evaluated on the new test set, indicating that even when following the
original data collection and labeling process, a significant distribution shift can occur. In
addition, for all four distribution shifts, the in- and out-of-distribution errors are strongly
linearly correlated.
daniel@dlej.net
jiayu.liu@tum.de
reinhard.heckel@tum.de
1
arXiv:2210.11589v2 [cs.LG] 20 Jul 2023
Miller et al. (2021) identified a strong linear correlation of the performance of image classifiers
for a variety of natural distribution shifts. Apart from classification, the linear performance
relationship phenomenon is also observed in machine learning tasks where models produce
real-valued output, for example in pose estimation (Miller et al., 2021) and object detec-
tion (Caine et al., 2021).
Darestani et al. (2021) identified a strong linear correlation of the performance of image
reconstruction methods for a variety of natural distribution shifts. This relation between
in- and out-of-distribution performances persisted for image reconstruction methods that are
only tuned, i.e., only a small set of hyperparameters is chosen based on hyperparameter
optimization on the training data.
An important consequence of a linear, or more generally, a monotonic relationship between
in- and out-of-distribution performances is that a model that performs better in-distribution also
performs better on out-of-distribution data, and thus measuring in-distribution performance can
serve as a proxy for tuning and comparing different models for application on out-of-distribution
data.
It is therefore important to understand when a linear or more generally a monotonic relation-
ship between the performance on two distributions occurs. In this paper we study this question
theoretically and empirically for a class of distribution shifts where the feature or signal models
come from different distributions, also known as covariate shift.
First, we show that for a real-world regression problem, in- and out-of-distribution performances
are linearly correlated. Specifically, we show that for object detection, the performance of models
trained on the COCO 2017 training set and evaluated on the COCO 2017 validation set is linearly
correlated with the performance on the VOC 2012 dataset. This finding establishes that a linear risk
relation also occurs for regression problems, beyond classification problems as established before.
We then consider a simple linear regression model with a feature vector drawn from a different
subspace for in- and out-of-distribution data. We provide sufficient conditions for a linear estimator
that characterizes when a linear relation between in- and out-of-distribution occurs.
Next, we consider a general setup encompassing classification and regression, and consider a
distribution shift model on the feature vectors. We consider a large class of estimators obtained
with regularized empirical risk minimization, and show that as various training parameters change,
including for example the regularization strength or the number of training examples (resulting in
different estimators), the relationship between in- and out-of-distribution performances is mono-
tonic. Different classes of estimators follow different monotonic relations, and we also observe this
in practice (see Figure 3). Interestingly, for a certain class of shifts in classification, we recover a
linear relation for a nonlinear function of the risks that is remarkably similar to that demonstrated
empirically by Miller et al. (2021).
Finally, we study linear inverse problems, to understand when a linear relation occurs in a signal
reconstruction problem. We consider a distribution shift model consisting of a shift in subspace as
well as noise variance, and again characterize conditions under which a linear or near-linear relation
between in- and out-of-distribution performances exists.
Our results suggest that linear risk relationships observed in regression and classification actually
arise by independent mechanisms, being based on a shift in feature subspace for regression and a
shift in feature scaling for classification.
Code for the experiments and figures in this paper can be found at https://github.com/
MLI-lab/monotonic_risk_relationships
2
0.01 0.015
0
0.02
0.04
0.06
0.08
COCO 2017 MSE
VOC 2012 MSE
Faster R-CNN
Mask R-CNN
Keypoint R-CNN
RetinaNet
SSD
YOLOv5
linear fit
y=x
0 100 200 300 400 500
101
100
101
102
Index
Singular value
COCO 2017
VOC 2012
0 100 200 300 400 500
0.7
0.8
0.9
1
k
Subspace similarity
Figure 1: Bounding box prediction on COCO 2017 and VOC 2012 datasets. Left: There is an
approximate linear relation of mean squared error (MSE) for models trained COCO 2017. Middle:
The spectrum of the feature spaces of YOLOv5 on the two datasets decays quickly, which suggests
that a feature subspace model could be a reasonable approximation. Right: A principal-angle-
based similarity between subspaces spanned by the top kprincipal components on the two datasets.
The subspaces are well-aligned, which is a sufficient condition for a linear relation as stated in
Theorem 1.
1.1 Prior Theoretical Work on Characterizing Linear Performance Relations
Classical theory for characterizing out-of-distribution performance ensures that the difference be-
tween in- and out-of-distribution performance of an estimator is bounded by a function of the
distance of the training and test distributions (Qui˜nonero-Candela et al., 2008; Ben-David et al.,
2010; Cortes and Mohri, 2014). Such bounds often apply to a class of target distributions. In
contrast, we are interested in precise relationships between a fixed source and target distribution.
Regarding characterizing linear relationships, Miller et al. (2021, Sec. 7) proved that for a
distribution shift for a binary mixture model, the in- and out-of-distribution accuracies have a
linear relation if the features vectors are sufficiently high-dimensional. Mania and Sra (2020)
showed that an approximate linear relationship occurs under a model similarity assumption that
high accuracy models correctly classify most of the data points that are correctly classified by lower
accuracy models.
Most related to our work is that of Tripuraneni et al. (2021), who revealed an exact linear
relation for squared error of a linear random feature regression model under a covariate shift in the
high-dimensional limit. This covariate shift is philosophically similar to the simplifying assumption
we make for the main statement and interpretation of our results, and yields a similar linear relation
for squared error. However, our results apply to a broader class of general linear models and extend
to misclassification error, and we go further to capture how the distribution shift can depend on
the task itself, which captures how classification problems can become easier or harder. Moreover,
our results predict general monotonic relationships as opposed to only linear ones.
2 Linear Relations in Regression and Motivation for the Subspace
Model
Prior work in the distribution shift literature for prediction tasks has focused on either classification
or on problems with real-valued outputs but using discrete performance metrics (for example, pose
estimation (Miller et al., 2021) and object detection (Caine et al., 2021)). Here, we consider a real-
valued squared error metric and show that linear relationships between in- and out-of-distribution
performances also occur in a standard regression setup.
We evaluate a collection of neural network models for object detection, which are trained on
3
the COCO 2017 training set (Lin et al., 2014): Faster R-CNN (Ren et al., 2015), Mask R-CNN (He
et al., 2017), Keypoint R-CNN (He et al., 2017), SSD (Liu et al., 2016), RetinaNet (Lin et al., 2017),
and YOLOv5 (Redmon et al., 2016; Jocher et al., 2020). See Figure 1 (left), where we compute
their mean squared errors for bounding box coordinate prediction on the COCO 2017 validation
set and the VOC 2012 training/validation set (Everingham et al., 2010). The models we consider
all perform worse on the out-of-distribution data, and the in- and out-of-distribution performances
are approximately linearly related.
It is in general difficult to model distribution shifts analytically. In this work, one aspect of
distribution shifts that we model is the change in the subspaces where the feature vectors lie. To
motivate this model, we next examine the feature space of the YOLOv5 model on the in- and
out-of-distribution data.
The YOLOv5 model, and all other models considered, can be viewed to make a prediction
for an image by generating features through several layers, and then aggregating those features
with a linear layer (or a very shallow neural network) to make a prediction. We consider the 512-
dimensional feature vectors from the penultimate layer of YOLOv5 as the features. Let {z(i)
j
R512 :i[Nin], j [K(i)
in ]}and {z(i)
jR512 :i[Nout], j [K(i)
out]}be the set of feature vectors
of the in- and out-of-distribution data, respectively, where z(i)
jis the feature vector of the jth true
positive prediction on image i,Nin and Nout are the numbers of images of the respective datasets,
and K(i)
in and K(i)
out are the number of true positive predictions on the ith images of the respective
datasets. We perform principal component analysis on these two sets of feature vectors and plot
the spectra in Figure 1 (middle). We observe that approximating the feature space by the top
100 principal components explains 96.0% and 95.6% of the variances of COCO 2017 and VOC
2012 respectively. This observation demonstrates that the feature vectors approximately lie in
low-dimensional subspaces.
Moreover, Figure 1 (right) shows that the feature subspaces for the two distributions are
overlapping substantially. Specifically, Figure 1 (right) shows the subspace similarity defined as
qcos(θ)2
2/k (Soltanolkotabi and Cand`es, 2012; Heckel and B¨olcskei, 2015), where θis the vec-
tor of principal angles between the subspaces spanned by the top kprincipal components of the
individual feature vector sets. The subspaces spanned by the top 100 principal components, which
account for over 95% of the variance, have a 0.855 subspace similarity (note that the maximum
value 1 is achieved for θ=0). More details on the experiment are in Appendix A.1.
Because the output of neural networks is simply a linear model applied to this feature space,
this observation suggests that the relationship between in- and out-of-distribution performances of
even highly nonlinear models such as neural networks on data from highly nonlinear spaces may be
modeled by a change in linear subspaces of a transformed feature space. Therefore, we theoretically
study the effect of changes of subspace in linear models and the resulting performance relationships.
Our results consider fixed feature spaces, while different deep learning models have different feature
representations at the final layer. However, our study can shed light on performance changes of
models from the same family that share similar feature representations under distribution shifts.
3 Linear Relations in Regression in Finite Dimensions
We begin our theoretical study by considering the linear regression setting under additive noise:
y=xTβ+z, where βRdis a fixed parameter vector that determines the model, and zis
independent observation noise. We assume that the feature vector xis drawn randomly from a
subspace, also known as the hidden manifold model (Goldt et al., 2020). Let dP, dQd. For data
4
from distribution P, the feature vector is given by x=UPcP, where UPRd×dPhas orthonormal
columns and cPRdPis zero-mean and has identity covariance. The noise variable is zero-mean
and has variance σ2
P. The data from distribution Qis generated in the same manner, but the
signal is from a different subspace with x=UQcQ, where UQRd×dQhas orthonormal columns,
cQRdQis zero-mean and has identity covariance, and the noise is zero-mean and has variance
σ2
Q.
For an estimate b
βof β, define the risk on distribution Pwith respect to the squared error metric
as RP(b
β) = ExPh(yxb
β)2i(respectively RQ(b
β) on distribution Q). We are interested in the
relation of those risks for a class of estimators. We consider an estimate of the model parameter
βassuming knowledge of the distribution for simplicity, equivalent to having large amounts of
training data. The analysis can be extended readily to estimates based on finite samples. We
consider the estimator
b
βλ= arg min
β
EPh(βTxy)2i+λβ2
2,
parameterized by the regularization parameter λ. It can be shown that b
βλ=αUPUT
Pβfor
α= 1/(1 + λ), which is the projection of βonto the subspace scaled by the factor α[0,1].
The following theorem provides sufficient conditions for a linear relation between the in- and
out-of-distribution risks RP(b
βλ) and RQ(b
βλ) of this class of estimators parameterized by the reg-
ularization parameter λ. Theorem 1 is a consequence of Theorem 6 in Appendix C, which also
provides a necessary condition for a linear risk relation.
Theorem 1 (Sufficient conditions).The out-of-distrubiton risk RQ(b
βλ)is an affine function of the
in-distribution risk RP(b
βλ)as a function of the regularization parameter λif one of the following
conditions holds:
(a) range(UQ)range(UP), or range(UP)range(UQ);
(b) βrange(UP).
Moreover, for random β, the expected out-of-distribution risk, EβhRQ(b
βλ)i, is an affine function
of the expected in-distribution risk EβhRP(b
βλ)iif
(c) EββT=I.
Condition (a) is a property of the distribution shift itself. When the subspaces are aligned
between the two distributions, we observe a linear risk relationship for the set of estimators param-
eterized by λ. Recall from the previous section, that the feature subspaces of the object detection
model we evaluate roughly align, as shown in Figure 1 (right). Thus, our theorem suggests a linear
relationship, which in turn sheds light on the linear relationship we observed in practice. We remark
that the linear relationship guaranteed by Theorem 1 is exact assuming full knowledge of the source
distribution, but only approximate in the finite sample regime for an estimate that minimizes the
regularized empirical risk.
Condition (b) is a property of the parameter vector βand its learnability under distribution
P. Under condition (b), b
βλ=αβ, which greatly simplifies the risks:
RP(b
βλ) = (1 α)2β∗⊤β+σ2
Pand RQ(b
βλ) = (1 α)2β∗⊤UQU
Qβ+σ2
Q.
5
摘要:

MonotonicRiskRelationshipsunderDistributionShiftsforRegularizedRiskMinimizationDanielLeJeune∗1,JiayuLiu†2,andReinhardHeckel‡21DepartmentofStatistics,StanfordUniversity2DepartmentofElectricalandComputerEngineering,TechnicalUniversityofMunichAbstractMachinelearningsystemsareoftenappliedtodatathatisdra...

展开>> 收起<<
Monotonic Risk Relationships under Distribution Shifts for Regularized Risk Minimization Daniel LeJeune1 Jiayu Liu2 and Reinhard Heckel2.pdf

共34页,预览5页

还剩页未读, 继续阅读

声明:本站为文档C2C交易模式,即用户上传的文档直接被用户下载,本站只是中间服务平台,本站所有文档下载所得的收益归上传人(含作者)所有。玖贝云文库仅提供信息存储空间,仅对用户上传内容的表现方式做保护处理,对上载内容本身不做任何修改或编辑。若文档所含内容侵犯了您的版权或隐私,请立即通知玖贝云文库,我们立即给予删除!
分类:图书资源 价格:10玖币 属性:34 页 大小:2.73MB 格式:PDF 时间:2025-05-02

开通VIP享超值会员特权

  • 多端同步记录
  • 高速下载文档
  • 免费文档工具
  • 分享文档赚钱
  • 每日登录抽奖
  • 优质衍生服务
/ 34
客服
关注