
the COCO 2017 training set (Lin et al., 2014): Faster R-CNN (Ren et al., 2015), Mask R-CNN (He
et al., 2017), Keypoint R-CNN (He et al., 2017), SSD (Liu et al., 2016), RetinaNet (Lin et al., 2017),
and YOLOv5 (Redmon et al., 2016; Jocher et al., 2020). See Figure 1 (left), where we compute
their mean squared errors for bounding box coordinate prediction on the COCO 2017 validation
set and the VOC 2012 training/validation set (Everingham et al., 2010). The models we consider
all perform worse on the out-of-distribution data, and the in- and out-of-distribution performances
are approximately linearly related.
It is in general difficult to model distribution shifts analytically. In this work, one aspect of
distribution shifts that we model is the change in the subspaces where the feature vectors lie. To
motivate this model, we next examine the feature space of the YOLOv5 model on the in- and
out-of-distribution data.
The YOLOv5 model, and all other models considered, can be viewed to make a prediction
for an image by generating features through several layers, and then aggregating those features
with a linear layer (or a very shallow neural network) to make a prediction. We consider the 512-
dimensional feature vectors from the penultimate layer of YOLOv5 as the features. Let {z(i)
j∈
R512 :i∈[Nin], j ∈[K(i)
in ]}and {z(i)
j∈R512 :i∈[Nout], j ∈[K(i)
out]}be the set of feature vectors
of the in- and out-of-distribution data, respectively, where z(i)
jis the feature vector of the jth true
positive prediction on image i,Nin and Nout are the numbers of images of the respective datasets,
and K(i)
in and K(i)
out are the number of true positive predictions on the ith images of the respective
datasets. We perform principal component analysis on these two sets of feature vectors and plot
the spectra in Figure 1 (middle). We observe that approximating the feature space by the top
100 principal components explains 96.0% and 95.6% of the variances of COCO 2017 and VOC
2012 respectively. This observation demonstrates that the feature vectors approximately lie in
low-dimensional subspaces.
Moreover, Figure 1 (right) shows that the feature subspaces for the two distributions are
overlapping substantially. Specifically, Figure 1 (right) shows the subspace similarity defined as
q∥cos(θ)∥2
2/k (Soltanolkotabi and Cand`es, 2012; Heckel and B¨olcskei, 2015), where θis the vec-
tor of principal angles between the subspaces spanned by the top kprincipal components of the
individual feature vector sets. The subspaces spanned by the top 100 principal components, which
account for over 95% of the variance, have a 0.855 subspace similarity (note that the maximum
value 1 is achieved for θ=0). More details on the experiment are in Appendix A.1.
Because the output of neural networks is simply a linear model applied to this feature space,
this observation suggests that the relationship between in- and out-of-distribution performances of
even highly nonlinear models such as neural networks on data from highly nonlinear spaces may be
modeled by a change in linear subspaces of a transformed feature space. Therefore, we theoretically
study the effect of changes of subspace in linear models and the resulting performance relationships.
Our results consider fixed feature spaces, while different deep learning models have different feature
representations at the final layer. However, our study can shed light on performance changes of
models from the same family that share similar feature representations under distribution shifts.
3 Linear Relations in Regression in Finite Dimensions
We begin our theoretical study by considering the linear regression setting under additive noise:
y=xTβ∗+z, where β∗∈Rdis a fixed parameter vector that determines the model, and zis
independent observation noise. We assume that the feature vector xis drawn randomly from a
subspace, also known as the hidden manifold model (Goldt et al., 2020). Let dP, dQ≤d. For data
4