WINDOW -BASED DISTRIBUTION SHIFT DETECTION FOR DEEP NEURAL NETWORKS Guy Bar-Shalom

2025-05-06 0 0 984.51KB 18 页 10玖币
侵权投诉
WINDOW-BASED DISTRIBUTION SHIFT DETECTION FOR DEEP
NEURAL NETWORKS
Guy Bar-Shalom
Department of Computer Science
Technion
guy.b@cs.technion.ac.il
Yonatan Geifman
Deci.AI
yonatan.g@cs.technion.ac.il
Ran El-Yaniv
Department of Computer Science
Technion, Deci.AI
rani@cs.technion.ac.il
ABSTRACT
To deploy and operate deep neural models in production, the quality of their predictions, which might
be contaminated benignly or manipulated maliciously by input distributional deviations, must be
monitored and assessed. Specifically, we study the case of monitoring the healthy operation of a
deep neural network (DNN) receiving a stream of data, with the aim of detecting input distributional
deviations over which the quality of the network’s predictions is potentially damaged. Using selective
prediction principles, we propose a distribution deviation detection method for DNNs. The proposed
method is derived from a tight coverage generalization bound computed over a sample of instances
drawn from the true underlying distribution. Based on this bound, our detector continuously monitors
the operation of the network out-of-sample over a test window and fires off an alarm whenever a
deviation is detected. Our novel detection method performs on-par or better than the state-of-the-art,
while consuming substantially lower computation time (five orders of magnitude reduction) and space
complexities. Unlike previous methods, which require at least linear dependence on the size of the
source distribution for each detection, rendering them inapplicable to “Google-Scale” datasets, our
approach eliminates this dependence, making it suitable for real-world applications.
1 Introduction
A wide range of artificial intelligence applications and services rely on deep neural models because of their remarkable
accuracy. When a trained model is deployed in production, its operation should be monitored for abnormal behavior,
and a flag should be raised if such is detected. Corrective measures can be taken if the underlying cause of the abnormal
behavior is identified. For example, simple distributional changes may only require retraining with fresh data, while
more severe cases may require redesigning the model (e.g., when new classes emerge).
We are concerned with distribution shift detection in the context of deep neural models and consider the following
setting. Pretrained model
f
is given, and we presume it was trained with data sampled from some distribution
P
. In
addition to the dataset used in training
f
, we are also given an additional sample of data from
P
, which is used to train a
detector
D
(we refer to this as the detection-training set or source set). While
f
is used in production to process a stream
of emerging input data, we continually feed
D
with the most recent window
Wk
of
k
input elements. The detector also
has access to the final layers of the model
f
and should be able to determine whether the data contained in
Wk
came
from a distribution different from
P
. We emphasize that in this paper we are not considering the problem of identifying
single-instance out-of-distribution or outlier instances [
1
,
2
,
3
,
4
,
5
,
6
,
7
,
8
], but rather the information residing in
a population of
k
instances. While it may seem straightforward to apply single-instance detectors to a window (by
applying the detector to each instance in the window), this approach can be computationally expensive since such
methods are not designed for window-based tasks; see discussion in Section 3. Moreover, we demonstrate here that
computationally feasible single-instance methods can fail to detect population-based deviations. We emphasize that
we are not concerned in characterizing types of distribution shifts, nor do we tackle at all the complementary topic of
out-of-distribution robustness.
Distribution shift detection has been scarcely considered in the context of deep neural networks (DNNs), however,
it is a fundamental topic in machine learning and statistics. The standard method for tackling it is by performing
arXiv:2210.10897v3 [cs.CV] 8 Jun 2023
a dimensionality reduction over both the detection-training (source) and test (target) samples, and then applying a
two-sample statistical test over these reduced representations to detect a deviation. This is further discussed in Section 3.
In particular, deep models can benefit from the semantic representation created by the model itself, which provides
meaningful dimensionality reduction that is readily available at the last layers of the model. Using the embedding layer
(or softmax) along with statistical two-sample tests was recently proposed by [
9
] and [
10
] who termed solutions of
this structure black-box shift detection (BBSD). Using both the univariate Kolmogorov-Smirnov (KS) test and the
maximum mean discrepancy (MMD) method, see details below, [
10
] achieves impressive detection results when using
MNIST and CIFAR-10 as proxies for the distribution
P
. As shown here, the KS test is also very effective over ImageNet
when a stronger model is used (ResNet50 vs ResNet18). However, BBSD methods have the disadvantage of being
computationally intensive (and probably inapplicable to read-world datasets) due to the use of two-sample tests between
the detection-training set (which can, and is preferred to be the largest possible) and the window
Wk
; a complexity
analysis is provided in Section 5.
We propose a different approach based on selective prediction [
11
,
12
], where a model quantifies its prediction
uncertainty and abstains from predicting uncertain instances. First, we develop a method for selective prediction with
guaranteed coverage. This method identifies the best abstaining threshold and coverage bound for a given pretrained
classifier
f
, such that the resulting empirical coverage will not violate the bound with high probability (when abstention
is determined using the threshold). The guaranteed coverage method is of independent interest, and it is analogous to
selective prediction with guaranteed risk [
12
]. Because the empirical coverage of such a classifier is highly unlikely to
violate the bound if the underlying distribution remains the same, a systematic violation indicates a distribution shift.
To be more specific, given a detection-training sample
Sm
, our coverage-based detection algorithm computes a fixed
number of tight generalization coverage bounds, which are then used to detect a distribution shift in a window
Wk
of
test data. The proposed detection algorithm exhibits remarkable computational efficiency due to its ability to operate
independently of the size of
Sm
during detection, which is crucial when considering “Google-scale” datasets, such as
the JFT-3B dataset. In contrast, the currently available distribution shift detectors rely on a framework that requires
significantly higher computational requirements (this framework is illustrated in Figure 3 in Appendix A). A run-time
comparison of these methods and ours is provided in Figure 1.
In a comprehensive empirical study, we compared our coverage-based detection algorithm with the best-performing
baselines, including the KS approach of [
10
]. Additionally, we investigated the suitability of single-instance detection
methods for identifying distribution shifts. For a fair comparison, all methods used the same (publicly available)
underlying models: ResNet50, MobileNetV3-S, and ViT-T. To evaluate the effectiveness of our approach, we employed
the ImageNet dataset to simulate the source distribution. We then introduced distribution shifts using a range of methods,
starting with simple noise and progressing to more sophisticated techniques such as adversarial examples. Based on
these experiments, it is evident that our coverage-based detection method is overall significantly more powerful than the
baselines across a wide range of test window sizes.
To summarize, the contributions of this paper are: (1) A theoretically justified algorithm (Algorithm 1), that produces a
coverage bound, which is of independent interest, and allows for the creation of selective classifiers with guaranteed
coverage. (2) A theoretically motivated “windowed” detection algorithm (Algorithm 2), which detects a distribution
shift over a window; this proposed algorithm exhibits remarkable efficiency compared to state-of-the-art methods (five
orders of magnitude better than the best method). (3) A comprehensive empirical study demonstrating significant
improvements relative to existing baselines, and introducing the use of single-instance methods for detecting distribution
shifts.
2 Problem Formulation
We consider the problem of detecting distribution shifts in input streams provided to pretrained deep neural models.
Let
PPX
denote a probability distribution over an input space
X
, and assume that a model
f
has been trained on a
set of instances drawn from
P
. Consider a setting where the model
f
is deployed and while being used in production
its input distribution might change or even be attacked by an adversary. Our goal is to detect such events to allow for
appropriate action, e.g., retraining the model with respect to the revised distribution.
Inspired by [
10
], we formulate this problem as follows. We are given a pretrained model
f
, and a detection-training set,
Sm∼ Pm
. Then we would like to train a detection model to be able to detect a distribution shift; namely, discriminate
between windows containing in-distribution (ID) data, and alternative-distribution (AD) data. Thus, given an unlabeled
test sample window
WkQk
, where
Q
is a possibly different distribution, the objective is to determine whether
P ̸=Q
. We also ask what is the smallest test sample size
k
required to determine that
P ̸=Q
. Since typically the
detection-training set
Sm
can be quite large, we further ask whether it is possible to devise an effective detection
procedure whose time complexity is o(m).
2
3 Related Work
Distribution shift detection methods often comprise the following two steps: dimensionality reduction, and a two-sample
test between the detection-training sample and test samples. In most cases, these methods are “lazy” in the sense that
for each test sample, they make a detection decision based on a computation over the entire detection-training sample.
Their performance will be sub-optimal if only a subset of the train sample is used. Figure 3 in Appendix A illustrates
this general framework.
The use of dimensionality reduction is optional. It can often improve performance by focusing on a less noisy
representation of the data. Dimensionality reduction techniques include no reduction, principal components analysis
[
13
], sparse random projection [
14
], autoencoders [
15
,
16
], domain classifiers, [
10
] and more. In this work we focus
on black box shift detection (BBSD) methods [
9
], that rely on deep neural representations of the data generated by
a pretrained model. The representation we extract from the model will typically utilize either the softmax outputs,
acronymed BBSD-Softmax, or the embeddings, acronymed BBSD-Embeddings; for simplicity, we may omit the BBSD
acronym. Due to the dimensionality of the final representation, multivariate or multiple univariate two-sample tests can
be conducted.
By combining BBSD-Softmax with a Kolmogorov-Smirnov (KS) statistical test [
17
] and using the Bonferroni correction
[
18
], [
10
] achieved state-of-the-art results in distribution shift detection in the context of image classification (MNIST
and CIFAR-10). We acronym their method as BBSD-KS-Softmax (or KS-Softmax). The univariate KS test processes
individual dimensions separately; its statistic is calculated by computing the largest difference
Z
of the cumulative
density functions (CDFs) across all dimensions as follows:
Z= supz|FP(z)FQ(z)|
, where
FQ
and
FP
are the
empirical CDFs of the detection-training and test data (which are sampled from
P
and
Q
, respectively; see Section 2).
The Bonferroni correction rejects the null hypothesis when the minimal p-value among all tests is less than
α
d
, where
α
is the significance level and
d
is the number of dimensions. Although less conservative approaches to aggregation exist
[
19
,
20
], they usually assume some dependencies among the tests. The maximum mean discrepancy (MMD) method [
21
]
is a kernel-based multivariate test that can be used to distinguish between probability distributions
P
and
Q
. Formally,
MMD2(F, P, Q) = ||µPµQ||2
F2
, where
µP
and
µQ
are the mean embeddings of
P
and
Q
in a reproducing kernel
Hilbert space
F
. Given a kernel
K
, and samples,
{x1, x2, . . . , xm} ∼ Pm
and
{x
1, x
2, . . . , x
k} ∼ Qk
, an unbiased
estimator for
MMD2
can be found in [
21
,
22
]. [
23
] and [
21
] used the RBF kernel
K(x, x) = e1
2σ2||xx||2
2
, where
2σ2
is set to the median of the pairwise Euclidean distances between all samples. By performing a permutation test
on the kernel matrix, the p-value is obtained. In our experiments (see Section 6.4), we thus use four population based
baselines: KS-Softmax,KS-Embeddings,MMD-Softmax, and MMD-Embeddings.
As mentioned in the introduction, our work is complementary to the topic of single-instance out-of-distribution (OOD)
detection [
1
,
2
,
3
,
4
,
5
,
6
,
7
,
8
]. Although these methods can be applied to each instance in a window, they often fail
to capture population statistics within the window, making them inadequate for detecting population-based changes.
Additionally, many of these methods are computationally expensive and cannot be applied efficiently to large windows.
For example, methods such as those described in [
24
,
25
] extract values from each layer in the network, while others
such as [
1
] require gradient calculations. We note that the application of the best single-instance methods such as
[
24
,
25
,
1
] in our (large scale) empirical setting is computationally challenging and preclude empirical comparison
to our method. Therefore, we consider (in Section 6.4) the detection performance of two computationally efficient
single-instance baselines: Softmax-Response (abbreviated as Single-instance SR or Single-SR) and Entropy-based
(abbreviated as Single-instance Ent or Single-Ent), as described in [
26
,
27
,
2
]. Specifically, we apply each single-
instance OOD detector to every sample in the window and in the detection-training set. We then use a two-sample t-test
to determine the p-value between the uncertainty estimators of each sample. Finally, we mention [
12
] who developed a
risk generalization bound for selective classifiers [
11
]. The bound presented in that paper is analogous to the coverage
generalization bound we present in Theorem 4.2. The risk bound in [
12
] can also be used for shift-detection. To apply
their risk bound to this task, however, labels, which are not available, are required. Our method (Section 4) detects
distribution shifts without using any labels.
Distribution shift detection methods often comprise the following two steps: dimensionality reduction, and a two-sample
test between the detection-training sample and test samples. In most cases, these methods are “lazy” in the sense that
for each test sample, they make a detection decision based on a computation over the entire detection-training sample.
Their performance will be sub-optimal if only a subset of the train sample is used. Figure 3 in Appendix A illustrates
this general framework.
The use of dimensionality reduction is optional. It can often improve performance by focusing on a less noisy
representation of the data. Dimensionality reduction techniques include no reduction, principal components analysis
[
13
], sparse random projection [
14
], autoencoders [
15
,
16
], domain classifiers, [
10
] and more. In this work we focus
on black box shift detection (BBSD) methods [
9
], that rely on deep neural representations of the data generated by
a pretrained model. The representation we extract from the model will typically utilize either the softmax outputs,
3
acronymed BBSD-Softmax, or the embeddings, acronymed BBSD-Embeddings; for simplicity, we may omit the BBSD
acronym. Due to the dimensionality of the final representation, multivariate or multiple univariate two-sample tests can
be conducted.
By combining BBSD-Softmax with a Kolmogorov-Smirnov (KS) statistical test [
17
] and using the Bonferroni correction
[
18
], [
10
] achieved state-of-the-art results in distribution shift detection in the context of image classification (MNIST
and CIFAR-10). We acronym their method as KS-Softmax. The univariate KS test processes individual dimensions
separately; its statistic is calculated by computing the largest difference
Z
of the cumulative density functions (CDFs)
across all dimensions as follows:
Z= sup
z
|FP(z)FQ(z)|
, where
FQ
and
FP
are the empirical CDFs of the detection-
training and test data (which are sampled from
P
and
Q
, respectively; see Section 2). The Bonferroni correction rejects
the null hypothesis when the minimal p-value among all tests is less than
α
d
, where
α
is the significance level of the
test, and
d
is the number of dimensions. Although there have been several less conservative approaches to aggregation
[19, 20], they usually assume some dependencies among the tests.
The maximum mean discrepancy (MMD) method [
21
] is a kernel-based multivariate test that can be used to distinguish
between probability distributions
P
and
Q
. Formally,
MMD2(F, P, Q) = ||µPµQ||2
F2
, where
µP
and
µQ
are the mean embeddings of
P
and
Q
in a reproducing kernel Hilbert space
F
. Given a kernel
K
, and samples,
{x1, x2, . . . , xm} ∼ Pm
and
{x
1, x
2, . . . , x
k} ∼ Qk
, an unbiased estimator for
MMD2
can be found in [
21
,
22
].
[
23
] and [
21
] used the RBF kernel
K(x, x) = e1
2σ2||xx||2
2
, where
2σ2
is set to the median of the pairwise Euclidean
distances between all samples. By performing a permutation test on the kernel matrix, the p-value is obtained. In
our experiments (see Section 6.4), we thus use four population based baselines: KS-Softmax,KS-Embeddings,
MMD-Softmax, and MMD-Embeddings.
As mentioned in the introduction, our work is complementary to the topic of single-instance out-of-distribution (OOD)
detection [
1
,
2
,
3
,
4
,
5
,
6
,
7
,
8
]. Although these methods can be applied to each instance in a window, they often fail
to capture population statistics within the window, making them inadequate for detecting population-based changes.
Additionally, many of these methods are computationally expensive and cannot be applied efficiently to large windows.
For example, methods such as those described in [
24
,
25
] extract values from each layer in the network, while others
such as [
1
] require gradient calculations. We note that the application of the best single-instance methods such as
[
24
,
25
,
1
] in our (large scale) empirical setting is computationally challenging and preclude empirical comparison
to our method. Therefore, we consider (in Section 6.4) the detection performance of two computationally efficient
single-instance baselines: Softmax-Response (abbreviated as Single-instance SR or Single-SR) and Entropy-based
(abbreviated as Single-instance Ent or Single-Ent), as described in [
26
,
27
,
2
]. Specifically, we apply each single-
instance OOD detector to every sample in the window and in the detection-training set. We then use a two-sample t-test
to determine the p-value between the uncertainty estimators of each sample.
Finally, we mention [
12
] who developed a risk generalization bound for selective classifiers [
11
]. The bound presented
in that paper is analogous to the coverage generalization bound we present in Theorem 4.2. The risk bound in [
12
] can
also be used for shift-detection. To apply their risk bound to this task, however, labels, which are not available, are
required. Our method (Section 4) detects distribution shifts without using any labels.
4 Proposed Method – Coverage-Based Detection
In this section, we present a novel technique for detecting a distribution shift based on selective prediction principles
(definitions follow). We develop a tight generalization coverage bound that holds with high probability for ID data,
sampled from the source distribution. The main idea is that violations of this coverage bound indicate a distribution
shift with high probability.
4.1 Selection with Guaranteed Coverage
We begin by introducing basic selective prediction terminology and definitions that are required to describe our method.
Consider a standard multiclass classification problem, where
X
is some feature space (e.g., raw image data) and
Y
is a
finite label set,
Y={1,2,3, ..., C}
, representing
C
classes. Let
P(X, Y )
be a probability distribution over
X × Y
,
and define a classifier as a function
f:X → Y
. We refer to
P
as the source distribution. A selective classifier [
11
] is a
pair
(f, g)
, where
f
is a classifier and
g:X → {0,1}
is a selection function [
28
], which serves as a binary qualifier for
fas follows,
(f, g)(x)f(x),if g(x) = 1;
don’t know,if g(x)=0.
4
摘要:

WINDOW-BASEDDISTRIBUTIONSHIFTDETECTIONFORDEEPNEURALNETWORKSGuyBar-ShalomDepartmentofComputerScienceTechnionguy.b@cs.technion.ac.ilYonatanGeifmanDeci.AIyonatan.g@cs.technion.ac.ilRanEl-YanivDepartmentofComputerScienceTechnion,Deci.AIrani@cs.technion.ac.ilABSTRACTTodeployandoperatedeepneuralmodelsinpr...

展开>> 收起<<
WINDOW -BASED DISTRIBUTION SHIFT DETECTION FOR DEEP NEURAL NETWORKS Guy Bar-Shalom.pdf

共18页,预览4页

还剩页未读, 继续阅读

声明:本站为文档C2C交易模式,即用户上传的文档直接被用户下载,本站只是中间服务平台,本站所有文档下载所得的收益归上传人(含作者)所有。玖贝云文库仅提供信息存储空间,仅对用户上传内容的表现方式做保护处理,对上载内容本身不做任何修改或编辑。若文档所含内容侵犯了您的版权或隐私,请立即通知玖贝云文库,我们立即给予删除!
分类:图书资源 价格:10玖币 属性:18 页 大小:984.51KB 格式:PDF 时间:2025-05-06

开通VIP享超值会员特权

  • 多端同步记录
  • 高速下载文档
  • 免费文档工具
  • 分享文档赚钱
  • 每日登录抽奖
  • 优质衍生服务
/ 18
客服
关注