WINDOW -BASED DISTRIBUTION SHIFT DETECTION FOR DEEP NEURAL NETWORKS Guy Bar-Shalom

2025-05-06 0 0 984.51KB 18 页 10玖币

侵权投诉

WINDOW-BASED DISTRIBUTION SHIFT DETECTION FOR DEEP

NEURAL NETWORKS

Guy Bar-Shalom

Department of Computer Science

Technion

guy.b@cs.technion.ac.il

Yonatan Geifman

Deci.AI

yonatan.g@cs.technion.ac.il

Ran El-Yaniv

Department of Computer Science

Technion, Deci.AI

rani@cs.technion.ac.il

ABSTRACT

To deploy and operate deep neural models in production, the quality of their predictions, which might

be contaminated benignly or manipulated maliciously by input distributional deviations, must be

monitored and assessed. Speciﬁcally, we study the case of monitoring the healthy operation of a

deep neural network (DNN) receiving a stream of data, with the aim of detecting input distributional

deviations over which the quality of the network’s predictions is potentially damaged. Using selective

prediction principles, we propose a distribution deviation detection method for DNNs. The proposed

method is derived from a tight coverage generalization bound computed over a sample of instances

drawn from the true underlying distribution. Based on this bound, our detector continuously monitors

the operation of the network out-of-sample over a test window and ﬁres off an alarm whenever a

deviation is detected. Our novel detection method performs on-par or better than the state-of-the-art,

while consuming substantially lower computation time (ﬁve orders of magnitude reduction) and space

complexities. Unlike previous methods, which require at least linear dependence on the size of the

source distribution for each detection, rendering them inapplicable to “Google-Scale” datasets, our

approach eliminates this dependence, making it suitable for real-world applications.

1 Introduction

A wide range of artiﬁcial intelligence applications and services rely on deep neural models because of their remarkable

accuracy. When a trained model is deployed in production, its operation should be monitored for abnormal behavior,

and a ﬂag should be raised if such is detected. Corrective measures can be taken if the underlying cause of the abnormal

behavior is identiﬁed. For example, simple distributional changes may only require retraining with fresh data, while

more severe cases may require redesigning the model (e.g., when new classes emerge).

We are concerned with distribution shift detection in the context of deep neural models and consider the following

setting. Pretrained model

is given, and we presume it was trained with data sampled from some distribution

. In

addition to the dataset used in training

, we are also given an additional sample of data from

, which is used to train a

detector

(we refer to this as the detection-training set or source set). While

is used in production to process a stream

of emerging input data, we continually feed

with the most recent window

input elements. The detector also

has access to the ﬁnal layers of the model

and should be able to determine whether the data contained in

came

from a distribution different from

. We emphasize that in this paper we are not considering the problem of identifying

single-instance out-of-distribution or outlier instances [

], but rather the information residing in

a population of

instances. While it may seem straightforward to apply single-instance detectors to a window (by

applying the detector to each instance in the window), this approach can be computationally expensive since such

methods are not designed for window-based tasks; see discussion in Section 3. Moreover, we demonstrate here that

computationally feasible single-instance methods can fail to detect population-based deviations. We emphasize that

we are not concerned in characterizing types of distribution shifts, nor do we tackle at all the complementary topic of

out-of-distribution robustness.

Distribution shift detection has been scarcely considered in the context of deep neural networks (DNNs), however,

it is a fundamental topic in machine learning and statistics. The standard method for tackling it is by performing

arXiv:2210.10897v3 [cs.CV] 8 Jun 2023

a dimensionality reduction over both the detection-training (source) and test (target) samples, and then applying a

two-sample statistical test over these reduced representations to detect a deviation. This is further discussed in Section 3.

In particular, deep models can beneﬁt from the semantic representation created by the model itself, which provides

meaningful dimensionality reduction that is readily available at the last layers of the model. Using the embedding layer

(or softmax) along with statistical two-sample tests was recently proposed by [

] and [

] who termed solutions of

this structure black-box shift detection (BBSD). Using both the univariate Kolmogorov-Smirnov (KS) test and the

maximum mean discrepancy (MMD) method, see details below, [

] achieves impressive detection results when using

MNIST and CIFAR-10 as proxies for the distribution

. As shown here, the KS test is also very effective over ImageNet

when a stronger model is used (ResNet50 vs ResNet18). However, BBSD methods have the disadvantage of being

computationally intensive (and probably inapplicable to read-world datasets) due to the use of two-sample tests between

the detection-training set (which can, and is preferred to be the largest possible) and the window

; a complexity

analysis is provided in Section 5.

We propose a different approach based on selective prediction [

], where a model quantiﬁes its prediction

uncertainty and abstains from predicting uncertain instances. First, we develop a method for selective prediction with

guaranteed coverage. This method identiﬁes the best abstaining threshold and coverage bound for a given pretrained

classiﬁer

, such that the resulting empirical coverage will not violate the bound with high probability (when abstention

is determined using the threshold). The guaranteed coverage method is of independent interest, and it is analogous to

selective prediction with guaranteed risk [

]. Because the empirical coverage of such a classiﬁer is highly unlikely to

violate the bound if the underlying distribution remains the same, a systematic violation indicates a distribution shift.

To be more speciﬁc, given a detection-training sample

, our coverage-based detection algorithm computes a ﬁxed

number of tight generalization coverage bounds, which are then used to detect a distribution shift in a window

test data. The proposed detection algorithm exhibits remarkable computational efﬁciency due to its ability to operate

independently of the size of

during detection, which is crucial when considering “Google-scale” datasets, such as

the JFT-3B dataset. In contrast, the currently available distribution shift detectors rely on a framework that requires

signiﬁcantly higher computational requirements (this framework is illustrated in Figure 3 in Appendix A). A run-time

comparison of these methods and ours is provided in Figure 1.

In a comprehensive empirical study, we compared our coverage-based detection algorithm with the best-performing

baselines, including the KS approach of [

]. Additionally, we investigated the suitability of single-instance detection

methods for identifying distribution shifts. For a fair comparison, all methods used the same (publicly available)

underlying models: ResNet50, MobileNetV3-S, and ViT-T. To evaluate the effectiveness of our approach, we employed

the ImageNet dataset to simulate the source distribution. We then introduced distribution shifts using a range of methods,

starting with simple noise and progressing to more sophisticated techniques such as adversarial examples. Based on

these experiments, it is evident that our coverage-based detection method is overall signiﬁcantly more powerful than the

baselines across a wide range of test window sizes.

To summarize, the contributions of this paper are: (1) A theoretically justiﬁed algorithm (Algorithm 1), that produces a

coverage bound, which is of independent interest, and allows for the creation of selective classiﬁers with guaranteed

coverage. (2) A theoretically motivated “windowed” detection algorithm (Algorithm 2), which detects a distribution

shift over a window; this proposed algorithm exhibits remarkable efﬁciency compared to state-of-the-art methods (ﬁve

orders of magnitude better than the best method). (3) A comprehensive empirical study demonstrating signiﬁcant

improvements relative to existing baselines, and introducing the use of single-instance methods for detecting distribution

shifts.

2 Problem Formulation

We consider the problem of detecting distribution shifts in input streams provided to pretrained deep neural models.

Let

P≜PX

denote a probability distribution over an input space

, and assume that a model

has been trained on a

set of instances drawn from

. Consider a setting where the model

is deployed and while being used in production

its input distribution might change or even be attacked by an adversary. Our goal is to detect such events to allow for

appropriate action, e.g., retraining the model with respect to the revised distribution.

Inspired by [

], we formulate this problem as follows. We are given a pretrained model

, and a detection-training set,

Sm∼ Pm

. Then we would like to train a detection model to be able to detect a distribution shift; namely, discriminate

between windows containing in-distribution (ID) data, and alternative-distribution (AD) data. Thus, given an unlabeled

test sample window

Wk∼Qk

, where

is a possibly different distribution, the objective is to determine whether

P ̸=Q

. We also ask what is the smallest test sample size

required to determine that

P ̸=Q

. Since typically the

detection-training set

can be quite large, we further ask whether it is possible to devise an effective detection

procedure whose time complexity is o(m).

3 Related Work

Distribution shift detection methods often comprise the following two steps: dimensionality reduction, and a two-sample

test between the detection-training sample and test samples. In most cases, these methods are “lazy” in the sense that

for each test sample, they make a detection decision based on a computation over the entire detection-training sample.

Their performance will be sub-optimal if only a subset of the train sample is used. Figure 3 in Appendix A illustrates

this general framework.

The use of dimensionality reduction is optional. It can often improve performance by focusing on a less noisy

representation of the data. Dimensionality reduction techniques include no reduction, principal components analysis

[

], sparse random projection [

], autoencoders [

], domain classiﬁers, [

] and more. In this work we focus

on black box shift detection (BBSD) methods [

], that rely on deep neural representations of the data generated by

a pretrained model. The representation we extract from the model will typically utilize either the softmax outputs,

acronymed BBSD-Softmax, or the embeddings, acronymed BBSD-Embeddings; for simplicity, we may omit the BBSD

acronym. Due to the dimensionality of the ﬁnal representation, multivariate or multiple univariate two-sample tests can

be conducted.

By combining BBSD-Softmax with a Kolmogorov-Smirnov (KS) statistical test [

] and using the Bonferroni correction

[

], [

] achieved state-of-the-art results in distribution shift detection in the context of image classiﬁcation (MNIST

and CIFAR-10). We acronym their method as BBSD-KS-Softmax (or KS-Softmax). The univariate KS test processes

individual dimensions separately; its statistic is calculated by computing the largest difference

of the cumulative

density functions (CDFs) across all dimensions as follows:

Z= supz|FP(z)−FQ(z)|

, where

and

are the

empirical CDFs of the detection-training and test data (which are sampled from

and

, respectively; see Section 2).

The Bonferroni correction rejects the null hypothesis when the minimal p-value among all tests is less than

, where

is the signiﬁcance level and

is the number of dimensions. Although less conservative approaches to aggregation exist

[

], they usually assume some dependencies among the tests. The maximum mean discrepancy (MMD) method [

]

is a kernel-based multivariate test that can be used to distinguish between probability distributions

and

. Formally,

MMD2(F, P, Q) = ||µP−µQ||2

, where

µP

and

µQ

are the mean embeddings of

and

in a reproducing kernel

Hilbert space

. Given a kernel

, and samples,

{x1, x2, . . . , xm} ∼ Pm

and

{x′

1, x′

2, . . . , x′

k} ∼ Qk

, an unbiased

estimator for

MMD2

can be found in [

]. [

] and [

] used the RBF kernel

K(x, x′) = e−1

2σ2||x−x′||2

, where

2σ2

is set to the median of the pairwise Euclidean distances between all samples. By performing a permutation test

on the kernel matrix, the p-value is obtained. In our experiments (see Section 6.4), we thus use four population based

baselines: KS-Softmax,KS-Embeddings,MMD-Softmax, and MMD-Embeddings.

As mentioned in the introduction, our work is complementary to the topic of single-instance out-of-distribution (OOD)

detection [

]. Although these methods can be applied to each instance in a window, they often fail

to capture population statistics within the window, making them inadequate for detecting population-based changes.

Additionally, many of these methods are computationally expensive and cannot be applied efﬁciently to large windows.

For example, methods such as those described in [

] extract values from each layer in the network, while others

such as [

] require gradient calculations. We note that the application of the best single-instance methods such as

[

] in our (large scale) empirical setting is computationally challenging and preclude empirical comparison

to our method. Therefore, we consider (in Section 6.4) the detection performance of two computationally efﬁcient

single-instance baselines: Softmax-Response (abbreviated as Single-instance SR or Single-SR) and Entropy-based

(abbreviated as Single-instance Ent or Single-Ent), as described in [

]. Speciﬁcally, we apply each single-

instance OOD detector to every sample in the window and in the detection-training set. We then use a two-sample t-test

to determine the p-value between the uncertainty estimators of each sample. Finally, we mention [

] who developed a

risk generalization bound for selective classiﬁers [

]. The bound presented in that paper is analogous to the coverage

generalization bound we present in Theorem 4.2. The risk bound in [

] can also be used for shift-detection. To apply

their risk bound to this task, however, labels, which are not available, are required. Our method (Section 4) detects

distribution shifts without using any labels.

Distribution shift detection methods often comprise the following two steps: dimensionality reduction, and a two-sample

test between the detection-training sample and test samples. In most cases, these methods are “lazy” in the sense that

for each test sample, they make a detection decision based on a computation over the entire detection-training sample.

Their performance will be sub-optimal if only a subset of the train sample is used. Figure 3 in Appendix A illustrates

this general framework.

The use of dimensionality reduction is optional. It can often improve performance by focusing on a less noisy

representation of the data. Dimensionality reduction techniques include no reduction, principal components analysis

[

], sparse random projection [

], autoencoders [

], domain classiﬁers, [

] and more. In this work we focus

on black box shift detection (BBSD) methods [

], that rely on deep neural representations of the data generated by

a pretrained model. The representation we extract from the model will typically utilize either the softmax outputs,

acronymed BBSD-Softmax, or the embeddings, acronymed BBSD-Embeddings; for simplicity, we may omit the BBSD

acronym. Due to the dimensionality of the ﬁnal representation, multivariate or multiple univariate two-sample tests can

be conducted.

By combining BBSD-Softmax with a Kolmogorov-Smirnov (KS) statistical test [

] and using the Bonferroni correction

[

], [

] achieved state-of-the-art results in distribution shift detection in the context of image classiﬁcation (MNIST

and CIFAR-10). We acronym their method as KS-Softmax. The univariate KS test processes individual dimensions

separately; its statistic is calculated by computing the largest difference

of the cumulative density functions (CDFs)

across all dimensions as follows:

Z= sup

|FP(z)−FQ(z)|

, where

and

are the empirical CDFs of the detection-

training and test data (which are sampled from

and

, respectively; see Section 2). The Bonferroni correction rejects

the null hypothesis when the minimal p-value among all tests is less than

, where

is the signiﬁcance level of the

test, and

is the number of dimensions. Although there have been several less conservative approaches to aggregation

[19, 20], they usually assume some dependencies among the tests.

The maximum mean discrepancy (MMD) method [

] is a kernel-based multivariate test that can be used to distinguish

between probability distributions

and

. Formally,

MMD2(F, P, Q) = ||µP−µQ||2

, where

µP

and

µQ

are the mean embeddings of

and

in a reproducing kernel Hilbert space

. Given a kernel

, and samples,

{x1, x2, . . . , xm} ∼ Pm

and

{x′

1, x′

2, . . . , x′

k} ∼ Qk

, an unbiased estimator for

MMD2

can be found in [

[

] and [

] used the RBF kernel

K(x, x′) = e−1

2σ2||x−x′||2

, where

2σ2

is set to the median of the pairwise Euclidean

distances between all samples. By performing a permutation test on the kernel matrix, the p-value is obtained. In

our experiments (see Section 6.4), we thus use four population based baselines: KS-Softmax,KS-Embeddings,

MMD-Softmax, and MMD-Embeddings.

As mentioned in the introduction, our work is complementary to the topic of single-instance out-of-distribution (OOD)

detection [

]. Although these methods can be applied to each instance in a window, they often fail

to capture population statistics within the window, making them inadequate for detecting population-based changes.

Additionally, many of these methods are computationally expensive and cannot be applied efﬁciently to large windows.

For example, methods such as those described in [

] extract values from each layer in the network, while others

such as [

] require gradient calculations. We note that the application of the best single-instance methods such as

[

] in our (large scale) empirical setting is computationally challenging and preclude empirical comparison

to our method. Therefore, we consider (in Section 6.4) the detection performance of two computationally efﬁcient

single-instance baselines: Softmax-Response (abbreviated as Single-instance SR or Single-SR) and Entropy-based

(abbreviated as Single-instance Ent or Single-Ent), as described in [

]. Speciﬁcally, we apply each single-

instance OOD detector to every sample in the window and in the detection-training set. We then use a two-sample t-test

to determine the p-value between the uncertainty estimators of each sample.

Finally, we mention [

] who developed a risk generalization bound for selective classiﬁers [

]. The bound presented

in that paper is analogous to the coverage generalization bound we present in Theorem 4.2. The risk bound in [

] can

also be used for shift-detection. To apply their risk bound to this task, however, labels, which are not available, are

required. Our method (Section 4) detects distribution shifts without using any labels.

4 Proposed Method – Coverage-Based Detection

In this section, we present a novel technique for detecting a distribution shift based on selective prediction principles

(deﬁnitions follow). We develop a tight generalization coverage bound that holds with high probability for ID data,

sampled from the source distribution. The main idea is that violations of this coverage bound indicate a distribution

shift with high probability.

4.1 Selection with Guaranteed Coverage

We begin by introducing basic selective prediction terminology and deﬁnitions that are required to describe our method.

Consider a standard multiclass classiﬁcation problem, where

is some feature space (e.g., raw image data) and

is a

ﬁnite label set,

Y={1,2,3, ..., C}

, representing

classes. Let

P(X, Y )

be a probability distribution over

X × Y

and deﬁne a classiﬁer as a function

f:X → Y

. We refer to

as the source distribution. A selective classiﬁer [

] is a

pair

(f, g)

, where

is a classiﬁer and

g:X → {0,1}

is a selection function [

], which serves as a binary qualiﬁer for

fas follows,

(f, g)(x)≜f(x),if g(x) = 1;

don’t know,if g(x)=0.

文档加载中……请稍候！
如果长时间未打开，您也可以点击刷新试试。

下载文档到电脑，查找使用更方便

10 玖币 0人已下载

立即下载

摘要：

WINDOW-BASEDDISTRIBUTIONSHIFTDETECTIONFORDEEPNEURALNETWORKSGuyBar-ShalomDepartmentofComputerScienceTechnionguy.b@cs.technion.ac.ilYonatanGeifmanDeci.AIyonatan.g@cs.technion.ac.ilRanEl-YanivDepartmentofComputerScienceTechnion,Deci.AIrani@cs.technion.ac.ilABSTRACTTodeployandoperatedeepneuralmodelsinpr...

展开>> 收起<<

WINDOW -BASED DISTRIBUTION SHIFT DETECTION FOR DEEP NEURAL NETWORKS Guy Bar-Shalom.pdf

共18页,预览4页

还剩页未读，继续阅读

声明：本站为文档C2C交易模式，即用户上传的文档直接被用户下载，本站只是中间服务平台，本站所有文档下载所得的收益归上传人(含作者)所有。玖贝云文库仅提供信息存储空间，仅对用户上传内容的表现方式做保护处理，对上载内容本身不做任何修改或编辑。若文档所含内容侵犯了您的版权或隐私，请立即通知玖贝云文库，我们立即给予删除！

WINDOW -BASED DISTRIBUTION SHIFT DETECTION FOR DEEP NEURAL NETWORKS Guy Bar-Shalom

相关推荐

开通VIP享超值会员特权

作者详情

相关内容

热门标签

举报选择: