DIAGNOSE Avoiding Out-of-distribution Data using Submodular Information Measures Suraj Kothawade1 Akshit Srivastava2 Venkat Iyer2 Ganesh Ramakrishnan2

2025-04-27 0 0 1.53MB 19 页 10玖币

侵权投诉

DIAGNOSE: Avoiding Out-of-distribution Data

using Submodular Information Measures

Suraj Kothawade1, Akshit Srivastava2, Venkat Iyer2, Ganesh Ramakrishnan2,

and Rishabh Iyer1

1University of Texas at Dallas, USA

2Indian Institute of Technology, Bombay, India

suraj.kothawade@utdallas.edu

Abstract.

Avoiding out-of-distribution (OOD) data is critical for train-

ing supervised machine learning models in the medical imaging domain.

Furthermore, obtaining labeled medical data is diﬃcult and expensive

since it requires expert annotators like doctors, radiologists, etc. Active

learning (AL) is a well-known method to mitigate labeling costs by select-

ing the most diverse or uncertain samples. However, current AL methods

do not work well in the medical imaging domain with OOD data. We

propose

Diagnose

(avoi

ing out-of-d

stribution d

ta usin

submod-

ular i

rmation mea

s), an active learning framework that can

jointly model similarity and dissimilarity, which is crucial in mining in-

distribution data and avoiding OOD data at the same time. Particularly,

we use a small number of data points as exemplars that represent a query

set of in-distribution data points and another set of exemplars that repre-

sent a private set of OOD data points. We illustrate the generalizability

of our framework by evaluating it on a wide variety of real-world OOD

scenarios. Our experiments verify the superiority of

Diagnose

over the

state-of-the-art AL methods across multiple domains of medical imaging.

1 Introduction

Deep learning based models are widely used for medical image computing. How-

ever, it is critical to mitigate incorrect predictions for avoiding a catastrophe

when these models are deployed at a health-care facility. It is known that deep

models are data hungry, which leads us to two problems before we can train a

high quality model.

Firstly,

procuring medical data is diﬃcult due to limited

availability and privacy constraints.

Secondly,

acquiring the right labeled data

to train a supervised model which has minimum dissimilarity with the test

(deployment) distribution can be challenging [

]. This diﬃculty is particularly

because the unlabeled dataset consists of out-of-distribution (OOD) data caused

due to changes in data collection procedures, treatment protocols, demographics

of the target population, etc. [

]. In this paper, we study active learning (AL)

strategies in order to mitigate both these problems.

Current AL techniques are designed to acquire data points that are either the

most uncertain, or the most diverse, or a mix of both. Unfortunately, this makes

arXiv:2210.01526v1 [cs.CV] 4 Oct 2022

2 S. Kothawade et al.

the current techniques susceptible to picking data points that are OOD which gives

rise to two more problems:

Wastage of expensive labeling resources, since expert

annotators need to ﬁlter out OOD data points rather than focusing on annotating

the in-distribution data points.

Drop in model performance, since OOD data

points may sink into the labeled set due to human errors. To tackle the above

problems, we propose

Diagnose

, an active learning framework that uses the

submodular information measures [

] as acquisition functions to model similarity

with the in-distribution data points and dissimilarity with the OOD data points.

1.1 Problem Statement: OOD Scenarios in Medical Data

Scenario A: Scenario B: Scenario C:

In-Distribution Out-of-Distribution

Increase in semantic similarity between In-Distribution and Out-of-Distribution data

Blurred Overexposed

Incorrectly

cropped Underexposed

Coronal view

Sagittal view

Airplane

Dog

Unrelated Images Incorrectly Acquired Mixed View

Melanoma Correctly acquired CT Axial view

Fig. 1: The out-of-distribution (OOD) im-

ages in three scenarios are contrasted

with the in-distribution (ID) images. A:

Inputs that are unrelated to the task. B:

Inputs which are incorrectly acquired. C:

Inputs that belong to a diﬀerent view of

anatomy. Note that these scenarios be-

come increasingly diﬃcult as we go from

→

C since the semantic similarity be-

tween OOD and ID increases.

We consider a diverse set of four OOD

scenarios with increasing levels of dif-

ﬁculty. We present three scenarios in

Fig. 1 and discuss an additional sce-

nario in Appendix. D.1. We present the

details for each scenario in the context

of image classiﬁcation below:

Scenario A - Unrelated Images:

Avoid images that are completely unas-

sociated for the task. For instance, real-

world images mixed with skin lesion

images (ﬁrst column in Fig. 1).

Scenario B - Incorrectly Ac-

quired:

Avoid images that are either

captured incorrectly or post-processed

incorrectly. For instance, incorrectly

cropped/positioned images, blurred

images, or images captured using a dif-

ferent procedure etc. (second column

in Fig. 1). OOD images of this type are

harder to ﬁlter than scenario A since

there may be some overlap with the se-

mantics of the in-distribution images.

Scenario C - Mixed View:

Avoid

images captured with a diﬀerent view

of the anatomy than the deployment

scenario. For example, images from

a coronal or sagittal view are OOD

when the deployment is on axial view

images (third column in Fig. 1). Note that this scenario is further challenging

than scenario B since only the viewpoint of the same organ makes it ID or OOD.

DIAGNOSE: Avoiding OOD Data using Submodular Information Measures 3

1.2 Related work

Uncertainty based Active Learning.

Uncertainty based methods aim to

select the most uncertain data points according to a model for labeling. The most

common techniques are - 1) Entropy [

] selects data points with maximum

entropy, and 2) Margin [

] selects data points such that the diﬀerence between

the top two predictions is minimum.

Diversity based Active Learning.

The main drawback of uncertainty based

methods is that they lack diversity within the acquired subset. To mitigate this,

a number of approaches have proposed to incorporate diversity. The Coreset

method [

] minimizes a coreset loss to form coresets that represent the geometric

structure of the original dataset. They do so using a greedy k-center clustering. A

recent approach called Badge [

] uses the last linear layer gradients to represent

data points and runs K-means++ [

] to obtain centers each having high gradient

magnitude. Having representative centers with high gradient magnitude ensures

uncertainty and diversity at the same time. However, for batch AL, Badge

models diversity and uncertainty only within the batch and not across all batches.

Another method, BatchBald [

] requires a large number of Monte Carlo

dropout samples to obtain reliable mutual information which limits its application

to medical domains where data is scarce.

Active Learning for OOD data. To the best of our knowledge, only a small

minority of AL methods tackle OOD data. Our work is closest to and inspired

from Similar [

], which uses the

Scmi

functions (see Sec. 2) for simulated OOD

scenarios on toy datasets with thumbnail images (CIFAR-10 [

]) and black and

white digit images (MNIST [

]). In contrast,

Diagnose

tackles a wide range of

real-world OOD scenarios in the medical imaging domain. Another related AL

baseline is Glister-Active [

] with an acquisition formulation that maximizes

the log-likelihood on a held-out validation set.

1.3 Our contributions

We summarize our contributions as follows:

We emphasize on four diverse OOD

data scenarios in the context of medical image classiﬁcation (see Fig. 1).

Given

the limitations of current AL methods on medical datasets, we propose

Diagnose

a novel AL framework that can jointly model similarity with the in-distribution

(ID) data points and dissimilarity with the OOD data points. We observe that the

submodular conditional mutual information functions that jointly model similarity

and dissimilarity acquire the most number of ID data points (see Fig. 3, 3, 4).

We demonstrate the eﬀectiveness of our framework for multiple modalities,

namely, dermatoscopy, Abdominal CT, and histopathology. Furthermore, we

show that Diagnose consistently outperforms the state-of-the-art AL methods

on all OOD scenarios.

Through rigorous ablation studies, we compare the

eﬀects of maximizing mutual information and conditional gain functions.

4 S. Kothawade et al.

2 Preliminaries

Submodular Functions:

We let

denote the ground-set of

data points

{

, ..., n}

and a set function

: 2

V−→ R

. The function

is submodular [

]

if it satisﬁes the diminishing marginal returns, namely

(

j|A

)

≥f

(

j|B

)for all

A ⊆ B ⊆ V, j /∈ B

. Diﬀerent submodular functions model diﬀerent properties. For

e.g., facility location,

(

) =

i∈V

max

j∈A Sij

, selects a representative subset and log

determinant, f(A) = log det(S)selects a diverse subset [9], where Sis a matrix

containing pariwise similarity values Sij .

Table 1: Instantiations of Submodular Information Measures (SIM).

(a) SMI and SCG functions.

SMI If(A;Q)

FLMI P

i∈U

min(max

j∈A Sij ,max

j∈Q Sij )

LogDetMI log det(SA)−log det(SA−

SA,QS−1

QST

A,Q)

SCG f(A|P)

FLCG P

i∈U

max(max

j∈A Sij −max

j∈P Sij ,0)

LogDetCG log det(SA−SA,PS−1

PST

A,P)

(b) SCMI functions.

SCMI If(A;Q|P)

FLCMI P

i∈U

max(min(max

j∈A Sij ,max

j∈Q Sij )

−max

j∈P Sij ,0)

LogDetCMI log det(I−S−1

PSP,QS−1

QST

P,Q)

det(I−S−1

A∪P SA∪P,QS−1

QST

A∪P,Q)

Submodular Information Measures (SIM):

Given a set of items

A,Q,P ⊆

, the submodular conditional mutual information (

Scmi

) [

] is deﬁned as

(

;

Q|P

) =

(

A ∪ P

) +

(

Q ∪ P

)

−f

(

A∪Q∪P

)

−f

(

). Intuitively, this

jointly measures the similarity between

and

and the dissimilarity between

and

. We refer to

as the query set and

as the private or conditioning

set. Kothawade et. al. [

] extend the SIM to handle the case when

and

can come from a diﬀerent set

which is disjoint from the ground set

. In the

context of medical image classiﬁcation in scenarios with OOD data,

is the

source set of images, whereas

contains data points from the in-distribution

classes that we are interested in selecting, and

contains OOD data points

that we want to avoid. As discussed in [

], we can use the

Scmi

formulation

to obtain the submodular mutual information (

Smi

) by setting

Q←Q

and

P ← ∅

. The

Smi

is deﬁned as:

(

;

) =

(

) +

(

)

−f

(

A∪Q

). Similarly,

the submodular conditional gain (

Scg

) formulation can be obtained by setting

Q←∅

and

P ← P

. The

Scg

is deﬁned as:

(

A|P

) =

(

A ∪ P

)

−f

(

). To

ﬁnd an optimal subset given

Q,P ⊆ V0

, we can deﬁne

gQ,P

(

) =

(

;

Q|P

A ⊆ V

and maximize the same. In Tab. 1, we present the instantiations of various

Scmi

Scg

and,

Scmi

functions with the naming convention abbreviated as the

‘function name’ + ‘CMI/MI/CG’. The submodular functions that we use include

‘Facility Location’ (FL) and ‘Log Determinant’ (LogDet) [8, 15].

文档加载中……请稍候！
如果长时间未打开，您也可以点击刷新试试。

下载文档到电脑，查找使用更方便

10 玖币 0人已下载

立即下载

摘要：

DIAGNOSE:AvoidingOut-of-distributionDatausingSubmodularInformationMeasuresSurajKothawade1,AkshitSrivastava2,VenkatIyer2,GaneshRamakrishnan2,andRishabhIyer11UniversityofTexasatDallas,USA2IndianInstituteofTechnology,Bombay,Indiasuraj.kothawade@utdallas.eduAbstract.Avoidingout-of-distribution(OOD)datai...

展开>> 收起<<

DIAGNOSE Avoiding Out-of-distribution Data using Submodular Information Measures Suraj Kothawade1 Akshit Srivastava2 Venkat Iyer2 Ganesh Ramakrishnan2.pdf

共19页,预览4页

还剩页未读，继续阅读

声明：本站为文档C2C交易模式，即用户上传的文档直接被用户下载，本站只是中间服务平台，本站所有文档下载所得的收益归上传人(含作者)所有。玖贝云文库仅提供信息存储空间，仅对用户上传内容的表现方式做保护处理，对上载内容本身不做任何修改或编辑。若文档所含内容侵犯了您的版权或隐私，请立即通知玖贝云文库，我们立即给予删除！

DIAGNOSE Avoiding Out-of-distribution Data using Submodular Information Measures Suraj Kothawade1 Akshit Srivastava2 Venkat Iyer2 Ganesh Ramakrishnan2

相关推荐

开通VIP享超值会员特权

作者详情

相关内容

热门标签

举报选择: