DIAGNOSE Avoiding Out-of-distribution Data using Submodular Information Measures Suraj Kothawade1 Akshit Srivastava2 Venkat Iyer2 Ganesh Ramakrishnan2

2025-04-27 0 0 1.53MB 19 页 10玖币
侵权投诉
DIAGNOSE: Avoiding Out-of-distribution Data
using Submodular Information Measures
Suraj Kothawade1, Akshit Srivastava2, Venkat Iyer2, Ganesh Ramakrishnan2,
and Rishabh Iyer1
1University of Texas at Dallas, USA
2Indian Institute of Technology, Bombay, India
suraj.kothawade@utdallas.edu
Abstract.
Avoiding out-of-distribution (OOD) data is critical for train-
ing supervised machine learning models in the medical imaging domain.
Furthermore, obtaining labeled medical data is difficult and expensive
since it requires expert annotators like doctors, radiologists, etc. Active
learning (AL) is a well-known method to mitigate labeling costs by select-
ing the most diverse or uncertain samples. However, current AL methods
do not work well in the medical imaging domain with OOD data. We
propose
Diagnose
(avoi
D
ing out-of-d
I
stribution d
A
ta usin
G
submod-
ular i
N
f
O
rmation mea
S
ur
E
s), an active learning framework that can
jointly model similarity and dissimilarity, which is crucial in mining in-
distribution data and avoiding OOD data at the same time. Particularly,
we use a small number of data points as exemplars that represent a query
set of in-distribution data points and another set of exemplars that repre-
sent a private set of OOD data points. We illustrate the generalizability
of our framework by evaluating it on a wide variety of real-world OOD
scenarios. Our experiments verify the superiority of
Diagnose
over the
state-of-the-art AL methods across multiple domains of medical imaging.
1 Introduction
Deep learning based models are widely used for medical image computing. How-
ever, it is critical to mitigate incorrect predictions for avoiding a catastrophe
when these models are deployed at a health-care facility. It is known that deep
models are data hungry, which leads us to two problems before we can train a
high quality model.
Firstly,
procuring medical data is difficult due to limited
availability and privacy constraints.
Secondly,
acquiring the right labeled data
to train a supervised model which has minimum dissimilarity with the test
(deployment) distribution can be challenging [
22
]. This difficulty is particularly
because the unlabeled dataset consists of out-of-distribution (OOD) data caused
due to changes in data collection procedures, treatment protocols, demographics
of the target population, etc. [
5
]. In this paper, we study active learning (AL)
strategies in order to mitigate both these problems.
Current AL techniques are designed to acquire data points that are either the
most uncertain, or the most diverse, or a mix of both. Unfortunately, this makes
arXiv:2210.01526v1 [cs.CV] 4 Oct 2022
2 S. Kothawade et al.
the current techniques susceptible to picking data points that are OOD which gives
rise to two more problems:
1)
Wastage of expensive labeling resources, since expert
annotators need to filter out OOD data points rather than focusing on annotating
the in-distribution data points.
2)
Drop in model performance, since OOD data
points may sink into the labeled set due to human errors. To tackle the above
problems, we propose
Diagnose
, an active learning framework that uses the
submodular information measures [
8
] as acquisition functions to model similarity
with the in-distribution data points and dissimilarity with the OOD data points.
1.1 Problem Statement: OOD Scenarios in Medical Data
Scenario A: Scenario B: Scenario C:
In-Distribution Out-of-Distribution
Increase in semantic similarity between In-Distribution and Out-of-Distribution data
Blurred Overexposed
Incorrectly
cropped Underexposed
Coronal view
Sagittal view
Airplane
Dog
Unrelated Images Incorrectly Acquired Mixed View
Melanoma Correctly acquired CT Axial view
Fig. 1: The out-of-distribution (OOD) im-
ages in three scenarios are contrasted
with the in-distribution (ID) images. A:
Inputs that are unrelated to the task. B:
Inputs which are incorrectly acquired. C:
Inputs that belong to a different view of
anatomy. Note that these scenarios be-
come increasingly difficult as we go from
A
C since the semantic similarity be-
tween OOD and ID increases.
We consider a diverse set of four OOD
scenarios with increasing levels of dif-
ficulty. We present three scenarios in
Fig. 1 and discuss an additional sce-
nario in Appendix. D.1. We present the
details for each scenario in the context
of image classification below:
Scenario A - Unrelated Images:
Avoid images that are completely unas-
sociated for the task. For instance, real-
world images mixed with skin lesion
images (first column in Fig. 1).
Scenario B - Incorrectly Ac-
quired:
Avoid images that are either
captured incorrectly or post-processed
incorrectly. For instance, incorrectly
cropped/positioned images, blurred
images, or images captured using a dif-
ferent procedure etc. (second column
in Fig. 1). OOD images of this type are
harder to filter than scenario A since
there may be some overlap with the se-
mantics of the in-distribution images.
Scenario C - Mixed View:
Avoid
images captured with a different view
of the anatomy than the deployment
scenario. For example, images from
a coronal or sagittal view are OOD
when the deployment is on axial view
images (third column in Fig. 1). Note that this scenario is further challenging
than scenario B since only the viewpoint of the same organ makes it ID or OOD.
DIAGNOSE: Avoiding OOD Data using Submodular Information Measures 3
1.2 Related work
Uncertainty based Active Learning.
Uncertainty based methods aim to
select the most uncertain data points according to a model for labeling. The most
common techniques are - 1) Entropy [
24
] selects data points with maximum
entropy, and 2) Margin [
21
] selects data points such that the difference between
the top two predictions is minimum.
Diversity based Active Learning.
The main drawback of uncertainty based
methods is that they lack diversity within the acquired subset. To mitigate this,
a number of approaches have proposed to incorporate diversity. The Coreset
method [
23
] minimizes a coreset loss to form coresets that represent the geometric
structure of the original dataset. They do so using a greedy k-center clustering. A
recent approach called Badge [
2
] uses the last linear layer gradients to represent
data points and runs K-means++ [
1
] to obtain centers each having high gradient
magnitude. Having representative centers with high gradient magnitude ensures
uncertainty and diversity at the same time. However, for batch AL, Badge
models diversity and uncertainty only within the batch and not across all batches.
Another method, BatchBald [
12
] requires a large number of Monte Carlo
dropout samples to obtain reliable mutual information which limits its application
to medical domains where data is scarce.
Active Learning for OOD data. To the best of our knowledge, only a small
minority of AL methods tackle OOD data. Our work is closest to and inspired
from Similar [
14
], which uses the
Scmi
functions (see Sec. 2) for simulated OOD
scenarios on toy datasets with thumbnail images (CIFAR-10 [
16
]) and black and
white digit images (MNIST [
17
]). In contrast,
Diagnose
tackles a wide range of
real-world OOD scenarios in the medical imaging domain. Another related AL
baseline is Glister-Active [
11
] with an acquisition formulation that maximizes
the log-likelihood on a held-out validation set.
1.3 Our contributions
We summarize our contributions as follows:
1)
We emphasize on four diverse OOD
data scenarios in the context of medical image classification (see Fig. 1).
2)
Given
the limitations of current AL methods on medical datasets, we propose
Diagnose
,
a novel AL framework that can jointly model similarity with the in-distribution
(ID) data points and dissimilarity with the OOD data points. We observe that the
submodular conditional mutual information functions that jointly model similarity
and dissimilarity acquire the most number of ID data points (see Fig. 3, 3, 4).
3)
We demonstrate the effectiveness of our framework for multiple modalities,
namely, dermatoscopy, Abdominal CT, and histopathology. Furthermore, we
show that Diagnose consistently outperforms the state-of-the-art AL methods
on all OOD scenarios.
4)
Through rigorous ablation studies, we compare the
effects of maximizing mutual information and conditional gain functions.
4 S. Kothawade et al.
2 Preliminaries
Submodular Functions:
We let
V
denote the ground-set of
n
data points
V
=
{
1
,
2
,
3
, ..., n}
and a set function
f
: 2
VR
. The function
f
is submodular [
6
]
if it satisfies the diminishing marginal returns, namely
f
(
j|A
)
f
(
j|B
)for all
A ⊆ B ⊆ V, j /∈ B
. Different submodular functions model different properties. For
e.g., facility location,
f
(
A
) =
P
i∈V
max
j∈A Sij
, selects a representative subset and log
determinant, f(A) = log det(S)selects a diverse subset [9], where Sis a matrix
containing pariwise similarity values Sij .
Table 1: Instantiations of Submodular Information Measures (SIM).
(a) SMI and SCG functions.
SMI If(A;Q)
FLMI P
i∈U
min(max
j∈A Sij ,max
j∈Q Sij )
LogDetMI log det(SA)log det(SA
SA,QS1
QST
A,Q)
SCG f(A|P)
FLCG P
i∈U
max(max
j∈A Sij max
j∈P Sij ,0)
LogDetCG log det(SASA,PS1
PST
A,P)
(b) SCMI functions.
SCMI If(A;Q|P)
FLCMI P
i∈U
max(min(max
j∈A Sij ,max
j∈Q Sij )
max
j∈P Sij ,0)
LogDetCMI log det(IS1
PSP,QS1
QST
P,Q)
det(IS1
A∪P SA∪P,QS1
QST
A∪P,Q)
Submodular Information Measures (SIM):
Given a set of items
A,Q,P ⊆
V
, the submodular conditional mutual information (
Scmi
) [
8
] is defined as
If
(
A
;
Q|P
) =
f
(
A ∪ P
) +
f
(
Q ∪ P
)
f
(
A∪Q∪P
)
f
(
P
). Intuitively, this
jointly measures the similarity between
Q
and
A
and the dissimilarity between
P
and
A
. We refer to
Q
as the query set and
P
as the private or conditioning
set. Kothawade et. al. [
15
] extend the SIM to handle the case when
Q
and
P
can come from a different set
V0
which is disjoint from the ground set
V
. In the
context of medical image classification in scenarios with OOD data,
V
is the
source set of images, whereas
Q
contains data points from the in-distribution
classes that we are interested in selecting, and
P
contains OOD data points
that we want to avoid. As discussed in [
14
], we can use the
Scmi
formulation
to obtain the submodular mutual information (
Smi
) by setting
Q←Q
and
P ← ∅
. The
Smi
is defined as:
If
(
A
;
Q
) =
f
(
A
) +
f
(
Q
)
f
(
A∪Q
). Similarly,
the submodular conditional gain (
Scg
) formulation can be obtained by setting
Q←∅
and
P ← P
. The
Scg
is defined as:
f
(
A|P
) =
f
(
A ∪ P
)
f
(
P
). To
find an optimal subset given
Q,P ⊆ V0
, we can define
gQ,P
(
A
) =
If
(
A
;
Q|P
),
A ⊆ V
and maximize the same. In Tab. 1, we present the instantiations of various
Scmi
,
Scg
and,
Scmi
functions with the naming convention abbreviated as the
‘function name’ + ‘CMI/MI/CG’. The submodular functions that we use include
‘Facility Location’ (FL) and ‘Log Determinant’ (LogDet) [8, 15].
摘要:

DIAGNOSE:AvoidingOut-of-distributionDatausingSubmodularInformationMeasuresSurajKothawade1,AkshitSrivastava2,VenkatIyer2,GaneshRamakrishnan2,andRishabhIyer11UniversityofTexasatDallas,USA2IndianInstituteofTechnology,Bombay,Indiasuraj.kothawade@utdallas.eduAbstract.Avoidingout-of-distribution(OOD)datai...

展开>> 收起<<
DIAGNOSE Avoiding Out-of-distribution Data using Submodular Information Measures Suraj Kothawade1 Akshit Srivastava2 Venkat Iyer2 Ganesh Ramakrishnan2.pdf

共19页,预览4页

还剩页未读, 继续阅读

声明:本站为文档C2C交易模式,即用户上传的文档直接被用户下载,本站只是中间服务平台,本站所有文档下载所得的收益归上传人(含作者)所有。玖贝云文库仅提供信息存储空间,仅对用户上传内容的表现方式做保护处理,对上载内容本身不做任何修改或编辑。若文档所含内容侵犯了您的版权或隐私,请立即通知玖贝云文库,我们立即给予删除!
分类:图书资源 价格:10玖币 属性:19 页 大小:1.53MB 格式:PDF 时间:2025-04-27

开通VIP享超值会员特权

  • 多端同步记录
  • 高速下载文档
  • 免费文档工具
  • 分享文档赚钱
  • 每日登录抽奖
  • 优质衍生服务
/ 19
客服
关注