Greedy Modality Selection via Approximate Submodular Maximization Runxiang Cheng1Gargi Balasubramaniam1Yifei He 1Yao-Hung Hubert Tsai2Han Zhao1 1University of Illinois Urbana-Champaign Illinois USA

2025-05-06 1 0 449.21KB 16 页 10玖币
侵权投诉
Greedy Modality Selection via Approximate Submodular Maximization
Runxiang Cheng1Gargi Balasubramaniam1Yifei He*1Yao-Hung Hubert Tsai2Han Zhao1
1University of Illinois Urbana-Champaign, Illinois, USA
2Carnegie Mellon University, Pennsylvania, USA
Abstract
Multimodal learning considers learning from multi-
modality data, aiming to fuse heterogeneous
sources of information. However, it is not always
feasible to leverage all available modalities due
to memory constraints. Further, training on all the
modalities may be inefficient when redundant in-
formation exists within data, such as different sub-
sets of modalities providing similar performance.
In light of these challenges, we study modality
selection, intending to efficiently select the most
informative and complementary modalities under
certain computational constraints. We formulate
a theoretical framework for optimizing modality
selection in multimodal learning and introduce a
utility measure to quantify the benefit of select-
ing a modality. For this optimization problem, we
present efficient algorithms when the utility mea-
sure exhibits monotonicity and approximate sub-
modularity. We also connect the utility measure
with existing Shapley-value-based feature impor-
tance scores. Last, we demonstrate the efficacy of
our algorithm on synthetic (Patch-MNIST) and
two real-world (PEMS-SF, CMU-MOSI) datasets.
1 INTRODUCTION
Multimodal learning considers learning with data from mul-
tiple modalities (e.g., images, text, speech, etc) to improve
generalization of the learned models by using complemen-
tary information from different modalities.
1
In many real-
world applications, multimodal learning has shown superior
performance [Bapna et al., 2022, Wu et al., 2021], and has
demonstrated a stronger capability over learning from a
single modality. The advantages of multimodal learning
*Equal contribution.
1We use the terms modality/view interchangably.
have also been studied from a theoretical standpoint. Prior
work showed that learning with more modalities achieves
a smaller population risk [Huang et al., 2021], or utilizing
cross-modal information can provably improve prediction in
multiview learning [Zhang et al., 2019] or semi-supervised
learning [Sun et al., 2020]. With the recent advances in
training large-scale neural network models from multiple
modalities [Devlin et al., 2018, Brown et al., 2020], one
emerging challenge lies in the modality selection problem.
From the modeling perspective, it might be tempting to use
all the modalities available. However, it is inefficient or even
infeasible to learn from all modalities as the total number of
input modalities increases. A modality often consists of high-
dimensional data. And model complexity can scale linearly
or exponentially with the number of input modalities [Zadeh
et al., 2017, Liu et al., 2018], resulting in large consumption
of computational and energy resources. The marginal benefit
from the new modalities may also decrease as more modali-
ties have been included. In some cases, learning from fewer
modalities is sufficient to achieve the desirable outcome,
due to the potential overlap in the information provided
by these modalities. Furthermore, proactively selecting the
modalities most informative towards prediction reduces the
cost of collecting the inferior ones. For example, in sensor
placement problems where each sensor can be treated as a
modality, finding the optimal subset of sensors for a learning
objective (e.g., temperature or traffic prediction) eliminates
the cost of maintaining extra sensors [Krause et al., 2011].
In light of the aforementioned challenges, in this paper, we
study the optimization problem of modality selection in mul-
timodal learning: given a set of input modalities and a fixed
budget on the number of selected modalities, how to select
a subset that optimizes prediction performance? Note that
in general this problem is of combinatorial nature, since one
may have to enumerate all the potential subsets of modal-
ities in order to find the best one. Hence, without further
assumptions on the structure of the underlying prediction
problem, it is intractable to solve this modality selection
problem exactly and efficiently.
Accepted for the 38th Conference on Uncertainty in Artificial Intelligence (UAI 2022).
arXiv:2210.12562v1 [cs.LG] 22 Oct 2022
To approach these challenges, we propose a utility function
that conveniently quantifies the benefit of any set of modal-
ities towards prediction in most typical learning settings.
We then identify a proper assumption that is suitable for
multimodal/multiview learning, which allows us to develop
efficient approximate algorithms for modality selection. We
assume that the input modalities are approximately condi-
tionally independent given the target. Since the strength
of conditional independence is now parameterized, our re-
sults are generalizable to problems on multimodal data with
different levels of conditional independence.
We show that our definition of utility for a modality naturally
manifests as the Shannon mutual information between the
modality and the prediction target, in the setting of binary
classification with cross-entropy loss. Under approximate
conditional independence, mutual information is monotone
and approximately submodular. These properties intrinsi-
cally describe the empirical advantages of learning with
more modalities, and allow us to formulate modality selec-
tion as a submodular optimization problem. In this context,
we can have efficient selection algorithms with provable
performance guarantee on the selected subset. For example,
we show a performance guarantee of the greedy maximiza-
tion algorithm from Nemhauser et al. [1978] under approxi-
mate submodularity. Further, we connect modality selection
to marginal-contribution-based feature importance scores
in feature selection. We examine the Shapley value and
Marginal Contribution Feature Importance (MCI) [Catav
et al., 2021] for ranking modality importance. We show
that these scores, although are originally intractable, can be
solved efficiently under assumptions in the context of modal-
ity selection. Lastly, we evaluate our theoretical results on
three classification datasets. The experiment results confirm
both the utility and the diversity of the selected modalities.
To summarize, we contributes the following in this paper:
Propose a general measure of modality utility, and iden-
tify a proper assumption that is suitable for multimodal
learning and helpful for developing efficient approxi-
mate algorithms for modality selection.
Demonstrate algorithm with performance guarantee
on the selected modalities for prediction theoretically
and empirically in classification problems with cross-
entropy loss.
Establish theoretical connections between modality
selection and feature importance scores, i.e., Shapley
value and Marginal Contribution Feature Importance.
2 PRELIMINARIES
In this section, we first describe our notation and problem
setup, and we then provide a brief introduction to submodu-
lar function maximization and feature importance scores.
2.1 NOTATION AND SETUP
We use
X
and
Y
to denote the random variables that take
values in input space
X
and output space
Y
, respectively.
The instantiation of
X
and
Y
is denoted by
x
and
y
. We use
H
to denote the hypothesis class of predictors from input to
output space, and
ˆ
Y
to denote the predicted variable. Let
X
be multimodal, i.e.,
X=X1×... ×Xk
. Each
Xi
is the input
from the
i
-th modality. We use
Xi
to denote the random
variable that takes value in
Xi
, and
V
to denote the full set
of all input modalities, i.e.,
V={X1, ..., Xk}
. Throughout
the paper, we often use
S
and
S0
to denote arbitrary subsets
of
V
. Lastly, we use
I(·,·)
to mean the Shannon mutual in-
formation,
H(·)
for entropy,
`ce(Y, ˆ
Y)
for the cross-entropy
loss
(Y= 1) log ˆ
Y+ (Y= 0) log(1ˆ
Y)
, and
`01(Y, ˆ
Y)
for zero-one loss (Y6=ˆ
Y).
For the simplicity of discussion, we primarily focus on the
setting of binary classification with cross-entropy loss
2
. In
this setting, a subset of input modalities
SV
and output
Y∈ {0,1}
are observed. The predictor aims to make pre-
diction
ˆ
Y[0,1]
which minimizes the cross-entropy loss
between
Y
and
ˆ
Y
. The goal of modality selection is to select
the subset of input modalities to this loss minimization goal
under certain constraints. Our results rely on the following
assumption to hold.
Assumption 2.1
(
-Approximate Conditional Indepen-
dence)
.
There exists a positive constant
0
such that,
S, S0V, S S0=, we have I(S;S0|Y).
Note that when
= 0
, Assumption 2.1 reduces to strict con-
ditional independence between disjoint modalities given the
target variable. In fact, this is a common assumption used
in prior work in multimodal learning [White et al., 2012,
Wu and Goodman, 2018, Sun et al., 2020]. In practice, how-
ever, strict conditional independence is often difficult to be
satisfied. Thus, we use a more general assumption above,
in which input modalities are approximately conditionally
independent. In this assumption, the strength of the condi-
tional independence relationship is controlled by a positive
constant
, which is the upper bound of the conditional
mutual information between modalities given the target.
Connection to feature selection.
It is worth mentioning
that modality selection shares a natural correspondence to
the problem of feature selection. Without loss of generality,
a modality could be considered as a group of features; the-
oretically, the group could even contain a single feature in
some settings. But a distinction between these two problems
lies in the feasibility of conditional independence. In mul-
timodal learning where input data is often heterogeneous,
the (approximate) conditional independence assumption is
2
We choose binary class setting for ease of exposition, our
general proofs and results directly extend to multi-class setting.
We have only used the binary case to derive the conditional entropy
(supplementary material), and to further showcase Corollary 3.1
more likely to hold among input modalities. Whereas in the
feature level, such an assumption is quite restrictive [Zhang
et al., 2012], as it boils down to asking the data to approxi-
mately satisfy the Naive Bayes assumption.
2.2 SUBMODULAR OPTIMIZATION
Submodularity is a property of set functions that has many
theoretical implications and applications in computer sci-
ence. A definition of submodularity is as follows, where
2V
denotes the power set of
V
, and the set function
f
assigns
each subset SVto a value f(S)R.
Definition 2.1
(Nemhauser et al. [1978])
.
Given a finite
set
V
, a function
f: 2VR
is submodular if for any
ABV
, and
eV\B
, we have
f(A{e})f(A)
f(B∪ {e})f(B).
In other words, adding new elements to a larger set does
not yield larger marginal benefit comparing to adding new
elements to its subset. One common type of optimization
on submodular function is submodular function maximiza-
tion with cardinality constraints. It asks to find a subset
SV
that maximizes
f(S)
subject to
|S| ≤ q
. Finding
the optimal solution to this problem is
N P
-hard. However,
Nemhauser et al. [1978] propose that a greedy maximization
algorithm can provide a solution with approximate guaran-
tee to the optimal solution in polynomial time. We provide
the pseudocode of this greedy algorithm below.
Algorithm 1: Greedy Maximization
Data: Full set V={X1, ..., Xk}, constraint qZ+.
Input: f: 2VR, and pZ+, where pq≤ |V|
Output: Subset Sp
S0=
for i= 0,1, ..., p 1do
Xi= arg maxXjV\Si(f(Si∪ {Xj})f(Si))
Si+1 =Si∪ {Xi}
In this algorithm,
V
is the full set to select elements from,
f
is the submodular function to be maximized,
p
is the number
of iterations for the algorithm to run, and
q
is the cardinality
constraint. It starts with an empty set
S0
, and subsequently
adds to the current set
Si
the element
Xi
that maximizes
the marginal gain
f(Si∪ {Xj})f(Si)
at each iteration
i
.
Algorithm 1 runs in pseudo-polynomial time
O(p|V|)
, and
has an approximation guarantee as follows.
Theorem 2.1
(Nemhauser et al. [1978])
.
Let
qZ+
,
Sp
be the solution from Algorithm 1 at iteration
p
, and
e
is the
Euler’s number, we have:
f(Sp)(1 ep
q) max
S:|S|≤qf(S)(1)
maxS:|S|≤qf(S)
is the optimal value from the optimal
subset whose cardinality is at most
q
. If
f
is monotone,
arg maxS:|S|≤qf(S)
has cardinality exactly
q
. By running
Algorithm 1 for exactly
q
iterations, we obtain a greedily-
obtained value that is at least 11
eof the optimal value.
2.3 FEATURE IMPORTANCE SCORES
The feature importance domain in machine learning studies
scoring methods that measure the contribution of individ-
ual features. A common setting of these feature importance
scoring methods is to treat each feature as a participant in
a coalitional game, in which all of them contribute to an
overall gain. Then a scoring method assigns each feature
a importance score by evaluating their individual contribu-
tions. Many notable feature importance scores are adapted
from the Shapley value, defined as follows:
Definition 2.2
(Shapley [1952])
.
Given a set of all players
F
in a coalitional game
v: 2FR
, the Shapley value of
player idefined by vis:
φv,i =X
SF\{i}
|S|!(|F|−|S| − 1)!
|F|!(v(S∪ {i})v(S))
(2)
The Shapley value of a player
i
is the average of its marginal
contribution in game
v
in each possible player subsets
excluding
i
. The game
v
is a set function that computes
the gain of a set of players. Computing the exact Shapley
value of a player is
]P
-hard, and its complexity is expo-
nential to the number of players in the coalitional game –
there are
O(2|F|)
unique subsets, and each subset
S
could
have a unique
4v(i|S)
value [Roth, 1988, Winter, 2002].
Nonetheless, in certain game settings, there are approxi-
mation methods to Shapley value, such as Monte Carlo
simulation [Faigle and Kern, 1992]. When Shapley value is
adapted to the feature importance domain, each input fea-
ture is a player, and
v
is also called the evaluation function.
But
v
is not unique – it can be a prediction model or utility
measure; different
v
may induce different properties to the
Shapley value.
We also examine another feature importance score from
Catav et al. [2021], known as the Marginal Contribution
Feature Importance (MCI). Kumar et al. [2020] has shown
that Shapley-value-based feature importance scores [Shap-
ley, 1952, Lundberg and Lee, 2017, Covert et al., 2020]
could underestimate the importance of correlated features
by assigning them lower scores if these features present
together in the full set. In light of this, MCI is proposed to
overcome this issue. In Definition 2.3, MCI of a feature
i
is
the maximum marginal contribution in
v
over all possible
feature subsets. The complexity of computing the exact MCI
of a feature is also exponential to the number of features.
Definition 2.3
(Catav et al. [2021])
.
Given a set of all fea-
tures
F
, and a non-decreasing set function
v: 2FR
, the
MCI of feature ievaluated on vis:
φmci
v,i = max
SF(v(S∪ {i})v(S)) (3)
3 MODALITY SELECTION
This section presents our theoretical results. In Section 3.1,
we introduce the utility function to measure the prediction
benefit of modalities, and present its subsequent properties.
In Section 3.2, we present theoretical guarantees of greedy
modality selection via maximizing an approximately sub-
modular function. In Section 3.3, we show computational
advantages of feature importance scores (i.e., Shapley value
and MCI) in the context of modality selection. Due to space
limit, proofs are deferred to supplementary materials.
3.1 UTILITY FUNCTION
In order to compare the benefit of different sets of input
modalities in multimodal learning, we motivate a general
definition of utility function that can quantify the impact of
a set of input modalities towards prediction.
Definition 3.1.
Let
c
be some constant in the output space,
and
`(·,·)
be a loss function. For a set of input modalities
SV
, the utility of
S
given by the utility function
fu:
2VRis defined to be:
fu(S):= inf
c∈Y
E[`(Y, c)] inf
h∈H
E[`(Y, h(S))] (4)
In other words, the utility of a set of modalities
fu(S)
is the
reduction of the minimum expected loss in predicting
Y
by
observing
S
comparing to observing some constant value
c
.
The intuition is based on the phenomena that multimodal in-
put tends to reduce prediction loss in practice. Definition 3.1
can be easily interpretable in different loss functions and
learning settings. Note that it is also used in feature selec-
tion to measure the unversial predictive power of a given
feature [Covert et al., 2020]. Under the binary classification
setting with cross-entropy loss,
fu
is the Shannon mutual
information between the output and multimodal input.
Proposition 3.1.
Given
Y∈ {0,1}
and
`(Y, ˆ
Y):= (Y=
1) log ˆ
Y+ (Y= 0) log(1 ˆ
Y),fu(S) = I(S;Y).
The result above is well-known, and has also been proven
in [Grünwald and Dawid, 2004, Farnia and Tse, 2016].
We further can show that
I(S;Y)
is monotonically non-
decreasing on the set of input modalities S.
Proposition 3.2. MNV
,
I(N;Y)I(M;Y) =
I(N\M;Y|M)0.
A combination of Proposition 3.1 and Proposition 3.2 im-
plies that using more modalities as input leads to equivalent
or better prediction. It also shows that Definition 3.1 can
quantitatively capture the extra prediction benefit from the
additional modalities in closed-form (e.g.,
I(N\M;Y|
M)
). And this extra benefit is the most apparent when test
loss reaches convergence (e.g.,
inf
). This monotonicity prop-
erty is also a key indication that Definition 3.1 can intrin-
sically characterize the advantage of multimodal learning
over unimodal learning.
Comparison to previous results.
Previous work [Amini
et al., 2009, Huang et al., 2021] have discovered similar con-
clusions that more views/modalities will not lead to worse
optimal population error in the context of multiview and
multimodal learning, respectively. They obtained this ob-
servation through analysis to the excess risks of learning
from multiple and single modalities, and show that the ex-
cess risk of learning from multiple modalities cannot be
larger than that of single modality. Instead, our work adopts
an information-theoretic characterization, which leads to
an easy-to-interpret measure on the benefits of additional
modalities. Furthermore, using well-developed entropy es-
timators, it is relatively straightforward to estimate these
measures in practice. As a comparison, excess risks are
hard to estimate in practice, since they depend on the Bayes
optimal errors, which limits their uses in many applications.
Next, we show that
fu(S) = I(S;Y)
is approximately
submodular under Assumption 2.1. Previously, Krause and
Guestrin [2012] has shown mutual information to be sub-
modular under strict conditional independence. Here we
provide a more flexible notion of submodularity for mutual
information. There are also other generalizations of submod-
ularity such as weak submodularity [Khanna et al., 2017]
or adaptive submodularity [Golovin and Krause, 2011]. Our
definition of approximate submodularity is more specific to
the case of mutual information and Assumption 2.1.
Proposition 3.3.
Under Assumption 2.1,
I(S;Y)
is
-
approximately submodular, i.e.,
ABV
,
eV\B
,
I(A∪ {e};Y)I(A;Y) + I(B∪ {e};Y)I(B;Y)
.
The above proposition states that if conditional mutual infor-
mation between input modalities given output is below a cer-
tain threshold
 > 0
, then the utilty function
fu(·) = I(·;Y)
admits a diminishing gain pattern controlled by
. This di-
minishing gain pattern is the definition of submodularity
(Definition 2.1). When conditional mutual information is
zero, input modalities are strictly conditional independent,
and I(·;Y)is strictly submodular.
3.2 MODALITY SELECTION VIA
APPROXIMATE SUBMODULARITY
With Proposition 3.3, we can formulate the problem of
modality selection as a submodular function maximization
摘要:

GreedyModalitySelectionviaApproximateSubmodularMaximizationRunxiangCheng1GargiBalasubramaniam1YifeiHe*1Yao-HungHubertTsai2HanZhao11UniversityofIllinoisUrbana-Champaign,Illinois,USA2CarnegieMellonUniversity,Pennsylvania,USAAbstractMultimodallearningconsiderslearningfrommulti-modalitydata,aimingtofu...

展开>> 收起<<
Greedy Modality Selection via Approximate Submodular Maximization Runxiang Cheng1Gargi Balasubramaniam1Yifei He 1Yao-Hung Hubert Tsai2Han Zhao1 1University of Illinois Urbana-Champaign Illinois USA.pdf

共16页,预览4页

还剩页未读, 继续阅读

声明:本站为文档C2C交易模式,即用户上传的文档直接被用户下载,本站只是中间服务平台,本站所有文档下载所得的收益归上传人(含作者)所有。玖贝云文库仅提供信息存储空间,仅对用户上传内容的表现方式做保护处理,对上载内容本身不做任何修改或编辑。若文档所含内容侵犯了您的版权或隐私,请立即通知玖贝云文库,我们立即给予删除!
分类:图书资源 价格:10玖币 属性:16 页 大小:449.21KB 格式:PDF 时间:2025-05-06

开通VIP享超值会员特权

  • 多端同步记录
  • 高速下载文档
  • 免费文档工具
  • 分享文档赚钱
  • 每日登录抽奖
  • 优质衍生服务
/ 16
客服
关注