Greedy Modality Selection via Approximate Submodular Maximization Runxiang Cheng1Gargi Balasubramaniam1Yifei He 1Yao-Hung Hubert Tsai2Han Zhao1 1University of Illinois Urbana-Champaign Illinois USA

2025-05-06 1 0 449.21KB 16 页 10玖币

侵权投诉

Greedy Modality Selection via Approximate Submodular Maximization

Runxiang Cheng∗1Gargi Balasubramaniam∗1Yifei He*1Yao-Hung Hubert Tsai2Han Zhao1

1University of Illinois Urbana-Champaign, Illinois, USA

2Carnegie Mellon University, Pennsylvania, USA

Abstract

Multimodal learning considers learning from multi-

modality data, aiming to fuse heterogeneous

sources of information. However, it is not always

feasible to leverage all available modalities due

to memory constraints. Further, training on all the

modalities may be inefﬁcient when redundant in-

formation exists within data, such as different sub-

sets of modalities providing similar performance.

In light of these challenges, we study modality

selection, intending to efﬁciently select the most

informative and complementary modalities under

certain computational constraints. We formulate

a theoretical framework for optimizing modality

selection in multimodal learning and introduce a

utility measure to quantify the beneﬁt of select-

ing a modality. For this optimization problem, we

present efﬁcient algorithms when the utility mea-

sure exhibits monotonicity and approximate sub-

modularity. We also connect the utility measure

with existing Shapley-value-based feature impor-

tance scores. Last, we demonstrate the efﬁcacy of

our algorithm on synthetic (Patch-MNIST) and

two real-world (PEMS-SF, CMU-MOSI) datasets.

1 INTRODUCTION

Multimodal learning considers learning with data from mul-

tiple modalities (e.g., images, text, speech, etc) to improve

generalization of the learned models by using complemen-

tary information from different modalities.

In many real-

world applications, multimodal learning has shown superior

performance [Bapna et al., 2022, Wu et al., 2021], and has

demonstrated a stronger capability over learning from a

single modality. The advantages of multimodal learning

*Equal contribution.

1We use the terms modality/view interchangably.

have also been studied from a theoretical standpoint. Prior

work showed that learning with more modalities achieves

a smaller population risk [Huang et al., 2021], or utilizing

cross-modal information can provably improve prediction in

multiview learning [Zhang et al., 2019] or semi-supervised

learning [Sun et al., 2020]. With the recent advances in

training large-scale neural network models from multiple

modalities [Devlin et al., 2018, Brown et al., 2020], one

emerging challenge lies in the modality selection problem.

From the modeling perspective, it might be tempting to use

all the modalities available. However, it is inefﬁcient or even

infeasible to learn from all modalities as the total number of

input modalities increases. A modality often consists of high-

dimensional data. And model complexity can scale linearly

or exponentially with the number of input modalities [Zadeh

et al., 2017, Liu et al., 2018], resulting in large consumption

of computational and energy resources. The marginal beneﬁt

from the new modalities may also decrease as more modali-

ties have been included. In some cases, learning from fewer

modalities is sufﬁcient to achieve the desirable outcome,

due to the potential overlap in the information provided

by these modalities. Furthermore, proactively selecting the

modalities most informative towards prediction reduces the

cost of collecting the inferior ones. For example, in sensor

placement problems where each sensor can be treated as a

modality, ﬁnding the optimal subset of sensors for a learning

objective (e.g., temperature or trafﬁc prediction) eliminates

the cost of maintaining extra sensors [Krause et al., 2011].

In light of the aforementioned challenges, in this paper, we

study the optimization problem of modality selection in mul-

timodal learning: given a set of input modalities and a ﬁxed

budget on the number of selected modalities, how to select

a subset that optimizes prediction performance? Note that

in general this problem is of combinatorial nature, since one

may have to enumerate all the potential subsets of modal-

ities in order to ﬁnd the best one. Hence, without further

assumptions on the structure of the underlying prediction

problem, it is intractable to solve this modality selection

problem exactly and efﬁciently.

Accepted for the 38th Conference on Uncertainty in Artiﬁcial Intelligence (UAI 2022).

arXiv:2210.12562v1 [cs.LG] 22 Oct 2022

To approach these challenges, we propose a utility function

that conveniently quantiﬁes the beneﬁt of any set of modal-

ities towards prediction in most typical learning settings.

We then identify a proper assumption that is suitable for

multimodal/multiview learning, which allows us to develop

efﬁcient approximate algorithms for modality selection. We

assume that the input modalities are approximately condi-

tionally independent given the target. Since the strength

of conditional independence is now parameterized, our re-

sults are generalizable to problems on multimodal data with

different levels of conditional independence.

We show that our deﬁnition of utility for a modality naturally

manifests as the Shannon mutual information between the

modality and the prediction target, in the setting of binary

classiﬁcation with cross-entropy loss. Under approximate

conditional independence, mutual information is monotone

and approximately submodular. These properties intrinsi-

cally describe the empirical advantages of learning with

more modalities, and allow us to formulate modality selec-

tion as a submodular optimization problem. In this context,

we can have efﬁcient selection algorithms with provable

performance guarantee on the selected subset. For example,

we show a performance guarantee of the greedy maximiza-

tion algorithm from Nemhauser et al. [1978] under approxi-

mate submodularity. Further, we connect modality selection

to marginal-contribution-based feature importance scores

in feature selection. We examine the Shapley value and

Marginal Contribution Feature Importance (MCI) [Catav

et al., 2021] for ranking modality importance. We show

that these scores, although are originally intractable, can be

solved efﬁciently under assumptions in the context of modal-

ity selection. Lastly, we evaluate our theoretical results on

three classiﬁcation datasets. The experiment results conﬁrm

both the utility and the diversity of the selected modalities.

To summarize, we contributes the following in this paper:

•

Propose a general measure of modality utility, and iden-

tify a proper assumption that is suitable for multimodal

learning and helpful for developing efﬁcient approxi-

mate algorithms for modality selection.

•

Demonstrate algorithm with performance guarantee

on the selected modalities for prediction theoretically

and empirically in classiﬁcation problems with cross-

entropy loss.

•

Establish theoretical connections between modality

selection and feature importance scores, i.e., Shapley

value and Marginal Contribution Feature Importance.

2 PRELIMINARIES

In this section, we ﬁrst describe our notation and problem

setup, and we then provide a brief introduction to submodu-

lar function maximization and feature importance scores.

2.1 NOTATION AND SETUP

We use

and

to denote the random variables that take

values in input space

and output space

, respectively.

The instantiation of

and

is denoted by

and

. We use

to denote the hypothesis class of predictors from input to

output space, and

to denote the predicted variable. Let

be multimodal, i.e.,

X=X1×... ×Xk

. Each

is the input

from the

-th modality. We use

to denote the random

variable that takes value in

, and

to denote the full set

of all input modalities, i.e.,

V={X1, ..., Xk}

. Throughout

the paper, we often use

and

to denote arbitrary subsets

. Lastly, we use

I(·,·)

to mean the Shannon mutual in-

formation,

H(·)

for entropy,

`ce(Y, ˆ

for the cross-entropy

loss

(Y= 1) log ˆ

Y+ (Y= 0) log(1−ˆ

, and

`01(Y, ˆ

for zero-one loss (Y6=ˆ

Y).

For the simplicity of discussion, we primarily focus on the

setting of binary classiﬁcation with cross-entropy loss

. In

this setting, a subset of input modalities

S⊆V

and output

Y∈ {0,1}

are observed. The predictor aims to make pre-

diction

Y∈[0,1]

which minimizes the cross-entropy loss

between

and

. The goal of modality selection is to select

the subset of input modalities to this loss minimization goal

under certain constraints. Our results rely on the following

assumption to hold.

Assumption 2.1

(



-Approximate Conditional Indepen-

dence)

There exists a positive constant

≥0

such that,

∀S, S0⊆V, S ∩S0=∅, we have I(S;S0|Y)≤.

Note that when

= 0

, Assumption 2.1 reduces to strict con-

ditional independence between disjoint modalities given the

target variable. In fact, this is a common assumption used

in prior work in multimodal learning [White et al., 2012,

Wu and Goodman, 2018, Sun et al., 2020]. In practice, how-

ever, strict conditional independence is often difﬁcult to be

satisﬁed. Thus, we use a more general assumption above,

in which input modalities are approximately conditionally

independent. In this assumption, the strength of the condi-

tional independence relationship is controlled by a positive

constant



, which is the upper bound of the conditional

mutual information between modalities given the target.

Connection to feature selection.

It is worth mentioning

that modality selection shares a natural correspondence to

the problem of feature selection. Without loss of generality,

a modality could be considered as a group of features; the-

oretically, the group could even contain a single feature in

some settings. But a distinction between these two problems

lies in the feasibility of conditional independence. In mul-

timodal learning where input data is often heterogeneous,

the (approximate) conditional independence assumption is

We choose binary class setting for ease of exposition, our

general proofs and results directly extend to multi-class setting.

We have only used the binary case to derive the conditional entropy

(supplementary material), and to further showcase Corollary 3.1

more likely to hold among input modalities. Whereas in the

feature level, such an assumption is quite restrictive [Zhang

et al., 2012], as it boils down to asking the data to approxi-

mately satisfy the Naive Bayes assumption.

2.2 SUBMODULAR OPTIMIZATION

Submodularity is a property of set functions that has many

theoretical implications and applications in computer sci-

ence. A deﬁnition of submodularity is as follows, where

denotes the power set of

, and the set function

assigns

each subset S⊆Vto a value f(S)∈R.

Deﬁnition 2.1

(Nemhauser et al. [1978])

Given a ﬁnite

set

, a function

f: 2V→R

is submodular if for any

A⊆B⊆V

, and

e∈V\B

, we have

f(A∪{e})−f(A)≥

f(B∪ {e})−f(B).

In other words, adding new elements to a larger set does

not yield larger marginal beneﬁt comparing to adding new

elements to its subset. One common type of optimization

on submodular function is submodular function maximiza-

tion with cardinality constraints. It asks to ﬁnd a subset

S⊆V

that maximizes

f(S)

subject to

|S| ≤ q

. Finding

the optimal solution to this problem is

N P

-hard. However,

Nemhauser et al. [1978] propose that a greedy maximization

algorithm can provide a solution with approximate guaran-

tee to the optimal solution in polynomial time. We provide

the pseudocode of this greedy algorithm below.

Algorithm 1: Greedy Maximization

Data: Full set V={X1, ..., Xk}, constraint q∈Z+.

Input: f: 2V→R, and p∈Z+, where p≤q≤ |V|

Output: Subset Sp

S0=∅

for i= 0,1, ..., p −1do

Xi= arg maxXj∈V\Si(f(Si∪ {Xj})−f(Si))

Si+1 =Si∪ {Xi}

In this algorithm,

is the full set to select elements from,

is the submodular function to be maximized,

is the number

of iterations for the algorithm to run, and

is the cardinality

constraint. It starts with an empty set

, and subsequently

adds to the current set

the element

that maximizes

the marginal gain

f(Si∪ {Xj})−f(Si)

at each iteration

Algorithm 1 runs in pseudo-polynomial time

O(p|V|)

, and

has an approximation guarantee as follows.

Theorem 2.1

(Nemhauser et al. [1978])

Let

q∈Z+

be the solution from Algorithm 1 at iteration

, and

is the

Euler’s number, we have:

f(Sp)≥(1 −e−p

q) max

S:|S|≤qf(S)(1)

maxS:|S|≤qf(S)

is the optimal value from the optimal

subset whose cardinality is at most

. If

is monotone,

arg maxS:|S|≤qf(S)

has cardinality exactly

. By running

Algorithm 1 for exactly

iterations, we obtain a greedily-

obtained value that is at least 1−1

eof the optimal value.

2.3 FEATURE IMPORTANCE SCORES

The feature importance domain in machine learning studies

scoring methods that measure the contribution of individ-

ual features. A common setting of these feature importance

scoring methods is to treat each feature as a participant in

a coalitional game, in which all of them contribute to an

overall gain. Then a scoring method assigns each feature

a importance score by evaluating their individual contribu-

tions. Many notable feature importance scores are adapted

from the Shapley value, deﬁned as follows:

Deﬁnition 2.2

(Shapley [1952])

Given a set of all players

in a coalitional game

v: 2F→R

, the Shapley value of

player ideﬁned by vis:

φv,i =X

S⊆F\{i}

|S|!(|F|−|S| − 1)!

|F|!(v(S∪ {i})−v(S))

(2)

The Shapley value of a player

is the average of its marginal

contribution in game

in each possible player subsets

excluding

. The game

is a set function that computes

the gain of a set of players. Computing the exact Shapley

value of a player is

-hard, and its complexity is expo-

nential to the number of players in the coalitional game –

there are

O(2|F|)

unique subsets, and each subset

could

have a unique

4v(i|S)

value [Roth, 1988, Winter, 2002].

Nonetheless, in certain game settings, there are approxi-

mation methods to Shapley value, such as Monte Carlo

simulation [Faigle and Kern, 1992]. When Shapley value is

adapted to the feature importance domain, each input fea-

ture is a player, and

is also called the evaluation function.

But

is not unique – it can be a prediction model or utility

measure; different

may induce different properties to the

Shapley value.

We also examine another feature importance score from

Catav et al. [2021], known as the Marginal Contribution

Feature Importance (MCI). Kumar et al. [2020] has shown

that Shapley-value-based feature importance scores [Shap-

ley, 1952, Lundberg and Lee, 2017, Covert et al., 2020]

could underestimate the importance of correlated features

by assigning them lower scores if these features present

together in the full set. In light of this, MCI is proposed to

overcome this issue. In Deﬁnition 2.3, MCI of a feature

the maximum marginal contribution in

over all possible

feature subsets. The complexity of computing the exact MCI

of a feature is also exponential to the number of features.

Deﬁnition 2.3

(Catav et al. [2021])

Given a set of all fea-

tures

, and a non-decreasing set function

v: 2F→R

, the

MCI of feature ievaluated on vis:

φmci

v,i = max

S⊆F(v(S∪ {i})−v(S)) (3)

3 MODALITY SELECTION

This section presents our theoretical results. In Section 3.1,

we introduce the utility function to measure the prediction

beneﬁt of modalities, and present its subsequent properties.

In Section 3.2, we present theoretical guarantees of greedy

modality selection via maximizing an approximately sub-

modular function. In Section 3.3, we show computational

advantages of feature importance scores (i.e., Shapley value

and MCI) in the context of modality selection. Due to space

limit, proofs are deferred to supplementary materials.

3.1 UTILITY FUNCTION

In order to compare the beneﬁt of different sets of input

modalities in multimodal learning, we motivate a general

deﬁnition of utility function that can quantify the impact of

a set of input modalities towards prediction.

Deﬁnition 3.1.

Let

be some constant in the output space,

and

`(·,·)

be a loss function. For a set of input modalities

S⊆V

, the utility of

given by the utility function

fu:

2V→Ris deﬁned to be:

fu(S):= inf

c∈Y

E[`(Y, c)] −inf

h∈H

E[`(Y, h(S))] (4)

In other words, the utility of a set of modalities

fu(S)

is the

reduction of the minimum expected loss in predicting

observing

comparing to observing some constant value

The intuition is based on the phenomena that multimodal in-

put tends to reduce prediction loss in practice. Deﬁnition 3.1

can be easily interpretable in different loss functions and

learning settings. Note that it is also used in feature selec-

tion to measure the unversial predictive power of a given

feature [Covert et al., 2020]. Under the binary classiﬁcation

setting with cross-entropy loss,

is the Shannon mutual

information between the output and multimodal input.

Proposition 3.1.

Given

Y∈ {0,1}

and

`(Y, ˆ

Y):= (Y=

1) log ˆ

Y+ (Y= 0) log(1 −ˆ

Y),fu(S) = I(S;Y).

The result above is well-known, and has also been proven

in [Grünwald and Dawid, 2004, Farnia and Tse, 2016].

We further can show that

I(S;Y)

is monotonically non-

decreasing on the set of input modalities S.

Proposition 3.2. ∀M⊆N⊆V

I(N;Y)−I(M;Y) =

I(N\M;Y|M)≥0.

A combination of Proposition 3.1 and Proposition 3.2 im-

plies that using more modalities as input leads to equivalent

or better prediction. It also shows that Deﬁnition 3.1 can

quantitatively capture the extra prediction beneﬁt from the

additional modalities in closed-form (e.g.,

I(N\M;Y|

). And this extra beneﬁt is the most apparent when test

loss reaches convergence (e.g.,

inf

). This monotonicity prop-

erty is also a key indication that Deﬁnition 3.1 can intrin-

sically characterize the advantage of multimodal learning

over unimodal learning.

Comparison to previous results.

Previous work [Amini

et al., 2009, Huang et al., 2021] have discovered similar con-

clusions that more views/modalities will not lead to worse

optimal population error in the context of multiview and

multimodal learning, respectively. They obtained this ob-

servation through analysis to the excess risks of learning

from multiple and single modalities, and show that the ex-

cess risk of learning from multiple modalities cannot be

larger than that of single modality. Instead, our work adopts

an information-theoretic characterization, which leads to

an easy-to-interpret measure on the beneﬁts of additional

modalities. Furthermore, using well-developed entropy es-

timators, it is relatively straightforward to estimate these

measures in practice. As a comparison, excess risks are

hard to estimate in practice, since they depend on the Bayes

optimal errors, which limits their uses in many applications.

Next, we show that

fu(S) = I(S;Y)

is approximately

submodular under Assumption 2.1. Previously, Krause and

Guestrin [2012] has shown mutual information to be sub-

modular under strict conditional independence. Here we

provide a more ﬂexible notion of submodularity for mutual

information. There are also other generalizations of submod-

ularity such as weak submodularity [Khanna et al., 2017]

or adaptive submodularity [Golovin and Krause, 2011]. Our

deﬁnition of approximate submodularity is more speciﬁc to

the case of mutual information and Assumption 2.1.

Proposition 3.3.

Under Assumption 2.1,

I(S;Y)



approximately submodular, i.e.,

∀A⊆B⊆V

e∈V\B

I(A∪ {e};Y)−I(A;Y) + ≥I(B∪ {e};Y)−I(B;Y)

The above proposition states that if conditional mutual infor-

mation between input modalities given output is below a cer-

tain threshold

 > 0

, then the utilty function

fu(·) = I(·;Y)

admits a diminishing gain pattern controlled by



. This di-

minishing gain pattern is the deﬁnition of submodularity

(Deﬁnition 2.1). When conditional mutual information is

zero, input modalities are strictly conditional independent,

and I(·;Y)is strictly submodular.

3.2 MODALITY SELECTION VIA

APPROXIMATE SUBMODULARITY

With Proposition 3.3, we can formulate the problem of

modality selection as a submodular function maximization

文档加载中……请稍候！
如果长时间未打开，您也可以点击刷新试试。

下载文档到电脑，查找使用更方便

10 玖币 0人已下载

立即下载

摘要：

GreedyModalitySelectionviaApproximateSubmodularMaximizationRunxiangCheng1GargiBalasubramaniam1YifeiHe*1Yao-HungHubertTsai2HanZhao11UniversityofIllinoisUrbana-Champaign,Illinois,USA2CarnegieMellonUniversity,Pennsylvania,USAAbstractMultimodallearningconsiderslearningfrommulti-modalitydata,aimingtofu...

展开>> 收起<<

Greedy Modality Selection via Approximate Submodular Maximization Runxiang Cheng1Gargi Balasubramaniam1Yifei He 1Yao-Hung Hubert Tsai2Han Zhao1 1University of Illinois Urbana-Champaign Illinois USA.pdf

共16页,预览4页

还剩页未读，继续阅读

声明：本站为文档C2C交易模式，即用户上传的文档直接被用户下载，本站只是中间服务平台，本站所有文档下载所得的收益归上传人(含作者)所有。玖贝云文库仅提供信息存储空间，仅对用户上传内容的表现方式做保护处理，对上载内容本身不做任何修改或编辑。若文档所含内容侵犯了您的版权或隐私，请立即通知玖贝云文库，我们立即给予删除！

Greedy Modality Selection via Approximate Submodular Maximization Runxiang Cheng1Gargi Balasubramaniam1Yifei He 1Yao-Hung Hubert Tsai2Han Zhao1 1University of Illinois Urbana-Champaign Illinois USA

相关推荐

开通VIP享超值会员特权

作者详情

相关内容

热门标签

举报选择: