
To approach these challenges, we propose a utility function
that conveniently quantifies the benefit of any set of modal-
ities towards prediction in most typical learning settings.
We then identify a proper assumption that is suitable for
multimodal/multiview learning, which allows us to develop
efficient approximate algorithms for modality selection. We
assume that the input modalities are approximately condi-
tionally independent given the target. Since the strength
of conditional independence is now parameterized, our re-
sults are generalizable to problems on multimodal data with
different levels of conditional independence.
We show that our definition of utility for a modality naturally
manifests as the Shannon mutual information between the
modality and the prediction target, in the setting of binary
classification with cross-entropy loss. Under approximate
conditional independence, mutual information is monotone
and approximately submodular. These properties intrinsi-
cally describe the empirical advantages of learning with
more modalities, and allow us to formulate modality selec-
tion as a submodular optimization problem. In this context,
we can have efficient selection algorithms with provable
performance guarantee on the selected subset. For example,
we show a performance guarantee of the greedy maximiza-
tion algorithm from Nemhauser et al. [1978] under approxi-
mate submodularity. Further, we connect modality selection
to marginal-contribution-based feature importance scores
in feature selection. We examine the Shapley value and
Marginal Contribution Feature Importance (MCI) [Catav
et al., 2021] for ranking modality importance. We show
that these scores, although are originally intractable, can be
solved efficiently under assumptions in the context of modal-
ity selection. Lastly, we evaluate our theoretical results on
three classification datasets. The experiment results confirm
both the utility and the diversity of the selected modalities.
To summarize, we contributes the following in this paper:
•
Propose a general measure of modality utility, and iden-
tify a proper assumption that is suitable for multimodal
learning and helpful for developing efficient approxi-
mate algorithms for modality selection.
•
Demonstrate algorithm with performance guarantee
on the selected modalities for prediction theoretically
and empirically in classification problems with cross-
entropy loss.
•
Establish theoretical connections between modality
selection and feature importance scores, i.e., Shapley
value and Marginal Contribution Feature Importance.
2 PRELIMINARIES
In this section, we first describe our notation and problem
setup, and we then provide a brief introduction to submodu-
lar function maximization and feature importance scores.
2.1 NOTATION AND SETUP
We use
X
and
Y
to denote the random variables that take
values in input space
X
and output space
Y
, respectively.
The instantiation of
X
and
Y
is denoted by
x
and
y
. We use
H
to denote the hypothesis class of predictors from input to
output space, and
ˆ
Y
to denote the predicted variable. Let
X
be multimodal, i.e.,
X=X1×... ×Xk
. Each
Xi
is the input
from the
i
-th modality. We use
Xi
to denote the random
variable that takes value in
Xi
, and
V
to denote the full set
of all input modalities, i.e.,
V={X1, ..., Xk}
. Throughout
the paper, we often use
S
and
S0
to denote arbitrary subsets
of
V
. Lastly, we use
I(·,·)
to mean the Shannon mutual in-
formation,
H(·)
for entropy,
`ce(Y, ˆ
Y)
for the cross-entropy
loss
(Y= 1) log ˆ
Y+ (Y= 0) log(1−ˆ
Y)
, and
`01(Y, ˆ
Y)
for zero-one loss (Y6=ˆ
Y).
For the simplicity of discussion, we primarily focus on the
setting of binary classification with cross-entropy loss
2
. In
this setting, a subset of input modalities
S⊆V
and output
Y∈ {0,1}
are observed. The predictor aims to make pre-
diction
ˆ
Y∈[0,1]
which minimizes the cross-entropy loss
between
Y
and
ˆ
Y
. The goal of modality selection is to select
the subset of input modalities to this loss minimization goal
under certain constraints. Our results rely on the following
assumption to hold.
Assumption 2.1
(
-Approximate Conditional Indepen-
dence)
.
There exists a positive constant
≥0
such that,
∀S, S0⊆V, S ∩S0=∅, we have I(S;S0|Y)≤.
Note that when
= 0
, Assumption 2.1 reduces to strict con-
ditional independence between disjoint modalities given the
target variable. In fact, this is a common assumption used
in prior work in multimodal learning [White et al., 2012,
Wu and Goodman, 2018, Sun et al., 2020]. In practice, how-
ever, strict conditional independence is often difficult to be
satisfied. Thus, we use a more general assumption above,
in which input modalities are approximately conditionally
independent. In this assumption, the strength of the condi-
tional independence relationship is controlled by a positive
constant
, which is the upper bound of the conditional
mutual information between modalities given the target.
Connection to feature selection.
It is worth mentioning
that modality selection shares a natural correspondence to
the problem of feature selection. Without loss of generality,
a modality could be considered as a group of features; the-
oretically, the group could even contain a single feature in
some settings. But a distinction between these two problems
lies in the feasibility of conditional independence. In mul-
timodal learning where input data is often heterogeneous,
the (approximate) conditional independence assumption is
2
We choose binary class setting for ease of exposition, our
general proofs and results directly extend to multi-class setting.
We have only used the binary case to derive the conditional entropy
(supplementary material), and to further showcase Corollary 3.1