Stacked Penalized Logistic Regression for Selecting Views in Multi-View Learning Wouter van Loon1 Marjolein Fokkema1 Frank de Vos12 Marisa

2025-05-03 0 0 881.6KB 49 页 10玖币
侵权投诉
Stacked Penalized Logistic Regression for
Selecting Views in Multi-View Learning
Wouter van Loon1, Marjolein Fokkema1, Frank de Vos1,2, Marisa
Koini3, Reinhold Schmidt3, and Mark de Rooij1,2
1Department of Methodology and Statistics, Leiden University
2Leiden Institute for Brain and Cognition
3Division of Neurogeriatrics, Department of Neurology, Medical
University of Graz
June 7, 2024
Abstract
Data for which a set of objects is described by multiple distinct feature sets
(called views) is known as multi-view data. When missing values occur in
multi-view data, all features in a view are likely to be missing simultaneously.
This may lead to very large quantities of missing data which, especially when
combined with high-dimensionality, can make the application of conditional
imputation methods computationally infeasible. However, the multi-view
structure could be leveraged to reduce the complexity and computational
load of imputation. We introduce a new imputation method based on the
existing stacked penalized logistic regression (StaPLR) algorithm for multi-
view learning. It performs imputation in a dimension-reduced space to ad-
dress computational challenges inherent to the multi-view context. We com-
pare the performance of the new imputation method with several existing
imputation algorithms in simulated data sets and a real data application.
The results show that the new imputation method leads to competitive re-
sults at a much lower computational cost, and makes the use of advanced
imputation algorithms such as missForest and predictive mean matching
possible in settings where they would otherwise be computationally infeasi-
ble.
keywords missing data,imputation,multi-view learning,stacked generaliza-
tion,feature selection
Accepted for publication in Information Fusion at https://doi.org/10.1016/j.inffus.2024.102524.
©2024. This manuscript version is made available under the CC-BY 4.0 license
http://creativecommons.org/licenses/by/4.0/
arXiv:2210.14484v4 [stat.ML] 20 Jun 2024
1 Introduction
Multi-view data refers to any data set where the features have been divided into
distinct feature sets [1, 2, 3]1. Such data sets are particularly common in the
biomedical domain where these feature sets, commonly called views, often corre-
spond to different data sources or modalities [4, 5, 6, 7]. Classification models of
disease using information from multiple views generally lead to better performance
than models using only a single view [8, 9, 10, 11, 12, 13]. Traditionally, infor-
mation from different views is often combined using simple feature concatenation,
where the features corresponding to different views are simply aggregated into a
single feature matrix, so that traditional machine learning methods can be deployed
[4]. More recently, dedicated multi-view machine learning techniques have been
developed, which are specifically designed to handle the multi-view structure of
the data [2, 4]. One such multi-view learning technique is stacked penalized logis-
tic regression (StaPLR) [14]. In addition to improving classification performance,
StaPLR can automatically select the views that are most relevant for prediction
[14, 15, 16]. This ability to select the most relevant views is particularly impor-
tant in the biomedical sciences [4] where selecting, for example, a subset of brain
scan types [16], could drastically reduce costs in future measurements, and prevent
patients from undergoing unnecessary medical procedures. Furthermore, models
which select views rather than individual features tend to be more interpretable
[16].
In practice, not all views may be observed for all subjects. When confronted
with missing views, typical approaches are to remove any subjects with at least one
missing value from the data set (called list-wise deletion or complete case analysis
(CCA)), or to replace missing values by some substituted value, a process known
as imputation. In biomedical studies, a single view may consist of thousands or
even millions of features. With the traditional approach of feature concatenation,
in the presence of missing views, CCA leads to a massive loss of information, while
imputation may be computationally infeasible. In this article we propose a new
method for dealing with missing views, based on the StaPLR algorithm. We show
how this method requires much less computation by imputing missing values in a
dimension-reduced space, rather than in the original feature space. We compare
our proposed imputation method with imputation methods applied in the original
feature space.
1Depending on the research area, multi-view data is sometimes called multi-block, multi-set,
multi-group, or multi-table data [3].
2
2 Methods
Missing values are often divided into three categories: missing completely at ran-
dom (MCAR), missing at random (MAR), or missing not at random (MNAR)
[17, 18]. Values are said to be MCAR if the causes of the missingness are unre-
lated to both missing and observed data [18]. Examples include random machine
failure, or missingness introduced by analyzing a random sub-sample of the data.
If the missingness is not completely random but depends only on observed data,
the missing values are said to be MAR [18]. If the missingness instead depends on
unobserved factors, the missing values are said to be MNAR [18]. Here, we will
focus on MCAR missing values.
The simplest way of dealing with MCAR missing values is to discard obser-
vations with at least one missing value through complete case analysis. However,
this approach is potentially very wasteful since a single missing value causes an
entire observation to be removed from the data. CCA may therefore remove many
more observed values from the data than the number of values initially missing,
and drastically reduce the sample size, leading to increased variance and therefore
less accurate predictions.
To prevent wasting observed data, missing values can be imputed. The sim-
plest form of imputation is to replace each missing value with a constant. A very
common choice is the unconditional mean of the feature, a procedure known as
(unconditional) mean imputation (MI). If one is primarily interested in prediction,
MI has some favorable properties: Its computational cost is extremely small, and
it has been shown that MI is universally consistent for prediction even for MAR
data, as long as the learning algorithm used is also universally consistent [19]. Here
consistent means that, given an infinite amount of training data, the prediction
function achieves the error rate of the best possible prediction function (i.e., the
Bayes rate), while universal means that the procedure is consistent for all pos-
sible data distributions [19]. However, MI is often criticized because it is known
to distort the data distribution by attenuating existing correlations between the
features, underestimating the variance, and causing bias in almost any estimate
other than the mean [18].
Many more sophisticated imputation methods have been developed. The lit-
erature on the imputation of missing values is vast, and we do not aim to give a
complete overview here. However, most of the popular imputation methods can
be grouped in a number of categories. The first such category consists of cold deck
imputation methods, which impute missing values using observed values from a
different data set [20]. However, this requires suitable additional data to be avail-
able, which is often not the case. By contrast, hot deck-style imputation methods
[21] are more generally applicable. For each observation with missing values, these
imputation methods find one or several complete observations in the data which
3
are most similar to the observation with missing values [21]. The observed val-
ues of these cases, or some function thereof, are then used to impute the missing
values of the incomplete case. The most popular example is imputation based on
the k-nearest neighbors (kNN) algorithm [22]. A different category of imputation
methods is that of regression-based imputation. This includes the state-of-the-art
multiple imputation through chained equations (MICE) [23]. Another category is
based on matrix factorization, which includes Adaptive-Impute [24], and various
other methods [25] based on, for example, principal component analysis (PCA)
[26] or multiple factor analysis (MFA) [27]. More recently, tree-based imputation
methods such as missForest [28] have become popular. Finally, there are deep
learning imputation methods which are generally based on auto-encoders, such as
multiple imputation with denoising autoencoders (MIDAS) [29] or missing data
importance-weighted autoencoder (MIWAE) [30], and/or based on generative ad-
versarial networks, such as generative adversarial imputation nets (GAIN) [31] or
graph imputation neural networks (GINN) [32]. Some of the most sophisticated
imputation methods may combine ideas from several of the aforementioned cate-
gories. Predictive mean matching (PMM) [18], for example, uses regression-based
imputation to find cases in the data which are most similar in terms of their pre-
dicted values. It is worth noting that it is generally preferable to generate not one,
but multiple imputed data sets, so that correct variance estimates can be obtained
[18]; this is known as multiple imputation [18].
We can also categorize the existing imputation methods depending on whether
they perform unconditional or conditional imputation. We define an unconditional
imputation method as any method in which the imputation of a missing value is
based solely on other observations of the same feature, that is, the imputation
takes place within a single column of the feature matrix. The aforementioned
mean imputation is a classic example of an unconditional imputation method. By
contrast, a conditional imputation method is any method in which the imputation
of a missing value is based, in part or completely, on observations of other features,
that is, the imputation uses different columns of the feature matrix. Most sophis-
ticated imputation methods, such as Bayesian multiple imputation and PMM,
are conditional imputation methods. The distinction between unconditional and
conditional imputation methods is of particular interest for feature selection. Un-
conditional imputation methods, such as mean imputation, use only the univariate
distributions for imputation, so that the imputed feature remains in some sense
‘pure’ and free from contamination from other features. However, as mentioned
earlier, mean imputation is known to distort the data distribution by attenuating
existing correlations between the features [18].
By contrast, some (but not all) conditional imputation methods preserve the
correlations between features [18]. However, in this case the imputed values depend
4
on other features in the data. In the event that a selected feature has a large
number of imputed values, this may lead to difficulties in interpretation, since a
large proportion of the selected feature is derived from other features. Nevertheless,
a recent study on the effect of imputation methods on feature selection suggests
sophisticated conditional imputation methods generally lead to better results than
unconditional imputation methods [33]. Because it is not possible to both perform
the imputation independent of other features and preserve existing correlations,
one has to choose between one or the other.
It should be noted that other methods for handling missing data exist which
do not explicitly impute missing values. These methods incorporate the missing
data handling directly into the model fitting procedure and include likelihood-
based methods such as full information maximum likelihood (FIML) [34, 35] for
parametric regression models, and missingness incorporated in attributes (MIA)
[36] for decision trees. However, these methods are less broadly applicable than
imputation methods [18, 19, 35] and we do not consider them here.
2.1 From Missing Features to Missing Views
In multi-view data, it is likely that missingness will occur at the view level, rather
than at the feature level [37, 38]. Missing views may occur at random and/or by
design [37, 38]. In a study where one of the views corresponds to features derived
from a magnetic resonance imaging (MRI) scan, factors like the MRI scanner ex-
periencing machine failure, a mistake in the scanning protocol by the researcher
administering the scan, or a subject simply not making it to their appointment
in time due to heavy traffic, would lead to all features of this view being simul-
taneously missing. Likewise, if one of the views corresponds to features derived
from a sample of blood or cerebral spinal fluid (CSF), a lost or contaminated sam-
ple would lead to all derived features being simultaneously missing. Note that
in these cases, although the missingness occurs at the view level, the underlying
mechanism is still MCAR. Another common example of MCAR data occurs in the
case of planned missingness, where the missing values are part of the study design.
For example, it may be considered too expensive to administer an MRI scan to
all study participants, so instead an MRI scan is administered only to a random
sub-sample of the participants. Again the underlying mechanism is MCAR, but all
features corresponding to the MRI scan will be missing simultaneously for the un-
measured sub-sample. Throughout the rest of this article we will assume that (1)
for each observation, a view is either completely missing or completely observed,
and (2) the missingness is completely at random (MCAR).
Conceptually, one could impute a missing view by first applying feature con-
catenation, and then simply applying a chosen imputation method on the concate-
nated feature set. However, in practice this may be impossible. For example, if
5
摘要:

StackedPenalizedLogisticRegressionforSelectingViewsinMulti-ViewLearningWoutervanLoon1,MarjoleinFokkema1,FrankdeVos1,2,MarisaKoini3,ReinholdSchmidt3,andMarkdeRooij1,21DepartmentofMethodologyandStatistics,LeidenUniversity2LeidenInstituteforBrainandCognition3DivisionofNeurogeriatrics,DepartmentofNeurol...

展开>> 收起<<
Stacked Penalized Logistic Regression for Selecting Views in Multi-View Learning Wouter van Loon1 Marjolein Fokkema1 Frank de Vos12 Marisa.pdf

共49页,预览5页

还剩页未读, 继续阅读

声明:本站为文档C2C交易模式,即用户上传的文档直接被用户下载,本站只是中间服务平台,本站所有文档下载所得的收益归上传人(含作者)所有。玖贝云文库仅提供信息存储空间,仅对用户上传内容的表现方式做保护处理,对上载内容本身不做任何修改或编辑。若文档所含内容侵犯了您的版权或隐私,请立即通知玖贝云文库,我们立即给予删除!
分类:图书资源 价格:10玖币 属性:49 页 大小:881.6KB 格式:PDF 时间:2025-05-03

开通VIP享超值会员特权

  • 多端同步记录
  • 高速下载文档
  • 免费文档工具
  • 分享文档赚钱
  • 每日登录抽奖
  • 优质衍生服务
/ 49
客服
关注