Manifold Alignment with Label Information Andr es F. DuqueMyriam LizotteGuy WolfKevin R. Moon Abstract

2025-05-02 0 0 7.71MB 9 页 10玖币
侵权投诉
Manifold Alignment with Label Information
Andr´es F. DuqueMyriam LizotteGuy Wolf Kevin R. Moon §
Abstract
Multi-domain data is becoming increasingly common and
presents both challenges and opportunities in the data
science community. The integration of distinct data-views
can be used for exploratory data analysis, and benefit
downstream analysis including machine learning related
tasks. With this in mind, we present a novel manifold
alignment method called MALI (Manifold alignment with
label information) that learns a correspondence between two
distinct domains. MALI can be considered as belonging to a
middle ground between the more commonly addressed semi-
supervised manifold alignment problem with some known
correspondences between the two domains, and the purely
unsupervised case, where no known correspondences are
provided. To do this, MALI learns the manifold structure
in both domains via a diffusion process and then leverages
discrete class labels to guide the alignment. By aligning two
distinct domains, MALI recovers a pairing and a common
representation that reveals related samples in both domains.
Additionally, MALI can be used for the transfer learning
problem known as domain adaptation. We show that MALI
outperforms the current state-of-the-art manifold alignment
methods across multiple datasets.
1 Introduction
The data collection process of a given phenomena may
be affected by different sources of variability, creat-
ing seemingly distinct domains. For instance, natural
images with different illumination, contrast or noise,
may affect the classification performance of a machine
learning model previously trained on a different do-
main. In biology, the modern study of single-cell dy-
namics is conducted via different instruments, condi-
tions and modalities, raising different challenges and
opportunities [28, 29]. In many cases, the relationships
between the different domains are unknown. Hence,
the fusion and integration of multi-domain data has
been extensively studied in the data science commu-
nity for supervised learning as well as data mining and
exploratory data analysis. One of the earliest meth-
ods to do this is Canonical Correlation Analysis (CCA),
which finds a linear projection that maximizes the corre-
lation between the two domains [30]. CCA has been ex-
tended to different formulations in recent years as sparse
CCA [20, 26] or Kernel CCA [8, 15].
In many applications, a reasonable assumption to
Utah State University, andres.duque@usu.edu
Universit´e de Montr´eal, myriam.lizotte@mila.quebec
Universit´e de Montr´eal, wolfguy@mila.quebec
§Utah State University, kevin.moon@usu.edu
Figure 1: Manifold alignment. Two different
datasets measured from the same underlying phenom-
ena are captured in different conditions, instruments,
experimental designs, etc. Manifold alignment assumes
a common latent space (grey) from which the observa-
tions are mapped by functions fand gto the different
ambient spaces. We seek to find the underlying relation-
ship hbetween observations living in different spaces X
and Ywithout assuming any pairing known a priori.
Instead we assume there are labeled observations for
different classes (different shapes).
make is that the data collected in different domains is
controlled by a set of shared underlying modes of vari-
ation or latent variables. The manifold assumption is
also often applicable in this case, in which the data
measured in the different domains are assumed to lie
on a low-dimensional manifold embedded in the high-
dimensional ambient spaces, being the result of smooth
mappings of the latent variables (see Fig. 1). With
this in mind, manifold alignment (MA) has become a
common technique for data integration. Some applica-
tions of MA include handling different face poses and
protein structure alignment ( [1, 37]), medical images
for Alzheimer’s disease classification ( [4], [16]), multi-
modal sensing images [32], graph-matching [13], and in-
tegrating single-cell multi-omics data [6] .
Multiple MA methods have been proposed under
different prior knowledge assumptions that relate the
two domains. Methods such as CCA or multi-view dif-
fusion maps [23] can be categorized as supervised MA,
since the data is assumed to come in a paired fashion.
More challenging scenarios arise when partial or null
arXiv:2210.12774v2 [stat.ML] 31 Oct 2022
a priori pairing knowledge is considered. Purely un-
supervised algorithms are designed for scenarios where
neither pairings between domains nor any other side-
information is available. As a consequence, they rely
solely on the particular topology of each domain to in-
fer inter-domain similarities (e.g. [6, 10, 11, 34]).
Methods that leverage some additional information
are often categorized as semi-supervised MA. As a spe-
cial case, several methods consider partial correspon-
dence information, where a few one-to-one matching
samples work as anchor points to find a consistent align-
ment for the rest of the data. Some papers leverage the
graph structure of the data [12,18,19,33] and are closely
related to Laplacian eigenmaps [5]. Others resort to
neural networks such as the GAN-based MAGAN [2] or
the autoencoder presented in [3].
However, even partial correspondences can be ex-
pensive or impossible to acquire. This is the case in
biological applications where the measurement process
destroys the cells, making it impossible to measure other
modalities of the exact same cells. But even if there are
no known correspondences between domains, we do not
have to resort to unsupervised MA. If we have access to
side information about the datasets from both domains,
such as discrete class labels, we can leverage this extra
knowledge to perform manifold alignment [31, 35, 36].
Motivated by this, we propose a new semi-supervised
MA algorithm called MALI (Manifold Alignment with
Label Information). MALI leverages the manifold struc-
ture of the data in both domains, combined with the
discrete label information, and it does not require any
known corresponding points in the different domains.
MALI is built upon the widely-used manifold learn-
ing method Diffusion Maps [9] and optimal transport
(OT) [27]. We show via experimentation that MALI
outperforms current state-of-the-art MA algorithms in
this setting across multiple datasets by several metrics.
The setting described above is similar to the domain
adaptation (DA) [14] problem. In traditional machine
learning, the training set and the test set are assumed to
be sampled from the same distribution and to share the
same features. But in practice these assumptions may
not hold, for example, due to the different collection
circumstances mentioned previously. When data is ex-
pensive or time-consuming to label, it may be desirable
to train a model on existing related datasets and then
adapt it to the new task. It is of interest to leverage
the knowledge acquired from training on one dataset to
improve performance on the same task on a different
dataset, or potentially even a different task. One pos-
sible approach to tackle DA is to use MA, since knowl-
edge can be transferred through MA via the learned
inter-domain correspondences or by training on a shared
latent representation of both domains.
2 Preliminaries
2.1 Problem Description Assume we have two
datasets X={x1, x2, . . . , xn} ∈ Rn×pand Y=
{y1, y2, . . . , ym} ∈ Rm×q. We assume that all of the
points in Xare labeled with discrete (i.e. class) labels
Lx={`x
1, . . . , `x
n}while the points in Ymay be partially
or fully labeled with discrete labels Ly={`y
1, . . . , `y
r},
with rm. In the domain adaptation problem, Xis
measured from the source domain while Yis measured
from the target domain.
The problem consists of learning an alignment be-
tween both data manifolds, by leveraging their respec-
tive geometric structures as well as the label knowledge
available from both domains. There are several pos-
sible ways to represent such an alignment using MA
algorithms. One way is to directly learn hard or soft
correspondences between points in Xand Y. A regres-
sion model could then be trained using these correspon-
dences to learn a parametric mapping between domains.
In the domain adaptation problem, unlabeled data in
the target domain can be labeled using the more rich
label information using the regression model.
A second way to represent the alignment is to
learn a shared embedding space which can be used
for downstream analysis. For the domain adaptation
problem, a classifier could be trained on the shared
embedding space using the labels from X. Direct
correspondences can be learned by, for example, using
a nearest neighbor approach in the shared space.
As we show in this work, MALI is suited for any
of these scenarios. We first find pairwise cross-domain
distances, which are then leveraged to find hard or soft
assignments between the domains via optimal transport.
If required, a shared embedding can be learned using
these assignments.
2.2 Related work Here we summarize two existing
methods that perform manifold alignment using discrete
label information without assuming prior known corre-
spondences. In [35] both datasets are concatenated in a
new block matrix
Z=X0
0YR(n+m)×(p+q).
Domain-specific similarity matrices WXand WYare cre-
ated from the data, e.g. via a kernel function. These
matrices are then similarly combined in a new block ma-
trix WZ. To leverage the label information, the authors
create a label-similarity matrix with entries Ws(i, j) = 1
if samples ziand zjshare the same label, 0 otherwise. A
label dissimilarity matrix Wd(i, j) = |Ws(i, j)1|is also
摘要:

ManifoldAlignmentwithLabelInformationAndresF.Duque*MyriamLizotte„GuyWolf…KevinR.Moon§AbstractMulti-domaindataisbecomingincreasinglycommonandpresentsbothchallengesandopportunitiesinthedatasciencecommunity.Theintegrationofdistinctdata-viewscanbeusedforexploratorydataanalysis,andbene tdownstreamanalys...

展开>> 收起<<
Manifold Alignment with Label Information Andr es F. DuqueMyriam LizotteGuy WolfKevin R. Moon Abstract.pdf

共9页,预览2页

还剩页未读, 继续阅读

声明:本站为文档C2C交易模式,即用户上传的文档直接被用户下载,本站只是中间服务平台,本站所有文档下载所得的收益归上传人(含作者)所有。玖贝云文库仅提供信息存储空间,仅对用户上传内容的表现方式做保护处理,对上载内容本身不做任何修改或编辑。若文档所含内容侵犯了您的版权或隐私,请立即通知玖贝云文库,我们立即给予删除!
分类:图书资源 价格:10玖币 属性:9 页 大小:7.71MB 格式:PDF 时间:2025-05-02

开通VIP享超值会员特权

  • 多端同步记录
  • 高速下载文档
  • 免费文档工具
  • 分享文档赚钱
  • 每日登录抽奖
  • 优质衍生服务
/ 9
客服
关注