Manifold Alignment with Label Information Andr es F. DuqueMyriam LizotteGuy WolfKevin R. Moon Abstract

2025-05-02 1 0 7.71MB 9 页 10玖币

侵权投诉

Manifold Alignment with Label Information

Andr´es F. Duque∗Myriam Lizotte†Guy Wolf ‡Kevin R. Moon §

Abstract

Multi-domain data is becoming increasingly common and

presents both challenges and opportunities in the data

science community. The integration of distinct data-views

can be used for exploratory data analysis, and beneﬁt

downstream analysis including machine learning related

tasks. With this in mind, we present a novel manifold

alignment method called MALI (Manifold alignment with

label information) that learns a correspondence between two

distinct domains. MALI can be considered as belonging to a

middle ground between the more commonly addressed semi-

supervised manifold alignment problem with some known

correspondences between the two domains, and the purely

unsupervised case, where no known correspondences are

provided. To do this, MALI learns the manifold structure

in both domains via a diﬀusion process and then leverages

discrete class labels to guide the alignment. By aligning two

distinct domains, MALI recovers a pairing and a common

representation that reveals related samples in both domains.

Additionally, MALI can be used for the transfer learning

problem known as domain adaptation. We show that MALI

outperforms the current state-of-the-art manifold alignment

methods across multiple datasets.

1 Introduction

The data collection process of a given phenomena may

be aﬀected by diﬀerent sources of variability, creat-

ing seemingly distinct domains. For instance, natural

images with diﬀerent illumination, contrast or noise,

may aﬀect the classiﬁcation performance of a machine

learning model previously trained on a diﬀerent do-

main. In biology, the modern study of single-cell dy-

namics is conducted via diﬀerent instruments, condi-

tions and modalities, raising diﬀerent challenges and

opportunities [28, 29]. In many cases, the relationships

between the diﬀerent domains are unknown. Hence,

the fusion and integration of multi-domain data has

been extensively studied in the data science commu-

nity for supervised learning as well as data mining and

exploratory data analysis. One of the earliest meth-

ods to do this is Canonical Correlation Analysis (CCA),

which ﬁnds a linear projection that maximizes the corre-

lation between the two domains [30]. CCA has been ex-

tended to diﬀerent formulations in recent years as sparse

CCA [20, 26] or Kernel CCA [8, 15].

In many applications, a reasonable assumption to

∗Utah State University, andres.duque@usu.edu

†Universit´e de Montr´eal, myriam.lizotte@mila.quebec

‡Universit´e de Montr´eal, wolfguy@mila.quebec

§Utah State University, kevin.moon@usu.edu

Figure 1: Manifold alignment. Two diﬀerent

datasets measured from the same underlying phenom-

ena are captured in diﬀerent conditions, instruments,

experimental designs, etc. Manifold alignment assumes

a common latent space (grey) from which the observa-

tions are mapped by functions fand gto the diﬀerent

ambient spaces. We seek to ﬁnd the underlying relation-

ship hbetween observations living in diﬀerent spaces X

and Ywithout assuming any pairing known a priori.

Instead we assume there are labeled observations for

diﬀerent classes (diﬀerent shapes).

make is that the data collected in diﬀerent domains is

controlled by a set of shared underlying modes of vari-

ation or latent variables. The manifold assumption is

also often applicable in this case, in which the data

measured in the diﬀerent domains are assumed to lie

on a low-dimensional manifold embedded in the high-

dimensional ambient spaces, being the result of smooth

mappings of the latent variables (see Fig. 1). With

this in mind, manifold alignment (MA) has become a

common technique for data integration. Some applica-

tions of MA include handling diﬀerent face poses and

protein structure alignment ( [1, 37]), medical images

for Alzheimer’s disease classiﬁcation ( [4], [16]), multi-

modal sensing images [32], graph-matching [13], and in-

tegrating single-cell multi-omics data [6] .

Multiple MA methods have been proposed under

diﬀerent prior knowledge assumptions that relate the

two domains. Methods such as CCA or multi-view dif-

fusion maps [23] can be categorized as supervised MA,

since the data is assumed to come in a paired fashion.

More challenging scenarios arise when partial or null

arXiv:2210.12774v2 [stat.ML] 31 Oct 2022

a priori pairing knowledge is considered. Purely un-

supervised algorithms are designed for scenarios where

neither pairings between domains nor any other side-

information is available. As a consequence, they rely

solely on the particular topology of each domain to in-

fer inter-domain similarities (e.g. [6, 10, 11, 34]).

Methods that leverage some additional information

are often categorized as semi-supervised MA. As a spe-

cial case, several methods consider partial correspon-

dence information, where a few one-to-one matching

samples work as anchor points to ﬁnd a consistent align-

ment for the rest of the data. Some papers leverage the

graph structure of the data [12,18,19,33] and are closely

related to Laplacian eigenmaps [5]. Others resort to

neural networks such as the GAN-based MAGAN [2] or

the autoencoder presented in [3].

However, even partial correspondences can be ex-

pensive or impossible to acquire. This is the case in

biological applications where the measurement process

destroys the cells, making it impossible to measure other

modalities of the exact same cells. But even if there are

no known correspondences between domains, we do not

have to resort to unsupervised MA. If we have access to

side information about the datasets from both domains,

such as discrete class labels, we can leverage this extra

knowledge to perform manifold alignment [31, 35, 36].

Motivated by this, we propose a new semi-supervised

MA algorithm called MALI (Manifold Alignment with

Label Information). MALI leverages the manifold struc-

ture of the data in both domains, combined with the

discrete label information, and it does not require any

known corresponding points in the diﬀerent domains.

MALI is built upon the widely-used manifold learn-

ing method Diﬀusion Maps [9] and optimal transport

(OT) [27]. We show via experimentation that MALI

outperforms current state-of-the-art MA algorithms in

this setting across multiple datasets by several metrics.

The setting described above is similar to the domain

adaptation (DA) [14] problem. In traditional machine

learning, the training set and the test set are assumed to

be sampled from the same distribution and to share the

same features. But in practice these assumptions may

not hold, for example, due to the diﬀerent collection

circumstances mentioned previously. When data is ex-

pensive or time-consuming to label, it may be desirable

to train a model on existing related datasets and then

adapt it to the new task. It is of interest to leverage

the knowledge acquired from training on one dataset to

improve performance on the same task on a diﬀerent

dataset, or potentially even a diﬀerent task. One pos-

sible approach to tackle DA is to use MA, since knowl-

edge can be transferred through MA via the learned

inter-domain correspondences or by training on a shared

latent representation of both domains.

2 Preliminaries

2.1 Problem Description Assume we have two

datasets X={x1, x2, . . . , xn} ∈ Rn×pand Y=

{y1, y2, . . . , ym} ∈ Rm×q. We assume that all of the

points in Xare labeled with discrete (i.e. class) labels

Lx={`x

1, . . . , `x

n}while the points in Ymay be partially

or fully labeled with discrete labels Ly={`y

1, . . . , `y

r},

with r≤m. In the domain adaptation problem, Xis

measured from the source domain while Yis measured

from the target domain.

The problem consists of learning an alignment be-

tween both data manifolds, by leveraging their respec-

tive geometric structures as well as the label knowledge

available from both domains. There are several pos-

sible ways to represent such an alignment using MA

algorithms. One way is to directly learn hard or soft

correspondences between points in Xand Y. A regres-

sion model could then be trained using these correspon-

dences to learn a parametric mapping between domains.

In the domain adaptation problem, unlabeled data in

the target domain can be labeled using the more rich

label information using the regression model.

A second way to represent the alignment is to

learn a shared embedding space which can be used

for downstream analysis. For the domain adaptation

problem, a classiﬁer could be trained on the shared

embedding space using the labels from X. Direct

correspondences can be learned by, for example, using

a nearest neighbor approach in the shared space.

As we show in this work, MALI is suited for any

of these scenarios. We ﬁrst ﬁnd pairwise cross-domain

distances, which are then leveraged to ﬁnd hard or soft

assignments between the domains via optimal transport.

If required, a shared embedding can be learned using

these assignments.

2.2 Related work Here we summarize two existing

methods that perform manifold alignment using discrete

label information without assuming prior known corre-

spondences. In [35] both datasets are concatenated in a

new block matrix

Z=X0

0Y∈R(n+m)×(p+q).

Domain-speciﬁc similarity matrices WXand WYare cre-

ated from the data, e.g. via a kernel function. These

matrices are then similarly combined in a new block ma-

trix WZ. To leverage the label information, the authors

create a label-similarity matrix with entries Ws(i, j) = 1

if samples ziand zjshare the same label, 0 otherwise. A

label dissimilarity matrix Wd(i, j) = |Ws(i, j)−1|is also

文档加载中……请稍候！
如果长时间未打开，您也可以点击刷新试试。

下载文档到电脑，查找使用更方便

10 玖币 0人已下载

立即下载

摘要：

ManifoldAlignmentwithLabelInformationAndresF.Duque*MyriamLizotteGuyWolfKevinR.Moon§AbstractMulti-domaindataisbecomingincreasinglycommonandpresentsbothchallengesandopportunitiesinthedatasciencecommunity.Theintegrationofdistinctdata-viewscanbeusedforexploratorydataanalysis,andbenetdownstreamanalys...

展开>> 收起<<

Manifold Alignment with Label Information Andr es F. DuqueMyriam LizotteGuy WolfKevin R. Moon Abstract.pdf

共9页,预览2页

还剩页未读，继续阅读

声明：本站为文档C2C交易模式，即用户上传的文档直接被用户下载，本站只是中间服务平台，本站所有文档下载所得的收益归上传人(含作者)所有。玖贝云文库仅提供信息存储空间，仅对用户上传内容的表现方式做保护处理，对上载内容本身不做任何修改或编辑。若文档所含内容侵犯了您的版权或隐私，请立即通知玖贝云文库，我们立即给予删除！

Manifold Alignment with Label Information Andr es F. DuqueMyriam LizotteGuy WolfKevin R. Moon Abstract

相关推荐

开通VIP享超值会员特权

作者详情

相关内容

热门标签

举报选择: