A Spectral Method for Assessing and Combining Multiple Data Visualizations Rong Ma1 Eric D. Sun2and James Zou2

2025-04-30 0 0 7.64MB 49 页 10玖币

侵权投诉

A Spectral Method for Assessing and Combining

Multiple Data Visualizations

Rong Ma1, Eric D. Sun2and James Zou2

Department of Statistics, Stanford University1

Department of Biomedical Data Science, Stanford University2

Abstract

Dimension reduction and data visualization aim to project a high-dimensional dataset to a

low-dimensional space while capturing the intrinsic structures in the data. It is an indispens-

able part of modern data science, and many dimensional reduction and visualization algorithms

have been developed. However, diﬀerent algorithms have their own strengths and weaknesses,

making it critically important to evaluate their relative performance for a given dataset, and

to leverage and combine their individual strengths. In this paper, we propose an eﬃcient spec-

tral method for assessing and combining multiple visualizations of a given dataset produced by

diverse algorithms. The proposed method provides a quantitative measure – the visualization

eigenscore – of the relative performance of the visualizations for preserving the structure around

each data point. Then it leverages the eigenscores to obtain a consensus visualization, which

has much improved quality over the individual visualizations in capturing the underlying true

data structure. Our approach is ﬂexible and works as a wrapper around any visualizations. We

analyze multiple simulated and real-world datasets from diverse applications to demonstrate

the eﬀectiveness of the eigenscores for evaluating visualizations and the superiority of the pro-

posed consensus visualization. Furthermore, we establish rigorous theoretical justiﬁcation of our

method based on a general statistical framework, yielding fundamental principles behind the

empirical success of consensus visualization along with practical guidance.

KEY WORDS: data visualization; dimension reduction; high-dimensional data; manifold learn-

ing; spectral method.

1 INTRODUCTION

Data visualization and dimension reduction is a central topic in statistics and data science, as it fa-

cilitates intuitive understanding and global views of high-dimensional datasets and their underlying

structural patterns through a low-dimensional embedding of the data (Donoho, 2017; Chen et al.,

2020). The past decades have witnessed an explosion in machine learning algorithms for data visu-

alization and dimension reduction. Many of them, such as Laplacian eigenmap (Belkin and Niyogi,

2003), kernel principal component analysis (kPCA) (Sch¨olkopf et al., 1997), t-SNE (van der Maaten

and Hinton, 2008), and UMAP (McInnes et al., 2018), have been regarded as indispensable tools

and state-of-art techniques for generating graphics in academic and professional writings (Chen

et al., 2007), and for exploratory data analysis and pattern discovery in many research disciplines,

such as astrophysics (Traven et al., 2017), computer vision (Cheng et al., 2015), genetics (Platzer,

arXiv:2210.13711v1 [stat.ML] 25 Oct 2022

2013), molecular biology (Olivon et al., 2018), especially in single-cell transcriptomics (Kobak and

Berens, 2019), among others.

However, the wide availability and functional diversity of data visualization methods also brings

forth new challenges to data analysts and practitioners (Nonato and Aupetit, 2018; Espadoto et al.,

2019). On the one hand, it is critically important to determine among the extensive list which

visualization method is most suitable and reliable for embedding a given dataset. In fact, even

for a single visualization method, such as t-SNE or UMAP, oftentimes there are multiple tuning

parameters to be determined by the users, and diﬀerent tuning parameters may lead to distinct

visualizations (Kobak and Linderman, 2021; Cai and Ma, 2021). Thus, for a given dataset, selecting

the most suitable visualization method and along with its tuning parameters calls for a method

that provides quantitative and objective assessment of diﬀerent visualizations of the dataset. On

the other hand, as diﬀerent methods are usually based on distinct ideas and heuristics, they would

generate qualitatively diverse visualizations of a dataset, each containing important features about

the data that are possibly unique to the visualization method. Meanwhile, due to noisiness and high-

dimensionality of many real-world datasets, their low-dimensional visualizations necessarily contain

distortions from the underlying true structures, which again may vary from one visualization to

another. It is therefore of substantial practical interest to combine strengths and reach a consensus

among multiple data visualizations, in order to obtain an even better “meta-visualization” of the

data that captures the most information and is least susceptible to the distortions. Naturally, a

meta-visualization would also save practitioners from painstakingly selecting a single visualization

method among many.

In this paper, we propose an eﬃcient spectral approach for simultaneously assessing and combin-

ing multiple data visualizations produced by diverse dimension reduction/visualization algorithms,

allowing for diﬀerent settings of tuning parameters for individual algorithms. Speciﬁcally, the

proposed method takes as input a collection of visualizations, or low-dimensional embeddings of

a dataset, hereafter referred as “candidate visualizations,” and summarizes each visualization by

a normalized pairwise-distance matrix among the samples. With respect to each sample in the

dataset, we construct a comparison matrix from these normalized distance matrices, characterizing

the local concordance between each pair of candidate visualizations. Based on eigen-decomposition

of the comparison matrices, we propose a quantitative measure, referred as “visualization eigen-

score,” that quantiﬁes the relative performance of the candidate visualizations in a sample-wise

manner, reﬂecting their local concordance with the underlying low-dimensional structure contained

in the data. To obtain a meta-visualization, the candidate visualizations are combined together

into a meta-distance matrix, deﬁned as a row-wise weighted average of those normalized distance

matrices, using the corresponding eigenscores as the weights. The meta-distance matrix is then

used to produce a meta-visualization, based on an existing method such as UMAP or kPCA, which

is shown to be more reliable and more informative compared to individual candidate visualizations.

Our method is schematically summarized in Figure 1 and Algorithm 1, and detailed in Section 2.1.

The thus obtained meta-visualization reﬂects a joint perspective aggregating various aspects of the

data that are oftentimes captured separately by individual candidate visualizations.

Numerically, through extensive simulations and analysis of multiple real-world datasets with

diverse underlying structures, we show the eﬀectiveness of the proposed eigenscores in assessing

and ranking a collection of candidate visualizations, and demonstrate the superiority of the ﬁnal

meta-visualization over all the candidate visualizations in terms of identiﬁcation and characteri-

zation of these structural patterns. To achieve a deeper understanding of the proposed method,

we also develop a formal statistical framework, that rigorously justiﬁes the proposed scoring and

meta-visualization method, providing theoretical insights on the fundamental principles behind the

empirical success of the method, along with its proper interpretations, and guidance on practice.

Figure 1: A graphical illustration of the proposed method. The algorithm takes as input the

normalized pairwise distance matrices associated to a collection of candidate visualizations (viz1

to viz4) of a dataset. For each sample of the dataset, we compute the similarity matrix between

the rows of the normalized distance matrices associated to the sample (rows highlighted in the

same color), and then deﬁne the corresponding eigenscores as the ﬁrst eigenvector of the similarity

matrix. The size of the circles in the similarity matrices and the vectors of eigenscores indicate

the magnitude of the entries (assumed to be non-negative). The meta-distance matrix is deﬁned

such that its rows are the eigenscore-weighted average of the rows in the normalized distance

matrices. The meta-distance leads to a meta-visualization, expected to be more concordant with

the underlying true structure than individual candidate visualizations.

1.1 Related Works

Quantitative assessment of dimension reduction and data visualization algorithms is of substantial

practical interests, and have been extensively studied in the past two decades. For example, many

evaluation methods are based on distortion measures from metric geometry (Abraham et al., 2006,

2009; Chennuru Vankadara and von Luxburg, 2018; Bartal et al., 2019), whereas some other meth-

ods rely on information-theoretic precision-recall measures (Venna et al., 2010; Arora et al., 2018),

co-ranking structure (Mokbel et al., 2013), or graph-based criteria (Wang et al., 2021; Cai and Ma,

2021). See also Bertini et al. (2011), Nonato and Aupetit (2018) and Espadoto et al. (2019) for

recent reviews. However, most of these existing methods evaluate data visualizations by comparing

them directly with the original dataset, without accounting for its noisiness. The thus obtained

assessment may suﬀer from intrinsic bias due to ignorance of the underlying true structures, only

approximately represented by the noisy observations; see Section 2.3.5 and Supplementary Figure

21 for more discussions. To address this issue, the proposed eigenscores, in contrast, provide prov-

ably consistent assessment and ranking of visualizations reﬂecting their relative concordances with

the underlying noiseless structures in the data.

Compared to the quantitative assessment of data visualizations, there is a scarcity of meta-

visualization methods that combine strengths of multiple data visualizations. In Pagliosa et al.

(2015), an interactive method is developed that assesses and combines diﬀerent multidimensional

projection methods via a convex combination technique. However, for supervised learning tasks

such as classiﬁcation, there is a long history of research on designing and developing meta-classiﬁers

that combine multiple classiﬁers (Woods et al., 1997; Tax et al., 2000; Parisi et al., 2014; Liu et al.,

2017; Mohandes et al., 2018). Compared with meta-classiﬁcation, the main diﬃculty of meta-

visualization lies in the identiﬁcation of a common space to properly align multiple visualizations,

or low-dimensional embeddings, whose scales and coordinate bases may drastically diﬀer from

one to another (see, for example, Figures 3-5(a)). Moreover, unlike many meta-classiﬁers, which

combines presumably independent classiﬁers trained over diﬀerent datasets, a meta-visualization

procedure typically relies on multiple visualizations of the same dataset, and therefore has to deal

with more complicated correlation structure among the visualizations. The current study provides

the ﬁrst meta-visualization method that can ﬂexibly combine any number of visualizations, and

has interpretable and provable performance guarantee.

1.2 Main Contributions

The main contribution of the current study can be summarized as follows:

•We propose a computationally eﬃcient spectral method for assessing and combining multi-

ple data visualizations. The method is generic and easy to implement: it does not require

knowledge of the original dataset, and can be applied to a large number of data visualizations

generated by diverse methods.

•For any collection of visualizations of a dataset, our method provides a quantitative measure

– eigenscore – of the relative performance of the visualizations for preserving the structure

around each data point. The eigenscores are useful on their own rights for assessing the local

and global reliability of a visualization in representing the underlying structures of the data,

and in guiding selection of hyper-parameters.

•The proposed method automatically combines strengths and ameliorates weakness (distor-

tions) of the candidate visualizations, leading to a meta-visualization which is provably better

than all the candidate visualizations under a wide range of settings. We show that the meta-

visualization is able to capture diverse intrinsic structures, such as clusters, trajectories, and

mixed low-dimensional structures, contained in noisy and high-dimensional datasets.

•We establish rigorous theoretical justiﬁcations of the method under a general signal-plus-noise

model (Section 2.3) in the large-sample limit. We prove the convergence of the eigenscores

to certain underlying true concordance measures, the guaranteed performance of the meta-

visualization and its advantages over alternative methods, its robustness against possible

adversarial candidate visualizations, along with their conditions, interpretations, and practical

implications.

The proposed method is described in detail in Section 2.1, and empirically illustrated and evalu-

ated in Section 2.2, through extensive simulation studies and analyses of three real-world datasets

with diverse underlying structures. In Section 2.3, we show results from our theoretical analysis,

which unveils fundamental principles associated to the method, such as the beneﬁts of including

qualitatively and functionally diverse candidate visualizations.

2 RESULTS

2.1 Eigenscore and Meta-Visualization Methodology

Throughout, without loss of generality, we assume that for visualization purpose the target embed-

ding is two-dimensional, although our discussion applies to any ﬁnite-dimensional embedding.

We consider visualizing a p-dimensional dataset {Yi}1≤i≤ncontaining nsamples. From {Yi}1≤i≤n,

suppose we obtain a collection of K(candidate) visualizations of the data, produced by various visu-

alization methods. We denote these visualizations as two-dimensional embeddings {X(k)

i}1≤i≤n⊂

R2for k∈ {1,2, ..., K}. Our approach only needs access to the low-dimensional embeddings

{X(k)

i}1≤i≤nrather than the raw data {Yi}1≤i≤n; as a result, the users can use our method even if

they don’t have access to the original data, which is often the case.

2.1.1 Measuring Normalized Distances From Each Visualization

In order that the proposed method is invariant to the respective scale and coordinate basis (i.e.,

directionality) of the low-dimensional embeddings generated from diﬀerent visualization method,

we start by considering the normalized pairwise-distance matrix for each visualization.

Speciﬁcally, for each k∈ {1,2, ..., K}, we deﬁne the normalized pairwise-distance matrix

P(k)= [D(k)]−1P(k)∈Rn×n,(2.5)

where

P(k)= (kX(k)

i−X(k)

jk2)1≤i,j≤n∈Rn×n,(2.6)

is the un-normalized Euclidean distance matrix, and D(k)= diag(kP(k)

1.k2,kP(k)

2.k2, ..., kP(k)

n. k2) is

a diagonal matrix with its diagonal entries being the `2-norms of the rows {P(k)

1., ..., P(k)

n. }of P(k).

As a result, the normalized distance matrix ¯

P(k)has its rows being unit vectors, and is invariant

to any scaling and rotation of the visualization {X(k)

i}1≤i≤n.

The normalized distance matrices {¯

P(k)}1≤k≤Ksummarize the candidate visualizations in a

compact and eﬃcient way. Their scale- and rotation-invariance properties are particularly useful

for comparing visualizations produced by distinct methods.

文档加载中……请稍候！
如果长时间未打开，您也可以点击刷新试试。

下载文档到电脑，查找使用更方便

10 玖币 0人已下载

立即下载

摘要：

ASpectralMethodforAssessingandCombiningMultipleDataVisualizationsRongMa1,EricD.Sun2andJamesZou2DepartmentofStatistics,StanfordUniversity1DepartmentofBiomedicalDataScience,StanfordUniversity2AbstractDimensionreductionanddatavisualizationaimtoprojectahigh-dimensionaldatasettoalow-dimensionalspacewhile...

展开>> 收起<<

A Spectral Method for Assessing and Combining Multiple Data Visualizations Rong Ma1 Eric D. Sun2and James Zou2.pdf

共49页,预览5页

还剩页未读，继续阅读

声明：本站为文档C2C交易模式，即用户上传的文档直接被用户下载，本站只是中间服务平台，本站所有文档下载所得的收益归上传人(含作者)所有。玖贝云文库仅提供信息存储空间，仅对用户上传内容的表现方式做保护处理，对上载内容本身不做任何修改或编辑。若文档所含内容侵犯了您的版权或隐私，请立即通知玖贝云文库，我们立即给予删除！

A Spectral Method for Assessing and Combining Multiple Data Visualizations Rong Ma1 Eric D. Sun2and James Zou2

相关推荐

开通VIP享超值会员特权

作者详情

相关内容

热门标签

举报选择: