A Spectral Method for Assessing and Combining Multiple Data Visualizations Rong Ma1 Eric D. Sun2and James Zou2

2025-04-30 0 0 7.64MB 49 页 10玖币
侵权投诉
A Spectral Method for Assessing and Combining
Multiple Data Visualizations
Rong Ma1, Eric D. Sun2and James Zou2
Department of Statistics, Stanford University1
Department of Biomedical Data Science, Stanford University2
Abstract
Dimension reduction and data visualization aim to project a high-dimensional dataset to a
low-dimensional space while capturing the intrinsic structures in the data. It is an indispens-
able part of modern data science, and many dimensional reduction and visualization algorithms
have been developed. However, different algorithms have their own strengths and weaknesses,
making it critically important to evaluate their relative performance for a given dataset, and
to leverage and combine their individual strengths. In this paper, we propose an efficient spec-
tral method for assessing and combining multiple visualizations of a given dataset produced by
diverse algorithms. The proposed method provides a quantitative measure – the visualization
eigenscore – of the relative performance of the visualizations for preserving the structure around
each data point. Then it leverages the eigenscores to obtain a consensus visualization, which
has much improved quality over the individual visualizations in capturing the underlying true
data structure. Our approach is flexible and works as a wrapper around any visualizations. We
analyze multiple simulated and real-world datasets from diverse applications to demonstrate
the effectiveness of the eigenscores for evaluating visualizations and the superiority of the pro-
posed consensus visualization. Furthermore, we establish rigorous theoretical justification of our
method based on a general statistical framework, yielding fundamental principles behind the
empirical success of consensus visualization along with practical guidance.
KEY WORDS: data visualization; dimension reduction; high-dimensional data; manifold learn-
ing; spectral method.
1 INTRODUCTION
Data visualization and dimension reduction is a central topic in statistics and data science, as it fa-
cilitates intuitive understanding and global views of high-dimensional datasets and their underlying
structural patterns through a low-dimensional embedding of the data (Donoho, 2017; Chen et al.,
2020). The past decades have witnessed an explosion in machine learning algorithms for data visu-
alization and dimension reduction. Many of them, such as Laplacian eigenmap (Belkin and Niyogi,
2003), kernel principal component analysis (kPCA) (Scolkopf et al., 1997), t-SNE (van der Maaten
and Hinton, 2008), and UMAP (McInnes et al., 2018), have been regarded as indispensable tools
and state-of-art techniques for generating graphics in academic and professional writings (Chen
et al., 2007), and for exploratory data analysis and pattern discovery in many research disciplines,
such as astrophysics (Traven et al., 2017), computer vision (Cheng et al., 2015), genetics (Platzer,
1
arXiv:2210.13711v1 [stat.ML] 25 Oct 2022
2013), molecular biology (Olivon et al., 2018), especially in single-cell transcriptomics (Kobak and
Berens, 2019), among others.
However, the wide availability and functional diversity of data visualization methods also brings
forth new challenges to data analysts and practitioners (Nonato and Aupetit, 2018; Espadoto et al.,
2019). On the one hand, it is critically important to determine among the extensive list which
visualization method is most suitable and reliable for embedding a given dataset. In fact, even
for a single visualization method, such as t-SNE or UMAP, oftentimes there are multiple tuning
parameters to be determined by the users, and different tuning parameters may lead to distinct
visualizations (Kobak and Linderman, 2021; Cai and Ma, 2021). Thus, for a given dataset, selecting
the most suitable visualization method and along with its tuning parameters calls for a method
that provides quantitative and objective assessment of different visualizations of the dataset. On
the other hand, as different methods are usually based on distinct ideas and heuristics, they would
generate qualitatively diverse visualizations of a dataset, each containing important features about
the data that are possibly unique to the visualization method. Meanwhile, due to noisiness and high-
dimensionality of many real-world datasets, their low-dimensional visualizations necessarily contain
distortions from the underlying true structures, which again may vary from one visualization to
another. It is therefore of substantial practical interest to combine strengths and reach a consensus
among multiple data visualizations, in order to obtain an even better “meta-visualization” of the
data that captures the most information and is least susceptible to the distortions. Naturally, a
meta-visualization would also save practitioners from painstakingly selecting a single visualization
method among many.
In this paper, we propose an efficient spectral approach for simultaneously assessing and combin-
ing multiple data visualizations produced by diverse dimension reduction/visualization algorithms,
allowing for different settings of tuning parameters for individual algorithms. Specifically, the
proposed method takes as input a collection of visualizations, or low-dimensional embeddings of
a dataset, hereafter referred as “candidate visualizations,” and summarizes each visualization by
a normalized pairwise-distance matrix among the samples. With respect to each sample in the
dataset, we construct a comparison matrix from these normalized distance matrices, characterizing
the local concordance between each pair of candidate visualizations. Based on eigen-decomposition
of the comparison matrices, we propose a quantitative measure, referred as “visualization eigen-
score,” that quantifies the relative performance of the candidate visualizations in a sample-wise
manner, reflecting their local concordance with the underlying low-dimensional structure contained
in the data. To obtain a meta-visualization, the candidate visualizations are combined together
into a meta-distance matrix, defined as a row-wise weighted average of those normalized distance
matrices, using the corresponding eigenscores as the weights. The meta-distance matrix is then
used to produce a meta-visualization, based on an existing method such as UMAP or kPCA, which
is shown to be more reliable and more informative compared to individual candidate visualizations.
Our method is schematically summarized in Figure 1 and Algorithm 1, and detailed in Section 2.1.
The thus obtained meta-visualization reflects a joint perspective aggregating various aspects of the
data that are oftentimes captured separately by individual candidate visualizations.
Numerically, through extensive simulations and analysis of multiple real-world datasets with
diverse underlying structures, we show the effectiveness of the proposed eigenscores in assessing
and ranking a collection of candidate visualizations, and demonstrate the superiority of the final
meta-visualization over all the candidate visualizations in terms of identification and characteri-
zation of these structural patterns. To achieve a deeper understanding of the proposed method,
we also develop a formal statistical framework, that rigorously justifies the proposed scoring and
meta-visualization method, providing theoretical insights on the fundamental principles behind the
empirical success of the method, along with its proper interpretations, and guidance on practice.
2
Figure 1: A graphical illustration of the proposed method. The algorithm takes as input the
normalized pairwise distance matrices associated to a collection of candidate visualizations (viz1
to viz4) of a dataset. For each sample of the dataset, we compute the similarity matrix between
the rows of the normalized distance matrices associated to the sample (rows highlighted in the
same color), and then define the corresponding eigenscores as the first eigenvector of the similarity
matrix. The size of the circles in the similarity matrices and the vectors of eigenscores indicate
the magnitude of the entries (assumed to be non-negative). The meta-distance matrix is defined
such that its rows are the eigenscore-weighted average of the rows in the normalized distance
matrices. The meta-distance leads to a meta-visualization, expected to be more concordant with
the underlying true structure than individual candidate visualizations.
3
1.1 Related Works
Quantitative assessment of dimension reduction and data visualization algorithms is of substantial
practical interests, and have been extensively studied in the past two decades. For example, many
evaluation methods are based on distortion measures from metric geometry (Abraham et al., 2006,
2009; Chennuru Vankadara and von Luxburg, 2018; Bartal et al., 2019), whereas some other meth-
ods rely on information-theoretic precision-recall measures (Venna et al., 2010; Arora et al., 2018),
co-ranking structure (Mokbel et al., 2013), or graph-based criteria (Wang et al., 2021; Cai and Ma,
2021). See also Bertini et al. (2011), Nonato and Aupetit (2018) and Espadoto et al. (2019) for
recent reviews. However, most of these existing methods evaluate data visualizations by comparing
them directly with the original dataset, without accounting for its noisiness. The thus obtained
assessment may suffer from intrinsic bias due to ignorance of the underlying true structures, only
approximately represented by the noisy observations; see Section 2.3.5 and Supplementary Figure
21 for more discussions. To address this issue, the proposed eigenscores, in contrast, provide prov-
ably consistent assessment and ranking of visualizations reflecting their relative concordances with
the underlying noiseless structures in the data.
Compared to the quantitative assessment of data visualizations, there is a scarcity of meta-
visualization methods that combine strengths of multiple data visualizations. In Pagliosa et al.
(2015), an interactive method is developed that assesses and combines different multidimensional
projection methods via a convex combination technique. However, for supervised learning tasks
such as classification, there is a long history of research on designing and developing meta-classifiers
that combine multiple classifiers (Woods et al., 1997; Tax et al., 2000; Parisi et al., 2014; Liu et al.,
2017; Mohandes et al., 2018). Compared with meta-classification, the main difficulty of meta-
visualization lies in the identification of a common space to properly align multiple visualizations,
or low-dimensional embeddings, whose scales and coordinate bases may drastically differ from
one to another (see, for example, Figures 3-5(a)). Moreover, unlike many meta-classifiers, which
combines presumably independent classifiers trained over different datasets, a meta-visualization
procedure typically relies on multiple visualizations of the same dataset, and therefore has to deal
with more complicated correlation structure among the visualizations. The current study provides
the first meta-visualization method that can flexibly combine any number of visualizations, and
has interpretable and provable performance guarantee.
1.2 Main Contributions
The main contribution of the current study can be summarized as follows:
We propose a computationally efficient spectral method for assessing and combining multi-
ple data visualizations. The method is generic and easy to implement: it does not require
knowledge of the original dataset, and can be applied to a large number of data visualizations
generated by diverse methods.
For any collection of visualizations of a dataset, our method provides a quantitative measure
– eigenscore – of the relative performance of the visualizations for preserving the structure
around each data point. The eigenscores are useful on their own rights for assessing the local
and global reliability of a visualization in representing the underlying structures of the data,
and in guiding selection of hyper-parameters.
The proposed method automatically combines strengths and ameliorates weakness (distor-
tions) of the candidate visualizations, leading to a meta-visualization which is provably better
4
than all the candidate visualizations under a wide range of settings. We show that the meta-
visualization is able to capture diverse intrinsic structures, such as clusters, trajectories, and
mixed low-dimensional structures, contained in noisy and high-dimensional datasets.
We establish rigorous theoretical justifications of the method under a general signal-plus-noise
model (Section 2.3) in the large-sample limit. We prove the convergence of the eigenscores
to certain underlying true concordance measures, the guaranteed performance of the meta-
visualization and its advantages over alternative methods, its robustness against possible
adversarial candidate visualizations, along with their conditions, interpretations, and practical
implications.
The proposed method is described in detail in Section 2.1, and empirically illustrated and evalu-
ated in Section 2.2, through extensive simulation studies and analyses of three real-world datasets
with diverse underlying structures. In Section 2.3, we show results from our theoretical analysis,
which unveils fundamental principles associated to the method, such as the benefits of including
qualitatively and functionally diverse candidate visualizations.
2 RESULTS
2.1 Eigenscore and Meta-Visualization Methodology
Throughout, without loss of generality, we assume that for visualization purpose the target embed-
ding is two-dimensional, although our discussion applies to any finite-dimensional embedding.
We consider visualizing a p-dimensional dataset {Yi}1incontaining nsamples. From {Yi}1in,
suppose we obtain a collection of K(candidate) visualizations of the data, produced by various visu-
alization methods. We denote these visualizations as two-dimensional embeddings {X(k)
i}1in
R2for k∈ {1,2, ..., K}. Our approach only needs access to the low-dimensional embeddings
{X(k)
i}1inrather than the raw data {Yi}1in; as a result, the users can use our method even if
they don’t have access to the original data, which is often the case.
2.1.1 Measuring Normalized Distances From Each Visualization
In order that the proposed method is invariant to the respective scale and coordinate basis (i.e.,
directionality) of the low-dimensional embeddings generated from different visualization method,
we start by considering the normalized pairwise-distance matrix for each visualization.
Specifically, for each k∈ {1,2, ..., K}, we define the normalized pairwise-distance matrix
¯
P(k)= [D(k)]1P(k)Rn×n,(2.5)
where
P(k)= (kX(k)
iX(k)
jk2)1i,jnRn×n,(2.6)
is the un-normalized Euclidean distance matrix, and D(k)= diag(kP(k)
1.k2,kP(k)
2.k2, ..., kP(k)
n. k2) is
a diagonal matrix with its diagonal entries being the `2-norms of the rows {P(k)
1., ..., P(k)
n. }of P(k).
As a result, the normalized distance matrix ¯
P(k)has its rows being unit vectors, and is invariant
to any scaling and rotation of the visualization {X(k)
i}1in.
The normalized distance matrices {¯
P(k)}1kKsummarize the candidate visualizations in a
compact and efficient way. Their scale- and rotation-invariance properties are particularly useful
for comparing visualizations produced by distinct methods.
5
摘要:

ASpectralMethodforAssessingandCombiningMultipleDataVisualizationsRongMa1,EricD.Sun2andJamesZou2DepartmentofStatistics,StanfordUniversity1DepartmentofBiomedicalDataScience,StanfordUniversity2AbstractDimensionreductionanddatavisualizationaimtoprojectahigh-dimensionaldatasettoalow-dimensionalspacewhile...

展开>> 收起<<
A Spectral Method for Assessing and Combining Multiple Data Visualizations Rong Ma1 Eric D. Sun2and James Zou2.pdf

共49页,预览5页

还剩页未读, 继续阅读

声明:本站为文档C2C交易模式,即用户上传的文档直接被用户下载,本站只是中间服务平台,本站所有文档下载所得的收益归上传人(含作者)所有。玖贝云文库仅提供信息存储空间,仅对用户上传内容的表现方式做保护处理,对上载内容本身不做任何修改或编辑。若文档所含内容侵犯了您的版权或隐私,请立即通知玖贝云文库,我们立即给予删除!
分类:图书资源 价格:10玖币 属性:49 页 大小:7.64MB 格式:PDF 时间:2025-04-30

开通VIP享超值会员特权

  • 多端同步记录
  • 高速下载文档
  • 免费文档工具
  • 分享文档赚钱
  • 每日登录抽奖
  • 优质衍生服务
/ 49
客服
关注