Review of Clustering Methods for Functional Data Mimi Zhang13 Andrew Parnell23 1School of Computer Science and Statistics Trinity College Dublin Ireland

2025-04-29 0 0 2.7MB 52 页 10玖币
侵权投诉
Review of Clustering Methods for Functional Data
Mimi Zhang1,3, Andrew Parnell2,3
1School of Computer Science and Statistics, Trinity College Dublin, Ireland
2Hamilton Institute, Maynooth University, Ireland
3I-Form Advanced Manufacturing Research Centre, Science Foundation Ireland, Ireland
Abstract
Functional data clustering is to identify heterogeneous morphological patterns in
the continuous functions underlying the discrete measurements/observations. Appli-
cation of functional data clustering has appeared in many publications across various
fields of sciences, including but not limited to biology, (bio)chemistry, engineering, en-
vironmental science, medical science, psychology, social science, etc. The phenomenal
growth of the application of functional data clustering indicates the urgent need for a
systematic approach to develop efficient clustering methods and scalable algorithmic
implementations. On the other hand, there is abundant literature on the cluster anal-
ysis of time series, trajectory data, spatio-temporal data, etc., which are all related to
functional data. Therefore, an overarching structure of existing functional data clus-
tering methods will enable the cross-pollination of ideas across various research fields.
We here conduct a comprehensive review of original clustering methods for functional
data. We propose a systematic taxonomy that explores the connections and differ-
ences among the existing functional data clustering methods and relates them to the
conventional multivariate clustering methods. The structure of the taxonomy is built
on three main attributes of a functional data clustering method and therefore is more
reliable than existing categorizations. The review aims to bridge the gap between the
functional data analysis community and the clustering community and to generate new
principles for functional data clustering.
Keywords: curve registration, dependent functional data, multivariate functional
data, shape analysis
1
arXiv:2210.00847v1 [stat.ME] 3 Oct 2022
1 Introduction
With the advancement of data-collection technology, a wide range of industry and business
sectors are now able to collect functional data. According to Ramsay and Silverman [1], a
functional datum is not an individual value but rather a set of measurements/observations
along a continuum that, taken together, are to be regarded as a single entity. Functional
data come in many forms, but their defining quality is that they consist of functions – often,
but not always, curves. For example, spectroscopic techniques obtain spectral information
by probing each sample with electromagnetic radiation that varies in a range of wavelengths,
and hence the calculated absorption coefficient is a function of wavelength. By probing a
sample at different wavelengths, the set of absorption coefficients is one data unit. Another
example of functional data is an fMRI time series, consisting of a time series of 3D images
of the living human brain, where each 3D image consists of a large number of voxels (3D
pixels). For example, the prevalent BOLD fMRI detects the blood-oxygen-level-dependent
signal that reflects changes in deoxyhemoglobin, driven by localized changes in brain blood
flow and blood oxygenation. Each 3D image is a functional datum (or, equivalently, a
random field). Paradigmatic formats of functional data include time series, trajectories,
spatio-temporal data, etc. However, the term “functional” is not the defining quality of time
series, trajectories, or spatio-temporal data. Ansari et al. [2] classified spatio-temporal data
into five types, according to which certain types of spatio-temporal data are not functional
data. Apart from the difference in the definitions of data format, the main difference is in
the focus of statistical analysis: the focus of functional data analysis is on analyzing relations
among the random elements, rather than properties of individual random elements.
While functional data analysis has received attention from statisticians since the 1980s,
there is very little advancement in the area of functional data clustering. Within the two
databases: Scopus and Web of Science, we found only about 100 articles that are on develop-
ing clustering methods for functional data.1Moreover, nearly all documented methods tackle
only the functional-data part of the problem, not the clustering part of the problem. For
example, many studies mainly concern extracting a tabular-data proxy for functional data,
ignoring the synergy between the feature-learning (a.k.a., representation-learning) step and
the clustering step. The main objective of our review is to develop an overarching structure
of existing functional data clustering methods, which highlights the similarities and differ-
1In the appendix, we give the details on the identification of relevant literature and the article selection
process. We also provide a table that implements the classification of the reviewed articles according to our
taxonomy.
2
ences among them and their connections with conventional multivariate clustering methods.
We point to a few good references that give excellent coverage of state-of-the-art clustering
methods for relevant data types (i.e., time series, trajectory data, and spatio-temporal data).
We also suggest a new methodological framework that extricates the primary deficiency in
the current tandem approach. The review will also help connect the machine learning and
computer science communities with the challenges and opportunities in analyzing functional
data.
Figure 1 depicts the tandem approach adopted in the current practice of functional data
tabular
data
smooth
functions clustering
step 2 step 3
step 2*
sample
paths step 1
Figure 1: Functional data clustering methods can be categorized into two major groups,
according to whether the clustering method is applied to the extracted tabular data (steps
1, 2 and 3) or to the estimated smooth functions (step 1 and step 2*). In the upper line
approach, cluster analysis is performed in a finite-dimensional space, while in the bottom
line approach, cluster analysis is performed in an infinite-dimensional space.
clustering, and Figure 2 illustrates our taxonomy. Functional data clustering methods can
be categorized (Tier 1 categorization) according to whether the clustering method is applied
to the extracted tabular data (i.e., in a finite-dimensional space) or to the estimated smooth
functions (i.e., in an infinite-dimensional space). Then within each major category, cluster-
ing methods can be further categorized (Tier 2 categorization) according to the definition
of (dis)similarity, the definition of cluster, and/or algorithmic features. In particular, in the
upper pipeline of Figure 1, clustering methods can be classified into “hierarchical clustering”,
“model-based clustering”, “centroid-based clustering”, “density-based clustering”, “spectral
clustering”, etc. In the bottom pipeline, clustering methods can be classified into “subspace
clustering”, “nonparametric Bayesian”, “density-based clustering”, “new (dis)similarity”,
etc. Finally, in the Tier 3 categorization, clustering methods are grouped according to the
way they deal with phase variation and/or amplitude variation. In the random-effects cate-
gory, phase variation and amplitude variation are characterized by a few random parameters
in the function expression; for example, y=y(at +b), where tis the argument, and the ran-
dom parameters aand bare to capture the phase variation. In the (non)parametric category,
3



 
























Figure 2: The three-tier categorization of existing functional data clustering methods. The
first tier categorization concerns the dimension of the direct input to a clustering method, the
second tier categorization is based on the characteristics of the clustering method, and the
third tier categorization is to highlight the different strategies that deal with phase variation
and/or amplitude variation. Methods highlighted in green and blue constitute the vast
majority of the literature and are respectively reviewed in Section 3 and Section 4. Methods
highlighted in grey explicitly address the phase variation and/or amplitude variation in their
clustering methods and are reviewed in Section 7.
the time-warping functions admit either a parametric model or a nonparametric model. In
the equivalence-relation category, two functions are equivalent if they can be transformed
to each other by, e.g., a linear time-warping function. Our three-tier categorization pro-
vides a well-conceived and useful taxonomy in that it frames the three defining features of
functional data clustering methods: dimensionality reduction, clustering strategy, and curve
registration.
There are a few attempts at devising taxonomic categories for functional data clustering
methods. The short survey given by Jacques and Preda [3] classifies a few conventional func-
tional data clustering methods into three categories. Chamroukhi and Nguyen [4] reviewed
a few articles that differ in the way of extracting tabular data but all apply the model-based
clustering technique on the extracted tabular data. Cheam and Fredette [5] reviewed a few
functional data clustering methods and categorized them according to whether they allow
4
amplitude variation and/or phase variation within clusters. We note that, while a few func-
tional data clustering methods explicitly deal with phase variation, the majority of functional
data clustering methods adopt the convention that phase variation, whether relevant or not
to the clustering problem, will be identified in the pre-processing step. Hence, the categories
provided by [5] are too broad to enlighten future works. By contrast, our three-tier cate-
gorization provides a lot more information. Moreover, none of the above surveys tends to
be as comprehensive as we are in this review. Ullah and Finch [6] conducted a systematic
overview of applications of functional data analysis, covering all articles published during
1995 – 2010. Cuevas [7] provided a good survey of the current theory and statistics of func-
tional data analysis. Finally, while there is limited literature in the field of functional data
clustering, there is abundant literature on clustering time series, trajectory data, or spatio-
temporal data. Readers are referred to the following recent surveys for cross-pollination of
insights and ideas: Zheng [8] for trajectory data, Aghabozorgi et al. [9] for time series, and
Atluri at al. [10], Ansari et al. [2] and Wang et al. [11] for spatio-temporal data.
The novelty of functional data clustering obliges us to start by clarifying the terminology
in Section 2. The majority of the different functional data clustering methods are explained
in Section 3 & 4, while Section 5 & 6 are respectively dedicated to the clustering methods
for vector-valued functional data and dependent functional data, which are two demanding
tasks in this field. All the methods reviewed in Section 3-6 belong to the “pre-processing”
category in Tier 3 categorization. Only a few articles, reviewed in Section 7, explicitly
address the phase variation problem in their clustering methods. We conclude our review
by presenting in Section 8 a new methodological framework that aims at maximizing the
synergy among the sequential steps in a functional data clustering method. The layout of
our overview in each section is consistent with the hierarchy of our taxonomy. However, we
may explain an original work and its follow-up or relevant works together, to avoid repeating
the problem context and to provide an integrated view. Table 2 in the appendix delineates
the classification of all the reviewed publications according to our taxonomy.
2 Preliminaries
The notion “random function” is a natural generalization of the notion “random variable”.
Let Tdenote a compact set in a topological space of dimension d(1). For example, Tcan
be an interval or a manifold. A random function Yis defined on a probability space (Ω,F,P)
and takes values in an infinite-dimensional space Y. Most theoretical developments require
5
摘要:

ReviewofClusteringMethodsforFunctionalDataMimiZhang1;3,AndrewParnell2;31SchoolofComputerScienceandStatistics,TrinityCollegeDublin,Ireland2HamiltonInstitute,MaynoothUniversity,Ireland3I-FormAdvancedManufacturingResearchCentre,ScienceFoundationIreland,IrelandAbstractFunctionaldataclusteringistoidentif...

展开>> 收起<<
Review of Clustering Methods for Functional Data Mimi Zhang13 Andrew Parnell23 1School of Computer Science and Statistics Trinity College Dublin Ireland.pdf

共52页,预览5页

还剩页未读, 继续阅读

声明:本站为文档C2C交易模式,即用户上传的文档直接被用户下载,本站只是中间服务平台,本站所有文档下载所得的收益归上传人(含作者)所有。玖贝云文库仅提供信息存储空间,仅对用户上传内容的表现方式做保护处理,对上载内容本身不做任何修改或编辑。若文档所含内容侵犯了您的版权或隐私,请立即通知玖贝云文库,我们立即给予删除!
分类:图书资源 价格:10玖币 属性:52 页 大小:2.7MB 格式:PDF 时间:2025-04-29

开通VIP享超值会员特权

  • 多端同步记录
  • 高速下载文档
  • 免费文档工具
  • 分享文档赚钱
  • 每日登录抽奖
  • 优质衍生服务
/ 52
客服
关注