Clustering Plasma Concentration-Time Curves Applications of Unsupervised Learning in Pharmacogenomics

2025-04-29 0 0 1.47MB 38 页 10玖币
侵权投诉
Clustering Plasma Concentration-Time Curves:
Applications of Unsupervised Learning in
Pharmacogenomics
Jackson P. LautierStella GrosserJessica KimHyewon Kim
Junghi Kim§
Abstract
Pharmaceutical researchers are continually searching for techniques to improve
both drug development processes and patient outcomes. An area of recent interest
is the potential for machine learning (ML) applications within pharmacology. One
such application not yet given close study is the unsupervised clustering of plasma
concentration-time curves, hereafter, pharmacokinetic (PK) curves. In this paper, we
present our findings on how to cluster PK curves by their similarity. Specifically, we
find clustering to be effective at identifying similar-shaped PK curves and informative
for understanding patterns within each cluster of PK curves. Because PK curves are
time series data objects, our approach utilizes the extensive body of research related
to the clustering of time series data as a starting point. As such, we examine many
dissimilarity measures between time series data objects to find those most suitable for
PK curves. We identify Euclidean distance as generally most appropriate for clustering
PK curves, and we further show that dynamic time warping, Fechet, and structure-
based measures of dissimilarity like correlation may produce unexpected results. As
an illustration, we apply these methods in a case study with 250 PK curves used in
a previous pharmacogenomic study. Our case study finds that an unsupervised ML
clustering with Euclidean distance, without any subject genetic information, is able to
independently validate the same conclusions as the reference pharmacogenomic results.
To our knowledge, this is the first such demonstration. Further, the case study demon-
strates how the clustering of PK curves may generate insights that could be difficult
to perceive solely with population level summary statistics of PK metrics.
Keywords: CYP2C19, distance metrics, hierarchical clustering, precision medicine
Department of Mathematical Sciences, Bentley University, Waltham, MA, USA
Office of Biostatistics, Center for Drug Evaluation and Research, U.S. Food and Drug Administration,
Silver Spring, MD, USA
Office of Clinical Pharmacology, Center for Drug Evaluation and Research, U.S. Food and Drug Admin-
istration, Silver Spring, MD, USA
§Corresponding to junghi.kim@fda.hhs.gov, 10903 New Hampshire Avenue, Silver Spring, MD, USA
1
arXiv:2210.13310v2 [stat.AP] 4 Sep 2023
1 Introduction
Plasma concentration time-curves or pharmacokinetic (PK) curves are generated by plotting
drug concentration levels in plasma samples at various time intervals after the administration
of a drug product (Shargel and Yu,2016). As such, they are an essential component of char-
acterizing drug disposition, which is an important prerequisite to determine or modify dosing
regimens for individuals and groups of patients (Shargel and Yu,2016). Because of recent
machine learning (ML) and artifical intelligence (AI) work related generally to drug devel-
opment (Zhang et al.,2022) and patient outcomes (Thirunavukkarasu and Karuppasamy,
2022), it is natural to suppose that any ML and AI applications with the potential to enhance
the understanding or interpretation of PK curves would be of significant interest within the
broader field of pharmaceutical research. Indeed, this is the case (e.g., Koch et al.,2020;
Zame et al.,2020;McComb et al.,2021). Absent from these studies is the use of clustering
techniques on data sets of PK curves, however, and we thus present to our knowledge the
first applied study of clustering techniques for use on PK data.
We may use the observation that PK curves are time series data to utilize existing
literature related to the ML clustering of time series data as a starting point to cluster similar
PK curves. Broadly, time series clustering is a technique for grouping time series data based
on their similarity. Clustering techniques for time series data have grown rapidly and have
been applied successfully in a wide range of domains including medicine (e.g., personalized
drug design), environmental science, and many more (Javed et al.,2020). In particular, there
exists a large set of well-established time series dissimilarity (or distance) measures (Montero
and Vilar,2014). It is not straightforward to define dissimilarity between PK curves, and
we illustrate a potential shape-based distance interpretation for two PK curves with five
concentration sampling points in Figure 1. The novelty of this paper is twofold. First,
we narrow down the broad field of potential time series dissimilarity measures to select
a suitable one for use on PK curves. Specifically, we find the following five dissimilarity
measures most applicable: correlation, Fechet, dynamic time warping (DTW), temporal
2
t1t2t3t4t5
d1
d2
d3d4d5
Curve1
Curve2
Time (t)
Concentration
Figure 1: Illustrative Example of Distance Between Two PK Curves.It is not straight-
forward to measure the dissimilarity between two PK curves for use within a ML clustering exercise. The above is a geometric
interpretation of “distance” between two PK curves with five plasma concentration sampling time points.
correlation coefficient (CORT), and Euclidean. Of these five, we identify Euclidean distance
as generally most appropriate for clustering PK concentration curves, and we summarize the
pros and cons of all five of these dissimilarity measures in Table 1.
Second, we present a novel case study to illustrate the merits of clustering PK curves.
The case study objective is to use ML clustering to independently analyze data from a
concluded pharmacogenomic study that attempted to tailor treatment strategies to the in-
dividual patient level. The case study data set consists of PK curves, along with genetic and
demographic information from 250 observations, and it spans nine Phase 1 studies. As such,
the concentration sampling time points vary between PK curve data observations. This is
a well-known issue in time series clustering (Montero and Vilar,2014). While there are dis-
similarity measures in Section 2that are able to handle PK curves with a differing number
of sampling times, we find such methods may not yield desirable results (see Section 2and
the Appendix for details). Hence, we find Euclidean distance using only the shared concen-
tration sampling time points to be most effective. Lastly, the case study demonstrates how
clustering PK curves with Euclidean distance can provide additional insights in comparison
to PK analysis based solely on PK parameters (e.g., area-under-the-time-concentration curve
3
(AUC), maximum concentration (Cmax), time-until-maximum concentration (Tmax), etc.).
The paper proceeds as follows. The methodological review occurs in Section 2. The
case study then follows in Section 3, and Section 4concludes. For reference, the Appendix
provides extended details on the dissimilarity measures of Section 2, and the Supplementary
Material provides additional details on clustering methods, cluster selection criteria, and the
case study results.
2 Methods
Clustering algorithms are usually categorized by families, such as hierarchical clustering,
partition-based, model-based, and density-based clustering (e.g., Hastie et al.,2009;James
et al.,2013). We situate our paper within hierarchical clustering with a particular focus on
similarity measures for PK curves and defer discussion of other algorithms to the Supple-
mental Material. (Hierarchical clustering will also be used exclusively within the case study,
see Section 3.2.)
An important step in hierarchical clustering is to find an appropriate distance or dissimi-
larity measure between data objects to be clustered, which is done in Section 2.3. Section 2.4
then briefly reviews the well-known problem of deciding on the final number of clusters (e.g.,
James et al.,2013).
2.1 Hierarchical Clustering
Hierarchical clustering is a “bottom-up” clustering technique with an attractive feature of
producing a tree-based representation of observations called a dendrogram. Colloquially, data
objects that are relatively similar are to be grouped into the same cluster, while data objects
that are relatively dissimilar are to be grouped into separate clusters. The dendrogram,
therefore, represents the relationships of similarity among all objects in a data set (see
bottom of Figure 2).
4
At the beginning of the algorithm, each data object is treated as a single cluster. The
two most similar clusters are then merged into a new single cluster, and this new cluster
becomes an updated data object with a value that is determined by averaging its now two
members. There are other methods to determine the new cluster value besides averaging, but
we defer this discussion at present for ease of exposition (the precise vernacular is linkage, see
James et al. (2013) for details). The merging process continues until all original data objects
eventually merge into a single cluster, as represented by the very top of the dendrogram.
The dendrogram itself does not report an optimal number of clusters; it is better thought
of as a visualization of similarity (or dissimilarity) within a data set given a particular
measure of dissimilarity. (We discuss dissimilarity measures more thoroughly in Section 2.3.).
To interpret the amount of similarity between two PK curves on a dendrogram, it is necessary
to find the vertical point where the two curves first fuse. It is an error to associate horizontal
proximity of two curves on the x-axis of the dendrogram with similarity. For example,
consider the labeled subjects X98, X100, and Y8 on the bottom of Figure 2. Despite the fact
subjects X100 and Y8 are quite close in terms of horizontal labels, they do not fuse until the
very top of the dendrogram. Thus, X100 and Y8 should be considered quite dissimilar, on a
relative basis, among all PK curves within the complete data set. On the other hand, X98
and X100 are much further apart in horizontal labels than X100 and Y8, but they fuse much
sooner vertically. Hence, X98 and X100 should be interpreted as relatively more similar
than X100 and Y8. Finally, horizontal ordering of labeled subjects has no bearing on the
interpretation of the clustering outcome; it is akin to the horizontal ordering of bars on a
bar chart.
2.2 An Illustrative Example
We demonstrate the potential effectiveness of hierarchical clustering for PK curves with an
illustrative example. Consider two PK curves, C1and C2, from a one compartment linear PK
model, assuming first order absorption and first order elimination after oral administration.
5
摘要:

ClusteringPlasmaConcentration-TimeCurves:ApplicationsofUnsupervisedLearninginPharmacogenomicsJacksonP.Lautier∗StellaGrosser†JessicaKim†HyewonKim‡JunghiKim†§AbstractPharmaceuticalresearchersarecontinuallysearchingfortechniquestoimprovebothdrugdevelopmentprocessesandpatientoutcomes.Anareaofrecentinter...

展开>> 收起<<
Clustering Plasma Concentration-Time Curves Applications of Unsupervised Learning in Pharmacogenomics.pdf

共38页,预览5页

还剩页未读, 继续阅读

声明:本站为文档C2C交易模式,即用户上传的文档直接被用户下载,本站只是中间服务平台,本站所有文档下载所得的收益归上传人(含作者)所有。玖贝云文库仅提供信息存储空间,仅对用户上传内容的表现方式做保护处理,对上载内容本身不做任何修改或编辑。若文档所含内容侵犯了您的版权或隐私,请立即通知玖贝云文库,我们立即给予删除!
分类:图书资源 价格:10玖币 属性:38 页 大小:1.47MB 格式:PDF 时间:2025-04-29

开通VIP享超值会员特权

  • 多端同步记录
  • 高速下载文档
  • 免费文档工具
  • 分享文档赚钱
  • 每日登录抽奖
  • 优质衍生服务
/ 38
客服
关注