Clustering Plasma Concentration-Time Curves Applications of Unsupervised Learning in Pharmacogenomics

2025-04-29 0 0 1.47MB 38 页 10玖币

侵权投诉

Clustering Plasma Concentration-Time Curves:

Applications of Unsupervised Learning in

Pharmacogenomics

Jackson P. Lautier∗Stella Grosser†Jessica Kim†Hyewon Kim‡

Junghi Kim†§

Abstract

Pharmaceutical researchers are continually searching for techniques to improve

both drug development processes and patient outcomes. An area of recent interest

is the potential for machine learning (ML) applications within pharmacology. One

such application not yet given close study is the unsupervised clustering of plasma

concentration-time curves, hereafter, pharmacokinetic (PK) curves. In this paper, we

present our ﬁndings on how to cluster PK curves by their similarity. Speciﬁcally, we

ﬁnd clustering to be eﬀective at identifying similar-shaped PK curves and informative

for understanding patterns within each cluster of PK curves. Because PK curves are

time series data objects, our approach utilizes the extensive body of research related

to the clustering of time series data as a starting point. As such, we examine many

dissimilarity measures between time series data objects to ﬁnd those most suitable for

PK curves. We identify Euclidean distance as generally most appropriate for clustering

PK curves, and we further show that dynamic time warping, Fr´echet, and structure-

based measures of dissimilarity like correlation may produce unexpected results. As

an illustration, we apply these methods in a case study with 250 PK curves used in

a previous pharmacogenomic study. Our case study ﬁnds that an unsupervised ML

clustering with Euclidean distance, without any subject genetic information, is able to

independently validate the same conclusions as the reference pharmacogenomic results.

To our knowledge, this is the ﬁrst such demonstration. Further, the case study demon-

strates how the clustering of PK curves may generate insights that could be diﬃcult

to perceive solely with population level summary statistics of PK metrics.

Keywords: CYP2C19, distance metrics, hierarchical clustering, precision medicine

∗Department of Mathematical Sciences, Bentley University, Waltham, MA, USA

†Oﬃce of Biostatistics, Center for Drug Evaluation and Research, U.S. Food and Drug Administration,

Silver Spring, MD, USA

‡Oﬃce of Clinical Pharmacology, Center for Drug Evaluation and Research, U.S. Food and Drug Admin-

istration, Silver Spring, MD, USA

§Corresponding to junghi.kim@fda.hhs.gov, 10903 New Hampshire Avenue, Silver Spring, MD, USA

arXiv:2210.13310v2 [stat.AP] 4 Sep 2023

1 Introduction

Plasma concentration time-curves or pharmacokinetic (PK) curves are generated by plotting

drug concentration levels in plasma samples at various time intervals after the administration

of a drug product (Shargel and Yu,2016). As such, they are an essential component of char-

acterizing drug disposition, which is an important prerequisite to determine or modify dosing

regimens for individuals and groups of patients (Shargel and Yu,2016). Because of recent

machine learning (ML) and artiﬁcal intelligence (AI) work related generally to drug devel-

opment (Zhang et al.,2022) and patient outcomes (Thirunavukkarasu and Karuppasamy,

2022), it is natural to suppose that any ML and AI applications with the potential to enhance

the understanding or interpretation of PK curves would be of signiﬁcant interest within the

broader ﬁeld of pharmaceutical research. Indeed, this is the case (e.g., Koch et al.,2020;

Zame et al.,2020;McComb et al.,2021). Absent from these studies is the use of clustering

techniques on data sets of PK curves, however, and we thus present to our knowledge the

ﬁrst applied study of clustering techniques for use on PK data.

We may use the observation that PK curves are time series data to utilize existing

literature related to the ML clustering of time series data as a starting point to cluster similar

PK curves. Broadly, time series clustering is a technique for grouping time series data based

on their similarity. Clustering techniques for time series data have grown rapidly and have

been applied successfully in a wide range of domains including medicine (e.g., personalized

drug design), environmental science, and many more (Javed et al.,2020). In particular, there

exists a large set of well-established time series dissimilarity (or distance) measures (Montero

and Vilar,2014). It is not straightforward to deﬁne dissimilarity between PK curves, and

we illustrate a potential shape-based distance interpretation for two PK curves with ﬁve

concentration sampling points in Figure 1. The novelty of this paper is twofold. First,

we narrow down the broad ﬁeld of potential time series dissimilarity measures to select

a suitable one for use on PK curves. Speciﬁcally, we ﬁnd the following ﬁve dissimilarity

measures most applicable: correlation, Fr´echet, dynamic time warping (DTW), temporal

t1t2t3t4t5

d3d4d5

Curve1

Curve2

Time (t)

Concentration

Figure 1: Illustrative Example of Distance Between Two PK Curves.It is not straight-

forward to measure the dissimilarity between two PK curves for use within a ML clustering exercise. The above is a geometric

interpretation of “distance” between two PK curves with ﬁve plasma concentration sampling time points.

correlation coeﬃcient (CORT), and Euclidean. Of these ﬁve, we identify Euclidean distance

as generally most appropriate for clustering PK concentration curves, and we summarize the

pros and cons of all ﬁve of these dissimilarity measures in Table 1.

Second, we present a novel case study to illustrate the merits of clustering PK curves.

The case study objective is to use ML clustering to independently analyze data from a

concluded pharmacogenomic study that attempted to tailor treatment strategies to the in-

dividual patient level. The case study data set consists of PK curves, along with genetic and

demographic information from 250 observations, and it spans nine Phase 1 studies. As such,

the concentration sampling time points vary between PK curve data observations. This is

a well-known issue in time series clustering (Montero and Vilar,2014). While there are dis-

similarity measures in Section 2that are able to handle PK curves with a diﬀering number

of sampling times, we ﬁnd such methods may not yield desirable results (see Section 2and

the Appendix for details). Hence, we ﬁnd Euclidean distance using only the shared concen-

tration sampling time points to be most eﬀective. Lastly, the case study demonstrates how

clustering PK curves with Euclidean distance can provide additional insights in comparison

to PK analysis based solely on PK parameters (e.g., area-under-the-time-concentration curve

(AUC), maximum concentration (Cmax), time-until-maximum concentration (Tmax), etc.).

The paper proceeds as follows. The methodological review occurs in Section 2. The

case study then follows in Section 3, and Section 4concludes. For reference, the Appendix

provides extended details on the dissimilarity measures of Section 2, and the Supplementary

Material provides additional details on clustering methods, cluster selection criteria, and the

case study results.

2 Methods

Clustering algorithms are usually categorized by families, such as hierarchical clustering,

partition-based, model-based, and density-based clustering (e.g., Hastie et al.,2009;James

et al.,2013). We situate our paper within hierarchical clustering with a particular focus on

similarity measures for PK curves and defer discussion of other algorithms to the Supple-

mental Material. (Hierarchical clustering will also be used exclusively within the case study,

see Section 3.2.)

An important step in hierarchical clustering is to ﬁnd an appropriate distance or dissimi-

larity measure between data objects to be clustered, which is done in Section 2.3. Section 2.4

then brieﬂy reviews the well-known problem of deciding on the ﬁnal number of clusters (e.g.,

James et al.,2013).

2.1 Hierarchical Clustering

Hierarchical clustering is a “bottom-up” clustering technique with an attractive feature of

producing a tree-based representation of observations called a dendrogram. Colloquially, data

objects that are relatively similar are to be grouped into the same cluster, while data objects

that are relatively dissimilar are to be grouped into separate clusters. The dendrogram,

therefore, represents the relationships of similarity among all objects in a data set (see

bottom of Figure 2).

At the beginning of the algorithm, each data object is treated as a single cluster. The

two most similar clusters are then merged into a new single cluster, and this new cluster

becomes an updated data object with a value that is determined by averaging its now two

members. There are other methods to determine the new cluster value besides averaging, but

we defer this discussion at present for ease of exposition (the precise vernacular is linkage, see

James et al. (2013) for details). The merging process continues until all original data objects

eventually merge into a single cluster, as represented by the very top of the dendrogram.

The dendrogram itself does not report an optimal number of clusters; it is better thought

of as a visualization of similarity (or dissimilarity) within a data set given a particular

measure of dissimilarity. (We discuss dissimilarity measures more thoroughly in Section 2.3.).

To interpret the amount of similarity between two PK curves on a dendrogram, it is necessary

to ﬁnd the vertical point where the two curves ﬁrst fuse. It is an error to associate horizontal

proximity of two curves on the x-axis of the dendrogram with similarity. For example,

consider the labeled subjects X98, X100, and Y8 on the bottom of Figure 2. Despite the fact

subjects X100 and Y8 are quite close in terms of horizontal labels, they do not fuse until the

very top of the dendrogram. Thus, X100 and Y8 should be considered quite dissimilar, on a

relative basis, among all PK curves within the complete data set. On the other hand, X98

and X100 are much further apart in horizontal labels than X100 and Y8, but they fuse much

sooner vertically. Hence, X98 and X100 should be interpreted as relatively more similar

than X100 and Y8. Finally, horizontal ordering of labeled subjects has no bearing on the

interpretation of the clustering outcome; it is akin to the horizontal ordering of bars on a

bar chart.

2.2 An Illustrative Example

We demonstrate the potential eﬀectiveness of hierarchical clustering for PK curves with an

illustrative example. Consider two PK curves, C1and C2, from a one compartment linear PK

model, assuming ﬁrst order absorption and ﬁrst order elimination after oral administration.

文档加载中……请稍候！
如果长时间未打开，您也可以点击刷新试试。

下载文档到电脑，查找使用更方便

10 玖币 0人已下载

立即下载

摘要：

ClusteringPlasmaConcentration-TimeCurves:ApplicationsofUnsupervisedLearninginPharmacogenomicsJacksonP.Lautier∗StellaGrosser†JessicaKim†HyewonKim‡JunghiKim†§AbstractPharmaceuticalresearchersarecontinuallysearchingfortechniquestoimprovebothdrugdevelopmentprocessesandpatientoutcomes.Anareaofrecentinter...

展开>> 收起<<

Clustering Plasma Concentration-Time Curves Applications of Unsupervised Learning in Pharmacogenomics.pdf

共38页,预览5页

还剩页未读，继续阅读

声明：本站为文档C2C交易模式，即用户上传的文档直接被用户下载，本站只是中间服务平台，本站所有文档下载所得的收益归上传人(含作者)所有。玖贝云文库仅提供信息存储空间，仅对用户上传内容的表现方式做保护处理，对上载内容本身不做任何修改或编辑。若文档所含内容侵犯了您的版权或隐私，请立即通知玖贝云文库，我们立即给予删除！

Clustering Plasma Concentration-Time Curves Applications of Unsupervised Learning in Pharmacogenomics

相关推荐

开通VIP享超值会员特权

作者详情

相关内容

热门标签

举报选择: