Exploring Data-Driven Chemical SMILES Tokenization Approaches to Identify Key Protein-Ligand Binding Moieties

2025-05-06 0 0 2.11MB 16 页 10玖币

侵权投诉

EXPLORING DATA-DRIVEN CHEMICAL SMILES TOKENIZATION

APPROACHES TO IDENTIFY KEY PROTEIN-LIGAND BINDING

MOIETIES

A PREPRINT

Asu Busra Temizer∗

Faculty of Pharmacy

Department of Pharmaceutical Chemistry

Istanbul University

Gökçe Uludo˘

gan∗

Department of Computer Engineering

Bogazici University

Rıza Özçelik∗

Department of Computer Engineering

Bogazici University

Taha Koulani

Faculty of Pharmacy

Department of Pharmaceutical Chemistry

Istanbul University

Elif Ozkirimli

Data and Analytics Chapter

Pharma International Informatics

F. Hoffmann-La Roche AG

Kutlu O. Ulgen

Department of Chemical Engineering

Bogazici University

Nilgün Karalı

Department of Pharmaceutical Chemistry

Istanbul University

karalin@istanbul.edu.tr

Arzucan Özgür

Department of Computer Engineering

Bogazici University

arzucan.ozgur@boun.edu.tr

ABSTRACT

Machine learning models have found numerous successful applications in computational drug

discovery. A large body of these models represents molecules as sequences since molecular sequences

are easily available, simple, and informative. The sequence-based models often segment molecular

sequences into pieces called chemical words (analogous to the words that make up sentences in

human languages) and then apply advanced natural language processing techniques for tasks such

as de novo drug design, property prediction, and binding afﬁnity prediction. However, the chemical

characteristics and signiﬁcance of these building blocks, chemical words, remain unexplored. This

study aims to investigate the chemical vocabularies generated by popular subword tokenization

algorithms, namely Byte Pair Encoding (BPE), WordPiece, and Unigram, and identify key chemical

words associated with protein-ligand binding. To this end, we build a language-inspired pipeline that

treats high afﬁnity ligands of protein targets as documents and selects key chemical words making

up those ligands based on tf-idf weighting. Further, we conduct case studies on a number of protein

families to analyze the impact of key chemical words on binding. Through our analysis, we ﬁnd that

these key chemical words are speciﬁc to protein targets and correspond to known pharmacophores

and functional groups. Our ﬁndings will help shed light on the chemistry captured by the chemical

words, and by machine learning models for drug discovery at large.

Keywords chemical words ·chemoinformatics ·machine learning ·medicinal chemistry ·subword tokenization

∗These authors contributed equally to this work.

arXiv:2210.14642v2 [q-bio.BM] 25 Sep 2023

Data-Driven SMILES Tokenization to Identify Binding Moieties A PREPRINT

1 Introduction

The last decade has witnessed the rise of machine learning. The models learned to sing [

], write [

], and paint [

]

through large unlabeled datasets, despite the absence of well-deﬁned rules for the tasks. Drug discovery is also an

excellent application domain for machine learning models. However, unlike images and audio tracks, chemicals are

non-numeric and need intermediate representations upon which machines can learn.

Text-based representations of chemicals [

] can be used as intermediate representations: they are easily available,

simple, and as powerful as more complex representations such as 2D or 3D representations [

]. The power of the

text-based representations in machine learning is partially due to the chemical language perspective. The chemical

language perspective views text-based representations of chemicals as documents written in a chemical language and

borrows advanced approaches from the natural language processing domain to build computational drug discovery

models [

]. Successful applications of the chemical language perspective in drug discovery include de novo drug design

[9, 10, 11, 12], binding site detection [13, 14, 15, 16], and drug-target interaction prediction [17, 18, 19, 20].

At the core of the chemical language perspective are chemical words, which correspond to the smaller building blocks

of chemicals, similar to the words in natural languages. On the other hand, deﬁning the chemical words poses another

research problem since the chemical language, unlike natural (human) languages, has no pre-deﬁned collection of

words, i.e., a vocabulary. Several approaches are available to build a chemical vocabulary for state-of-the-art models for

downstream tasks such as drug-target afﬁnity prediction, similar compound identiﬁcation, and de novo drug design

[

]. However, to the best of our knowledge, the chemical vocabularies obtained from text based representations

are not studied from a chemical perspective so far, and therefore it is unknown whether these chemical vocabularies

capture chemical information or not. This raises the question: do models utilizing chemical words rely on chemically

meaningful building blocks, or arbitrary chemical subsequences? Here we seek an answer to this question in order to

deﬁne the role of these models in new drug design and development from a medicinal chemist’s perspective.

Chemical word is an integral concept for the studies that rely on the chemical language hypothesis [

]. The chemical

language hypothesis interprets the text-based representation (i.e. molecular strings) of chemicals as sentences and

borrows language processing methods from natural languages. Chemical words are the building blocks of these

“sentences", and, unlike the words in natural languages, need to be discovered.

Subword tokenization algorithms allow learning the chemical words from large unlabeled corpora of molecular strings.

In addition to discovering chemical words, they can segment the chemical sentences into chemical words in the

discovered vocabulary. One commonly used algorithm in this context is Byte Pair Encoding (BPE) [

], which is widely

adopted in various applications such as de novo drug design [

], molecular property prediction [

], and protein-ligand

binding afﬁnity prediction [

]. BPE starts with an initial vocabulary of individual characters and iteratively merges the

most frequently occurring pairs until a desired vocabulary size is reached. Another subword tokenization method used

in molecular design [

] is WordPiece [

], which is similar to BPE in that it also begins with an initial vocabulary and

continues merging pairs of tokens. However, unlike BPE, WordPiece selects the pair that maximizes the likelihood of

the training data, rather than simply the most frequent one. In contrast to both BPE and WordPiece, Unigram [

] is a

top-down tokenization method that starts with an initial vocabulary of all possible words and reduces it based on the

likelihood of the unigram model, and it has recently been used in molecular ﬁngerprinting [29].

We adopt three widely utilized subword tokenization algorithms, namely Byte Pair Encoding (BPE), WordPiece, and

Unigram, which have been successfully applied in computational drug discovery tasks [

] and compare

the vocabularies generated by these methods. However, due to the large number of chemical words in the resulting

vocabularies, it is impractical to interpret them all manually. Therefore, we propose a novel language-inspired pipeline

to identify key chemical words that are strongly associated with protein-ligand binding. In natural languages, each

genre or period has a certain distribution of words. Similarly, each protein family can be expected to have a certain

distribution of substructures (or chemical words) to which it would bind. Therefore, we focused on the high afﬁnity

ligands of protein families to build protein family speciﬁc vocabularies and analyzed the chemical signiﬁcance of

subwords for the protein family. Protein-ligand binding is selected as the chemical investigation perspective since

strong binding relationships are fundamental in drug discovery pipelines, are widely available in the literature, and

naturally group chemicals by their binding targets. The pipeline processes a protein-ligand binding afﬁnity dataset and

identiﬁes ten chemical words for each protein or protein family in the dataset. We ﬁnd that while the vocabularies

generated by different subword tokenization algorithms differ in words, lengths, and validity, the identiﬁed key chemical

words are similar, as measured by the mean edit distance between protein or family-speciﬁc top ranking subwords

with different vocabularies. We also observe that the selected words are protein or family-speciﬁc, often associated

with only one protein/family, and signiﬁcantly different from the words identiﬁed for weak binders. As a case study,

we examine the selected chemical words for a number of important drug target families and ﬁnd that the top-ranking

chemical words are associated with the known pharmacophores and functional groups of the protein families. Notably,

Data-Driven SMILES Tokenization to Identify Binding Moieties A PREPRINT

for the aldehyde dehydrogenase 1 enzyme family, the chemical words selected by the proposed algorithm were found to

improve the drug-likeness of the molecules. Our results corroborate the chemical word-based models in computational

drug discovery and are a step toward interpreting the computational drug discovery models from a pharmaceutical

chemistry perspective.

2 Materials and methods

In this work, we build a pipeline to analyze the chemical signiﬁcance of the chemical words of high afﬁnity ligands

of proteins or protein families. The vocabularies are created for each protein family based on the hypothesis that the

important subwords of each protein family will be speciﬁc to the protein family. The top ranking words for the ligands

of carbonic anhydrases and casein kinase 1 gamma in BindingDB (BDB), as well as pyruvate kinase M2 and aldehyde

dehydroganse 1 from Lit-PCBA are examined for chemical relevance.

2.1 Segmenting Chemicals into Chemical Words

In this study, we adopt three commonly used subword tokenization algorithms: BPE [

], WordPiece [

], and Unigram

[

], and learn vocabularies with sizes of 8K, 16K, and 32K. The vocabularies are identiﬁed by applying the tokenization

algorithms on the SMILES representations of

∼

2.3M compounds in ChEMBLv27 [

], and then, the vocabularies are

used to segment the compounds into their chemical words.

2.2 Characterizing Chemical Words

Subword tokenization algorithms, adopted from the ﬁeld of natural language processing, have recently been widely

used to identify chemical words and represent compounds in computational drug discovery studies. However, the

identiﬁed vocabularies have not been characterized and a comparison of the different methods has not been performed

yet. To investigate the chemical words learned by different tokenization methods, we analyze various statistics, such

as word length, validity, vocabulary overlaps across the algorithms, and word similarity to the most similar extended

functional group, which is a generalized version of traditional functional group and introduced by Lu et al. [31].

2.3 Term Frequency - Inverse Document Frequency

Term Frequency - Inverse Document Frequency (TF-IDF) [

] is a document vectorization algorithm originally

introduced in the ﬁeld of information retrieval. TF-IDF represents documents based on the importance of each word

in the document’s vocabulary. TF-IDF postulates that the importance of a word for a document is proportional to the

word’s frequency in the document and the inverse number of documents the word appears in the corpus. The TF-IDF

vector for a document Dis formulated as:

⃗

D= [tfw1,D ∗idfw1,· · · , tfwV,D ∗idfwV]1×V(1)

where

tfwi,D

is the count of the

ith

word in the vocabulary,

, in

idfwi

is the natural logarithm of the number of

documents in the corpus divided by the number of documents in which

is present; and

is the number of different

words in the corpus, i.e., the vocabulary size. However, the naive TF-IDF weighting may consider twenty occurrences

of a word in a document twenty times more signiﬁcant than a single occurrence, which may not be accurate. To

address this issue, a sublinear term frequency scaling variant of the algorithm has been introduced [

] , which uses the

logarithm of the term frequency and assigns a weight given by:

tfs(w, D) = 1 + log tfw,D if tfw,D >0

0otherwise (2)

where

tfw,D

is the count of word

in document

and

tfs(w, D)

is its scaled counterpart used in the importance

computation instead.

2.4 Datasets

We used three different datasets of protein-ligand afﬁnity.

The BDB dataset contains afﬁnity information for 31K interactions of 924 compounds and 490 proteins from 81 protein

families. The BDB dataset reports the protein-compound afﬁnities in terms of

pKd

and has been used in previous

Data-Driven SMILES Tokenization to Identify Binding Moieties A PREPRINT

protein-compound interaction prediction studies [

]. To identify protein family-speciﬁc words, we map each

protein in this dataset to its corresponding PFAM families [34] using the InterPro API [35].

LIT-PCBA [

] includes 15 target proteins and 7844 active and 407,381 inactive compounds. The dataset is speciﬁcally

curated for virtual screening and machine learning with efforts made to ensure it is unbiased and realistic.

ProtBENCH [

] contains protein family-speciﬁc bioactivity datasets. These datasets include interactions belonging

to different protein superfamilies, including membrane receptors, ion channels, transporters, transcription factors,

epigenetic regulators, and enzymes with ﬁve subgroups (i.e., transferases, proteases, hydrolases, oxidoreductases, and

other enzymes). The family subsets have varying sizes in terms of interactions ( from 19K to 220K), number of proteins

( from 100 to 1K), and number of compounds (from 10K to 120K).

2.5 Identifying the Key Chemical Words for Strong Protein-Ligand Binding

Here we propose a novel pipeline to identify key chemical words for strong binding to both individual proteins and to

protein families. The proposed algorithm is language-inspired and postulates, similar to TF-IDF, that if a chemical

word is common and unique to the strong binders of a protein or protein family, then it signiﬁes a key chemical

substructure for binding. Accordingly, the algorithm ﬁrst identiﬁes strong binders for each protein or protein family.

While LIT-PCBA already distinguishes the interactions as strong and weak binders, strong binders in other datasets are

identiﬁed using predeﬁned thresholds such as a

pKd

value higher than 7 for BDB and a bioactivity score (pChEMBL)

greater than the median score for ProtBENCH. Next, strong binders of each protein or protein family are represented

as a document, in which each strong binding compound is a sentence composed of chemical words identiﬁed via the

algorithms described in the Segmenting Chemicals into the Chemical Words section. Finally, documents containing the

SMILES representations of the high afﬁnity ligands of each protein or protein family are vectorized with TF-IDF and

the chemical words are ranked based on their TF-IDF scores. The proposed pipeline is illustrated in Figure 1.

NS(=O)(=O)

NS(=O)(=O)c1ccc(

S(=O)(=O)

...

NS(=O)(=O)c1c ... S(=O)(=O)CCO

CSc1nc2ccc ... Cl)c1)

NS(=O)(=O)c1c ... ccccc12

......

High Affinity

Ligand Search

Subword

Tokenization

Key Chemical

Words

Figure 1: The proposed pipeline to identify key chemical words for high afﬁnity to a protein family. For each protein

family, the pipeline ﬁrst extracts the SMILES representations of compounds that bind to the protein family with high

afﬁnity. Next, the SMILES strings are segmented into their chemical words via the tokenization algorithms such as

BPE, and each high-afﬁnity compound list is modeled as a document composed of SMILES "sentences". Last, the

compound documents are vectorized via TF-IDF, which assigns an importance score to each chemical word per protein

family, and the ten chemical words with the highest TF-IDF scores are identiﬁed as key chemical words. Key chemical

words for carbonic anhydrase, casein kinase 1 gamma, pyruvate kinase M2, and aldehyde dehydrogenase 1 enzyme

systems are analyzed further to interpret their chemical signiﬁcance.

2.6 Comparing Key Chemical Words

While the proposed pipeline aims at identifying key chemical words for strong binders of individual proteins or protein

families, it can also be used to compute word importance scores for weak binders. To compare the importance scores of

weak binders with those of strong binders, we apply the pipeline on weak binders by considering weak interactions

as documents associated with proteins or protein families. Next, we conduct Wilcoxon rank-sum tests (p < 0.05) on

a target level to compare the importance scores of chemical words associated with strong binding to each protein or

family with those associated with weak binding to that particular target.

To investigate whether the chemical words identiﬁed by using different vocabularies are similar, we compute mean edit

distance scores between selected chemical words using the following equation:

|Ap|X

hAp∈Ap

min

hBp∈Bp

dhAp, hBp(3)

文档加载中……请稍候！
如果长时间未打开，您也可以点击刷新试试。

下载文档到电脑，查找使用更方便

10 玖币 0人已下载

立即下载

摘要：

EXPLORINGDATA-DRIVENCHEMICALSMILESTOKENIZATIONAPPROACHESTOIDENTIFYKEYPROTEIN-LIGANDBINDINGMOIETIESAPREPRINTAsuBusraTemizer∗FacultyofPharmacyDepartmentofPharmaceuticalChemistry˙IstanbulUniversityGökçeUludo˘gan∗DepartmentofComputerEngineeringBogaziciUniversityRızaÖzçelik∗DepartmentofComputerEngineerin...

展开>> 收起<<

Exploring Data-Driven Chemical SMILES Tokenization Approaches to Identify Key Protein-Ligand Binding Moieties.pdf

共16页,预览4页

还剩页未读，继续阅读

声明：本站为文档C2C交易模式，即用户上传的文档直接被用户下载，本站只是中间服务平台，本站所有文档下载所得的收益归上传人(含作者)所有。玖贝云文库仅提供信息存储空间，仅对用户上传内容的表现方式做保护处理，对上载内容本身不做任何修改或编辑。若文档所含内容侵犯了您的版权或隐私，请立即通知玖贝云文库，我们立即给予删除！

Exploring Data-Driven Chemical SMILES Tokenization Approaches to Identify Key Protein-Ligand Binding Moieties

相关推荐

开通VIP享超值会员特权

作者详情

相关内容

热门标签

举报选择: