Exploring Data-Driven Chemical SMILES Tokenization Approaches to Identify Key Protein-Ligand Binding Moieties

2025-05-06 0 0 2.11MB 16 页 10玖币
侵权投诉
EXPLORING DATA-DRIVEN CHEMICAL SMILES TOKENIZATION
APPROACHES TO IDENTIFY KEY PROTEIN-LIGAND BINDING
MOIETIES
A PREPRINT
Asu Busra Temizer
Faculty of Pharmacy
Department of Pharmaceutical Chemistry
˙
Istanbul University
Gökçe Uludo˘
gan
Department of Computer Engineering
Bogazici University
Rıza Özçelik
Department of Computer Engineering
Bogazici University
Taha Koulani
Faculty of Pharmacy
Department of Pharmaceutical Chemistry
˙
Istanbul University
Elif Ozkirimli
Data and Analytics Chapter
Pharma International Informatics
F. Hoffmann-La Roche AG
Kutlu O. Ulgen
Department of Chemical Engineering
Bogazici University
Nilgün Karalı
Department of Pharmaceutical Chemistry
Istanbul University
karalin@istanbul.edu.tr
Arzucan Özgür
Department of Computer Engineering
Bogazici University
arzucan.ozgur@boun.edu.tr
ABSTRACT
Machine learning models have found numerous successful applications in computational drug
discovery. A large body of these models represents molecules as sequences since molecular sequences
are easily available, simple, and informative. The sequence-based models often segment molecular
sequences into pieces called chemical words (analogous to the words that make up sentences in
human languages) and then apply advanced natural language processing techniques for tasks such
as de novo drug design, property prediction, and binding affinity prediction. However, the chemical
characteristics and significance of these building blocks, chemical words, remain unexplored. This
study aims to investigate the chemical vocabularies generated by popular subword tokenization
algorithms, namely Byte Pair Encoding (BPE), WordPiece, and Unigram, and identify key chemical
words associated with protein-ligand binding. To this end, we build a language-inspired pipeline that
treats high affinity ligands of protein targets as documents and selects key chemical words making
up those ligands based on tf-idf weighting. Further, we conduct case studies on a number of protein
families to analyze the impact of key chemical words on binding. Through our analysis, we find that
these key chemical words are specific to protein targets and correspond to known pharmacophores
and functional groups. Our findings will help shed light on the chemistry captured by the chemical
words, and by machine learning models for drug discovery at large.
Keywords chemical words ·chemoinformatics ·machine learning ·medicinal chemistry ·subword tokenization
These authors contributed equally to this work.
arXiv:2210.14642v2 [q-bio.BM] 25 Sep 2023
Data-Driven SMILES Tokenization to Identify Binding Moieties A PREPRINT
1 Introduction
The last decade has witnessed the rise of machine learning. The models learned to sing [
1
], write [
2
], and paint [
3
]
through large unlabeled datasets, despite the absence of well-defined rules for the tasks. Drug discovery is also an
excellent application domain for machine learning models. However, unlike images and audio tracks, chemicals are
non-numeric and need intermediate representations upon which machines can learn.
Text-based representations of chemicals [
4
,
5
,
6
] can be used as intermediate representations: they are easily available,
simple, and as powerful as more complex representations such as 2D or 3D representations [
7
]. The power of the
text-based representations in machine learning is partially due to the chemical language perspective. The chemical
language perspective views text-based representations of chemicals as documents written in a chemical language and
borrows advanced approaches from the natural language processing domain to build computational drug discovery
models [
8
]. Successful applications of the chemical language perspective in drug discovery include de novo drug design
[9, 10, 11, 12], binding site detection [13, 14, 15, 16], and drug-target interaction prediction [17, 18, 19, 20].
At the core of the chemical language perspective are chemical words, which correspond to the smaller building blocks
of chemicals, similar to the words in natural languages. On the other hand, defining the chemical words poses another
research problem since the chemical language, unlike natural (human) languages, has no pre-defined collection of
words, i.e., a vocabulary. Several approaches are available to build a chemical vocabulary for state-of-the-art models for
downstream tasks such as drug-target affinity prediction, similar compound identification, and de novo drug design
[
21
,
22
,
23
]. However, to the best of our knowledge, the chemical vocabularies obtained from text based representations
are not studied from a chemical perspective so far, and therefore it is unknown whether these chemical vocabularies
capture chemical information or not. This raises the question: do models utilizing chemical words rely on chemically
meaningful building blocks, or arbitrary chemical subsequences? Here we seek an answer to this question in order to
define the role of these models in new drug design and development from a medicinal chemist’s perspective.
Chemical word is an integral concept for the studies that rely on the chemical language hypothesis [
8
]. The chemical
language hypothesis interprets the text-based representation (i.e. molecular strings) of chemicals as sentences and
borrows language processing methods from natural languages. Chemical words are the building blocks of these
“sentences", and, unlike the words in natural languages, need to be discovered.
Subword tokenization algorithms allow learning the chemical words from large unlabeled corpora of molecular strings.
In addition to discovering chemical words, they can segment the chemical sentences into chemical words in the
discovered vocabulary. One commonly used algorithm in this context is Byte Pair Encoding (BPE) [
24
], which is widely
adopted in various applications such as de novo drug design [
23
], molecular property prediction [
25
], and protein-ligand
binding affinity prediction [
21
]. BPE starts with an initial vocabulary of individual characters and iteratively merges the
most frequently occurring pairs until a desired vocabulary size is reached. Another subword tokenization method used
in molecular design [
26
] is WordPiece [
27
], which is similar to BPE in that it also begins with an initial vocabulary and
continues merging pairs of tokens. However, unlike BPE, WordPiece selects the pair that maximizes the likelihood of
the training data, rather than simply the most frequent one. In contrast to both BPE and WordPiece, Unigram [
28
] is a
top-down tokenization method that starts with an initial vocabulary of all possible words and reduces it based on the
likelihood of the unigram model, and it has recently been used in molecular fingerprinting [29].
We adopt three widely utilized subword tokenization algorithms, namely Byte Pair Encoding (BPE), WordPiece, and
Unigram, which have been successfully applied in computational drug discovery tasks [
23
,
25
,
21
,
26
,
29
] and compare
the vocabularies generated by these methods. However, due to the large number of chemical words in the resulting
vocabularies, it is impractical to interpret them all manually. Therefore, we propose a novel language-inspired pipeline
to identify key chemical words that are strongly associated with protein-ligand binding. In natural languages, each
genre or period has a certain distribution of words. Similarly, each protein family can be expected to have a certain
distribution of substructures (or chemical words) to which it would bind. Therefore, we focused on the high affinity
ligands of protein families to build protein family specific vocabularies and analyzed the chemical significance of
subwords for the protein family. Protein-ligand binding is selected as the chemical investigation perspective since
strong binding relationships are fundamental in drug discovery pipelines, are widely available in the literature, and
naturally group chemicals by their binding targets. The pipeline processes a protein-ligand binding affinity dataset and
identifies ten chemical words for each protein or protein family in the dataset. We find that while the vocabularies
generated by different subword tokenization algorithms differ in words, lengths, and validity, the identified key chemical
words are similar, as measured by the mean edit distance between protein or family-specific top ranking subwords
with different vocabularies. We also observe that the selected words are protein or family-specific, often associated
with only one protein/family, and significantly different from the words identified for weak binders. As a case study,
we examine the selected chemical words for a number of important drug target families and find that the top-ranking
chemical words are associated with the known pharmacophores and functional groups of the protein families. Notably,
2
Data-Driven SMILES Tokenization to Identify Binding Moieties A PREPRINT
for the aldehyde dehydrogenase 1 enzyme family, the chemical words selected by the proposed algorithm were found to
improve the drug-likeness of the molecules. Our results corroborate the chemical word-based models in computational
drug discovery and are a step toward interpreting the computational drug discovery models from a pharmaceutical
chemistry perspective.
2 Materials and methods
In this work, we build a pipeline to analyze the chemical significance of the chemical words of high affinity ligands
of proteins or protein families. The vocabularies are created for each protein family based on the hypothesis that the
important subwords of each protein family will be specific to the protein family. The top ranking words for the ligands
of carbonic anhydrases and casein kinase 1 gamma in BindingDB (BDB), as well as pyruvate kinase M2 and aldehyde
dehydroganse 1 from Lit-PCBA are examined for chemical relevance.
2.1 Segmenting Chemicals into Chemical Words
In this study, we adopt three commonly used subword tokenization algorithms: BPE [
24
], WordPiece [
27
], and Unigram
[
28
], and learn vocabularies with sizes of 8K, 16K, and 32K. The vocabularies are identified by applying the tokenization
algorithms on the SMILES representations of
2.3M compounds in ChEMBLv27 [
30
], and then, the vocabularies are
used to segment the compounds into their chemical words.
2.2 Characterizing Chemical Words
Subword tokenization algorithms, adopted from the field of natural language processing, have recently been widely
used to identify chemical words and represent compounds in computational drug discovery studies. However, the
identified vocabularies have not been characterized and a comparison of the different methods has not been performed
yet. To investigate the chemical words learned by different tokenization methods, we analyze various statistics, such
as word length, validity, vocabulary overlaps across the algorithms, and word similarity to the most similar extended
functional group, which is a generalized version of traditional functional group and introduced by Lu et al. [31].
2.3 Term Frequency - Inverse Document Frequency
Term Frequency - Inverse Document Frequency (TF-IDF) [
32
] is a document vectorization algorithm originally
introduced in the field of information retrieval. TF-IDF represents documents based on the importance of each word
in the document’s vocabulary. TF-IDF postulates that the importance of a word for a document is proportional to the
word’s frequency in the document and the inverse number of documents the word appears in the corpus. The TF-IDF
vector for a document Dis formulated as:
D= [tfw1,D idfw1,· · · , tfwV,D idfwV]1×V(1)
where
tfwi,D
is the count of the
ith
word in the vocabulary,
wi
, in
D
.
idfwi
is the natural logarithm of the number of
documents in the corpus divided by the number of documents in which
wi
is present; and
V
is the number of different
words in the corpus, i.e., the vocabulary size. However, the naive TF-IDF weighting may consider twenty occurrences
of a word in a document twenty times more significant than a single occurrence, which may not be accurate. To
address this issue, a sublinear term frequency scaling variant of the algorithm has been introduced [
32
] , which uses the
logarithm of the term frequency and assigns a weight given by:
tfs(w, D) = 1 + log tfw,D if tfw,D >0
0otherwise (2)
where
tfw,D
is the count of word
w
in document
D
and
tfs(w, D)
is its scaled counterpart used in the importance
computation instead.
2.4 Datasets
We used three different datasets of protein-ligand affinity.
The BDB dataset contains affinity information for 31K interactions of 924 compounds and 490 proteins from 81 protein
families. The BDB dataset reports the protein-compound affinities in terms of
pKd
and has been used in previous
3
Data-Driven SMILES Tokenization to Identify Binding Moieties A PREPRINT
protein-compound interaction prediction studies [
19
,
33
]. To identify protein family-specific words, we map each
protein in this dataset to its corresponding PFAM families [34] using the InterPro API [35].
LIT-PCBA [
36
] includes 15 target proteins and 7844 active and 407,381 inactive compounds. The dataset is specifically
curated for virtual screening and machine learning with efforts made to ensure it is unbiased and realistic.
ProtBENCH [
37
] contains protein family-specific bioactivity datasets. These datasets include interactions belonging
to different protein superfamilies, including membrane receptors, ion channels, transporters, transcription factors,
epigenetic regulators, and enzymes with five subgroups (i.e., transferases, proteases, hydrolases, oxidoreductases, and
other enzymes). The family subsets have varying sizes in terms of interactions ( from 19K to 220K), number of proteins
( from 100 to 1K), and number of compounds (from 10K to 120K).
2.5 Identifying the Key Chemical Words for Strong Protein-Ligand Binding
Here we propose a novel pipeline to identify key chemical words for strong binding to both individual proteins and to
protein families. The proposed algorithm is language-inspired and postulates, similar to TF-IDF, that if a chemical
word is common and unique to the strong binders of a protein or protein family, then it signifies a key chemical
substructure for binding. Accordingly, the algorithm first identifies strong binders for each protein or protein family.
While LIT-PCBA already distinguishes the interactions as strong and weak binders, strong binders in other datasets are
identified using predefined thresholds such as a
pKd
value higher than 7 for BDB and a bioactivity score (pChEMBL)
greater than the median score for ProtBENCH. Next, strong binders of each protein or protein family are represented
as a document, in which each strong binding compound is a sentence composed of chemical words identified via the
algorithms described in the Segmenting Chemicals into the Chemical Words section. Finally, documents containing the
SMILES representations of the high affinity ligands of each protein or protein family are vectorized with TF-IDF and
the chemical words are ranked based on their TF-IDF scores. The proposed pipeline is illustrated in Figure 1.
NS(=O)(=O)
NS(=O)(=O)c1ccc(
S(=O)(=O)
.
.
.
...
NS(=O)(=O)c1c ... S(=O)(=O)CCO
CSc1nc2ccc ... Cl)c1)
NS(=O)(=O)c1c ... ccccc12
......
High Affinity
Ligand Search
Subword
Tokenization
Key Chemical
Words
Figure 1: The proposed pipeline to identify key chemical words for high affinity to a protein family. For each protein
family, the pipeline first extracts the SMILES representations of compounds that bind to the protein family with high
affinity. Next, the SMILES strings are segmented into their chemical words via the tokenization algorithms such as
BPE, and each high-affinity compound list is modeled as a document composed of SMILES "sentences". Last, the
compound documents are vectorized via TF-IDF, which assigns an importance score to each chemical word per protein
family, and the ten chemical words with the highest TF-IDF scores are identified as key chemical words. Key chemical
words for carbonic anhydrase, casein kinase 1 gamma, pyruvate kinase M2, and aldehyde dehydrogenase 1 enzyme
systems are analyzed further to interpret their chemical significance.
2.6 Comparing Key Chemical Words
While the proposed pipeline aims at identifying key chemical words for strong binders of individual proteins or protein
families, it can also be used to compute word importance scores for weak binders. To compare the importance scores of
weak binders with those of strong binders, we apply the pipeline on weak binders by considering weak interactions
as documents associated with proteins or protein families. Next, we conduct Wilcoxon rank-sum tests (p < 0.05) on
a target level to compare the importance scores of chemical words associated with strong binding to each protein or
family with those associated with weak binding to that particular target.
To investigate whether the chemical words identified by using different vocabularies are similar, we compute mean edit
distance scores between selected chemical words using the following equation:
1
|Ap|X
hApAp
min
hBpBp
dhAp, hBp(3)
4
摘要:

EXPLORINGDATA-DRIVENCHEMICALSMILESTOKENIZATIONAPPROACHESTOIDENTIFYKEYPROTEIN-LIGANDBINDINGMOIETIESAPREPRINTAsuBusraTemizer∗FacultyofPharmacyDepartmentofPharmaceuticalChemistry˙IstanbulUniversityGökçeUludo˘gan∗DepartmentofComputerEngineeringBogaziciUniversityRızaÖzçelik∗DepartmentofComputerEngineerin...

展开>> 收起<<
Exploring Data-Driven Chemical SMILES Tokenization Approaches to Identify Key Protein-Ligand Binding Moieties.pdf

共16页,预览4页

还剩页未读, 继续阅读

声明:本站为文档C2C交易模式,即用户上传的文档直接被用户下载,本站只是中间服务平台,本站所有文档下载所得的收益归上传人(含作者)所有。玖贝云文库仅提供信息存储空间,仅对用户上传内容的表现方式做保护处理,对上载内容本身不做任何修改或编辑。若文档所含内容侵犯了您的版权或隐私,请立即通知玖贝云文库,我们立即给予删除!

相关推荐

分类:图书资源 价格:10玖币 属性:16 页 大小:2.11MB 格式:PDF 时间:2025-05-06

开通VIP享超值会员特权

  • 多端同步记录
  • 高速下载文档
  • 免费文档工具
  • 分享文档赚钱
  • 每日登录抽奖
  • 优质衍生服务
/ 16
客服
关注