
Data-Driven SMILES Tokenization to Identify Binding Moieties A PREPRINT
1 Introduction
The last decade has witnessed the rise of machine learning. The models learned to sing [
1
], write [
2
], and paint [
3
]
through large unlabeled datasets, despite the absence of well-defined rules for the tasks. Drug discovery is also an
excellent application domain for machine learning models. However, unlike images and audio tracks, chemicals are
non-numeric and need intermediate representations upon which machines can learn.
Text-based representations of chemicals [
4
,
5
,
6
] can be used as intermediate representations: they are easily available,
simple, and as powerful as more complex representations such as 2D or 3D representations [
7
]. The power of the
text-based representations in machine learning is partially due to the chemical language perspective. The chemical
language perspective views text-based representations of chemicals as documents written in a chemical language and
borrows advanced approaches from the natural language processing domain to build computational drug discovery
models [
8
]. Successful applications of the chemical language perspective in drug discovery include de novo drug design
[9, 10, 11, 12], binding site detection [13, 14, 15, 16], and drug-target interaction prediction [17, 18, 19, 20].
At the core of the chemical language perspective are chemical words, which correspond to the smaller building blocks
of chemicals, similar to the words in natural languages. On the other hand, defining the chemical words poses another
research problem since the chemical language, unlike natural (human) languages, has no pre-defined collection of
words, i.e., a vocabulary. Several approaches are available to build a chemical vocabulary for state-of-the-art models for
downstream tasks such as drug-target affinity prediction, similar compound identification, and de novo drug design
[
21
,
22
,
23
]. However, to the best of our knowledge, the chemical vocabularies obtained from text based representations
are not studied from a chemical perspective so far, and therefore it is unknown whether these chemical vocabularies
capture chemical information or not. This raises the question: do models utilizing chemical words rely on chemically
meaningful building blocks, or arbitrary chemical subsequences? Here we seek an answer to this question in order to
define the role of these models in new drug design and development from a medicinal chemist’s perspective.
Chemical word is an integral concept for the studies that rely on the chemical language hypothesis [
8
]. The chemical
language hypothesis interprets the text-based representation (i.e. molecular strings) of chemicals as sentences and
borrows language processing methods from natural languages. Chemical words are the building blocks of these
“sentences", and, unlike the words in natural languages, need to be discovered.
Subword tokenization algorithms allow learning the chemical words from large unlabeled corpora of molecular strings.
In addition to discovering chemical words, they can segment the chemical sentences into chemical words in the
discovered vocabulary. One commonly used algorithm in this context is Byte Pair Encoding (BPE) [
24
], which is widely
adopted in various applications such as de novo drug design [
23
], molecular property prediction [
25
], and protein-ligand
binding affinity prediction [
21
]. BPE starts with an initial vocabulary of individual characters and iteratively merges the
most frequently occurring pairs until a desired vocabulary size is reached. Another subword tokenization method used
in molecular design [
26
] is WordPiece [
27
], which is similar to BPE in that it also begins with an initial vocabulary and
continues merging pairs of tokens. However, unlike BPE, WordPiece selects the pair that maximizes the likelihood of
the training data, rather than simply the most frequent one. In contrast to both BPE and WordPiece, Unigram [
28
] is a
top-down tokenization method that starts with an initial vocabulary of all possible words and reduces it based on the
likelihood of the unigram model, and it has recently been used in molecular fingerprinting [29].
We adopt three widely utilized subword tokenization algorithms, namely Byte Pair Encoding (BPE), WordPiece, and
Unigram, which have been successfully applied in computational drug discovery tasks [
23
,
25
,
21
,
26
,
29
] and compare
the vocabularies generated by these methods. However, due to the large number of chemical words in the resulting
vocabularies, it is impractical to interpret them all manually. Therefore, we propose a novel language-inspired pipeline
to identify key chemical words that are strongly associated with protein-ligand binding. In natural languages, each
genre or period has a certain distribution of words. Similarly, each protein family can be expected to have a certain
distribution of substructures (or chemical words) to which it would bind. Therefore, we focused on the high affinity
ligands of protein families to build protein family specific vocabularies and analyzed the chemical significance of
subwords for the protein family. Protein-ligand binding is selected as the chemical investigation perspective since
strong binding relationships are fundamental in drug discovery pipelines, are widely available in the literature, and
naturally group chemicals by their binding targets. The pipeline processes a protein-ligand binding affinity dataset and
identifies ten chemical words for each protein or protein family in the dataset. We find that while the vocabularies
generated by different subword tokenization algorithms differ in words, lengths, and validity, the identified key chemical
words are similar, as measured by the mean edit distance between protein or family-specific top ranking subwords
with different vocabularies. We also observe that the selected words are protein or family-specific, often associated
with only one protein/family, and significantly different from the words identified for weak binders. As a case study,
we examine the selected chemical words for a number of important drug target families and find that the top-ranking
chemical words are associated with the known pharmacophores and functional groups of the protein families. Notably,
2