
formulate the task as finding the most accurate se-
mantically and syntactically valid solution from the
candidate solutions generated by SHR. With the
help of the significantly reduced exponential search
space provided by SHR and linguistically involved
feature engineering, these lexicon driven systems
(Krishna et al.,2020b,2018) report close to state-
of-the-art performance for the SWS task. How-
ever, these approaches rely on the completeness
assumption of SHR, which is optimistic given that
SHR does not use domain specific lexicons. These
models are handicapped by the failure of this pre-
liminary step. On the other hand, purely engineer-
ing based knowledge-lean data-centric approaches
(Hellwig and Nehrdich,2018;Reddy et al.,2018;
Aralikatte et al.,2018) perform surprisingly well
without any explicit hand-crafted features and ex-
ternal linguistic resources. These purely engineer-
ing based approaches are known for their ease of
scalability and deployment for training/inference.
However, a drawback of these approaches is that
they are blind to latent word information available
through external resources.
There are also lattice-structured approaches
(Zhang and Yang,2018;Gui et al.,2019;Li
et al.,2020) (originally proposed for Chinese
Named Entity Recognition (NER), which incor-
porate lexical information in character-level se-
quence labelling architecture). However, these
approaches cannot be directly applied for SWS;
since acquiring word-level information is not triv-
ial due to sandhi phenomenon. To overcome these
shortcomings, we propose
Trans
former-based
L
inguistically
I
nformed
T
okenizer (TransLIST).
TransLIST is a perfect blend of purely engineer-
ing and lexicon driven approaches for the SWS
task and provides the following advantages: (1)
Similar to purely engineering approaches, it facil-
itates ease of scalability and deployment during
training/inference. (2) Similar to lexicon driven ap-
proaches, it is capable of utilizing the candidate so-
lutions generated by SHR, which further improves
the performance. (3) Contrary to lexicon driven
approaches, TransLIST is robust and can function
even when candidate solution space is partly avail-
able or unavailable.
Our key contributions are as follows: (a) We pro-
pose the linguistically informed tokenization mod-
ule (§ 2.1) which accommodates language-specific
sandhi phenomenon and adds inductive bias for the
SWS task. (b) We propose a novel soft-masked
attention (§ 2.2) that helps to add inductive bias for
prioritizing potential candidates keeping mutual in-
teractions between all candidates intact. (c) We
propose a novel path ranking algorithm (§ 2.3) to
rectify the corrupted predictions. (d) We report an
average 7.2 points perfect match absolute gain (§ 3)
over the current state-of-the-art system (Hellwig
and Nehrdich,2018).
We elucidate our findings by first describing
TransLIST and its key components (§ 2), followed
by the evaluation of TransLIST against strong base-
lines on a test-bed of 2 benchmark datasets for the
SWS task (§ 3). Finally, we investigate and delve
deeper into the capabilities of the proposed compo-
nents and its corresponding modules (§ 4).
2 Methodology
In this section, we will examine the key compo-
nents of TransLIST which includes a linguisti-
cally informed tokenization module that encodes
character input with latent-word information while
accounting for SWS-specific sandhi phenomena
(§ 2.1), a novel soft-masked attention to prioritise
potential candidate words (§ 2.2) and a novel path
ranking algorithm to correct mispredictions (§ 2.3).
2.1 Linguistically Informed Sanskrit
Tokenizer (LIST)
Lexicon driven approaches for SWS are brittle in
realistic scenarios and purely engineering based
approaches do not consider the potentially use-
ful latent word information. We propose a win-
win/robust solution by formulating SWS as a
character-level sequence labelling integrated with
latent word information from the SHR as and when
available. TransLIST is illustrated with an example
`
svetodh
¯
avati in Figure 2. SHR employs a Finite
State Transducer (FST) in the form of a lexical
juncture system to obtain a compact representation
of candidate solution space aligned with the input
sequence. As shown in Figure 2(a), we receive
the candidate solution space from the SHR engine.
Here,
`
svetah dh
¯
avati and
`
sveta
¯
udh
¯
a avati are two
syntactically possible splits.
3
It does not suggest
the final segmentation. The candidate space in-
cludes words such as
`
sva,
`
sveta
h
.
and eta
h
.
whose
boundaries are modified with respect to the in-
put sequence due to sandhi phenomenon. SHR
gives us mapping (head and tail position) of all
3Only some of the solutions are shown for visualization.