TransLIST A Transformer-Based Linguistically Informed Sanskrit Tokenizer Jivnesh Sandhan1 Rathin Singha2 Narein Rao1 Suvendu Samanta1

2025-05-06 0 0 1.26MB 11 页 10玖币
侵权投诉
TransLIST: A Transformer-Based Linguistically Informed Sanskrit
Tokenizer
Jivnesh Sandhan1, Rathin Singha2, Narein Rao1, Suvendu Samanta1,
Laxmidhar Behera1,4 and Pawan Goyal3
1IIT Kanpur, 2UCLA, 3IIT Kharagpur, 4IIT Mandi
jivnesh@iitk.ac.in,rsingha108@g.ucla.edu,
nrao20@iitk.ac.in,pawang@cse.iitkgp.ac.in
Abstract
Sanskrit Word Segmentation (SWS) is essen-
tial in making digitized texts available and in
deploying downstream tasks. It is, however,
non-trivial because of the sandhi phenomenon
that modifies the characters at the word bound-
aries, and needs special treatment. Existing
lexicon driven approaches for SWS make use
of Sanskrit Heritage Reader, a lexicon-driven
shallow parser, to generate the complete candi-
date solution space, over which various meth-
ods are applied to produce the most valid so-
lution. However, these approaches fail while
encountering out-of-vocabulary tokens. On
the other hand, purely engineering methods
for SWS have made use of recent advances in
deep learning, but cannot make use of the la-
tent word information on availability.
To mitigate the shortcomings of both families
of approaches, we propose Transformer based
Linguistically Informed Sanskrit Tokenizer
(TransLIST) consisting of (1) a module that
encodes the character input along with latent-
word information, which takes into account
the sandhi phenomenon specific to SWS and
is apt to work with partial or no candidate so-
lutions, (2) a novel soft-masked attention to
prioritize potential candidate words and (3) a
novel path ranking algorithm to rectify the cor-
rupted predictions. Experiments on the bench-
mark datasets for SWS show that TransLIST
outperforms the current state-of-the-art system
by an average 7.2 points absolute gain in terms
of perfect match (PM) metric.1
1 Introduction
Sanskrit is considered as a cultural heritage and
knowledge preserving language of ancient India.
The momentous development in digitization efforts
has made ancient manuscripts in Sanskrit readily
available for the public domain. However, the us-
ability of these digitized manuscripts is limited
1
The codebase and datasets are publicly available at:
https://github.com/rsingha108/TransLIST
due to linguistic challenges posed by the language.
SWS conventionally serves the most fundamen-
tal prerequisite for text processing step to make
these digitized manuscripts accessible and to de-
ploy many downstream tasks such as text classifi-
cation (Sandhan et al.,2019;Krishna et al.,2016b),
morphological tagging (Gupta et al.,2020;Krishna
et al.,2018), dependency parsing (Sandhan et al.,
2021;Krishna et al.,2020a), automatic speech
recognition (Kumar et al.,2022) etc. SWS is not
straightforward due to the phenomenon of sandhi,
which creates phonetic transformations at word
boundaries. This not only obscures the word bound-
aries but also modifies the characters at juncture
point by deletion, insertion and substitution opera-
tion. Figure 1illustrates some of the syntactically
possible splits due to the language-specific sandhi
phenomenon for Sanskrit. This demonstrates the
challenges involved in identifying the location of
the split and the kind of transformation performed
at word boundaries.
śvetodhāvati
śvā ita ūdhā avati
śva ita ūdhā avati
śvetaḥ dhāvati
śva itaḥ dhāvati
śveta ūdhā avati
śva eta ūdhā avati
Input chunk
Set of candidate solutions
Correct segmentation
Figure 1: An example to illustrate challenges posed by
sandhi phenomenon for SWS task.
The recent surge in SWS datasets (Krishna et al.,
2017;Krishnan et al.,2020) has led to various
methodologies to handle SWS. Existing lexicon-
driven approaches rely on a lexicon driven shal-
low parser, popularly known as Sanskrit Heritage
Reader (SHR) (Goyal and Huet,2016a).
2
This line
of approaches (Krishna et al.,2016a,2018,2020b)
2https://sanskrit.inria.fr/DICO/reader.fr.html
arXiv:2210.11753v1 [cs.CL] 21 Oct 2022
formulate the task as finding the most accurate se-
mantically and syntactically valid solution from the
candidate solutions generated by SHR. With the
help of the significantly reduced exponential search
space provided by SHR and linguistically involved
feature engineering, these lexicon driven systems
(Krishna et al.,2020b,2018) report close to state-
of-the-art performance for the SWS task. How-
ever, these approaches rely on the completeness
assumption of SHR, which is optimistic given that
SHR does not use domain specific lexicons. These
models are handicapped by the failure of this pre-
liminary step. On the other hand, purely engineer-
ing based knowledge-lean data-centric approaches
(Hellwig and Nehrdich,2018;Reddy et al.,2018;
Aralikatte et al.,2018) perform surprisingly well
without any explicit hand-crafted features and ex-
ternal linguistic resources. These purely engineer-
ing based approaches are known for their ease of
scalability and deployment for training/inference.
However, a drawback of these approaches is that
they are blind to latent word information available
through external resources.
There are also lattice-structured approaches
(Zhang and Yang,2018;Gui et al.,2019;Li
et al.,2020) (originally proposed for Chinese
Named Entity Recognition (NER), which incor-
porate lexical information in character-level se-
quence labelling architecture). However, these
approaches cannot be directly applied for SWS;
since acquiring word-level information is not triv-
ial due to sandhi phenomenon. To overcome these
shortcomings, we propose
Trans
former-based
L
inguistically
I
nformed
T
okenizer (TransLIST).
TransLIST is a perfect blend of purely engineer-
ing and lexicon driven approaches for the SWS
task and provides the following advantages: (1)
Similar to purely engineering approaches, it facil-
itates ease of scalability and deployment during
training/inference. (2) Similar to lexicon driven ap-
proaches, it is capable of utilizing the candidate so-
lutions generated by SHR, which further improves
the performance. (3) Contrary to lexicon driven
approaches, TransLIST is robust and can function
even when candidate solution space is partly avail-
able or unavailable.
Our key contributions are as follows: (a) We pro-
pose the linguistically informed tokenization mod-
ule (§ 2.1) which accommodates language-specific
sandhi phenomenon and adds inductive bias for the
SWS task. (b) We propose a novel soft-masked
attention (§ 2.2) that helps to add inductive bias for
prioritizing potential candidates keeping mutual in-
teractions between all candidates intact. (c) We
propose a novel path ranking algorithm (§ 2.3) to
rectify the corrupted predictions. (d) We report an
average 7.2 points perfect match absolute gain (§ 3)
over the current state-of-the-art system (Hellwig
and Nehrdich,2018).
We elucidate our findings by first describing
TransLIST and its key components (§ 2), followed
by the evaluation of TransLIST against strong base-
lines on a test-bed of 2 benchmark datasets for the
SWS task (§ 3). Finally, we investigate and delve
deeper into the capabilities of the proposed compo-
nents and its corresponding modules (§ 4).
2 Methodology
In this section, we will examine the key compo-
nents of TransLIST which includes a linguisti-
cally informed tokenization module that encodes
character input with latent-word information while
accounting for SWS-specific sandhi phenomena
2.1), a novel soft-masked attention to prioritise
potential candidate words (§ 2.2) and a novel path
ranking algorithm to correct mispredictions (§ 2.3).
2.1 Linguistically Informed Sanskrit
Tokenizer (LIST)
Lexicon driven approaches for SWS are brittle in
realistic scenarios and purely engineering based
approaches do not consider the potentially use-
ful latent word information. We propose a win-
win/robust solution by formulating SWS as a
character-level sequence labelling integrated with
latent word information from the SHR as and when
available. TransLIST is illustrated with an example
`
svetodh
¯
avati in Figure 2. SHR employs a Finite
State Transducer (FST) in the form of a lexical
juncture system to obtain a compact representation
of candidate solution space aligned with the input
sequence. As shown in Figure 2(a), we receive
the candidate solution space from the SHR engine.
Here,
`
svetah dh
¯
avati and
`
sveta
¯
udh
¯
a avati are two
syntactically possible splits.
3
It does not suggest
the final segmentation. The candidate space in-
cludes words such as
`
sva,
`
sveta
h
.
and eta
h
.
whose
boundaries are modified with respect to the in-
put sequence due to sandhi phenomenon. SHR
gives us mapping (head and tail position) of all
3Only some of the solutions are shown for visualization.
ś v e t o dh ā v a t i
śvetaḥ
śveta
avati
dhāvati
(Subset of) candidate solutions from SHR
ūdhā
ś v e t o dh ā v a t i
śv
śve
N-grams in absence of SHR
śvet
ve tiat
vet vat ati
… vati
(a)
Transformer encoder with SMA module
... śva śvetaḥ etaḥ dhāvati
e
... 11 3 6
3
35 5 11
3
...
aḥ_
e
v
2
2
v
ś
1
1
ś
t
4
4
t
o
5
5
(b)
Figure 2: Illustration of TransLIST with a toy example “´
svetodh¯
avati”. Translation: “The white (horse) runs. (a)
LIST module: We use the candidate solutions (two possible candidate solutions are highlighted with , colors
where the latter is the gold standard) from SHR if available; in the absence of SHR, we resort to using n-grams
(n4). (b) TransLIST architecture: In span encoding, each node is represented by head and tail position index of
its character in the input sequence. , , denote tokens, heads and tails, respectively. The SHR helps to include
words such as ´
sva, ´
svetah
.and etah
.whose boundaries are modified with respect to input sequence due to sandhi
phenomenon. Finally, on the top of the Transformer encoder, classification head learns to predict gold standard
output shown by for the corresponding input character nodes only.
the candidate nodes with the input sequence. In
case such mapping is incorrect for some cases, we
rectify it with the help of deterministic algorithm
by matching candidate nodes with the input sen-
tence and finding the closest match. In the absence
of SHR, we propose to use all possible n-grams
(
n4
)
4
which helps to add inductive bias about
neighboring candidates in the window size of 4.
5
We feed the candidate words/n-grams to the Trans-
former encoder and the classification head learns to
predict gold standard output for the corresponding
input character nodes only. The output vocabulary
consists of unigram characters (e.g., ´
s, v), bigrams
and tri-grams (e.g., a
h
.
_). The output vocabulary
contains ‘_’ to represent spacing between words.
Consequently, TransLIST is capable of using both
character-level modelling as well as latent word
information as and when available. On the other
hand, purely engineering approaches rely only on
character-level modelling and Lexicon driven ap-
proaches rely only on word-level information from
SHR to handle sandhi.
4We do not observe significant improvements for n > 4.
5
Our probing analysis (Figure 4) suggests that char-char
attention mostly focuses on immediate neighbors. Refer to § 4
for detailed ablations on LIST variants.
2.2 Soft Masked Attention (SMA)
Transformers (Vaswani et al.,2017) have been
proven to be effective for capturing long-distance
dependencies in a sequence. The self-attention
property of a Transformer facilitates effective in-
teraction between character and available latent
word information. There are two preliminary pre-
requisites for effective modelling of inductive bias
for tokenization: (1) Allow interactions between
the candidate words/characters within and amongst
chunks. (2) Prioritize candidate words contain-
ing the input character for which a prediction is
being made (e.g., in Figure 2(b),
`
sva and
`
sveta
h
.
are prioritized amongst the candidate words when
predicting for the character
`
s).
6
The vanilla self-
attention (Vaswani et al.,2017) can address both
the requirements; however, it has to self-learn the
inductive bias associated with prioritisation. It may
not be an effective solution in low-resourced set-
tings. On the other hand, if we use hard-masked
attention to address the second prerequisite, we
lose mutual interactions between the candidates.
Hence, we propose a novel soft-masked attention
which helps to address both the requirements effec-
tively. To the best of our knowledge, there is no
existing soft-masked attention similar to ours. We
formally discuss this below.
6
We find that failing to meet any one of the prerequisites
leads to drop in performance (§ 4).
摘要:

TransLIST:ATransformer-BasedLinguisticallyInformedSanskritTokenizerJivneshSandhan1,RathinSingha2,NareinRao1,SuvenduSamanta1,LaxmidharBehera1,4andPawanGoyal31IITKanpur,2UCLA,3IITKharagpur,4IITMandijivnesh@iitk.ac.in,rsingha108@g.ucla.edu,nrao20@iitk.ac.in,pawang@cse.iitkgp.ac.inAbstractSanskritWordSe...

展开>> 收起<<
TransLIST A Transformer-Based Linguistically Informed Sanskrit Tokenizer Jivnesh Sandhan1 Rathin Singha2 Narein Rao1 Suvendu Samanta1.pdf

共11页,预览3页

还剩页未读, 继续阅读

声明:本站为文档C2C交易模式,即用户上传的文档直接被用户下载,本站只是中间服务平台,本站所有文档下载所得的收益归上传人(含作者)所有。玖贝云文库仅提供信息存储空间,仅对用户上传内容的表现方式做保护处理,对上载内容本身不做任何修改或编辑。若文档所含内容侵犯了您的版权或隐私,请立即通知玖贝云文库,我们立即给予删除!
分类:图书资源 价格:10玖币 属性:11 页 大小:1.26MB 格式:PDF 时间:2025-05-06

开通VIP享超值会员特权

  • 多端同步记录
  • 高速下载文档
  • 免费文档工具
  • 分享文档赚钱
  • 每日登录抽奖
  • 优质衍生服务
/ 11
客服
关注