TransLIST A Transformer-Based Linguistically Informed Sanskrit Tokenizer Jivnesh Sandhan1 Rathin Singha2 Narein Rao1 Suvendu Samanta1

2025-05-06 0 0 1.26MB 11 页 10玖币

侵权投诉

TransLIST: A Transformer-Based Linguistically Informed Sanskrit

Tokenizer

Jivnesh Sandhan1, Rathin Singha2, Narein Rao1, Suvendu Samanta1,

Laxmidhar Behera1,4 and Pawan Goyal3

1IIT Kanpur, 2UCLA, 3IIT Kharagpur, 4IIT Mandi

jivnesh@iitk.ac.in,rsingha108@g.ucla.edu,

nrao20@iitk.ac.in,pawang@cse.iitkgp.ac.in

Abstract

Sanskrit Word Segmentation (SWS) is essen-

tial in making digitized texts available and in

deploying downstream tasks. It is, however,

non-trivial because of the sandhi phenomenon

that modiﬁes the characters at the word bound-

aries, and needs special treatment. Existing

lexicon driven approaches for SWS make use

of Sanskrit Heritage Reader, a lexicon-driven

shallow parser, to generate the complete candi-

date solution space, over which various meth-

ods are applied to produce the most valid so-

lution. However, these approaches fail while

encountering out-of-vocabulary tokens. On

the other hand, purely engineering methods

for SWS have made use of recent advances in

deep learning, but cannot make use of the la-

tent word information on availability.

To mitigate the shortcomings of both families

of approaches, we propose Transformer based

Linguistically Informed Sanskrit Tokenizer

(TransLIST) consisting of (1) a module that

encodes the character input along with latent-

word information, which takes into account

the sandhi phenomenon speciﬁc to SWS and

is apt to work with partial or no candidate so-

lutions, (2) a novel soft-masked attention to

prioritize potential candidate words and (3) a

novel path ranking algorithm to rectify the cor-

rupted predictions. Experiments on the bench-

mark datasets for SWS show that TransLIST

outperforms the current state-of-the-art system

by an average 7.2 points absolute gain in terms

of perfect match (PM) metric.1

1 Introduction

Sanskrit is considered as a cultural heritage and

knowledge preserving language of ancient India.

The momentous development in digitization efforts

has made ancient manuscripts in Sanskrit readily

available for the public domain. However, the us-

ability of these digitized manuscripts is limited

The codebase and datasets are publicly available at:

https://github.com/rsingha108/TransLIST

due to linguistic challenges posed by the language.

SWS conventionally serves the most fundamen-

tal prerequisite for text processing step to make

these digitized manuscripts accessible and to de-

ploy many downstream tasks such as text classiﬁ-

cation (Sandhan et al.,2019;Krishna et al.,2016b),

morphological tagging (Gupta et al.,2020;Krishna

et al.,2018), dependency parsing (Sandhan et al.,

2021;Krishna et al.,2020a), automatic speech

recognition (Kumar et al.,2022) etc. SWS is not

straightforward due to the phenomenon of sandhi,

which creates phonetic transformations at word

boundaries. This not only obscures the word bound-

aries but also modiﬁes the characters at juncture

point by deletion, insertion and substitution opera-

tion. Figure 1illustrates some of the syntactically

possible splits due to the language-speciﬁc sandhi

phenomenon for Sanskrit. This demonstrates the

challenges involved in identifying the location of

the split and the kind of transformation performed

at word boundaries.

śvetodhāvati

śvā ita ūdhā avati

śva ita ūdhā avati

śvetaḥ dhāvati

śva itaḥ dhāvati

śveta ūdhā avati

śva eta ūdhā avati

Input chunk

Set of candidate solutions

Correct segmentation

Figure 1: An example to illustrate challenges posed by

sandhi phenomenon for SWS task.

The recent surge in SWS datasets (Krishna et al.,

2017;Krishnan et al.,2020) has led to various

methodologies to handle SWS. Existing lexicon-

driven approaches rely on a lexicon driven shal-

low parser, popularly known as Sanskrit Heritage

Reader (SHR) (Goyal and Huet,2016a).

This line

of approaches (Krishna et al.,2016a,2018,2020b)

2https://sanskrit.inria.fr/DICO/reader.fr.html

arXiv:2210.11753v1 [cs.CL] 21 Oct 2022

formulate the task as ﬁnding the most accurate se-

mantically and syntactically valid solution from the

candidate solutions generated by SHR. With the

help of the signiﬁcantly reduced exponential search

space provided by SHR and linguistically involved

feature engineering, these lexicon driven systems

(Krishna et al.,2020b,2018) report close to state-

of-the-art performance for the SWS task. How-

ever, these approaches rely on the completeness

assumption of SHR, which is optimistic given that

SHR does not use domain speciﬁc lexicons. These

models are handicapped by the failure of this pre-

liminary step. On the other hand, purely engineer-

ing based knowledge-lean data-centric approaches

(Hellwig and Nehrdich,2018;Reddy et al.,2018;

Aralikatte et al.,2018) perform surprisingly well

without any explicit hand-crafted features and ex-

ternal linguistic resources. These purely engineer-

ing based approaches are known for their ease of

scalability and deployment for training/inference.

However, a drawback of these approaches is that

they are blind to latent word information available

through external resources.

There are also lattice-structured approaches

(Zhang and Yang,2018;Gui et al.,2019;Li

et al.,2020) (originally proposed for Chinese

Named Entity Recognition (NER), which incor-

porate lexical information in character-level se-

quence labelling architecture). However, these

approaches cannot be directly applied for SWS;

since acquiring word-level information is not triv-

ial due to sandhi phenomenon. To overcome these

shortcomings, we propose

Trans

former-based

inguistically

nformed

okenizer (TransLIST).

TransLIST is a perfect blend of purely engineer-

ing and lexicon driven approaches for the SWS

task and provides the following advantages: (1)

Similar to purely engineering approaches, it facil-

itates ease of scalability and deployment during

training/inference. (2) Similar to lexicon driven ap-

proaches, it is capable of utilizing the candidate so-

lutions generated by SHR, which further improves

the performance. (3) Contrary to lexicon driven

approaches, TransLIST is robust and can function

even when candidate solution space is partly avail-

able or unavailable.

Our key contributions are as follows: (a) We pro-

pose the linguistically informed tokenization mod-

ule (§ 2.1) which accommodates language-speciﬁc

sandhi phenomenon and adds inductive bias for the

SWS task. (b) We propose a novel soft-masked

attention (§ 2.2) that helps to add inductive bias for

prioritizing potential candidates keeping mutual in-

teractions between all candidates intact. (c) We

propose a novel path ranking algorithm (§ 2.3) to

rectify the corrupted predictions. (d) We report an

average 7.2 points perfect match absolute gain (§ 3)

over the current state-of-the-art system (Hellwig

and Nehrdich,2018).

We elucidate our ﬁndings by ﬁrst describing

TransLIST and its key components (§ 2), followed

by the evaluation of TransLIST against strong base-

lines on a test-bed of 2 benchmark datasets for the

SWS task (§ 3). Finally, we investigate and delve

deeper into the capabilities of the proposed compo-

nents and its corresponding modules (§ 4).

2 Methodology

In this section, we will examine the key compo-

nents of TransLIST which includes a linguisti-

cally informed tokenization module that encodes

character input with latent-word information while

accounting for SWS-speciﬁc sandhi phenomena

(§ 2.1), a novel soft-masked attention to prioritise

potential candidate words (§ 2.2) and a novel path

ranking algorithm to correct mispredictions (§ 2.3).

2.1 Linguistically Informed Sanskrit

Tokenizer (LIST)

Lexicon driven approaches for SWS are brittle in

realistic scenarios and purely engineering based

approaches do not consider the potentially use-

ful latent word information. We propose a win-

win/robust solution by formulating SWS as a

character-level sequence labelling integrated with

latent word information from the SHR as and when

available. TransLIST is illustrated with an example

svetodh

avati in Figure 2. SHR employs a Finite

State Transducer (FST) in the form of a lexical

juncture system to obtain a compact representation

of candidate solution space aligned with the input

sequence. As shown in Figure 2(a), we receive

the candidate solution space from the SHR engine.

Here,

svetah dh

avati and

sveta

udh

a avati are two

syntactically possible splits.

It does not suggest

the ﬁnal segmentation. The candidate space in-

cludes words such as

sva,

sveta

and eta

whose

boundaries are modiﬁed with respect to the in-

put sequence due to sandhi phenomenon. SHR

gives us mapping (head and tail position) of all

3Only some of the solutions are shown for visualization.

ś v e t o dh ā v a t i

śvetaḥ

śveta

avati

dhāvati

(Subset of) candidate solutions from SHR

ūdhā

ś v e t o dh ā v a t i

śv

śve

N-grams in absence of SHR

śvet

ve tiat…

vet … vat ati

… vati

(a)

Transformer encoder with SMA module

... śva śvetaḥ etaḥ dhāvati

... 11 3 6

…35 5 11

...

aḥ_

(b)

Figure 2: Illustration of TransLIST with a toy example “´

svetodh¯

avati”. Translation: “The white (horse) runs.” (a)

LIST module: We use the candidate solutions (two possible candidate solutions are highlighted with , colors

where the latter is the gold standard) from SHR if available; in the absence of SHR, we resort to using n-grams

(n≤4). (b) TransLIST architecture: In span encoding, each node is represented by head and tail position index of

its character in the input sequence. , , denote tokens, heads and tails, respectively. The SHR helps to include

words such as ´

sva, ´

svetah

.and etah

.whose boundaries are modiﬁed with respect to input sequence due to sandhi

phenomenon. Finally, on the top of the Transformer encoder, classiﬁcation head learns to predict gold standard

output shown by for the corresponding input character nodes only.

the candidate nodes with the input sequence. In

case such mapping is incorrect for some cases, we

rectify it with the help of deterministic algorithm

by matching candidate nodes with the input sen-

tence and ﬁnding the closest match. In the absence

of SHR, we propose to use all possible n-grams

(

n≤4

)

which helps to add inductive bias about

neighboring candidates in the window size of 4.

We feed the candidate words/n-grams to the Trans-

former encoder and the classiﬁcation head learns to

predict gold standard output for the corresponding

input character nodes only. The output vocabulary

consists of unigram characters (e.g., ´

s, v), bigrams

and tri-grams (e.g., a

_). The output vocabulary

contains ‘_’ to represent spacing between words.

Consequently, TransLIST is capable of using both

character-level modelling as well as latent word

information as and when available. On the other

hand, purely engineering approaches rely only on

character-level modelling and Lexicon driven ap-

proaches rely only on word-level information from

SHR to handle sandhi.

4We do not observe signiﬁcant improvements for n > 4.

Our probing analysis (Figure 4) suggests that char-char

attention mostly focuses on immediate neighbors. Refer to § 4

for detailed ablations on LIST variants.

2.2 Soft Masked Attention (SMA)

Transformers (Vaswani et al.,2017) have been

proven to be effective for capturing long-distance

dependencies in a sequence. The self-attention

property of a Transformer facilitates effective in-

teraction between character and available latent

word information. There are two preliminary pre-

requisites for effective modelling of inductive bias

for tokenization: (1) Allow interactions between

the candidate words/characters within and amongst

chunks. (2) Prioritize candidate words contain-

ing the input character for which a prediction is

being made (e.g., in Figure 2(b),

sva and

sveta

are prioritized amongst the candidate words when

predicting for the character

s).

The vanilla self-

attention (Vaswani et al.,2017) can address both

the requirements; however, it has to self-learn the

inductive bias associated with prioritisation. It may

not be an effective solution in low-resourced set-

tings. On the other hand, if we use hard-masked

attention to address the second prerequisite, we

lose mutual interactions between the candidates.

Hence, we propose a novel soft-masked attention

which helps to address both the requirements effec-

tively. To the best of our knowledge, there is no

existing soft-masked attention similar to ours. We

formally discuss this below.

We ﬁnd that failing to meet any one of the prerequisites

leads to drop in performance (§ 4).

文档加载中……请稍候！
如果长时间未打开，您也可以点击刷新试试。

下载文档到电脑，查找使用更方便

10 玖币 0人已下载

立即下载

摘要：

TransLIST:ATransformer-BasedLinguisticallyInformedSanskritTokenizerJivneshSandhan1,RathinSingha2,NareinRao1,SuvenduSamanta1,LaxmidharBehera1,4andPawanGoyal31IITKanpur,2UCLA,3IITKharagpur,4IITMandijivnesh@iitk.ac.in,rsingha108@g.ucla.edu,nrao20@iitk.ac.in,pawang@cse.iitkgp.ac.inAbstractSanskritWordSe...

展开>> 收起<<

TransLIST A Transformer-Based Linguistically Informed Sanskrit Tokenizer Jivnesh Sandhan1 Rathin Singha2 Narein Rao1 Suvendu Samanta1.pdf

共11页,预览3页

还剩页未读，继续阅读

声明：本站为文档C2C交易模式，即用户上传的文档直接被用户下载，本站只是中间服务平台，本站所有文档下载所得的收益归上传人(含作者)所有。玖贝云文库仅提供信息存储空间，仅对用户上传内容的表现方式做保护处理，对上载内容本身不做任何修改或编辑。若文档所含内容侵犯了您的版权或隐私，请立即通知玖贝云文库，我们立即给予删除！

TransLIST A Transformer-Based Linguistically Informed Sanskrit Tokenizer Jivnesh Sandhan1 Rathin Singha2 Narein Rao1 Suvendu Samanta1

相关推荐

开通VIP享超值会员特权

作者详情

相关内容

热门标签

举报选择: