Look Ma Only 400 Samples Revisiting the Effectiveness of Automatic N-Gram Rule Generation for Spelling Normalization in Filipino Lorenzo Jaime Yu Flores Dragomir Radev

2025-05-02 0 0 626.4KB 7 页 10玖币

侵权投诉

Look Ma, Only 400 Samples! Revisiting the Effectiveness of Automatic

N-Gram Rule Generation for Spelling Normalization in Filipino

Lorenzo Jaime Yu Flores Dragomir Radev

Yale University

lj.flores@yale.edu

Abstract

With 84.75 million Filipinos online, the abil-

ity for models to process online text is crucial

for developing Filipino NLP applications. To

this end, spelling correction is a crucial prepro-

cessing step for downstream processing. How-

ever, the lack of data prevents the use of lan-

guage models for this task. In this paper, we

propose an N-Gram + Damerau-Levenshtein

distance model with automatic rule extraction.

We train the model on 300 samples, and show

that despite limited training data, it achieves

good performance and outperforms other deep

learning approaches in terms of accuracy and

edit distance. Moreover, the model (1) re-

quires little compute power, (2) trains in lit-

tle time, thus allowing for retraining, and (3)

is easily interpretable, allowing for direct trou-

bleshooting, highlighting the success of tra-

ditional approaches over more complex deep

learning models in settings where data is un-

available.

1 Introduction

Filipinos are among the most active social media

users worldwide (Baclig,2022). In 2022, roughly

84.75M Filipinos were online (Statista,2022a),

with 96.2% on Facebook (Statista,2022b). Hence,

developing language models that can process on-

line text is crucial for Filipino NLP applications.

Contractions and abbreviations are common in

such online text (Salvacion and Limpot,2022).

For example, dito (here) can be written as d2, or

nakakatawa (funny) as nkktawa, which are abbre-

viated based on their pronunciation. However, lan-

guage models like Google Translate remain limited

in their ability to detect and correct such words, as

we ﬁnd later in the paper. Hence, we aim to im-

prove the spelling correction ability of such models.

In this paper, we demonstrate the effectiveness

of a simple n-gram based algorithm for this task,

inspired by prior work on automatic rule genera-

tion by Mangu and Brill (1997). Speciﬁcally, we

(1) create a training dataset of 300 examples, (2)

automatically generate n-gram based spelling rules

using the dataset, and (3) use the rules to propose

and select candidates. We then demonstrate that

this model outperforms seq-to-seq approaches.

Ultimately, the paper aims to highlight the use

of traditional approaches in areas where SOTA lan-

guage models are difﬁcult to apply due to limita-

tions in data availability. Such approaches have the

added beneﬁt of (1) requiring little compute power

for training and inference, (2) training in very little

time (allowing for frequent retraining), and (3) giv-

ing researchers full clarity over its inner workings,

thereby improving the ease of troubleshooting.

2 Related Work

The problem of online text spelling correction is

most closely related to spelling normalization – the

subtask of reverting shortcuts and abbreviations

into their original form (Nocon et al.,2014). In this

paper, we will use correcting to mean normalizing

a word. This is useful for low-resource languages

like Filipino, wherein spelling is often not standard-

ized across its users (Li et al.,2020).

Many approaches have been tried for word

normalization in online Filipino text: (1) pre-

determined rules using commonly seen patterns

(Guingab et al.,2014;Oco and Borra,2011), (2)

dictionary-substitution models for extracting pat-

terns in misspelled words (Nocon et al.,2014), or

(3) trigrams and Levenshtein or QWERTY distance

to select words which share similar trigrams or are

close in terms of edit or keyboard distance (Chan

et al.,2008;Go et al.,2017).

Each method has its limitations which we seek

to address. Predetermined rules must be manually

updated to learn emerging patterns, as is common

in the constantly evolving vocabulary of online Fil-

ipino text (Salvacion and Limpot,2022;Lumabi,

2020). Dictionary-substitution models are limited

by the constraint of picking mapping each pattern

arXiv:2210.02675v2 [cs.CL] 5 Nov 2022

to only a single substitution, whereas in reality, dif-

ferent patterns may need to be applied to different

words bearing the same pattern (Nocon et al.,2014).

Trigrams and distance metrics alone may be suc-

cessful in the context of correcting typographical

errors for which the model was developed (Chan

et al.,2008), but may not be as successful on in-

tentionally abbreviated words. Our work uses a

combination of these methods to develop a model

that can be easily updated, considers multiple possi-

ble candidates, and works in the online text setting.

The task is further complicated by the lack of

data, which hinders the use of large pretrained lan-

guage models. Previous supervised modeling ap-

proaches require thousands of labeled examples

(Etoori et al.,2018), and even unsupervised ap-

proaches for similar problems required vocabulary

lists containing the desired words for translation

(Lample et al.,2018a,b). Since such datasets are

not available, our paper revisits simpler models,

and ﬁnds that they exhibit comparable performance

to that of much larger SOTA models.

3 Data

We use a dataset consisting of Facebook comments

made on weather advisories of a Philippine weather

bureau in 2014. We identiﬁed 403 distinct abbrevi-

ated and contracted words, and had three Filipino

undergraduate volunteers write their “correct” ver-

sions. To maximize the data, we removed hyphens

and standardized spacing, then ﬁltered out candi-

dates where all annotators gave different answers.

We obtained 398 examples (98.7%) with 83.8%

inter-annotator agreement. We then created a 298-

100 train-test split; we selected test examples that

used spelling rules present in the training set to test

the ability of our n-gram model to extract and apply

such rules. To test generalizability, we also perform

cross-validation. The data and code for our experi-

ments are available at the following repository. 1

4 Model

Automatic Rule Generation

We extract

spelling rules from pairs

(w, c)

, where

a misspelled word, and

is its corrected ver-

sion. The rule generation algorithm slides a

window of length

over

and

, and records

w[i:i+k]→c[j:j+k]

as a rule (

i, j

are

pointers); it returns a dictionary mapping each

1https://github.com/ljyflores/

Filipino-Slang

substring to a list of “correct" substrings (See

Appendix 1for algorithm and example).

We test substrings of length 1 to 4, and ﬁnd that

lengths 1 / 2 work best. This makes sense as many

Filipino words are abbreviated by syllable, which

typically have 1-2 letters. This is similar to Indone-

sian (Batais and Wiltshire,2015) and Malay (Ramli

et al.,2015), suggesting possible extensions.

We further ﬁlter candidates to words present in

a Filipino vocabulary list developed by Gensaya

(2018) (MIT License), except for when none of

the candidates exist in the vocabulary list, in which

case we use all the generated words as candidates.

Candidate Generation

We recursively generate

candidates by replacing each substring with all pos-

sible rules in the rule dictionary. If the substring

is not present, we keep the substring as is. An

example can be found in Appendix D.

We ﬁnd that rules involving single letter sub-

strings often occur at the end of a word. Hence, we

test candidate generation algorithms which either

allow single letter rules to be used anywhere when

generating (V1), or only for the last letter of a word

(V2). We also vary the # of candidates kept at each

generation step (ranked by likelihood, see Eq 2).

Ranking Candidates

We explore two ways of

ranking candidates: (1) Damerau-Levenshtein Dis-

tance we rank candidates based on their edit dis-

tance from the misspelled word using the pyxdam-

eraulevenshtein

package with standard settings,

and (2) Likelihood Score we compute the likeli-

hood of the output word

given misspelled word

as the product of probability the rules used to

generate it, where the probability of a rule is the

number of occurrences of

a→b

divided by the

number of rules starting with a(See Eqs 1,2).

P(a→b) = |{a→b}|

|{a→c}∀c|(1)

P(w→c) =

len(w)−k

i=1

P(w[i:i+k]→c[i:i+k]) (2)

5 Evaluation

5.1 Comparison to Language Models

We benchmark the performance of our models

against two seq-to-seq models on the same dataset:

2https://github.com/lanl/

pyxDamerauLevenshtein

文档加载中……请稍候！
如果长时间未打开，您也可以点击刷新试试。

下载文档到电脑，查找使用更方便

10 玖币 0人已下载

立即下载

摘要：

LookMa,Only400Samples!RevisitingtheEffectivenessofAutomaticN-GramRuleGenerationforSpellingNormalizationinFilipinoLorenzoJaimeYuFloresDragomirRadevYaleUniversitylj.flores@yale.eduAbstractWith84.75millionFilipinosonline,theabil-ityformodelstoprocessonlinetextiscrucialfordevelopingFilipinoNLPapplicatio...

展开>> 收起<<

Look Ma Only 400 Samples Revisiting the Effectiveness of Automatic N-Gram Rule Generation for Spelling Normalization in Filipino Lorenzo Jaime Yu Flores Dragomir Radev.pdf

共7页,预览2页

还剩页未读，继续阅读

声明：本站为文档C2C交易模式，即用户上传的文档直接被用户下载，本站只是中间服务平台，本站所有文档下载所得的收益归上传人(含作者)所有。玖贝云文库仅提供信息存储空间，仅对用户上传内容的表现方式做保护处理，对上载内容本身不做任何修改或编辑。若文档所含内容侵犯了您的版权或隐私，请立即通知玖贝云文库，我们立即给予删除！

Look Ma Only 400 Samples Revisiting the Effectiveness of Automatic N-Gram Rule Generation for Spelling Normalization in Filipino Lorenzo Jaime Yu Flores Dragomir Radev

相关推荐

开通VIP享超值会员特权

作者详情

相关内容

热门标签

举报选择: