Look Ma Only 400 Samples Revisiting the Effectiveness of Automatic N-Gram Rule Generation for Spelling Normalization in Filipino Lorenzo Jaime Yu Flores Dragomir Radev

2025-05-02 0 0 626.4KB 7 页 10玖币
侵权投诉
Look Ma, Only 400 Samples! Revisiting the Effectiveness of Automatic
N-Gram Rule Generation for Spelling Normalization in Filipino
Lorenzo Jaime Yu Flores Dragomir Radev
Yale University
lj.flores@yale.edu
Abstract
With 84.75 million Filipinos online, the abil-
ity for models to process online text is crucial
for developing Filipino NLP applications. To
this end, spelling correction is a crucial prepro-
cessing step for downstream processing. How-
ever, the lack of data prevents the use of lan-
guage models for this task. In this paper, we
propose an N-Gram + Damerau-Levenshtein
distance model with automatic rule extraction.
We train the model on 300 samples, and show
that despite limited training data, it achieves
good performance and outperforms other deep
learning approaches in terms of accuracy and
edit distance. Moreover, the model (1) re-
quires little compute power, (2) trains in lit-
tle time, thus allowing for retraining, and (3)
is easily interpretable, allowing for direct trou-
bleshooting, highlighting the success of tra-
ditional approaches over more complex deep
learning models in settings where data is un-
available.
1 Introduction
Filipinos are among the most active social media
users worldwide (Baclig,2022). In 2022, roughly
84.75M Filipinos were online (Statista,2022a),
with 96.2% on Facebook (Statista,2022b). Hence,
developing language models that can process on-
line text is crucial for Filipino NLP applications.
Contractions and abbreviations are common in
such online text (Salvacion and Limpot,2022).
For example, dito (here) can be written as d2, or
nakakatawa (funny) as nkktawa, which are abbre-
viated based on their pronunciation. However, lan-
guage models like Google Translate remain limited
in their ability to detect and correct such words, as
we find later in the paper. Hence, we aim to im-
prove the spelling correction ability of such models.
In this paper, we demonstrate the effectiveness
of a simple n-gram based algorithm for this task,
inspired by prior work on automatic rule genera-
tion by Mangu and Brill (1997). Specifically, we
(1) create a training dataset of 300 examples, (2)
automatically generate n-gram based spelling rules
using the dataset, and (3) use the rules to propose
and select candidates. We then demonstrate that
this model outperforms seq-to-seq approaches.
Ultimately, the paper aims to highlight the use
of traditional approaches in areas where SOTA lan-
guage models are difficult to apply due to limita-
tions in data availability. Such approaches have the
added benefit of (1) requiring little compute power
for training and inference, (2) training in very little
time (allowing for frequent retraining), and (3) giv-
ing researchers full clarity over its inner workings,
thereby improving the ease of troubleshooting.
2 Related Work
The problem of online text spelling correction is
most closely related to spelling normalization – the
subtask of reverting shortcuts and abbreviations
into their original form (Nocon et al.,2014). In this
paper, we will use correcting to mean normalizing
a word. This is useful for low-resource languages
like Filipino, wherein spelling is often not standard-
ized across its users (Li et al.,2020).
Many approaches have been tried for word
normalization in online Filipino text: (1) pre-
determined rules using commonly seen patterns
(Guingab et al.,2014;Oco and Borra,2011), (2)
dictionary-substitution models for extracting pat-
terns in misspelled words (Nocon et al.,2014), or
(3) trigrams and Levenshtein or QWERTY distance
to select words which share similar trigrams or are
close in terms of edit or keyboard distance (Chan
et al.,2008;Go et al.,2017).
Each method has its limitations which we seek
to address. Predetermined rules must be manually
updated to learn emerging patterns, as is common
in the constantly evolving vocabulary of online Fil-
ipino text (Salvacion and Limpot,2022;Lumabi,
2020). Dictionary-substitution models are limited
by the constraint of picking mapping each pattern
arXiv:2210.02675v2 [cs.CL] 5 Nov 2022
to only a single substitution, whereas in reality, dif-
ferent patterns may need to be applied to different
words bearing the same pattern (Nocon et al.,2014).
Trigrams and distance metrics alone may be suc-
cessful in the context of correcting typographical
errors for which the model was developed (Chan
et al.,2008), but may not be as successful on in-
tentionally abbreviated words. Our work uses a
combination of these methods to develop a model
that can be easily updated, considers multiple possi-
ble candidates, and works in the online text setting.
The task is further complicated by the lack of
data, which hinders the use of large pretrained lan-
guage models. Previous supervised modeling ap-
proaches require thousands of labeled examples
(Etoori et al.,2018), and even unsupervised ap-
proaches for similar problems required vocabulary
lists containing the desired words for translation
(Lample et al.,2018a,b). Since such datasets are
not available, our paper revisits simpler models,
and finds that they exhibit comparable performance
to that of much larger SOTA models.
3 Data
We use a dataset consisting of Facebook comments
made on weather advisories of a Philippine weather
bureau in 2014. We identified 403 distinct abbrevi-
ated and contracted words, and had three Filipino
undergraduate volunteers write their “correct” ver-
sions. To maximize the data, we removed hyphens
and standardized spacing, then filtered out candi-
dates where all annotators gave different answers.
We obtained 398 examples (98.7%) with 83.8%
inter-annotator agreement. We then created a 298-
100 train-test split; we selected test examples that
used spelling rules present in the training set to test
the ability of our n-gram model to extract and apply
such rules. To test generalizability, we also perform
cross-validation. The data and code for our experi-
ments are available at the following repository. 1
4 Model
Automatic Rule Generation
We extract
spelling rules from pairs
(w, c)
, where
w
is
a misspelled word, and
c
is its corrected ver-
sion. The rule generation algorithm slides a
window of length
k
over
w
and
c
, and records
w[i:i+k]c[j:j+k]
as a rule (
i, j
are
pointers); it returns a dictionary mapping each
1https://github.com/ljyflores/
Filipino-Slang
substring to a list of “correct" substrings (See
Appendix 1for algorithm and example).
We test substrings of length 1 to 4, and find that
lengths 1 / 2 work best. This makes sense as many
Filipino words are abbreviated by syllable, which
typically have 1-2 letters. This is similar to Indone-
sian (Batais and Wiltshire,2015) and Malay (Ramli
et al.,2015), suggesting possible extensions.
We further filter candidates to words present in
a Filipino vocabulary list developed by Gensaya
(2018) (MIT License), except for when none of
the candidates exist in the vocabulary list, in which
case we use all the generated words as candidates.
Candidate Generation
We recursively generate
candidates by replacing each substring with all pos-
sible rules in the rule dictionary. If the substring
is not present, we keep the substring as is. An
example can be found in Appendix D.
We find that rules involving single letter sub-
strings often occur at the end of a word. Hence, we
test candidate generation algorithms which either
allow single letter rules to be used anywhere when
generating (V1), or only for the last letter of a word
(V2). We also vary the # of candidates kept at each
generation step (ranked by likelihood, see Eq 2).
Ranking Candidates
We explore two ways of
ranking candidates: (1) Damerau-Levenshtein Dis-
tance we rank candidates based on their edit dis-
tance from the misspelled word using the pyxdam-
eraulevenshtein
2
package with standard settings,
and (2) Likelihood Score we compute the likeli-
hood of the output word
c
given misspelled word
w
as the product of probability the rules used to
generate it, where the probability of a rule is the
number of occurrences of
ab
divided by the
number of rules starting with a(See Eqs 1,2).
P(ab) = |{ab}|
|{ac}∀c|(1)
P(wc) =
len(w)k
Y
i=1
P(w[i:i+k]c[i:i+k]) (2)
5 Evaluation
5.1 Comparison to Language Models
We benchmark the performance of our models
against two seq-to-seq models on the same dataset:
2https://github.com/lanl/
pyxDamerauLevenshtein
摘要:

LookMa,Only400Samples!RevisitingtheEffectivenessofAutomaticN-GramRuleGenerationforSpellingNormalizationinFilipinoLorenzoJaimeYuFloresDragomirRadevYaleUniversitylj.flores@yale.eduAbstractWith84.75millionFilipinosonline,theabil-ityformodelstoprocessonlinetextiscrucialfordevelopingFilipinoNLPapplicatio...

展开>> 收起<<
Look Ma Only 400 Samples Revisiting the Effectiveness of Automatic N-Gram Rule Generation for Spelling Normalization in Filipino Lorenzo Jaime Yu Flores Dragomir Radev.pdf

共7页,预览2页

还剩页未读, 继续阅读

声明:本站为文档C2C交易模式,即用户上传的文档直接被用户下载,本站只是中间服务平台,本站所有文档下载所得的收益归上传人(含作者)所有。玖贝云文库仅提供信息存储空间,仅对用户上传内容的表现方式做保护处理,对上载内容本身不做任何修改或编辑。若文档所含内容侵犯了您的版权或隐私,请立即通知玖贝云文库,我们立即给予删除!
分类:图书资源 价格:10玖币 属性:7 页 大小:626.4KB 格式:PDF 时间:2025-05-02

开通VIP享超值会员特权

  • 多端同步记录
  • 高速下载文档
  • 免费文档工具
  • 分享文档赚钱
  • 每日登录抽奖
  • 优质衍生服务
/ 7
客服
关注