Look Ma, Only 400 Samples! Revisiting the Effectiveness of Automatic
N-Gram Rule Generation for Spelling Normalization in Filipino
Lorenzo Jaime Yu Flores Dragomir Radev
Yale University
lj.flores@yale.edu
Abstract
With 84.75 million Filipinos online, the abil-
ity for models to process online text is crucial
for developing Filipino NLP applications. To
this end, spelling correction is a crucial prepro-
cessing step for downstream processing. How-
ever, the lack of data prevents the use of lan-
guage models for this task. In this paper, we
propose an N-Gram + Damerau-Levenshtein
distance model with automatic rule extraction.
We train the model on 300 samples, and show
that despite limited training data, it achieves
good performance and outperforms other deep
learning approaches in terms of accuracy and
edit distance. Moreover, the model (1) re-
quires little compute power, (2) trains in lit-
tle time, thus allowing for retraining, and (3)
is easily interpretable, allowing for direct trou-
bleshooting, highlighting the success of tra-
ditional approaches over more complex deep
learning models in settings where data is un-
available.
1 Introduction
Filipinos are among the most active social media
users worldwide (Baclig,2022). In 2022, roughly
84.75M Filipinos were online (Statista,2022a),
with 96.2% on Facebook (Statista,2022b). Hence,
developing language models that can process on-
line text is crucial for Filipino NLP applications.
Contractions and abbreviations are common in
such online text (Salvacion and Limpot,2022).
For example, dito (here) can be written as d2, or
nakakatawa (funny) as nkktawa, which are abbre-
viated based on their pronunciation. However, lan-
guage models like Google Translate remain limited
in their ability to detect and correct such words, as
we find later in the paper. Hence, we aim to im-
prove the spelling correction ability of such models.
In this paper, we demonstrate the effectiveness
of a simple n-gram based algorithm for this task,
inspired by prior work on automatic rule genera-
tion by Mangu and Brill (1997). Specifically, we
(1) create a training dataset of 300 examples, (2)
automatically generate n-gram based spelling rules
using the dataset, and (3) use the rules to propose
and select candidates. We then demonstrate that
this model outperforms seq-to-seq approaches.
Ultimately, the paper aims to highlight the use
of traditional approaches in areas where SOTA lan-
guage models are difficult to apply due to limita-
tions in data availability. Such approaches have the
added benefit of (1) requiring little compute power
for training and inference, (2) training in very little
time (allowing for frequent retraining), and (3) giv-
ing researchers full clarity over its inner workings,
thereby improving the ease of troubleshooting.
2 Related Work
The problem of online text spelling correction is
most closely related to spelling normalization – the
subtask of reverting shortcuts and abbreviations
into their original form (Nocon et al.,2014). In this
paper, we will use correcting to mean normalizing
a word. This is useful for low-resource languages
like Filipino, wherein spelling is often not standard-
ized across its users (Li et al.,2020).
Many approaches have been tried for word
normalization in online Filipino text: (1) pre-
determined rules using commonly seen patterns
(Guingab et al.,2014;Oco and Borra,2011), (2)
dictionary-substitution models for extracting pat-
terns in misspelled words (Nocon et al.,2014), or
(3) trigrams and Levenshtein or QWERTY distance
to select words which share similar trigrams or are
close in terms of edit or keyboard distance (Chan
et al.,2008;Go et al.,2017).
Each method has its limitations which we seek
to address. Predetermined rules must be manually
updated to learn emerging patterns, as is common
in the constantly evolving vocabulary of online Fil-
ipino text (Salvacion and Limpot,2022;Lumabi,
2020). Dictionary-substitution models are limited
by the constraint of picking mapping each pattern
arXiv:2210.02675v2 [cs.CL] 5 Nov 2022