
Modeling the Graphotactics of Low-Resource Languages Using
Sequential GANs
Isaac Wasserman
Haverford College / Haverford, PA
University of Pennsylvania / Philadelphia, PA
isaacrw@seas.upenn.edu
Abstract
Generative Adversarial Networks (GANs)
have been shown to aid in the creation of ar-
tificial data in situations where large amounts
of real data are difficult to come by. This is-
sue is especially salient in the computational
linguistics space, where researchers are of-
ten tasked with modeling the complex mor-
phologic and grammatical processes of low-
resource languages. This paper will discuss
the implementation and testing of a GAN that
attempts to model and reproduce the grapho-
tactics of a language using only 100 example
strings. These artificial, yet graphotactically
compliant, strings are meant to aid in modeling
the morphological inflection of low-resource
languages.
1 Introduction
1.1 Task
In 2019, Anastasopoulos and Neubig made waves
with their multilingual morphological inflection
model for low resource languages (Anastasopou-
los and Neubig,2019) that they submitted to the
SIGMORPHON 2019 shared task (McCarthy et al.,
2019). All models submitted were pretrained on
high resource languages of similar ancestry to the
target language, allowing many models to greatly
exceed the performance of previous attempts at
low-resource morphological inflection. However,
what allowed Anastasopoulos and Neubig’s model
to outperform other submissions was its use of data
“hallucination”.
To perform this hallucination, they aligned the
lemma with its inflected form, extracted the stem,
and generated new artificial examples by replacing
this stem with randomly generated strings (in the
language’s alphabet) of equal length.
1
Though this
random substitution may seem haphazard, the ap-
proach allowed for an additional 10% accuracy, on
1
The alignment process assumes that the lemma and in-
flected form share a common substring.
average, when tested against versions of the model
that only used cross-lingual transfer.
Surely, a more well informed approach to stem
generation would further improve the accuracy of
the inflectional model. Given the demonstrated
ability of GANs to produce photorealistic, yet com-
pletely contrived images, they are potentially ideal
for such a task. The experiments detailed in this
paper attempt to produce a technique for gener-
ating fake word stems that provide more relevant
information to the inflectional model from Anasta-
sopoulos and Neubig, 2019 (Anastasopoulos and
Neubig,2019), thereby increasing the accuracy of
its inflections. By modeling the graphotactics of
the target language using a GAN, it should be pos-
sible to produce strings that more accurately depict
possible character sequences.
1.2 Generative Adversarial Networks
Generative adversarial networks are a class of un-
supervised machine learning architectures, most
commonly used for image generation. These net-
works consist of a generator and a discriminator
that are trained simultaneously on a set of data rep-
resenting a class or domain; this domain could be
anything from photos of human faces to time series
of hourly temperatures.
2
The generator is tasked
with producing “fake” examples that are within
this domain without ever seeing any real examples
from the training set. Meanwhile, the discriminator
is fed a combination of fake examples (from the
generator) and real examples and is tasked with
classifying them as real or fake. The respective
goals of the generator and discriminator constitute
a zero-sum game, in which the generator is con-
stantly trying to outsmart the discriminator, while
the discriminator hones its ability to distinguish
between in-domain and out-of-domain examples.
Though GANs are most often applied to im-
2
Technically speaking, the generator and discriminator are
most often trained one after another on a repeated basis.
arXiv:2210.14409v1 [cs.CL] 26 Oct 2022