
We propose SilverAlign, a novel algorithm to
create silver evaluation data for guiding the choice
of appropriate word alignment methods. Our ap-
proach is based on a machine translation model and
exploits minimal sentence pairs to create parallel
corpora with alignment links. Figure 1illustrates
our core idea with minimal pairs in English and
Blissymbols. Our approach is to create alterna-
tive sentences in minimal pairs, to rely on machine
translation models to track changed words for each
alternative and then align words in the source sen-
tence.
In summary, our contributions are:
1.
We find that our silver benchmarks rank meth-
ods with high consistency compared to rank-
ings based on gold data. This means that we
can identify the best methods based on silver
data if there is no gold data available, which is
frequently the case in low-resource scenarios
for word alignment.
2.
We conduct an extensive analysis of our silver
resource with respect to gold data for 9 lan-
guage pairs from different language families
and resource availability. We perform various
experiments for word alignment models on
sub-word tokenization, tokenizer vocabulary
size, varying performance of Part-of-Speech
tags, and word frequencies.
3.
SilverAlign supports a more accurate evalua-
tion and a more in-depth analysis than small
gold sets (i.e., English-Hindi has only 90 sen-
tences) because we can automatically create
larger evaluation benchmarks. Also, Silver-
Align is robust to domain changes as it shows
a high correlation between gold and both in-
and out-of-domain silver benchmarks.
4.
It has been shown that machine translation per-
formance (including NMT performance) can
be improved by choosing a tokenization that
optimizes compatibility between source and
target languages (Deguchi et al.,2020). We
show that SilverAlign can be used to find such
a compatible tokenization for each language
pair.
5.
We make our silver data and code available as
a resource for future work that takes advantage
of our silver evaluation datasets.
2
Our code
2https://github.com/akoksal/SilverAlign
can be used to create silver benchmarks for
multiple languages, and our silver benchmark
can be used out-of-the box.
The rest of the paper is organized as follows.
Section 2describes related work. The details of
SilverAlign method are explained in Section 3. Sec-
tion 4describes the experimental setup, evaluation
metrics and datasets. We compare the results on
our silver benchmarks to gold data in Section 5.
Finally, we draw conclusions and discuss future
work in Section 6.
2 Related Work
2.1 Word alignment analysis
The analysis of word alignment performance with
respect to different factors has been analyzed by
many works. Ho and Yvon (2019) compare dis-
crete and neural version word aligners and show
the superiority of the second class. They also com-
pare them with respect to unaligned words, rare
words, Part-of-Speech (PoS) tags, and distortion
errors. Asgari et al. (2020) study word alignment
results when using subword-level tokenization and
show improved performance with respect to word
level. Sabet et al. (2020) analyze the performance
of word aligners regarding different PoS for En-
glish/German and show that Eflomal has low per-
formance when aligning links with high distortion.
They also analyze the alignments based on word
frequency and show that the performance decreases
for rare words when aligning at the word level ver-
sus the subword level.
Ho and Yvon (2021) analyze the interaction be-
tween alignment methods and subword tokeniza-
tion (Unigram and Byte Pair Encoding (BPE)).
They observe that tokenizing into smaller units
helps to align rare and unknown words. They also
investigate the effect of different vocabulary sizes
and conclude that word-based segmentation is less
optimal. We also conduct an experiment in this
direction in Section 5.3.
2.2 Silver data creation in NLP
Collecting gold data for evaluating or training sys-
tems can be impractical due to its cost and the
need for human annotators. To solve these issues,
silver data - data generated automatically - has
been widely exploited for different tasks and do-
mains. For the Named Entity Recognition (NER)
task, Rebholz-Schuhmann et al. (2010) introduce