
ments based on the matrix of embedding similar-
ities (Jalili Sabet et al.,2020;Dou and Neubig,
2021). While achieving some improvements in
the alignment quality and efficiency, we find that
the existing LM based aligners capture few inter-
actions between the input source-target sentence
pairs. Specifically, SimAlign (Jalili Sabet et al.,
2020) encodes the source and target sentences sep-
arately without attending to the context in the other
language. Dou and Neubig (2021) further propose
Awesome-Align, which considers the cross-lingual
context by taking the concatenation of the sentence
pairs as inputs during training, but still encodes
them separately during inference.
However, the lack of interaction between the in-
put source-target sentence pairs degrades the align-
ment quality severely, especially for the ambigu-
ous words in the monolingual context. Figure 1
presents an example of our reproduced results from
Awesome-Align. The ambiguous Chinese word
“
以
” has two different meanings: 1) a preposition
(“to”, “as”, “for” in English), 2) the abbreviation
of the word “
以色列
” (“Israel” in English). In this
example, the word “
以
” is misaligned to “to” and
“for” as the model does not fully consider the word
“Israel’ in the target sentence. Intuitively, the cross-
lingual context is very helpful for alleviating the
meaning confusion in the task of word alignment.
Based on the above observation, we propose
Cross-Align
, which fully considers the cross-
lingual context by modeling deep interactions be-
tween the input sentence pairs. Specifically, Cross-
Align encodes the monolingual information for
source and target sentences independently with the
shared self-attention modules in the shallow layers,
and then explicitly models deep cross-lingual in-
teractions with the cross-attention modules in the
upper layers. Besides, to train Cross-Align effec-
tively, we propose a two-stage training framework,
where the model is trained with the simple TLM
objective (Conneau and Lample,2019) to learn the
cross-lingual representations in the first stage, and
then finetuned with a self-supervised alignment
objective to bridge the gap between training and in-
ference in the second stage. We conduct extensive
experiments on five different language pairs and the
results show that our approach achieves the SOTA
performance on four out of five language pairs.
2
Compared to the existing approaches which apply
2
In Ro-En, we achieve the best performance among models
in the same line, but perform a little poorer than the NMT
based models which have much more parameters than ours.
many complex training objectives, our approach is
simple yet effective.
Our main contributions are summarized as fol-
lows:
•
We propose Cross-Align, a novel word alignment
model which utilizes the self-attention modules
to encode monolingual representations and the
cross-attention modules to model cross-lingual
interactions.
•
We propose a two-stage training framework to
boost model performance on word alignment,
which is simple yet effective.
•
Extensive experiments show that the proposed
model achieves SOTA performance on four out
of five different language pairs.
2 Related Work
2.1 NMT based Aligner
Recently, there is a surge of interest in studying
alignment based on the attention weights (Vaswani
et al.,2017) of NMT systems. However, the naive
attention may fails to capture clear word alignments
(Serrano and Smith,2019). Therefore, Zenkel et al.
(2019) and Garg et al. (2019) extend the Trans-
former architecture with a separate alignment layer
on top of the decoder, and produce competitive
results compared to GIZA++. Chen et al. (2020)
further improve alignment quality by adapting the
alignment induction with the to-be-aligned target
token. Recently, Chen et al. (2021) and Zhang
and van Genabith (2021) propose self-supervised
models that take advantage of the full context on
the target side, and achieve the SOTA results. Al-
though NMT based aligners achieve promising re-
sults, there are still some disadvantages: 1) The
inherent discrepancy between translation task and
word alignment is not eliminated, so the reliability
of the attention mechanism is still under suspicion
(Li et al.,2019); 2) Since NMT models are unidirec-
tional, it requires NMT models in both directions to
obtain final alignment, which is lack of efficiency.
2.2 LM based Aligner
Recent pre-trained multilingual language mod-
els like mBERT (Devlin et al.,2019) and XLM-
R (Conneau and Lample,2019) achieve promis-
ing results on many cross-lingual transfer tasks
(Liang et al.,2020;Hu et al.,2020;Wang et al.,
2022a,b). Jalili Sabet et al. (2020) prove that mul-
tilingual LMs are also helpful in word alignment