To simplify the alignment structure, we consider only using one operation in the collapsing function.
One option is only collapsing all consecutive repeated words in the alignment. The concern is that it
will limit the expressive power of the model. For example, the target sentence
y= (A, A, B, C, D)
has probability 0 since it has repeated words that cannot be mapped from any alignment. Therefore,
we favor another option that only removes all blank tokens. In this way, the expressive power stays
the same and the alignment space becomes much simpler, where alignments of
y
simply contain
T
target words and Ta−Tblank tokens.
We denote this simplified loss as SCTC, and illustrate the difference between CTC and SCTC in
Figure 2. The alignment space for the target sentence
y
is defined as
βs(y)
, and the collapsing
function is defined as
β−1
s
. We can still obtain the translation probability
p(y|x, θ)
with dynamic
programming. We define the forward variable
αt(s)
to be the total probability of
y1:s
considering all
possible alignments a1:t, which can be calculated recursively from αt−1(s)and αt−1(s−1):
αt(s) = αt−1(s−1)pt(ys|x, θ) + αt−1(s)pt(|x, θ).(5)
Finally, the total translation probability
p(y|x, θ)
is given by the forward variable
αTa(T)
, so we can
train the model with the cross-entropy loss Lsctc(θ) = −log αTa(T).
3.2 Non-Monotonic Alignments under SCTC
3.2.1 Bipartite Matching
In this section, we explore non-monotonic alignments under the SCTC loss, which are helpful for
further analysis under the regular CTC loss. We first extend the alignment space from monotonic
alignments
βs(y)
to non-monotonic alignments
γs(y)
to allow for the global word reordering in
machine translation, where γs(y)is defined as:
γs(y) = ∪
y0∈P(y)βs(y0).(6)
In the above definition,
P(y)
represents all permutations of
y
. For example,
P((A, B)) =
{(A, B),(B, A)}
. By enumerating
P(y)
, we consider all possible word reorderings of the tar-
get sentence. Ideally, we want to traverse all alignments in
γs(y)
to calculate the log-likelihood loss:
Lsum(θ) = −log X
a∈γs(y)
p(a|x, θ).(7)
However, without the monotonic structure, it becomes difficult to marginalize out all latent alignments
with dynamic programming. Alternatively, we can minimize the loss of the best alignment, which is
an upper bound of Equation 7:
Lmax(θ) = −log max
a∈γs(y)p(a|x, θ).(8)
Following prior work [
46
,
7
,
9
], we formulize finding the best alignment as a maximum bipartite
matching problem and solve it with the Hungarian algorithm [
28
]. Specifically, we observe that the
alignment space
γs(y)
is simply permutations of
T
target words and
Ta−T
blank tokens. Therefore,
finding the best alignment is equivalent to finding the best bipartite matching between the
Ta
model
predictions and the
T
target words plus
Ta−T
blank tokens, where the two sets of nodes are
connected by edges with the prediction log-probability as weights.
3.2.2 N-Gram Matching
Without the monotonic structure desired for dynamic programming, calculating Equation 7 becomes
difficult. This difficulty is also caused by the strict requirement of exact match for alignments, which
makes it intractable to simplify the summation of probabilities. In practice, it is not necessary to force
the exact match between alignments and the target sentence since a large overlap is also favorable.
Therefore, we further extend the alignment space by considering all alignments that overlap with the
target sentence. Specifically, we are interested in the overlap of n-grams, which is the core of some
evaluation metrics (e.g., BLEU).
We propose a non-monotonic and non-exclusive n-gram matching objective based on SCTC to
encourage the overlap of n-grams. Following the underlying idea of probabilistic matching [
39
], we
introduce the probabilistic variant of the n-gram count to make the objective differentiable.
4