
Source #parl. sents. #pairs %pairs
en-no
WikiMatrix 530,000 10,274 1.94
CCMatrix 8,000,000 73,394 0.92
en-es
UNPC 2,800,000 28,028 1.00
WikiMatrix 3,290,000 41,577 1.26
All 14,620,000 153,273 1.05
Table 1: Number of parallel sentences in the English-
Norwegian and English-Spanish parallel corpora we
work with, and pairs of sentences with negation and
affirmative interpretations we automatically generate
via backtranslation. The yield (%pairs) is low, but as we
shall see these pairs are useful to solve natural language
understanding tasks when negation is present without
hurting results when negation is not present.
them, we automatically collect pairs of sentences
with negation and their affirmative interpretations.
Additionally, extrinsic evaluations show that de-
spite our collection procedure is noisy, leveraging
our pairs is beneficial to solve three natural lan-
guage understanding tasks.
3 Collecting Sentences with Negation
and Their Affirmative Interpretations
This section outlines our approach to create a large
collection of sentences containing negation and
their affirmative interpretations. First, we present
the sources of parallel corpora we work with. Sec-
ond, we describe our multilingual negation cue
detector to identify negation cues in the parallel
sentences. Third, we describe the backtranslation
step and a few checks to improve quality. Lastly,
we present an analysis of the resulting sentences
with negation and their affirmative interpretations.
3.1 Selecting Parallel Corpora
We select parallel sentences in English and either
Norwegian or Spanish for two reasons: (a) large
parallel corpora are available in these language
pairs and (b) negation cue annotations are available
in monolingual corpora for the three languages.
The latter is a requirement to build a multilingual
cue detector (Section 3.2). We extract the paral-
lel sentences from three parallel corpora available
in the OPUS portal (Tiedemann,2012)): WikiMa-
trix (Schwenk et al.,2021a), CCMatrix (Schwenk
et al.,2021b;Fan et al.,2021), and UNPC (Ziemski
et al.,2016). Table 1(Column 3) shows the number
of parallel sentences we collect from each of the
corpora and language pair (total: 14.6 million).
3.2 Identifying Negation Cues in Multiple
Languages
In order to detect negation in the parallel sentences,
we develop a multilingual negation cue detector
that works with English, Norwegian, and Span-
ish texts. To this end, we fine-tune a multilin-
gual BERT (mBERT)
2
(Devlin et al.,2019) with
negation cue annotations in the three languages
we work with: English (Morante and Daelemans,
2012b), Norwegian (Mæhlum et al.,2021), and
Spanish (Jiménez-Zafra et al.,2018). We fine-tune
jointly for all three languages by combining the
original training splits into a multilingual training
split. We terminate the training process after the
F1 score in the (combined) development split does
not increase for 5 epochs; the final model is the
one which yields the highest F1 score during the
training process. Additional details regarding train-
ing procedure and hyperparameters are provided
in Appendix A. Our multilingual detector is not
perfect but obtains competitive results (F1 scores):
English: 91.96 (test split), Norwegian: 93.40 (test
split), and Spanish: 84.41 (dev split, as gold anno-
tations for the test split are not publicly available).
The system detects various negation cue types in-
cluding single tokens (no, never, etc.), affixal, and
lexicalized negations (Section 3.4).
We use our multilingual cue detector to de-
tect negation in the 14.6 million of parallel sen-
tences. In the English-Norwegian parallel sen-
tences (8.5M), negation is present in both sentences
(WikiMatrix: 7.3%, CCMatrix: 14.2%), either
sentence (WikiMatrix: 5.2%, CCMatrix: 5.2%),
or neither sentence (WikiMatrix: 87.5%, CCMa-
trix: 80.6%). Similarly, in English-Spanish parallel
sentences, negation is present in both sentences
(UNPC: 10.7%, WikiMatrix: 5.7%), either sen-
tence (UNPC: 4.6%, WikiMatrix: 4.4%), or nei-
ther sentence (UNPC: 84.7%, WikiMatrix: 89.9%).
Since we are interested in sentences containing
negation and their affirmative interpretations, we
only keep the sentences in which either the source
or target sentence contains negation.
3.3 Generating Affirmative Interpretations
After identifying negation cues in the parallel sen-
tences, we backtranslate into English the sentence
in the target language (either Norwegian or Span-
ish; they may or may not contain a negation). In
2https://github.com/google-research/bert/blob/
master/multilingual.md