
Task Dataset #Class Verbalizations # Token in Verb.
Mean Std.
LA CoLA (Warstadt et al.,2019) 2 correct, incorrect. (Gao et al.,2021) 1 0
NER CoNLL03 (Tjong Kim Sang and De Meulder,2003) 5 location, person, not an, ... (Cui et al.,2021) 1.2 0.4
NLI MNLI (Williams et al.,2018) 3 yes, no, maybe. (Fu et al.,2022) 1 0
NLI XNLI (Conneau et al.,2018) 3 yes, no, maybe; Evet, ... (Zhao and Schütze,2021) 1 0
PI PAWS-X (Yang et al.,2019) 2 yes, no. (Qi et al.,2022) 1 0
TC MARC (Keung et al.,2020) 2 good, {average, bad}. (Huang et al.,2022) 1 0
RC
TACRED (Zhang et al.,2017) 42 founded by, city of birth, country of death, ... 3.23 1.99
SemEval (Hendrickx et al.,2010) 10 cause effect, entity origin, product producer, ... 2.50 0.81
NYT (Riedel et al.,2010) 24 ethnicity, major shareholder of, religion, ... 2.10 1.01
SCIERC (Luan et al.,2018) 6 conjuction, feature of, part of, used for, ... 2.17 0.69
SMiLER (EN) (Seganti et al.,2021) 36 birth place, starring, won award, ... 2.58 0.68
SMiLER (ALL) (Seganti et al.,2021) 36 hat Genre, chef d’organisation, del país, ... 3.66 1.44
Table 2: Statistics of the lengths of the verbalizations over several classification tasks. The lengths for non-RC
tasks depend on the tokenizers from the respective PLMs in the cited work. The lengths for RC tasks are based
on the mT5BASE tokenizer. Mean and std. show that the label space of the RC task is more complex than most
few-class classification tasks. The verbalizations of RC datasets are listed in Appendix B. For SemEval, the two
possible directions of a relation are combined. For NYT, we use the version from Zeng et al. (2018). For SMiLER,
"EN" is the English split; "ALL" contains all data from 14 languages.
and in-language (IL) prompting. For English, CS-
and IL- prompting are equivalent, since
L
is En-
glish itself.
Word order of prompting
For the RC task,
head-relation-tail triples involve three elements.
Therefore, deriving natural language prompts from
them requires handling where to put the predicate
(relation). In the case of SOV languages, filling
in a relation that occurs between
eh
and
et
seems
less intuitive. Therefore, to investigate if the word
order of prompting affects prediction accuracy,
we swap the entities and the blank in the SVO-
template “
x.eh____ et
” and get “
x.ehet____
”
as the SOV-template.
3.4 Training and Inference
The training and inference setups depend on the em-
ployed model. Prompting autoencoding language
models requires the verbalizations to be of fixed
length, since the length of masks, which is identical
with verbalization length, is unknown during infer-
ence. Encoder-decoders can handle verbalizations
of varying length by nature (Han et al.,2022;Du
et al.,2022). Han et al. (2021) adjust all the ver-
balizations in TACRED to a length of 3, to enable
prompting with RoBERTa for RC. We argue that
for multilingual RC, this fix is largely infeasible,
because: (1) in case of in-language prompting on
SMiLER, the variance of the length of the verbal-
izations increases from 0.68 to 1.44 after translation
(see Table 2), and surpasses most of listed mono-
lingual RC datasets (SemEval, NYT and SCIERC),
making it harder to unify the length; (2) manually
adjusting the translated prompts requires manual
effort per target language, making it much more ex-
pensive than adjusting only English verbalizations.
Therefore, we suggest using an encoder-decoder
PLM for prompting (Song et al.,2022).
Training objective
For an encoder-decoder
PLM
M
, given the prompt input
T(x)
and the
target sequence
φ(r)
(i.e. label verbalization), we
denote the output sequence as
y
. The probability of
an exact-match decoding is calculated as follows:
|φ(r)|
Y
t=1
Pθ(yt=φt(r)|y<t, T (x)) ,(5)
where
yt
,
φt(r)
denote the
t
-th token of
y
and
φ(r)
,
respectively.
y<t
denotes the decoded sequence
on the left.
θ
represents the set of all the learn-
able parameters, including those of the PLM
θM
,
and those of the soft tokens
θsp
in case of vari-
ant “soft prompt”. Hence, the final objective over
the training set
X
is to minimize the negative log-
likelihood:
argmin
θ
−1
|X | X
x∈X
|φ(r)|
X
t=1
log Pθ(yt=φt(r)|y<t, T (x)) .
(6)
Inference
We collect the output logits of the
decoder,
L∈R|V|×L
, where
|V|
is the vocabulary
size of
M
, and
L
is the maximum decode length.
For each relation
r∈ R
, its score is given by (Han
et al.,2022):
scoreθ(r) := 1
|φ(r)|
|φ(r)|
X
t=1
Pθ(yt=φt(r)),(7)
4