
tunity to test cross-task knowledge transfer in a
setting where one target task depends on the other
(HiFeatMTL) – this is especially so given the eval-
uation methods used (detailed in Section 2).
We evaluate the effectiveness of boosting per-
formance on the target tasks through the transfer
of information from two related tasks: a) eSNLI,
which is a dataset consisting of explanations asso-
ciated with NLI labels, and b) IMPLI, which is an
NLI dataset (without explanations) that contains
figurative language. More concretely, we set out to
answer the following research questions:
1.
Can distinct task-specific knowledge be trans-
ferred from separate tasks so as to improve
performance on a target task? Concretely, can
we transfer explanations of literal language
from eSNLI and figurative NLI without expla-
nations from IMPLI?
2.
Which of the two knowledge transfer tech-
niques (SFT or HiFeatMTL) is more effective
in the text-to-text context?
2 The FigLang2022 Shared Task
FigLang2022 is a variation of the NLI task which
requires the generation of a textual explanation for
the NLI prediction. Additionally, the hypothesis
is a sentence that employs one of four kinds of
figurative expressions: sarcasm, simile, idiom, or
metaphor. Additionally, a hypothesis can be a cre-
ative paraphrase, which rewords the premise using
more expressive, literal terminology. Table 1shows
examples from the task dataset.
Entailment
Premise I respectfully disagree.
Hypothesis I beg to differ. (Idiom)
Explanation To beg to differ is to disagree
with someone, and in this
sentence the speaker is
respectfully disagreeing.
Contradiction
Premise She was calm.
Hypothesis She was like a kitten in a den
of coyotes. (Simile)
Explanation A kitten in a den of coyotes
would be scared and not calm.
Table 1: An entailment and contradiction pair from the
FigLang2022 dataset.
FigLang2022 takes into consideration the qual-
ity of the generated explanation when assessing
the model’s performance by use of an explanation
score, which is the average between BERTScore
and BLEURT and ranges between 0 and 100. The
task leaderboard is based on NLI label accuracy at
an explanation score threshold of 60, although the
NLI label accuracy is reported at three thresholds
of the explanation score (i.e. 0, 50, and 60) so as
to provide a glimpse of how the model’s NLI and
explanation abilities are influenced by each other.
3 Related Work
NLI is considered central to the task of Natural
Language Understanding, and there has been sig-
nificant focus on the development of models that
can perform well on the task (Wang et al.,2018).
This task of language inference has been indepen-
dently extended to incorporate explanations (Cam-
buru et al.,2018) and figurative language (Stowe
et al.,2022) (both detailed below). Chakrabarty
et al. (2022) introduced FLUTE, the Figurative
Language Understanding and Textual Explanations
dataset which brought together these two aspects.
Previous shared tasks involving figurative lan-
guage focused on the identification or represen-
tation of figurative knowledge: For example,
FigLang2020 (Klebanov et al.,2020) and Task
6 of SemEval 2022 (Abu Farha et al.,2022) in-
volved sarcasm detection, and Task 2 of SemEval
2022 (Tayyar Madabushi et al.,2022) involved the
identification and representation of idioms.
The generation of textual explanations necessi-
tates the use of generative models such as BART
(Lewis et al.,2020) or T5 (Raffel et al.,2019).
Narang et al. (2020) introduce WT5, a sequence-to-
sequence model that outputs natural-text explana-
tions alongside its predictions and Erliksson et al.
(2021) found T5 to consistently outperform BART
in explanation generation.
Of specific relevance to our work are the IM-
PLI (Stowe et al.,2022) and eSNLI (Camburu et al.,
2018) datasets. IMPLI links a figurative sentence,
specifically idiomatic or metaphoric, to a literal
counterpart, with the NLI relation being either en-
tailment or non-entailment. Stowe et al. (2022)
show that idioms are difficult for models to han-
dle, particularly in non-entailment relations. The
eSNLI dataset (Camburu et al.,2018) is an explana-
tion dataset for general NLI. It extends the Stanford
Natural Language Inference dataset (Bowman et al.,
2015) with human-generated text explanations.
Hierarchical feature pipeline based MTL archi-
tectures (HiFeatMTL) use the outputs of one task
as a feature in the next and are distinct from hier-
archical signal pipeline architectures wherein the