single answer selection task but this score drops to
20% on the multiple answer selection task. Models
need to encode rich commonsense knowledge to
solve this task due to its hardness. In this work,
we attempt to encode commonsense knowledge
to a large pre-trained language model T5-Large by
continuing training it on a dialogue-level common-
sense dataset CICERO (Ghosal et al.,2022) using a
set of commonsense-aware pre-training objectives.
Large pre-trained language models, such as
GPT-2 (Radford et al.,2019) and T5 (Raffel
et al.,2020b), seem attractive frameworks to solve
contextual commonsense inference task. Through
fine-tuning, these models have become state of the
art in several natural language understanding tasks,
such as SuperGLUE (Wang et al.,2019). Addition-
ally, being trained on several hundreds of GB of
text may have endowed these models with much
commonsense knowledge (Petroni et al.,2019).
However, the fine-tuning approach may not
suffice for tasks with limited training samples.
Nonetheless, previous work (Gururangan et al.,
2020;Zhou et al.,2021a) has shown that, prior to
fine-tuning, pre-training with objectives catered to
the target tasks may improve performance on such
tasks. Following this intuition, we propose a set of
self-supervised pre-training objectives to adapt the
language models for the contextual commonsense
inference task, specifically addressing the task of
multi-choice answer selection.
Thus, our contribution in this paper is twofold:
i) we curate
CICEROv2
, containing multiple dis-
tinct contextual commonsense inferences per di-
mension, and ii) we propose a set of pre-training
objectives for contextual commonsense inference
that improves over the vanilla fine-tuning by about
1.9% for the multi-choice answer selection task, de-
fined on both CICERO and CICEROv2datasets.
2 Primer on CICERO
The dialogues in CICERO (Ghosal et al.,2022)
are sourced from three different datasets: Daily-
Dialog (Li et al.,2017), MuTual (Cui et al.,2020),
and DREAM (Sun et al.,2019). All dialogues
are dyadic and their inherent nature is particularly
conducive to qualitatively rich utterance-level
inferences. These annotated inferences are cate-
gorized into five dimensions: cause, subsequent
event, prerequisite, motivation, and emotional
reaction. The tasks proposed on these inferences
require contextual understanding, multi-utterance
reasoning, and commonsense knowledge.
In addition to introducing CICERO, Ghosal et al.
(2022) also defines a multi-choice answer selec-
tion task (MCQ), where the original annotation is
considered as the primary correct answer. The
candidates for the remaining correct and incor-
rect answers are generated using fine-tuned T5
models (Raffel et al.,2020a). Adversarial filter-
ing (Zellers et al.,2018a) is applied to these can-
didates to identify the hard-to-distinguish answers,
which are manually labeled as correct or incorrect.
Drawbacks of CICERO.
The automatically-
generated and labeled-as-correct answers are the
only sources of secondary correct answers in the
CICERO dataset. In total, close to 15% of the
instances contain multiple correct answers (infer-
ences). We empirically analyzed these instances
and found that the adversarial filtering algorithm
favors the selection of alternate answers that are
lexically close to the primary correct answer. As
such, both correct and incorrect answers bear a
relatively high degree of token-level and semantic
similarity with each other as indicated in Table 2in
terms of BLEU, ROUGE-L, CIDER and semantic-
similarity metrics. This belies the multiview nature
of commonsense-based explanations, where mul-
tiple either independent or related explanations of
the same event may exist. This is demonstrated in
Fig. 1where the target utterance “I don’t think so. I
know I’ve put on weight this winter.” can be a con-
sequence of multiple possible events. Particularly,
the event of weight gain can be caused by lack of
physical activity and exercise or unhealthy diet or
perhaps both. There are myriad of other possible
factors that may contribute to the weight gain, such
as disease, but those multitudes of possibilities or
views are not captured in CICERO.
3 CICEROv2
To address the drawbacks highlighted earlier, we
introduce
CICEROv2
, to improve the general-
ization ability of the models trained on this data.
CICEROv2
contains commonsense inferences
from target utterances of dyadic dialogues sampled
from CICERO. A human annotator is given a dia-
logue with a target utterance and asked a question
about the target utterance. The annotator writes
multiple distinct correct answers and two or more
incorrect answers for the question.
We start by sampling (dialogue,target,ques-
tion) triplets from CICERO. For these instances, we
show the original correct answer from CICERO to
the annotators to avoid duplication. The annotators