
coherent given the context and the discourse con-
nective. Considering the commonsense knowledge
and the reasoning involved, it should not be diffi-
cult to see that this task is challenging.
Our contributions are four-fold. First, we cre-
ate DISCOSENSE, a new dataset aimed at testing
LMs’ commonsense reasoning capabilities through
discourse connectives. Second, we employ a con-
trolled text generation based adversarial filtering
approach to generate compelling negatives. Third,
we establish baseline results on DISCOSENSE with
numerous state-of-the-art discriminator models and
show that they struggle to perform well on DIS-
COSENSE, which makes our dataset an ideal bench-
mark for next-generation commonsense reason-
ing systems. Finally, we show the efficacy of us-
ing DISCOSENSE as a transfer learning resource
through sequential fine-tuning of LMs on DIS-
COSENSE followed by HELLASWAG and achieve
near state-of-the-art results on the HELLASWAG
test set. To stimulate work on this task, we make
our code and data publicly available.1
2 Related Work
In this section, we discuss related work, focusing
our discussion on the differences between DIS-
COSENSE and existing commonsense reasoning
benchmarks. In addition, we present an overview
of Adversarial Filtering, which will facilitate the in-
troduction of the Conditional Adversarial Filtering
mechanism we propose in Section 3.
Commonsense reasoning benchmarks.
SWAG
(Zellers et al.,2018) and HELLASWAG (Zellers
et al.,2019b) are arguably the most prominent com-
monsense reasoning benchmarks. In SWAG, given
a partial description along with four candidate end-
ings, the task is to predict the most plausible ending.
The synthetic options (a.k.a. distractors) are gener-
ated through a process called Adversarial Filtering
(AF) (see below). HELLASWAG is an extension of
SWAG that seeks to eliminate artifacts in the gen-
erated endings. Unlike SWAG and HELLASWAG,
DISCOSENSE requires that the discourse connec-
tive be taken into account in the reasoning pro-
cess, thus increasing the number of inference steps
and potentially the task complexity. In addition,
while the examples in SWAG and HELLASWAG
come primarily from ActivityNet (a benchmark
focused on dense captioning of temporal events),
1
For our code and data, see
https://github.com/
prajjwal1/discosense/.
DISCOSENSE features a more diverse set of exam-
ples coming from varied domains that may only be
solved with rich background knowledge.
There are benchmarks that aim to test differ-
ent kinds of commonsense reasoning abilities, al-
though none of them focuses on reasoning over dis-
course connectives. SocialIQA (Sap et al.,2019),
for instance, focuses on social and emotional com-
monsense reasoning. ABDUCTIVE NLI (Bhaga-
vatula et al.,2020) focuses on abductive reasoning.
WINOGRANDE (Sakaguchi et al.,2020) contains
Winograd schema-inspired problems, which are
essentially hard pronoun resolution problems re-
quiring world knowledge. PIQA (Bisk et al.,2020)
examines physical commonsense reasoning. MC-
TACO (Zhou et al.,2019) and TIMEDIAL (Qin
et al.,2021) focus on temporal reasoning in com-
prehension and dialogue formats.
More closely related to DISCOSENSE are com-
monsense reasoning benchmarks that involve rea-
soning with a particular kind of relations. COPA
(Choice of Plausible Alternatives) (Roemmele
et al.,2011) focuses exclusively on reasoning with
CAUSAL relations and involves choosing the more
plausible ending out of two (rather than four) op-
tions. P-MCQA (Qasemi et al.,2021) focuses
exclusively on reasoning with PRECONDITION re-
lations: given a commonsense fact, select the pre-
condition that make the fact possible (enabling) or
impossible (disabling) out of four options.
δ
-NLI
(Rudinger et al.,2020), which aims to evaluate de-
fensible inference, focuses exclusively on reasoning
with the STRENGTHEN/WEAKEN relations: given
a premise-claim pair where the premise supports
the claim, generate a sentence that either strength-
ens or weakens the support. WINOVENTI (Do and
Pavlick,2021), which is composed of Winograd-
style schemas, focuses exclusively on reasoning
with ENTAILMENT relations: given two sentences
with an entailment relation, such as ”Pete says the
pear is delicious. The pear is ”, the goal is to
fill in the blank with one of two choices (e.g., ”ed-
ible”, ”inedible”). There are two key differences
between these datasets and DISCOSENSE. First,
rather than focusing on a particular type of relation,
DISCOSENSE encompasses 37 discourse connec-
tives signaling different discourse relation types.
Second, DISCOSENSE involves reasoning with dis-
course connectives, which is more complicated
than reasoning with discourse relations. Specifi-
cally, as some connectives are sense-ambiguous