DISCO SENSE Commonsense Reasoning with Discourse Connectives Prajjwal Bhargava University of Texas at Dallas

2025-05-03 0 0 358.5KB 16 页 10玖币
侵权投诉
DISCOSENSE: Commonsense Reasoning with Discourse Connectives
Prajjwal Bhargava
University of Texas at Dallas
prajjwalin@protonmail.com
Vincent Ng
University of Texas at Dallas
vince@hlt.utdallas.edu
Abstract
We present DISCOSENSE, a benchmark for
commonsense reasoning via understanding a
wide variety of discourse connectives. We gen-
erate compelling distractors in DISCOSENSE
using Conditional Adversarial Filtering, an ex-
tension of Adversarial Filtering that employs
conditional generation. We show that state-
of-the-art pre-trained language models strug-
gle to perform well on DISCOSENSE, which
makes this dataset ideal for evaluating next-
generation commonsense reasoning systems.
1 Introduction
Much of the recent work in commonsense reason-
ing has focused on evaluating a pre-trained lan-
guage model’s (LM) ability to predict the most
plausible ending/option given a context. Even after
devising bias reduction techniques (Zellers et al.,
2019b;Bras et al.,2020) to mitigate the effects
of annotation artifacts and make the task difficult,
state-of-the-art LMs have managed to achieve or
even surpass human performance on numerous
commonsense downstream tasks (Zellers et al.,
2019b;Sakaguchi et al.,2020;Bhagavatula et al.,
2020). Nevertheless, these LMs are still very far
from being able to perform commonsense reason-
ing as well as humans. Hence, the fact that they
have begun to ace existing benchmarks implies that
time is ripe to design a new challenging benchmark
that can reliably target their limitations.
Motivated by this observation, we present DIS-
COSENSE, a benchmark for performing common-
sense reasoning through understanding a wide va-
riety of discourse connectives. Figure 1shows an
example taken from DISCOSENSE. As can be seen,
an example is composed of a context (e.g., “Our
waitress was very nice, but she kept on forgetting
my stuff.”) and a discourse connective (e.g., “For
example”), and the goal is to choose the most plau-
sible ending out of four options. If we ignore the
discourse connective, then all four options may
Our waitress was very nice, but she kept on forgetting
my stuff. For example
a) When I ordered the garlic shrimp, she remembered to
add my requested garlic butter.
b) She took forever to bring me my beer and fries.
c) When I told her I wanted to use the free breakfast that
was available she was not pleased.
d) For some customers, this is fine.
Figure 1: Example on commonsense reasoning with
discourse connectives. The correct (i.e., most plausi-
ble) option is boldfaced.
seem plausible because we do not know what the
writer’s intent is. Once we consider both the con-
text and the discourse connective, then it is clear
that only option b) is plausible. The reason is that
“For example” signals an EXEMPLIFICATION rela-
tion between its arguments, and what follows the
discourse connective is expected to be an example
of the waitress keeping on forgetting the writer’s
stuff. Using commonsense knowledge, we know
that (1) “my beer and fries” is an example of “my
stuff”, and (2) her taking forever to bring the writer
stuff implies she kept on forgetting his/her stuff.
What if we replace “For example” with “How-
ever” in the example? Since “However” signals a
CONTRAST relation, options a) and d) both seem
viable. Specifically, option a) describes a situation
in which she did not forget the writer’s stuff. While
option d), unlike option a), does not describe any
example that signals a contrast, one may infer a
contrast between option d) and the context: being
forgetful is fine for some customers. Nevertheless,
option a) is arguably more plausible than option
d) and should be chosen. The reason is that for
d) to be sensible, one needs to assume that her
forgetting the writers stuff implies that she is in
general forgetful. Without this assumption, it may
be strange for other customers to have an opinion
on her forgetting the writer’s stuff. In general, the
most plausible option is the option that makes the
smallest number of assumptions, and/or is the most
arXiv:2210.12478v1 [cs.CL] 22 Oct 2022
coherent given the context and the discourse con-
nective. Considering the commonsense knowledge
and the reasoning involved, it should not be diffi-
cult to see that this task is challenging.
Our contributions are four-fold. First, we cre-
ate DISCOSENSE, a new dataset aimed at testing
LMs’ commonsense reasoning capabilities through
discourse connectives. Second, we employ a con-
trolled text generation based adversarial filtering
approach to generate compelling negatives. Third,
we establish baseline results on DISCOSENSE with
numerous state-of-the-art discriminator models and
show that they struggle to perform well on DIS-
COSENSE, which makes our dataset an ideal bench-
mark for next-generation commonsense reason-
ing systems. Finally, we show the efficacy of us-
ing DISCOSENSE as a transfer learning resource
through sequential fine-tuning of LMs on DIS-
COSENSE followed by HELLASWAG and achieve
near state-of-the-art results on the HELLASWAG
test set. To stimulate work on this task, we make
our code and data publicly available.1
2 Related Work
In this section, we discuss related work, focusing
our discussion on the differences between DIS-
COSENSE and existing commonsense reasoning
benchmarks. In addition, we present an overview
of Adversarial Filtering, which will facilitate the in-
troduction of the Conditional Adversarial Filtering
mechanism we propose in Section 3.
Commonsense reasoning benchmarks.
SWAG
(Zellers et al.,2018) and HELLASWAG (Zellers
et al.,2019b) are arguably the most prominent com-
monsense reasoning benchmarks. In SWAG, given
a partial description along with four candidate end-
ings, the task is to predict the most plausible ending.
The synthetic options (a.k.a. distractors) are gener-
ated through a process called Adversarial Filtering
(AF) (see below). HELLASWAG is an extension of
SWAG that seeks to eliminate artifacts in the gen-
erated endings. Unlike SWAG and HELLASWAG,
DISCOSENSE requires that the discourse connec-
tive be taken into account in the reasoning pro-
cess, thus increasing the number of inference steps
and potentially the task complexity. In addition,
while the examples in SWAG and HELLASWAG
come primarily from ActivityNet (a benchmark
focused on dense captioning of temporal events),
1
For our code and data, see
https://github.com/
prajjwal1/discosense/.
DISCOSENSE features a more diverse set of exam-
ples coming from varied domains that may only be
solved with rich background knowledge.
There are benchmarks that aim to test differ-
ent kinds of commonsense reasoning abilities, al-
though none of them focuses on reasoning over dis-
course connectives. SocialIQA (Sap et al.,2019),
for instance, focuses on social and emotional com-
monsense reasoning. ABDUCTIVE NLI (Bhaga-
vatula et al.,2020) focuses on abductive reasoning.
WINOGRANDE (Sakaguchi et al.,2020) contains
Winograd schema-inspired problems, which are
essentially hard pronoun resolution problems re-
quiring world knowledge. PIQA (Bisk et al.,2020)
examines physical commonsense reasoning. MC-
TACO (Zhou et al.,2019) and TIMEDIAL (Qin
et al.,2021) focus on temporal reasoning in com-
prehension and dialogue formats.
More closely related to DISCOSENSE are com-
monsense reasoning benchmarks that involve rea-
soning with a particular kind of relations. COPA
(Choice of Plausible Alternatives) (Roemmele
et al.,2011) focuses exclusively on reasoning with
CAUSAL relations and involves choosing the more
plausible ending out of two (rather than four) op-
tions. P-MCQA (Qasemi et al.,2021) focuses
exclusively on reasoning with PRECONDITION re-
lations: given a commonsense fact, select the pre-
condition that make the fact possible (enabling) or
impossible (disabling) out of four options.
δ
-NLI
(Rudinger et al.,2020), which aims to evaluate de-
fensible inference, focuses exclusively on reasoning
with the STRENGTHEN/WEAKEN relations: given
a premise-claim pair where the premise supports
the claim, generate a sentence that either strength-
ens or weakens the support. WINOVENTI (Do and
Pavlick,2021), which is composed of Winograd-
style schemas, focuses exclusively on reasoning
with ENTAILMENT relations: given two sentences
with an entailment relation, such as ”Pete says the
pear is delicious. The pear is ”, the goal is to
fill in the blank with one of two choices (e.g., ”ed-
ible”, ”inedible”). There are two key differences
between these datasets and DISCOSENSE. First,
rather than focusing on a particular type of relation,
DISCOSENSE encompasses 37 discourse connec-
tives signaling different discourse relation types.
Second, DISCOSENSE involves reasoning with dis-
course connectives, which is more complicated
than reasoning with discourse relations. Specifi-
cally, as some connectives are sense-ambiguous
Dataset Model Human
SWAG (Zellers et al.,2018) 91.71 88
αNLI (Bhagavatula et al.,2020) 91.18 92.9
Hellaswag (Zellers et al.,2019b) 93.85 95.6
CosmosQA (Huang et al.,2019) 91.79 94
PIQA (Bisk et al.,2020) 90.13 94.9
SocialIQa (Sap et al.,2019) 83.15 88.1
MC-TACO (Zhou et al.,2019) 80.87 75.8
WinoGrande (Sakaguchi et al.,2020) 86.64 94
ProtoQA (Boratko et al.,2020) 54.15 74.03
VCR (Zellers et al.,2019a) 63.15 85
Table 1: Status of how competitive current common-
sense reasoning benchmarks are for state-of-the-art pre-
trained language models.
(e.g., the connective ”since” may serve as a tem-
poral or causal connective (Pitler and Nenkova,
2009)), a LM will likely need to (implicitly) per-
form sense disambiguation in order to perform well
on DISCOSENSE.
There are datasets and knowledge bases where
the semantic/discourse/commonsense relations are
explicitly annotated and which can provide data
sources from which commonsense reasoning
benchmarks can be derived. Examples include
(1) the Penn Discourse TreeBank (Prasad et al.,
2008), where two sentences or text segments are
annotated with their discourse relation type, if any;
(2) COREQUISITE (Qasemi et al.,2021), which
is used to provide the commonsense facts and the
human-generated preconditions in the P-MCQA
dataset mentioned above; (3) SNLI (Bowman et al.,
2015), where each premise-hypothesis pair is an-
notated as ENTAILMENT, CONTRADICTION, or
NEUTRAL; (4) ATOMIC
20
20
(Hwang et al.,2021),
which is a commonsense knowledge graph where
the nodes correspond to propositions and the edges
correspond to social/physical commonsense rela-
tions; and (5) SOCIAL-CHEM-101 (Forbes et al.,
2020), which is a collection of statements about
commonsense social judgments made given every-
day situations.
One of the motivations behind the creation of
DISCOSENSE is that state-of-the-art LMs have
managed to achieve or even surpass human perfor-
mance on various commonsense reasoning bench-
marks. Table 1shows the best accuracies achieved
by existing LMs on 10 widely used commonsense
reasoning benchmarks and the corresponding hu-
man performance levels. As can be seen, existing
LMs have managed to achieve an accuracy of more
than 80% on eight of these benchmarks.
Context +
Discourse Marker
Option 1
Option 2
Option 3
Option 4
Discriminator LM
Option 2
Generator LM
Context +
Discourse Marker
Option 1
New Option 2
Option 3
Option 4
Repeat the process until convergence
Replace easiest option
with the new adversarial option
Find easiest
option
Figure 2: Components of Adversarial Filtering.
Adversarial filtering (AF).
Originally proposed
by Zellers et al. (2018), AF aims to create examples
that would be difficult for models to solve, specif-
ically by replacing the easy options in correctly-
solved examples with difficult ones. As shown
in Figure 2, AF has three components: data (i.e.,
examples with multiple options, one of which is
correct), a discriminator LM (a classifier that is
used to solve each example) and a generator LM
(a model that generates new options for an exam-
ple). In each AF iteration, the discriminator LM is
trained on the training set and used to solve each
example in the test set. If a test example is incor-
rectly solved (i.e., the discriminator LM chooses
the wrong option), the example is deemed suffi-
ciently difficult and no change is made to it. On
the other hand, if a test example is correctly solved,
then AF seeks to increase its difficulty by replacing
the easiest option (i.e., the generated option that
the discriminator LM classifies with the highest
confidence) with a new option generated by the
generator LM. Training a new discriminator LM
in each AF iteration ensures that the dataset is not
just adversarial for one LM but a class of LMs,
as training different instances of the same type of
LMs results in models that have differently learned
linguistic representations. This process is repeated
on all correctly classified examples in the test set
until the performance on the test set converges.
3 DISCOSENSE
3.1 Task Description
DISCOSENSE aims to measure the commonsense
inference abilities of computational models through
the use of discourse connectives. The correct end-
ings can be obtained after understanding the pur-
pose of the given discourse connectives. Given a
context
c“ ps, dq
, which is composed of a contex-
tual sentence
s
and a discourse connective
d
as well
as a set of four options
O“ to1, o2, o3, o4u
, the
task is to predict the most plausible ending
oiPO
.
Data DISCOSENSE DISCOSENSE
Source Train Test
DISCOVERY Train Bottom 7% -
DISCOVERY Validation - 100%
DISCOFUSE train Top 54k -
w/ DC
Table 2: Data sources for DISCOSENSE and its com-
position before human verification. DC refers to those
samples in DISCOFUSE that are concerned with the dis-
course connective phenomenon.
3.2 Dataset Creation
To assemble DISCOSENSE, we focus on source
datasets that contain two sentences connected
through a discourse connective. Specifically, we
use two peer reviewed academic datasets, DISCOV-
ERY (Sileo et al.,2019) and DISCOFUSE (Geva
et al.,2019). In DISCOVERY, each sentence is com-
posed of two sentences connected via a discourse
connective for the purpose of learning joint sen-
tence representations with discourse connectives.
DISCOFUSE, on the other hand, is assembled for
the task of sentence fusion (i.e., joining several
independent sentences into a single coherent sen-
tence). We only consider those examples where a
discourse connective is needed for sentence fusion,
and include in DISCOSENSE the fused sentences
in the Wikipedia
2
split of DISCOFUSE. Since these
datasets contain sentences from Common Crawl
3
and Wikipedia articles, DISCOSENSE is diverse
in the topics it covers. Importantly, since by con-
struction the discourse connective is crucial in solv-
ing the underlying tasks (i.e., sentence represen-
tation learning and sentence fusion), the crucial
role played by the discourse connectives in these
sentences makes them suitable for our use case. De-
tails of how the DISCOVERY and DISCOFUSE sen-
tences are used to create DISCOSENSE are shown
in Tables 2and 3.
3.3 Generating Options
Next, we describe how we generate challenging
options for DISCOSENSE using an improved ver-
sion of AF that we call Conditional Adversarial
Filtering (CAF). CAF follows the AF procedure in
Figure 2, only differing from AF in terms of (1) the
generator LM (Section 3.3.1), (2) the discriminator
LM (Section 3.3.2), and (3) how the generator LMs
are used to generate options (Section 3.3.3).
2https://en.wikipedia.org/
3https://commoncrawl.org/
Data Generator LM
DISCOVERY Train last 93%
DISCOVERY Test 100%
Table 3: Data used to train the generator LMs in Con-
ditional Adversarial Filtering.
3.3.1 Conditional Generator LM
Pre-training does not explicitly teach how impor-
tant a particular token or text span is in contributing
to the semantics of a sentence. Hence, to be able
to generate sentences that are coherent with not
only the context but also the discourse connective,
we propose to use Controllable Text Generation,
which aims to provide a more granular control over
how generation happens to match a particular at-
tribute. In the context of Transformer-based LMs,
there are two lines of research on controllable text
generation. One examines how to steer genera-
tion by fine-tuning an extra set of parameters while
keeping the base (unconditionally trained) model
fixed (Dathathri et al.,2020;Qin et al.,2020;Zhang
et al.,2020;Krause et al.,2020), while the other in-
volves conditionally training a generative model on
a control variable to generate text w.r.t. a prompt
prefix. We adopt the latter approach, extending
CTRL (Keskar et al.,2019) to explicitly steer gen-
eration w.r.t. discourse relations by using discourse
connectives as control codes, as described below.
Training. The input to CTRL is as follows:
input: rds`rcontexts ´ label: rendings
where
d
is a discourse connective. Specifically,
each input context for CTRL is prepended with a
connective, and the training task for CTRL is to
learn the conditional distribution
ppe|d, contextq
over possible endings
e
. The predicted ending is
then compared with the human generated ending
to compute loss. Since the original CTRL model
is pre-trained with control codes suitable for open-
ended text generation, we fine-tune CTRL on the
portion of DISCOVERY shown in Table 3 using all
the 174 connectives present in the selected splits.
Comparing Tables 2and 3, we can see that the data
the generator LM is fine-tuned on is not part of
DISCOSENSE. Doing so ensures that the endings
generated by the generator LM are different from
the ground truth (i.e., the human written endings).
Decoding.
We use Nucleus sampling (Holtzman
et al.,2020) for generating options for the training
set with the value of
p
set to
0.7
, which means the
摘要:

DISCOSENSE:CommonsenseReasoningwithDiscourseConnectivesPrajjwalBhargavaUniversityofTexasatDallasprajjwalin@protonmail.comVincentNgUniversityofTexasatDallasvince@hlt.utdallas.eduAbstractWepresentDISCOSENSE,abenchmarkforcommonsensereasoningviaunderstandingawidevarietyofdiscourseconnectives.Wegen-erate...

展开>> 收起<<
DISCO SENSE Commonsense Reasoning with Discourse Connectives Prajjwal Bhargava University of Texas at Dallas.pdf

共16页,预览4页

还剩页未读, 继续阅读

声明:本站为文档C2C交易模式,即用户上传的文档直接被用户下载,本站只是中间服务平台,本站所有文档下载所得的收益归上传人(含作者)所有。玖贝云文库仅提供信息存储空间,仅对用户上传内容的表现方式做保护处理,对上载内容本身不做任何修改或编辑。若文档所含内容侵犯了您的版权或隐私,请立即通知玖贝云文库,我们立即给予删除!
分类:图书资源 价格:10玖币 属性:16 页 大小:358.5KB 格式:PDF 时间:2025-05-03

开通VIP享超值会员特权

  • 多端同步记录
  • 高速下载文档
  • 免费文档工具
  • 分享文档赚钱
  • 每日登录抽奖
  • 优质衍生服务
/ 16
客服
关注