DISCO SENSE Commonsense Reasoning with Discourse Connectives Prajjwal Bhargava University of Texas at Dallas

2025-05-03 0 0 358.5KB 16 页 10玖币

侵权投诉

DISCOSENSE: Commonsense Reasoning with Discourse Connectives

Prajjwal Bhargava

University of Texas at Dallas

prajjwalin@protonmail.com

Vincent Ng

University of Texas at Dallas

vince@hlt.utdallas.edu

Abstract

We present DISCOSENSE, a benchmark for

commonsense reasoning via understanding a

wide variety of discourse connectives. We gen-

erate compelling distractors in DISCOSENSE

using Conditional Adversarial Filtering, an ex-

tension of Adversarial Filtering that employs

conditional generation. We show that state-

of-the-art pre-trained language models strug-

gle to perform well on DISCOSENSE, which

makes this dataset ideal for evaluating next-

generation commonsense reasoning systems.

1 Introduction

Much of the recent work in commonsense reason-

ing has focused on evaluating a pre-trained lan-

guage model’s (LM) ability to predict the most

plausible ending/option given a context. Even after

devising bias reduction techniques (Zellers et al.,

2019b;Bras et al.,2020) to mitigate the effects

of annotation artifacts and make the task difﬁcult,

state-of-the-art LMs have managed to achieve or

even surpass human performance on numerous

commonsense downstream tasks (Zellers et al.,

2019b;Sakaguchi et al.,2020;Bhagavatula et al.,

2020). Nevertheless, these LMs are still very far

from being able to perform commonsense reason-

ing as well as humans. Hence, the fact that they

have begun to ace existing benchmarks implies that

time is ripe to design a new challenging benchmark

that can reliably target their limitations.

Motivated by this observation, we present DIS-

COSENSE, a benchmark for performing common-

sense reasoning through understanding a wide va-

riety of discourse connectives. Figure 1shows an

example taken from DISCOSENSE. As can be seen,

an example is composed of a context (e.g., “Our

waitress was very nice, but she kept on forgetting

my stuff.”) and a discourse connective (e.g., “For

example”), and the goal is to choose the most plau-

sible ending out of four options. If we ignore the

discourse connective, then all four options may

Our waitress was very nice, but she kept on forgetting

my stuff. For example

a) When I ordered the garlic shrimp, she remembered to

add my requested garlic butter.

b) She took forever to bring me my beer and fries.

c) When I told her I wanted to use the free breakfast that

was available she was not pleased.

d) For some customers, this is ﬁne.

Figure 1: Example on commonsense reasoning with

discourse connectives. The correct (i.e., most plausi-

ble) option is boldfaced.

seem plausible because we do not know what the

writer’s intent is. Once we consider both the con-

text and the discourse connective, then it is clear

that only option b) is plausible. The reason is that

“For example” signals an EXEMPLIFICATION rela-

tion between its arguments, and what follows the

discourse connective is expected to be an example

of the waitress keeping on forgetting the writer’s

stuff. Using commonsense knowledge, we know

that (1) “my beer and fries” is an example of “my

stuff”, and (2) her taking forever to bring the writer

stuff implies she kept on forgetting his/her stuff.

What if we replace “For example” with “How-

ever” in the example? Since “However” signals a

CONTRAST relation, options a) and d) both seem

viable. Speciﬁcally, option a) describes a situation

in which she did not forget the writer’s stuff. While

option d), unlike option a), does not describe any

example that signals a contrast, one may infer a

contrast between option d) and the context: being

forgetful is ﬁne for some customers. Nevertheless,

option a) is arguably more plausible than option

d) and should be chosen. The reason is that for

d) to be sensible, one needs to assume that her

forgetting the writer’s stuff implies that she is in

general forgetful. Without this assumption, it may

be strange for other customers to have an opinion

on her forgetting the writer’s stuff. In general, the

most plausible option is the option that makes the

smallest number of assumptions, and/or is the most

arXiv:2210.12478v1 [cs.CL] 22 Oct 2022

coherent given the context and the discourse con-

nective. Considering the commonsense knowledge

and the reasoning involved, it should not be difﬁ-

cult to see that this task is challenging.

Our contributions are four-fold. First, we cre-

ate DISCOSENSE, a new dataset aimed at testing

LMs’ commonsense reasoning capabilities through

discourse connectives. Second, we employ a con-

trolled text generation based adversarial ﬁltering

approach to generate compelling negatives. Third,

we establish baseline results on DISCOSENSE with

numerous state-of-the-art discriminator models and

show that they struggle to perform well on DIS-

COSENSE, which makes our dataset an ideal bench-

mark for next-generation commonsense reason-

ing systems. Finally, we show the efﬁcacy of us-

ing DISCOSENSE as a transfer learning resource

through sequential ﬁne-tuning of LMs on DIS-

COSENSE followed by HELLASWAG and achieve

near state-of-the-art results on the HELLASWAG

test set. To stimulate work on this task, we make

our code and data publicly available.1

2 Related Work

In this section, we discuss related work, focusing

our discussion on the differences between DIS-

COSENSE and existing commonsense reasoning

benchmarks. In addition, we present an overview

of Adversarial Filtering, which will facilitate the in-

troduction of the Conditional Adversarial Filtering

mechanism we propose in Section 3.

Commonsense reasoning benchmarks.

SWAG

(Zellers et al.,2018) and HELLASWAG (Zellers

et al.,2019b) are arguably the most prominent com-

monsense reasoning benchmarks. In SWAG, given

a partial description along with four candidate end-

ings, the task is to predict the most plausible ending.

The synthetic options (a.k.a. distractors) are gener-

ated through a process called Adversarial Filtering

(AF) (see below). HELLASWAG is an extension of

SWAG that seeks to eliminate artifacts in the gen-

erated endings. Unlike SWAG and HELLASWAG,

DISCOSENSE requires that the discourse connec-

tive be taken into account in the reasoning pro-

cess, thus increasing the number of inference steps

and potentially the task complexity. In addition,

while the examples in SWAG and HELLASWAG

come primarily from ActivityNet (a benchmark

focused on dense captioning of temporal events),

For our code and data, see

https://github.com/

prajjwal1/discosense/.

DISCOSENSE features a more diverse set of exam-

ples coming from varied domains that may only be

solved with rich background knowledge.

There are benchmarks that aim to test differ-

ent kinds of commonsense reasoning abilities, al-

though none of them focuses on reasoning over dis-

course connectives. SocialIQA (Sap et al.,2019),

for instance, focuses on social and emotional com-

monsense reasoning. ABDUCTIVE NLI (Bhaga-

vatula et al.,2020) focuses on abductive reasoning.

WINOGRANDE (Sakaguchi et al.,2020) contains

Winograd schema-inspired problems, which are

essentially hard pronoun resolution problems re-

quiring world knowledge. PIQA (Bisk et al.,2020)

examines physical commonsense reasoning. MC-

TACO (Zhou et al.,2019) and TIMEDIAL (Qin

et al.,2021) focus on temporal reasoning in com-

prehension and dialogue formats.

More closely related to DISCOSENSE are com-

monsense reasoning benchmarks that involve rea-

soning with a particular kind of relations. COPA

(Choice of Plausible Alternatives) (Roemmele

et al.,2011) focuses exclusively on reasoning with

CAUSAL relations and involves choosing the more

plausible ending out of two (rather than four) op-

tions. P-MCQA (Qasemi et al.,2021) focuses

exclusively on reasoning with PRECONDITION re-

lations: given a commonsense fact, select the pre-

condition that make the fact possible (enabling) or

impossible (disabling) out of four options.

-NLI

(Rudinger et al.,2020), which aims to evaluate de-

fensible inference, focuses exclusively on reasoning

with the STRENGTHEN/WEAKEN relations: given

a premise-claim pair where the premise supports

the claim, generate a sentence that either strength-

ens or weakens the support. WINOVENTI (Do and

Pavlick,2021), which is composed of Winograd-

style schemas, focuses exclusively on reasoning

with ENTAILMENT relations: given two sentences

with an entailment relation, such as ”Pete says the

pear is delicious. The pear is ”, the goal is to

ﬁll in the blank with one of two choices (e.g., ”ed-

ible”, ”inedible”). There are two key differences

between these datasets and DISCOSENSE. First,

rather than focusing on a particular type of relation,

DISCOSENSE encompasses 37 discourse connec-

tives signaling different discourse relation types.

Second, DISCOSENSE involves reasoning with dis-

course connectives, which is more complicated

than reasoning with discourse relations. Speciﬁ-

cally, as some connectives are sense-ambiguous

Dataset Model Human

SWAG (Zellers et al.,2018) 91.71 88

αNLI (Bhagavatula et al.,2020) 91.18 92.9

Hellaswag (Zellers et al.,2019b) 93.85 95.6

CosmosQA (Huang et al.,2019) 91.79 94

PIQA (Bisk et al.,2020) 90.13 94.9

SocialIQa (Sap et al.,2019) 83.15 88.1

MC-TACO (Zhou et al.,2019) 80.87 75.8

WinoGrande (Sakaguchi et al.,2020) 86.64 94

ProtoQA (Boratko et al.,2020) 54.15 74.03

VCR (Zellers et al.,2019a) 63.15 85

Table 1: Status of how competitive current common-

sense reasoning benchmarks are for state-of-the-art pre-

trained language models.

(e.g., the connective ”since” may serve as a tem-

poral or causal connective (Pitler and Nenkova,

2009)), a LM will likely need to (implicitly) per-

form sense disambiguation in order to perform well

on DISCOSENSE.

There are datasets and knowledge bases where

the semantic/discourse/commonsense relations are

explicitly annotated and which can provide data

sources from which commonsense reasoning

benchmarks can be derived. Examples include

(1) the Penn Discourse TreeBank (Prasad et al.,

2008), where two sentences or text segments are

annotated with their discourse relation type, if any;

(2) COREQUISITE (Qasemi et al.,2021), which

is used to provide the commonsense facts and the

human-generated preconditions in the P-MCQA

dataset mentioned above; (3) SNLI (Bowman et al.,

2015), where each premise-hypothesis pair is an-

notated as ENTAILMENT, CONTRADICTION, or

NEUTRAL; (4) ATOMIC

(Hwang et al.,2021),

which is a commonsense knowledge graph where

the nodes correspond to propositions and the edges

correspond to social/physical commonsense rela-

tions; and (5) SOCIAL-CHEM-101 (Forbes et al.,

2020), which is a collection of statements about

commonsense social judgments made given every-

day situations.

One of the motivations behind the creation of

DISCOSENSE is that state-of-the-art LMs have

managed to achieve or even surpass human perfor-

mance on various commonsense reasoning bench-

marks. Table 1shows the best accuracies achieved

by existing LMs on 10 widely used commonsense

reasoning benchmarks and the corresponding hu-

man performance levels. As can be seen, existing

LMs have managed to achieve an accuracy of more

than 80% on eight of these benchmarks.

Context +

Discourse Marker

Option 1

Option 2

Option 3

Option 4

Discriminator LM

Option 2

Generator LM

Context +

Discourse Marker

Option 1

New Option 2

Option 3

Option 4

Repeat the process until convergence

Replace easiest option

with the new adversarial option

Find easiest

option

Figure 2: Components of Adversarial Filtering.

Adversarial ﬁltering (AF).

Originally proposed

by Zellers et al. (2018), AF aims to create examples

that would be difﬁcult for models to solve, specif-

ically by replacing the easy options in correctly-

solved examples with difﬁcult ones. As shown

in Figure 2, AF has three components: data (i.e.,

examples with multiple options, one of which is

correct), a discriminator LM (a classiﬁer that is

used to solve each example) and a generator LM

(a model that generates new options for an exam-

ple). In each AF iteration, the discriminator LM is

trained on the training set and used to solve each

example in the test set. If a test example is incor-

rectly solved (i.e., the discriminator LM chooses

the wrong option), the example is deemed sufﬁ-

ciently difﬁcult and no change is made to it. On

the other hand, if a test example is correctly solved,

then AF seeks to increase its difﬁculty by replacing

the easiest option (i.e., the generated option that

the discriminator LM classiﬁes with the highest

conﬁdence) with a new option generated by the

generator LM. Training a new discriminator LM

in each AF iteration ensures that the dataset is not

just adversarial for one LM but a class of LMs,

as training different instances of the same type of

LMs results in models that have differently learned

linguistic representations. This process is repeated

on all correctly classiﬁed examples in the test set

until the performance on the test set converges.

3 DISCOSENSE

3.1 Task Description

DISCOSENSE aims to measure the commonsense

inference abilities of computational models through

the use of discourse connectives. The correct end-

ings can be obtained after understanding the pur-

pose of the given discourse connectives. Given a

context

c“ ps, dq

, which is composed of a contex-

tual sentence

and a discourse connective

as well

as a set of four options

O“ to1, o2, o3, o4u

, the

task is to predict the most plausible ending

oiPO

Data DISCOSENSE DISCOSENSE

Source Train Test

DISCOVERY Train Bottom 7% -

DISCOVERY Validation - 100%

DISCOFUSE train Top „54k -

w/ DC

Table 2: Data sources for DISCOSENSE and its com-

position before human veriﬁcation. DC refers to those

samples in DISCOFUSE that are concerned with the dis-

course connective phenomenon.

3.2 Dataset Creation

To assemble DISCOSENSE, we focus on source

datasets that contain two sentences connected

through a discourse connective. Speciﬁcally, we

use two peer reviewed academic datasets, DISCOV-

ERY (Sileo et al.,2019) and DISCOFUSE (Geva

et al.,2019). In DISCOVERY, each sentence is com-

posed of two sentences connected via a discourse

connective for the purpose of learning joint sen-

tence representations with discourse connectives.

DISCOFUSE, on the other hand, is assembled for

the task of sentence fusion (i.e., joining several

independent sentences into a single coherent sen-

tence). We only consider those examples where a

discourse connective is needed for sentence fusion,

and include in DISCOSENSE the fused sentences

in the Wikipedia

split of DISCOFUSE. Since these

datasets contain sentences from Common Crawl

and Wikipedia articles, DISCOSENSE is diverse

in the topics it covers. Importantly, since by con-

struction the discourse connective is crucial in solv-

ing the underlying tasks (i.e., sentence represen-

tation learning and sentence fusion), the crucial

role played by the discourse connectives in these

sentences makes them suitable for our use case. De-

tails of how the DISCOVERY and DISCOFUSE sen-

tences are used to create DISCOSENSE are shown

in Tables 2and 3.

3.3 Generating Options

Next, we describe how we generate challenging

options for DISCOSENSE using an improved ver-

sion of AF that we call Conditional Adversarial

Filtering (CAF). CAF follows the AF procedure in

Figure 2, only differing from AF in terms of (1) the

generator LM (Section 3.3.1), (2) the discriminator

LM (Section 3.3.2), and (3) how the generator LMs

are used to generate options (Section 3.3.3).

2https://en.wikipedia.org/

3https://commoncrawl.org/

Data Generator LM

DISCOVERY Train last 93%

DISCOVERY Test 100%

Table 3: Data used to train the generator LMs in Con-

ditional Adversarial Filtering.

3.3.1 Conditional Generator LM

Pre-training does not explicitly teach how impor-

tant a particular token or text span is in contributing

to the semantics of a sentence. Hence, to be able

to generate sentences that are coherent with not

only the context but also the discourse connective,

we propose to use Controllable Text Generation,

which aims to provide a more granular control over

how generation happens to match a particular at-

tribute. In the context of Transformer-based LMs,

there are two lines of research on controllable text

generation. One examines how to steer genera-

tion by ﬁne-tuning an extra set of parameters while

keeping the base (unconditionally trained) model

ﬁxed (Dathathri et al.,2020;Qin et al.,2020;Zhang

et al.,2020;Krause et al.,2020), while the other in-

volves conditionally training a generative model on

a control variable to generate text w.r.t. a prompt

preﬁx. We adopt the latter approach, extending

CTRL (Keskar et al.,2019) to explicitly steer gen-

eration w.r.t. discourse relations by using discourse

connectives as control codes, as described below.

Training. The input to CTRL is as follows:

input: rds`rcontexts ´ label: rendings

where

is a discourse connective. Speciﬁcally,

each input context for CTRL is prepended with a

connective, and the training task for CTRL is to

learn the conditional distribution

ppe|d, contextq

over possible endings

. The predicted ending is

then compared with the human generated ending

to compute loss. Since the original CTRL model

is pre-trained with control codes suitable for open-

ended text generation, we ﬁne-tune CTRL on the

portion of DISCOVERY shown in Table 3 using all

the 174 connectives present in the selected splits.

Comparing Tables 2and 3, we can see that the data

the generator LM is ﬁne-tuned on is not part of

DISCOSENSE. Doing so ensures that the endings

generated by the generator LM are different from

the ground truth (i.e., the human written endings).

Decoding.

We use Nucleus sampling (Holtzman

et al.,2020) for generating options for the training

set with the value of

set to

0.7

, which means the

文档加载中……请稍候！
如果长时间未打开，您也可以点击刷新试试。

下载文档到电脑，查找使用更方便

10 玖币 0人已下载

立即下载

摘要：

DISCOSENSE:CommonsenseReasoningwithDiscourseConnectivesPrajjwalBhargavaUniversityofTexasatDallasprajjwalin@protonmail.comVincentNgUniversityofTexasatDallasvince@hlt.utdallas.eduAbstractWepresentDISCOSENSE,abenchmarkforcommonsensereasoningviaunderstandingawidevarietyofdiscourseconnectives.Wegen-erate...

展开>> 收起<<

DISCO SENSE Commonsense Reasoning with Discourse Connectives Prajjwal Bhargava University of Texas at Dallas.pdf

共16页,预览4页

还剩页未读，继续阅读

声明：本站为文档C2C交易模式，即用户上传的文档直接被用户下载，本站只是中间服务平台，本站所有文档下载所得的收益归上传人(含作者)所有。玖贝云文库仅提供信息存储空间，仅对用户上传内容的表现方式做保护处理，对上载内容本身不做任何修改或编辑。若文档所含内容侵犯了您的版权或隐私，请立即通知玖贝云文库，我们立即给予删除！

DISCO SENSE Commonsense Reasoning with Discourse Connectives Prajjwal Bhargava University of Texas at Dallas

相关推荐

开通VIP享超值会员特权

作者详情

相关内容

热门标签

举报选择: