CIKQA Learning Commonsense Inference with a Uniﬁed Knowledge-in-the-loop QA Paradigm Hongming Zhang12 Yintong Huo3 Yanai Elazar45 Yangqiu Song1 Yoav Goldberg45 Dan Roth2

2025-04-27 0 0 798.27KB 12 页 10玖币

侵权投诉

CIKQA: Learning Commonsense Inference with a Uniﬁed

Knowledge-in-the-loop QA Paradigm

Hongming Zhang1,2, Yintong Huo3, Yanai Elazar4,5, Yangqiu Song1, Yoav Goldberg4,5, Dan Roth2

1HKUST, 2UPenn, 3CUHK, 4AI2, 5University of Washington, 6Bar Ilan University

{hzhangal,yqsong}@cse.ust.hk,ythuo@cse.cuhk.edu.hk

{yanaiela,yoav.goldberg}@gmail.com,danroth@seas.upenn.edu

Abstract

Recently, the community has achieved sub-

stantial progress on many commonsense rea-

soning benchmarks. However, it is still un-

clear what is learned from the training process:

the knowledge, inference capability, or both?

We argue that due to the large scale of com-

monsense knowledge, it is infeasible to anno-

tate a large enough training set for each task

to cover all commonsense for learning. Thus

we should separate the commonsense knowl-

edge acquisition and inference over common-

sense knowledge as two separate tasks. In this

work, we focus on investigating models’ com-

monsense inference capabilities from two per-

spectives: (1) Whether models can know if the

knowledge they have is enough to solve the

task; (2) Whether models can develop com-

monsense inference capabilities that general-

ize across commonsense tasks. We ﬁrst align

commonsense tasks with relevant knowledge

from commonsense knowledge bases and ask

humans to annotate whether the knowledge is

enough or not. Then, we convert different

commonsense tasks into a uniﬁed question an-

swering format to evaluate models’ general-

ization capabilities. We name the benchmark

as Commonsense Inference with Knowledge-

in-the-loop Question Answering (CIKQA).

1 Introduction

Understanding human language requires both the

language knowledge (e.g., grammar and seman-

tics) and world knowledge, which can be further

divided into factual and commonsense knowledge

(Katz and Fodor,1963). Recently, the community

has made great progress on helping machines ac-

quire and apply language and factual knowledge.

However, how to help machines acquire and in-

fer over commonsense is still unclear. To an-

swer this question, many commonsense reasoning

datasets (Roemmele et al.,2011;Sakaguchi et al.,

2020;Talmor et al.,2019;Zellers et al.,2019;Lin

et al.,2020) have been proposed. Even though

they target different knowledge types, modalities,

and come in different formats, they often follow a

standard supervised learning setting, which aims

at helping machines to solve a speciﬁc task with

the training data. However, two limitations of this

learning paradigm have restricted the development

of commonsense reasoning systems.

First, there is no clear separation between

knowledge and inference. As discussed in (Elazar

et al.,2021), a common phenomenon is that

larger training data will lead to better perfor-

mance, mainly because richer knowledge is cov-

ered. However, due to the large scale of com-

monsense knowledge, it is infeasible to annotate

a large enough training set for each task, and the

responsibility of the training data should be teach-

ing models how to do inference rather than ac-

quire the commonsense knowledge. Several recent

works have explored using structured knowledge

for commonsense reasoning tasks (Lin et al.,2019;

Lv et al.,2020;Paul and Frank,2020). However,

as these works did not clearly analyze the cover-

age of the structured knowledge (i.e., knowledge

graphs (KGs)), it is still unclear what the perfor-

mance means, better knowledge coverage or bet-

ter inference capability. To dig into what is behind

this learning process, we propose to equip each

question with auto-extracted knowledge and ask

humans to annotate whether the knowledge is gold

(i.e., sufﬁcient to answer the question). By doing

so, we could evaluate whether models can know

if the provided knowledge is gold or not and how

well they can conduct inference over the provided

knowledge to solve the task.

Second, the supervised learning may force the

model to learn the distribution of the training data

rather than a universal inference model. As a re-

sult, the model may perform well on the test set

that follows the same distribution but fail on other

tasks (Kejriwal and Shen,2020). Previously, as

different tasks have different formats, it is hard

arXiv:2210.06246v1 [cs.CL] 12 Oct 2022

Figure 1: CIKQA demonstration. All tasks are converted into a uniﬁed format such that we could easily evaluate

the generlization capability of all models. We also equip all questions with auto-extracted knowledge graphs from

existing KGs and ask humans to annotate whether the knowledge is gold or not. In this example, we expect models

to ﬁrst identify the quality of the knowledge and then conduct inference over the knowledge to solve the question.

to evaluate the generalization ability of common-

sense reasoning models. Motivated by the ex-

isting trend of using a uniﬁed format (i.e., ques-

tion answering) for different tasks (Khashabi et al.,

2020), we propose to convert various common-

sense reasoning tasks into a uniﬁed QA format

such that we can easily and fairly evaluate the gen-

eralization ability of learned commonsense rea-

soning models.

Combining these two lines of effort, we propose

a new commonsense inference evaluation bench-

mark Knowledge-in-the-loop Commonsense In-

ference with QA (CIKQA). An example is shown

in Figure 1. We ﬁrst convert several popular com-

monsense reasoning tasks into a uniﬁed QA for-

mat and equip them with the relevant knowledge

from existing commonsense knowledge graphs.

We leverage human annotation to label whether

the provided knowledge is gold to answer the

question. With CIKQA, we are interested in an-

swering two questions: (1) Whether current mod-

els can distinguish the knowledge is gold or not;

(3) Can current commonsense inference models

generalize across different commonsense reason-

ing tasks.

Experiments with several recent knowledge-

based commonsense reasoning models show that

even though current deep models could learn to

conduct simple inference after training with a few

examples when gold knowledge is provided, they

still cannot learn to distinguish gold knowledge

very well. Moreover, even though current models

demonstrate an encouraging generalization ability

across the three tasks we consider, they still can-

not learn complex inference (e.g., abductive rea-

soning) very well. We hope that our benchmark1

can motivate more advanced commonsense infer-

ence methods in the future.

2 Dataset Construction

In CIKQA, to encourage a generalizable com-

monsense inference model, we follow previous

work (Khashabi et al.,2020;Cohen et al.,2020;

Wu et al.,2020;Du and Cardie,2020) to unify

all selected tasks as a binary question answering

problem, and equip each question with a support-

ing knowledge graph Gretrieved from existing

commonsense KGs. We leverage crowd-sourcing

workers to annotate whether the knowledge is gold

(i.e., accurate and enough) for answering the ques-

tion. Details about task selection, format uniﬁ-

cation, support knowledge extraction, and anno-

tation are as follows.

2.1 Task Selection

In CIKQA, we select the following four popular

commonsense reasoning tasks:

1. HardPCR (Zhang et al.,2021): The hard pro-

noun coreference resolution (HardPCR) task is

one of the most famous commonsense reason-

ing tasks. For each question, a target pronoun

and two candidate mentions are provided, and

1Available at https://github.com/CogComp/CIKQA.

Task Name Original Assertion Transformed Question Answer

HardPCR The ﬁsh ate the worm. It was

hungry.

The ﬁsh ate the worm. It was hun-

gry. What was hungry?

(A) Fish; (B) Worm

CommonsenesQA What is a place that someone

can go buy a teddy bear?

What is a place that someone can go

buy a teddy bear?

(A) Toy store; (B) Shelf

COPA I drank from the water fountain. I drank from the water fountain.

What was the cause of this?

(A) I was thirsty.; (B) I felt

nauseous.

ATOMIC PersonX buys the bike. Before PersonX buys the bike, what

did PersonX want?

(A) To be social.; (B) To

have transportation.

Table 1: Demonstration of the original assertion, transformed questions, and answers. Correct and wrong answers

are indicated with blue and red, respectively.

the task is to select the correct mention that the

pronoun refers to. Careful expert annotations

are conducted to get rid of the inﬂuence of all

simple linguistic rules and the models are re-

quired to solve the problem with commonsense

reasoning. In CIKQA, we include instances

from WSC (Levesque et al.,2012), DPR (Rah-

man and Ng,2012), and WinoGrande (Sak-

aguchi et al.,2020). To create a question re-

garding the target pronoun, we ﬁrst ﬁnd the

sentence that contains the target pronoun and

then determine whether the participating pro-

noun refers to a person or an object.

2. CommonsenseQA (Talmor et al.,2019): Com-

monsenseQA is a commonsense question an-

swering dataset. For each question-answer pair,

four relevant but wrong concepts are used as the

other candidates, and the models are required to

select the correct one out of ﬁve candidates. In

CIKQA, we randomly sample a negative an-

swer to make it a binary choice task, which is

consistent with other datasets.

3. COPA (Roemmele et al.,2011): COPA fo-

cuses on evaluating the understanding of events

causality. For a target event, two candidate

followup events are provided, and models are

asked to predict the one caused by or the rea-

son for the target event.

4. ATOMIC (Sap et al.,2019): The last one is

the commonsense knowledge base completion.

Given a head concept (e.g., “eat food”) and a

relation (e.g., “cause”), we want to predict the

tail concept. In CIKQA, we focus on predict-

ing edges of ATOMIC.

In COPA and ATOMIC, where the task is to

predict the relations between two events or states

(e.g., “PersonX eats”-Causes-“PersonX is full”),

for each triplet, we randomly sample another event

or state as the negative tail and ask the model to

select the correct one. To make the task challeng-

ing and avoid sampling irrelevant events or states,

we require the sampled negative event or state to

be connected with the head event or state with a

different triplet (e.g., “PersonX is hungry” from

the triplet “PersonX eats”-CausedBy-“PersonX is

hungry”). For each type of relation, we write a

pattern to generate the question. For example,

for the “Causes” relation, we will ask “What can

be caused by ‘PersonX eats’?”. Examples of in-

stances in the original datasets and their trans-

formed questions and candidate answers are pre-

sented in Table 1.

2.2 Supporting Knowledge Extraction

As discussed in Section 1, a limitation of existing

commonsense reasoning benchmarks is that there

is no clear boundary between knowledge and in-

ference. As such, it is unclear what is learned from

the training data, the knowledge, or how to per-

form inference. To address this issue and encour-

age models to learn inference rather than knowl-

edge from the training data, we propose to equip

each question with supporting knowledge. The

question is selected as part of the dataset only if we

ﬁnd supporting knowledge to answer the question.

Note that this procedure serves as an improved

evaluation setup than pure supervised learning,

and not as a solution to commonsense reasoning.

This section introduces the selected commonsense

knowledge graphs and then introduces how we ex-

tract the corresponding commonsense knowledge

for each question.

2.2.1 Commonsense KG Selection

Many commonsense knowledge graphs were de-

veloped to enhance machines’ commonsense rea-

soning abilities, including ConceptNet (Liu and

Singh,2004), ATOMIC (Sap et al.,2019),

GLUCOSE (Mostafazadeh et al.,2020), and

文档加载中……请稍候！
如果长时间未打开，您也可以点击刷新试试。

下载文档到电脑，查找使用更方便

10 玖币 0人已下载

立即下载

摘要：

CIKQA:LearningCommonsenseInferencewithaUniedKnowledge-in-the-loopQAParadigmHongmingZhang1;2,YintongHuo3,YanaiElazar4;5,YangqiuSong1,YoavGoldberg4;5,DanRoth21HKUST,2UPenn,3CUHK,4AI2,5UniversityofWashington,6BarIlanUniversityfhzhangal,yqsongg@cse.ust.hk,ythuo@cse.cuhk.edu.hkfyanaiela,yoav.goldbergg@g...

展开>> 收起<<

CIKQA Learning Commonsense Inference with a Uniﬁed Knowledge-in-the-loop QA Paradigm Hongming Zhang12 Yintong Huo3 Yanai Elazar45 Yangqiu Song1 Yoav Goldberg45 Dan Roth2.pdf

共12页,预览3页

还剩页未读，继续阅读

声明：本站为文档C2C交易模式，即用户上传的文档直接被用户下载，本站只是中间服务平台，本站所有文档下载所得的收益归上传人(含作者)所有。玖贝云文库仅提供信息存储空间，仅对用户上传内容的表现方式做保护处理，对上载内容本身不做任何修改或编辑。若文档所含内容侵犯了您的版权或隐私，请立即通知玖贝云文库，我们立即给予删除！

CIKQA Learning Commonsense Inference with a Uniﬁed Knowledge-in-the-loop QA Paradigm Hongming Zhang12 Yintong Huo3 Yanai Elazar45 Yangqiu Song1 Yoav Goldberg45 Dan Roth2

相关推荐

开通VIP享超值会员特权

作者详情

相关内容

热门标签

举报选择: