CIKQA Learning Commonsense Inference with a Unified Knowledge-in-the-loop QA Paradigm Hongming Zhang12 Yintong Huo3 Yanai Elazar45 Yangqiu Song1 Yoav Goldberg45 Dan Roth2

2025-04-27 0 0 798.27KB 12 页 10玖币
侵权投诉
CIKQA: Learning Commonsense Inference with a Unified
Knowledge-in-the-loop QA Paradigm
Hongming Zhang1,2, Yintong Huo3, Yanai Elazar4,5, Yangqiu Song1, Yoav Goldberg4,5, Dan Roth2
1HKUST, 2UPenn, 3CUHK, 4AI2, 5University of Washington, 6Bar Ilan University
{hzhangal,yqsong}@cse.ust.hk,ythuo@cse.cuhk.edu.hk
{yanaiela,yoav.goldberg}@gmail.com,danroth@seas.upenn.edu
Abstract
Recently, the community has achieved sub-
stantial progress on many commonsense rea-
soning benchmarks. However, it is still un-
clear what is learned from the training process:
the knowledge, inference capability, or both?
We argue that due to the large scale of com-
monsense knowledge, it is infeasible to anno-
tate a large enough training set for each task
to cover all commonsense for learning. Thus
we should separate the commonsense knowl-
edge acquisition and inference over common-
sense knowledge as two separate tasks. In this
work, we focus on investigating models’ com-
monsense inference capabilities from two per-
spectives: (1) Whether models can know if the
knowledge they have is enough to solve the
task; (2) Whether models can develop com-
monsense inference capabilities that general-
ize across commonsense tasks. We first align
commonsense tasks with relevant knowledge
from commonsense knowledge bases and ask
humans to annotate whether the knowledge is
enough or not. Then, we convert different
commonsense tasks into a unified question an-
swering format to evaluate models’ general-
ization capabilities. We name the benchmark
as Commonsense Inference with Knowledge-
in-the-loop Question Answering (CIKQA).
1 Introduction
Understanding human language requires both the
language knowledge (e.g., grammar and seman-
tics) and world knowledge, which can be further
divided into factual and commonsense knowledge
(Katz and Fodor,1963). Recently, the community
has made great progress on helping machines ac-
quire and apply language and factual knowledge.
However, how to help machines acquire and in-
fer over commonsense is still unclear. To an-
swer this question, many commonsense reasoning
datasets (Roemmele et al.,2011;Sakaguchi et al.,
2020;Talmor et al.,2019;Zellers et al.,2019;Lin
et al.,2020) have been proposed. Even though
they target different knowledge types, modalities,
and come in different formats, they often follow a
standard supervised learning setting, which aims
at helping machines to solve a specific task with
the training data. However, two limitations of this
learning paradigm have restricted the development
of commonsense reasoning systems.
First, there is no clear separation between
knowledge and inference. As discussed in (Elazar
et al.,2021), a common phenomenon is that
larger training data will lead to better perfor-
mance, mainly because richer knowledge is cov-
ered. However, due to the large scale of com-
monsense knowledge, it is infeasible to annotate
a large enough training set for each task, and the
responsibility of the training data should be teach-
ing models how to do inference rather than ac-
quire the commonsense knowledge. Several recent
works have explored using structured knowledge
for commonsense reasoning tasks (Lin et al.,2019;
Lv et al.,2020;Paul and Frank,2020). However,
as these works did not clearly analyze the cover-
age of the structured knowledge (i.e., knowledge
graphs (KGs)), it is still unclear what the perfor-
mance means, better knowledge coverage or bet-
ter inference capability. To dig into what is behind
this learning process, we propose to equip each
question with auto-extracted knowledge and ask
humans to annotate whether the knowledge is gold
(i.e., sufficient to answer the question). By doing
so, we could evaluate whether models can know
if the provided knowledge is gold or not and how
well they can conduct inference over the provided
knowledge to solve the task.
Second, the supervised learning may force the
model to learn the distribution of the training data
rather than a universal inference model. As a re-
sult, the model may perform well on the test set
that follows the same distribution but fail on other
tasks (Kejriwal and Shen,2020). Previously, as
different tasks have different formats, it is hard
arXiv:2210.06246v1 [cs.CL] 12 Oct 2022
Figure 1: CIKQA demonstration. All tasks are converted into a unified format such that we could easily evaluate
the generlization capability of all models. We also equip all questions with auto-extracted knowledge graphs from
existing KGs and ask humans to annotate whether the knowledge is gold or not. In this example, we expect models
to first identify the quality of the knowledge and then conduct inference over the knowledge to solve the question.
to evaluate the generalization ability of common-
sense reasoning models. Motivated by the ex-
isting trend of using a unified format (i.e., ques-
tion answering) for different tasks (Khashabi et al.,
2020), we propose to convert various common-
sense reasoning tasks into a unified QA format
such that we can easily and fairly evaluate the gen-
eralization ability of learned commonsense rea-
soning models.
Combining these two lines of effort, we propose
a new commonsense inference evaluation bench-
mark Knowledge-in-the-loop Commonsense In-
ference with QA (CIKQA). An example is shown
in Figure 1. We first convert several popular com-
monsense reasoning tasks into a unified QA for-
mat and equip them with the relevant knowledge
from existing commonsense knowledge graphs.
We leverage human annotation to label whether
the provided knowledge is gold to answer the
question. With CIKQA, we are interested in an-
swering two questions: (1) Whether current mod-
els can distinguish the knowledge is gold or not;
(3) Can current commonsense inference models
generalize across different commonsense reason-
ing tasks.
Experiments with several recent knowledge-
based commonsense reasoning models show that
even though current deep models could learn to
conduct simple inference after training with a few
examples when gold knowledge is provided, they
still cannot learn to distinguish gold knowledge
very well. Moreover, even though current models
demonstrate an encouraging generalization ability
across the three tasks we consider, they still can-
not learn complex inference (e.g., abductive rea-
soning) very well. We hope that our benchmark1
can motivate more advanced commonsense infer-
ence methods in the future.
2 Dataset Construction
In CIKQA, to encourage a generalizable com-
monsense inference model, we follow previous
work (Khashabi et al.,2020;Cohen et al.,2020;
Wu et al.,2020;Du and Cardie,2020) to unify
all selected tasks as a binary question answering
problem, and equip each question with a support-
ing knowledge graph Gretrieved from existing
commonsense KGs. We leverage crowd-sourcing
workers to annotate whether the knowledge is gold
(i.e., accurate and enough) for answering the ques-
tion. Details about task selection, format unifi-
cation, support knowledge extraction, and anno-
tation are as follows.
2.1 Task Selection
In CIKQA, we select the following four popular
commonsense reasoning tasks:
1. HardPCR (Zhang et al.,2021): The hard pro-
noun coreference resolution (HardPCR) task is
one of the most famous commonsense reason-
ing tasks. For each question, a target pronoun
and two candidate mentions are provided, and
1Available at https://github.com/CogComp/CIKQA.
Task Name Original Assertion Transformed Question Answer
HardPCR The fish ate the worm. It was
hungry.
The fish ate the worm. It was hun-
gry. What was hungry?
(A) Fish; (B) Worm
CommonsenesQA What is a place that someone
can go buy a teddy bear?
What is a place that someone can go
buy a teddy bear?
(A) Toy store; (B) Shelf
COPA I drank from the water fountain. I drank from the water fountain.
What was the cause of this?
(A) I was thirsty.; (B) I felt
nauseous.
ATOMIC PersonX buys the bike. Before PersonX buys the bike, what
did PersonX want?
(A) To be social.; (B) To
have transportation.
Table 1: Demonstration of the original assertion, transformed questions, and answers. Correct and wrong answers
are indicated with blue and red, respectively.
the task is to select the correct mention that the
pronoun refers to. Careful expert annotations
are conducted to get rid of the influence of all
simple linguistic rules and the models are re-
quired to solve the problem with commonsense
reasoning. In CIKQA, we include instances
from WSC (Levesque et al.,2012), DPR (Rah-
man and Ng,2012), and WinoGrande (Sak-
aguchi et al.,2020). To create a question re-
garding the target pronoun, we first find the
sentence that contains the target pronoun and
then determine whether the participating pro-
noun refers to a person or an object.
2. CommonsenseQA (Talmor et al.,2019): Com-
monsenseQA is a commonsense question an-
swering dataset. For each question-answer pair,
four relevant but wrong concepts are used as the
other candidates, and the models are required to
select the correct one out of five candidates. In
CIKQA, we randomly sample a negative an-
swer to make it a binary choice task, which is
consistent with other datasets.
3. COPA (Roemmele et al.,2011): COPA fo-
cuses on evaluating the understanding of events
causality. For a target event, two candidate
followup events are provided, and models are
asked to predict the one caused by or the rea-
son for the target event.
4. ATOMIC (Sap et al.,2019): The last one is
the commonsense knowledge base completion.
Given a head concept (e.g., “eat food”) and a
relation (e.g., “cause”), we want to predict the
tail concept. In CIKQA, we focus on predict-
ing edges of ATOMIC.
In COPA and ATOMIC, where the task is to
predict the relations between two events or states
(e.g., “PersonX eats”-Causes-“PersonX is full”),
for each triplet, we randomly sample another event
or state as the negative tail and ask the model to
select the correct one. To make the task challeng-
ing and avoid sampling irrelevant events or states,
we require the sampled negative event or state to
be connected with the head event or state with a
different triplet (e.g., “PersonX is hungry” from
the triplet “PersonX eats”-CausedBy-“PersonX is
hungry”). For each type of relation, we write a
pattern to generate the question. For example,
for the “Causes” relation, we will ask “What can
be caused by ‘PersonX eats’?”. Examples of in-
stances in the original datasets and their trans-
formed questions and candidate answers are pre-
sented in Table 1.
2.2 Supporting Knowledge Extraction
As discussed in Section 1, a limitation of existing
commonsense reasoning benchmarks is that there
is no clear boundary between knowledge and in-
ference. As such, it is unclear what is learned from
the training data, the knowledge, or how to per-
form inference. To address this issue and encour-
age models to learn inference rather than knowl-
edge from the training data, we propose to equip
each question with supporting knowledge. The
question is selected as part of the dataset only if we
find supporting knowledge to answer the question.
Note that this procedure serves as an improved
evaluation setup than pure supervised learning,
and not as a solution to commonsense reasoning.
This section introduces the selected commonsense
knowledge graphs and then introduces how we ex-
tract the corresponding commonsense knowledge
for each question.
2.2.1 Commonsense KG Selection
Many commonsense knowledge graphs were de-
veloped to enhance machines’ commonsense rea-
soning abilities, including ConceptNet (Liu and
Singh,2004), ATOMIC (Sap et al.,2019),
GLUCOSE (Mostafazadeh et al.,2020), and
摘要:

CIKQA:LearningCommonsenseInferencewithaUniedKnowledge-in-the-loopQAParadigmHongmingZhang1;2,YintongHuo3,YanaiElazar4;5,YangqiuSong1,YoavGoldberg4;5,DanRoth21HKUST,2UPenn,3CUHK,4AI2,5UniversityofWashington,6BarIlanUniversityfhzhangal,yqsongg@cse.ust.hk,ythuo@cse.cuhk.edu.hkfyanaiela,yoav.goldbergg@g...

展开>> 收起<<
CIKQA Learning Commonsense Inference with a Unified Knowledge-in-the-loop QA Paradigm Hongming Zhang12 Yintong Huo3 Yanai Elazar45 Yangqiu Song1 Yoav Goldberg45 Dan Roth2.pdf

共12页,预览3页

还剩页未读, 继续阅读

声明:本站为文档C2C交易模式,即用户上传的文档直接被用户下载,本站只是中间服务平台,本站所有文档下载所得的收益归上传人(含作者)所有。玖贝云文库仅提供信息存储空间,仅对用户上传内容的表现方式做保护处理,对上载内容本身不做任何修改或编辑。若文档所含内容侵犯了您的版权或隐私,请立即通知玖贝云文库,我们立即给予删除!
分类:图书资源 价格:10玖币 属性:12 页 大小:798.27KB 格式:PDF 时间:2025-04-27

开通VIP享超值会员特权

  • 多端同步记录
  • 高速下载文档
  • 免费文档工具
  • 分享文档赚钱
  • 每日登录抽奖
  • 优质衍生服务
/ 12
客服
关注