CIKQA: Learning Commonsense Inference with a Unified
Knowledge-in-the-loop QA Paradigm
Hongming Zhang1,2, Yintong Huo3, Yanai Elazar4,5, Yangqiu Song1, Yoav Goldberg4,5, Dan Roth2
1HKUST, 2UPenn, 3CUHK, 4AI2, 5University of Washington, 6Bar Ilan University
{hzhangal,yqsong}@cse.ust.hk,ythuo@cse.cuhk.edu.hk
{yanaiela,yoav.goldberg}@gmail.com,danroth@seas.upenn.edu
Abstract
Recently, the community has achieved sub-
stantial progress on many commonsense rea-
soning benchmarks. However, it is still un-
clear what is learned from the training process:
the knowledge, inference capability, or both?
We argue that due to the large scale of com-
monsense knowledge, it is infeasible to anno-
tate a large enough training set for each task
to cover all commonsense for learning. Thus
we should separate the commonsense knowl-
edge acquisition and inference over common-
sense knowledge as two separate tasks. In this
work, we focus on investigating models’ com-
monsense inference capabilities from two per-
spectives: (1) Whether models can know if the
knowledge they have is enough to solve the
task; (2) Whether models can develop com-
monsense inference capabilities that general-
ize across commonsense tasks. We first align
commonsense tasks with relevant knowledge
from commonsense knowledge bases and ask
humans to annotate whether the knowledge is
enough or not. Then, we convert different
commonsense tasks into a unified question an-
swering format to evaluate models’ general-
ization capabilities. We name the benchmark
as Commonsense Inference with Knowledge-
in-the-loop Question Answering (CIKQA).
1 Introduction
Understanding human language requires both the
language knowledge (e.g., grammar and seman-
tics) and world knowledge, which can be further
divided into factual and commonsense knowledge
(Katz and Fodor,1963). Recently, the community
has made great progress on helping machines ac-
quire and apply language and factual knowledge.
However, how to help machines acquire and in-
fer over commonsense is still unclear. To an-
swer this question, many commonsense reasoning
datasets (Roemmele et al.,2011;Sakaguchi et al.,
2020;Talmor et al.,2019;Zellers et al.,2019;Lin
et al.,2020) have been proposed. Even though
they target different knowledge types, modalities,
and come in different formats, they often follow a
standard supervised learning setting, which aims
at helping machines to solve a specific task with
the training data. However, two limitations of this
learning paradigm have restricted the development
of commonsense reasoning systems.
First, there is no clear separation between
knowledge and inference. As discussed in (Elazar
et al.,2021), a common phenomenon is that
larger training data will lead to better perfor-
mance, mainly because richer knowledge is cov-
ered. However, due to the large scale of com-
monsense knowledge, it is infeasible to annotate
a large enough training set for each task, and the
responsibility of the training data should be teach-
ing models how to do inference rather than ac-
quire the commonsense knowledge. Several recent
works have explored using structured knowledge
for commonsense reasoning tasks (Lin et al.,2019;
Lv et al.,2020;Paul and Frank,2020). However,
as these works did not clearly analyze the cover-
age of the structured knowledge (i.e., knowledge
graphs (KGs)), it is still unclear what the perfor-
mance means, better knowledge coverage or bet-
ter inference capability. To dig into what is behind
this learning process, we propose to equip each
question with auto-extracted knowledge and ask
humans to annotate whether the knowledge is gold
(i.e., sufficient to answer the question). By doing
so, we could evaluate whether models can know
if the provided knowledge is gold or not and how
well they can conduct inference over the provided
knowledge to solve the task.
Second, the supervised learning may force the
model to learn the distribution of the training data
rather than a universal inference model. As a re-
sult, the model may perform well on the test set
that follows the same distribution but fail on other
tasks (Kejriwal and Shen,2020). Previously, as
different tasks have different formats, it is hard
arXiv:2210.06246v1 [cs.CL] 12 Oct 2022