Do Children Texts Hold The Key To Commonsense Knowledge Julien Romero Telecom SudParis

2025-04-27 0 0 248.18KB 6 页 10玖币
侵权投诉
Do Children Texts Hold The Key To Commonsense Knowledge?
Julien Romero
T´
el´
ecom SudParis
jromero@telecom-sudparis.eu
Simon Razniewski
Max Planck Institute for Informatics
srazniew@mpi-inf.mpg.de
Abstract
Compiling comprehensive repositories of com-
monsense knowledge is a long-standing prob-
lem in AI. Many concerns revolve around the
issue of reporting bias, i.e., that frequency
in text sources is not a good proxy for rele-
vance or truth. This paper explores whether
children’s texts hold the key to commonsense
knowledge compilation, based on the hypoth-
esis that such content makes fewer assump-
tions on the reader’s knowledge, and therefore
spells out commonsense more explicitly. An
analysis with several corpora shows that chil-
dren’s texts indeed contain much more, and
more typical commonsense assertions. More-
over, experiments show that this advantage can
be leveraged in popular language-model-based
commonsense knowledge extraction settings,
where task-unspecific fine-tuning on small
amounts of children texts (childBERT) already
yields significant improvements. This pro-
vides a refreshing perspective different from
the common trend of deriving progress from
ever larger models and corpora.
1 Introduction
Compiling commonsense knowledge (CSK) is a
long-standing problem in AI (Lenat,1995). Au-
tomated text-extraction-based approaches to CSK
compilation, like Knext (Gordon et al.,2010),
TupleKB (Dalvi Mishra et al.,2017), Quasi-
modo (Romero et al.,2019), COMET (Hwang
et al.,2021) or Ascent (Nguyen et al.,2021) typ-
ically struggle with reporting bias (Gordon and
Van Durme,2013;Mehrabi et al.,2021), in particu-
lar an under-reporting of basic commonsense asser-
tions. This is a crux of commonsense: If knowledge
is assumed to be commonplace, such as that rain is
wet or cars have wheels, there is little need to utter
it explicitly. In contrast, statements that contradict
commonsense are more frequently reported, lead-
ing to inappropriate images of the real world, e.g.,
that fires are more often cold than hot (e.g., 238 vs.
173 literal occurrences in the English Wikipedia).
Children’s material may partially counter this
bias: As children’s knowledge is still growing,
seemingly obvious assertions may still be fre-
quently expressed explicitly in such material. Note
that this is not a binary question of whether some
knowledge is expressed or not, but more a rank-
ing problem: Prominent CSK repositories often do
not struggle to recall relevant statements (e.g., As-
cent (Nguyen et al.,2021) contains 2800 assertions
for “elephant”), but struggle to rank them properly.
This is especially true for language-model based ap-
proaches of CSK compilation (Hwang et al.,2021;
West et al.,2022), which by design can assign every
token in the vocabulary a probability, but should
do so in sensible order.
This paper investigates (i) whether children’s
texts are a promising source for CSK and (ii)
whether small corpora can still boost knowledge
extraction from large language models. Specif-
ically, we analyze the density and typicality of
CSK assertions in children’s text corpora and
show how fine-tuning existing language models
on them can improve CSK compilation. Data and
models, including a childBERT variant, can be
found at
https://www.mpi-inf.mpg.de/
children-texts-for-commonsense.
2 Background
Prominent manual efforts towards CSK compi-
lation include ConceptNet (Speer et al.,2017),
Atomic (Sap et al.,2019), and the integrated
CSKG (Ilievski et al.,2021). Prominent text ex-
traction projects are Knext (Gordon et al.,2010),
TupleKB (Dalvi Mishra et al.,2017), Quasi-
modo (Romero et al.,2019) and Ascent (Nguyen
et al.,2022). Each carefully selects extraction cor-
pora, like Wikipedia texts, user query logs, or tar-
geted web search, to minimize extraction noise and
maximize salience. Nonetheless, all struggle with
extracting very basic CSK that is generally deemed
arXiv:2210.04530v1 [cs.CL] 10 Oct 2022
too obvious to state explicitly. The utilized cor-
pora are also small compared to what is typically
used in language model pre-training. Therefore,
pre-trained language models (PTLMs) have been
employed directly for CSK extraction in a setting
called prompting/probing (cf. the LAMA bench-
mark) (Petroni et al.,2019), where the BERT LM
showed promising results in predicting Concept-
Net assertions. They can also be employed with
supervision, like in the COMET and the Atomic
10x
system (Hwang et al.,2021;West et al.,2022).
However, both PTLM-paradigms are grounded in
frequencies observed in the original text corpora
used for LM training, which are again subject to
reporting bias.
3 Children Text Corpora
For understanding the nature of different text cor-
pora, we rely on the Flesch Reading-ease score
(FRE) (Flesch,1979) that is based on the number
of syllables, words, and sentences.
It generally ranges between 0 and 100, with 0-30
being considered difficult to read, 60-70 assumed
standard, and above 80 easy.
We investigate three children text corpora:
1. Children Book Test (CBT)
The CBT
dataset (Hill et al.,2016) contains 108 chil-
dren books such as Alice’s Adventures in Won-
derland extracted from the Gutenberg Project.
It targets children around 12-14 years old and
is about 30 MB in total.
2. C4-easy
C4 (Raffel et al.,2020) is a cleaned
version of Common Crawl’s web crawl cor-
pus that was used to train the T5 language
model. It is approximately 305 GB in size.
We derive C4-easy by restricting the corpus
to documents with an FRE greater than 80,
retaining 40.827.011 documents, which are
11% of C4.
3. InfantBooks
We newly introduce the In-
fantBooks dataset, composed of 496 books
targeted at kids from 1-6 years. It is
based on Ebooks from websites like freekids-
books.org,monkeypen.com and kidsworld-
fun.com, which we collected, transcribed, and
cleaned. The final dataset consists of 496
books with 2 MB of text.1
As a baseline, and to rule out that observed im-
provements stem only from general training on
1
The dataset is available at
https://www.mpi-inf.
mpg.de/children-texts-for-commonsense.
more data, we also compare with employing the
whole C4 corpus. In Table 1, we compare the
corpora according to average document length, vo-
cabulary size, and readability. In Table 2, we make
the same comparison with the number of distinct
words, the number of frequent words (with a rela-
tive frequency greater than 0.01%), and the cumu-
lative frequency of the top 1000 words.
Corpus Avg. doc. len. Vocab. size Readability (FRE)
C4 411 words 151k 60 (Standard)
CBT 57k words 63k 62 (Standard)
C4-easy 317 words 106k 86 (Easy)
InfantBooks 659 words 18k 91 (Very Easy)
Table 1: Text corpora considered for pretrain-
ing/finetuning, sorted by FRE.
Corpus Dist. Words freq. words Cumul. freq. top 1k
C4 8M 994 68%
CBT 5M 874 82%
C4-easy 8M 908 75%
InfantBooks 5M 1031 82%
Table 2: Text corpora statistics.
4 Analysis
CSK Density.
Although CBT and InfantBooks
are too small for comprehensive text extraction, it
is informative to see how dense CSK assertions are
stated in them, i.e., the relative frequencies of CSK
assertions per text.
We used the CSLB (Devereux et al.,2014)
dataset, a large crowdsourced set of basic CSK
assertions, like alligator: is scary / is long / is
green. We focused on the top 4,245 properties for
638 subjects stated at least five times. For each
corpus, we computed the relative frequencies with
which these statements appear (w/ lemmatization).
Table 3shows the results. As one can see, In-
fantBooks has the highest relative density of CSK
assertions, 3x as many as C4 per sentence, 5x more
per word.
To further explore the relation of text simplicity
and CSK density, we grouped C4 documents into
buckets based on their FRE. For a sample of 10k
documents per bucket, Figure 1reports the per-
word frequencies of CSK assertions, considering
all spotted CSK assertions (blue) or only distinct
ones (red). The results are shown in Figure 1. As
one can see, CSK density increases significantly
with easier readability, and only the most simple
documents suffer from a lack of diversity (decrease
in blue line).
摘要:

DoChildrenTextsHoldTheKeyToCommonsenseKnowledge?JulienRomeroT´el´ecomSudParisjromero@telecom-sudparis.euSimonRazniewskiMaxPlanckInstituteforInformaticssrazniew@mpi-inf.mpg.deAbstractCompilingcomprehensiverepositoriesofcom-monsenseknowledgeisalong-standingprob-leminAI.Manyconcernsrevolvearoundtheissu...

展开>> 收起<<
Do Children Texts Hold The Key To Commonsense Knowledge Julien Romero Telecom SudParis.pdf

共6页,预览2页

还剩页未读, 继续阅读

声明:本站为文档C2C交易模式,即用户上传的文档直接被用户下载,本站只是中间服务平台,本站所有文档下载所得的收益归上传人(含作者)所有。玖贝云文库仅提供信息存储空间,仅对用户上传内容的表现方式做保护处理,对上载内容本身不做任何修改或编辑。若文档所含内容侵犯了您的版权或隐私,请立即通知玖贝云文库,我们立即给予删除!
分类:图书资源 价格:10玖币 属性:6 页 大小:248.18KB 格式:PDF 时间:2025-04-27

开通VIP享超值会员特权

  • 多端同步记录
  • 高速下载文档
  • 免费文档工具
  • 分享文档赚钱
  • 每日登录抽奖
  • 优质衍生服务
/ 6
客服
关注