Do Children Texts Hold The Key To Commonsense Knowledge Julien Romero Telecom SudParis

2025-04-27 0 0 248.18KB 6 页 10玖币

侵权投诉

Do Children Texts Hold The Key To Commonsense Knowledge?

Julien Romero

T´

el´

ecom SudParis

jromero@telecom-sudparis.eu

Simon Razniewski

Max Planck Institute for Informatics

srazniew@mpi-inf.mpg.de

Abstract

Compiling comprehensive repositories of com-

monsense knowledge is a long-standing prob-

lem in AI. Many concerns revolve around the

issue of reporting bias, i.e., that frequency

in text sources is not a good proxy for rele-

vance or truth. This paper explores whether

children’s texts hold the key to commonsense

knowledge compilation, based on the hypoth-

esis that such content makes fewer assump-

tions on the reader’s knowledge, and therefore

spells out commonsense more explicitly. An

analysis with several corpora shows that chil-

dren’s texts indeed contain much more, and

more typical commonsense assertions. More-

over, experiments show that this advantage can

be leveraged in popular language-model-based

commonsense knowledge extraction settings,

where task-unspeciﬁc ﬁne-tuning on small

amounts of children texts (childBERT) already

yields signiﬁcant improvements. This pro-

vides a refreshing perspective different from

the common trend of deriving progress from

ever larger models and corpora.

1 Introduction

Compiling commonsense knowledge (CSK) is a

long-standing problem in AI (Lenat,1995). Au-

tomated text-extraction-based approaches to CSK

compilation, like Knext (Gordon et al.,2010),

TupleKB (Dalvi Mishra et al.,2017), Quasi-

modo (Romero et al.,2019), COMET (Hwang

et al.,2021) or Ascent (Nguyen et al.,2021) typ-

ically struggle with reporting bias (Gordon and

Van Durme,2013;Mehrabi et al.,2021), in particu-

lar an under-reporting of basic commonsense asser-

tions. This is a crux of commonsense: If knowledge

is assumed to be commonplace, such as that rain is

wet or cars have wheels, there is little need to utter

it explicitly. In contrast, statements that contradict

commonsense are more frequently reported, lead-

ing to inappropriate images of the real world, e.g.,

that ﬁres are more often cold than hot (e.g., 238 vs.

173 literal occurrences in the English Wikipedia).

Children’s material may partially counter this

bias: As children’s knowledge is still growing,

seemingly obvious assertions may still be fre-

quently expressed explicitly in such material. Note

that this is not a binary question of whether some

knowledge is expressed or not, but more a rank-

ing problem: Prominent CSK repositories often do

not struggle to recall relevant statements (e.g., As-

cent (Nguyen et al.,2021) contains 2800 assertions

for “elephant”), but struggle to rank them properly.

This is especially true for language-model based ap-

proaches of CSK compilation (Hwang et al.,2021;

West et al.,2022), which by design can assign every

token in the vocabulary a probability, but should

do so in sensible order.

This paper investigates (i) whether children’s

texts are a promising source for CSK and (ii)

whether small corpora can still boost knowledge

extraction from large language models. Specif-

ically, we analyze the density and typicality of

CSK assertions in children’s text corpora and

show how ﬁne-tuning existing language models

on them can improve CSK compilation. Data and

models, including a childBERT variant, can be

found at

https://www.mpi-inf.mpg.de/

children-texts-for-commonsense.

2 Background

Prominent manual efforts towards CSK compi-

lation include ConceptNet (Speer et al.,2017),

Atomic (Sap et al.,2019), and the integrated

CSKG (Ilievski et al.,2021). Prominent text ex-

traction projects are Knext (Gordon et al.,2010),

TupleKB (Dalvi Mishra et al.,2017), Quasi-

modo (Romero et al.,2019) and Ascent (Nguyen

et al.,2022). Each carefully selects extraction cor-

pora, like Wikipedia texts, user query logs, or tar-

geted web search, to minimize extraction noise and

maximize salience. Nonetheless, all struggle with

extracting very basic CSK that is generally deemed

arXiv:2210.04530v1 [cs.CL] 10 Oct 2022

too obvious to state explicitly. The utilized cor-

pora are also small compared to what is typically

used in language model pre-training. Therefore,

pre-trained language models (PTLMs) have been

employed directly for CSK extraction in a setting

called prompting/probing (cf. the LAMA bench-

mark) (Petroni et al.,2019), where the BERT LM

showed promising results in predicting Concept-

Net assertions. They can also be employed with

supervision, like in the COMET and the Atomic

10x

system (Hwang et al.,2021;West et al.,2022).

However, both PTLM-paradigms are grounded in

frequencies observed in the original text corpora

used for LM training, which are again subject to

reporting bias.

3 Children Text Corpora

For understanding the nature of different text cor-

pora, we rely on the Flesch Reading-ease score

(FRE) (Flesch,1979) that is based on the number

of syllables, words, and sentences.

It generally ranges between 0 and 100, with 0-30

being considered difﬁcult to read, 60-70 assumed

standard, and above 80 easy.

We investigate three children text corpora:

1. Children Book Test (CBT)

The CBT

dataset (Hill et al.,2016) contains 108 chil-

dren books such as Alice’s Adventures in Won-

derland extracted from the Gutenberg Project.

It targets children around 12-14 years old and

is about 30 MB in total.

2. C4-easy

C4 (Raffel et al.,2020) is a cleaned

version of Common Crawl’s web crawl cor-

pus that was used to train the T5 language

model. It is approximately 305 GB in size.

We derive C4-easy by restricting the corpus

to documents with an FRE greater than 80,

retaining 40.827.011 documents, which are

11% of C4.

3. InfantBooks

We newly introduce the In-

fantBooks dataset, composed of 496 books

targeted at kids from 1-6 years. It is

based on Ebooks from websites like freekids-

books.org,monkeypen.com and kidsworld-

fun.com, which we collected, transcribed, and

cleaned. The ﬁnal dataset consists of 496

books with 2 MB of text.1

As a baseline, and to rule out that observed im-

provements stem only from general training on

The dataset is available at

https://www.mpi-inf.

mpg.de/children-texts-for-commonsense.

more data, we also compare with employing the

whole C4 corpus. In Table 1, we compare the

corpora according to average document length, vo-

cabulary size, and readability. In Table 2, we make

the same comparison with the number of distinct

words, the number of frequent words (with a rela-

tive frequency greater than 0.01%), and the cumu-

lative frequency of the top 1000 words.

Corpus Avg. doc. len. Vocab. size Readability (FRE)

C4 411 words 151k 60 (Standard)

CBT 57k words 63k 62 (Standard)

C4-easy 317 words 106k 86 (Easy)

InfantBooks 659 words 18k 91 (Very Easy)

Table 1: Text corpora considered for pretrain-

ing/ﬁnetuning, sorted by FRE.

Corpus Dist. Words freq. words Cumul. freq. top 1k

C4 8M 994 68%

CBT 5M 874 82%

C4-easy 8M 908 75%

InfantBooks 5M 1031 82%

Table 2: Text corpora statistics.

4 Analysis

CSK Density.

Although CBT and InfantBooks

are too small for comprehensive text extraction, it

is informative to see how dense CSK assertions are

stated in them, i.e., the relative frequencies of CSK

assertions per text.

We used the CSLB (Devereux et al.,2014)

dataset, a large crowdsourced set of basic CSK

assertions, like alligator: is scary / is long / is

green. We focused on the top 4,245 properties for

638 subjects stated at least ﬁve times. For each

corpus, we computed the relative frequencies with

which these statements appear (w/ lemmatization).

Table 3shows the results. As one can see, In-

fantBooks has the highest relative density of CSK

assertions, 3x as many as C4 per sentence, 5x more

per word.

To further explore the relation of text simplicity

and CSK density, we grouped C4 documents into

buckets based on their FRE. For a sample of 10k

documents per bucket, Figure 1reports the per-

word frequencies of CSK assertions, considering

all spotted CSK assertions (blue) or only distinct

ones (red). The results are shown in Figure 1. As

one can see, CSK density increases signiﬁcantly

with easier readability, and only the most simple

documents suffer from a lack of diversity (decrease

in blue line).

文档加载中……请稍候！
如果长时间未打开，您也可以点击刷新试试。

下载文档到电脑，查找使用更方便

10 玖币 0人已下载

立即下载

摘要：

DoChildrenTextsHoldTheKeyToCommonsenseKnowledge?JulienRomeroT´el´ecomSudParisjromero@telecom-sudparis.euSimonRazniewskiMaxPlanckInstituteforInformaticssrazniew@mpi-inf.mpg.deAbstractCompilingcomprehensiverepositoriesofcom-monsenseknowledgeisalong-standingprob-leminAI.Manyconcernsrevolvearoundtheissu...

展开>> 收起<<

Do Children Texts Hold The Key To Commonsense Knowledge Julien Romero Telecom SudParis.pdf

共6页,预览2页

还剩页未读，继续阅读

声明：本站为文档C2C交易模式，即用户上传的文档直接被用户下载，本站只是中间服务平台，本站所有文档下载所得的收益归上传人(含作者)所有。玖贝云文库仅提供信息存储空间，仅对用户上传内容的表现方式做保护处理，对上载内容本身不做任何修改或编辑。若文档所含内容侵犯了您的版权或隐私，请立即通知玖贝云文库，我们立即给予删除！

Do Children Texts Hold The Key To Commonsense Knowledge Julien Romero Telecom SudParis

相关推荐

开通VIP享超值会员特权

作者详情

相关内容

热门标签

举报选择: