
too obvious to state explicitly. The utilized cor-
pora are also small compared to what is typically
used in language model pre-training. Therefore,
pre-trained language models (PTLMs) have been
employed directly for CSK extraction in a setting
called prompting/probing (cf. the LAMA bench-
mark) (Petroni et al.,2019), where the BERT LM
showed promising results in predicting Concept-
Net assertions. They can also be employed with
supervision, like in the COMET and the Atomic
10x
system (Hwang et al.,2021;West et al.,2022).
However, both PTLM-paradigms are grounded in
frequencies observed in the original text corpora
used for LM training, which are again subject to
reporting bias.
3 Children Text Corpora
For understanding the nature of different text cor-
pora, we rely on the Flesch Reading-ease score
(FRE) (Flesch,1979) that is based on the number
of syllables, words, and sentences.
It generally ranges between 0 and 100, with 0-30
being considered difficult to read, 60-70 assumed
standard, and above 80 easy.
We investigate three children text corpora:
1. Children Book Test (CBT)
The CBT
dataset (Hill et al.,2016) contains 108 chil-
dren books such as Alice’s Adventures in Won-
derland extracted from the Gutenberg Project.
It targets children around 12-14 years old and
is about 30 MB in total.
2. C4-easy
C4 (Raffel et al.,2020) is a cleaned
version of Common Crawl’s web crawl cor-
pus that was used to train the T5 language
model. It is approximately 305 GB in size.
We derive C4-easy by restricting the corpus
to documents with an FRE greater than 80,
retaining 40.827.011 documents, which are
11% of C4.
3. InfantBooks
We newly introduce the In-
fantBooks dataset, composed of 496 books
targeted at kids from 1-6 years. It is
based on Ebooks from websites like freekids-
books.org,monkeypen.com and kidsworld-
fun.com, which we collected, transcribed, and
cleaned. The final dataset consists of 496
books with 2 MB of text.1
As a baseline, and to rule out that observed im-
provements stem only from general training on
1
The dataset is available at
https://www.mpi-inf.
mpg.de/children-texts-for-commonsense.
more data, we also compare with employing the
whole C4 corpus. In Table 1, we compare the
corpora according to average document length, vo-
cabulary size, and readability. In Table 2, we make
the same comparison with the number of distinct
words, the number of frequent words (with a rela-
tive frequency greater than 0.01%), and the cumu-
lative frequency of the top 1000 words.
Corpus Avg. doc. len. Vocab. size Readability (FRE)
C4 411 words 151k 60 (Standard)
CBT 57k words 63k 62 (Standard)
C4-easy 317 words 106k 86 (Easy)
InfantBooks 659 words 18k 91 (Very Easy)
Table 1: Text corpora considered for pretrain-
ing/finetuning, sorted by FRE.
Corpus Dist. Words freq. words Cumul. freq. top 1k
C4 8M 994 68%
CBT 5M 874 82%
C4-easy 8M 908 75%
InfantBooks 5M 1031 82%
Table 2: Text corpora statistics.
4 Analysis
CSK Density.
Although CBT and InfantBooks
are too small for comprehensive text extraction, it
is informative to see how dense CSK assertions are
stated in them, i.e., the relative frequencies of CSK
assertions per text.
We used the CSLB (Devereux et al.,2014)
dataset, a large crowdsourced set of basic CSK
assertions, like alligator: is scary / is long / is
green. We focused on the top 4,245 properties for
638 subjects stated at least five times. For each
corpus, we computed the relative frequencies with
which these statements appear (w/ lemmatization).
Table 3shows the results. As one can see, In-
fantBooks has the highest relative density of CSK
assertions, 3x as many as C4 per sentence, 5x more
per word.
To further explore the relation of text simplicity
and CSK density, we grouped C4 documents into
buckets based on their FRE. For a sample of 10k
documents per bucket, Figure 1reports the per-
word frequencies of CSK assertions, considering
all spotted CSK assertions (blue) or only distinct
ones (red). The results are shown in Figure 1. As
one can see, CSK density increases significantly
with easier readability, and only the most simple
documents suffer from a lack of diversity (decrease
in blue line).