
To assess the difficulty of language modeling,
image captioning, and automatic speech recogni-
tion (ASR) with the Bloom Library datasets, we
trained baseline models on each of these tasks. For
certain languages, the Bloom Library datasets facil-
itate the first known baselines with comparable to
state-of-the-art performance for higher-resourced
languages. We acheive a BLEU score on image
captioning of above 10.0 for 10 languages using
only data from
bloom-captioning
. For ASR,
we demonstrate a Word Error Rate (WER) below
0.5 for 18 languages and a Character Error Rate
(CER) below 0.2 for 21 languages.
2 Related Work
In terms of language coverage, various multilin-
gual and single modality datasets have emerged re-
cently. These include, by way of example, the JHU
Bible Corpus (McCarthy et al.,2020), the CMU
Wilderness Multilingual Speech dataset (Black,
2019), Common Voice 9 (Ardila et al.,2019), Mul-
tilingual BABEL (Consortium,2022), and MAS-
SIVE (FitzGerald et al.,2022). The number of
languages in these datasets is impressive. However,
many are limited in domain (e.g., only including
Bible data), accessibility, licensing, or modality
(e.g., only focusing on text or read speech). These
datasets are also primarily rooted in content from
large, dominant languages, like English, and are
translated or adapted to other fairly large languages.
Bloom Library data, in contrast, originates from
local language communities,
3
which are produc-
ing Bloom Books to fit their own local language
ecology and perspectives. As a result, the data pre-
sented here covers languages, language families,
and topics that are not covered by any other aligned
and prepared datasets.
In terms of modality, the research community
is presenting an increasing number of intriguing
multimodal datasets. These include, by way of
example, Pano-AVQA (Yun et al.,2021), which
facilitates question answering regarding various ob-
jects, sounds, and their associations in videos, and
VIST (Huang et al.,2016), which facilitates se-
quential vision-to-language tasks. However, recent
multimodal datasets are overwhelmingly monolin-
3
On the use of the term “local” languages, we followed
the terminology used in Bird (2022) and related works, which
defines the term along the lines of “small, primarily-oral lan-
guages, often Indigenous or endangered, including the original
and emerging languages of Africa, Asia, Australia, the Ameri-
cas, the Pacific, and the minority languages of Europe.”
gual.
Datasets representing both multiple modalities
and many languages include Multi30k, which is
one of the few multimodal, multilingual datasets
in existence, with ~30k images and correspond-
ing text descriptions in several languages including
English, German (Elliott et al.,2016), French (El-
liott et al.,2017), and Czech (Barrault et al.,2018).
One listing can be found in Kádár (2019), which
provides a helpful (and comprehensive) table of
multilingual, multimodal resources, dividing them
into two categories: (i) "translation" (with captions
translated into another language); and (ii) "descrip-
tion" (with annotations independently created for
each language). The table reveals that Multi30k
was, at the time, the largest translation dataset avail-
able in terms of image count, at approximately 31k
images and 31k sentences covering 4 languages.
The Bloom Library datasets fit into the "descrip-
tion" category of Kádár (2019). However, with over
90k+ images and 110k+ captions covering 351 lan-
guages and additional speech data in 56 languages,
Bloom Library represents a massive increase in lan-
guage and modality coverage (up to two orders of
magnitude wider than previous multilingual, multi-
modal datasets). Further, the existing datasets refer-
enced by Kádár (2019) focus on large languages in
high-resource settings, with no representation of lo-
cal languages in low resource settings. In contrast,
our datasets include languages in extremely low
resource and non-dominant settings like Bisu [bzi]
and Kagayanen [cgc], with estimated populations
of 700 and 30,000 users, respectively.
3 Constructing the Datasets
The authors worked directly with the Bloom Li-
brary developers to gain access to and understand
the raw data behind the Bloom Library website.
We parse, clean, deduplicate, and publicly release
this data for research use on the Hugging Face
Hub
45
in formats compatible with the Hugging
Face datasets Python package.6
bloom-lm
,
bloom-captioning
, and
bloom-vist
are created using one data pipeline
starting with
bloom-vist
, because each of
these datasets use some or all of the images and
corresponding text within the Bloom Library. A
4https://www.ai.sil.org/bloom
5https://huggingface.co/sil-ai
6https://huggingface.co/docs/datasets/
index