Bloom Library Multimodal Datasets in 300 Languages for a Variety of Downstream Tasks Colin Leong Joshua Nemecek Jacob Mansdorfer Anna Filighera

2025-05-06 0 0 506.65KB 14 页 10玖币

侵权投诉

Bloom Library: Multimodal Datasets in 300+ Languages for a Variety of

Downstream Tasks

Colin Leong*, Joshua Nemecek†, Jacob Mansdorfer§, Anna Filighera¶,

Abraham Owodunni|| and Daniel Whitenack†

*University of Dayton Research Institute,†SIL International,§Independent Contractor,

¶TU Darmstadt and ||Masakhane

cleong1@udayton.edu,joshua_nemecek@sil.org

jacob.mansdorfer@gmail.com,anna.filighera@kom.tu-darmstadt.de

owodunniabraham@gmail.com,dan_whitenack@sil.org

Abstract

We present Bloom Library, a linguistically

diverse set of multimodal and multilingual

datasets for language modeling, image cap-

tioning, visual storytelling, and speech synthe-

sis/recognition. These datasets represent ei-

ther the most, or among the most, multilingual

datasets for each of the included downstream

tasks. In total, the initial release of the Bloom

Library datasets covers 363 languages across

32 language families. We train downstream

task models for various languages represented

in the data, showing the viability of the data

for future work in low-resource, multimodal

NLP and establishing the ﬁrst known base-

lines for these downstream tasks in certain lan-

guages (e.g., Bisu [bzi], with an estimated pop-

ulation of 700 users). Some of these ﬁrst-of-

their-kind baselines are comparable to state-of-

the-art performance for higher-resourced lan-

guages. The Bloom Library datasets are re-

leased under Creative Commons licenses on

the Hugging Face datasets hub to catalyze

more linguistically diverse research in the in-

cluded downstream tasks.

1 Introduction

Only a negligible fraction of the 7100+ living

languages (Eberhard et al.,2021) have sufﬁcient,

publicly available text, audio, and image data

to train state-of-the-art language/speech models

and/or models for downstream tasks like Named

Entity Recognition (NER) or image captioning.

This data scarcity results in systematic inequali-

ties in the performance of NLP tasks across the

world’s languages (Blasi et al.,2021). Indigenous

language ecologies also represent profoundly dif-

ferent understandings of the nature and function

of language (Bird,2022,2020), which might prior-

itize orality or translanguaging (Quakenbush and

Simons,2018), for example, above a single, written

mode of communication in all domains.

The Bloom Library

is a web-based platform that

is attempting to facilitate an increase in the amount

of multimodal materials available to communities

speaking non-dominant languages. The Bloom Li-

brary holds over 12,400 books in 545 languages

(at the time this paper is published), covering sub-

jects including agriculture, business, culture, math,

science, religion, and health. Many of these books

include images aligned with text, and 1,600+ of the

books have corresponding audio recordings (called

"talking books"). Language communities can cre-

ate new books, create audio recordings, download

existing books, and translate existing books using

the open-source "Bloom" software2.

To boost language diversity and indigenous per-

spectives in the NLP research community, we

present multimodal datasets post-processed out

of the Bloom Library. We anticipate that more

task-speciﬁc datasets will be created from the

Bloom Library. However, as a starting point,

we are presenting the following datasets: (1)

bloom-lm

for language modeling in 351 lan-

guages; (2)

bloom-captioning

for image-to-

text or text-to-image tasks in 351 languages; (3)

bloom-vist

for visual storytelling in 351 lan-

guages; and (4)

bloom-speech

for speech-to-

text and text-to-speech tasks in 56 languages.

The languages in these datasets correspond to 32

language families, and many of the included lan-

guages are in extremely low-resource settings. Fur-

ther, to the authors’ knowledge,

bloom-vist

rep-

resents the ﬁrst (and certainly most) multilingual

visual storytelling dataset, and

bloom-speech

includes more languages in the following language

families than any other aligned speech dataset

(number of languages in parenthesis): Austrone-

sian (8), Mayan (6), Niger-Congo (7), Sepik (2),

Tequistlatecan (2), and Trans-New Guinea (3).

1https://bloomlibrary.org/

2https://github.com/BloomBooks/

BloomDesktop

arXiv:2210.14712v1 [cs.CL] 26 Oct 2022

To assess the difﬁculty of language modeling,

image captioning, and automatic speech recogni-

tion (ASR) with the Bloom Library datasets, we

trained baseline models on each of these tasks. For

certain languages, the Bloom Library datasets facil-

itate the ﬁrst known baselines with comparable to

state-of-the-art performance for higher-resourced

languages. We acheive a BLEU score on image

captioning of above 10.0 for 10 languages using

only data from

bloom-captioning

. For ASR,

we demonstrate a Word Error Rate (WER) below

0.5 for 18 languages and a Character Error Rate

(CER) below 0.2 for 21 languages.

2 Related Work

In terms of language coverage, various multilin-

gual and single modality datasets have emerged re-

cently. These include, by way of example, the JHU

Bible Corpus (McCarthy et al.,2020), the CMU

Wilderness Multilingual Speech dataset (Black,

2019), Common Voice 9 (Ardila et al.,2019), Mul-

tilingual BABEL (Consortium,2022), and MAS-

SIVE (FitzGerald et al.,2022). The number of

languages in these datasets is impressive. However,

many are limited in domain (e.g., only including

Bible data), accessibility, licensing, or modality

(e.g., only focusing on text or read speech). These

datasets are also primarily rooted in content from

large, dominant languages, like English, and are

translated or adapted to other fairly large languages.

Bloom Library data, in contrast, originates from

local language communities,

which are produc-

ing Bloom Books to ﬁt their own local language

ecology and perspectives. As a result, the data pre-

sented here covers languages, language families,

and topics that are not covered by any other aligned

and prepared datasets.

In terms of modality, the research community

is presenting an increasing number of intriguing

multimodal datasets. These include, by way of

example, Pano-AVQA (Yun et al.,2021), which

facilitates question answering regarding various ob-

jects, sounds, and their associations in videos, and

VIST (Huang et al.,2016), which facilitates se-

quential vision-to-language tasks. However, recent

multimodal datasets are overwhelmingly monolin-

On the use of the term “local” languages, we followed

the terminology used in Bird (2022) and related works, which

deﬁnes the term along the lines of “small, primarily-oral lan-

guages, often Indigenous or endangered, including the original

and emerging languages of Africa, Asia, Australia, the Ameri-

cas, the Paciﬁc, and the minority languages of Europe.”

gual.

Datasets representing both multiple modalities

and many languages include Multi30k, which is

one of the few multimodal, multilingual datasets

in existence, with ~30k images and correspond-

ing text descriptions in several languages including

English, German (Elliott et al.,2016), French (El-

liott et al.,2017), and Czech (Barrault et al.,2018).

One listing can be found in Kádár (2019), which

provides a helpful (and comprehensive) table of

multilingual, multimodal resources, dividing them

into two categories: (i) "translation" (with captions

translated into another language); and (ii) "descrip-

tion" (with annotations independently created for

each language). The table reveals that Multi30k

was, at the time, the largest translation dataset avail-

able in terms of image count, at approximately 31k

images and 31k sentences covering 4 languages.

The Bloom Library datasets ﬁt into the "descrip-

tion" category of Kádár (2019). However, with over

90k+ images and 110k+ captions covering 351 lan-

guages and additional speech data in 56 languages,

Bloom Library represents a massive increase in lan-

guage and modality coverage (up to two orders of

magnitude wider than previous multilingual, multi-

modal datasets). Further, the existing datasets refer-

enced by Kádár (2019) focus on large languages in

high-resource settings, with no representation of lo-

cal languages in low resource settings. In contrast,

our datasets include languages in extremely low

resource and non-dominant settings like Bisu [bzi]

and Kagayanen [cgc], with estimated populations

of 700 and 30,000 users, respectively.

3 Constructing the Datasets

The authors worked directly with the Bloom Li-

brary developers to gain access to and understand

the raw data behind the Bloom Library website.

We parse, clean, deduplicate, and publicly release

this data for research use on the Hugging Face

Hub

in formats compatible with the Hugging

Face datasets Python package.6

bloom-lm

bloom-captioning

, and

bloom-vist

are created using one data pipeline

starting with

bloom-vist

, because each of

these datasets use some or all of the images and

corresponding text within the Bloom Library. A

4https://www.ai.sil.org/bloom

5https://huggingface.co/sil-ai

6https://huggingface.co/docs/datasets/

index

separate data pipeline is used for

bloom-speech

to process only "talking books."

3.1 bloom-vist

The Bloom Library books offer the rare possibil-

ity of leveraging sequential images for language

understanding across many languages. Thus, we

ﬁrst process the Bloom Library data into a format

consistent with the VIST task published by Huang

et al. (2016). VIST is a dataset made by creating

collections of sequential image-caption pairs which

form short “stories”, collaboratively setup by re-

searchers at Google Research, CMU, and JHU, we

structure our data to match this We hope this re-

lease of VIST-formatted data from Bloom Library

catalyzes techniques in both multilingual and mul-

timodal storytelling.

The raw Bloom Library data we received from

the Bloom Library team consisted of a folder of

ﬁles for each "book," which corresponds to one of

the pages on the Bloom Library website. The rele-

vant ﬁles in this folder include: (1)

meta.json

containing important metadata such as the book’s

translation lineage, alternative titles, copyright,

etc.; (2) an

*.htm

ﬁle containing the actual data,

particularly text and image links for each page of

the book; (3) in certain cases, a number of image

ﬁles of various types including *.jpg and *.png; and

(4) in certain cases (for talking books), a number of

*.mp3

audio ﬁles. In order to construct the sequen-

tial VIST-type data, we parse the

*.htm

ﬁle with

BeautifulSoup

to associate images ﬁles with cap-

tions and sequence these according to the sequence

of pages in the book. We use

meta.json

to pull

out relevant metadata (book title, topics, etc.) and

to ﬁlter out any books not released under a Creative

Commons license.

Figure 1includes some example data included

bloom-vist

by way of example. The dataset

includes "albums," which are ordered sequences of

images. An album may be associated with multiple

"stories," where each story is an ordered sequence

of text captions.

Once in the appropriate format, we take various

steps to clean up and ﬁlter the data. We check for,

among other things, irreconcilable inconsistencies

in metadata (like conﬂicting titles or book IDs), du-

plicate books, duplicate stories, duplicate albums,

and similar or identical image-caption pairs. To

8https://www.crummy.com/software/

BeautifulSoup/

account for image size or brightness variations dur-

ing deduplication, we utilize a perceptual hash

to identify albums sharing at least 80% of images.

We also ﬁlter out stories where the writing system

script (e.g., Latn or Thai) does not match the ma-

jority writing system script used for that language.

Of the 14,095 stories in the raw data, 2,547 were

duplicates and 155 are ﬁltered due to script mis-

match.

Finally, we follow Kreutzer et al. (2022) and con-

duct a manual inspection for every language, reject-

ing any with obvious quality issues "at a glance."

As in that work, some of the authors

conducted

manual inspections on languages they were familiar

with (e.g. Mandarin, German), but also languages

they had no familiarity with. These checks provide

a "ﬂoor" on data quality, allowing the detection of

extremely low-quality data that is quite obviously

wrong even at a glance even by those who do not

speak the language.

For example, in this manual review, we detected

a number of books having captions in the wrong

language (e.g. "English" text in Devanagari or Ara-

bic script) or obvious "test" stories containing the

verbatim phrases "text in a block" or the English

text "THIS IS ALSO IN FALI." in a book marked

as being in the Fali language. Manual inspection

was conducted on at least 50 random stories per

language - or fewer if there were fewer stories in a

language overall. 85 stories did not pass this man-

ual inspection, some of which were also ﬁltered

out by the other quality checks.

Stories which failed any of the checks above are

marked as "quarantined" in the JSON ﬁle. Down-

stream data loading scripts can then ﬁlter these

when loading the data.

After all ﬁltering and "quarantining" of items in

the JSON, we are left with 11,407 stories contain-

ing a total of 112,080 image/caption pairs in our

dataset listed on HuggingFace. The

bloom-vist

dataset is listed in the Hugging Face Hub as bloom-

vist.11

3.2 bloom-captioning

Building off of the data produced for

bloom-vist

, we further process the VIST JSON

9https://github.com/JohannesBuchner/

imagehash

Colin Leong: Native English and L2 Mandarin Chinese,

and Anna Filighera: Native German and L2 French.

11https://huggingface.co/datasets/

sil-ai/bloom-vist

文档加载中……请稍候！
如果长时间未打开，您也可以点击刷新试试。

下载文档到电脑，查找使用更方便

10 玖币 0人已下载

立即下载

摘要：

BloomLibrary:MultimodalDatasetsin300+LanguagesforaVarietyofDownstreamTasksColinLeong*,JoshuaNemecek,JacobMansdorfer§,AnnaFilighera¶,AbrahamOwodunni||andDanielWhitenack*UniversityofDaytonResearchInstitute,SILInternational,§IndependentContractor,¶TUDarmstadtand||Masakhanecleong1@udayton.edu,joshua_...

展开>> 收起<<

Bloom Library Multimodal Datasets in 300 Languages for a Variety of Downstream Tasks Colin Leong Joshua Nemecek Jacob Mansdorfer Anna Filighera.pdf

共14页,预览3页

还剩页未读，继续阅读

声明：本站为文档C2C交易模式，即用户上传的文档直接被用户下载，本站只是中间服务平台，本站所有文档下载所得的收益归上传人(含作者)所有。玖贝云文库仅提供信息存储空间，仅对用户上传内容的表现方式做保护处理，对上载内容本身不做任何修改或编辑。若文档所含内容侵犯了您的版权或隐私，请立即通知玖贝云文库，我们立即给予删除！

Bloom Library Multimodal Datasets in 300 Languages for a Variety of Downstream Tasks Colin Leong Joshua Nemecek Jacob Mansdorfer Anna Filighera

相关推荐

开通VIP享超值会员特权

作者详情

相关内容

热门标签

举报选择: