Bloom Library Multimodal Datasets in 300 Languages for a Variety of Downstream Tasks Colin Leong Joshua Nemecek Jacob Mansdorfer Anna Filighera

2025-05-06 0 0 506.65KB 14 页 10玖币
侵权投诉
Bloom Library: Multimodal Datasets in 300+ Languages for a Variety of
Downstream Tasks
Colin Leong*, Joshua Nemecek, Jacob Mansdorfer§, Anna Filighera,
Abraham Owodunni|| and Daniel Whitenack
*University of Dayton Research Institute,SIL International,§Independent Contractor,
TU Darmstadt and ||Masakhane
cleong1@udayton.edu,joshua_nemecek@sil.org
jacob.mansdorfer@gmail.com,anna.filighera@kom.tu-darmstadt.de
owodunniabraham@gmail.com,dan_whitenack@sil.org
Abstract
We present Bloom Library, a linguistically
diverse set of multimodal and multilingual
datasets for language modeling, image cap-
tioning, visual storytelling, and speech synthe-
sis/recognition. These datasets represent ei-
ther the most, or among the most, multilingual
datasets for each of the included downstream
tasks. In total, the initial release of the Bloom
Library datasets covers 363 languages across
32 language families. We train downstream
task models for various languages represented
in the data, showing the viability of the data
for future work in low-resource, multimodal
NLP and establishing the first known base-
lines for these downstream tasks in certain lan-
guages (e.g., Bisu [bzi], with an estimated pop-
ulation of 700 users). Some of these first-of-
their-kind baselines are comparable to state-of-
the-art performance for higher-resourced lan-
guages. The Bloom Library datasets are re-
leased under Creative Commons licenses on
the Hugging Face datasets hub to catalyze
more linguistically diverse research in the in-
cluded downstream tasks.
1 Introduction
Only a negligible fraction of the 7100+ living
languages (Eberhard et al.,2021) have sufficient,
publicly available text, audio, and image data
to train state-of-the-art language/speech models
and/or models for downstream tasks like Named
Entity Recognition (NER) or image captioning.
This data scarcity results in systematic inequali-
ties in the performance of NLP tasks across the
world’s languages (Blasi et al.,2021). Indigenous
language ecologies also represent profoundly dif-
ferent understandings of the nature and function
of language (Bird,2022,2020), which might prior-
itize orality or translanguaging (Quakenbush and
Simons,2018), for example, above a single, written
mode of communication in all domains.
The Bloom Library
1
is a web-based platform that
is attempting to facilitate an increase in the amount
of multimodal materials available to communities
speaking non-dominant languages. The Bloom Li-
brary holds over 12,400 books in 545 languages
(at the time this paper is published), covering sub-
jects including agriculture, business, culture, math,
science, religion, and health. Many of these books
include images aligned with text, and 1,600+ of the
books have corresponding audio recordings (called
"talking books"). Language communities can cre-
ate new books, create audio recordings, download
existing books, and translate existing books using
the open-source "Bloom" software2.
To boost language diversity and indigenous per-
spectives in the NLP research community, we
present multimodal datasets post-processed out
of the Bloom Library. We anticipate that more
task-specific datasets will be created from the
Bloom Library. However, as a starting point,
we are presenting the following datasets: (1)
bloom-lm
for language modeling in 351 lan-
guages; (2)
bloom-captioning
for image-to-
text or text-to-image tasks in 351 languages; (3)
bloom-vist
for visual storytelling in 351 lan-
guages; and (4)
bloom-speech
for speech-to-
text and text-to-speech tasks in 56 languages.
The languages in these datasets correspond to 32
language families, and many of the included lan-
guages are in extremely low-resource settings. Fur-
ther, to the authors’ knowledge,
bloom-vist
rep-
resents the first (and certainly most) multilingual
visual storytelling dataset, and
bloom-speech
includes more languages in the following language
families than any other aligned speech dataset
(number of languages in parenthesis): Austrone-
sian (8), Mayan (6), Niger-Congo (7), Sepik (2),
Tequistlatecan (2), and Trans-New Guinea (3).
1https://bloomlibrary.org/
2https://github.com/BloomBooks/
BloomDesktop
arXiv:2210.14712v1 [cs.CL] 26 Oct 2022
To assess the difficulty of language modeling,
image captioning, and automatic speech recogni-
tion (ASR) with the Bloom Library datasets, we
trained baseline models on each of these tasks. For
certain languages, the Bloom Library datasets facil-
itate the first known baselines with comparable to
state-of-the-art performance for higher-resourced
languages. We acheive a BLEU score on image
captioning of above 10.0 for 10 languages using
only data from
bloom-captioning
. For ASR,
we demonstrate a Word Error Rate (WER) below
0.5 for 18 languages and a Character Error Rate
(CER) below 0.2 for 21 languages.
2 Related Work
In terms of language coverage, various multilin-
gual and single modality datasets have emerged re-
cently. These include, by way of example, the JHU
Bible Corpus (McCarthy et al.,2020), the CMU
Wilderness Multilingual Speech dataset (Black,
2019), Common Voice 9 (Ardila et al.,2019), Mul-
tilingual BABEL (Consortium,2022), and MAS-
SIVE (FitzGerald et al.,2022). The number of
languages in these datasets is impressive. However,
many are limited in domain (e.g., only including
Bible data), accessibility, licensing, or modality
(e.g., only focusing on text or read speech). These
datasets are also primarily rooted in content from
large, dominant languages, like English, and are
translated or adapted to other fairly large languages.
Bloom Library data, in contrast, originates from
local language communities,
3
which are produc-
ing Bloom Books to fit their own local language
ecology and perspectives. As a result, the data pre-
sented here covers languages, language families,
and topics that are not covered by any other aligned
and prepared datasets.
In terms of modality, the research community
is presenting an increasing number of intriguing
multimodal datasets. These include, by way of
example, Pano-AVQA (Yun et al.,2021), which
facilitates question answering regarding various ob-
jects, sounds, and their associations in videos, and
VIST (Huang et al.,2016), which facilitates se-
quential vision-to-language tasks. However, recent
multimodal datasets are overwhelmingly monolin-
3
On the use of the term “local” languages, we followed
the terminology used in Bird (2022) and related works, which
defines the term along the lines of “small, primarily-oral lan-
guages, often Indigenous or endangered, including the original
and emerging languages of Africa, Asia, Australia, the Ameri-
cas, the Pacific, and the minority languages of Europe.
gual.
Datasets representing both multiple modalities
and many languages include Multi30k, which is
one of the few multimodal, multilingual datasets
in existence, with ~30k images and correspond-
ing text descriptions in several languages including
English, German (Elliott et al.,2016), French (El-
liott et al.,2017), and Czech (Barrault et al.,2018).
One listing can be found in Kádár (2019), which
provides a helpful (and comprehensive) table of
multilingual, multimodal resources, dividing them
into two categories: (i) "translation" (with captions
translated into another language); and (ii) "descrip-
tion" (with annotations independently created for
each language). The table reveals that Multi30k
was, at the time, the largest translation dataset avail-
able in terms of image count, at approximately 31k
images and 31k sentences covering 4 languages.
The Bloom Library datasets fit into the "descrip-
tion" category of Kádár (2019). However, with over
90k+ images and 110k+ captions covering 351 lan-
guages and additional speech data in 56 languages,
Bloom Library represents a massive increase in lan-
guage and modality coverage (up to two orders of
magnitude wider than previous multilingual, multi-
modal datasets). Further, the existing datasets refer-
enced by Kádár (2019) focus on large languages in
high-resource settings, with no representation of lo-
cal languages in low resource settings. In contrast,
our datasets include languages in extremely low
resource and non-dominant settings like Bisu [bzi]
and Kagayanen [cgc], with estimated populations
of 700 and 30,000 users, respectively.
3 Constructing the Datasets
The authors worked directly with the Bloom Li-
brary developers to gain access to and understand
the raw data behind the Bloom Library website.
We parse, clean, deduplicate, and publicly release
this data for research use on the Hugging Face
Hub
45
in formats compatible with the Hugging
Face datasets Python package.6
bloom-lm
,
bloom-captioning
, and
bloom-vist
are created using one data pipeline
starting with
bloom-vist
, because each of
these datasets use some or all of the images and
corresponding text within the Bloom Library. A
4https://www.ai.sil.org/bloom
5https://huggingface.co/sil-ai
6https://huggingface.co/docs/datasets/
index
separate data pipeline is used for
bloom-speech
to process only "talking books."
3.1 bloom-vist
The Bloom Library books offer the rare possibil-
ity of leveraging sequential images for language
understanding across many languages. Thus, we
first process the Bloom Library data into a format
consistent with the VIST task published by Huang
et al. (2016). VIST is a dataset made by creating
collections of sequential image-caption pairs which
form short “stories”, collaboratively setup by re-
searchers at Google Research, CMU, and JHU, we
structure our data to match this We hope this re-
lease of VIST-formatted data from Bloom Library
catalyzes techniques in both multilingual and mul-
timodal storytelling.
The raw Bloom Library data we received from
the Bloom Library team consisted of a folder of
files for each "book," which corresponds to one of
the pages on the Bloom Library website. The rele-
vant files in this folder include: (1)
meta.json
,
containing important metadata such as the book’s
translation lineage, alternative titles, copyright,
etc.; (2) an
*.htm
file containing the actual data,
particularly text and image links for each page of
the book; (3) in certain cases, a number of image
files of various types including *.jpg and *.png; and
(4) in certain cases (for talking books), a number of
*.mp3
audio files. In order to construct the sequen-
tial VIST-type data, we parse the
*.htm
file with
BeautifulSoup
8
to associate images files with cap-
tions and sequence these according to the sequence
of pages in the book. We use
meta.json
to pull
out relevant metadata (book title, topics, etc.) and
to filter out any books not released under a Creative
Commons license.
Figure 1includes some example data included
in
bloom-vist
by way of example. The dataset
includes "albums," which are ordered sequences of
images. An album may be associated with multiple
"stories," where each story is an ordered sequence
of text captions.
Once in the appropriate format, we take various
steps to clean up and filter the data. We check for,
among other things, irreconcilable inconsistencies
in metadata (like conflicting titles or book IDs), du-
plicate books, duplicate stories, duplicate albums,
and similar or identical image-caption pairs. To
8https://www.crummy.com/software/
BeautifulSoup/
account for image size or brightness variations dur-
ing deduplication, we utilize a perceptual hash
9
to identify albums sharing at least 80% of images.
We also filter out stories where the writing system
script (e.g., Latn or Thai) does not match the ma-
jority writing system script used for that language.
Of the 14,095 stories in the raw data, 2,547 were
duplicates and 155 are filtered due to script mis-
match.
Finally, we follow Kreutzer et al. (2022) and con-
duct a manual inspection for every language, reject-
ing any with obvious quality issues "at a glance."
As in that work, some of the authors
10
conducted
manual inspections on languages they were familiar
with (e.g. Mandarin, German), but also languages
they had no familiarity with. These checks provide
a "floor" on data quality, allowing the detection of
extremely low-quality data that is quite obviously
wrong even at a glance even by those who do not
speak the language.
For example, in this manual review, we detected
a number of books having captions in the wrong
language (e.g. "English" text in Devanagari or Ara-
bic script) or obvious "test" stories containing the
verbatim phrases "text in a block" or the English
text "THIS IS ALSO IN FALI." in a book marked
as being in the Fali language. Manual inspection
was conducted on at least 50 random stories per
language - or fewer if there were fewer stories in a
language overall. 85 stories did not pass this man-
ual inspection, some of which were also filtered
out by the other quality checks.
Stories which failed any of the checks above are
marked as "quarantined" in the JSON file. Down-
stream data loading scripts can then filter these
when loading the data.
After all filtering and "quarantining" of items in
the JSON, we are left with 11,407 stories contain-
ing a total of 112,080 image/caption pairs in our
dataset listed on HuggingFace. The
bloom-vist
dataset is listed in the Hugging Face Hub as bloom-
vist.11
3.2 bloom-captioning
Building off of the data produced for
bloom-vist
, we further process the VIST JSON
9https://github.com/JohannesBuchner/
imagehash
10
Colin Leong: Native English and L2 Mandarin Chinese,
and Anna Filighera: Native German and L2 French.
11https://huggingface.co/datasets/
sil-ai/bloom-vist
摘要:

BloomLibrary:MultimodalDatasetsin300+LanguagesforaVarietyofDownstreamTasksColinLeong*,JoshuaNemecek†,JacobMansdorfer§,AnnaFilighera¶,AbrahamOwodunni||andDanielWhitenack†*UniversityofDaytonResearchInstitute,†SILInternational,§IndependentContractor,¶TUDarmstadtand||Masakhanecleong1@udayton.edu,joshua_...

展开>> 收起<<
Bloom Library Multimodal Datasets in 300 Languages for a Variety of Downstream Tasks Colin Leong Joshua Nemecek Jacob Mansdorfer Anna Filighera.pdf

共14页,预览3页

还剩页未读, 继续阅读

声明:本站为文档C2C交易模式,即用户上传的文档直接被用户下载,本站只是中间服务平台,本站所有文档下载所得的收益归上传人(含作者)所有。玖贝云文库仅提供信息存储空间,仅对用户上传内容的表现方式做保护处理,对上载内容本身不做任何修改或编辑。若文档所含内容侵犯了您的版权或隐私,请立即通知玖贝云文库,我们立即给予删除!
分类:图书资源 价格:10玖币 属性:14 页 大小:506.65KB 格式:PDF 时间:2025-05-06

开通VIP享超值会员特权

  • 多端同步记录
  • 高速下载文档
  • 免费文档工具
  • 分享文档赚钱
  • 每日登录抽奖
  • 优质衍生服务
/ 14
客服
关注