Are word boundaries useful for unsupervised language learning Tu Anh Nguyen12 Maureen de Seyssel1 Robin Algayres1 Patricia Roze1 Ewan Dunbar1 Emmanuel Dupoux12

2025-04-27 0 0 407.4KB 12 页 10玖币
侵权投诉
Are word boundaries useful for unsupervised language learning?
Tu Anh Nguyen1,2, Maureen de Seyssel1, Robin Algayres1, Patricia Roze1,
Ewan Dunbar1, Emmanuel Dupoux1,2
1ENS, INRIA, INSERM, UPEC, PSL Research University
2Meta AI
{nguyentuanh208,emmanuel.dupoux}@gmail.com
Abstract
Word or word-fragment based Language
Models (LM) are typically preferred over
character-based ones in many downstream
applications. This may not be surprising
as words seem more linguistically relevant
units than characters. Words provide at
least two kinds of relevant information:
boundary information and meaningful units.
However, word boundary information may
be absent or unreliable in the case of
speech input (word boundaries are not marked
explicitly in the speech stream). Here, we
systematically compare LSTMs as a function
of the input unit (character, phoneme, word,
word part), with or without gold boundary
information. We probe linguistic knowledge
in the networks at the lexical, syntactic and
semantic levels using three speech-adapted
black box NLP psycholinguistically-inpired
benchmarks (pWUGGY, pBLIMP, pSIMI).
We find that the absence of boundaries
costs between 2% and 28% in relative
performance depending on the task. We show
that gold boundaries can be replaced by
automatically found ones obtained with an
unsupervised segmentation algorithm, and that
even modest segmentation performance gives
a gain in performance on two of the three
tasks compared to basic character/phone based
models without boundary information.
1 Introduction
Neural language models trained with a self-
supervised objective have proven very successful
as a pretraining method to learn useful represen-
tations. In particular, because they do not requi-
re labels, they can be trained on very large cor-
pora taken from the internet, and then fine-tuned
To cite this work: Nguyen, T.A., de Seyssel, M.,
Algayres, R., Roze, P., Dunbar, E., Dupoux, E. (2020). Are
word boundaries useful for unsupervised language learning?
CoML Technical Report, September 2020
with a small amount of labels on downstream tasks
(Peters et al.,2018;Devlin et al.,2019;Radford
et al.,2019;Yang et al.,2019). One of the unsol-
ved problem is the optimality of the input units
on which these neural models are trained (Senn-
rich et al.,2016;Bostrom and Durrett,2020). Lar-
ger units like words tend to give better results, alt-
hough they give rise to out-of-vocabulary (OOV)
problems. Small units like characters do not have
this problem and may not require boundary infor-
mation, but give rise to slightly lower performan-
ce. Word part units like BPE (Gage,1994;Senn-
rich et al.,2016) seem to be a good compromise,
providing larger units but allowing to deal with un-
seen words. Note that BPEs require word bounda-
ries, even if they are modeling subword parts.
Recent work has applied self-supervised Lan-
guage Modeling (LM) or masking objectives to
raw audio, totally by-passing text, basing the loss
function on automatically discovered quantized
speech units (Baevski et al.,2020,2019). Even
though this approach has been shown to be ve-
ry useful for pretraining an ASR system with few
labels, the question remains as to what would be
the optimal kinds of units for language modeling
from raw speech. This may become even more sa-
lient, as quantized speech units tend to be smal-
ler than phonemes (therefore unlikely units to car-
ry meaning or syntactic information), and without
any word boundary (making it difficult to define
meaningful higher order units). In other words, if
we want to apply LM approaches to raw audio, a
major stumbling block may be the word segmen-
tation problem. The fact that word segmentation
from audio is itself a difficult problem may give
rise to a circularity issue: we may need accurate
word segmentation in order to do proper language
modeling from audio without any labels. We may
need excellent acoustic units to do accurate word
segmentation. We may need very good language
arXiv:2210.02956v1 [cs.CL] 6 Oct 2022
modeling in order to obtain accurate decoding in-
to acoustic units. Back to square one.
Here, we wish to estimate, as a preliminary
question, the cost of switching from a word-based
representation (with boundaries) to a phoneme
one, without boundaries. We use the Librispeech
corpus (Panayotov et al.,2015), for which we have
both the text transcription and a phoneme-based
transcription. We use use phoneme transcriptions
as a proxy for ’accurate’ acoustic units, leaving
for later the problem of erroneous transcripts when
the units are derived from speech. We also test the
possibility of replacing gold word boundaries by
automatically obtained one using an unsupervised
word segmentation algorithm.
When comparing LMs with widely different
kinds of input units, standard metrics like perple-
xity cannot be used because these metrics scale in
complicated ways with the granularity of the in-
put units. Instead here, we rely on three psycho-
linguistically inspired black-box NLP benchmarks
which are independent of unit granularity, and
which we adapt to be speech-compatible by pho-
nemizing them and filtering the vocabulary with
the Librispeech train set. The first two are based
on assigning pseudo-probabilities to input strings,
which are used them as a proxy for an accepta-
bility score. For the lexical benchmark (pWUG-
GY), we compare the acceptability of words (li-
ke “brick") to that of a non-word (like “blick").
The words and non-words are otherwise matched
on unigram and bigram probabilities. For the syn-
tactic benchmark (pBLIMP), we adapted and pho-
netically transcribed the BLIMP dataset (Warstadt
et al.,2019) in which the acceptability of pairs of
grammatical and ungrammatical sentences is as-
sessed. The semantic test (pSIMI) is based on the
distance between embeddings of words, which is
correlated with human obtained distances.
The structure of the paper is as follows:
after presenting the related work (Section 2) and
methods (datasets, models and metrics, Section 3),
we present the results of baseline character-based
LSTM models with access to word boundaries
(Section 4). We then present experiments where
we change the units to be phones, and remove
the gold boundaries, or replace them with
automatically extracted ones (Section 5).
2 Related work
Units for LSTMs The importance of word
boundaries has been investigated by Hahn and
Baroni (2019). They compared word based
and character based LSTMs where the word
boundaries (space character) were removed, on
a variety of probe tasks. They found that the
character-based LSTMs passed a number of
linguistic tests, sometimes better than word based
models that are impaired by the presence of OOVs.
Here, we follow the same inspiration, but evaluate
more systematically models that are boundary
based, but do not suffer from OOVs (ie, BPE and
fallback models), in order to give word models a
fairer comparison point and provide a quantitative
measure of the cost of not having boundaries.
We also expand the investigation to phoneme
representations that are step closer to speech.
Black box linguistics Among the variety of
Black-Box linguistic tasks, psycholinguistically
inspired ones enable the direct comparison of
models and humans. Grammaticality judgments
for recurrent networks have been investigated
since Allen and Seidenberg (1999), who use
closely matched pairs of sentences to investigate
grammatical correctness. This approach has been
adopted recently to assess the abilities of RNNs,
and LSTMs in particular, to capture syntactic
structures. For instance, Linzen et al. (2016)
and Gulordava et al. (2018) use word probes in
minimally different pairs of English sentences
to study number agreement. To discriminate
grammatical sentences from ungrammatical ones,
they retrieve the probabilities of the possible
morphological forms of a target word, given
the probability of the previous words in the
sentence. Practically, in the sentence “the boy
is sleeping”, the network has detected number
agreement if P(w=is)>P(w=are). This
methodology has also been adapted by Goldberg
(2019) to models trained with a masked language-
modeling objective. Those works find that in the
absence of many detractors or complex sentence
features, recent language models perform well at
the number-agreement problem in English.
More closely related to our work, Ravfogel
et al. (2018) use word probes to examine
whether LSTMs understand Basque agreement.
Like German, Basque is a morpho-syntactically
rich language with relatively free word order,
thus providing a challenging setting for the LM.
In contrast to our work, the LM’s ability to
understand verb argument structure is tested on
number-agreement and on suffix recovery tasks,
which involve localized changes rather than whole
sentence perturbations and re-orderings.
3 Methods
3.1 Training set
We used as a training set the transcription of
the Librispeech 960h dataset (Panayotov et al.,
2015), composed of 281K sentences (9M words,
40M characters or 33M phonemes). We can
therefore give a comparative performance with
other speech-based work. As it is the transcription
of an ASR dataset, the text has originally been
cleaned, removed all the punctuation marks and
uppercased, resulting in a vocab size of 90K. For
the phonetic transcription, we used the original
LibriSpeech lexicon, for some words that are not
in the lexicon, we used the G2P-seq2seq toolkit 1
to generate their phonetic transcriptions.
3.2 Black Box test sets
We setup three tasks, to evaluate the LMs at three
levels: the lexicon (the pWUGGY benchmark),
syntax (the pBLIMP benchmark) and semantics
(the pSIMI benchmark). All of these benchmarks
are presented in two formats: a character format
(in which case the test tokens are in text) and a
phonetic format, obtained by using the same G2P-
seq2seq toolkit as for the train set.
Lexicon - the pWUGGY benchmark. We built
on Godais et al. (2017) which used the ’Spot-the-
word’ task in which the networks are presented
with a pair of an existing word and a matching
non-word, and are evaluated on their capacity to
attribute a higher probability to the word.
The non-words are generated with the WUGGY
software (Keuleers and Brysbaert,2010), which
generates for a given word, a list of candidate
nonwords matched in phonotactics, syllabic struc-
ture, and other character-based constraints of the
English language. We added additional constraints
using a stochastic sampler to also match unigram
and bigram, character and phoneme frequencies
(see Supplementary Material Bfor more details).
The test dataset is composed of two subsets: a
set of pairs built with words present in LibriSpeech
training set and a set of pairs built with words not
existing in LibriSpeech (OOV words) with 30K
and 10K pairs respectively. We also prepared a
small development set containing 10K pairs of
words from LibriSpeech disjoint from the test set
1https://github.com/cmusphinx/g2p-seq2seq
in case of necessary uses. Each word or nonword
in a pair was then preceded and followed by a
<EOS> symbol to help the model distinguish a
word from a prefix or suffix (e.g., a nonword firew
and a word firework).
Sentence Grammaticality - the pBLIMP bench-
mark. This Benchmark is adapted from BLIMP
(Warstadt et al.,2019), a dataset of linguistic mi-
nimal sentence pairs of matched grammatical and
ungrammatical sentences. As for the preceding
test, the task is to decide which of the two mem-
bers of the pair is grammatical or not based on the
probability of the sentence.
We adapted the code used to generate the
BLIMP dataset (Warstadt et al.,2019) in order
to create pBLIMP, specifically tailored for speech
purposes. In BLIMP, sentences are divided into
twelve broad categories focusing on different lin-
guistic paradigms in the fields of syntax, morpho-
logy or semantics. These categories are themsel-
ves divided into 67 finer linguistic subcategories,
containing 1000 sentence pairs each, automatical-
ly generated using expert hand-crafted grammar.
One additional subcategory was also subsequently
added in the code.
To make this dataset ’speech-ready’, we discar-
ded five subcategories and slightly modified the
grammar for 9 additional subcategories in order to
avoid any difficulty of generating a prosodic con-
tour for the ungrammatical sentences. We also re-
moved from the vocabulary all words not present
in the Librispeech (Panayotov et al.,2015) train
set, as well as compound words and homophones
that could cause further understanding issues once
synthesised. 5000 sentence pairs were then gene-
rated for each of the 63 remaining subcategories.
We then sampled sentence pairs from the genera-
ted pool to create a development and a test set,
ensuring that the larger linguistic categories were
sampled in terms of n-gram language model scores
(see Supplementary Material B). The test and de-
velopment sets contains 63000 and 6300 sentence
pairs respectively, with no sentence pairs overlap.
Semantics: the pSIMI benchmark. Here, the
task is to compute the similarity of the represen-
tation of pairs of words and compare it to human
similarity judgements.
Based on previous work Chung and Glass
(2018), we used a set of 13 existing semantic si-
milarity/relatedness tests. The similarity-based da-
摘要:

Arewordboundariesusefulforunsupervisedlanguagelearning?TuAnhNguyen1;2,MaureendeSeyssel1,RobinAlgayres1,PatriciaRoze1,EwanDunbar1,EmmanuelDupoux1;21ENS,INRIA,INSERM,UPEC,PSLResearchUniversity2MetaAI{nguyentuanh208,emmanuel.dupoux}@gmail.comAbstractWordorword-fragmentbasedLanguageModels(LM)aretypical...

展开>> 收起<<
Are word boundaries useful for unsupervised language learning Tu Anh Nguyen12 Maureen de Seyssel1 Robin Algayres1 Patricia Roze1 Ewan Dunbar1 Emmanuel Dupoux12.pdf

共12页,预览3页

还剩页未读, 继续阅读

声明:本站为文档C2C交易模式,即用户上传的文档直接被用户下载,本站只是中间服务平台,本站所有文档下载所得的收益归上传人(含作者)所有。玖贝云文库仅提供信息存储空间,仅对用户上传内容的表现方式做保护处理,对上载内容本身不做任何修改或编辑。若文档所含内容侵犯了您的版权或隐私,请立即通知玖贝云文库,我们立即给予删除!
分类:图书资源 价格:10玖币 属性:12 页 大小:407.4KB 格式:PDF 时间:2025-04-27

开通VIP享超值会员特权

  • 多端同步记录
  • 高速下载文档
  • 免费文档工具
  • 分享文档赚钱
  • 每日登录抽奖
  • 优质衍生服务
/ 12
客服
关注