Are word boundaries useful for unsupervised language learning Tu Anh Nguyen12 Maureen de Seyssel1 Robin Algayres1 Patricia Roze1 Ewan Dunbar1 Emmanuel Dupoux12

2025-04-27 1 0 407.4KB 12 页 10玖币

侵权投诉

Are word boundaries useful for unsupervised language learning?∗

Tu Anh Nguyen1,2, Maureen de Seyssel1, Robin Algayres1, Patricia Roze1,

Ewan Dunbar1, Emmanuel Dupoux1,2

1ENS, INRIA, INSERM, UPEC, PSL Research University

2Meta AI

{nguyentuanh208,emmanuel.dupoux}@gmail.com

Abstract

Word or word-fragment based Language

Models (LM) are typically preferred over

character-based ones in many downstream

applications. This may not be surprising

as words seem more linguistically relevant

units than characters. Words provide at

least two kinds of relevant information:

boundary information and meaningful units.

However, word boundary information may

be absent or unreliable in the case of

speech input (word boundaries are not marked

explicitly in the speech stream). Here, we

systematically compare LSTMs as a function

of the input unit (character, phoneme, word,

word part), with or without gold boundary

information. We probe linguistic knowledge

in the networks at the lexical, syntactic and

semantic levels using three speech-adapted

black box NLP psycholinguistically-inpired

benchmarks (pWUGGY, pBLIMP, pSIMI).

We ﬁnd that the absence of boundaries

costs between 2% and 28% in relative

performance depending on the task. We show

that gold boundaries can be replaced by

automatically found ones obtained with an

unsupervised segmentation algorithm, and that

even modest segmentation performance gives

a gain in performance on two of the three

tasks compared to basic character/phone based

models without boundary information.

1 Introduction

Neural language models trained with a self-

supervised objective have proven very successful

as a pretraining method to learn useful represen-

tations. In particular, because they do not requi-

re labels, they can be trained on very large cor-

pora taken from the internet, and then ﬁne-tuned

∗To cite this work: Nguyen, T.A., de Seyssel, M.,

Algayres, R., Roze, P., Dunbar, E., Dupoux, E. (2020). Are

word boundaries useful for unsupervised language learning?

CoML Technical Report, September 2020

with a small amount of labels on downstream tasks

(Peters et al.,2018;Devlin et al.,2019;Radford

et al.,2019;Yang et al.,2019). One of the unsol-

ved problem is the optimality of the input units

on which these neural models are trained (Senn-

rich et al.,2016;Bostrom and Durrett,2020). Lar-

ger units like words tend to give better results, alt-

hough they give rise to out-of-vocabulary (OOV)

problems. Small units like characters do not have

this problem and may not require boundary infor-

mation, but give rise to slightly lower performan-

ce. Word part units like BPE (Gage,1994;Senn-

rich et al.,2016) seem to be a good compromise,

providing larger units but allowing to deal with un-

seen words. Note that BPEs require word bounda-

ries, even if they are modeling subword parts.

Recent work has applied self-supervised Lan-

guage Modeling (LM) or masking objectives to

raw audio, totally by-passing text, basing the loss

function on automatically discovered quantized

speech units (Baevski et al.,2020,2019). Even

though this approach has been shown to be ve-

ry useful for pretraining an ASR system with few

labels, the question remains as to what would be

the optimal kinds of units for language modeling

from raw speech. This may become even more sa-

lient, as quantized speech units tend to be smal-

ler than phonemes (therefore unlikely units to car-

ry meaning or syntactic information), and without

any word boundary (making it difﬁcult to deﬁne

meaningful higher order units). In other words, if

we want to apply LM approaches to raw audio, a

major stumbling block may be the word segmen-

tation problem. The fact that word segmentation

from audio is itself a difﬁcult problem may give

rise to a circularity issue: we may need accurate

word segmentation in order to do proper language

modeling from audio without any labels. We may

need excellent acoustic units to do accurate word

segmentation. We may need very good language

arXiv:2210.02956v1 [cs.CL] 6 Oct 2022

modeling in order to obtain accurate decoding in-

to acoustic units. Back to square one.

Here, we wish to estimate, as a preliminary

question, the cost of switching from a word-based

representation (with boundaries) to a phoneme

one, without boundaries. We use the Librispeech

corpus (Panayotov et al.,2015), for which we have

both the text transcription and a phoneme-based

transcription. We use use phoneme transcriptions

as a proxy for ’accurate’ acoustic units, leaving

for later the problem of erroneous transcripts when

the units are derived from speech. We also test the

possibility of replacing gold word boundaries by

automatically obtained one using an unsupervised

word segmentation algorithm.

When comparing LMs with widely different

kinds of input units, standard metrics like perple-

xity cannot be used because these metrics scale in

complicated ways with the granularity of the in-

put units. Instead here, we rely on three psycho-

linguistically inspired black-box NLP benchmarks

which are independent of unit granularity, and

which we adapt to be speech-compatible by pho-

nemizing them and ﬁltering the vocabulary with

the Librispeech train set. The ﬁrst two are based

on assigning pseudo-probabilities to input strings,

which are used them as a proxy for an accepta-

bility score. For the lexical benchmark (pWUG-

GY), we compare the acceptability of words (li-

ke “brick") to that of a non-word (like “blick").

The words and non-words are otherwise matched

on unigram and bigram probabilities. For the syn-

tactic benchmark (pBLIMP), we adapted and pho-

netically transcribed the BLIMP dataset (Warstadt

et al.,2019) in which the acceptability of pairs of

grammatical and ungrammatical sentences is as-

sessed. The semantic test (pSIMI) is based on the

distance between embeddings of words, which is

correlated with human obtained distances.

The structure of the paper is as follows:

after presenting the related work (Section 2) and

methods (datasets, models and metrics, Section 3),

we present the results of baseline character-based

LSTM models with access to word boundaries

(Section 4). We then present experiments where

we change the units to be phones, and remove

the gold boundaries, or replace them with

automatically extracted ones (Section 5).

2 Related work

Units for LSTMs The importance of word

boundaries has been investigated by Hahn and

Baroni (2019). They compared word based

and character based LSTMs where the word

boundaries (space character) were removed, on

a variety of probe tasks. They found that the

character-based LSTMs passed a number of

linguistic tests, sometimes better than word based

models that are impaired by the presence of OOVs.

Here, we follow the same inspiration, but evaluate

more systematically models that are boundary

based, but do not suffer from OOVs (ie, BPE and

fallback models), in order to give word models a

fairer comparison point and provide a quantitative

measure of the cost of not having boundaries.

We also expand the investigation to phoneme

representations that are step closer to speech.

Black box linguistics Among the variety of

Black-Box linguistic tasks, psycholinguistically

inspired ones enable the direct comparison of

models and humans. Grammaticality judgments

for recurrent networks have been investigated

since Allen and Seidenberg (1999), who use

closely matched pairs of sentences to investigate

grammatical correctness. This approach has been

adopted recently to assess the abilities of RNNs,

and LSTMs in particular, to capture syntactic

structures. For instance, Linzen et al. (2016)

and Gulordava et al. (2018) use word probes in

minimally different pairs of English sentences

to study number agreement. To discriminate

grammatical sentences from ungrammatical ones,

they retrieve the probabilities of the possible

morphological forms of a target word, given

the probability of the previous words in the

sentence. Practically, in the sentence “the boy

is sleeping”, the network has detected number

agreement if P(w=is)>P(w=are). This

methodology has also been adapted by Goldberg

(2019) to models trained with a masked language-

modeling objective. Those works ﬁnd that in the

absence of many detractors or complex sentence

features, recent language models perform well at

the number-agreement problem in English.

More closely related to our work, Ravfogel

et al. (2018) use word probes to examine

whether LSTMs understand Basque agreement.

Like German, Basque is a morpho-syntactically

rich language with relatively free word order,

thus providing a challenging setting for the LM.

In contrast to our work, the LM’s ability to

understand verb argument structure is tested on

number-agreement and on sufﬁx recovery tasks,

which involve localized changes rather than whole

sentence perturbations and re-orderings.

3 Methods

3.1 Training set

We used as a training set the transcription of

the Librispeech 960h dataset (Panayotov et al.,

2015), composed of 281K sentences (9M words,

40M characters or 33M phonemes). We can

therefore give a comparative performance with

other speech-based work. As it is the transcription

of an ASR dataset, the text has originally been

cleaned, removed all the punctuation marks and

uppercased, resulting in a vocab size of 90K. For

the phonetic transcription, we used the original

LibriSpeech lexicon, for some words that are not

in the lexicon, we used the G2P-seq2seq toolkit 1

to generate their phonetic transcriptions.

3.2 Black Box test sets

We setup three tasks, to evaluate the LMs at three

levels: the lexicon (the pWUGGY benchmark),

syntax (the pBLIMP benchmark) and semantics

(the pSIMI benchmark). All of these benchmarks

are presented in two formats: a character format

(in which case the test tokens are in text) and a

phonetic format, obtained by using the same G2P-

seq2seq toolkit as for the train set.

Lexicon - the pWUGGY benchmark. We built

on Godais et al. (2017) which used the ’Spot-the-

word’ task in which the networks are presented

with a pair of an existing word and a matching

non-word, and are evaluated on their capacity to

attribute a higher probability to the word.

The non-words are generated with the WUGGY

software (Keuleers and Brysbaert,2010), which

generates for a given word, a list of candidate

nonwords matched in phonotactics, syllabic struc-

ture, and other character-based constraints of the

English language. We added additional constraints

using a stochastic sampler to also match unigram

and bigram, character and phoneme frequencies

(see Supplementary Material Bfor more details).

The test dataset is composed of two subsets: a

set of pairs built with words present in LibriSpeech

training set and a set of pairs built with words not

existing in LibriSpeech (OOV words) with 30K

and 10K pairs respectively. We also prepared a

small development set containing 10K pairs of

words from LibriSpeech disjoint from the test set

1https://github.com/cmusphinx/g2p-seq2seq

in case of necessary uses. Each word or nonword

in a pair was then preceded and followed by a

<EOS> symbol to help the model distinguish a

word from a preﬁx or sufﬁx (e.g., a nonword ﬁrew

and a word ﬁrework).

Sentence Grammaticality - the pBLIMP bench-

mark. This Benchmark is adapted from BLIMP

(Warstadt et al.,2019), a dataset of linguistic mi-

nimal sentence pairs of matched grammatical and

ungrammatical sentences. As for the preceding

test, the task is to decide which of the two mem-

bers of the pair is grammatical or not based on the

probability of the sentence.

We adapted the code used to generate the

BLIMP dataset (Warstadt et al.,2019) in order

to create pBLIMP, speciﬁcally tailored for speech

purposes. In BLIMP, sentences are divided into

twelve broad categories focusing on different lin-

guistic paradigms in the ﬁelds of syntax, morpho-

logy or semantics. These categories are themsel-

ves divided into 67 ﬁner linguistic subcategories,

containing 1000 sentence pairs each, automatical-

ly generated using expert hand-crafted grammar.

One additional subcategory was also subsequently

added in the code.

To make this dataset ’speech-ready’, we discar-

ded ﬁve subcategories and slightly modiﬁed the

grammar for 9 additional subcategories in order to

avoid any difﬁculty of generating a prosodic con-

tour for the ungrammatical sentences. We also re-

moved from the vocabulary all words not present

in the Librispeech (Panayotov et al.,2015) train

set, as well as compound words and homophones

that could cause further understanding issues once

synthesised. 5000 sentence pairs were then gene-

rated for each of the 63 remaining subcategories.

We then sampled sentence pairs from the genera-

ted pool to create a development and a test set,

ensuring that the larger linguistic categories were

sampled in terms of n-gram language model scores

(see Supplementary Material B). The test and de-

velopment sets contains 63000 and 6300 sentence

pairs respectively, with no sentence pairs overlap.

Semantics: the pSIMI benchmark. Here, the

task is to compute the similarity of the represen-

tation of pairs of words and compare it to human

similarity judgements.

Based on previous work Chung and Glass

(2018), we used a set of 13 existing semantic si-

milarity/relatedness tests. The similarity-based da-

文档加载中……请稍候！
如果长时间未打开，您也可以点击刷新试试。

下载文档到电脑，查找使用更方便

10 玖币 0人已下载

立即下载

摘要：

Arewordboundariesusefulforunsupervisedlanguagelearning?TuAnhNguyen1;2,MaureendeSeyssel1,RobinAlgayres1,PatriciaRoze1,EwanDunbar1,EmmanuelDupoux1;21ENS,INRIA,INSERM,UPEC,PSLResearchUniversity2MetaAI{nguyentuanh208,emmanuel.dupoux}@gmail.comAbstractWordorword-fragmentbasedLanguageModels(LM)aretypical...

展开>> 收起<<

Are word boundaries useful for unsupervised language learning Tu Anh Nguyen12 Maureen de Seyssel1 Robin Algayres1 Patricia Roze1 Ewan Dunbar1 Emmanuel Dupoux12.pdf

共12页,预览3页

还剩页未读，继续阅读

声明：本站为文档C2C交易模式，即用户上传的文档直接被用户下载，本站只是中间服务平台，本站所有文档下载所得的收益归上传人(含作者)所有。玖贝云文库仅提供信息存储空间，仅对用户上传内容的表现方式做保护处理，对上载内容本身不做任何修改或编辑。若文档所含内容侵犯了您的版权或隐私，请立即通知玖贝云文库，我们立即给予删除！

Are word boundaries useful for unsupervised language learning Tu Anh Nguyen12 Maureen de Seyssel1 Robin Algayres1 Patricia Roze1 Ewan Dunbar1 Emmanuel Dupoux12

相关推荐

开通VIP享超值会员特权

作者详情

相关内容

热门标签

举报选择: