ACOUSTICALLY-DRIVEN PHONEME REMOVAL THAT PRESERVES VOCAL AFFECT CUES Camille Nouﬁ Jonathan Berger

2025-04-30 0 0 1000.74KB 5 页 10玖币

侵权投诉

ACOUSTICALLY-DRIVEN PHONEME REMOVAL THAT

PRESERVES VOCAL AFFECT CUES

Camille Nouﬁ∗, Jonathan Berger

Stanford University

Center for Computer Research

in Music and Acoustics

Stanford, CA, USA

Karen J. Parker, Daniel L. Bowling†

Stanford School of Medicine

Department of Psychiatry

and Behavioral Sciences

Stanford, CA, USA

ABSTRACT

In this paper, we propose a method for removing linguistic informa-

tion from speech for the purpose of isolating paralinguistic indicators

of affect. The immediate utility of this method lies in clinical tests

of sensitivity to vocal affect that are not confounded by language,

which is impaired in a variety of clinical populations. The method is

based on simultaneous recordings of speech audio and electroglotto-

graphic (EGG) signals. The speech audio signal is used to estimate

the average vocal tract ﬁlter response and amplitude envelop. The

EGG signal supplies a direct correlate of voice source activity that is

mostly independent of phonetic articulation. The dynamic energy of

the speech audio and the average vocal tract ﬁlter are applied to the

EGG signal create a third signal designed to capture as much paralin-

guistic information from the vocal production system as possible—

maximizing the retention of bioacoustic cues to affect—while elim-

inating phonetic cues to verbal meaning. To evaluate the success of

this method, we studied the perception of corresponding speech au-

dio and transformed EGG signals in an affect rating experiment with

online listeners. The results show a high degree of similarity in the

perceived affect of matched signals, indicating that our method is

effective.

Index Terms—speech, paralanguage, affect, voice transforma-

tion, electroglottagraphy, phoneme removal

1. INTRODUCTION

Much of the information conveyed by speech is transmitted through

paralinguistic cues encoded in the audio signal. These paralinguistic

cues are essential to the communication of emotions, intentions, and

personality [1–3]. For the majority of our daily interactions, these

paralinguistic cues are embedded among phonetic cues encoding lin-

guistic meaning. Although most individuals have no problem pars-

ing linguistic and paralinguistic cues in speech and responding ap-

propriately, this ability is often impaired in clinical populations (e.g.,

in autism [4] and depression [5]). Focusing on autism, the impair-

ment is assumed to pertain to the reception of paralinguistic cues to

speaker affect. However, the tests on which this assumption is based

use speech stimuli, and thus confound sensitivity to paralanguage

with language functioning. Testing sensitivity to paralinguistic af-

fect directly requires isolating it from speech. This is important for

∗Author correspondence: cnoufi@ccrma.stanford.edu

†This work was funded in part thanks to a grant from the National Insti-

tute of Mental Health (K01MH122730) and a seed grant from the Wu Tsai

Neurosciences Institute at Stanford University.

understanding the nature of auditory-vocal contributions to clinical

dysfunction, particularly in mental health.

Existing methods that attempt to isolate paralinguistic cues from

speech beneﬁt from economy and efﬁciency, but they also lose sig-

niﬁcant amounts of paralinguistic information, particularly concern-

ing affect. For example, one simple and efﬁcient method is to re-

move phonologic content from speech audio by adaptively low-pass

ﬁltering the signal such that the ﬁlter roll-off occurs below the sec-

ond formant peak, thus removing a critical cue to vowel identiﬁca-

tion (i.e., the ratio between the ﬁrst and second formants). However,

because this method removes high-frequency content (>starting at

approx. 500-2500 Hz, depending on the vowel [6]), it also destroys

important affective content [7].

Another method is to discard phonetic cues by separating the vo-

cal signal into two parts, the signal representing the laryngeal source,

and the signal representing the supralaryngeal ﬁlter. Whereas the ﬁl-

ter is more typically associated with linguistic articulation [8–10],

the source is more associated with paralinguistic features that are es-

sential to affect, such as voice pitch, breathiness, roughness, and

other varieties of voice quality [1, 3, 11–13]. The most common

method for separating the vocal source signal e(t)from the vocal

tract impulse response h(t)is linear predictive coding (LPC) [14].

LPC uses a pth-order linear predictor to estimate speech signal ˜s(t)

from pprevious samples. This is done by solving for optimum pre-

dictor coefﬁcients akof the ﬁlter for a pseudo-stationary frame of

speech. Once the ﬁlter coefﬁcients are found the “residual” e(t)is

calculated, representing a mixture of both glottal and noise-based

phonetic content.

Although source-ﬁlter separation is useful in many cases, the

reality is that both linguistic and paralinguistic information are en-

coded across the entire range of the frequency spectrum [7] and

produced by both source and ﬁlter [1, 11–13]. Given that existing

methods operating on speech audio alone do not adequately sep-

arate linguistic and paralinguistic information, we designed a new

method that leverages synchronized audio and electroglottagraphic

(EGG) [15–18] recordings of speech to create a third transformed

EGG (“tEGG”) signal made by applying a speech-based spectro-

temporal transform to the EGG signal. When played as audio, this

stimulus lacks the speech signal’s linguistic content but retains sig-

niﬁcant paralinguistic information across the frequency spectrum. In

Section 2, we describe the algorithm for the transformation method.

In Section 3, we describe the speech/EGG recording process and an

implementation of the algorithm. In Section 4, we evaluate our al-

gorithm in an online experiment comparing ratings of emotional af-

fect in corresponding speech and tEGG signal pairs. We conclude by

brieﬂy summarizing future directions and applications of the method

arXiv:2210.15001v2 [eess.AS] 14 Mar 2023

文档加载中……请稍候！
如果长时间未打开，您也可以点击刷新试试。

下载文档到电脑，查找使用更方便

10 玖币 0人已下载

立即下载

摘要：

ACOUSTICALLY-DRIVENPHONEMEREMOVALTHATPRESERVESVOCALAFFECTCUESCamilleNou,JonathanBergerStanfordUniversityCenterforComputerResearchinMusicandAcousticsStanford,CA,USAKarenJ.Parker,DanielL.BowlingyStanfordSchoolofMedicineDepartmentofPsychiatryandBehavioralSciencesStanford,CA,USAABSTRACTInthispaper,wep...

展开>> 收起<<

ACOUSTICALLY-DRIVEN PHONEME REMOVAL THAT PRESERVES VOCAL AFFECT CUES Camille Nouﬁ Jonathan Berger.pdf

共5页,预览1页

还剩页未读，继续阅读

声明：本站为文档C2C交易模式，即用户上传的文档直接被用户下载，本站只是中间服务平台，本站所有文档下载所得的收益归上传人(含作者)所有。玖贝云文库仅提供信息存储空间，仅对用户上传内容的表现方式做保护处理，对上载内容本身不做任何修改或编辑。若文档所含内容侵犯了您的版权或隐私，请立即通知玖贝云文库，我们立即给予删除！

ACOUSTICALLY-DRIVEN PHONEME REMOVAL THAT PRESERVES VOCAL AFFECT CUES Camille Nouﬁ Jonathan Berger

相关推荐

开通VIP享超值会员特权

作者详情

相关内容

热门标签

举报选择: