ACOUSTICALLY-DRIVEN PHONEME REMOVAL THAT PRESERVES VOCAL AFFECT CUES Camille Noufi Jonathan Berger

2025-04-30 0 0 1000.74KB 5 页 10玖币
侵权投诉
ACOUSTICALLY-DRIVEN PHONEME REMOVAL THAT
PRESERVES VOCAL AFFECT CUES
Camille Noufi, Jonathan Berger
Stanford University
Center for Computer Research
in Music and Acoustics
Stanford, CA, USA
Karen J. Parker, Daniel L. Bowling
Stanford School of Medicine
Department of Psychiatry
and Behavioral Sciences
Stanford, CA, USA
ABSTRACT
In this paper, we propose a method for removing linguistic informa-
tion from speech for the purpose of isolating paralinguistic indicators
of affect. The immediate utility of this method lies in clinical tests
of sensitivity to vocal affect that are not confounded by language,
which is impaired in a variety of clinical populations. The method is
based on simultaneous recordings of speech audio and electroglotto-
graphic (EGG) signals. The speech audio signal is used to estimate
the average vocal tract filter response and amplitude envelop. The
EGG signal supplies a direct correlate of voice source activity that is
mostly independent of phonetic articulation. The dynamic energy of
the speech audio and the average vocal tract filter are applied to the
EGG signal create a third signal designed to capture as much paralin-
guistic information from the vocal production system as possible—
maximizing the retention of bioacoustic cues to affect—while elim-
inating phonetic cues to verbal meaning. To evaluate the success of
this method, we studied the perception of corresponding speech au-
dio and transformed EGG signals in an affect rating experiment with
online listeners. The results show a high degree of similarity in the
perceived affect of matched signals, indicating that our method is
effective.
Index Termsspeech, paralanguage, affect, voice transforma-
tion, electroglottagraphy, phoneme removal
1. INTRODUCTION
Much of the information conveyed by speech is transmitted through
paralinguistic cues encoded in the audio signal. These paralinguistic
cues are essential to the communication of emotions, intentions, and
personality [1–3]. For the majority of our daily interactions, these
paralinguistic cues are embedded among phonetic cues encoding lin-
guistic meaning. Although most individuals have no problem pars-
ing linguistic and paralinguistic cues in speech and responding ap-
propriately, this ability is often impaired in clinical populations (e.g.,
in autism [4] and depression [5]). Focusing on autism, the impair-
ment is assumed to pertain to the reception of paralinguistic cues to
speaker affect. However, the tests on which this assumption is based
use speech stimuli, and thus confound sensitivity to paralanguage
with language functioning. Testing sensitivity to paralinguistic af-
fect directly requires isolating it from speech. This is important for
Author correspondence: cnoufi@ccrma.stanford.edu
This work was funded in part thanks to a grant from the National Insti-
tute of Mental Health (K01MH122730) and a seed grant from the Wu Tsai
Neurosciences Institute at Stanford University.
understanding the nature of auditory-vocal contributions to clinical
dysfunction, particularly in mental health.
Existing methods that attempt to isolate paralinguistic cues from
speech benefit from economy and efficiency, but they also lose sig-
nificant amounts of paralinguistic information, particularly concern-
ing affect. For example, one simple and efficient method is to re-
move phonologic content from speech audio by adaptively low-pass
filtering the signal such that the filter roll-off occurs below the sec-
ond formant peak, thus removing a critical cue to vowel identifica-
tion (i.e., the ratio between the first and second formants). However,
because this method removes high-frequency content (>starting at
approx. 500-2500 Hz, depending on the vowel [6]), it also destroys
important affective content [7].
Another method is to discard phonetic cues by separating the vo-
cal signal into two parts, the signal representing the laryngeal source,
and the signal representing the supralaryngeal filter. Whereas the fil-
ter is more typically associated with linguistic articulation [8–10],
the source is more associated with paralinguistic features that are es-
sential to affect, such as voice pitch, breathiness, roughness, and
other varieties of voice quality [1, 3, 11–13]. The most common
method for separating the vocal source signal e(t)from the vocal
tract impulse response h(t)is linear predictive coding (LPC) [14].
LPC uses a pth-order linear predictor to estimate speech signal ˜s(t)
from pprevious samples. This is done by solving for optimum pre-
dictor coefficients akof the filter for a pseudo-stationary frame of
speech. Once the filter coefficients are found the “residual” e(t)is
calculated, representing a mixture of both glottal and noise-based
phonetic content.
Although source-filter separation is useful in many cases, the
reality is that both linguistic and paralinguistic information are en-
coded across the entire range of the frequency spectrum [7] and
produced by both source and filter [1, 11–13]. Given that existing
methods operating on speech audio alone do not adequately sep-
arate linguistic and paralinguistic information, we designed a new
method that leverages synchronized audio and electroglottagraphic
(EGG) [15–18] recordings of speech to create a third transformed
EGG (“tEGG”) signal made by applying a speech-based spectro-
temporal transform to the EGG signal. When played as audio, this
stimulus lacks the speech signal’s linguistic content but retains sig-
nificant paralinguistic information across the frequency spectrum. In
Section 2, we describe the algorithm for the transformation method.
In Section 3, we describe the speech/EGG recording process and an
implementation of the algorithm. In Section 4, we evaluate our al-
gorithm in an online experiment comparing ratings of emotional af-
fect in corresponding speech and tEGG signal pairs. We conclude by
briefly summarizing future directions and applications of the method
arXiv:2210.15001v2 [eess.AS] 14 Mar 2023
摘要:

ACOUSTICALLY-DRIVENPHONEMEREMOVALTHATPRESERVESVOCALAFFECTCUESCamilleNou,JonathanBergerStanfordUniversityCenterforComputerResearchinMusicandAcousticsStanford,CA,USAKarenJ.Parker,DanielL.BowlingyStanfordSchoolofMedicineDepartmentofPsychiatryandBehavioralSciencesStanford,CA,USAABSTRACTInthispaper,wep...

展开>> 收起<<
ACOUSTICALLY-DRIVEN PHONEME REMOVAL THAT PRESERVES VOCAL AFFECT CUES Camille Noufi Jonathan Berger.pdf

共5页,预览1页

还剩页未读, 继续阅读

声明:本站为文档C2C交易模式,即用户上传的文档直接被用户下载,本站只是中间服务平台,本站所有文档下载所得的收益归上传人(含作者)所有。玖贝云文库仅提供信息存储空间,仅对用户上传内容的表现方式做保护处理,对上载内容本身不做任何修改或编辑。若文档所含内容侵犯了您的版权或隐私,请立即通知玖贝云文库,我们立即给予删除!
分类:图书资源 价格:10玖币 属性:5 页 大小:1000.74KB 格式:PDF 时间:2025-04-30

开通VIP享超值会员特权

  • 多端同步记录
  • 高速下载文档
  • 免费文档工具
  • 分享文档赚钱
  • 每日登录抽奖
  • 优质衍生服务
/ 5
客服
关注