Modeling Animal Vocalizations through Synthesizers Masato Hagiwara Earth Species Project

2025-05-06 0 0 389.11KB 6 页 10玖币
侵权投诉
Modeling Animal Vocalizations through Synthesizers
Masato Hagiwara
Earth Species Project
masato@earthspecies.org
Maddie Cusimano
Earth Species Project
maddie@earthspecies.org
Jen-Yu Liu
Earth Species Project
jenyu@earthspecies.org
TorchSynth
L(x, y)
Encoder
(optional)
Optimizer
Parameters
Target Prediction
Target (cat) Prediction (cat)
Noise
Pitch
Env.
Figure 1: Method overview. Given an animal sound, synthesizer parameters are optimized for recon-
struction. The predicted sound can be modified (e.g. pitch-shift, denoising, envelope modulation).
1 Introduction
Modeling real-world sound is a fundamental problem in the creative use of machine learning and many
other fields, including human speech processing and bioacoustics. Transformer-based generative
models [
20
,
6
] are known to produce realistic sound, although they have limited control and are
hard to interpret. Recently, lighter-weight models that incorporate structured modules and domain
knowledge, notably DDSP [
8
,
29
], have been shown to produce high-quality musical sound. However,
a lack of signal-processing knowledge may hinder users from effectively manipulating the synthesis
parameters, of which there can be over a hundred per frame.
As an alternative, we aim to use modular synthesizers, i.e., compositional, parametric electronic
musical instruments, for modeling non-music sounds
1
. Synthesizers are lightweight and designed
in part for control, which make them a plausible candidate model. However, inferring synthesizer
parameters given a target sound, i.e., the parameter inference task [
15
,
11
], is not trivial for gen-
eral sounds. Research utilizing modern optimization techniques has typically focused on musical
sound [
30
,
9
,
18
,
2
,
4
]. In this work, we optimize a differentiable synthesizer from TorchSynth [
28
]
in order to model, emulate, and creatively generate animal vocalizations. We compare an array of
optimization methods, from gradient-based search to genetic algorithms, for inferring its parameters,
and then demonstrate how one can control and interpret the parameters for modeling non-music
sounds. The current work is intended as a creative tool for artists to integrate animal sound into
their work, but we hope future work in this direction could also enable new synthesis methods in
bioacoustics [16].
2 Parameter Inference
The default synthesizer implemented in TorchSynth named Voice has a relatively simple architecture
with a total of 78 parameters, consisting of two voltage-controlled oscillators (VCOs) and a noise
1Audio samples are at https://earthspecies.github.io/animalsynth/
Preprint. Under review.
arXiv:2210.10857v1 [cs.SD] 19 Oct 2022
Method Loss () Acc. ()
Random search 0.861 0.068
Gradient (Adam) 1.553 0.027
Variational optimization 0.891 0.034
Genetic algorithm 0.633 0.114
Differential evolution 0.738 0.068
PGPE [23] 1.000 0.011
CMA-ES [12] 0.891 0.027
Metropolis MCMC 0.658 0.091
Bayesian optimization [31] 0.737 0.034
Encoder 2.131 0.034
Original — 0.841
Table 1: Comparison of optimization methods
Figure 2: Plot of parameters per category. Param-
eters are normalized to the [0,1] range.
generator, whose pitch and amplitude envelopes are modulated by a mix of low-frequency oscillators
(LFOs). For any non-trivial synthesizer, it is extremely difficult to adjust parameters manually to
reproduce the target sound, necessitating automated optimization techniques.
We first compare several optimization techniques for parameter inference. As the target sound dataset,
we used the animal sounds (e.g., dog, cat, birds) in the first fold of ESC-50 [
22
], an environmental
sound dataset. We used the multi-scale STFT loss implemented in auraloss [
25
] as the objective. In
order to assess the quality of reconstruction, we fine-tuned the pretrained VGGish [
14
] model on the
remaining four folds of ESC-50 on the 50-class sound classification task (which includes both animal
and non-animal categories) and used its accuracy. Intuitively, if an optimization method achieves
good reconstruction, it can also “fool” the classifier trained on real sounds.
Table 1 shows the optimization techniques and their metrics. See the Supplemental Information for the
details of individual methods. Overall, evolutionary algorithms (the genetic algorithm and differential
evolution) worked relatively well, while gradient-based methods performed poorly, due to the fact that
the objective function is highly nonlinear and complex with respect to synthesizer parameters. This
trend is also observed for other highly complex optimization problems [
26
]. Predicting parameters
directly with a neural network encoder was not successful either, possibly due to the discrepancies
between the real sounds and the training data (artificially generated sound by TorchSynth itself).
One potential way to improve the gradient-based methods is to use a proxy network that simulates the
synthesizer, and replace the original synthesizer with the proxy network in parameter inference [
24
].
We trained a MelGAN [17]-based proxy network that takes the TorchSynth parameters as input and
the corresponding waveform as the target. We found that the reconstructions were poor. In contrast to
audio effects, converting 78 parameters to a 2-second waveform (88,200 samples) may have its own
challenges due to the large difference in the scales of the input and output.
After inferring parameters for each sample/class, controlling and generating new sounds are trivial.
For example, one can change pitch, de-noise, and/or modify the amplitude envelope by changing just
one or few synthesizer parameters as shown in the spectrograms in Figure 1.
3 Interpretation and Generation
One benefit of synthesizer parameter inference is that the fitted parameters are interpretable. In
Figure 2, we choose three animal categories (cow, pig, cat) for which the classifier accuracy is
relatively high and plot the fundamental frequency (f0) and the duration of the note inferred by the
genetic algorithm. The sound instances cluster based on their acoustic properties, i.e., cats make short
and high-pitched sounds while cow vocalizations are longer and lower-pitched.
Finally, as a proof of concept, we generated new cat sounds by first fitting a Gaussian distribution over
the inferred cat parameters (green squares in Figure 2) in the unnormalized space, then generating 100
new sounds by sampling from this distribution. The fine-tuned VGGish model correctly classified 18
samples out of 100, demonstrating that we can generate plausible new sounds by sampling synthesizer
parameters from the fitted distribution.
2
摘要:

ModelingAnimalVocalizationsthroughSynthesizersMasatoHagiwaraEarthSpeciesProjectmasato@earthspecies.orgMaddieCusimanoEarthSpeciesProjectmaddie@earthspecies.orgJen-YuLiuEarthSpeciesProjectjenyu@earthspecies.orgFigure1:Methodoverview.Givenananimalsound,synthesizerparametersareoptimizedforrecon-structio...

展开>> 收起<<
Modeling Animal Vocalizations through Synthesizers Masato Hagiwara Earth Species Project.pdf

共6页,预览2页

还剩页未读, 继续阅读

声明:本站为文档C2C交易模式,即用户上传的文档直接被用户下载,本站只是中间服务平台,本站所有文档下载所得的收益归上传人(含作者)所有。玖贝云文库仅提供信息存储空间,仅对用户上传内容的表现方式做保护处理,对上载内容本身不做任何修改或编辑。若文档所含内容侵犯了您的版权或隐私,请立即通知玖贝云文库,我们立即给予删除!
分类:图书资源 价格:10玖币 属性:6 页 大小:389.11KB 格式:PDF 时间:2025-05-06

开通VIP享超值会员特权

  • 多端同步记录
  • 高速下载文档
  • 免费文档工具
  • 分享文档赚钱
  • 每日登录抽奖
  • 优质衍生服务
/ 6
客服
关注