Modeling Animal Vocalizations through Synthesizers Masato Hagiwara Earth Species Project

2025-05-06 0 0 389.11KB 6 页 10玖币

侵权投诉

Modeling Animal Vocalizations through Synthesizers

Masato Hagiwara

Earth Species Project

masato@earthspecies.org

Maddie Cusimano

Earth Species Project

maddie@earthspecies.org

Jen-Yu Liu

Earth Species Project

jenyu@earthspecies.org

TorchSynth

L(x, y)

Encoder

(optional)

Optimizer

Parameters

Target Prediction

Target (cat) Prediction (cat)

Noise

Pitch

Env.

Figure 1: Method overview. Given an animal sound, synthesizer parameters are optimized for recon-

struction. The predicted sound can be modiﬁed (e.g. pitch-shift, denoising, envelope modulation).

1 Introduction

Modeling real-world sound is a fundamental problem in the creative use of machine learning and many

other ﬁelds, including human speech processing and bioacoustics. Transformer-based generative

models [

] are known to produce realistic sound, although they have limited control and are

hard to interpret. Recently, lighter-weight models that incorporate structured modules and domain

knowledge, notably DDSP [

], have been shown to produce high-quality musical sound. However,

a lack of signal-processing knowledge may hinder users from effectively manipulating the synthesis

parameters, of which there can be over a hundred per frame.

As an alternative, we aim to use modular synthesizers, i.e., compositional, parametric electronic

musical instruments, for modeling non-music sounds

. Synthesizers are lightweight and designed

in part for control, which make them a plausible candidate model. However, inferring synthesizer

parameters given a target sound, i.e., the parameter inference task [

], is not trivial for gen-

eral sounds. Research utilizing modern optimization techniques has typically focused on musical

sound [

]. In this work, we optimize a differentiable synthesizer from TorchSynth [

]

in order to model, emulate, and creatively generate animal vocalizations. We compare an array of

optimization methods, from gradient-based search to genetic algorithms, for inferring its parameters,

and then demonstrate how one can control and interpret the parameters for modeling non-music

sounds. The current work is intended as a creative tool for artists to integrate animal sound into

their work, but we hope future work in this direction could also enable new synthesis methods in

bioacoustics [16].

2 Parameter Inference

The default synthesizer implemented in TorchSynth named Voice has a relatively simple architecture

with a total of 78 parameters, consisting of two voltage-controlled oscillators (VCOs) and a noise

1Audio samples are at https://earthspecies.github.io/animalsynth/

Preprint. Under review.

arXiv:2210.10857v1 [cs.SD] 19 Oct 2022

Method Loss (↓) Acc. (↑)

Random search 0.861 0.068

Gradient (Adam) 1.553 0.027

Variational optimization 0.891 0.034

Genetic algorithm 0.633 0.114

Differential evolution 0.738 0.068

PGPE [23] 1.000 0.011

CMA-ES [12] 0.891 0.027

Metropolis MCMC 0.658 0.091

Bayesian optimization [31] 0.737 0.034

Encoder 2.131 0.034

Original — 0.841

Table 1: Comparison of optimization methods

Figure 2: Plot of parameters per category. Param-

eters are normalized to the [0,1] range.

generator, whose pitch and amplitude envelopes are modulated by a mix of low-frequency oscillators

(LFOs). For any non-trivial synthesizer, it is extremely difﬁcult to adjust parameters manually to

reproduce the target sound, necessitating automated optimization techniques.

We ﬁrst compare several optimization techniques for parameter inference. As the target sound dataset,

we used the animal sounds (e.g., dog, cat, birds) in the ﬁrst fold of ESC-50 [

], an environmental

sound dataset. We used the multi-scale STFT loss implemented in auraloss [

] as the objective. In

order to assess the quality of reconstruction, we ﬁne-tuned the pretrained VGGish [

] model on the

remaining four folds of ESC-50 on the 50-class sound classiﬁcation task (which includes both animal

and non-animal categories) and used its accuracy. Intuitively, if an optimization method achieves

good reconstruction, it can also “fool” the classiﬁer trained on real sounds.

Table 1 shows the optimization techniques and their metrics. See the Supplemental Information for the

details of individual methods. Overall, evolutionary algorithms (the genetic algorithm and differential

evolution) worked relatively well, while gradient-based methods performed poorly, due to the fact that

the objective function is highly nonlinear and complex with respect to synthesizer parameters. This

trend is also observed for other highly complex optimization problems [

]. Predicting parameters

directly with a neural network encoder was not successful either, possibly due to the discrepancies

between the real sounds and the training data (artiﬁcially generated sound by TorchSynth itself).

One potential way to improve the gradient-based methods is to use a proxy network that simulates the

synthesizer, and replace the original synthesizer with the proxy network in parameter inference [

We trained a MelGAN [17]-based proxy network that takes the TorchSynth parameters as input and

the corresponding waveform as the target. We found that the reconstructions were poor. In contrast to

audio effects, converting 78 parameters to a 2-second waveform (88,200 samples) may have its own

challenges due to the large difference in the scales of the input and output.

After inferring parameters for each sample/class, controlling and generating new sounds are trivial.

For example, one can change pitch, de-noise, and/or modify the amplitude envelope by changing just

one or few synthesizer parameters as shown in the spectrograms in Figure 1.

3 Interpretation and Generation

One beneﬁt of synthesizer parameter inference is that the ﬁtted parameters are interpretable. In

Figure 2, we choose three animal categories (cow, pig, cat) for which the classiﬁer accuracy is

relatively high and plot the fundamental frequency (f0) and the duration of the note inferred by the

genetic algorithm. The sound instances cluster based on their acoustic properties, i.e., cats make short

and high-pitched sounds while cow vocalizations are longer and lower-pitched.

Finally, as a proof of concept, we generated new cat sounds by ﬁrst ﬁtting a Gaussian distribution over

the inferred cat parameters (green squares in Figure 2) in the unnormalized space, then generating 100

new sounds by sampling from this distribution. The ﬁne-tuned VGGish model correctly classiﬁed 18

samples out of 100, demonstrating that we can generate plausible new sounds by sampling synthesizer

parameters from the ﬁtted distribution.

文档加载中……请稍候！
如果长时间未打开，您也可以点击刷新试试。

下载文档到电脑，查找使用更方便

10 玖币 0人已下载

立即下载

摘要：

ModelingAnimalVocalizationsthroughSynthesizersMasatoHagiwaraEarthSpeciesProjectmasato@earthspecies.orgMaddieCusimanoEarthSpeciesProjectmaddie@earthspecies.orgJen-YuLiuEarthSpeciesProjectjenyu@earthspecies.orgFigure1:Methodoverview.Givenananimalsound,synthesizerparametersareoptimizedforrecon-structio...

展开>> 收起<<

Modeling Animal Vocalizations through Synthesizers Masato Hagiwara Earth Species Project.pdf

共6页,预览2页

还剩页未读，继续阅读

声明：本站为文档C2C交易模式，即用户上传的文档直接被用户下载，本站只是中间服务平台，本站所有文档下载所得的收益归上传人(含作者)所有。玖贝云文库仅提供信息存储空间，仅对用户上传内容的表现方式做保护处理，对上载内容本身不做任何修改或编辑。若文档所含内容侵犯了您的版权或隐私，请立即通知玖贝云文库，我们立即给予删除！

Modeling Animal Vocalizations through Synthesizers Masato Hagiwara Earth Species Project

相关推荐

开通VIP享超值会员特权

作者详情

相关内容

热门标签

举报选择: