A VES ANIMAL VOCALIZATION ENCODER BASED ON SELF-SUPERVISION Masato Hagiwara Earth Species Project

2025-04-30 0 0 278.63KB 5 页 10玖币
侵权投诉
AVES: ANIMAL VOCALIZATION ENCODER BASED ON SELF-SUPERVISION
Masato Hagiwara
Earth Species Project
ABSTRACT
The lack of annotated training data in bioacoustics hinders
the use of large-scale neural network models trained in a
supervised way. In order to leverage a large amount of unan-
notated audio data, we propose AVES (Animal Vocaliza-
tion Encoder based on Self-Supervision), a self-supervised,
transformer-based audio representation model for encoding
animal vocalizations. We pretrain AVES on a diverse set
of unannotated audio datasets and fine-tune them for down-
stream bioacoustics tasks. Comprehensive experiments with
a suite of classification and detection tasks have shown that
AVES outperforms all the strong baselines and even the
supervised “topline” models trained on annotated audio clas-
sification datasets. The results also suggest that curating a
small training subset related to downstream tasks is an effi-
cient way to train high-quality audio representation models.
We open-source our models1.
Index Termsbioacoustics, self-supervision
1. INTRODUCTION
The use of machine learning, especially deep neural network
models, has become increasingly popular in bioacoustics in
recent years, driven by the increased amount of bioacoustic
data recorded by affordable recording devices [1].
However, when trained in a supervised way, deep neu-
ral networks typically require a large amount of annotated
data, and collecting high-quality annotation in bioacoustics
requires deep expertise and is often costly in terms of the time
required for manual labeling. This lack of labelled data in
bioacoustics hinders the use of large-scale models and most
recent models still rely on species-specific customized CNNs
(convolutional neural networks) trained on a small amount of
task-specific annotated data [2]. Although a few studies [3, 4]
pretrained from large datasets such as AudioSet [5], this re-
quires a large annotated dataset of general sound.
One solution to this issue is self-supervision, a type of
machine learning technique where the training signals come
from the data itself in the form of pseudo labels. In recent
years, large-scale self-supervised models and transfer learn-
ing have been hugely successful in related domains such
as NLP [6, 7], computer vision [8, 9], and human speech
1https://github.com/earthspecies/aves
processing [10, 11]. There has been prior work on train-
ing general-domain audio representation models [12, 13],
although those models are not designed specifically for bioa-
coustics and contain few datasets of animal sound. These
findings lead us to expect that similar self-supervised ap-
proaches are also effective for bioacoustic tasks.
In this paper, we propose AVES (Animal Vocalization
Encoder based on Self-supervision2), a self-supervised,
transformer-based audio representation model for encod-
ing animal vocalizations for downstream bioacoustic tasks.
For modeling raw waveforms with self-supervision, we use
HuBERT (Hidden Unit BERT) [14], an audio representation
model originally proposed for human speech. HuBERT ob-
tained one of the top results in the SUPERB benchmark [15].
We pretrain AVES on unannotated audio data that include
not only animal vocalizations, but also other sounds (e.g.,
human speech and environmental sound). We emphasize
that the pretraining is purely self-supervised and does not
rely on any annotation, besides choosing which instances to
include. We evaluate AVES on a suite of classification and
detection tasks, two of the most common tasks considered
in the bioacoustic literature [2]. Our experiments show that
AVES outperforms all the baselines (including pretrained
ResNet and VGGish [16] models pretrained on millions of
videos), even the supervised topline models trained on anno-
tated audio classification datasets.
Our contributions are as follows: with extensive bench-
marking, we show that self-supervised methods originally
proposed for human speech processing can be successfully
applied to the domain of bioacoustics. This paper is also one
of the few first studies [2] showing that, given appropriate
pretraining, transformers work well for data-scarce domains
such as bioacoustics, where CNNs are still by far the most
common approach. We open source our pretrained models.
2. METHOD
AVES heavily relies on HuBERT [14], a self-supervised audio
representation model originally proposed for human speech.
2.1. HuBERT pretraining
Unlike human written language where text can be used as
self-supervision signals, human speech and bioacoustics data
2Pronounced /ei vi:z/. Aves” is the class name for birds, although our
method supports diverse animal species, not just birds.
arXiv:2210.14493v1 [cs.SD] 26 Oct 2022
摘要:

AVES:ANIMALVOCALIZATIONENCODERBASEDONSELF-SUPERVISIONMasatoHagiwaraEarthSpeciesProjectABSTRACTThelackofannotatedtrainingdatainbioacousticshinderstheuseoflarge-scaleneuralnetworkmodelstrainedinasupervisedway.Inordertoleveragealargeamountofunan-notatedaudiodata,weproposeAVES(AnimalVocaliza-tionEncoder...

展开>> 收起<<
A VES ANIMAL VOCALIZATION ENCODER BASED ON SELF-SUPERVISION Masato Hagiwara Earth Species Project.pdf

共5页,预览1页

还剩页未读, 继续阅读

声明:本站为文档C2C交易模式,即用户上传的文档直接被用户下载,本站只是中间服务平台,本站所有文档下载所得的收益归上传人(含作者)所有。玖贝云文库仅提供信息存储空间,仅对用户上传内容的表现方式做保护处理,对上载内容本身不做任何修改或编辑。若文档所含内容侵犯了您的版权或隐私,请立即通知玖贝云文库,我们立即给予删除!
分类:图书资源 价格:10玖币 属性:5 页 大小:278.63KB 格式:PDF 时间:2025-04-30

开通VIP享超值会员特权

  • 多端同步记录
  • 高速下载文档
  • 免费文档工具
  • 分享文档赚钱
  • 每日登录抽奖
  • 优质衍生服务
/ 5
客服
关注