
AVES: ANIMAL VOCALIZATION ENCODER BASED ON SELF-SUPERVISION
Masato Hagiwara
Earth Species Project
ABSTRACT
The lack of annotated training data in bioacoustics hinders
the use of large-scale neural network models trained in a
supervised way. In order to leverage a large amount of unan-
notated audio data, we propose AVES (Animal Vocaliza-
tion Encoder based on Self-Supervision), a self-supervised,
transformer-based audio representation model for encoding
animal vocalizations. We pretrain AVES on a diverse set
of unannotated audio datasets and fine-tune them for down-
stream bioacoustics tasks. Comprehensive experiments with
a suite of classification and detection tasks have shown that
AVES outperforms all the strong baselines and even the
supervised “topline” models trained on annotated audio clas-
sification datasets. The results also suggest that curating a
small training subset related to downstream tasks is an effi-
cient way to train high-quality audio representation models.
We open-source our models1.
Index Terms—bioacoustics, self-supervision
1. INTRODUCTION
The use of machine learning, especially deep neural network
models, has become increasingly popular in bioacoustics in
recent years, driven by the increased amount of bioacoustic
data recorded by affordable recording devices [1].
However, when trained in a supervised way, deep neu-
ral networks typically require a large amount of annotated
data, and collecting high-quality annotation in bioacoustics
requires deep expertise and is often costly in terms of the time
required for manual labeling. This lack of labelled data in
bioacoustics hinders the use of large-scale models and most
recent models still rely on species-specific customized CNNs
(convolutional neural networks) trained on a small amount of
task-specific annotated data [2]. Although a few studies [3, 4]
pretrained from large datasets such as AudioSet [5], this re-
quires a large annotated dataset of general sound.
One solution to this issue is self-supervision, a type of
machine learning technique where the training signals come
from the data itself in the form of pseudo labels. In recent
years, large-scale self-supervised models and transfer learn-
ing have been hugely successful in related domains such
as NLP [6, 7], computer vision [8, 9], and human speech
1https://github.com/earthspecies/aves
processing [10, 11]. There has been prior work on train-
ing general-domain audio representation models [12, 13],
although those models are not designed specifically for bioa-
coustics and contain few datasets of animal sound. These
findings lead us to expect that similar self-supervised ap-
proaches are also effective for bioacoustic tasks.
In this paper, we propose AVES (Animal Vocalization
Encoder based on Self-supervision2), a self-supervised,
transformer-based audio representation model for encod-
ing animal vocalizations for downstream bioacoustic tasks.
For modeling raw waveforms with self-supervision, we use
HuBERT (Hidden Unit BERT) [14], an audio representation
model originally proposed for human speech. HuBERT ob-
tained one of the top results in the SUPERB benchmark [15].
We pretrain AVES on unannotated audio data that include
not only animal vocalizations, but also other sounds (e.g.,
human speech and environmental sound). We emphasize
that the pretraining is purely self-supervised and does not
rely on any annotation, besides choosing which instances to
include. We evaluate AVES on a suite of classification and
detection tasks, two of the most common tasks considered
in the bioacoustic literature [2]. Our experiments show that
AVES outperforms all the baselines (including pretrained
ResNet and VGGish [16] models pretrained on millions of
videos), even the supervised topline models trained on anno-
tated audio classification datasets.
Our contributions are as follows: with extensive bench-
marking, we show that self-supervised methods originally
proposed for human speech processing can be successfully
applied to the domain of bioacoustics. This paper is also one
of the few first studies [2] showing that, given appropriate
pretraining, transformers work well for data-scarce domains
such as bioacoustics, where CNNs are still by far the most
common approach. We open source our pretrained models.
2. METHOD
AVES heavily relies on HuBERT [14], a self-supervised audio
representation model originally proposed for human speech.
2.1. HuBERT pretraining
Unlike human written language where text can be used as
self-supervision signals, human speech and bioacoustics data
2Pronounced /ei vi:z/. “Aves” is the class name for birds, although our
method supports diverse animal species, not just birds.
arXiv:2210.14493v1 [cs.SD] 26 Oct 2022