
ON OUT-OF-DISTRIBUTION DETECTION FOR AUDIO WITH DEEP NEAREST
NEIGHBORS
Zaharah Bukhsh, Aaqib Saeed
Eindhoven University of Technology, Eindhoven, The Netherlands
ABSTRACT
Out-of-distribution (OOD) detection is concerned with identi-
fying data points that do not belong to the same distribution as
the model’s training data. For the safe deployment of predic-
tive models in a real-world environment, it is critical to avoid
making confident predictions on OOD inputs as it can lead to
potentially dangerous consequences. However, OOD detec-
tion largely remains an under-explored area in the audio (and
speech) domain. This is despite the fact that audio is a central
modality for many tasks, such as speaker diarization, automatic
speech recognition, and sound event detection. To address this,
we propose to leverage feature-space of the model with deep
k-nearest neighbors to detect OOD samples. We show that
this simple and flexible method effectively detects OOD inputs
across a broad category of audio (and speech) datasets. Specif-
ically, it improves the false positive rate (FPR@TPR95) by
17%
and the AUROC score by
7%
than other prior techniques.
Index Terms—
out-of-distribution, audio, speech, uncer-
tainty estimation, deep learning, nearest neighbors
1. INTRODUCTION
Out-of-distribution (OOD) detection is the task of identifying
inputs that are not drawn from the same distribution as the
training data or are not truly representative of them. Neural
networks are known to produce overconfident scores even for
samples that do not belong to the training distribution [
1
].
This is a challenging problem for deploying machine learn-
ing in safety-critical applications, where making confident
predictions on OOD inputs can lead to potentially danger-
ous consequences. Besides the capability to generalize well
for samples from the familiar distribution, a robust machine
learning model should be aware of uncertainty stemming from
unknown examples. It is an important competency for real-
world applications, where the distribution of data can change
over time or vary across different user groups.
A broad range of approaches has been proposed to tackle
the OOD detection issue and develop reliable methods that
successfully detect in-distribution (ID) and OOD inputs. A set
of common techniques is to deduce uncertainty measurements
The icons used in the figure are from TheNounProject.
around predictions of the neural network based on model out-
puts [
1
,
2
,
3
,
4
], feature space [
5
,
6
], and gradient norms [
7
].
Similarly, distance-based methods [
5
] has also gained sig-
nificant attention recently for identifying OOD inputs with
promising capabilities. Distance-based methods leverage rep-
resentations extracted from a pre-trained model and act on the
assumption that out-of-distribution test samples are isolated
from the ID data. Nevertheless, OOD detection is severely
understudied in the audio domain, although audio recognition
models are being widely deployed in real-world settings. As
well as, audio is an important modality for many tasks, such as
speaker diarization, automatic speech recognition, and sound
event detection. The prior works mainly focus on vision tasks
raising an important question about the efficacy and applica-
bility of existing methods to audio and speech.
Our work follows the same intuition as of the distance-
based method [
5
], and we aim to explore the richness of model
representation space to derive a meaningful signal that can
help solve the task of OOD detection. Formally, we propose
a simple yet effective system for out-of-distribution detec-
tion for audio inputs with deep k-nearest neighbors. In par-
ticular, we leverage nearest neighbor distance centered on
a non-parametric approach without making strong distribu-
tional assumptions regarding underlying embedding space. To
identify OOD samples, we extract embedding for a test input,
compute its distance to k-nearest neighbors in the training set
and use a threshold to flag the input, i.e., a sample far away in
representation space is more likely to be OOD.
We demonstrate the effectiveness of kNN-based approach
on a broad range of audio recognition tasks and different neu-
ral network architectures and provide an extensive comparison
with both recent and classical approaches as baselines. Impor-
tantly, we note that to the best of our knowledge, we make
a first attempt at studying out-of-distribution detection and
setting up a benchmark for audio across a variety of datasets
ranging from keyword spotting and emotion recognition to
environmental sounds and more. Empirically, we show that
for a MobileNet [
8
] model (trained on in-distribution data of
human vocal sounds), the non-parametric nearest neighbor
method improves FPR@TPR95 by
17
% and AUROC scores
by
7
% than approaches that leverage output or gradient space
of the model.
arXiv:2210.15283v2 [cs.SD] 25 Feb 2023