TEXT-TO-AUDIO GROUNDING BASED NOVEL METRIC FOR EVALUATING AUDIO
CAPTION SIMILARITY
Swapnil Bhosale, Rupayan Chakraborty, Sunil Kumar Kopparapu
TCS Research, Tata Consultancy Services Limited, India.
ABSTRACT
Automatic Audio Captioning (AAC) refers to the task of trans-
lating an audio sample into a natural language (NL) text that
describes the audio events, source of the events and their re-
lationships. Unlike NL text generation tasks, which rely on
metrics like BLEU,ROUGE,METEOR based on lexical seman-
tics for evaluation, the AAC evaluation metric requires an abil-
ity to map NL text (phrases) that correspond to similar sounds
in addition lexical semantics. Current metrics used for eval-
uation of AAC tasks lack an understanding of the perceived
properties of sound represented by text. In this paper, we
propose a novel metric based on Text-to-Audio Grounding
(TAG), which is, useful for evaluating cross modal tasks like
AAC. Experiments on publicly available AAC data-set shows
our evaluation metric to perform better compared to existing
metrics used in NL text and image captioning literature.
Index Terms—Audio Captioning, Audio Event Detec-
tion, Audio Grounding, Encoder-decoder, BERT.
1. INTRODUCTION
Caption generation is an integral part of scene understanding
which involves perceiving the relationships between actors
and entities. It has primarily been modeled as generating
natural language (NL) descriptions using image or video
cues [1]. However, audio based captioning was recently in-
troduced in [2], as a task of generating meaningful textual
descriptions for audio clips. Automatic Audio Captioning
(AAC) is an inter-modal translation task, where the objec-
tive is to generate a textual description for a corresponding
input audio signal [2]. Audio captioning is a critical step
towards machine intelligence with multiple applications in
daily scenarios, ranging from audio retrieval [3], scene un-
derstanding [4, 5] to assist the hearing impaired [6] and
audio surveillance. Unlike an Automatic Speech Recog-
nition (ASR) task, the output is a description rather than a
transcription of the linguistic content in the audio sample.
Moreover, in an ASR task any background audio events are
considered noise and hence are filtered during pre- or post-
processing. A precursor to the AAC task is the Audio Event
Detection (AED) [7, 8] problem, with emphasis on categoriz-
ing an audio (mostly sound) into a set of pre-defined audio
event labels. AAC includes but is not limited to, identifying
the presence of multiple audio events ("dog bark","gun
shot", etc.), acoustic scenes ("in a crowded place",
"amidst heavy rain", etc.), the spatio-temporal relation-
ships of event source ("kids playing", "while birds
chirping in the background"), and physical proper-
ties based on the interaction of the source objects with the
environment ("door creaks as it slowly revolves
back and forth") [9, 10].
Metrics used for evaluation play a big role when automat-
ically generated (NL text) captions have to be assessed for
their accuracy. Word embedding (or entity representations),
like word2vec (w2v), Bidirectional Encoder Representations
from Transformers (BERT), etc are often used for these pur-
poses. These embeddings are machine learned latent or vec-
tor spaces that map lexical words having similar contextual
and semantic meanings close to each other in the embedded
vector space. Formally, if W={w1, w2,· · · , wn}is the lan-
guage vocabulary containing nwords, where wirepresents
the ith word. If w2v(wi)is the word embedding of the word
wi, then vi=w2v(wi)is a mapping from W→IRm, such
that viis a mdimensional real numbered vector. If wi,wj
and wkare three words in Wsuch that wjand wkare seman-
tically close, in the language space, compared to wiand wk,
then the euclidean distance between viand vkis greater than
the distance between vjand vk.
The word embedding w2v(·), is trained on a very large
amount of NL text corpus which results in machine learning
occurrences of words in similar semantic contexts. For this
reason, w2v(.)seem to create an embedding space that, we
as humans, can relate to from the language perspective. As a
consequence, almost all NL processing (NLP) tasks that need
to compare text outputs of two different processes, use some
form of w2v(.)to measure the performance. Note that AAC is
essentially a task of assigning a text caption to an audio signal,
a(t), without the help of any other cue, namely aac(a(t)) pro-
duces a sequence of lexical words wα, wβ, wγ,· · · (∈W)to
form a grammatically valid language sentence. Currently, the
metrics adopted to measure the performance of an AAC sys-
tem, are the metrics (BLEU [11], ROUGE [12], METEOR [13],
CIDER[14], SPICE [15]) that are popularly used to compare
outputs of NL generation tasks. It should be noted that NL
tasks are expected to give semantically similar outputs, as in
arXiv:2210.06354v1 [cs.CL] 3 Oct 2022