TEXT-TO-AUDIO GROUNDING BASED NOVEL METRIC FOR EVALUATING AUDIO CAPTION SIMILARITY Swapnil Bhosale Rupayan Chakraborty Sunil Kumar Kopparapu

2025-04-26 0 0 1.24MB 9 页 10玖币
侵权投诉
TEXT-TO-AUDIO GROUNDING BASED NOVEL METRIC FOR EVALUATING AUDIO
CAPTION SIMILARITY
Swapnil Bhosale, Rupayan Chakraborty, Sunil Kumar Kopparapu
TCS Research, Tata Consultancy Services Limited, India.
ABSTRACT
Automatic Audio Captioning (AAC) refers to the task of trans-
lating an audio sample into a natural language (NL) text that
describes the audio events, source of the events and their re-
lationships. Unlike NL text generation tasks, which rely on
metrics like BLEU,ROUGE,METEOR based on lexical seman-
tics for evaluation, the AAC evaluation metric requires an abil-
ity to map NL text (phrases) that correspond to similar sounds
in addition lexical semantics. Current metrics used for eval-
uation of AAC tasks lack an understanding of the perceived
properties of sound represented by text. In this paper, we
propose a novel metric based on Text-to-Audio Grounding
(TAG), which is, useful for evaluating cross modal tasks like
AAC. Experiments on publicly available AAC data-set shows
our evaluation metric to perform better compared to existing
metrics used in NL text and image captioning literature.
Index TermsAudio Captioning, Audio Event Detec-
tion, Audio Grounding, Encoder-decoder, BERT.
1. INTRODUCTION
Caption generation is an integral part of scene understanding
which involves perceiving the relationships between actors
and entities. It has primarily been modeled as generating
natural language (NL) descriptions using image or video
cues [1]. However, audio based captioning was recently in-
troduced in [2], as a task of generating meaningful textual
descriptions for audio clips. Automatic Audio Captioning
(AAC) is an inter-modal translation task, where the objec-
tive is to generate a textual description for a corresponding
input audio signal [2]. Audio captioning is a critical step
towards machine intelligence with multiple applications in
daily scenarios, ranging from audio retrieval [3], scene un-
derstanding [4, 5] to assist the hearing impaired [6] and
audio surveillance. Unlike an Automatic Speech Recog-
nition (ASR) task, the output is a description rather than a
transcription of the linguistic content in the audio sample.
Moreover, in an ASR task any background audio events are
considered noise and hence are filtered during pre- or post-
processing. A precursor to the AAC task is the Audio Event
Detection (AED) [7, 8] problem, with emphasis on categoriz-
ing an audio (mostly sound) into a set of pre-defined audio
event labels. AAC includes but is not limited to, identifying
the presence of multiple audio events ("dog bark","gun
shot", etc.), acoustic scenes ("in a crowded place",
"amidst heavy rain", etc.), the spatio-temporal relation-
ships of event source ("kids playing", "while birds
chirping in the background"), and physical proper-
ties based on the interaction of the source objects with the
environment ("door creaks as it slowly revolves
back and forth") [9, 10].
Metrics used for evaluation play a big role when automat-
ically generated (NL text) captions have to be assessed for
their accuracy. Word embedding (or entity representations),
like word2vec (w2v), Bidirectional Encoder Representations
from Transformers (BERT), etc are often used for these pur-
poses. These embeddings are machine learned latent or vec-
tor spaces that map lexical words having similar contextual
and semantic meanings close to each other in the embedded
vector space. Formally, if W={w1, w2,· · · , wn}is the lan-
guage vocabulary containing nwords, where wirepresents
the ith word. If w2v(wi)is the word embedding of the word
wi, then vi=w2v(wi)is a mapping from WIRm, such
that viis a mdimensional real numbered vector. If wi,wj
and wkare three words in Wsuch that wjand wkare seman-
tically close, in the language space, compared to wiand wk,
then the euclidean distance between viand vkis greater than
the distance between vjand vk.
The word embedding w2v(·), is trained on a very large
amount of NL text corpus which results in machine learning
occurrences of words in similar semantic contexts. For this
reason, w2v(.)seem to create an embedding space that, we
as humans, can relate to from the language perspective. As a
consequence, almost all NL processing (NLP) tasks that need
to compare text outputs of two different processes, use some
form of w2v(.)to measure the performance. Note that AAC is
essentially a task of assigning a text caption to an audio signal,
a(t), without the help of any other cue, namely aac(a(t)) pro-
duces a sequence of lexical words wα, wβ, wγ,· · · (W)to
form a grammatically valid language sentence. Currently, the
metrics adopted to measure the performance of an AAC sys-
tem, are the metrics (BLEU [11], ROUGE [12], METEOR [13],
CIDER[14], SPICE [15]) that are popularly used to compare
outputs of NL generation tasks. It should be noted that NL
tasks are expected to give semantically similar outputs, as in
arXiv:2210.06354v1 [cs.CL] 3 Oct 2022
1. EXAMPLE 1
(a) A person walks on a path with leaves on it
(b) Heavy footfalls are audible as snow crunches beneath
their boots
(c) Shoes stepping and moving across an area covered with
dirt, twigs and leaves.
2. EXAMPLE 2
(a) A CD player is playing, and the tape is turning, but no
voices or noise on it.
(b) A clock in the foreground with traffic passing by in the
distance.
(c) Vehicle has its turn signal on and off when the vehicle
drives.
Fig. 1. Samples of Human annotated captions.
case of, image captioning, language translation, however this
is not true for AAC task.
We argue that these metrics used for NL tasks are not
appropriate for AAC task, though the outputs of both result
in text. It is well known that the same audio signal could
result in a variety of captions when annotated by humans
[16]. The differences in annotation is more prominent, es-
pecially in the absence of other (typically visual) cues. For
example "clock" and "car turn indicator", which are
far apart semantically in the w2v() embedding space could
appear interchangeably in an audio caption because they pro-
duce similar sounds. Additionally, auditory perception is
a complex process consisting of multiple stages including
diarization, perceptual restoration and selective attention to
name a few [17]. As a result, individuals belonging to dif-
ferent age groups, cultural backgrounds, experiences might
represent sounds, especially in the presence of external noise,
differently [17, 18].
This insight and the observation during our initial ex-
periments designed to investigate the ambiguity in human
annotation of publicly available AAC data sets motivated us
to explore the need for a new metric that can be used to
measure the performance of an AAC system. Figure 1 shows
examples of human annotated captions for the same input
audio. While all the captions in EXAMPLE 1 (Fig. 1) have
the sound produced by "footsteps"; the caption (item 1b)
captures the sound produced by "snow" (when stepped on
it), whereas (item 1a) and (item 1c) capture the sound pro-
duced by "leaves" and "twigs". Clearly any metric based
on NL semantics would mark the caption (item 1b) as being
incorrect when compared to (item 1a) or (item 1c) because
w2v"leaves" and w2v"snow" are not close to each other. A
similar inter-variance among reference captions, in Clotho
dataset [9], for the same audio can be observed in captions
EXAMPLE 2 (Fig. 1). Interestingly, the "ticking" sound
is perceived differently ("CD player","clock","turn
indicator") by the three human annotators.
We can broadly categorize the variation in human cap-
tioning the same audio into two types, (i) missed audio events
example, the "traffic" noise is captured in (item 2b)
and (item 2c), however, it is missed by the annotator in
(item 2a), and (ii) mis-identified of confused audio events ow-
ing to acoustic similarity example, "CD player","clock",
"turn signal" produce similar sounds. Our proposed
metric incorporates acoustic similarity in its formulation to
enable fair evaluation of AAC systems.
Fig. 2. Audio caption represented in 2D principal component
space. Each color represents different captions corresponding
to the same audio.
To explore variations in human annotated audio cap-
tions in Clotho dataset, we use a pre-trained Sentence-BERT
(sBERT) model [19] to extract sentence level embeddings.
Each point in the cluster plot (Fig. 2) represents the sBERT
embedding of a caption projected onto two principal compo-
nents [20] and the color represents the caption corresponding
to the same audio. It can be observed that the same color
points, representing captions corresponding to the same au-
dio, are not necessarily clustered together. Thus one can
conclude that metrics (in this case sBERT) which rely on
semantic information only are not suitable to evaluate au-
dio captions. To overcome this, AAC evaluation so far has
been relying on computing scores against all available ref-
erence captions using existing metrics (example, sBERT)
and then taking the average or the best score. While this
is helpful, it fails to address the inherent drawback in the
available metrics for use in AAC evaluation. BERTScore
[21], a popular evaluation metric for image captioning, lever-
ages contextual word embeddings to obtain similarity scores.
As a result it can accommodate words that are semantically
coherent, example, {"raining", "drizzling"}. How-
ever, for AAC task, acoustically similar words like {"heavy
摘要:

TEXT-TO-AUDIOGROUNDINGBASEDNOVELMETRICFOREVALUATINGAUDIOCAPTIONSIMILARITYSwapnilBhosale,RupayanChakraborty,SunilKumarKopparapuTCSResearch,TataConsultancyServicesLimited,India.ABSTRACTAutomaticAudioCaptioning(AAC)referstothetaskoftrans-latinganaudiosampleintoanaturallanguage(NL)textthatdescribestheau...

展开>> 收起<<
TEXT-TO-AUDIO GROUNDING BASED NOVEL METRIC FOR EVALUATING AUDIO CAPTION SIMILARITY Swapnil Bhosale Rupayan Chakraborty Sunil Kumar Kopparapu.pdf

共9页,预览2页

还剩页未读, 继续阅读

声明:本站为文档C2C交易模式,即用户上传的文档直接被用户下载,本站只是中间服务平台,本站所有文档下载所得的收益归上传人(含作者)所有。玖贝云文库仅提供信息存储空间,仅对用户上传内容的表现方式做保护处理,对上载内容本身不做任何修改或编辑。若文档所含内容侵犯了您的版权或隐私,请立即通知玖贝云文库,我们立即给予删除!
分类:图书资源 价格:10玖币 属性:9 页 大小:1.24MB 格式:PDF 时间:2025-04-26

开通VIP享超值会员特权

  • 多端同步记录
  • 高速下载文档
  • 免费文档工具
  • 分享文档赚钱
  • 每日登录抽奖
  • 优质衍生服务
/ 9
客服
关注