TEXT-TO-AUDIO GROUNDING BASED NOVEL METRIC FOR EVALUATING AUDIO CAPTION SIMILARITY Swapnil Bhosale Rupayan Chakraborty Sunil Kumar Kopparapu

2025-04-26 0 0 1.24MB 9 页 10玖币

侵权投诉

TEXT-TO-AUDIO GROUNDING BASED NOVEL METRIC FOR EVALUATING AUDIO

CAPTION SIMILARITY

Swapnil Bhosale, Rupayan Chakraborty, Sunil Kumar Kopparapu

TCS Research, Tata Consultancy Services Limited, India.

ABSTRACT

Automatic Audio Captioning (AAC) refers to the task of trans-

lating an audio sample into a natural language (NL) text that

describes the audio events, source of the events and their re-

lationships. Unlike NL text generation tasks, which rely on

metrics like BLEU,ROUGE,METEOR based on lexical seman-

tics for evaluation, the AAC evaluation metric requires an abil-

ity to map NL text (phrases) that correspond to similar sounds

in addition lexical semantics. Current metrics used for eval-

uation of AAC tasks lack an understanding of the perceived

properties of sound represented by text. In this paper, we

propose a novel metric based on Text-to-Audio Grounding

(TAG), which is, useful for evaluating cross modal tasks like

AAC. Experiments on publicly available AAC data-set shows

our evaluation metric to perform better compared to existing

metrics used in NL text and image captioning literature.

Index Terms—Audio Captioning, Audio Event Detec-

tion, Audio Grounding, Encoder-decoder, BERT.

1. INTRODUCTION

Caption generation is an integral part of scene understanding

which involves perceiving the relationships between actors

and entities. It has primarily been modeled as generating

natural language (NL) descriptions using image or video

cues [1]. However, audio based captioning was recently in-

troduced in [2], as a task of generating meaningful textual

descriptions for audio clips. Automatic Audio Captioning

(AAC) is an inter-modal translation task, where the objec-

tive is to generate a textual description for a corresponding

input audio signal [2]. Audio captioning is a critical step

towards machine intelligence with multiple applications in

daily scenarios, ranging from audio retrieval [3], scene un-

derstanding [4, 5] to assist the hearing impaired [6] and

audio surveillance. Unlike an Automatic Speech Recog-

nition (ASR) task, the output is a description rather than a

transcription of the linguistic content in the audio sample.

Moreover, in an ASR task any background audio events are

considered noise and hence are ﬁltered during pre- or post-

processing. A precursor to the AAC task is the Audio Event

Detection (AED) [7, 8] problem, with emphasis on categoriz-

ing an audio (mostly sound) into a set of pre-deﬁned audio

event labels. AAC includes but is not limited to, identifying

the presence of multiple audio events ("dog bark","gun

shot", etc.), acoustic scenes ("in a crowded place",

"amidst heavy rain", etc.), the spatio-temporal relation-

ships of event source ("kids playing", "while birds

chirping in the background"), and physical proper-

ties based on the interaction of the source objects with the

environment ("door creaks as it slowly revolves

back and forth") [9, 10].

Metrics used for evaluation play a big role when automat-

ically generated (NL text) captions have to be assessed for

their accuracy. Word embedding (or entity representations),

like word2vec (w2v), Bidirectional Encoder Representations

from Transformers (BERT), etc are often used for these pur-

poses. These embeddings are machine learned latent or vec-

tor spaces that map lexical words having similar contextual

and semantic meanings close to each other in the embedded

vector space. Formally, if W={w1, w2,· · · , wn}is the lan-

guage vocabulary containing nwords, where wirepresents

the ith word. If w2v(wi)is the word embedding of the word

wi, then vi=w2v(wi)is a mapping from W→IRm, such

that viis a mdimensional real numbered vector. If wi,wj

and wkare three words in Wsuch that wjand wkare seman-

tically close, in the language space, compared to wiand wk,

then the euclidean distance between viand vkis greater than

the distance between vjand vk.

The word embedding w2v(·), is trained on a very large

amount of NL text corpus which results in machine learning

occurrences of words in similar semantic contexts. For this

reason, w2v(.)seem to create an embedding space that, we

as humans, can relate to from the language perspective. As a

consequence, almost all NL processing (NLP) tasks that need

to compare text outputs of two different processes, use some

form of w2v(.)to measure the performance. Note that AAC is

essentially a task of assigning a text caption to an audio signal,

a(t), without the help of any other cue, namely aac(a(t)) pro-

duces a sequence of lexical words wα, wβ, wγ,· · · (∈W)to

form a grammatically valid language sentence. Currently, the

metrics adopted to measure the performance of an AAC sys-

tem, are the metrics (BLEU [11], ROUGE [12], METEOR [13],

CIDER[14], SPICE [15]) that are popularly used to compare

outputs of NL generation tasks. It should be noted that NL

tasks are expected to give semantically similar outputs, as in

arXiv:2210.06354v1 [cs.CL] 3 Oct 2022

1. EXAMPLE 1

(a) A person walks on a path with leaves on it

(b) Heavy footfalls are audible as snow crunches beneath

their boots

dirt, twigs and leaves.

2. EXAMPLE 2

(a) A CD player is playing, and the tape is turning, but no

voices or noise on it.

(b) A clock in the foreground with trafﬁc passing by in the

distance.

drives.

Fig. 1. Samples of Human annotated captions.

case of, image captioning, language translation, however this

is not true for AAC task.

We argue that these metrics used for NL tasks are not

appropriate for AAC task, though the outputs of both result

in text. It is well known that the same audio signal could

result in a variety of captions when annotated by humans

[16]. The differences in annotation is more prominent, es-

pecially in the absence of other (typically visual) cues. For

example "clock" and "car turn indicator", which are

far apart semantically in the w2v() embedding space could

appear interchangeably in an audio caption because they pro-

duce similar sounds. Additionally, auditory perception is

a complex process consisting of multiple stages including

diarization, perceptual restoration and selective attention to

name a few [17]. As a result, individuals belonging to dif-

ferent age groups, cultural backgrounds, experiences might

represent sounds, especially in the presence of external noise,

differently [17, 18].

This insight and the observation during our initial ex-

periments designed to investigate the ambiguity in human

annotation of publicly available AAC data sets motivated us

to explore the need for a new metric that can be used to

measure the performance of an AAC system. Figure 1 shows

examples of human annotated captions for the same input

audio. While all the captions in EXAMPLE 1 (Fig. 1) have

the sound produced by "footsteps"; the caption (item 1b)

captures the sound produced by "snow" (when stepped on

it), whereas (item 1a) and (item 1c) capture the sound pro-

duced by "leaves" and "twigs". Clearly any metric based

on NL semantics would mark the caption (item 1b) as being

incorrect when compared to (item 1a) or (item 1c) because

w2v"leaves" and w2v"snow" are not close to each other. A

similar inter-variance among reference captions, in Clotho

dataset [9], for the same audio can be observed in captions

EXAMPLE 2 (Fig. 1). Interestingly, the "ticking" sound

is perceived differently ("CD player","clock","turn

indicator") by the three human annotators.

We can broadly categorize the variation in human cap-

tioning the same audio into two types, (i) missed audio events

example, the "traffic" noise is captured in (item 2b)

and (item 2c), however, it is missed by the annotator in

(item 2a), and (ii) mis-identiﬁed of confused audio events ow-

ing to acoustic similarity example, "CD player","clock",

"turn signal" produce similar sounds. Our proposed

metric incorporates acoustic similarity in its formulation to

enable fair evaluation of AAC systems.

Fig. 2. Audio caption represented in 2D principal component

space. Each color represents different captions corresponding

to the same audio.

To explore variations in human annotated audio cap-

tions in Clotho dataset, we use a pre-trained Sentence-BERT

(sBERT) model [19] to extract sentence level embeddings.

Each point in the cluster plot (Fig. 2) represents the sBERT

embedding of a caption projected onto two principal compo-

nents [20] and the color represents the caption corresponding

to the same audio. It can be observed that the same color

points, representing captions corresponding to the same au-

dio, are not necessarily clustered together. Thus one can

conclude that metrics (in this case sBERT) which rely on

semantic information only are not suitable to evaluate au-

dio captions. To overcome this, AAC evaluation so far has

been relying on computing scores against all available ref-

erence captions using existing metrics (example, sBERT)

and then taking the average or the best score. While this

is helpful, it fails to address the inherent drawback in the

available metrics for use in AAC evaluation. BERTScore

[21], a popular evaluation metric for image captioning, lever-

ages contextual word embeddings to obtain similarity scores.

As a result it can accommodate words that are semantically

coherent, example, {"raining", "drizzling"}. How-

ever, for AAC task, acoustically similar words like {"heavy

文档加载中……请稍候！
如果长时间未打开，您也可以点击刷新试试。

下载文档到电脑，查找使用更方便

10 玖币 0人已下载

立即下载

摘要：

TEXT-TO-AUDIOGROUNDINGBASEDNOVELMETRICFOREVALUATINGAUDIOCAPTIONSIMILARITYSwapnilBhosale,RupayanChakraborty,SunilKumarKopparapuTCSResearch,TataConsultancyServicesLimited,India.ABSTRACTAutomaticAudioCaptioning(AAC)referstothetaskoftrans-latinganaudiosampleintoanaturallanguage(NL)textthatdescribestheau...

展开>> 收起<<

TEXT-TO-AUDIO GROUNDING BASED NOVEL METRIC FOR EVALUATING AUDIO CAPTION SIMILARITY Swapnil Bhosale Rupayan Chakraborty Sunil Kumar Kopparapu.pdf

共9页,预览2页

还剩页未读，继续阅读

声明：本站为文档C2C交易模式，即用户上传的文档直接被用户下载，本站只是中间服务平台，本站所有文档下载所得的收益归上传人(含作者)所有。玖贝云文库仅提供信息存储空间，仅对用户上传内容的表现方式做保护处理，对上载内容本身不做任何修改或编辑。若文档所含内容侵犯了您的版权或隐私，请立即通知玖贝云文库，我们立即给予删除！

TEXT-TO-AUDIO GROUNDING BASED NOVEL METRIC FOR EVALUATING AUDIO CAPTION SIMILARITY Swapnil Bhosale Rupayan Chakraborty Sunil Kumar Kopparapu

相关推荐

开通VIP享超值会员特权

作者详情

相关内容

热门标签

举报选择: