That Sounds Right Auditory Self-Supervision for Dynamic Robot Manipulation Abitha Thankaraj

2025-05-06 0 0 3.82MB 9 页 10玖币

侵权投诉

That Sounds Right: Auditory Self-Supervision for

Dynamic Robot Manipulation

Abitha Thankaraj

New York University

abitha@nyu.edu

Lerrel Pinto

New York University

lerrel@cs.nyu.edu

Abstract— Learning to produce contact-rich, dynamic behav-

iors from raw sensory data has been a longstanding challenge

in robotics. Prominent approaches primarily focus on using

visual or tactile sensing, where unfortunately one fails to

capture high-frequency interaction, while the other can be

too delicate for large-scale data collection. In this work, we

propose a data-centric approach to dynamic manipulation

that uses an often ignored source of information: sound. We

ﬁrst collect a dataset of 25k interaction-sound pairs across

ﬁve dynamic tasks using commodity contact microphones.

Then, given this data, we leverage self-supervised learning to

accelerate behavior prediction from sound. Our experiments

indicate that this self-supervised ‘pretraining’ is crucial to

achieving high performance, with a 34.5% lower MSE than

plain supervised learning and a 54.3% lower MSE over visual

training. Importantly, we ﬁnd that when asked to generate

desired sound proﬁles, online rollouts of our models on a

UR10 robot can produce dynamic behavior that achieves an

average of 11.5% improvement over supervised learning on

audio similarity metrics. Videos and audio data are best seen on

our project website: audio-robot-learning.github.io.

I. INTRODUCTION

Imagine learning to strike a tennis ball. How can you

tell whether your shot is getting better? Perhaps the most

distinctive feeling is the crisp, springy boom – that just

sounds right. It is not just playing tennis, for which audition

provides a rich and direct signal of success; for example

consider everyday tasks like unlocking a door, cracking an

egg, or swatting a ﬂy. In fact, recent works in neuroscience

[1,2,3] have found that sound enhances motor learning of

complex motor skills such as rowing, where being able to

listen to the sound of the rowing machine during learning

improves a person’s eventual rowing ability over those who

could not hear the machine’s sound.

In the context of robotics, the learning of motor skills

has often been centered around using visual observations as

input [4,5,6]. While this has enabled impressive success in a

variety of quasi-static robotics problems, visual observations

are often insufﬁcient in dynamic, contact-rich manipulation

problems. One reason for this is that while visual data

contains high amounts of spatial information, it misses out

on high-frequency temporal information that is crucial for

dynamic tasks. To address this challenge, we will need to

embrace other forms of sensory supervision that can provide

this information.

Using audition holds promise in remedying the temporal

information gap present in vision, and has been explored in

Input: Contact sound Output: Action

AuRL

Input: Contact sound Output: Action

AuRL

Fig. 1: AURL learns to generate dynamic behaviors from contact

sounds produced from a UR10 robot’s interaction.

several prior works. For example, Clarke et al. [7] shows that

sound from contact microphones can be used to estimate the

volume of granular material in a container, Gandhi et al. [8]

shows that sound can be used to identify object properties,

and Chen et al. [9,10] show that sound can be used alongside

vision to self-locate. Simultaneously, several works in the

computer vision community have shown that audio can even

be used to extract visual information in a scene [11]. The

foundation these prior works have built on reasoning with

audio begs the question, can audio also be leveraged to learn

dynamic manipulation skills?

In this work, we present AURL, a new learning-based

framework for auditory perception in robotic manipulation.

AURL takes multi-channel sound data as input and outputs

parameters that control dynamic behavior. We obtain our

sound data from contact microphones placed on and around

the robot. Directly reasoning over raw audio data would

require gathering large quantities of training data, which

would limit the applicability of our framework; to address

this, we use self-supervised learning methods to learn low-

dimensional representations from audio. These representa-

arXiv:2210.01116v1 [cs.RO] 3 Oct 2022

tions form the backbone of AURL and enable efﬁcient

learning of dynamic skills with our datasets.

We experimentally evaluate AURL on ﬁve dynamic ma-

nipulation tasks including swatting a ﬂy swatter (See Fig. 1),

striking a box, and rattling a child’s toy on a UR10 robot.

Across all tasks, AURL can predict parameters for dynamic

behavior from audio with lower errors than our baselines.

Furthermore, when given desired sounds to emulate, the

dynamic behavior produced by AURL can generate sounds

with high similarity to the desired ones. On our aural simi-

larity metric, we ﬁnd that self-supervised learning improves

performance by 34.5% over plain supervised learning.

To summarize, this paper makes three contributions. First,

we collect an audio-behavior dataset of 25k examples on

ﬁve tasks that capture contact-rich, dynamic interactions.

Second, we show that self-supervised learning techniques can

signiﬁcantly improve the performance of behavior prediction

from audio. Third, we perform an ablative analysis on various

design choices such as the self-supervised objective used and

the amount of data needed for training. To the best of our

knowledge, AURL represents the ﬁrst work to demonstrate

that dynamic behaviors can be learned solely from auditory

inputs. The dataset and robot videos from AURL are avail-

able on audio-robot-learning.github.io.

II. RELATED WORK

Our framework builds on several important works in

audio-based methods in robotics, self-supervised learning,

and dynamic manipulation. In this section, we will brieﬂy

describe works that are most relevant to ours.

A. Audio for Robotics

The use of audio information has been extensively ex-

plored in the context of multimodal learning with visual

data [11,12,13,10,14]. Several of our design decisions

like the use of convolutional networks are inspired by this

line of work [11]. In the context of robotics, several works

have used audio for better navigation [15,16,9,17], where

the central idea is that sound provides a useful signal of the

environment around the robot. Recently, the sound gener-

ated by a quadrotor has been shown to provide signal for

visual localization [18]. The use of audio for manipulation

remains sparse [7,8,19]. Prior work in this domain has

looked at using sound to improve manipulation with granular

material [7], connecting sound with robotic planar manipula-

tion [8], and learning imitation-based policies for multimodal

sensory inputs [19]. We draw several points of inspiration

from these works including the use of contact microphones to

record sound [7,8] and training behavior models [19]. AURL

differs from these works in two aspects. First, we focus on

dynamic manipulation that generates sound through contact-

rich interaction. Second, we show that such manipulation be-

haviors can be learned without any visual or multimodal data.

B. Self-Supervised Learning on Sensory Data

Self-supervised representation learning has led to impres-

sive results in computer vision [20,21,22,23] and natural

language processing [24,25]. The goal of these methods is

to extract low-dimensional representations that can improve

downstream learning tasks without the need for labeled data.

This is done by ﬁrst training a model on ‘pretext’ tasks with

an unlabeled dataset (e.g. Internet images). These pretext

tasks often include instance invariance to augmentations such

as color jitters or rotations [26,27,20,21,28]. The use of

such self-supervised learning has recently shown promise in

visual robotics tasks such as door opening [29] and dexterous

manipulation [30].

In the context of audio, several recent works have ex-

plored self-supervised representation learning for speech

and music data [31,32,33,34,35,36,37,38]. AURL

takes inspiration from BYOL-A [31] to use the BYOL [22]

framework for learning auditory representations. However,

unlike prior works that use music or speech data to train

their representations, our data includes multi-channel contact

sound that requires special consideration. For example, we

found that the Mixup augmentation used in BYOL-A does

not work well on our contact sound data.

C. Dynamic manipulation

Training dynamic manipulation behaviors, i.e. behaviors

that require dynamic properties such as momentum, has

received signiﬁcant interest in the robotics community [39,

40,41,42,43,44,45,46,47]. Several works in this

domain learn policies that output parameters of predeﬁned

motion primitives [43,45], which makes them amenable to

supervised learning approaches. Our work uses a similar

action parameterization for the dynamic tasks we consider.

However, in contrast to many of these works [44,45,46],

AURL can operate in domains where visual information

does not contain sufﬁcient temporal resolution or contact

information to effectively solve the task.

III. AUDIO-BEHAVIOR DATASET COLLECTION

Since there are no prior works that have released datasets

to study the interaction of audio with behavior learning,

we will need to create our own datasets. We select tasks

that allow for self-supervised dataset collection while being

dynamic in nature (see Fig. 2for task visualization). For tasks

that involve making contact with the environment, we use

deformable tools to ensure safety. All of our data is collected

on a UR10 robot, with behaviors generated through joint

velocity control, and audio recorded by contact microphones.

A. Dynamic Tasks and Setup

Our dataset consists of ﬁve tasks: rattling a rattle, shaking

a tambourine, swatting a ﬂy swatter, striking a horizontal sur-

face, and striking a vertical surface. For each task, we attach

a different object to our UR10’s end effector with 3D printed

mounts. Dynamic motion for each task is generated by

using motion primitives that control the robot’s velocity and

acceleration proﬁle. After the execution of the robot motion,

distinctive sounds are produced either from the object itself

(e.g. rattle) or through interaction with the environment (e.g.

swatter). Audio generated from each interaction is recorded

文档加载中……请稍候！
如果长时间未打开，您也可以点击刷新试试。

下载文档到电脑，查找使用更方便

10 玖币 0人已下载

立即下载

摘要：

ThatSoundsRight:AuditorySelf-SupervisionforDynamicRobotManipulationAbithaThankarajNewYorkUniversityabitha@nyu.eduLerrelPintoNewYorkUniversitylerrel@cs.nyu.eduAbstractLearningtoproducecontact-rich,dynamicbehav-iorsfromrawsensorydatahasbeenalongstandingchallengeinrobotics.Prominentapproachesprimarily...

展开>> 收起<<

That Sounds Right Auditory Self-Supervision for Dynamic Robot Manipulation Abitha Thankaraj.pdf

共9页,预览2页

还剩页未读，继续阅读

声明：本站为文档C2C交易模式，即用户上传的文档直接被用户下载，本站只是中间服务平台，本站所有文档下载所得的收益归上传人(含作者)所有。玖贝云文库仅提供信息存储空间，仅对用户上传内容的表现方式做保护处理，对上载内容本身不做任何修改或编辑。若文档所含内容侵犯了您的版权或隐私，请立即通知玖贝云文库，我们立即给予删除！

That Sounds Right Auditory Self-Supervision for Dynamic Robot Manipulation Abitha Thankaraj

相关推荐

开通VIP享超值会员特权

作者详情

相关内容

热门标签

举报选择: