That Sounds Right Auditory Self-Supervision for Dynamic Robot Manipulation Abitha Thankaraj

2025-05-06 0 0 3.82MB 9 页 10玖币
侵权投诉
That Sounds Right: Auditory Self-Supervision for
Dynamic Robot Manipulation
Abitha Thankaraj
New York University
abitha@nyu.edu
Lerrel Pinto
New York University
lerrel@cs.nyu.edu
Abstract Learning to produce contact-rich, dynamic behav-
iors from raw sensory data has been a longstanding challenge
in robotics. Prominent approaches primarily focus on using
visual or tactile sensing, where unfortunately one fails to
capture high-frequency interaction, while the other can be
too delicate for large-scale data collection. In this work, we
propose a data-centric approach to dynamic manipulation
that uses an often ignored source of information: sound. We
first collect a dataset of 25k interaction-sound pairs across
five dynamic tasks using commodity contact microphones.
Then, given this data, we leverage self-supervised learning to
accelerate behavior prediction from sound. Our experiments
indicate that this self-supervised ‘pretraining’ is crucial to
achieving high performance, with a 34.5% lower MSE than
plain supervised learning and a 54.3% lower MSE over visual
training. Importantly, we find that when asked to generate
desired sound profiles, online rollouts of our models on a
UR10 robot can produce dynamic behavior that achieves an
average of 11.5% improvement over supervised learning on
audio similarity metrics. Videos and audio data are best seen on
our project website: audio-robot-learning.github.io.
I. INTRODUCTION
Imagine learning to strike a tennis ball. How can you
tell whether your shot is getting better? Perhaps the most
distinctive feeling is the crisp, springy boom – that just
sounds right. It is not just playing tennis, for which audition
provides a rich and direct signal of success; for example
consider everyday tasks like unlocking a door, cracking an
egg, or swatting a fly. In fact, recent works in neuroscience
[1,2,3] have found that sound enhances motor learning of
complex motor skills such as rowing, where being able to
listen to the sound of the rowing machine during learning
improves a person’s eventual rowing ability over those who
could not hear the machine’s sound.
In the context of robotics, the learning of motor skills
has often been centered around using visual observations as
input [4,5,6]. While this has enabled impressive success in a
variety of quasi-static robotics problems, visual observations
are often insufficient in dynamic, contact-rich manipulation
problems. One reason for this is that while visual data
contains high amounts of spatial information, it misses out
on high-frequency temporal information that is crucial for
dynamic tasks. To address this challenge, we will need to
embrace other forms of sensory supervision that can provide
this information.
Using audition holds promise in remedying the temporal
information gap present in vision, and has been explored in
Input: Contact sound Output: Action
AuRL
Input: Contact sound Output: Action
AuRL
Fig. 1: AURL learns to generate dynamic behaviors from contact
sounds produced from a UR10 robot’s interaction.
several prior works. For example, Clarke et al. [7] shows that
sound from contact microphones can be used to estimate the
volume of granular material in a container, Gandhi et al. [8]
shows that sound can be used to identify object properties,
and Chen et al. [9,10] show that sound can be used alongside
vision to self-locate. Simultaneously, several works in the
computer vision community have shown that audio can even
be used to extract visual information in a scene [11]. The
foundation these prior works have built on reasoning with
audio begs the question, can audio also be leveraged to learn
dynamic manipulation skills?
In this work, we present AURL, a new learning-based
framework for auditory perception in robotic manipulation.
AURL takes multi-channel sound data as input and outputs
parameters that control dynamic behavior. We obtain our
sound data from contact microphones placed on and around
the robot. Directly reasoning over raw audio data would
require gathering large quantities of training data, which
would limit the applicability of our framework; to address
this, we use self-supervised learning methods to learn low-
dimensional representations from audio. These representa-
arXiv:2210.01116v1 [cs.RO] 3 Oct 2022
tions form the backbone of AURL and enable efficient
learning of dynamic skills with our datasets.
We experimentally evaluate AURL on five dynamic ma-
nipulation tasks including swatting a fly swatter (See Fig. 1),
striking a box, and rattling a child’s toy on a UR10 robot.
Across all tasks, AURL can predict parameters for dynamic
behavior from audio with lower errors than our baselines.
Furthermore, when given desired sounds to emulate, the
dynamic behavior produced by AURL can generate sounds
with high similarity to the desired ones. On our aural simi-
larity metric, we find that self-supervised learning improves
performance by 34.5% over plain supervised learning.
To summarize, this paper makes three contributions. First,
we collect an audio-behavior dataset of 25k examples on
five tasks that capture contact-rich, dynamic interactions.
Second, we show that self-supervised learning techniques can
significantly improve the performance of behavior prediction
from audio. Third, we perform an ablative analysis on various
design choices such as the self-supervised objective used and
the amount of data needed for training. To the best of our
knowledge, AURL represents the first work to demonstrate
that dynamic behaviors can be learned solely from auditory
inputs. The dataset and robot videos from AURL are avail-
able on audio-robot-learning.github.io.
II. RELATED WORK
Our framework builds on several important works in
audio-based methods in robotics, self-supervised learning,
and dynamic manipulation. In this section, we will briefly
describe works that are most relevant to ours.
A. Audio for Robotics
The use of audio information has been extensively ex-
plored in the context of multimodal learning with visual
data [11,12,13,10,14]. Several of our design decisions
like the use of convolutional networks are inspired by this
line of work [11]. In the context of robotics, several works
have used audio for better navigation [15,16,9,17], where
the central idea is that sound provides a useful signal of the
environment around the robot. Recently, the sound gener-
ated by a quadrotor has been shown to provide signal for
visual localization [18]. The use of audio for manipulation
remains sparse [7,8,19]. Prior work in this domain has
looked at using sound to improve manipulation with granular
material [7], connecting sound with robotic planar manipula-
tion [8], and learning imitation-based policies for multimodal
sensory inputs [19]. We draw several points of inspiration
from these works including the use of contact microphones to
record sound [7,8] and training behavior models [19]. AURL
differs from these works in two aspects. First, we focus on
dynamic manipulation that generates sound through contact-
rich interaction. Second, we show that such manipulation be-
haviors can be learned without any visual or multimodal data.
B. Self-Supervised Learning on Sensory Data
Self-supervised representation learning has led to impres-
sive results in computer vision [20,21,22,23] and natural
language processing [24,25]. The goal of these methods is
to extract low-dimensional representations that can improve
downstream learning tasks without the need for labeled data.
This is done by first training a model on ‘pretext’ tasks with
an unlabeled dataset (e.g. Internet images). These pretext
tasks often include instance invariance to augmentations such
as color jitters or rotations [26,27,20,21,28]. The use of
such self-supervised learning has recently shown promise in
visual robotics tasks such as door opening [29] and dexterous
manipulation [30].
In the context of audio, several recent works have ex-
plored self-supervised representation learning for speech
and music data [31,32,33,34,35,36,37,38]. AURL
takes inspiration from BYOL-A [31] to use the BYOL [22]
framework for learning auditory representations. However,
unlike prior works that use music or speech data to train
their representations, our data includes multi-channel contact
sound that requires special consideration. For example, we
found that the Mixup augmentation used in BYOL-A does
not work well on our contact sound data.
C. Dynamic manipulation
Training dynamic manipulation behaviors, i.e. behaviors
that require dynamic properties such as momentum, has
received significant interest in the robotics community [39,
40,41,42,43,44,45,46,47]. Several works in this
domain learn policies that output parameters of predefined
motion primitives [43,45], which makes them amenable to
supervised learning approaches. Our work uses a similar
action parameterization for the dynamic tasks we consider.
However, in contrast to many of these works [44,45,46],
AURL can operate in domains where visual information
does not contain sufficient temporal resolution or contact
information to effectively solve the task.
III. AUDIO-BEHAVIOR DATASET COLLECTION
Since there are no prior works that have released datasets
to study the interaction of audio with behavior learning,
we will need to create our own datasets. We select tasks
that allow for self-supervised dataset collection while being
dynamic in nature (see Fig. 2for task visualization). For tasks
that involve making contact with the environment, we use
deformable tools to ensure safety. All of our data is collected
on a UR10 robot, with behaviors generated through joint
velocity control, and audio recorded by contact microphones.
A. Dynamic Tasks and Setup
Our dataset consists of five tasks: rattling a rattle, shaking
a tambourine, swatting a fly swatter, striking a horizontal sur-
face, and striking a vertical surface. For each task, we attach
a different object to our UR10’s end effector with 3D printed
mounts. Dynamic motion for each task is generated by
using motion primitives that control the robot’s velocity and
acceleration profile. After the execution of the robot motion,
distinctive sounds are produced either from the object itself
(e.g. rattle) or through interaction with the environment (e.g.
swatter). Audio generated from each interaction is recorded
摘要:

ThatSoundsRight:AuditorySelf-SupervisionforDynamicRobotManipulationAbithaThankarajNewYorkUniversityabitha@nyu.eduLerrelPintoNewYorkUniversitylerrel@cs.nyu.eduAbstract—Learningtoproducecontact-rich,dynamicbehav-iorsfromrawsensorydatahasbeenalongstandingchallengeinrobotics.Prominentapproachesprimarily...

展开>> 收起<<
That Sounds Right Auditory Self-Supervision for Dynamic Robot Manipulation Abitha Thankaraj.pdf

共9页,预览2页

还剩页未读, 继续阅读

声明:本站为文档C2C交易模式,即用户上传的文档直接被用户下载,本站只是中间服务平台,本站所有文档下载所得的收益归上传人(含作者)所有。玖贝云文库仅提供信息存储空间,仅对用户上传内容的表现方式做保护处理,对上载内容本身不做任何修改或编辑。若文档所含内容侵犯了您的版权或隐私,请立即通知玖贝云文库,我们立即给予删除!

相关推荐

分类:图书资源 价格:10玖币 属性:9 页 大小:3.82MB 格式:PDF 时间:2025-05-06

开通VIP享超值会员特权

  • 多端同步记录
  • 高速下载文档
  • 免费文档工具
  • 分享文档赚钱
  • 每日登录抽奖
  • 优质衍生服务
/ 9
客服
关注