Learning on the Job Self-Rewarding Offline-to-Online Finetuning for Industrial Insertion of Novel Connectors from Vision Ashvin Nair1 Brian Zhu12 Gokul Narayanan2 Eugen Solowjow2 Sergey Levine1

2025-05-02 0 0 3.08MB 10 页 10玖币
侵权投诉
Learning on the Job: Self-Rewarding Offline-to-Online Finetuning for
Industrial Insertion of Novel Connectors from Vision
Ashvin Nair1, Brian Zhu12, Gokul Narayanan2, Eugen Solowjow2, Sergey Levine1
Abstract Learning-based methods in robotics hold the
promise of generalization, but what can be done if a learned
policy does not generalize to a new situation? In principle,
if an agent can at least evaluate its own success (i.e., with a
reward classifier that generalizes well even when the policy
does not), it could actively practice the task and finetune the
policy in this situation. We study this problem in the setting
of industrial insertion tasks, such as inserting connectors in
sockets and setting screws. Existing algorithms rely on precise
localization of the connector or socket and carefully managed
physical setups, such as assembly lines, to succeed at the task.
But in unstructured environments such as homes or even some
industrial settings, robots cannot rely on precise localization
and may be tasked with previously unseen connectors. Offline
reinforcement learning on a variety of connector insertion tasks
is a potential solution, but what if the robot is tasked with
inserting previously unseen connector? In such a scenario, we
will still need methods that can robustly solve such tasks with
online practice. One of the main observations we make in this
work is that, with a suitable representation learning and domain
generalization approach, it can be significantly easier for the
reward function to generalize to a new but structurally similar
task (e.g., inserting a new type of connector) than for the policy.
This means that a learned reward function can be used to facil-
itate the finetuning of the robot’s policy in situations where the
policy fails to generalize in zero shot, but the reward function
generalizes successfully. We show that such an approach can
be instantiated in the real world, pretrained on 50 different
connectors, and successfully finetuned to new connectors via
the learned reward function. Videos and visualizations can be
viewed at sites.google.com/view/learningonthejob
I. INTRODUCTION
Generalizable policies require broad and diverse datasets,
but for realistic applications, learning policies that can always
generalize in zero shot to new objects and environments is
often infeasible – indeed, even humans do not exhibit such
universal generalization capabilities. Instead, when faced
with a task that we don’t precisely know how to do, we can
quickly learn the task by leveraging our prior knowledge
and a little bit of practice. Reinforcement learning (RL)
provides us with a way to implement this kind of learning
on the job, using online finetuning in the new domain or
task, and potentially even extending it into a lifelong learning
system where the robot improves its generalization capacity
continually with each new task it masters.
However, instantiating this concept in a practical robotics
setting requires overcoming a number of obstacles. The robot
must be able to combine large amounts of diverse offline data
with small amounts of targeted online experience, and do so
First two authors contributed equally, 1University of California, Berke-
ley. 2Siemens Corporation. Correspondence: anair17@berkeley.edu
in a way that doesn’t require revisiting previously learned
tasks or domains, which means that we need an offline
RL algorithm that supports online finetuning. Perhaps more
importantly, the entire finetuning process must be supported
by the robot’s own sensors, without privileged information
or environment instrumentation, so as to retain the benefits
of autonomous learning. In particular, this means that when
adapting to a new task, the robot must be able to evaluate
on its own whether it is making progress on the task, using
a learned reward function.
We study this problem in the setting of learning a policy
from vision for performing industrial insertion tasks. This
family of assembly tasks, including plugging in connectors
into sockets, keys into locks, screwdrivers into screw intru-
sions, setting screws, and so on, are found in many stages
of manufacturing. When automated in factories today, these
tasks are done by robots with specialized control algorithms
that rely on precise localization of the socket location.
Typ e A Bl ack
Type A White
Typ e A Br own
Type A Pink
Type A Rotated
Typ e B B rown
Model E
RCA Red
RCA Yellow
USB
Aux
Craftsman
6mm
Craftsman
7mm
Craftsman
10mm
Craftsman
14mm
Craftsman
square
VGA
Displayport
Mini DP
PS2
DC
Lightning
DT2
DT3
DT4
DT6
USB-B
USB-C
BNC
Europlug
DT8
DVI
DX
Ethernet green
Ethernet red
Ethernet
HDMI
DSub
NEMA-TT
Typ e G
IEC 320
Key copper
Key silver
Key pink
Megablock 1x1
Megablock 2x2
NEMA-L14-Y
NEMA-L5
NEMA-L14-W
Typ e G
2-Prong
DAIB
<latexit sha1_base64="NubU5t3E46KImJ2ccATU8E1qMEY=">AAACMHicbVBNS8NAEN34bf2KevSyWAQFKYn4dRQV9KaC1UJbymS70aWbTdidiCXkJ3nxp+hFQRGv/go3bQ9afbDweG9mduYFiRQGPe/VGRkdG5+YnJouzczOzS+4i0tXJk4141UWy1jXAjBcCsWrKFDyWqI5RIHk10HnqPCv77g2IlaX2E14M4IbJULBAK3Uck8ayO8x8yv0LAyLIfQY0M5Dur7j0XYcgVBmk+aNCPCWgcyO81a/BbW18j7fyFtu2at4PdC/xB+QMhngvOU+NdoxSyOukEkwpu57CTYz0CiY5HmpkRqeAOvADa9bqiDippn1Ds7pmlXaNIy1fQppT/3ZkUFkTDcKbGWxtxn2CvE/r55iuN/MhEpS5Ir1PwpTSTGmRXq0LTRnKLuWANPC7krZLWhgaDMu2RD84ZP/kqutir9b2b7YLh8cDuKYIitklawTn+yRA3JKzkmVMPJAnskbeXcenRfnw/nsl444g55l8gvO1zeWiKoJ</latexit>
1. Oine Dataset (50 domains, Dtrain )
<latexit sha1_base64="8oQbH8B1Z3jPYagXi/WUQtkhE90=">AAACKHicbVBNSwMxEM36bf2qevQSLIIeLLsq6k1REY+KVoW2lGw6W4PZ7JJMxLLsz/HiX/EioohXf4nZtge/HgQe8+ZNZl6YSmHQ9z+8oeGR0bHxicnS1PTM7Fx5fuHSJFZzqPFEJvo6ZAakUFBDgRKuUw0sDiVchbeHhX51B9qIRF1gN4VmzDpKRIIzdKVWea+BcI/ZZpWeg4zWz20K+k4YaNNjNxGtEqpDc7raiBnecCazo7zVtyAYzNda5Ypf9Xugf0kwIBUywGmr/NJoJ9zGoJBLZkw98FNsZkyj4BLyUsMaSBm/ZR2oO6pYDKaZ9Q7N6YotNosS7Z5C2qt+d2QsNqYbh66z2Nf81orif1rdYrTbzIRKLYLi/Y8iKykmtEiNtoUGjrLrCONauF0pv2GacXTZllwIwe+T/5LLjWqwXd0626rsHwzimCBLZJmskoDskH1yQk5JjXDyQJ7IK3nzHr1n79376LcOeQPPIvkB7/MLyOWnFw==</latexit>
3. Self-Supervised Finetuning (Dtest)
<latexit sha1_base64="A6oHpJqOkJDcSEmKJ70xHPa2+9k=">AAACA3icbVDLSgMxFM3UV62vqjvdBIvgapgpRV0W3bizQl/QDiWTZtrQTGZI7ohlKLjxV9y4UMStP+HOvzHTdqGtBy4czrk3uff4seAaHOfbyq2srq1v5DcLW9s7u3vF/YOmjhJFWYNGIlJtn2gmuGQN4CBYO1aMhL5gLX90nfmte6Y0j2QdxjHzQjKQPOCUgJF6xaMusAdIyza+DYLsEVxXhEsuB5NeseTYzhR4mbhzUkJz1HrFr24/oknIJFBBtO64TgxeShRwKtik0E00iwkdkQHrGCpJyLSXTm+Y4FOj9HEQKVMS8FT9PZGSUOtx6JvOkMBQL3qZ+J/XSSC49FIu4wSYpLOPgkRgiHAWCO5zxSiIsSGEKm52xXRIFKFgYiuYENzFk5dJs2y753blrlKqXs3jyKNjdILOkIsuUBXdoBpqIIoe0TN6RW/Wk/VivVsfs9acNZ85RH9gff4Ae1SXcA==</latexit>
2. Oine Training
<latexit sha1_base64="aUwaESUNCFn/XrSd0vdKh7DTLoU=">AAACCXicbVDLSsNAFJ3UV62vqEs3g0WsUEoiRV0W3bisYB/QhDKZTNqhk0mYmQglZOvGX3HjQhG3/oE7/8ZpmoW2HrhwOOfeuXOPFzMqlWV9G6WV1bX1jfJmZWt7Z3fP3D/oyigRmHRwxCLR95AkjHLSUVQx0o8FQaHHSM+b3Mz83gMRkkb8Xk1j4oZoxGlAMVJaGpqwJusQ1WHq5G+lgviZM0YqFVlWh/L0bGhWrYaVAy4TuyBVUKA9NL8cP8JJSLjCDEk5sK1YuSkSimJGsoqTSBIjPEEjMtCUo5BIN823Z/BEKz4MIqGLK5irvydSFEo5DT3dGSI1loveTPzPGyQquHJTyuNEEY7ni4KEQRXBWSzQp4JgxaaaICyo/ivEYyQQVjq8ig7BXjx5mXTPG/ZFo3nXrLauizjK4AgcgxqwwSVogVvQBh2AwSN4Bq/gzXgyXox342PeWjKKmUPwB8bnD/OdmUA=</latexit>
(s, a, ˆr,s
0)
<latexit sha1_base64="W8Yte/4B2nxkBdP4L7/nt96H/FE=">AAACDXicbVA9SwNBEN3zM8avU0ubxSjEJtyFoJYBG8so5gNyR9jbbJIle7fH7pwYzvwBG/+KjYUitvZ2/hs3yRWa+GDg8d4MM/OCWHANjvNtLS2vrK6t5zbym1vbO7v23n5Dy0RRVqdSSNUKiGaCR6wOHARrxYqRMBCsGQwvJ37zjinNZXQLo5j5IelHvMcpASN17GMP2D2kN1IILBPAY0ywp3mIvZgXPdqV8KBPO3bBKTlT4EXiZqSAMtQ69pfXlTQJWQRUEK3brhODnxIFnAo2znuJZjGhQ9JnbUMjEjLtp9NvxvjEKF3ck8pUBHiq/p5ISaj1KAxMZ0hgoOe9ifif106gd+GnPIoTYBGdLeolAoPEk2hwlytGQYwMIVRxcyumA6IIBRNg3oTgzr+8SBrlkntWqlxXCtVyFkcOHaIjVEQuOkdVdIVqqI4oekTP6BW9WU/Wi/Vufcxal6xs5gD9gfX5A5sLmzc=</latexit>
Roll out a(·|s)
<latexit sha1_base64="3qFWfOpIMD1pzMY3l8H4kIZ4CZI=">AAACCnicbVDLSsNAFJ3UV62vqEs3o0Wom5KUom6EghuXVewDmhAmk2k7dDIJMxOhhKzd+CtuXCji1i9w5984TbPQ1gMXDufcO3Pv8WNGpbKsb6O0srq2vlHerGxt7+zumfsHXRklApMOjlgk+j6ShFFOOooqRvqxICj0Gen5k+uZ33sgQtKI36tpTNwQjTgdUoyUljzzOHXyR1JBggw6Y6RSkWXwCt55TixpTZ55ZtWqWzngMrELUgUF2p755QQRTkLCFWZIyoFtxcpNkVAUM5JVnESSGOEJGpGBphyFRLppvkUGT7USwGEkdHEFc/X3RIpCKaehrztDpMZy0ZuJ/3mDRA0v3ZTyOFGE4/lHw4RBFcFZLjCggmDFppogLKjeFeIxEggrnV5Fh2AvnrxMuo26fV5v3jarrUYRRxkcgRNQAza4AC1wA9qgAzB4BM/gFbwZT8aL8W58zFtLRjFzCP7A+PwB4r+aUQ==</latexit>
ˆr=R (s)
<latexit sha1_base64="wogKOrYDHAyOXR8U6jxu+bKQXhU=">AAAB8HicbVBNSwMxEJ2tX7V+VT16CRbBQym7pVSPBS8eW7Af0i4lm2bb0CS7JFmhLP0VXjwo4tWf481/Y9ruQVsfDDzem2FmXhBzpo3rfju5re2d3b38fuHg8Oj4pHh61tFRoghtk4hHqhdgTTmTtG2Y4bQXK4pFwGk3mN4t/O4TVZpF8sHMYuoLPJYsZAQbKz0OYlZGrTLqDIslt+IugTaJl5ESZGgOi1+DUUQSQaUhHGvd99zY+ClWhhFO54VBommMyRSPad9SiQXVfro8eI6urDJCYaRsSYOW6u+JFAutZyKwnQKbiV73FuJ/Xj8x4a2fMhknhkqyWhQmHJkILb5HI6YoMXxmCSaK2VsRmWCFibEZFWwI3vrLm6RTrXj1Sq1VKzWqWRx5uIBLuAYPbqAB99CENhAQ8Ayv8OYo58V5dz5WrTknmzmHP3A+fwAZF49B</latexit>
,Q,V
<latexit sha1_base64="StcdBwn3txr+ftAgd4Oy+7m79GI=">AAAB/XicbVDLSsNAFJ34rPUVHzs3g0VwVZJS1GVBBJcV7APaUCbTSTt0MgkzN8Uair/ixoUibv0Pd/6N0zQLbT1w4XDOvdx7jx8LrsFxvq2V1bX1jc3CVnF7Z3dv3z44bOooUZQ1aCQi1faJZoJL1gAOgrVjxUjoC9byR9czvzVmSvNI3sMkZl5IBpIHnBIwUs8+7gJ7gLRaxjdjIpJMnfbsklN2MuBl4uakhHLUe/ZXtx/RJGQSqCBad1wnBi8lCjgVbFrsJprFhI7IgHUMlSRk2kuz66f4zCh9HETKlAScqb8nUhJqPQl90xkSGOpFbyb+53USCK68lMs4ASbpfFGQCAwRnkWB+1wxCmJiCKGKm1sxHRJFKJjAiiYEd/HlZdKslN2LcvWuWqpV8jgK6ASdonPkoktUQ7eojhqIokf0jF7Rm/VkvVjv1se8dcXKZ47QH1ifP1fnlR8=</latexit>
4. Evaluation
<latexit sha1_base64="cxlroT9/+puyiP8rVo31TSCj9YU=">AAAB8HicbVBNSwMxEJ2tX7V+VT16CRahXspuKdVjwYvHKvZD2qVk02wbmmSXJCuUpb/CiwdFvPpzvPlvTNs9aOuDgcd7M8zMC2LOtHHdbye3sbm1vZPfLeztHxweFY9P2jpKFKEtEvFIdQOsKWeStgwznHZjRbEIOO0Ek5u533miSrNIPphpTH2BR5KFjGBjpcf7QT/WrKwvB8WSW3EXQOvEy0gJMjQHxa/+MCKJoNIQjrXueW5s/BQrwwins0I/0TTGZIJHtGepxIJqP10cPEMXVhmiMFK2pEEL9fdEioXWUxHYToHNWK96c/E/r5eY8NpPmYwTQyVZLgoTjkyE5t+jIVOUGD61BBPF7K2IjLHCxNiMCjYEb/XlddKuVrx6pXZXKzWqWRx5OINzKIMHV9CAW2hCCwgIeIZXeHOU8+K8Ox/L1pyTzZzCHzifPxtoj+o=</latexit>
R (s)
<latexit sha1_base64="m6Pxhz1XIfbolRGflDoqUPk8MTQ=">AAAB9HicbVDLTgJBEJzFF+IL9ehlIjHxRHYJUY8kXjyikUcCGzI728CE2YczvSjZ8B1ePGiMVz/Gm3/jAHtQsJJOKlXd6e7yYik02va3lVtb39jcym8Xdnb39g+Kh0dNHSWKQ4NHMlJtj2mQIoQGCpTQjhWwwJPQ8kbXM781BqVFFN7jJAY3YINQ9AVnaCS3i/CE6R08MuVPe8WSXbbnoKvEyUiJZKj3il9dP+JJACFyybTuOHaMbsoUCi5hWugmGmLGR2wAHUNDFoB20/nRU3pmFJ/2I2UqRDpXf0+kLNB6EnimM2A41MveTPzP6yTYv3JTEcYJQsgXi/qJpBjRWQLUFwo4yokhjCthbqV8yBTjaHIqmBCc5ZdXSbNSdi7K1dtqqVbJ4siTE3JKzolDLkmN3JA6aRBOHsgzeSVv1th6sd6tj0VrzspmjskfWJ8/XdiScw==</latexit>
Reward
<latexit sha1_base64="biXVK/jm8Hx8VzXI4pY4PavSwgo=">AAAB9HicbVBNSwMxEM3Wr1q/qh69BIvgqeyWoh4LXjxWsB/QLiWbZtvQbLIms8Vl6e/w4kERr/4Yb/4b03YP2vpg4PHeDDPzglhwA6777RQ2Nre2d4q7pb39g8Oj8vFJ26hEU9aiSijdDYhhgkvWAg6CdWPNSBQI1gkmt3O/M2XacCUfII2ZH5GR5CGnBKzk94E9QdZUgtN0NihX3Kq7AF4nXk4qKEdzUP7qDxVNIiaBCmJMz3Nj8DOigVPBZqV+YlhM6ISMWM9SSSJm/Gxx9AxfWGWIQ6VtScAL9fdERiJj0iiwnRGBsVn15uJ/Xi+B8MbPuIwTYJIuF4WJwKDwPAE85JpREKklhGpub8V0TDShYHMq2RC81ZfXSbtW9a6q9ft6pVHL4yiiM3SOLpGHrlED3aEmaiGKHtEzekVvztR5cd6dj2VrwclnTtEfOJ8/bo2Sfg==</latexit>
Policy
<latexit sha1_base64="Y2rlGVV1DXpjj8p+2TJSquGJIvE=">AAAB63icbVBNSwMxEJ2tX7V+VT16CRbBg5TdUqrHghePLdgPaJeSTbNtaJJdkqxQlv4FLx4U8eof8ua/MdvuQVsfDDzem2FmXhBzpo3rfjuFre2d3b3ifung8Oj4pHx61tVRogjtkIhHqh9gTTmTtGOY4bQfK4pFwGkvmN1nfu+JKs0i+WjmMfUFnkgWMoJNJrVvUHdUrrhVdwm0SbycVCBHa1T+Go4jkggqDeFY64HnxsZPsTKMcLooDRNNY0xmeEIHlkosqPbT5a0LdGWVMQojZUsatFR/T6RYaD0Xge0U2Ez1upeJ/3mDxIR3fspknBgqyWpRmHBkIpQ9jsZMUWL43BJMFLO3IjLFChNj4ynZELz1lzdJt1b1GtV6u15p1vI4inABl3ANHtxCEx6gBR0gMIVneIU3RzgvzrvzsWotOPnMOfyB8/kDDv6Njg==</latexit>
Q, V
<latexit sha1_base64="XTDfL5t6u2XFsrzDLD2e1H/fba8=">AAAB73icbVDLSgNBEOz1GeMr6tHLYBDiJeyGoB4DXjxGMA9IltA7mSRDZmfXmVkhrPkJLx4U8ervePNvnCR70MSChqKqm+6uIBZcG9f9dtbWNza3tnM7+d29/YPDwtFxU0eJoqxBIxGpdoCaCS5Zw3AjWDtWDMNAsFYwvpn5rUemNI/kvZnEzA9xKPmAUzRWandjXsInfdErFN2yOwdZJV5GipCh3it8dfsRTUImDRWodcdzY+OnqAyngk3z3USzGOkYh6xjqcSQaT+d3zsl51bpk0GkbElD5urviRRDrSdhYDtDNCO97M3E/7xOYgbXfsplnBgm6WLRIBHERGT2POlzxagRE0uQKm5vJXSECqmxEeVtCN7yy6ukWSl7l+XqXbVYq2Rx5OAUzqAEHlxBDW6hDg2gIOAZXuHNeXBenHfnY9G65mQzJ/AHzucPh9+PmQ==</latexit>
(a|s)
DAIB
DAIB
ResNet-18
<latexit sha1_base64="lBcJoq3IorS9BzjZM1ahX/3DYpI=">AAAB6nicbVDLTgJBEOzFF+IL9ehlIjHxRHYNUY9ELx4xyCOBDZkdemHC7OxmZtYECZ/gxYPGePWLvPk3DrAHBSvppFLVne6uIBFcG9f9dnJr6xubW/ntws7u3v5B8fCoqeNUMWywWMSqHVCNgktsGG4EthOFNAoEtoLR7cxvPaLSPJYPZpygH9GB5CFn1Fip/tSr94olt+zOQVaJl5ESZKj1il/dfszSCKVhgmrd8dzE+BOqDGcCp4VuqjGhbEQH2LFU0gi1P5mfOiVnVumTMFa2pCFz9ffEhEZaj6PAdkbUDPWyNxP/8zqpCa/9CZdJalCyxaIwFcTEZPY36XOFzIixJZQpbm8lbEgVZcamU7AheMsvr5LmRdm7LFfuK6XqTRZHHk7gFM7Bgyuowh3UoAEMBvAMr/DmCOfFeXc+Fq05J5s5hj9wPn8ARTKNzQ==</latexit>
zS
<latexit sha1_base64="9W6W5mEAPHuoMKZo9Hi0v0h0Abk=">AAAB6nicbVBNSwMxEJ3Ur1q/qh69BIvgqexKUY9FLx4r2g9ol5JNs21okl2SrFCW/gQvHhTx6i/y5r8xbfegrQ8GHu/NMDMvTAQ31vO+UWFtfWNzq7hd2tnd2z8oHx61TJxqypo0FrHuhMQwwRVrWm4F6ySaERkK1g7HtzO//cS04bF6tJOEBZIMFY84JdZJDz2Z9ssVr+rNgVeJn5MK5Gj0y1+9QUxTyZSlghjT9b3EBhnRllPBpqVealhC6JgMWddRRSQzQTY/dYrPnDLAUaxdKYvn6u+JjEhjJjJ0nZLYkVn2ZuJ/Xje10XWQcZWklim6WBSlAtsYz/7GA64ZtWLiCKGau1sxHRFNqHXplFwI/vLLq6R1UfUvq7X7WqV+k8dRhBM4hXPw4QrqcAcNaAKFITzDK7whgV7QO/pYtBZQPnMMf4A+fwBgTI3f</latexit>
µ
<latexit sha1_base64="AP1GKE831ATthHd/5R6XRwc0Lig=">AAAB7XicbVDLSgNBEOyNrxhfUY9eBoPgKeyKqMegF48RzQOSJcxOZpMx81hmZoWw5B+8eFDEq//jzb9xkuxBEwsaiqpuuruihDNjff/bK6ysrq1vFDdLW9s7u3vl/YOmUakmtEEUV7odYUM5k7RhmeW0nWiKRcRpKxrdTP3WE9WGKflgxwkNBR5IFjOCrZOa3Xs2ELhXrvhVfwa0TIKcVCBHvVf+6vYVSQWVlnBsTCfwExtmWFtGOJ2UuqmhCSYjPKAdRyUW1ITZ7NoJOnFKH8VKu5IWzdTfExkWxoxF5DoFtkOz6E3F/7xOauOrMGMySS2VZL4oTjmyCk1fR32mKbF87AgmmrlbERlijYl1AZVcCMHiy8ukeVYNLqrnd+eV2nUeRxGO4BhOIYBLqMEt1KEBBB7hGV7hzVPei/fufcxbC14+cwh/4H3+AG5Njww=</latexit>
<latexit sha1_base64="VcduhImtG31xtIwCRH/3qEwy95Y=">AAAB6HicbVDLTgJBEOzFF+IL9ehlIjHxRHYNUY9ELx4hkUcCGzI79MLI7OxmZtYECV/gxYPGePWTvPk3DrAHBSvppFLVne6uIBFcG9f9dnJr6xubW/ntws7u3v5B8fCoqeNUMWywWMSqHVCNgktsGG4EthOFNAoEtoLR7cxvPaLSPJb3ZpygH9GB5CFn1Fip/tQrltyyOwdZJV5GSpCh1it+dfsxSyOUhgmqdcdzE+NPqDKcCZwWuqnGhLIRHWDHUkkj1P5kfuiUnFmlT8JY2ZKGzNXfExMaaT2OAtsZUTPUy95M/M/rpCa89idcJqlByRaLwlQQE5PZ16TPFTIjxpZQpri9lbAhVZQZm03BhuAtv7xKmhdl77JcqVdK1ZssjjycwCmcgwdXUIU7qEEDGCA8wyu8OQ/Oi/PufCxac042cwx/4Hz+AOwvjQc=</latexit>
z
<latexit sha1_base64="a8+hVxiwdzQEQdY/+z7PctWKn08=">AAAB6nicbVDLSgNBEOyNrxhfUY9eBoPgKexKUI9BL3qLaB6QLGF2MkmGzM4uM71CXPIJXjwo4tUv8ubfOEn2oIkFDUVVN91dQSyFQdf9dnIrq2vrG/nNwtb2zu5ecf+gYaJEM15nkYx0K6CGS6F4HQVK3oo1p2EgeTMYXU/95iPXRkTqAccx90M6UKIvGEUr3T91b7vFklt2ZyDLxMtICTLUusWvTi9iScgVMkmNaXtujH5KNQom+aTQSQyPKRvRAW9bqmjIjZ/OTp2QE6v0SD/SthSSmfp7IqWhMeMwsJ0hxaFZ9Kbif147wf6lnwoVJ8gVmy/qJ5JgRKZ/k57QnKEcW0KZFvZWwoZUU4Y2nYINwVt8eZk0zsreeblyVylVr7I48nAEx3AKHlxAFW6gBnVgMIBneIU3RzovzrvzMW/NOdnMIfyB8/kDNgqNww==</latexit>
zI
<latexit sha1_base64="41LxcE0huifLA3ceaNWLHK43ug0=">AAAB7nicbVBNS8NAEJ3Ur1q/qh69LBbBU0lE1GPRi8cK9gPaUDabTbt0swm7E6GE/ggvHhTx6u/x5r9x2+agrQ8GHu/NMDMvSKUw6LrfTmltfWNzq7xd2dnd2z+oHh61TZJpxlsskYnuBtRwKRRvoUDJu6nmNA4k7wTju5nfeeLaiEQ94iTlfkyHSkSCUbRSpz+imIfTQbXm1t05yCrxClKDAs1B9asfJiyLuUImqTE9z03Rz6lGwSSfVvqZ4SllYzrkPUsVjbnx8/m5U3JmlZBEibalkMzV3xM5jY2ZxIHtjCmOzLI3E//zehlGN34uVJohV2yxKMokwYTMfieh0JyhnFhCmRb2VsJGVFOGNqGKDcFbfnmVtC/q3lX98uGy1rgt4ijDCZzCOXhwDQ24hya0gMEYnuEV3pzUeXHenY9Fa8kpZo7hD5zPH5TGj74=</latexit>
ˆ
d
<latexit sha1_base64="PgDI4vWes99CIsWm3tfccqA2aAA=">AAACBnicdVDLSgMxFM3UV62vqksRgkWoIGWmtOqy6MaFiwq2FjpDuZNJ29DMgyRTKMOs3Pgrblwo4tZvcOffmGkrqOiBkMM593LvPW7EmVSm+WHkFhaXllfyq4W19Y3NreL2TluGsSC0RUIeio4LknIW0JZiitNOJCj4Lqe37ugi82/HVEgWBjdqElHHh0HA+oyA0lKvuG/7oIYEeHKV9hLwxmnZO8b2EFTipUe9Ysms1E2NKjYr2V+r47lSx9ZUMc0SmqPZK77bXkhinwaKcJCya5mRchIQihFO04IdSxoBGcGAdjUNwKfSSaZnpPhQKx7uh0K/QOGp+r0jAV/Kie/qymxp+dvLxL+8bqz6Z07CgihWNCCzQf2YYxXiLBPsMUGJ4hNNgAimd8VkCAKI0skVdAhfl+L/SbtasU4qtetaqXE+jyOP9tABKiMLnaIGukRN1EIE3aEH9ISejXvj0XgxXmelOWPes4t+wHj7BC2ImPI=</latexit>
Ladv (d, ˆ
d)
Fig. 1: We describe how a robot can learn to insert an unseen connector from
prior experience under realistic conditions by evaluating its own reward. (1)
We first collect a diverse dataset with 50 connectors. There is significant
variation in the shape of the connectors, sockets, and background. (2) We
learn a reward function, policy, and value functions using offline RL. We
propose a domain adversarial information bottleneck (DAIB) in order to
generalize to new domains. A domain invariance loss is applied on part
of the latent representation zI, while a domain specific latent variable zS
is constrained by an information bottleneck. (3) We finetune online with
self-generated rewards to master a new domain. (4) We evaluate πin Dtest .
arXiv:2210.15206v2 [cs.RO] 27 Feb 2023
For robots to perform these insertion tasks in industrial
and warehouse settings with less human supervision, or in
unstructured environments such as homes, they must rely on
highly accurate state information of the external world (e.g.,
socket position and in-hand pose estimation). But such state
estimation, using either machine learning or computer vision
approaches, is brittle on unseen connectors. To solve the
general problem of inserting a novel connector, one promis-
ing solution is to generalize previously collected experience
of connector insertion to learn a policy to insert connectors
from vision. Among these tasks, there is enough variability
to require generalization and adaptation, but also enough
internal structural regularity that we expect transfer between
connectors. We first collected a large offline dataset with
insertion data of 50 connectors across 2 robots and diverse
backgrounds with actions, images, and sparse reward labels.
Offline RL on this data alone generalizes to connectors very
similar to those in the training dataset, but we will also expect
robots to be able to perform tasks in new domains, perhaps
after some practice. How can a robot insert test connectors
from vision in this setting, utilizing offline RL from offline
data to enable active online finetuning on a new connector?
The key insight is that we need to (1) adapt to new tasks
quickly with online finetuning if the zero-shot solution is not
sufficient and (2) generalize to new domains by finding com-
mon structure between domains while preserving important
domain-specific information. Ideally, a policy trained offline
can generalize from vision to new tasks. But if it does not, we
can still finetune in a new domain with minimal supervision
as long as we have a reward function that generalizes instead.
For training policies and reward functions that generalize to
test domains, we propose a split representation that combines
domain adversarial neural networks [1] for domain invariance
and a variational information bottleneck [2] for controlling
the flow of domain-specific information. This representation,
which we call domain adversarial information bottleneck
(DAIB) is used first for learning a robust reward function to
detect successful insertions for an unseen connector. Next,
we modify implicit Q-learning (IQL), an offline RL algo-
rithm amenable to online finetuning, to use DAIB. During
online finetuning, DAIB can be used in combination with
online RL to enable fast learning of novel connectors.
We present two main contributions. We demonstrate a
system for finetuning under realistic real-world constraints
with minimal human supervision, and applied it to insert
connectors robustly from vision without the need of accurate
socket localization, both for observations and rewards. To
accomplish this, we propose a novel representation learn-
ing method that allows better generalization of policies
and reward functions to unseen domains. We outperform
regression-based baselines on the same dataset that combine
localizing the socket with hand-designed control policies, as
well as prior RL methods. We show that new tasks can be
finetuned within 200 trials (about 50 minutes of real-world
interaction), given our dataset of off-policy data from 50
prior domains of 70,000 trajectories. This system allows us
to finetune IQL to a test connector, increasing performance
significantly over the offline performance. Project videos
and our dataset of robotic insertion will be made public at
sites.google.com/view/learningonthejob
II. RELATED WORK
Reinforcement learning has been applied to a variety of
robotics tasks [3]–[11]. To utilize offline datasets with diverse
data in robotics, algorithms developed for offline RL [12]–
[15] have been studied in the robotics setting [16]–[20]. A
subset of offline RL algorithms are amenable to finetuning
[14], [21]–[25]. Our work builds on the direction of offline
pretraining followed by online finetuning in robotics. But
beyond this line of work, we focus on finetuning from visual
input in realistic settings with multiple domains and without
ground truth reward functions for the new task.
In this respect, our work is closest to prior work on
self-supervised RL that does not assume an external reward
function and instead learns it from data. One class of self-
supervised RL methods uses goal-conditioned RL with self-
supervised rewards [26]–[36]. While general, this class of
methods is a poor fit for industrial insertion, as high precision
is required in both the policy and in evaluating rewards. In-
stead, we train a domain generalizing reward classifier from
prior data. Prior methods have used learned rewards [37] and
classifier rewards have been proposed as a scalable solution
for robotics tasks previously [38], [39]. However, learned
rewards have not been shown to be useful for finetuning
in novel real-world robotic domains previously. Because we
focus on applying offline RL and finetuning from vision in
the industrial insertion setting, domain generalization of the
reward function is vital for our method to work in practice.
Many aspects of robotic insertion, or peg-in-hole assembly,
has been studied in prior work [40]–[45], often utilizing
geometry and dynamic analysis, force control, tactile sens-
ing, and search, but these methods can be brittle to state
estimation errors. Learning-based methods, including RL,
have also been applied, usually for a single connector from
ground-truth state information [46]–[48]. In these cases, the
RL algorithm must learn to navigate the specific dynamics of
the single connector, but does not generalize across connec-
tors. More recent work has considered using meta-learning
to generalize and improve few-shot between domains [49].
Zhao et al. use offline RL and finetuning combined with
meta-learning to adapt to a new connector [50]. This work
assumes a known position of the socket and consistent
grasping of the connector, and is robust to a small amount
(±1mm) of noise. With known socket position and small
error, the learning algorithm can learn a structured noise
or exploration strategy that can overcome these errors. In
contrast, we initialize connectors within ±20mm of the
socket (20×the variance), which requires the robot to rely on
visual feedback since blind exploration will rarely succeed.
Closest to our work is prior work that also uses pixel input
for robotic insertion. Luo et al. incorporate vision alongside
proprioception, using a VAE to embed pixel input [51].
InsertionNet uses a vision system to localize the object and
socket, operating on a ”residual policy” which is learned
摘要:

LearningontheJob:Self-RewardingOfine-to-OnlineFinetuningforIndustrialInsertionofNovelConnectorsfromVisionAshvinNair1,BrianZhu12,GokulNarayanan2,EugenSolowjow2,SergeyLevine1Abstract—Learning-basedmethodsinroboticsholdthepromiseofgeneralization,butwhatcanbedoneifalearnedpolicydoesnotgeneralizetoane...

展开>> 收起<<
Learning on the Job Self-Rewarding Offline-to-Online Finetuning for Industrial Insertion of Novel Connectors from Vision Ashvin Nair1 Brian Zhu12 Gokul Narayanan2 Eugen Solowjow2 Sergey Levine1.pdf

共10页,预览2页

还剩页未读, 继续阅读

声明:本站为文档C2C交易模式,即用户上传的文档直接被用户下载,本站只是中间服务平台,本站所有文档下载所得的收益归上传人(含作者)所有。玖贝云文库仅提供信息存储空间,仅对用户上传内容的表现方式做保护处理,对上载内容本身不做任何修改或编辑。若文档所含内容侵犯了您的版权或隐私,请立即通知玖贝云文库,我们立即给予删除!
分类:图书资源 价格:10玖币 属性:10 页 大小:3.08MB 格式:PDF 时间:2025-05-02

开通VIP享超值会员特权

  • 多端同步记录
  • 高速下载文档
  • 免费文档工具
  • 分享文档赚钱
  • 每日登录抽奖
  • 优质衍生服务
/ 10
客服
关注