2026-02-09
link: https://arxiv.org/pdf/2502.17894
Summary: This work propose a sim2real framework of training a fetching task, which needs to reason about collision under partial observability. The method includes a Unified Voxel-based Scene Generator for simulation scene generation, the leverage of depth-prediction to bridge sim2real gap, and the occupancy prediction task for training, reaching a 90% zero-shot successful rate on sim2real transfer.
Methodology
FetchBot framework contains several parts, including the scene generation, oracle policy training, and the vision-based model including the leverage of depth-prediction and occupancy prediction task.
UniVoxGen
On scene generation, the work chooses a voxel-based method. This leverages four efficient fundamental voxel operations: union (add an object into the scene), intersection (detect voxel collision), difference (remove objects) and transformation (alter object poses). This scene acts as a dataset for such task, and at the same time the ground truth for the occupancy prediction task.
The scene generation mainly follows the following step: Every time when adding a object, sample its pose until it doesn't collide with other objects. However additional rules are designed for real life feasibility.
Oracle Policy
The observation of the oracle policy includes the proprioception (end effector pose), previous action, and a scene representation. The scene representation is obtained by a encoder, which first encode each object locally, then globally, like a pointnet++.
The reward of the reinforcement learning includes task completion reward, behavior change constraint and environment change penalty. The environment changing impact is quantified by the total rotation and translation of all obstacles. To get the task reward, the total rotation and translation must respectively stay under two thresholds.
The work use curriculum training based on the threshold. The policy is first trained on higher threshold, followed by lower threshold, and finally zero threshold. It is worth noticing that the training failed when using locomotion-like scene-based curriculum training, where each task has its difficulty levels. The insight in the paper is that, the knowledge in lower difficulty tasks is barely usable in higher ones.
Vision based policy
The first thing to solve is the depth-estimation. The work used DepthAnything as a model to predict the depth information. This ensures that the depth spaces of both sim and real are the same.
A more complex part is the task of occupancy prediction. This is done after getting feature map using resnet-50 and depth map using DepthAnything by a Deformable Cross-Attention mechanism:
- Initialize a 3d Query grid (which is \(C\times H\times W\times Z\)).
- For each 3d point and each camera, calculate its 2D projection using camera parameter.
- Use deformable attention. The result of a 3D point is the average deformable attention value with its query among all cameras in which the point is not out of frame.
- Moreover, for 3D volume near the end effector, we need to predict semantic occupancy (including object classification).
FetchBot use multistage policy training framework. The work first trains the vision encoder using the occupancy prediction task and scene-class affinity losses, after which the diffusion based policy is trained while freezing the vision encoder.
Result and limitation
It is obvious that FetchBot excels in fetching without changing the environment. It is worth noticing that the impact of noise and limitations of depth sensor limit the performance of point cloud based models.
The limitations listed in the paper are: under scenarios with significant occlusion, the policy goes too complex that the arm exceeds joint limits; dual-arms are needed for heavier objects; policy fails when the objects are initially unreachable and other objects needed to be moved and then restored.
Further notes:
- This voxel based generation doesn't cover scenes where, two objects physically lean on each other. The lack of such layout may cause an out-of-distribution problem when the policy encounters a temporary failure. A better closed-loop policy should be able to recover from this state and get things back to normal.
- The demo seems to include only rigid objects. Objects with complex dynamics are not included in the scene, and it seems to be challenging for FetchBot to fetch these objects under the current framework.
- Reasoning and language instructions could be added to this framework to provide a more robust closed-loop policy aiming to fetch initial unreachable objects.