Sim2real Image Translation Enables ViewpointRobust Policies from Fixed-Camera Datasets
link: https://arxiv.org/pdf/2601.09605
Image translation: we have lots of simulated trajectories, and use image translation to translate the simulated image to a sim2real image.
MANGO aims to solve the problems of traditional methods, including:
- diffusion is too slow. MANGO uses GAN.
- fail to generalize on different viewpoints on the fixed-viewpoints target domain
Method: Go through a encoder-decoder model to get a sim2real translated image with the following losses.
Use gt segmentation to calculate a segNCE loss, to make sure a pixel feature is similar to other pixel of the same seg class.
Encode the result image again to get PatchNCE loss
GAN loss with fixed-viewpoint real image.
Result: has rather comparable performance with 35M parameter ACT to the 4.5B VISTA(another viewpoint augmentation method). However MANGO is just much faster.