Sim2real Image Translation Enables ViewpointRobust Policies from Fixed-Camera Datasets

link: https://arxiv.org/pdf/2601.09605

Image translation: we have lots of simulated trajectories, and use image translation to translate the simulated image to a sim2real image.

MANGO aims to solve the problems of traditional methods, including:

  • diffusion is too slow. MANGO uses GAN.
  • fail to generalize on different viewpoints on the fixed-viewpoints target domain

Method: Go through a encoder-decoder model to get a sim2real translated image with the following losses.

  • Use gt segmentation to calculate a segNCE loss, to make sure a pixel feature is similar to other pixel of the same seg class.

  • Encode the result image again to get PatchNCE loss

  • GAN loss with fixed-viewpoint real image.

Result: has rather comparable performance with 35M parameter ACT to the 4.5B VISTA(another viewpoint augmentation method). However MANGO is just much faster.