Sim2real Image Translation Enables Viewpoint Robust Policies from Fixed-Camera Datasets

link: https://arxiv.org/pdf/2601.09605

Image translation: we have lots of simulated trajectories, and use image translation to translate the simulated image to a sim2real image.

MANGO aims to solve the problems of traditional methods, including:

  • diffusion is too slow. MANGO uses GAN.
  • fail to generalize on different viewpoints on the fixed-viewpoints target domain

Methodology: Go through a encoder-decoder model to get a sim2real translated image with the following losses.

  • Use gt segmentation to calculate a segNCE loss, to make sure a pixel feature is similar to other pixel of the same seg class.

  • Encode the result image again to get PatchNCE loss

  • GAN loss with fixed-viewpoint real image.

Result: has rather comparable performance with 35M parameter ACT to the 4.5B VISTA(another viewpoint augmentation method). However MANGO is just much faster.

* ActiveVLA: Injecting Active Perception into Vision-Language-Action Models for Precise 3D Robotic Manipulation

link: https://arxiv.org/pdf/2601.08325

Problem: current VLA models only have static vision perception, which might fail when occlusion occurs.

ActiveVLA propose a framework that allow us to actively adjust the camera viewpoint and actively zoom-in to get higher resolution for critical regions.

Methodology: use a 2-stage strategy.

  • First stage (coarse stage), reconstruct point cloud from RGB-D image, then do orthographic projection before feeding it into a VLM model (PaliGemma). Then do up-sampling to predict the crucial point and area for each 2D projection picture, before lifting it to 3D.

  • After getting the crucial point in 3D model, select a viewpoint and zoom-in factor. Feed them into PaliGemma to get image tokens, then feed it into action encoder to get a final action output.

    Viewpoint selection is done consider the 3 criteria: visibility, camera distance and viewpoint diversity.

    Action prediction is done by predicting the grasp position on the 2D image using an up-sampler. The up-sampler in two stages share weights. Then we lift it to a 3D transition. For rotation, we discretize each euler angle into 72 bins and do prediction.

Evaluation:

  • Achieves a new SOTA on RLBench with 91.8% success rate. Performs well on precision-demanding tasks, and remain robust under occlussions.
  • COLOSSEUM: surpasses BridgeVLA in most category to achieve a new SOTA. It performs well despite clutter, distractors and viewpoint changes.
  • Real-robot: Claims a ~90% success rate on occlusion scenarios.

* StereoVLA: Enhancing Vision-Language-Action Models with Stereo Vision

link: https://arxiv.org/pdf/2512.21970

Summary: This work proposes a novel Geometric-Semantic feature Extraction using existed works to extract and fuse semantic feature from VLM and geometric feature from Foundation Stereo, and enhance higher precision grasping that need better geometric recognition.

Methodology:

  • For left view, extract picture feature using DINOv2(capturing details) and SigLIP(capturing high-level sementics) exclusively.
  • Feed both view into Foundation Stereo. Use the filtered cost value \(V_c'\) in Foundation Stereo model considering that it provides dense geometric features.
  • Instead of concatenating them, just stack their channels and do projection to get geometric-semantic feature.

After obtaining the feature, things are similar with pi-0.5.

One thing to mention is that they add a task of "estimating the depth of pixel (x,y)" inside training. (x,y) are chosen from the object bounding box.

The work follows GraspVLA and rely heavily on GraspVLA-like synthetic data along with internet video data, as there is no dataset giving stereo vision information.

Evaluation: for bar-like object (pencils etc), GraspVLA performs well. It also reach 80% for medium size object and 30% for small size object where current SOTA couldn't pick up.

It is worth mentioning that other models often close the gripper too early, possibly due to the lack of precise spatial perception.

* TwinAligner: Visual-Dynamic Alignment Empowers Physics-aware Real2Sim2Real for Robotic Manipulation

link: https://arxiv.org/pdf/2512.19390

Summery: This work proposes a novel Real2Sim2Real system that enhances pixel-level visual alignment and dynamic consistency for both objects and robots by using 3DGS and optimization during the "Control-Hit-Slide" manner.

Methodology:

  • For pixel-level visual alignment, they first build the physical mesh, then initialize 3D Gaussian center on the vertices, after which they optimize the 3DGS using color loss and structure loss.

  • For articulate objects, build a automated pipeline with 3DOI to separate the object.

  • Dynamic consistency: In the "Control-Hit-Slide" process, the robot is controlled to move towards the object, then hit the object, and record the trajectory of the object.

    From the process, several dynamics parameters are related: friction, mass, center of mass, and robot controller parameter.

    The work list optimization goals (distance between the sim & real joint positions vector, and point cloud base object poses ADD & ADD-S loss), and use gradient-free optimization method (to ensure compatibility with non-differentiable simulators) to get the parameters.

  • For policy learning, the work simply uses tele-operation in the simulation environment for data generation, and use imitation learning policies Diffusion Policy and RISE policy.

Result: The output result shows a comparable result on a zero-shot sim2real policy (using only the sim policy) with a real2real policy.

Limitation: Human efforts for data collection still exists as we need to do the control-hit-slide process, and the accuracy and speed of dynamic alignment is restricted. Also the model doesn't fit deformable objects.

On-the-Fly VLA Adaptation via Test-Time Reinforcement Learning

link: https://arxiv.org/pdf/2601.06748

Summary: This work proposes a method of Test time RL for VLA (w/o diffusion) using value-free PPO and dense reward.

Methodology:

  • The goal of PPO consists of a clipped policy objective \(clip(r_t(\theta))\hat A_t\), the value loss and entropy regularization term. The limited trajectories make it hard to learn the value, TT-VLA simply discards the value loss and entropy regularization term, leaving only the clipped policy objective.

  • Also, TT-VLA set \(\lambda=\gamma=0\) in GAE, collapsing the PPO into a one-step formulation.

  • About the reward. TT-VLA calculate the "progress" of each time with the past observations and language instruction using Vision-Language Action-Critic model (VLAC) (Zhai et al., 2025).

    The reward is measured by how much the progress increase / decrease.

Limitation: the PPO-based RL framework seems hard to be directly moved to the diffusion-based policies. However, the dense reward seems a reasonable option to be considered.

* VITA: VISION-TO-ACTION FLOW MATCHING POLICY

link: https://arxiv.org/pdf/2507.13231

Summary: this work proposes a faster flow matching policy that does not need the cross-attention conditioning and flows directly from latent visions to latent actions.

Methodology:

  • The basic logic is, use a resnet-18 as vision encoder to encode RGB into latent space. After which, we flow it to represent a latent action, and use a auto-encoder to decode the action.
  • However, there exists several problems:
    • The action space has much lower dimension then the vision space.
    • Pre-training AE (auto-encoder) was unreliable to produce a flow target as the action data is limited and sparse.
    • Jointly training faces a latent space collapse due to Training-Inference Gap: the \(\hat z_1\) from encoder and \(z_1\) from ODE solving, are from different distribution.
  • These problems are solved by designing an effective learning framework, where we have the following training objectives:
    • Auto-Encoder loss (\(L_1\) loss between gt action chunk and its reconstruction peer) and Flow-matching loss (the MSE between gt velocity and its predicted peer)
    • Flow Latent Decoding (FLD): the MSE between \(D(\hat z_1)\) and gt. and Flow Latent Consistency (FLC): the MSE between \(z_1\) and \(\hat z_1\). These two objectives provide equivalent training signals, though putting them together produces a bit better result.

Note that:

  • This model is only vision-based, and the objective is a deterministic trajectory rather than a multi-model distribution.
  • Though the FLD acts as a MSE loss, this loss actually ensures that the action is precise enough, and optimizes both the decoder and velocity prediction.

DynamicVLA

link: https://arxiv.org/pdf/2601.22153

Summary: this work presents a 0.4B model to improve the grasp of dynamic objects with continuous interface and latent-aware action streaming. Also, it launch a simulated dataset and benchmark regarding dynamic manipulation.

Model: A VLA with flow matching as action expert with the following changes:

  • Uses a convolution-based vision encoder FastViT to save time.
  • Truncate the language backbone to the first 16 transformer layers.

The continuous interface states that, the model predicts the actions of the next \(n\) timesteps for each interface, while interfaces for every \(m\) timesteps. Thus, when \(n>m\) we reaches a continuous interface. Also, the latent-aware action streaming states that, when there are multiple prediction for a timestep, simply use the newest prediction.

A more valuable work is the dataset and benchmark. The DOM benchmark generate episodes in simulation by: predicting near-future (~0.23s) object motion, then continuously place the gripper 10cm above, then grasp and place and reset. For real-world data, they use RGBD camera to approximate object state and do similar stuffs.

ps (a small insight from a discussion with the co-author): Franka is really unsuitable for such work that needs fast operation.

World-Gymnast: Training Robots with Reinforcement Learning in a World Model

link: https://arxiv.org/pdf/2602.02454

This work proposes a method of training fine-tuning RL on VLA using world model rather than simulator, bridging sim2real gap and allowing test-time optimization.

Method: Directly use a pretrained world model as a simulator, and a VLM judging the final reward. That is, given a initial frame (or a frame that is close enough to the initial frame distribution), repeatedly generate action using VLA and status using world model, and use a GRPO to optimize policy. The real-world deployment result is used to further improve the world model.

Pic generation model (nano banana) is used to add distractor in the visual frame for robustness. Also, novel tasks (given initial frame and instruction) can be used for generalization.

For test-time optimization, World-Gymnast can run RL training in the world model starting from the test frame, though causing overfit.

FoldNet: Learning Generalizable Closed-Loop Policy for Garment Folding via Keypoint-Driven Asset and Demonstration Synthesis

link: https://arxiv.org/pdf/2505.09109

This work proposes a sim2real garment folding learning framework through a keypoint-based garment generation & demonstration generation (for 4 types of garments) and KG-DAgger, a method to improve the policy's robustness to failure recovery.

For each type of garment, the author manually select important keypoints, and a garment is generated by the following steps: randomize on keypoints parameters, generate borders connecting keypoints, heuristically define Z coordinate and UV coordinate for mesh, and generate texture using stable diffusion, followed by a VLM choosing the best texture.

The benefit of keypoints is that we can define a keypoint-base algorithm for demonstration, which leads us to KG-DAgger:

  • First train a vision-based model \(M_0\) using generated demonstrations.
  • \(M_i\) is optimized from \(M_{i-1}\) by, spotting errors in trajectories generated from \(M_{i-1}\) and adding the fixed trajectory generated by keypoint-base algorithm to the dataset.