读论文-Pi0.6

读论文 2026-02-03

读论文-Pi0.6

2026-02-03

link: https://arxiv.org/pdf/2511.14759

Summary

This work represents RECAP, a method that provides for RL training on general-purpose VLA models, that improves performance with higher accuracy and throughput.

Method

RECAP follows a standard regularized reinforcement learning, where we have

reward \(r(o_t,a_t)\) with the objective of maximizing cumulative reward (with no discount factor). In RECAP, we only have sparse reward.
Value \(V^{\pi}(o_t)\) and Advantage value \(A^{\pi}(o_t,a_t)=E[\sum_{t'=t}^{t+N-1}r_{t'}-V^{\pi}(o_{t+N})]-V^{\pi}(o_t)\).
Regularizing by setting the objective as the cumulative reward \(-\beta E_{o\sim \rho_{\pi_\theta}}[D(\pi(\cdot|o)||\pi_{ref}(\cdot|o))]\) so as to remain close to the original reference policy.

The major parts of RECAP are

Data collection: label each episode according to its outcome.
Value function training: train a critic / value fitting model with a smaller VLA model.
- The value function in RECAP is doing a classification problem, mapping the status and language instruction \((o_t,l)\) into \(B=201\) discrete bins. Thus we use a cross-entropy loss for this.
  
  Let the discretized \(R_{t}=\sum_{t'=t}^{T}r_{t'}\) be \(R_t^B\), then the objective is \[ \min E [\sum_{o_t\in \tau} H(R_t^B(\tau),p_{\phi}(V|o_t,l))] \]
- The final value of \(V(o_t,l)\) is \(\sum_b v(b)p_{\phi}(V=b|o_t,l)\) where \(v(b)\) is the value of the \(b-\)th bin.
- The classification here allows model to capture uncertainty being stable and robust to outliers.
Advantage conditioned training: extract a more optimal policy using the value function.
- This is difficult as policy-gradient method like PPO requires calculating \(\pi(a|o_t,l)\) which is not capable in flow-matching. They did try PPO with specific implementation but it performs worse than RECAP.
- Lemma: For any monotonically increasing function \(g\), define \(p(I|o,a)=g(A(o,a))/\int g(A(o,a))da\), that is, the probability of action \(a\) improves the advantage. Then the lemma states that, a new policy \(\hat\pi \propto \pi_{ref}(a|o)\cdot p(I|A(o,a))^{\beta}\) is guaranteed to improve over the ref policy.
  
  Using this lemma, and the Bayes \(p(I|a,o)=p(a|I,o)p(I|o)/p(a|o)\). As \(p(I|o)\) is constant, we got the important \[ \hat\pi\propto \pi_{ref}(a|o)\cdot (\frac{\pi_{ref}(a|I,o)}{\pi_{ref}(a|o)})^{\beta} \]
- When \(\beta =1\) we have \(\hat \pi\propto \pi_{ref}(a|I,o)\).
  
  Thus we modify the model a bit by adding a advantage / non-advantage text tag before action generation, allowing the model to predict \(\pi(a|I,o)\), giving us the NLL loss of \(-\log \pi(a|o)-\alpha \log \pi(a| I,o)\)
- For convenience, we set \(p(I|a,o)=1[V(o,a)>\epsilon]\).
- It is still difficult to calculate the log likelihood. However, math tells us that the flow matching loss acts as a lower bound. Also, the training omit the \(\alpha\) by, for each sample,
- Normally we optimize on \(\log \pi(a|I,o)\).
  - However we sometimes omit the \(I\) and optimize on \(\log \pi(a|o)\).
- When inference, simply sample from \(\pi (a|I=1,o)\) gets the result when \(\beta=1\). When \(\beta >1\), we make the magic trick by replace \(d\log\pi(a|o)/da\) with the flow velocity, and get \(v_{\beta}=v_{a|o}+\beta(v_{a|I,o}-v_{a|o})\).
Reward definition: using only sparse reward where we get heavy punishment on task failure when time's up and \(-1\) punishment for each timestep in the process. Reward is normalized within each task.

The procedure of RECAP:

After having a basic \(\pi_{0.6}\) model, we calculate \(V\) and \(I\) with \(\epsilon=\) 30% percentile (40 for fine-tuning an 10 for long tasks like laundry folding) of all values in the task, and do the policy extraction to get a general policy \(\pi^*_{0.6}\).
For each task, we first finetune the model with expert data in this task with \(I\equiv 1\). Then, we use this model to generate additional trajectories, and add to the dataset. Some of them are monitored by human and corrected. The corrected action enjoys \(I\equiv 1\).
It is worth mentioning that both value function and policy are extracted from the pre-trained model. The difference is that we are obtaining higher and higher quality data. It is said to prevent drift during iterations.

Limitations:

This work proposes a framework for RL in VLA. However, it is obvious that there are a lot of things to be done better:

The update of RECAP is majorly offline where we do a limited amount of iterations during post-training. Maybe it could be also extended online.
RECAP chooses the easy on-policy value fitting for convenience. Extend it to off-policy Q-learning may improve data efficiency.
There are a lot of human works and monitoring during the process. Replacing human works with VLMs might be better.