读论文-Pi0

读论文 2026-01-16

读论文-Pi0

2026-01-16

Summary

This work presents a VLA foundation model that established a scalable pre-train & fine-tune recipe. On pre-training, this work employed cross-embodiment training (for large OXE dataset) and combined flow matching (to control robots with high frequency) with VLM using a novel action expert.

Challenges

This work is to solve the current difficulties regarding availability of data, generalization, and robustness.

Challenges for developing such model that are directly listed in introduction: First, it must be done at a very large scale. Second, it requires model that can both make use of diverse data sources and represent intricate behaviors for physical contacts.

Methods

Goal: model the distribution \(p(A_t|o_t)\), where \(A_t\) is action chunk of \([a_t,\dots,A_{t+49]}]\), and \(o_t\) is the concatenation of current image (multiple views) \(I_t\), language token \(l_t\) and proprioception joint angle data \(q_t\).

Model:

Use standard late fusion VLM (PaliGamma) to process image & language token, and embed them into the same latent space.
Use flow matching to generate the action. In each flow matching step, use block-wise causal attention masks with three blocks: \([I_t^1,\dots,I_t^n,l_t]\)，\(q_t\) and \([a^\tau_t,\dots,a^\tau_{t+H-1}]\) (the noisy action). Within each block, there is bidirectional connection, but can't attend to future blocks.
The MLP layer after transformer block is separated into two expert MLP: One frozen weight for visual and language tokens, and a trainable smaller MLP for \(q_t\) and \(a^\tau\).

Detailed Implementation

Forward:

Input: images \(I_t^1\dots I_t^n\), language token \(l_t\), normalized proprioception data \(q_t\), normalized noisy action \(a_{[t,t+H-1]}^{\tau}\).
Use pre-trained PaliGemma’s ViT model to encode images. And embed language tokens.
Embed noisy action using an MLP
- \(W_3\cdot swish(W_2\cdot concat(W_1\cdot a_{t'}^{\tau},\phi(\tau)))\).
- \(\phi\) is sinusoidal positional embedding
A multi-query attention block:
- 8-head, head dim=256, depth=18
- Block-wise casual mask as mentioned above.
- Frozen QKV matrix for images and language tokens, and a trainable one for proprioception and action.
- Two experts in FFN layer. A frozen one for images and language tokens, and a trainable one for proprioception and action (action expert).
- The token vector width of images and language is 2048, with MLP dim=16384. The action expert have width=1024 and MLP dim=4096.
Outputs:
- Take out the transformer outputs corresponding to the \(H\) noisy actions, and decoded into \(v_{\theta}(A_t^{\tau},o_t)\) using linear projection.

Training:

Sample \(\epsilon \sim N(0,I)\) and \(\tau \sim Beta(\frac{s-\tau}{s};1.5,1)\) with \(s=0.999\). Let \(A_t^{\tau}=\tau A_t+(1-\tau )\epsilon\).
Forward, then calculate loss \(=||v_\theta(A_t^{\tau},o_t)-(A_t-\epsilon)||^2\).
Also calculate the word prediction loss with a standard cross entropy loss.

Inference

Sample \(\epsilon\sim N(0,I)\) and pick a integration step size \(\delta\). Perform \(A_t^{\tau+\delta}=A_t^{\delta}+\delta v_{\theta}(A_t^{\tau},o_t)\).
Need to use high level LLM planning policy for complex tasks.
Run inference every 0.8s for 20Hz UR5e and Franka, and run inference every 0.5s other robots with 50Hz.

Data and training

For each robot-task combination, let \(n\) be the number of samples of that combination, weight it by \(n^{0.43}\).