Summary
This work presents a VLA foundation model that established a scalable pre-train & fine-tune recipe. On pre-training, this work employed cross-embodiment training (for large OXE dataset) and combined flow matching (to control robots with high frequency) with VLM using a novel action expert.
Challenges
This work is to solve the current difficulties regarding availability of data, generalization, and robustness.
Challenges for developing such model that are directly listed in introduction: First, it must be done at a very large scale. Second, it requires model that can both make use of diverse data sources and represent intricate behaviors for physical contacts.
Methods
Goal: model the distribution \(p(A_t|o_t)\), where \(A_t\) is action chunk of \([a_t,\dots,A_{t+49]}]\), and \(o_t\) is the concatenation of current image (multiple views) \(I_t\), language token \(l_t\) and proprioception joint angle data \(q_t\).
Model:
- Use standard late fusion VLM (PaliGamma) to process image & language token, and embed them into the same latent space.
- Use flow matching to generate the action. In each flow matching step, use block-wise causal attention masks with three blocks: \([I_t^1,\dots,I_t^n,l_t]\),\(q_t\) and \([a^\tau_t,\dots,a^\tau_{t+H-1}]\) (the noisy action). Within each block, there is bidirectional connection, but can't attend to future blocks.
- The MLP layer after transformer block is separated into two expert MLP: One frozen weight for visual and language tokens, and a trainable smaller MLP for \(q_t\) and \(a^\tau\).
Detailed Implementation
Forward:
- Input: images \(I_t^1\dots I_t^n\), language token \(l_t\), normalized proprioception data \(q_t\), normalized noisy action \(a_{[t,t+H-1]}^{\tau}\).
- Use pre-trained PaliGemma’s ViT model to encode images. And embed language tokens.
- Embed noisy action using an MLP
- \(W_3\cdot swish(W_2\cdot concat(W_1\cdot a_{t'}^{\tau},\phi(\tau)))\).
- \(\phi\) is sinusoidal positional embedding
- A multi-query attention block:
- 8-head, head dim=256, depth=18
- Block-wise casual mask as mentioned above.
- Frozen QKV matrix for images and language tokens, and a trainable one for proprioception and action.
- Two experts in FFN layer. A frozen one for images and language tokens, and a trainable one for proprioception and action (action expert).
- The token vector width of images and language is 2048, with MLP dim=16384. The action expert have width=1024 and MLP dim=4096.
- Outputs:
- Take out the transformer outputs corresponding to the \(H\) noisy actions, and decoded into \(v_{\theta}(A_t^{\tau},o_t)\) using linear projection.
Training:
- Sample \(\epsilon \sim N(0,I)\) and \(\tau \sim Beta(\frac{s-\tau}{s};1.5,1)\) with \(s=0.999\). Let \(A_t^{\tau}=\tau A_t+(1-\tau )\epsilon\).
- Forward, then calculate loss \(=||v_\theta(A_t^{\tau},o_t)-(A_t-\epsilon)||^2\).
- Also calculate the word prediction loss with a standard cross entropy loss.
Inference
- Sample \(\epsilon\sim N(0,I)\) and pick a integration step size \(\delta\). Perform \(A_t^{\tau+\delta}=A_t^{\delta}+\delta v_{\theta}(A_t^{\tau},o_t)\).
- Need to use high level LLM planning policy for complex tasks.
- Run inference every 0.8s for 20Hz UR5e and Franka, and run inference every 0.5s other robots with 50Hz.
Data and training
- For each robot-task combination, let \(n\) be the number of samples of that combination, weight it by \(n^{0.43}\).