Summary

This work presents a VLA foundation model that established a scalable pre-train & fine-tune recipe. On pre-training, this work employed cross-embodiment training (for large OXE dataset) and combined flow matching (to control robots with high frequency) with VLM using a novel action expert.

Challenges

This work is to solve the current difficulties regarding availability of data, generalization, and robustness.

Challenges for developing such model that are directly listed in introduction: First, it must be done at a very large scale. Second, it requires model that can both make use of diverse data sources and represent intricate behaviors for physical contacts.

Methods

Goal: model the distribution \(p(A_t|o_t)\), where \(A_t\) is action chunk of \([a_t,\dots,A_{t+49]}]\), and \(o_t\) is the concatenation of current image (multiple views) \(I_t\), language token \(l_t\) and proprioception joint angle data \(q_t\).

Model:

  • Use standard late fusion VLM (PaliGamma) to process image & language token, and embed them into the same latent space.
  • Use flow matching to generate the action. In each flow matching step, use block-wise causal attention masks with three blocks: \([I_t^1,\dots,I_t^n,l_t]\)\(q_t\) and \([a^\tau_t,\dots,a^\tau_{t+H-1}]\) (the noisy action). Within each block, there is bidirectional connection, but can't attend to future blocks.
  • The MLP layer after transformer block is separated into two expert MLP: One frozen weight for visual and language tokens, and a trainable smaller MLP for \(q_t\) and \(a^\tau\).

Detailed Implementation

Forward:

  • Input: images \(I_t^1\dots I_t^n\), language token \(l_t\), normalized proprioception data \(q_t\), normalized noisy action \(a_{[t,t+H-1]}^{\tau}\).
  • Use pre-trained PaliGemma’s ViT model to encode images. And embed language tokens.
  • Embed noisy action using an MLP
    • \(W_3\cdot swish(W_2\cdot concat(W_1\cdot a_{t'}^{\tau},\phi(\tau)))\).
    • \(\phi\) is sinusoidal positional embedding
  • A multi-query attention block:
    • 8-head, head dim=256, depth=18
    • Block-wise casual mask as mentioned above.
    • Frozen QKV matrix for images and language tokens, and a trainable one for proprioception and action.
    • Two experts in FFN layer. A frozen one for images and language tokens, and a trainable one for proprioception and action (action expert).
    • The token vector width of images and language is 2048, with MLP dim=16384. The action expert have width=1024 and MLP dim=4096.
  • Outputs:
    • Take out the transformer outputs corresponding to the \(H\) noisy actions, and decoded into \(v_{\theta}(A_t^{\tau},o_t)\) using linear projection.

Training:

  • Sample \(\epsilon \sim N(0,I)\) and \(\tau \sim Beta(\frac{s-\tau}{s};1.5,1)\) with \(s=0.999\). Let \(A_t^{\tau}=\tau A_t+(1-\tau )\epsilon\).
  • Forward, then calculate loss \(=||v_\theta(A_t^{\tau},o_t)-(A_t-\epsilon)||^2\).
  • Also calculate the word prediction loss with a standard cross entropy loss.

Inference

  • Sample \(\epsilon\sim N(0,I)\) and pick a integration step size \(\delta\). Perform \(A_t^{\tau+\delta}=A_t^{\delta}+\delta v_{\theta}(A_t^{\tau},o_t)\).
  • Need to use high level LLM planning policy for complex tasks.
  • Run inference every 0.8s for 20Hz UR5e and Franka, and run inference every 0.5s other robots with 50Hz.

Data and training

  • For each robot-task combination, let \(n\) be the number of samples of that combination, weight it by \(n^{0.43}\).