link: https://arxiv.org/pdf/2402.15391

This work presents a generative interactive environment Genie trained from unlabeled internet videos. Genie uses ST transformer throughout the model, and encodes videos through a video tokenizer and a latent action model, which could be used for inferring action policies from unlabeled video data.

Methodology

Genie consists of three parts: a video tokenizer to encode video, a latent action model to infers actions between each frames, and a dynamics model to combine the results of the former output to predict the next frame.

At inference time, the LAM is replaced with the user's action. The LAM is used here only for training and make it somehow labeled.

ST Transformer

Throughout the model, they choose to use ST transformer instead of normal transformer.

Normal transformer model for ViT consider it as a \(T\times H\times W\)-token vector which is too slow. The ST transformer separates it into two blocks including a spatial layer where self-attention attends over the \(1\times H\times W\) tokens within each frame, and a temporal layer where self-attention attends over the \(T\times 1\times 1\) tokens with a casual mask within each pixel. Also, FFW layer is added only after both of them instead of after each of them.

Model details

As we got no available supervised video data with action label on the internet, we do it in an unsupervised manner.

The LAM is a VQ-VAE, where:

  • the encoder reads in \(x_1,\dots,x_{t+1}\) and outputs \(a_1,\dots,a_t\). \(a_i\) is limited to a vocabulary size of \(|A|=8\) (permit human playability)
  • the decoder reads in \(x_1,\dots,x_t\), \(a_1,\dots,a_t\) and outputs \(x_{t+1}\).
  • As there is a casual mask in the temporal layer, simply feed in the whole video for training is possible.

The video tokenizer also use a VQ-VAE, where:

  • the encoder takes in \(T\) frames of video data, then use encode it into a discrete \(T\times D\) latent space.
  • and the decoder decodes the latent space to get the raw video data.

The dynamics model is decoder-only MaskGIT.

  • Still it is an ST-transformer with temporal casual mask.
  • We randomly mask input tokens at train time according to a Bernoulli distribution.
  • The latent actions are not treated using concatenation. Instead, they are treated as additive embeddings like the positional embedding.

Metrics:

  • video fidelity: use Frenchet Video Distance (FVD)
  • controllability: use Peak signal-to-noise ratio (PSNR). That is, measures how much the generations differ when conditioned on result latent actions / random latent actions. Or, how much the latent actions affect the generation.

Implementation Details

For the LAM, it use patch size = 16 for ViT, and a latent dim = 32 to determine the action. Both encoder and decoder has 20 transformer layers and 16 heads, with width = 1024.

For the video tokenizer, the number of codes is 1024 (instead of the 8 in the LAM), with patch size = 4 (smaller than LAM) and latent dim = 32. However the transformer block is smaller with 12 layers and 8 heads for encoder, and the same with LAM for the decoder.

The 2.7B dynamics model use 36 layers and 22 heads, with width = 3072.

Video quality is done by using resnet. They manually labeled some video with quality then trained a resnet to calculate video quality.