World Models for Reinforcement Learning

• 12 min read
LARF World Model Architecture

Figure 1: Architecture of our Latent Autoregressive Flow-Matching (LARF) world model. A simple and scalable approach for fast policy optimization in imagination.

At Tau, we are building a general AI for robots. This requires a massive amount of data for training. Learning from human teleoperation demonstrations is not sufficient. Robots will have to learn from data without labeled actions through reinforcement learning on trillions of trials.

In the past, humans have developed physics simulators to generate this data. However, developing new environments is labor-intensive and some aspects like deformable objects or other humans remain hard or impossible to simulate.

Another approach is to learn a simulator on large amounts of unlabeled real world data. This scales well with more compute and data, and is not bound by human labor.

Model Architecture

In this report, we introduce a latent autoregressive flow-matching world model (LARF) and scale it to 1B parameters. It predicts real world images frame by frame in a compressed latent space with a large causal transformer and a small flow-matching head119. This allows us to inference the causal transformer once and use the output to predict the next observation and a future action which significantly speeds up policy optimization in the learned world model.

Frame-Wise Latent Space

To encode the high-dimensional pixel space of images into a low-dimensional continuous latent space, we design a frame-wise variational autoencoder (VAE)3. Given an input video V ∈ RT×H×W×C, the tokenizer compresses each frame in the spatial dimensions to [H/32, W/32] while expanding the number of channels C to 24.

Some of our datasets contain observations with multiple camera angles. We concatenate the images channel-wise before compressing them. We train the VAE to minimize the L1 reconstruction loss, KL loss, LPIPS perceptual loss45as well as GAN loss6.

While increasing the channel dimension achieves improved reconstruction performance measured in PSNR, LPIPS and SSIM, it leads to degraded generation quality. To overcome this optimization dilemma, we use semantic regularization of the latents to DinoV27features, by introducing a cosine similarity loss as well as a distance similarity loss8910.

Causal Transformer

We encode the sequence of latents with a causal transformer model. It applies attention alternately over the time and space dimensions. The temporal attention treats width and height as batch dimensions and computes attention only over the temporal dimension after adding a one dimensional rotary position embedding11. It applies causal masking so that key-value caching can be used at inference time. We further use grouped query attention12to reduce the size of the key-value caches. The spatial attention is per timestep over the width and height axes and uses a two dimensional rotary position embedding13. We use RMSNorm14and SwiGLU15for the feedforward layers.

Unfortunately, autoregressive prediction is prone to error accumulation. To mitigate this drift, we add a random amount of noise to the latent frames at training time16. A learned embedding is used to input whether the latent frame is noisy. At inference time, we use the clean embedding for the real context frames and the noisy embedding for the generated latent frames.

This noise embedding combined with actions if available is inserted into the transformer via AdaLN layers2.

Flow-Matching Head

The output of the encoder is normalized and added to the conditioning of the flow-matching head1718which is implemented using a standard diffusion transformer architecture with only spatial attention2. The objective is to predict the next latent frame. However, we noticed that predicting the delta between consecutive frames performs better than direct prediction.

At training time, we sample a noise timestep t independently for each episode timestep i from a logit-normal distribution19. We construct the training sample xt = t (xi+1 - xi) + (1-t) x0, where x0 ∼ 𝒩(0, 1) and xi correspond to the sequence of latents xi after adding a random amount of noise to mitigate error accumulation. The model is then trained to minimize the mean squared error to the velocity vt = (xi+1 - xi) - x0. At inference time, we use a simple first-order Euler ODE solver with uniformly distributed timesteps.

Experiments

We curated a dataset sourced from both internal sources and publicly accessible data. Our in-house dataset is collected mainly autonomously by our robots and comprises random exploration data as well as task-specific data such as picking, grasping and sweeping. For our internet-sourced dataset, we use the Ego4D dataset20and royalty-free videos from Pexels as well as RoboSet21. The first two sources depict human activities whereby Ego4D contains data captured with a head-mounted camera to show observations from an egocentric perspective.

In total, we train on around 0.5B frames. All videos are sampled at 10 fps with a resolution of 256px after filtering out low-quality clips.

We scale our model up to 1B parameters and pre-train on the entire dataset for 42 hours on 64 H100 GPUs. Afterwards, we fine-tune exclusively on robot data for 26 hours on 16 H100 GPUs. Inference speed for autoregressive latent frame prediction on a single H100 is around 110 fps at a batch size of 32. Decoding the latent frames to pixels is only needed for us humans. Policy optimization will happen in the latent space. Below you can see some generations on human and robot data.

Video predictions on human egocentric videos. The first column contains ground truth video and the other columns are randomly sampled generations. The first four frames are ground truth reconstructed frames and the last 50 are generated.

Video predictions on multi-view robot data. The first column contains ground truth video and the other columns are randomly sampled generations. The first frames are ground truth reconstructed frames and the last 30 are generated.

Future Work and Limitations

Our 1B parameter model generates accurate predictions for simple scenes, while generating complex collisions or other humans remains challenging. We expect significant performance improvements by expanding model size and dataset scale.

However, simulating complete visual observations is nearly always unnecessary for reinforcement learning. Images contain a lot of task-irrelevant information. Using a task-relevant latent space is an exciting future direction for improved efficiency of world model and RL training.

Another challenge is long-horizon tasks where autoregressive frame-wise simulation is slow and prone to significant error accumulation. To address this inefficiency, future work should investigate hierarchical approaches that enable both world models and RL policies to operate in temporally abstract spaces.

If you want to help us build this, please check out our open positions.

View Careers

References

[1]Rombach, R., Blattmann, A., Lorenz, D., Esser, P., & Ommer, B. (2022). High-Resolution Image Synthesis with Latent Diffusion Models.arXiv preprint arXiv:2112.10752.[↩]
[2]Peebles, W., & Xie, S. (2023). Scalable Diffusion Models with Transformers.arXiv preprint arXiv:2212.09748.[↩][↩][↩]
[3]Kingma, D. P., & Welling, M. (2013). Auto-Encoding Variational Bayes.arXiv preprint arXiv:1312.6114.[↩]
[4]Johnson, J., Alahi, A., & Fei-Fei, L. (2016). Perceptual Losses for Real-Time Style Transfer and Super-Resolution.arXiv preprint arXiv:1603.08155.[↩]
[5]Zhang, R., Isola, P., Efros, A. A., Shechtman, E., & Wang, O. (2018). The Unreasonable Effectiveness of Deep Features as a Perceptual Metric.arXiv preprint arXiv:1801.03924.[↩]
[6]Esser, P., Rombach, R., & Ommer, B. (2021). Taming Transformers for High-Resolution Image Synthesis.arXiv preprint arXiv:2012.09841.[↩]
[7]Oquab, M., et al. (2024). DINOv2: Learning Robust Visual Features without Supervision.arXiv preprint arXiv:2304.07193.[↩]
[8]Xiong, T., Liew, J. H., Huang, Z., Feng, J., & Liu, X. (2025). GigaTok: Scaling Visual Tokenizers to 3 Billion Parameters for Autoregressive Image Generation.arXiv preprint arXiv:2504.08736.[↩]
[9]Yao, J., Yang, B., & Wang, X. (2025). Reconstruction vs. Generation: Taming Optimization Dilemma in Latent Diffusion Models.arXiv preprint arXiv:2501.01423.[↩]
[10]Xu, W., et al. (2025). Exploring Representation-Aligned Latent Space for Better Generation.arXiv preprint arXiv:2502.00359.[↩]
[11]Su, J., et al. (2023). RoFormer: Enhanced Transformer with Rotary Position Embedding.arXiv preprint arXiv:2104.09864.[↩]
[12]Ainslie, J., et al. (2023). GQA: Training Generalized Multi-Query Transformer Models from Multi-Head Checkpoints.arXiv preprint arXiv:2305.13245.[↩]
[13]Heo, B., Park, S., Han, D., & Yun, S. (2024). Rotary Position Embedding for Vision Transformer.arXiv preprint arXiv:2403.13298.[↩]
[14]Zhang, B., & Sennrich, R. (2019). Root Mean Square Layer Normalization.arXiv preprint arXiv:1910.07467.[↩]
[15]Shazeer, N. (2020). GLU Variants Improve Transformer.arXiv preprint arXiv:2002.05202.[↩]
[16]Valevski, D., Leviathan, Y., Arar, M., & Fruchter, S. (2025). Diffusion Models Are Real-Time Game Engines.arXiv preprint arXiv:2408.14837.[↩]
[17]Li, T., Tian, Y., Li, H., Deng, M., & He, K. (2024). Autoregressive Image Generation without Vector Quantization.arXiv preprint arXiv:2406.11838.[↩]
[18]Lipman, Y., Chen, R. T. Q., Ben-Hamu, H., Nickel, M., & Le, M. (2023). Flow Matching for Generative Modeling.arXiv preprint arXiv:2210.02747.[↩]
[19]Esser, P., et al. (2024). Scaling Rectified Flow Transformers for High-Resolution Image Synthesis.arXiv preprint arXiv:2403.03206.[↩]
[20]Grauman, K., et al. (2022). Ego4D: Around the World in 3,000 Hours of Egocentric Video.arXiv preprint arXiv:2110.07058.[↩]
[21]Kumar, V., et al. (2023). RoboHive: A Unified Framework for Robot Learning.arXiv preprint arXiv:2310.06828.[↩]