Skip to content

raminmohammadi/World_Models

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

16 Commits
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

Technical Report: Neural Hallucination and Latent Policy Optimization in CarRacing-v3

A modular PyTorch implementation of World Models (Ha & Schmidhuber, 2018) for the CarRacing-v3 environment. This repository breaks down the architecture into three distinct stages—Perception, Memory, and Action—with integrated experiment tracking via Weights & Biases (W&B).

Date: February 19, 2026 Subject: Implementation and Empirical Analysis of World Models for Autonomous Navigation


World Model: Eval Episode 4

1. Abstract

This project represents my implementation and rigorous testing of the "World Model" architecture within the CarRacing-v3 environment. The core challenge in autonomous navigation is the high-dimensional nature of visual data. By decoupling perception from reasoning, we aim to build an agent that understands the underlying physics of its world rather than just reacting to pixels.

This re-implementation serves as a "baptism by fire" in debugging a seminal AI milestone. Building a z=32 latent bottleneck from scratch on a modern, sometimes unstable stack like Python 3.14 provided a research-driven stress test for the theory. Navigating the cuBLAS initialization minefield and the thermal limits of an RTX 3070 Ti was the price of admission to prove that a neural hallucination can effectively master reality.


2. Introduction: The Problem of High-Dimensional Real-Time Learning

In traditional Reinforcement Learning (RL), agents often struggle with the "curse of dimensionality" when processing raw pixel data. Calculating gradients across thousands of pixels while simultaneously learning physics and long-term strategy is computationally expensive and prone to high variance.

Our goal was to solve the CarRacing-v3 task by building a surrogate environment. By predicting the consequences of its actions, the agent can play the game millions of times "in its head" (the dream state), transferring that learned intuition back to the real track.


3. Component Architecture and Methodology

Alt Text

3.1 Perception (): The Variational Autoencoder

The Vision model is a Variational Autoencoder (VAE) designed to compress RGB images into a latent vector .

  • Architecture: The encoder uses four convolutional layers with stride-2 to downsample the input, while the decoder utilizes transposed convolutions to reconstruct the original frame.
  • The Latent Bottleneck: We utilize the reparameterization trick to allow backpropagation through the stochastic sampling process.
  • Empirical Findings: Training logs show a validation loss reduction from 18.0 to 14.5. The VAE successfully discards high-frequency noise like grass textures while preserving the structural "signal" necessary for navigation, such as track curvature and car orientation.

Alt Text

3.2 Memory (): Stochastic Temporal Forecasting

The Memory component is a Mixture Density Network combined with a Long Short-Term Memory (MDN-RNN).

  • Predictive Power: The model predicts , where is the hidden state of the LSTM.
  • Hyperparameter Selection: We configured the MDN head with 5 Gaussian components to handle the multi-modal nature of the environment, such as choosing between a left or right turn at a fork.
  • Loss Dynamics: The MDN loss dropped to approximately -155. This negative value indicates that the Gaussian distributions became highly peaked, signifying the model has mastered the "internal physics" of the track.

4. Training in the Dream State ()

The Controller () is a minimalist linear layer mapping to action vectors. We utilized CMA-ES (Covariance Matrix Adaptation Evolution Strategy) to optimize the weights.

4.1 The Hallucination Loop

In "Dream Mode," the agent never interacts with the actual Gym environment. Instead, the RNN predicts the next latent vector and reward, which are fed back into the Controller.

  • Hurdle: A significant risk is "model exploitation," where the Controller learns to exploit inaccuracies in the RNN's predictions.
  • Stability: To counter this, we used a deterministic strategy during training, taking the mean of the most likely Gaussian component to stabilize the evolution.

5. Hardware Constraints and Pipeline Bottlenecks

During training, we monitored the RTX 3070 Ti Laptop GPU closely.

  • GPU Utilization: Utilization remained constant at 90-100%, indicating a well-saturated compute pipeline.
  • The "CUBLAS" Hurdle: We encountered CUBLAS_STATUS_NOT_INITIALIZED errors traced to the TensorFloat-32 (TF32) kernel path. Disabling TF32 (torch.backends.cuda.matmul.allow_tf32 = False) ensured stability at the cost of slight performance overhead.
  • Thermal Constraints: The GPU temperature plateaued at 85°C. This thermal load suggests that sequential LSTM operations put a sustained stress on the silicon compared to more "bursty" convolutional workloads.

6. Synthesized Findings

This implementation reveals a high-performing architecture that solves the CarRacing-v3 task by successfully decoupling perception, memory, and control. By compressing environment observations into a z=32 latent vector, the model prioritized essential features over decorative noise. The VAE's stable convergence indicates that this bottleneck is sufficient for low-complexity physics.

The MDN-RNN demonstrated a powerful ability to "hallucinate" transitions, with reward prediction MSE hitting a near-zero 0.00004. This allows the agent to effectively "feel" incoming rewards within its dream state.

The ultimate success is evident in the Controller’s performance, achieving a "solved" state with rewards exceeding 780. The agent developed an aggressive, high-velocity racing style characterized by tight cornering and intentional drifting. However, the hardware constraints and stability issues in the Python 3.14 environment suggest that the next step for this research is moving toward Transformer-based world models to bypass the sequential compute bottlenecks inherent in RNNs.


📂 Project Structure

World_Models/
├── data/                  # Dataset storage
│   ├── recordings/        # Raw rollouts (.npz)
│   ├── processed/         # Preprocessed latent vectors (dataset.pt)
│   └── weights/           # Saved model checkpoints
├── src/
│   ├── action/            # (C) Controller Component
│   ├── memory/            # (M) Memory Component
│   ├── perception/        # (V) Vision Component
│   ├── helpers.py         # Global logging & utilities
│   └── play_world_model.py # Inference/Demo script

🚀 Pipeline & Usage

1. Data Generation

Collect raw rollouts using multiprocessing: python generate_data.py

2. Perception (VAE)

Train the VAE to compress 64x64 frames into a z=32 vector: python src/perception/train_vae.py --epochs 20

3. Memory Preprocessing

Convert raw images into latent sequences () to speed up RNN training: python src/memory/preprocess_rnn.py

4. Memory Training (MDN-RNN)

Train the MDN-RNN to predict the future latent state: python src/memory/train_rnn.py --epochs 20

5. Action (Controller Evolution)

Evolve the Controller using CMA-ES: python src/action/train_controller.py --generations 100


🎮 Inference (Play Mode)

Watch your fully trained agent play in real-time: python src/play_world_model.py --render_mode human


📊 Experiment Tracking

This project uses Weights & Biases for logging:

  • Loss Curves: Monitor VAE reconstruction and RNN log-likelihood.
  • Video: The train_controller.py script automatically uploads video replays of the best agents.

To disable logging: wandb disabled

About

A PyTorch re-implementation of World Models (Ha & Schmidhuber, 2018) for CarRacing-v3. The agent solves the track by "dreaming"—using a VAE for perception, an MDN-RNN for memory, and CMA-ES for controller evolution.

Topics

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors

Languages