Skip to content

AndrewBoessen/neural-game-engine

Repository files navigation

Neural Game Engine

Neural network approach for modeling interactive game environments using a VQ-VAE and Spatio-Temporal Transformer. Trained on Atari Skiing gameplay data.

SkiingGIF Generated
Original AI Generated

Install

  1. Clone Repo
git clone https://github.com/AndrewBoessen/neural-game-engine.github
cd neural-game-engine
  1. Create Conda Environment
conda create -n engine python=3.10
conda activate engine
  1. Install Dependencies
pip install -r requirements.txt

Load Checkpoints and Data

Pretrained Model Checkpoints

Download model checkpoints and move to root directory

VQ-VAE Checkpoint Neural Game Engine Checkpoint
Download Here Download Here

Gameplay Dataset

The model is trained on a dataset of ~33,000 frames and evaluated on a set of ~8,000 frames.

To train the transformer game-engine model, the dataset is preprocessed and tokenized

data.mp4

Download Datasets

Gameplay Dataset Token Dataset
Download Here Download Here
Extract to /gameplay_data/ Extract to /token_data/

Play

An interactive game script is available that generates frames based on user input

gameplay.mp4

Example gameplay recording. Running on Nvidia RTX 4070 at 15fps

Run Interactive Game Environment

Follow installation instructions above to install token data and model checkpoints

python play.py

Architecture

The Neural Game Engine architecture leverages a combination of a Vector Quantized Variational Auto-Encoder (VQ-VAE) for image tokenization and a Spatio-Temporal Transformer (ST-Transformer) for modeling game dynamics. This design enables the simulation of interactive game environments by capturing the causal relationships between user actions and game state transitions.

neural game engine

Interactive Game Engine. An RL agent is used to create a dataset consisting of observation, action pairs. The observations are encoded into states with the VAE encoder E. The sequential model (ST-Transformer) takes the encoded state, action pairs and predicts the next state s_t+1. The state to be predicted, initially represented as a mask token, [MASK], is iteratively generated in a non-auto regressive manner with MaskGIT and bidirectional attention. Predicted states are projected to pixel space with the VAE decoder D.

Data Collection (RL Agent)

A reinforcement learning (RL) agent is employed to collect gameplay data, generating a dataset of observation-action pairs. The agent interacts with the environment and records its trajectories, simulating diverse scenarios. This dataset, consisting of approximately 33,000 training frames and 8,000 validation frames, serves as the foundation for training both the image tokenizer and the game engine.

Image Tokenizer (VQ-VAE)

The VQ-VAE encodes game frames into a discrete latent space, forming a tokenized representation of the image. It uses:

  • Encoder: Converts 256×256 RGB images into a 16×16 grid of tokens, reducing spatial complexity.
  • Codebook: Contains 512 unique tokens used for quantization.
  • Decoder: Reconstructs images from the tokenized representations. This compression allows the ST-Transformer to operate on discrete tokens rather than raw pixel data, enabling efficient sequence modeling.

Game Engine (ST-Transformer)

The ST-Transformer predicts future game states based on past states and user actions. It processes sequences of state-action token pairs, capturing both spatial and temporal dependencies:

  • Spatial Attention: Models relationships between tokens within a single frame.
  • Temporal Attention: Models dependencies across multiple frames.
  • MaskGIT Algorithm: Uses bidirectional attention and iterative refinement to generate states in parallel, reducing computation time while maintaining visual fidelity.

State Prediction and Interactivity

The predicted tokens are fed back into the VQ-VAE decoder to generate the next game frame. User actions influence the prediction pipeline, ensuring real-time interaction and causality. The iterative non-autoregressive process minimizes quality degradation over time, though edge cases (e.g., object collisions) remain challenging.

This architecture demonstrates the feasibility of neural networks in simulating dynamic, visually coherent game environments with real-time responsiveness. For further details, refer to the full technical report.

About

Neural network approach for modeling interactive game environments using Vector Quantized Variational Auto-encoder (VQ-VAE) and Spatio-Temporal Transformers. Trained on Atari Skiing gameplay data.

Topics

Resources

License

Stars

Watchers

Forks

Packages

 
 
 

Contributors

Languages