A real-time benchmarking platform that tests how different Large Language Models (LLMs) solve sliding tile puzzles (n-puzzle problems). The system orchestrates puzzle games, calls LLM APIs, tracks move-by-move performance, and displays live results through an interactive dashboard with WebSocket updates.
- Real-time Dashboard - Live visualization of active puzzle games with WebSocket updates
- Multiple LLM Support - Test various models including GPT-4, GPT-4o, Claude 3.5 Sonnet, and more
- Batch Benchmarking - Run multiple configurations simultaneously for statistical comparison
- Game Replay - Step-by-step playback of completed games with adjustable speed
- Analytics Panel - Charts showing performance by model, board size, and success rates
- Filtering & Search - Filter games by status, model, or search by run ID
- Dark/Light Theme - Toggle between themes for comfortable viewing
- Node.js 18+
- pnpm package manager
- AI Gateway API key (for LLM access)
Create a .env file in the project root:
# Required: API key for Vercel AI Gateway
AI_GATEWAY_API_KEY=your_api_key_here
# Optional: Custom AI Gateway URL (defaults to Vercel AI Gateway)
AI_GATEWAY_BASE_URL=https://ai-gateway.vercel.sh/v1
# Optional: Default model to use (defaults to openai/gpt-4.1-mini)
DEFAULT_MODEL_ID=openai/gpt-4.1-mini
# Optional: Server port (defaults to 5000)
PORT=5000# Install dependencies
pnpm install
# Run development server
pnpm dev
# Run type checking
pnpm check
# Build for production
pnpm build
# Start production server
pnpm start├── client/ # Frontend React application
│ ├── src/
│ │ ├── components/ # React components
│ │ │ ├── ui/ # Shadcn UI components
│ │ │ └── ... # App-specific components
│ │ ├── hooks/ # Custom React hooks
│ │ ├── lib/ # Utility functions
│ │ └── pages/ # Page components
│ └── index.html
├── server/ # Backend Express server
│ ├── index.ts # Server entry point
│ ├── routes.ts # API routes & WebSocket setup
│ ├── game-engine.ts # Puzzle game logic
│ ├── llm-client.ts # LLM API integration
│ ├── orchestrator.ts # Benchmark run management
│ └── storage.ts # In-memory data storage
├── shared/ # Shared types & schemas
│ └── schema.ts # TypeScript types & Zod schemas
└── package.json
| Method | Endpoint | Description |
|---|---|---|
| GET | /api/runs |
Get all benchmark runs |
| GET | /api/runs/:runId |
Get specific run details |
| POST | /api/runs |
Create new benchmark run |
| POST | /api/runs/:runId/step |
Execute single move step |
| GET | /api/stats |
Get aggregate statistics |
Connect to /ws for real-time updates. Message types:
run_created- New benchmark run startedrun_updated- Run progress updatedrun_completed- Run finishedstats_updated- Statistics refreshed
Send a POST request to /api/runs:
{
"modelId": "openai/gpt-4o-mini",
"size": 3,
"maxMoves": 100,
"scrambleDepth": 20
}| Parameter | Type | Default | Description |
|---|---|---|---|
| modelId | string | DEFAULT_MODEL_ID env |
LLM model identifier |
| size | number | 4 | Board size (2-6) |
| maxMoves | number | 200 | Maximum moves before failure (10-500) |
| scrambleDepth | number | 50 | Random moves for scrambling (1-200) |
The dashboard includes presets for:
openai/gpt-4.1-mini- GPT-4.1 Miniopenai/gpt-4.1- GPT-4.1openai/gpt-4o-mini- GPT-4o Miniopenai/gpt-4o- GPT-4oanthropic/claude-3.5-sonnet- Claude 3.5 Sonnetanthropic/claude-3-haiku- Claude 3 Haiku
Additional models can be used via the AI Gateway by specifying the full model ID.
- React 18 - UI framework
- TypeScript - Type safety
- Vite - Build tool & dev server
- Tailwind CSS - Styling
- Shadcn UI / Radix - Component library
- TanStack Query - Server state management
- Recharts - Data visualization
- Framer Motion - Animations
- Wouter - Routing
- Express - HTTP server
- ws - WebSocket server
- Zod - Schema validation
- TypeScript - Type safety
The sliding puzzle (n-puzzle) consists of a grid of numbered tiles with one empty space. The goal is to arrange tiles in order by sliding tiles into the empty space. The LLM receives the current board state as JSON and must respond with a valid move direction: up, down, left, or right.
Example board state (3x3):
[[1, 2, 3], [4, null, 6], [7, 5, 8]]Example LLM response:
{"move": "down"}- Create component in
client/src/components/ - Use Shadcn UI primitives from
client/src/components/ui/ - Import shared types from
@shared/schema
The game engine in server/game-engine.ts handles:
- Board creation and scrambling
- Move validation
- State management
Modify server/llm-client.ts to add new providers. The system uses the Vercel AI Gateway which supports multiple providers through a unified API.
MIT