Embodied AI agents that learn through experimentation.
Drop a character into a scene. Claude sees it through vision models, experiments with what's possible, remembers what works, and writes new code when needed. No predefined action lists. No hardcoded behaviors. The character discovers its own capabilities.
Golem is open source because the metaverse should not be owned by one company nor should foundational AI character systems. Instead of vendor lock-in, Golem defines an open standard for AI-to-character communication so that AI can control characters in any game engine. Golem characters learn through exploration, not pre-programming. They see their world, experiment, remember what works, and become co-contributors to the virtual worlds they inhabit.
Bring your own AI. No vendor lock-in. Contribute to Golem's codebase.
Traditional AI characters (Convai, Inworld):
- Developer defines 12 actions the character can do
- AI picks from the menu
- Character is limited to what was anticipated
- Locked into their AI, their pricing, their roadmap
Golem:
- Developer provides a character and a scene
- Claude explores through vision and trial-and-error
- Character discovers what's possible
- Claude writes new scripts when needed
- You choose the AI β Claude, GPT, local models, whatever comes next
As AI models improve, Golem characters automatically inherit those improvements. We're not building AIβwe're building the embodiment layer for whatever AI becomes.
Golem is MIT licensed. No API keys required to get started. No per-conversation fees. Run it locally, modify it freely, deploy it anywhere.
Not locked into any AI provider. Connect Claude for advanced reasoning, GPT for conversation, a local Llama for privacy, or your own fine-tuned model. Swap backends without changing game code.
A simple, documented WebSocket protocol for AI-to-character communication. Implement it once in any engineβUnity, Unreal, Godot, web. Any AI that speaks the protocol can control any character that implements it. No proprietary SDKs.
Characters discover their capabilities through experimentation, not configuration. Vision models see the scene. Trial-and-error finds what works. Memory retains what's learned. Code generation creates new abilities.
βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
β Your AI Backend β
β Claude β’ GPT β’ Llama β’ Your Fine-tune β
βββββββββββββββββββββββ¬ββββββββββββββββββββββββββββββββββββ
β
βΌ
βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
β Vision Language Model β
β Sees the Unity scene β
βββββββββββββββββββββββ¬ββββββββββββββββββββββββββββββββββββ
β
βΌ
βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
β Golem Protocol (WebSocket) β
β Standard JSON messages over WS β
βββββββββββββββββββββββ¬ββββββββββββββββββββββββββββββββββββ
β
βΌ
βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
β Golem Runtime β
β Unity β’ Unreal (soon) β’ Godot (soon) β
βββββββββββββββββββββββ¬ββββββββββββββββββββββββββββββββββββ
β
βΌ
βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
β Feedback Loop β
β Did it work? β Memory β Pattern Recognition β
βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
- Vision β AI sees the scene through vision language models
- Experimentation β Try actions, observe results
- Memory β Remember what works, what doesn't
- Pattern Recognition β Generalize from experience
- Code Generation β Write new capabilities when needed
The character learns its environment like a child learns to walkβthrough exploration, not instruction.
git clone https://github.com/TreasureProject/Golem.gitOpen the project in Unity 2022.3+.
Golem connects to any AI server via WebSocket:
ws://localhost:5173/agents/chat/external:{agentId}
Your server receives scene state and sends commands. Use Claude, GPT, a local modelβwhatever you want.
Press Play. The AI sees the scene, experiments, and learns.
A simple JSON-over-WebSocket protocol. Any AI that produces these messages can control any Golem-compatible character.
{
"type": "character_action",
"data": {
"action": {
"type": "moveToLocation",
"parameters": { "location": "cafe" }
}
}
}{
"type": "emote",
"data": {
"type": "voice",
"audioBase64": "<base64-encoded-audio>"
}
}{
"type": "emote",
"data": {
"type": "animated",
"animation": { "name": "wave", "duration": 2.0 }
}
}{
"type": "facial_expression",
"data": {
"expression": "happy",
"intensity": 0.9
}
}Expressions: happy, sad, surprised, angry, neutral, thinking
{
"type": "script",
"data": {
"code": "<C# code to execute>",
"target": "character"
}
}The AI can write and execute new behaviors at runtimeβnot limited to predefined actions.
{
"type": "scene_state",
"data": {
"character": { "position": [0, 0, 5], "state": "idle" },
"objects": [...],
"screenshot": "<base64-encoded-image>"
}
}The AI receives visual and structured feedback to close the learning loop.
| Convai/Inworld | Golem | |
|---|---|---|
| Action space | Predefined by developer | Discovered by AI |
| Vision | None | Vision language models |
| Learning | None | Trial-and-error + memory |
| Code generation | None | Runtime scripting |
| AI backend | Locked to their API | Any (Claude, GPT, local) |
| Protocol | Proprietary SDK | Open WebSocket standard |
| Pricing | Per-API-call | Open source / free |
| Improvement | Their roadmap | Inherits AI advances |
Golem/
βββ Assets/
β βββ Scripts/
β β βββ Character/
β β β βββ PointClickController.cs # NavMesh movement
β β β βββ CharacterActionController.cs # Action routing
β β β βββ EmotePlayer.cs # Voice + lip sync
β β βββ Systems/
β β β βββ Networking/
β β β β βββ CFConnector.cs # WebSocket client
β β β βββ Camera/
β β β βββ CameraStateMachine.cs # Camera control
β β βββ Utils/
β β βββ WavUtility.cs # Audio decoding
β βββ Plugins/
β β βββ SALSA LipSync/ # Lip sync
β βββ Scenes/
β βββ Main.unity
βββ README.md
| Component | Purpose |
|---|---|
CFConnector.cs |
WebSocket client, connects to any AI backend |
CharacterActionController.cs |
Routes AI commands to character |
PointClickController.cs |
NavMesh movement + interaction states |
EmotePlayer.cs |
Voice playback with SALSA lip sync |
In the Unity Inspector, configure CFConnector:
| Setting | Default | Description |
|---|---|---|
| Host | localhost:5173 |
AI server address |
| Agent Id | character |
Agent identifier |
| Use Secure | false |
Use wss:// |
| Query Token | β | Auth token |
Test actions manually while developing:
| Key | Action |
|---|---|
1 |
Move to location |
2 |
Sit at chair |
3 |
Stand up |
4 |
Examine display |
5 |
Play arcade |
6 |
Change camera |
7 |
Idle |
Space |
Stand up |
We welcome contributions:
- Protocol improvements
- New runtime implementations (Unreal, Godot, web)
- AI backend adapters
- Documentation
MIT β Use it however you want.
Golem is built by Treasure, building the future of interactive IP and AI-driven entertainment experiences.