Skip to content

evolvent-ai/Terrarium

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

49 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

Terrarium: Multi-turn data engine for evaluating and optimizing LLM agents in living environments.

Evolvent AI Blog Discord X LinkedIn Hugging Face Star License

Terrarium is a framework for generating rich, multi-turn agent rollouts across living and evolving environments. You write task programs in pure Python that orchestrate a living environment — sending emails, populating databases, uploading files — drive LLM agents through it, and evaluate the outcomes. Think of it as a terrarium for agents — a contained world that lives and breathes while you watch how your agents behave.

Use it to benchmark agents, collect training trajectories, or build evaluation datasets that capture the complexity of real-world workflows.

terrarium-intro.mp4

[video link]

News

Motivation

The way we evaluate and train agents has evolved through three phases:

Phase 1: Static QA Phase 2: Single-turn Agents Phase 3: Multi-turn Agents & Living Environments
Representative lm-evaluation-harness, lmms-eval Harbor Terrarium
Interaction Single-turn Single-turn, multi-step Multi-turn, multi-step
Environment None Static sandboxes Composable, mutates between turns
Control flow Linear Loops & branches
Verification Ground-truth matching Final test script Programmatic checkers at any stage
Proactive agents Supported

Existing frameworks stop at Phase 2. They have limited support for environments that change mid-task, multi-turn agent interactions, or task logic with loops and branches. As agents move beyond coding into personal assistance, workflow automation, and proactive monitoring, these limitations become blocking. We built Terrarium to close this gap.

Core Features

  • Living environments — task programs mutate the environment between agent turns (new emails arrive, database records change, files appear), creating dynamic scenarios that static benchmarks can't express
  • Composable capabilities — mix and match capabilities (email, calendar, postgres, notion, ...) in a single task. The framework handles provisioning, networking, and teardown. Sandbox-backed and API-based capabilities are treated uniformly.
  • Complex control flow — loops, conditional branches, and stage-level checks, all in plain Python. No YAML schemas, no configuration languages.
  • Multi-turn formulation — tasks drive agents through evolving worlds over many interactions, capturing context maintenance and adaptation across turns
  • Proactive agent support — heartbeat and webhook patterns for agents that monitor, anticipate, and act on their own initiative

Demo

A Claude Code agent running the Branch & Loop task — checking email, writing study notes in Notion, looping to expand content, and adapting to a mid-task schedule change.

terrarium-demo.mp4

[video link]

Quick Start

git clone https://github.com/evolvent-ai/Terrarium.git
cd Terrarium
uv sync

Setup Environment Variables

# .env
ANTHROPIC_API_KEY=sk-...
OPENAI_API_KEY=sk-...
OPENAI_BASE_URL=https://...
NOTION_TOKEN=ntn_...                          # if using notion capability
GOOGLE_SHEETS_CREDENTIALS_FILE=creds.json     # if using google_sheets capability

Build Docker Images (optional)

Sandbox-based agents (claude_code, openclaw, codex, hermes) run inside containers. Build the images you need:

docker build -t terrarium/claude-code -f docker/claude-code.Dockerfile docker/
docker build -t terrarium/openclaw -f docker/openclaw.Dockerfile docker/
docker build -t terrarium/codex -f docker/codex.Dockerfile docker/
docker build -t terrarium/hermes -f docker/hermes.Dockerfile docker/

Try It

terrarium run -c demo/run_config.toml

Built-in Agents

Five agents are provided out of the box:

Agent Runs in Description
claude_code Docker Claude Code CLI, stream-json output, multi-turn via session ID
openclaw Docker OpenClaw CLI, session JSONL output
codex Docker OpenAI Codex CLI, session JSONL output, multi-turn via session ID
hermes Docker NousResearch hermes-agent CLI, session JSONL export, multi-turn via session ID
mini In-process Lightweight agent via litellm, supports tool registration and custom system prompts

Bring your own agent by implementing BaseAgent and loading via import path:

AgentConfig(name="my_agent", import_path="my_package.agent:MyAgent")

Sandbox Providers

Capabilities and containerized agents run inside sandboxes provisioned by a SandboxProvider. Two backends are built in:

Provider Backend Notes
docker Local Docker daemon (default) One container per sandbox on a per-session Docker network.
k8s Kubernetes cluster One Pod per sandbox in a shared namespace.

Bring your own provider by implementing SandboxProvider and loading via import path:

SandboxProviderConfig(import_path="my_package.providers:MyProvider", kwargs={...})

Benchmark Integration

Writing a Task

A task is a directory with a Python script and metadata:

my_task/
  task.py         # @entry function — the task program
  task.toml       # metadata (name, tags, description)
  resources/      # optional files to upload into the environment

task.toml:

[metadata]
name = "my_task"
author = "your_name"
difficulty = "medium"
category = "data_processing"
tags = ["email", "postgres"]
description = "Agent reads email and imports data into PostgreSQL."

The task.py is where everything happens. Declare what capabilities you need, set up the environment, drive the agent, and verify the results:

from terrarium.task.decorator import entry
from terrarium.task.checking import run_checkers

@entry(capabilities=["postgres", "email"])
def my_task(env, agent):
    # 1. Set up the world
    env.email.send(from_addr="hr@co.com", to="agent@co.com",
                   subject="Data", body="Please import the attached CSV.")

    # 2. Drive the agent
    email_info = env.email.connection_info
    agent.act(
        "Check your email and import the data into PostgreSQL.\n\n"
        f"Email: agent@co.com, "
        f"IMAP at {email_info['imap_host']}:{email_info['imap_port']}, "
        f"SMTP at {email_info['smtp_host']}:{email_info['smtp_port']}, no password"
    )

    # 3. Verify the outcome
    return run_checkers({
        "data_imported": lambda: env.postgres.query("SELECT count(*) FROM t")[0]["count"] == 5,
        "reply_sent": lambda: env.email.count_inbox("hr@co.com") > 0,
    })

Organizing into a Dataset

A dataset groups tasks together with shared metric configuration. Tasks are discovered recursively.

my_benchmark/
  dataset.toml
  import_task/
    task.toml
    task.py
  monitoring/
    heartbeat_task/
      task.toml
      task.py
    webhook_task/
      task.toml
      task.py

dataset.toml:

[metadata]
name = "my_benchmark"

[metrics]
types = ["mean", "max", "pass@5"]

Built-in metrics: mean, max, min, sum, pass@k.

Parameterized Tasks

When many tasks share the same logic and differ only in data (tau2-bench, Harbor-style benchmarks, etc.), one task.py can expand into N independent task instances via @task.parameterize:

@entry(capabilities=["workspace"])
def task(env, agent, *, task_index: int):
    task_data = _load_tasks()[task_index]
    # ... shared logic using task_data ...

@task.parameterize
def params():
    for i in range(len(_load_tasks())):
        yield {"name": f"retail_task_{i}", "params": {"task_index": i}}

Each yielded item becomes a full Task instance with its own name, metrics, and trial results. Optionally override capabilities or capabilities_config per instance:

yield {"name": "with_db", "params": {...}, "capabilities": ["workspace", "postgres"]}
yield {"name": "alt_db", "params": {...}, "capabilities_config": {"postgres": {"db_name": "alt"}}}

The directory layout stays minimal — one directory, one task.py — instead of N near-identical directories.

Running Jobs

CLI — point at a job config and go:

terrarium run -c job.toml
# job.toml
job_name = "my_eval"
datasets = ["my_benchmark"]
n_attempts = 3
n_concurrent_trials = 4

[[agents]]
name = "claude_code"
model_name = "claude-sonnet-4-6"

Python API — for programmatic control:

import asyncio
from terrarium import Job, JobConfig, AgentConfig

result = asyncio.run(Job(JobConfig(
    agents=[AgentConfig(name="claude_code", model_name="claude-sonnet-4-6")],
    datasets=["my_benchmark"],
    n_attempts=3,
    n_concurrent_trials=4,
)).run())

for tr in result.trial_results:
    print(f"{tr.trial_name}: {tr.checker_result.score}")

The engine expands agents x tasks x attempts into individual trials, runs them concurrently, and aggregates metrics (mean, max, min, sum, pass@k).

Checking Job Results

Every trial produces a TrialResult with the full trajectory, checker pass/fail, token usage, and timing. Results are persisted as JSON:

outputs/{job_name}/
  result.json                    # aggregated stats and metrics
  {trial_name}/result.json       # full trajectory, checks, timing

Drag any result.json into demo/viewer.html to explore interactively.

Design Highlights

Living Environments

Real-world environments don't sit still. New emails arrive, files appear, schedules change — often while the agent is working. Terrarium lets the task program mutate the environment between agent turns, creating scenarios that static benchmarks simply can't express.

@entry(capabilities=["postgres"])
def order_monitoring(env, agent):
    env.postgres.execute("CREATE TABLE orders (id SERIAL, status TEXT, item TEXT)")
    env.postgres.execute("INSERT INTO orders (status, item) VALUES ('shipped', 'laptop')")

    agent.act("Check the orders table and summarize any pending orders.")
    # No pending orders — agent reports nothing to do.

    # A new order arrives between turns
    env.postgres.execute("INSERT INTO orders (status, item) VALUES ('pending', 'keyboard')")

    agent.act("Check again for any updates.")
    # Agent should find the new pending order and report it.

Composable Environments

Agent tasks in the real world span multiple services — email, calendars, databases, file systems, cloud APIs. Like assembling a terrarium from soil, water, and plants, you compose an environment from capabilities — declare what you need, and the framework provisions, networks, and tears down everything automatically.

@entry(capabilities=["email", "calendar", "notion", "postgres"])
def composable_capabilities(env, agent):
    env.email.send(...)          # GreenMail container
    env.calendar.add_event(...)  # Radicale CalDAV container
    env.notion.create_page(...)  # Notion API
    env.postgres.execute(...)    # PostgreSQL container

Six capabilities are built-in. Sandbox-backed and API-based capabilities are treated uniformly — tasks use them through the same interface:

Capability Type Backend
workspace Sandbox Linux container (shell + filesystem)
email Sandbox GreenMail (SMTP/IMAP)
postgres Sandbox PostgreSQL 16
calendar Sandbox Radicale (CalDAV)
notion API Notion API
google_sheets API Google API

Complex Control Flow

Task definitions are pure Python functions. Loops, conditional branches, intermediate checks, dynamic stage generation — anything you can express in Python, you can use to define a task.

@entry(capabilities=["email", "notion", "calendar"])
def branch_and_loop(env, agent):
    # Stage 0: agent reads email, creates calendar event
    env.email.send(from_addr="prof@uni.edu", to="alex@uni.edu", subject="Exam Reminder", body=EXAM_EMAIL)
    agent.act("Check my email and create a calendar event for the exam.")

    # Stage 1: agent writes study guide, loop until it's detailed enough
    agent.act("Read my lecture notes and write a study guide in Notion.")
    original_length = len(get_text(env, page_id))
    for _ in range(5):
        if len(get_text(env, page_id)) >= original_length * 5:
            break
        agent.act("The guide is too brief. Expand it with more details.")

    # Stage 2: branch based on what the agent actually wrote
    if "bellman equation" in get_text(env, page_id).lower():
        env.email.send(from_addr="prof@uni.edu", to="alex@uni.edu", subject="Rescheduled", body=RESCHEDULE_EMAIL)
        agent.act("Check my email for updates.")
    else:
        agent.act("I'm not ready for the exam. Email the professor requesting an extension.")

    return aggregate_results(check_stage0, check_stage1, check_stage2)

Multi-Turn Formulation

Terrarium introduces a programmatic DSL for multi-turn agent tasks: each agent.act() call is a turn, each task is a program that drives the agent through an evolving world. This formulation captures what single-turn benchmarks miss — the ability to maintain context, adapt to changes, and execute plans across diverse tools over many interactions.

@entry(capabilities=["email", "notion", "calendar"])
def multi_turn_task(env, agent):
    agent.act("Check my email for the exam details and create a calendar event.")
    # ... environment changes ...
    agent.act("Read my lecture notes and put together a study guide in Notion.")
    # ... check intermediate results, branch ...
    agent.act("I just got a new email. Read it and act accordingly.")

Infrastructure for Proactive Agents

Most benchmarks target reactive agents — give a prompt, get a response. Terrarium is the first data infrastructure designed for proactive agents: agents that monitor, anticipate, and act on their own initiative in response to environmental changes.

# Heartbeat pattern: agent receives periodic signals and must decide what to do
@entry(capabilities=["workspace"])
def proactive_heartbeat(env, agent):
    agent.act("You'll receive periodic heartbeat signals. Check /root/results/ for new files each time.")
    agent.act("[09:30] Heartbeat: check for changes.")  # nothing — agent should do nothing
    env.workspace.fs.upload(...)                          # file appears
    agent.act("[10:00] Heartbeat: check for changes.")  # agent should notice and act

# Webhook pattern: agent receives event notifications and decides how to respond
@entry(capabilities=["email"])
def proactive_webhook(env, agent):
    agent.act("You'll receive email webhook notifications. Decide whether each needs a reply.")

    env.email.send(from_addr="boss@co.com", to="agent@co.com", subject="Urgent: client meeting", body="...")
    agent.act('{"event": "new_email", "from": "boss@co.com", "subject": "Urgent: client meeting"}')
    # Agent should read the email and reply.

    env.email.send(from_addr="newsletter@spam.com", to="agent@co.com", subject="Weekly digest", body="...")
    agent.act('{"event": "new_email", "from": "newsletter@spam.com", "subject": "Weekly digest"}')
    # Agent should ignore this one.

Roadmap

  • More environment capabilities — browser automation, Slack, MySQL/SQLite, cloud storage, and more
  • More agent adapters — Anthropic SDK, OpenAI, LangChain, local models, and others
  • More sandbox providers — support sandbox backends beyond Docker and Kubernetes (e.g., cloud VMs, lightweight runtimes)
  • Benchmark integrations — integrate more existing benchmarks beyond tau2-bench
  • CLI enhancementsterrarium init, terrarium validate, result comparison, and richer output
  • Execution engine — async task submission, distributed execution, and scalable orchestration
  • Documentation site — full API reference, tutorials, and contributor guide

Acknowledgements

Terrarium's CLI and execution infrastructure draws inspiration from Harbor. We thank the Harbor team for their pioneering work on agent evaluation tooling.

License

This project is licensed under the CC BY-NC 4.0 license. See LICENSE for details.

About

Terrarium: Multi-turn data engine for evaluating and optimizing LLM agents in living environments.

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors