Skip to content

esagduyu/labrat

Repository files navigation

πŸ€ LabRat

Find the cheese in your maze.

LabRat is a terminal-native AI data agent β€” and the start of "Claude Code for data scientists." Connect to your warehouse, ask a question in plain English, and watch the agent explore your schema, write dialect-correct SQL in real time, and surface the answer β€” all without leaving your terminal.

Note

Status: feature-complete v0 alpha. 659 tests passing, end-to-end against DuckDB. On the public DataAgentBench leaderboard at 51.4% (rank #10/18) and 80% on dbt Labs' ADE-bench β€” see Benchmark records.


Where this is going

The DuckDB demo and the benchmark scores are proof-of-life for the reasoning engine. The product they serve is bigger: an agent and a TUI that capture end-to-end data workflows the way Claude Code captures coding workflows. Three ideas drive the design.

The Rat, the Maze, and the Cheese. The vocabulary is the product, the way Claude Code has its own.

  • The Rat is the agent kernel β€” the reasoning loop, tools, providers, and warehouse adapters. It finds the cheese.
  • The Cheese is the deliverable β€” an answer that carries its own SQL and reasoning, reviewable like a colleague's work, saveable and shareable.
  • The Rat Maze is an optional, additive knowledge layer the colony builds over time: metric definitions, naming, known gotchas, and reusable analysis recipes distilled from real usage. A mapped maze means the next rat β€” or the next teammate β€” never cold-starts. (Anthropic's own data team measured <21% accuracy without this kind of reference layer, >95% with it.)

A Lego-modular, embeddable core. Everything is a self-contained library that composes. The Rat Core (agent loop + tools + adapters + providers) is harness-agnostic β€” it runs in LabRat's TUI or embedded in your own harness via the MCP server. The TUI and the Rat Maze are optional layers that snap on top; the bare-bones product (Core + TUI) must be excellent on its own, with zero knowledge layer required.

Bottoms-up, like Figma. An individual data scientist adopts the Core + TUI on their laptop against their own warehouse β€” no procurement, no platform commitment. Teams adopt the shared Rat Maze later, once the value is already proven.

The three pillars, in order: (1) find the cheese reliably β†’ (2) spread the cheese β†’ (3) map the maze. Pillar 1 is what the benchmarks measure and it's largely built; Pillars 2 and 3 are the road ahead (see Roadmap).

What makes LabRat different

  • The SQL editor is the agent's whiteboard. Watch SQL stream into the editor character-by-character in your warehouse's dialect as the agent thinks. Edit it. Run it. The agent learns from your edits.
  • Learns from you. Per Meta's research, 88% of data scientists' queries hit tables they've used before. LabRat captures your query history, infers your domain, and applies your past corrections automatically. Day-30 LabRat is meaningfully better than day-1 LabRat.
  • Grounded before it guesses. A one-call profile_dataset reads the real schema, row counts, and sample values before planning; link_schema narrows wide schemas to the relevant tables; verify_join mechanically probes a join's match-rate and fan-out before the agent trusts it. An opt-in LLM-as-judge verifier gates the final answer.
  • Audit-ready by default. Every interaction is event-sourced and logged. Pin findings and export polished HTML reports with full provenance β€” query, results, chart, timestamps, lineage.
  • Safe by default. Read-only roles enforced at connection. Mutations and multi-statement injection refused (sqlglot AST-checked). Queries gated by EXPLAIN-estimated cost. Spend tracked per session. Destructive mistakes are physically impossible.
  • Embeddable, not walled. The Rat Core mounts over MCP into Claude Code, Codex, Cursor, OpenCode, or any MCP host β€” fully open (AGPL-3.0), with swappable LLM backends. You ride the tools you already use instead of adopting a new app.
  • Benchmarked, and honest about it. Real public scores, with the methodology and a self-disclosed contamination story documented in full (below).

Benchmark records

LabRat is measured on the two most serious public agentic data benchmarks β€” both execution-based (real databases, real validators, no LLM judges). All runs use claude-sonnet-4-6 unless noted.

DataAgentBench β€” UC Berkeley EPIC (leaderboard)

Natural-language query answering across 12 datasets / 54 queries / 4 database systems (DuckDB, SQLite, PostgreSQL, MongoDB), many requiring cross-database joins. LabRat is on the public leaderboard at a stratified Pass@1 of 51.4% (rank #10 of 18), independently re-validated by the maintainers.

Phase Scope Score What it measures
1b 5 datasets Β· 17 queries 48.5% raw-Claude + prompt-engineering floor (no LabRat tools)
4 5 datasets Β· 17 queries 54.0% same subset with LabRat's tool layer β€” a +5.5pp measured tool-layer lift
5 (full) 12 datasets Β· 54 queries 51.4% full benchmark, on the leaderboard

Strongest single signal: crmarenapro 82% on a 6-database hybrid query set. The headline number has an honest history: our initial submission scored 58.0%, but we found and self-disclosed a harness flaw that let the agent read benchmark answer-key files; the maintainers independently re-validated all 270 answers and added LabRat at the corrected 51.4%. The sandbox that prevents this is now permanent. The full story β€” leak, disclosure, re-validation, sandbox gate, and per-dataset breakdown β€” is in docs/dab-progress-report.md.

ADE-bench β€” dbt Labs (repo)

Analytics-engineering tasks in Docker-sandboxed dbt+DuckDB projects. 80% overall (48/60).

Tier Tasks Score
Easy 15 100% (15/15)
Medium 30 80% (24/30)
Hard 15 60% (9/15)
Overall 60 80% (48/60)

On the 39 tasks shared with Altimate Code's published DuckDB results: LabRat 82% (32/39) vs. Altimate 77% (30/39), same model, same best-of-3 methodology. Full write-up and remaining-failure analysis: docs/ade-bench-progress-report.md.

Status β€” what's built

LabRat is feature-complete for v0 alpha.

Layer Status Details
7 warehouse adapters βœ… DuckDB, Postgres, Snowflake, BigQuery, Redshift, Trino, MySQL
4 LLM providers βœ… Anthropic API Β· Claude Code CLI (Mac OAuth, Max plan) Β· OpenAI-compatible Β· GPT-5.5 via ChatGPT subscription (Codex Responses API β€” personal/dev path)
Agent tool loop βœ… 20 tools: profile_dataset grounding, link_schema / verify_join, schema exploration, SQL execution, safety gates (mutation + statement-stacking refusal), multi-DB routing, attach_database (cross-DB JOINs), load_file (CSV/TSV/JSON/Parquet), load_mongo_collection, search_reference_docs (Scent retrieval), workflow (9-step data-analysis SOP), opt-in LLM-as-judge verifier, configurable turn/tool-call caps
MCP server βœ… python -m labrat.mcp.server mounts the tool registry over MCP stdio β€” drop into Claude Code, Codex, Cursor, OpenCode, or any MCP host
Query history βœ… always-on, PII-redacted JSONL per profile
Personal context engine βœ… table relevance scoring (frequency Γ— recency), LLM-generated descriptions
dbt catalog integration βœ… manifest.json + schema.yml + catalog.json + lineage
MCP catalog integration βœ… generic async client for any MCP-compatible catalog
Self-healing memory βœ… edit-derived + chat-correction memories, retrieval
Custom validations βœ… natural-language rules, warn/block severity
Benchmark harness βœ… unified suite protocol; ADE-bench + DataAgentBench integrations
3-pane TUI βœ… chat + SQL whiteboard + schema browser
Charts Β· HTML export Β· Audit log βœ… unicode + image-protocol charts; provenance-rich HTML findings; JSONL event sourcing

Test coverage: 659 tests (LLM-gated tests skipped without ANTHROPIC_API_KEY / LABRAT_RUN_LLM_TESTS).

Architecture

LabRat is built as composable layers β€” the kernel works without the layers above it.

  • Rat Core (src/labrat/agent/, src/labrat/db/) β€” AgentLoop drives tool-use round-trips against a ToolRegistry and a swappable ModelProvider. Tools subclass a common Tool[InputT] base with Pydantic-validated inputs. Warehouse adapters share a Connection ABC and return Polars DataFrames. run_agent_task() turns a one-shot prompt into a result in-process.
  • MCP server (src/labrat/mcp/) β€” exposes the same tool registry over MCP stdio, so the Core runs inside any MCP host. This is the embeddable seam: the DataAgentBench claude-mcp driver is living proof of LabRat's tools running inside a third-party harness today.
  • TUI (src/labrat/screens/, src/labrat/widgets/) β€” Textual 3-pane layout: chat, a streaming SQL editor (the agent's whiteboard), and a schema browser.
  • Knowledge subsystems (memory/, validations/, context_engine/, catalog/, history/, audit/) β€” the seeds of the Rat Maze: self-healing memory, NL validations, personal table-relevance scoring, dbt/MCP catalog loaders, query history, event sourcing.

Contributor-facing detail lives in CLAUDE.md; architectural decisions in decisions.md.

Install

# Coming soon
uv tool install labrat

Until then, build from source (requires Python 3.12+):

git clone https://github.com/esagduyu/labrat
cd labrat
uv sync
uv run labrat

Quickstart

labrat

On first run, an onboarding wizard walks you through picking a dialect, entering credentials (stored encrypted in the OS keyring), testing the connection, and optionally linking a dbt project or data catalog. Then ask a question:

> show me Q4 revenue by region

LabRat explores your schema, samples data, consults your history and memories, streams dialect-correct SQL into the editor, runs it behind safety gates, renders the results, and offers to chart or pin the finding. Press ? for the in-app keyboard reference.

Supported warehouses: DuckDB, PostgreSQL, Snowflake, BigQuery, Redshift, Trino/Presto, MySQL. New adapters are straightforward via the Connection base class β€” PRs welcome.

Supported providers: Anthropic API Β· Claude Code CLI (Mac OAuth / Max plan) Β· OpenAI-compatible (Azure, LiteLLM, vLLM, Together, Fireworks, Ollama) Β· GPT-5.5 via ChatGPT subscription. Configure per profile; default model claude-sonnet-4-6.

Roadmap

v0 alpha is feature-complete. The path forward follows the three pillars.

What this phase taught us β€” the lever is grounding, not the model. A GPT‑5.5 experiment (write-up) found GPT‑5.5 β‰ˆ Sonnet (not a free win) and the opt-in verifier gave no measured accuracy benefit on DAB. Meanwhile the failures we root-caused were ungrounded data handling (e.g. a query that fails only because a Date column is dirty mixed-format text) β€” fixable for any model by a one-line reference-doc note. That, plus Anthropic's 21%β†’95% result and Altimate's leaderboard-topping AutoContext, points the roadmap squarely at the knowledge/grounding layer. The grounding layer is now shipped. The Cartographer pre-pass (Scent #26a/#26b) and benchmark-safe prompt levers are built and ablated: Cartographer alone +8pp on Sonnet (21%β†’29% on the tuning subset); levers +8pp marginal on top; stacked 21%β†’38% (+17pp). The mechanism is provider-agnostic and proven β€” GPT‑5.5 consults search_reference_docs from the traces β€” but the effect is provider-dependent: +8pp on Sonnet, neutral on GPT‑5.5 (which already self-grounds exhaustively). The leaderboard path is Sonnet + Cartographer + levers.

Pillar 1 β€” find the cheese reliably (within-task reasoning). Shipped: grounding profiler, schema-linking, mechanically-verified joins, the opt-in verifier, benchmark-safe prompt levers (force-query, SQL self-repair via error_category/hint, push-aggregation-into-SQL β€” ablated at +8pp marginal on the tuning subset), and a workflow tool encoding a 9-step data-analysis SOP.

Pillar 2 β€” spread the cheese (the workflow product). The Cheese share artifact: every answer carries its SQL + reasoning, saveable and exportable as a reviewable unit; a provenance footer on every result; notebook (marimo) integration. This is the Figma-style adoption wedge β€” what makes LabRat a workflow tool, not a one-shot.

Pillar 3 β€” map the Rat Maze (the knowledge moat β€” the priority). The Scent layer and Cartographer are now shipped. search_reference_docs (#26a) gives the agent section-level lexical retrieval over reference docs. The Cartographer (#26b) is a deterministic, GT-firewalled first-contact pre-pass (cartograph_prepass) that explores a warehouse and writes Scent docs β€” table grain, columns, verify_join-confirmed joins, observed dimension values; structure is deterministic, an optional LLM "semantics" pass is flagged for a human to own. Wired into DAB via --agent-cartograph flag. The Cartographer is our brand; "AutoContext" is Altimate's (Altimate's AutoContext PR #53 set the precedent on the leaderboard at ~+8pp). Remaining Pillar 3 work: correction-harvesting loop, reusable analysis recipes, and the first-connect TUI path as a second Cartographer caller.

Foundations: extract labrat-core as an installable embeddable package; testcontainers integration tests for the live warehouse adapters; v1 GA after a week of dogfooding.

Why "LabRat"?

We started flirting with ratatui and Rust. We landed on Python + Textual because the agent is the point of the product and Python's iteration speed on prompts and tools dominates. The name stuck. The rat got a lab coat.

License

AGPL-3.0. Use, modify, and redistribute LabRat for any purpose, including commercial. If you distribute a modified version or run it as a service, you must release your modifications under the same license. Want LabRat under a more permissive license for proprietary use? Get in touch.

Development

uv run pytest                        # full test suite (659 tests)
uv run ruff check . && uv run ruff format --check . && uv run pyright   # lint + types

# Evals (see CLAUDE.md for the full matrix)
uv run python scripts/eval_duckdb.py                          # schema/SQL eval, no API key needed
uv run python scripts/eval_dab.py --driver claude-mcp --n-trials 5   # DataAgentBench (LabRat tools via MCP)
cd ~/repos/ade-bench && uv run ade run helixops_saas001 --db duckdb --project-type dbt --agent labrat_local --no-diffs

# Run the standalone agent on any prompt / any provider
uv run python scripts/run_task.py --prompt "How many rows in orders?" \
    --connections '{"main":{"db_type":"duckdb","db_path":"/path.duckdb"}}' --provider anthropic

# Mount the Rat Core over MCP in any host
LABRAT_MCP_CONNECTIONS='{"main":{"db_type":"duckdb","db_path":"/path.duckdb"}}' \
    uv run python -m labrat.mcp.server

Acknowledgments

LabRat stands on shoulders: Textual (mouse-native async TUIs in Python), Harlequin (a terminal SQL editor that feels professional), DuckDB, SQLGlot, Polars, and uv. The dbt Labs ADE-bench and UC Berkeley EPIC DataAgentBench teams build the kind of execution-based benchmarks that actually mean something. Meta's Analytics team and Anthropic's self-service-analytics writeup are the architectural foundation for the Rat Maze. And the folks at SignalPilot, Databao, and Altimate Code set a high bar in adjacent categories β€” we watch their work closely.

Contributing

Issues, discussions, and PRs welcome. The simplest contribution: use LabRat, hit a wall, and open an issue describing what broke.


A small rat in a big maze. Finding cheese, one query at a time.

About

πŸ€ Ask your data warehouse a question in plain English, watch SQL stream into a terminal editor, get results β€” without leaving your shell. dbt-native, 7 adapters, Claude-powered.

Topics

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors