Skip to content

galderic/mcp-databricks

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

4 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

Databricks SQL Warehouse MCP

Minimal Python MCP server that exposes two tools — run_sql and get_statement — for executing and monitoring SQL against a remote Databricks SQL warehouse.

This server is intentionally narrow:

  • SQL warehouse only
  • PAT authentication only
  • configuration from environment variables and env files
  • two MCP tools: run_sql and get_statement
  • hard-coded production guardrail (see below)
  • inline results for small read queries
  • best-effort lookup of inserted row count from query history using the returned statement ID

What it returns

run_sql returns structured JSON with:

  • statement_id: the Databricks statement execution ID
  • status: final Databricks execution state
  • is_terminal: whether Databricks reports the statement in a terminal state
  • timed_out: whether this MCP stopped polling before the statement reached a terminal state
  • error_message: Databricks statement error for failed/canceled statements when available
  • inserted_row_count: best-effort value from query history when available
  • query_result: inline rows for small read queries when Databricks returns result data
    • columns: ordered column metadata
    • rows: result rows in JSON_ARRAY format
    • row_count: number of rows returned inline
    • total_row_count: total row count when Databricks provides it
    • truncated: whether more result rows exist than were returned inline
    • next_chunk_internal_link: Databricks next-chunk hint when present
    • scalar_value: convenience value for 1x1 queries such as SELECT COUNT(*)
  • warehouse_id
  • catalog and schema actually used
  • history_lookup_status: whether the inserted row count was found

The statement_id can be searched in the Databricks Query History UI.

Environment

Copy .env.example to .env and fill in:

  • DATABRICKS_HOST
  • DATABRICKS_TOKEN
  • DATABRICKS_SQL_WAREHOUSE_ID

Optional:

  • DATABRICKS_CATALOG
  • DATABRICKS_SCHEMA
  • DATABRICKS_STATEMENT_TIMEOUT_SECONDS
  • DATABRICKS_POLL_INTERVAL_SECONDS
  • DATABRICKS_HISTORY_LOOKUP_TIMEOUT_SECONDS
  • DATABRICKS_HISTORY_LOOKUP_POLL_SECONDS

Configuration loading precedence is:

  1. real process environment variables
  2. a local .env discovered from the current working directory
  3. ~/.config/mcp-databricks.env
  4. ~/.config/mcp-databricks/.env

For a stable machine-level setup, prefer ~/.config/mcp-databricks.env:

mkdir -p ~/.config
cat > ~/.config/mcp-databricks.env <<'EOF'
DATABRICKS_HOST=https://your-workspace.cloud.databricks.com
DATABRICKS_TOKEN=your-token
DATABRICKS_SQL_WAREHOUSE_ID=your-warehouse-id
EOF

That lets you rotate the token or switch warehouses without re-registering the MCP server in Codex or Claude Code.

After changing a machine-level config file, restart the MCP client or start a new session so the server process picks up the new values.

Authentication Notes

This server uses a Databricks access token in DATABRICKS_TOKEN.

If your Databricks workspace uses the newer scoped PAT flow, create the token with:

  • token type: BI Tools
  • API scope: sql

That matches this server's API usage, which is limited to Databricks SQL endpoints under /api/2.0/sql/*.

The identity behind the token still needs normal Databricks permissions, including:

  • CAN USE on the target SQL warehouse
  • warehouse/data permissions required by the SQL you execute
  • preferably CAN MONITOR on the warehouse if you want the best-effort query-history lookup to populate inserted_row_count

Install

With uv:

uv sync

With pip:

python -m venv .venv
source .venv/bin/activate
pip install -e .

Run locally

python -m mcp_databricks

This starts the server over stdio, which is the recommended transport for a local Codex MCP.

Register In MCP Clients

After installing this package into an environment, point your MCP client at that environment's Python:

/path/to/python -m mcp_databricks

Codex

Recommended registration when using ~/.config/mcp-databricks.env:

codex mcp add databricks-sql \
  -- /path/to/python -m mcp_databricks

If you prefer to keep configuration inside Codex instead, you can still register with explicit environment variables:

codex mcp add databricks-sql \
  --env DATABRICKS_HOST=https://<your-workspace>.cloud.databricks.com \
  --env DATABRICKS_TOKEN=<your-token> \
  --env DATABRICKS_SQL_WAREHOUSE_ID=<your-warehouse-id> \
  -- /path/to/python -m mcp_databricks

Verify:

codex mcp list
codex mcp get databricks-sql

Claude Code

Recommended registration when using ~/.config/mcp-databricks.env:

claude mcp add-json databricks-sql '{
  "type": "stdio",
  "command": "/path/to/python",
  "args": ["-m", "mcp_databricks"]
}'

If you prefer to keep configuration inside Claude Code instead, register with env in the JSON:

claude mcp add-json databricks-sql '{
  "type": "stdio",
  "command": "/path/to/python",
  "args": ["-m", "mcp_databricks"],
  "env": {
    "DATABRICKS_HOST": "https://<your-workspace>.cloud.databricks.com",
    "DATABRICKS_TOKEN": "<your-token>",
    "DATABRICKS_SQL_WAREHOUSE_ID": "<your-warehouse-id>"
  }
}'

Project-scoped example:

claude mcp add-json -s project databricks-sql '{
  "type": "stdio",
  "command": "/path/to/python",
  "args": ["-m", "mcp_databricks"],
  "env": {
    "DATABRICKS_HOST": "https://<your-workspace>.cloud.databricks.com",
    "DATABRICKS_TOKEN": "<your-token>",
    "DATABRICKS_SQL_WAREHOUSE_ID": "<your-warehouse-id>"
  }
}'

Verify:

claude mcp list
claude mcp get databricks-sql

Production Guardrail

This server enforces a hard-coded safety rule: any mutating SQL statement that targets an object whose name contains a component starting with pro_ is refused before it reaches Databricks.

Mutating statements checked: INSERT, UPDATE, DELETE, MERGE, TRUNCATE, COPY INTO, CREATE, DROP, ALTER.

The guardrail inspects the target of the statement, not every identifier in the query. Read-only queries (SELECT, SHOW, DESCRIBE, EXPLAIN, ...) are always allowed, even against pro_ objects.

The check applies to every dot-separated component of the target's qualified name:

SQL Blocked? Reason
SELECT * FROM pro_catalog.schema.events No Read-only
INSERT INTO pro_catalog.schema.events ... Yes Catalog starts with pro_
DROP TABLE catalog.pro_schema.table Yes Schema starts with pro_
UPDATE catalog.schema.pro_users SET ... Yes Table starts with pro_
INSERT INTO dev.schema.events ... No No pro_ component

Default catalog/schema resolution is also considered: if the configured DATABRICKS_CATALOG or DATABRICKS_SCHEMA starts with pro_, unqualified mutating targets are blocked too.

When blocked, run_sql returns:

{"error": "Blocked: refusing to execute mutating SQL against '...'. Objects with identifiers starting with 'pro_' are protected.", "blocked": true}

This guardrail cannot be disabled via configuration. To change the protected prefix, edit PROTECTED_PREFIX in src/mcp_databricks/guardrails.py.

MCP tools

run_sql

Arguments:

  • sql: SQL statement to execute
  • catalog: optional catalog override
  • schema: optional schema override
  • timeout_seconds: optional per-call timeout override
  • poll_interval_seconds: optional per-call polling override

Notes:

  • Databricks Statement Execution API accepts one statement per request.
  • The submit request is detached from execution because the server sends wait_timeout=0s.
  • Inserted row counts are not reliably present in the immediate execution response, so the server performs a best-effort query-history lookup after successful execution.
  • If query history has not indexed the statement yet, inserted_row_count will be null.
  • For successful read queries, query_result returns the first inline result chunk in JSON_ARRAY format.
  • query_result.rows is capped at 1000 rows. Larger or multi-chunk results are flagged with query_result.truncated = true.
  • For 1x1 results such as SELECT COUNT(*), the value is also exposed as query_result.scalar_value.
  • If polling times out, the tool still returns statement_id plus the last known status so you can resume later with get_statement.

get_statement

Arguments:

  • statement_id: Databricks statement execution ID returned by run_sql
  • timeout_seconds: optional extra time to keep polling for completion
  • poll_interval_seconds: optional polling override when timeout_seconds is set

Notes:

  • This tool checks statement status later by statement_id; the original submit connection does not need to stay open.
  • If the statement has succeeded, the tool returns the same small inline query_result shape as run_sql when Databricks includes result data.
  • If the statement is still running and the polling window expires, it returns timed_out = true with the last known status.
  • If the statement failed or was canceled, the tool returns the status and error_message instead of hiding the outcome behind a timeout.

About

My own local custom databricks MCP

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors

Languages