Natural Language → ClickHouse via GPT-5 Context-Free Grammar

A small Next.js app that turns English questions into ClickHouse SQL using GPT-5's newly-added Context-Free Grammar (CFG) feature, then runs the queries against the NYC Taxi sample dataset (default.nyc_taxi, 20M trips, 2015-07-01 → 2015-09-30) on ClickHouse Cloud.

An exploration of §3 of OpenAI's GPT-5 cookbook: what does grammar-constrained decoding actually buy over an unconstrained baseline, measured with live evals?

Live demo: https://cfg-eval.vercel.app
Try it: https://cfg-eval.vercel.app/query
Long-form writeup: https://cfg-eval.vercel.app/about

NL question  ──▶  /api/query  ──▶  GPT-5 + Lark CFG  ──▶  ClickHouse Cloud  ──▶  result table
   (React)         (route)         (custom tool decode)      (HTTPS / 8443)        (React)

The model's decode is constrained, at every token, by a Lark grammar that defines a safe SELECT subset over nyc_taxi. Anything outside the grammar — DDL, DML, joins, subqueries, comments, lat/long columns, lowercase keywords — is structurally unrepresentable in the output, not just discouraged.

Quick start

Requires Node 20+ (or Bun), an OpenAI API key with GPT-5 access, and a ClickHouse Cloud service with the NYC Taxi sample dataset loaded.

# 1. Install
bun install            # or: npm install

# 2. Configure env
cp .env.example .env.local
#   Fill in OPENAI_API_KEY + CLICKHOUSE_HOST/USER/PASSWORD/DATABASE.

# 3. Run
bun run dev

Open http://localhost:3000/query, type a question (e.g. "Top 5 pickup neighborhoods by trip count in August 2015"), watch GPT-5 emit a grammar-constrained SQL query and ClickHouse return the rows.

The deliverables

1. Deployed app — `/query`

A single-prompt form. The route at app/api/query/route.ts:

validates the body with Zod,
checks an in-memory LRU cache (identical question → instant zero-token hit),
calls generateSQLConstrained — a GPT-5 Responses API request with a custom tool whose format is the Lark grammar,
extracts the SQL from the custom_tool_call output item,
executes it via runQuery against ClickHouse Cloud (HTTPS / 8443, max-execution-time + max-result-rows guards),
returns { sql, columns, rows, generationMs, executionMs, usage }.

2. The CFG — `lib/grammar/taxi.ts`

A Lark grammar accepting only: one SELECT from nyc_taxi; 13 whitelisted columns (lat/long excluded); aggregates count/sum/avg/min/max/uniq/uniqExact; date functions toDate/toHour/toStartOf*/toDayOfWeek/toMonth/toYear; WHERE/GROUP BY/HAVING/ORDER BY/LIMIT; UPPERCASE keywords. Whitespace is threaded explicitly (per the cookbook) rather than via %ignore WS; every terminal is content- and length-bounded.

3. Four evals + a head-to-head — `tests/evals.test.ts`

Gated behind RUN_EVALS=1 because they call GPT-5 and run real queries:

RUN_EVALS=1 bun run test                       # full set, default N=2 trials per case
RUN_EVALS=1 EVAL_N=5 bun run test              # 5 trials per case
RUN_EVALS=1 EVAL_SLICE=headline bun run test   # discriminating slice only (cheap headline run)

The full set splits into a control arm (clean prompts — both modes score ~100%) and a discriminating slice (3 adversarial + 5 phantom-column prompts built to break the baseline). The HEADLINE suite prints that slice as one table and asserts the constrained side is clean. EVAL_SLICE=headline runs just those 8 prompts.

#	Eval	What it asserts
1	Execution correctness	Result matches a live ClickHouse reference run for each of 21 labelled prompts (easy/medium/hard + an adversarial slice). Compares result sets, not SQL strings. pass@N ≥ 0.5.
2	SQL validity	Every constrained output executes on ClickHouse — zero failures. An execution error means the grammar accepted something CH didn't.
3	Schema adherence	Generated SQL references only real columns/tables/functions. Zero violations on constrained output by construction; the adversarial slice is where the unconstrained baseline drifts (`toUnixTimestamp`, `dateDiff`).
4	Refusal handling	10 unanswerable prompts (out-of-domain + phantom columns — real NYC-TLC fields absent from this 13-column subset). Constrained path declines via `cannot_answer` tool; unconstrained baseline emits degenerate answers (`SELECT 0 AS mta_tax`, `WHERE 1=0`).
5	CFG vs no-CFG head-to-head	Per-case + overall + adversarial-slice tables, each with a `schemaClean` column. Hard assertion: constrained `execRate == 1.0`.
★	HEADLINE	The discriminating slice in one table: on the 8 prompts built to break the baseline, `baselineFailed` vs `cfgFailed` (asserted 0). Persisted per run in the runbook.

Cases live in tests/eval-cases.ts and tests/out-of-scope-cases.ts. The SQL introspection behind eval 3 (tests/sql-introspect.ts) has offline unit tests. VERIFY_CASES=1 bun run test asserts each case's distractor SQL differs from the reference answer — so a coincidentally-correct query can't pass.

The adversarial slice design and grading methodology are explained in the long-form writeup.

Architecture

app/
  page.tsx              landing
  about/page.tsx        long-form writeup (context, grammar, evals, decisions)
  query/page.tsx        single-prompt form + Trace sidebar (client)
  evals/page.tsx        in-browser eval runner (same cases + grading as vitest)
  api/query/route.ts    POST: NL -> grammar-constrained SQL -> ClickHouse
  api/evals/route.ts    POST: run ONE eval trial, graded vs a live reference run
lib/
  openai.ts             OpenAI client + reasoning/verbosity defaults
  clickhouse.ts         CH Cloud client + bounded runQuery
  nl-to-sql.ts          generateSQLConstrained (CFG) + generateSQLUnconstrained (baseline)
  sql-guard.ts          runtime allowlist guard for manually edited SQL (/api/execute)
  result-compare.ts     result-set comparison (set/ordered/scalar) — correctness oracle
  eval-run.ts           shared request/verdict contract for the in-app eval runner
  grammar/
    taxi.ts             the canonical Lark grammar + schema description + allowlists
tests/
  eval-cases.ts         21 labelled cases (question, referenceSQL, compareMode, difficulty, distractors)
  out-of-scope-cases.ts 10 unanswerable prompts for the refusal eval
  sql-introspect.ts     identifier extraction for the schema-adherence guard
  evals.test.ts         the four evals + head-to-head (gated by RUN_EVALS=1)
  runbook.ts            persists per-run metrics + diffs vs the previous run
scripts/
  check_grammar.py      extracts NYC_TAXI_LARK from the .ts module and validates via Earley parser
  smoke-gpt5.ts         one-shot end-to-end smoke (bun run scripts/smoke-gpt5.ts)
  build-eval-set.ts     reads negative signals back and prints EvalCase scaffolds

Notable design choices:

Raw openai SDK — the custom-tool + grammar shape isn't first-class in @ai-sdk/openai yet.
Earley parser (not LALR) for local validation — explicit whitespace creates LALR-1 conflicts; LLGuidance isn't LALR-bound and accepts the same grammar.
In-memory LRU cache — the 2015 dataset is static; identical questions are instant zero-token hits, bounded at 256 entries.
reasoning.effort: "low" on the constrained path — the grammar guarantees syntactic validity, so the model doesn't spend reasoning tokens self-checking. Eval-confirmed latency + cost win at no accuracy loss.
LLM observability via Raindrop HTTP ingestion API, emitted from Next's after() (zero user-facing latency). Every /api/query and /api/execute request lands as a trace event; the eval harness logs each trial as nl_to_sql_eval. build-eval-set.ts reads those signals back to fold real failures into the regression set. No key set → silent no-op. See lib/raindrop.ts.

What I'd add before production

Prompt-injection adversarial set — DROP TABLE, ;--, role-spoofing prompts to measure what the grammar forecloses vs prompting alone.
Multi-table support — the grammar pins to one table; the cookbook flags joins/CTEs as a genuine LLGuidance failure mode.
Historical traces in-app — the Trace sidebar shows the current request; past traces live only in the external dashboard.
Auth + durable rate limiting — the in-process limiter (lib/rate-limit.ts) resets on restart; production needs a shared store.
CI eval cron — a scheduled run per model tier would catch silent regressions automatically.

Tech stack

Next.js 16 (Turbopack) · React 19 · TypeScript · Tailwind + shadcn/ui · openai 6.x (Responses API) · @clickhouse/client · Vitest · Zod · Python lark (Earley) for the local grammar validator.

Scripts

bun run dev         # start the dev server
bun run build       # production build (Turbopack)
bun run typecheck   # tsc --noEmit
bun run lint        # eslint
bun run test        # vitest — offline suites always (grammar, sql-guard,
                    # introspection, result-shape); live evals if RUN_EVALS=1

# One-time, for local grammar tests / scripts/check_grammar.py:
python3 -m venv .venv && .venv/bin/pip install lark

Name		Name	Last commit message	Last commit date
Latest commit History 3 Commits
app		app
components		components
lib		lib
public		public
scripts		scripts
tests		tests
.env.example		.env.example
.gitignore		.gitignore
LICENSE		LICENSE
README.md		README.md
bun.lock		bun.lock
components.json		components.json
eslint.config.mjs		eslint.config.mjs
next-env.d.ts		next-env.d.ts
next.config.ts		next.config.ts
package.json		package.json
postcss.config.mjs		postcss.config.mjs
tailwind.config.ts		tailwind.config.ts
tsconfig.json		tsconfig.json
vitest.config.ts		vitest.config.ts

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

Natural Language → ClickHouse via GPT-5 Context-Free Grammar

Quick start

The deliverables

1. Deployed app — `/query`

2. The CFG — `lib/grammar/taxi.ts`

3. Four evals + a head-to-head — `tests/evals.test.ts`

Architecture

What I'd add before production

Tech stack

Scripts

About

Uh oh!

Releases

Packages

Uh oh!

Uh oh!

Contributors

Uh oh!

Languages

Uh oh!

Folders and files

Latest commit

History

Repository files navigation

Natural Language → ClickHouse via GPT-5 Context-Free Grammar

Quick start

The deliverables

1. Deployed app — /query

2. The CFG — lib/grammar/taxi.ts

3. Four evals + a head-to-head — tests/evals.test.ts

Architecture

What I'd add before production

Tech stack

Scripts

About

Resources

License

Contributing

Security policy

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Uh oh!

Contributors

Uh oh!

Languages

1. Deployed app — `/query`

2. The CFG — `lib/grammar/taxi.ts`

3. Four evals + a head-to-head — `tests/evals.test.ts`

Packages