[RFC][FEATURE] PPL/SQL Query Linter & Static Analyzer

## 1. Problem Statement

### 1.1 Today: validation is post-execution

In OpenSearch SQL/PPL, every form of feedback the user gets about query quality lands **after** the query has already been dispatched (or, in the editor, after the parser fails outright). There is no advisory layer that runs *before* execution to warn the user about queries that are syntactically valid but semantically suboptimal or wrong.

The current state, by stage:

| Stage | Where it runs | What it catches | What it misses |
| --- | --- | --- | --- |
| ANTLR parse | OSD Monaco worker + backend | Syntax errors only — rendered as `MarkerSeverity.Error` red squigglies | Anything past tokens/grammar |
| `SemanticCheckException` | Backend `Analyzer`, **at execution time** | Hard semantic failures (e.g., *"can't resolve field X"*) | Soft mistakes that don't fail the resolver |
| Cluster execution | Backend executor | Engine-level failures, OOMs, timeouts | Wrong-but-completing results |
| **Pre-execution advisory** | — | **— (does not exist today) —** | Everything semantically suboptimal |

Everything between *"ANTLR parsed it"* and *"the cluster ran it"* is invisible until after the user clicks **Run**. The Monaco editor's `MarkerSeverity.Warning` channel is currently unused.

### 1.2 What that means in practice

This is forgiving on input but unforgiving on output. The same query the user wrote is dispatched, possibly scans a year of data, and returns either an error, an empty result, or a *wrong* result. Examples we see in the wild:

- `... | where service = "checkout" | stats count() by user_id` on a 5-billion-doc index — runs to completion, OOMs the coordinator at the `stats` stage.
- `... | where message like "%5xx%"` against a `keyword`-mapped field — silently returns zero rows. The user assumes "no errors == no matches".
- `source = logs-* | head 100` against a wildcard pattern matching zero indices — empty result, no warning, indistinguishable from "no data in the time range".
- `... | sort @timestamp` after `stats` — sorts the wrong post-aggregation columns, producing non-deterministic output.
- A typo in a field name (`stauts` vs `status`) — `SemanticCheckException` thrown, but only after the user submits and waits.

None of these are syntax errors. None *necessarily* throw `SemanticCheckException` early enough to be useful. The user discovers the problem only after the query has either run wrong or failed at the cluster.

### 1.3 Where peers are

| Tool | Pre-execution advisory | Auto-fix | Cost preview |
| --- | --- | --- | --- |
| Azure Data Explorer (Kusto) | Yes — "Best Practices" analyzer | Yes — Quick-fix | Yes |
| BigQuery Studio | Yes — partition-filter warnings | No | Yes — "will process N GB" |
| DataGrip | Yes — named SQL inspections | Yes | No |
| **OpenSearch SQL/PPL today** | No | No | No |

We are the outlier. The capability exists in every comparable analytics product.

### 1.4 The gap

Summarized: the project has post-execution validation (parse errors, runtime exceptions, engine failures), but **no pre-execution layer** that gives the user advisory feedback while they are still composing the query. We need to add one.

---

## 2. Proposed Solution

### 2.1 The shift: pre-execution lint layer

Add a **pre-execution lint layer** that runs against the parsed query *without executing it* and returns structured *diagnostics* the editor can render inline. The layer is purely advisory: it never blocks, never executes data-plane queries, and never mutates state. It complements — not replaces — the existing post-execution validation.

```
                    ┌───────────────────────────────┐
                    │  PRE-EXECUTION (NEW)          │
  user types        │                               │
   query  ───────►  │   ANTLR parse  →  Lint layer  │  ───►  diagnostics in editor
                    │   (errors)        (warnings)  │
                    └───────────────────────────────┘
                                    │
                                    │ user clicks Run
                                    ▼
                    ┌───────────────────────────────┐
                    │  POST-EXECUTION (TODAY)       │
                    │                               │
                    │   Analyzer  →  Planner  →     │  ───►  results / errors
                    │                Executor       │
                    └───────────────────────────────┘
```

### 2.2 What ships

1. **Lint REST endpoints** — `POST /_plugins/_ppl/_lint` and `POST /_plugins/_sql/_lint` accepting a query plus target index/data source, returning a structured diagnostics array without executing the query.
2. **Pluggable rule framework** — adding a new rule should not require changes to the REST handler or the response schema. Rules are independent units that walk the parsed query and emit diagnostics.
3. **Initial rule catalog** — at least 8 rules covering correctness, performance, and determinism categories, discovered from the existing PPL/SQL docs and exception sites (see [§2.5](#25-rule-discovery--catalog)).
4. **Frontend Monaco integration** — debounced calls to the lint endpoint, results rendered as `MarkerSeverity.Warning` markers alongside the existing `Error` syntax markers, with hover tooltips and a one-click *Apply fix* action.
5. **Auto-fix mechanism** — diagnostics that are auto-correctable carry a structured fix the editor applies on click.
6. **Cluster settings** — operators can disable individual rules, adjust thresholds, or kill-switch the entire feature without a restart.

### 2.3 Goals & non-goals

**Goals**
- **G1.** Stable REST contract for linting PPL and SQL queries against a target index/data source without executing them.
- **G2.** Pluggable rule catalog that grows incrementally — both in-tree and (eventually) by other plugins.
- **G3.** Ship at least 8 concrete rules at GA.
- **G4.** Diagnostics carry enough metadata (position, severity, category, rule ID, optional fix) for editors to render a useful UX out of the box.
- **G5.** OpenSearch Dashboards renders warnings inline within ~500 ms of a keystroke at p50.
- **G6.** Linting is **safe**: never executes user data-plane queries, never mutates state, observes per-cluster opt-out.

**Non-goals**
- Replacing `SemanticCheckException` for hard errors. The lint layer is *additive and advisory*.
- Full Calcite EXPLAIN cost-model integration on day one (planned as a later phase).
- Linting async data sources (Spark/Glue) end-to-end — different planner, different scope.
- Blocking execution. Lint is advisory; the engine remains the source of truth for refusal.

### 2.4 Requirements

#### 2.4.1 Must-have

1. **Lint endpoint** — `POST /_plugins/_ppl/_lint` accepting `{ query, index, dataSourceId? }` and returning a structured diagnostics array. SQL counterpart symmetric.
2. **Pluggable rule framework** — the REST surface and response schema do not change when a new rule is added.
3. **Diagnostic schema** — every diagnostic includes:
   - `ruleId` (stable, kebab-case)
   - `severity` (`error` | `warning` | `info` | `hint`)
   - `category` (`correctness` | `performance` | `determinism` | `style`)
   - `message` (user-facing)
   - `position` (precise start/end line/column **and** offsets — must be reliable for editor underlines)
   - optional `fix` (structured edit: start/end offset + replacement text)
   - optional `docUrl`
4. **Initial rule catalog (≥ 8 rules)** — discovered from PPL/SQL docs and exception sites; see [§2.5](#25-rule-discovery--catalog).
5. **Severity-level configuration via cluster settings** — operators can disable individual rules and adjust thresholds without a restart.
6. **Master switch** — `plugins.sql.lint.enabled` (default `false` until GA) so clusters can opt in.
7. **Frontend Monaco integration** — debounced calls, results rendered as `MarkerSeverity.Warning` markers alongside existing `Error` markers; hover shows message + suggestion + docs link.
8. **Auto-fix payload** — diagnostics that can be auto-corrected return a structured fix the editor applies via a code action.

#### 2.4.2 Nice-to-have

- Stretch rules beyond the GA catalog — additional candidates surfaced from the same docs / exception sources (see [§2.5](#25-rule-discovery--catalog)).
- Cost-estimate badge (à la BigQuery's *"This query will process X GB"*) — feasible once Calcite explain-cost is wired through.
- External plugin SPI so third-party plugins can register rules without forking core.
- Per-data-source rule subsets (some rules apply only to OpenSearch indices, others to Glue, etc.).
- Telemetry on which fixes get accepted, to guide which rules earn their keep.

### 2.5 Rule discovery & catalog

Rather than prescribing a fixed list of rules upfront, this RFC asks the intern to **discover** the initial catalog from sources of truth that already exist in the project. The PPL and SQL documentation, the existing exception sites in the codebase, and the issue tracker collectively describe — with examples — every footgun the team has cared about enough to write down. Those are the natural rule candidates.

**Where to look for rule candidates**

1. **PPL command docs** — `docs/user/ppl/cmd/*.md`. Many commands include "Note", "Limitation", or "Caveat" callouts describing wrong-result patterns. Each one is a rule candidate.
2. **PPL function docs** — `docs/user/ppl/functions/*.md`. Function-specific footguns (regex semantics, type coercion, null handling) belong here.
3. **PPL general guides** — `docs/user/ppl/general/`. Cross-cutting topics like wildcards, identifiers, and time filters.
4. **PPL limitations** — `docs/user/ppl/limitations/` and `docs/user/limitations/`. By definition these are documented constraints; every entry is a candidate for a pre-execution warning.
5. **SQL docs** — `docs/user/dql/`, `docs/user/admin/`, `docs/user/optimization/`. Same pattern as PPL.
6. **`SemanticCheckException` throw sites** — every place the analyzer throws `SemanticCheckException` is a hard error that fires *at execution time today*. Most of them have a natural pre-execution counterpart: the lint version surfaces the same condition as a warning while the user is still typing.
7. **Issues labelled `bug` / `wrong-result` / `confused user`** in the `opensearch-project/sql` issue tracker. Recurring user reports point to gaps that lint can close.

**What makes a good rule**

- It catches a real, repeatable mistake (cited in docs or issues).
- It can be detected from the parsed query plus index metadata (mapping, `_stats`, `_field_caps`) — no execution required.
- It has a clear, low-false-positive trigger condition, or a tunable threshold.
- It belongs to one of: `correctness`, `performance`, `determinism`, `style`.

**Where to look for rule *exceptions* (carve-outs)**

For every rule, there are legitimate uses that should *not* fire a warning. The same docs are the source of those carve-outs: usage patterns the docs explicitly endorse, examples in the docs, and "this is fine when …" notes. A rule without documented carve-outs is probably under-specified.

For instance, a rule "warn on `head` without `sort`" has a documented carve-out "no warning when the upstream is already deterministic (e.g., `top N`, `rare N`, a single-row scalar source)" — that carve-out comes directly from the `head` command docs.

**Catalog target**

The minimum bar at GA is **at least 8 rules** spanning all four categories, with at least one auto-fix-capable rule per category where it makes sense. The intern proposes the actual list in week 1 (see [§3](#3-phased-delivery-10-weeks)) after surveying the sources above.

Each proposed rule is reviewed against:
- Source citation (which doc entry / exception throw site / issue motivates it).
- Documented carve-outs.
- Default severity rationale.
- Auto-fix design (or rationale for why it has none).

**Per-rule cluster-settings example**

```yaml
plugins.sql.lint.enabled: true
plugins.sql.lint.rules.<rule-id>.severity: warning      # off | hint | info | warning | error
plugins.sql.lint.thresholds.<rule-id>.<param>: 100000   # rule-specific threshold
```

### 2.6 High-level design

The pre-execution layer sits next to the existing query path and reuses its parser and analyzer:

```
HTTP request
  → REST handler                        (new)
  → Parser (PPL or SQL)                 (existing ANTLR path, reused)
  → Lint context build                  (mapping, _stats, settings — cached)
  → Tolerant analyzer pass              (catches semantic errors as diagnostics, not exceptions)
  → Rule engine                         (each rule visits the plan + context, emits diagnostics)
  → JSON response
```

Two design points worth calling out:

- **Reuse the existing parser and analyzer.** The lint engine should not re-implement type inference or field resolution — it should run the same code paths as a real query, but in a *tolerant* mode where what would normally throw becomes a diagnostic. This is where most of the high-value rules (field validation, type mismatch) come from "for free".
- **Diagnostics carry precise positions.** Editors need offsets accurate to the offending token, not to the surrounding command. The structured `position` field is non-negotiable.

### 2.7 API contract

#### Request

```http
POST /_plugins/_ppl/_lint
Content-Type: application/json

{
  "query":  "source=logs-* | where status = 500 | stats count() by user_id",
  "index":  "logs-*",
  "dataSourceId": "my_glue",        // optional, defaults to local cluster
  "rules":  ["field-validation"],   // optional allowlist
  "settings": {                     // optional per-call overrides
    "highCardinalityThreshold": 100000
  }
}
```

`POST /_plugins/_sql/_lint` is symmetric.

#### Response

```json
{
  "took": 14,
  "diagnostics": [
    {
      "ruleId": "high-cardinality-groupby",
      "severity": "warning",
      "category": "performance",
      "message": "Grouping by 'user_id' (cardinality ≈ 4.2M) may exhaust coordinator memory.",
      "command": "stats",
      "position": {
        "startLine": 1, "startColumn": 51,
        "endLine":   1, "endColumn":   58,
        "startOffset": 50, "endOffset": 57
      },
      "fix": {
        "title": "Replace stats-by with `top N`",
        "edits": [
          { "startOffset": 50, "endOffset": 57, "replacement": "user_id | head 1000" }
        ]
      },
      "docUrl": "https://opensearch.org/docs/latest/.../lint/high-cardinality-groupby/"
    }
  ]
}
```

The endpoint always returns `200` if the request is well-formed — parse errors are surfaced as a single `error`-severity diagnostic with `ruleId: "syntax"` so the frontend renders them via the same code path.

### 2.8 Frontend integration (OpenSearch Dashboards)

- An OSD server proxy route forwards `/api/enhancements/ppl/lint` to `_plugins/_ppl/_lint` (same shape as the existing `_grammar` proxy), threading `dataSourceId` for multi-DS clusters.
- The Monaco PPL language module debounces lint calls (~500 ms) on content change, cancels in-flight requests on new keystrokes, and renders diagnostics as `MarkerSeverity.Warning` markers using the existing `setModelMarkers()` infrastructure.
- A code-action provider reads each diagnostic's structured `fix` and exposes a one-click *Apply fix* in the lightbulb menu.
- When the cluster has lint disabled, the proxy returns an empty diagnostics array; the editor never lights up.

No new UI framework, no new component library — pure addition on the existing Monaco infrastructure.

### 2.9 Cross-cutting concerns

- **Performance.** Mapping and stats lookups must be cached (per cluster + index, short TTL) so repeated keystrokes don't hammer `_mapping`. Per-rule and per-request timeouts bound worst-case latency. Lint is best-effort: a rule that times out is silently skipped rather than failing the response.
- **Security.** The endpoint reads `_mapping` and `_stats` for the target index — caller's role must already permit those reads. No privilege elevation. Audit-log impact (queries logged on every keystroke) is documented; the master switch lets operators opt out.
- **Multi-data-source.** `dataSourceId` is plumbed end-to-end. For non-OpenSearch data sources, AST-only rules apply; metadata-dependent rules emit an `info` diagnostic noting they couldn't run.
- **Settings.** `plugins.sql.lint.enabled`, `plugins.sql.lint.rules.<rule-id>.severity`, and per-rule thresholds are cluster settings (no restart required to change).
- **i18n.** Diagnostic messages are externalized for translation; English ships day one.

---

## 3. Acceptance Criteria

- `POST /_plugins/_ppl/_lint` and `POST /_plugins/_sql/_lint` return the documented schema for the eight rules from the catalog, with positive and negative integration tests for each.
- Per-rule cluster setting flips a rule off without a restart.
- Adding a new rule does not require changes to the REST handler or the response schema.
- A sample external-plugin rule is discovered and run end-to-end without modifying the SQL plugin or core.
- "How to add a rule" guide is published and the intern's own rules serve as canonical examples.
- OSD Monaco renders warning markers within ~500 ms p50 against a 10-line query.
- With `plugins.sql.lint.enabled: false`, no behavior change anywhere; the new endpoints are inert.

---

## 4. Open Questions

1. **Default-on or default-off at GA?** Lean default-off until P6 ships and feedback is positive, then flip.
2. **Source of cardinality estimates** for cardinality-sensitive rules. `_field_caps` doesn't expose value cardinality; a sampled `terms` aggregation is one possible path but needs a cost ceiling.
3. **Where do per-rule docs live?** Proposal: `docs/user/ppl/lint/<rule-id>.md`, with each diagnostic's `docUrl` pointing into the published docs.
4. **Should `error`-severity diagnostics block submission if the user hits Run anyway?** Recommended: no — the engine remains the source of truth for refusal. Lint is advisory by definition.
5. **Telemetry on rule emission and fix-acceptance** — useful for prioritizing the catalog, but needs a privacy review before P6.
6. **Streaming / partial parses** — could we lint mid-typing? Out of scope for v1; needs partial-AST recovery.

---

## 5. References

**Primary sources for rule and carve-out discovery (in-repo)**

- PPL user docs (commands, functions, general guides, limitations): [`docs/user/ppl/`](docs/user/ppl/)
  - Commands: [`docs/user/ppl/cmd/`](docs/user/ppl/cmd/)
  - Functions: [`docs/user/ppl/functions/`](docs/user/ppl/functions/)
  - General guides: [`docs/user/ppl/general/`](docs/user/ppl/general/)
  - Limitations: [`docs/user/ppl/limitations/`](docs/user/ppl/limitations/)
- SQL user docs: [`docs/user/dql/`](docs/user/dql/), [`docs/user/admin/`](docs/user/admin/), [`docs/user/optimization/`](docs/user/optimization/)
- Cross-engine limitations: [`docs/user/limitations/`](docs/user/limitations/)
- Developer docs (PPL command and function dev guides): [`docs/dev/`](docs/dev/)
- Engine code: `core/src/main/java/org/opensearch/sql/analysis/` for `SemanticCheckException` throw sites and the Analyzer / `TypeEnvironment` flow.

**Prior art (external)**

- Azure Data Explorer (Kusto) "Best Practices" analyzer: https://learn.microsoft.com/azure/data-explorer/web-ui-kql
- BigQuery cost preview: https://cloud.google.com/bigquery/docs/best-practices-costs
- DataGrip SQL inspections: https://www.jetbrains.com/help/datagrip/sql-inspections.html
- VS Code T-SQL linting: https://code.visualstudio.com/docs/languages/tsql

---

## Appendix — Illustrative end-to-end example

The rule IDs below are **placeholders for shape only** — the actual catalog is locked in week 1 from the discovery output (see [§2.5](#25-rule-discovery--catalog)).

**Input**

```ppl
source=logs-* | where status = 500 | stats count() by user_id | head 100
```

**Diagnostics emitted** (truncated for brevity)

```json
[
  {
    "ruleId": "missing-time-filter",
    "severity": "info",
    "category": "performance",
    "message": "Index 'logs-*' has a @timestamp field but no predicate filters on it. Full scans on time-series indices are usually unintended.",
    "fix": {
      "title": "Restrict to last 1 hour",
      "edits": [{ "startOffset": 16, "endOffset": 16, "replacement": "where @timestamp > now() - 1h | " }]
    }
  },
  {
    "ruleId": "high-cardinality-groupby",
    "severity": "warning",
    "category": "performance",
    "message": "Grouping by 'user_id' (cardinality ≈ 4.2M) may exhaust coordinator memory."
  },
  {
    "ruleId": "head-without-sort",
    "severity": "info",
    "category": "determinism",
    "message": "head returns an arbitrary subset of rows; output is non-deterministic."
  }
]
```

Three warnings on a query the user thought was fine. That is the value proposition.


Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[RFC][FEATURE] PPL/SQL Query Linter & Static Analyzer #5405

1. Problem Statement

1.1 Today: validation is post-execution

1.2 What that means in practice

1.3 Where peers are

1.4 The gap

2. Proposed Solution

2.1 The shift: pre-execution lint layer

2.2 What ships

2.3 Goals & non-goals

2.4 Requirements

2.4.1 Must-have

2.4.2 Nice-to-have

2.5 Rule discovery & catalog

2.6 High-level design

2.7 API contract

Request

Response

2.8 Frontend integration (OpenSearch Dashboards)

2.9 Cross-cutting concerns

3. Acceptance Criteria

4. Open Questions

5. References

Appendix — Illustrative end-to-end example

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Stage	Where it runs	What it catches	What it misses
ANTLR parse	OSD Monaco worker + backend	Syntax errors only — rendered as `MarkerSeverity.Error` red squigglies	Anything past tokens/grammar
`SemanticCheckException`	Backend `Analyzer`, at execution time	Hard semantic failures (e.g., "can't resolve field X")	Soft mistakes that don't fail the resolver
Cluster execution	Backend executor	Engine-level failures, OOMs, timeouts	Wrong-but-completing results
Pre-execution advisory	—	— (does not exist today) —	Everything semantically suboptimal

Tool	Pre-execution advisory	Auto-fix	Cost preview
Azure Data Explorer (Kusto)	Yes — "Best Practices" analyzer	Yes — Quick-fix	Yes
BigQuery Studio	Yes — partition-filter warnings	No	Yes — "will process N GB"
DataGrip	Yes — named SQL inspections	Yes	No
OpenSearch SQL/PPL today	No	No	No

[RFC][FEATURE] PPL/SQL Query Linter & Static Analyzer #5405

Description

1. Problem Statement

1.1 Today: validation is post-execution

1.2 What that means in practice

1.3 Where peers are

1.4 The gap

2. Proposed Solution

2.1 The shift: pre-execution lint layer

2.2 What ships

2.3 Goals & non-goals

2.4 Requirements

2.4.1 Must-have

2.4.2 Nice-to-have

2.5 Rule discovery & catalog

2.6 High-level design

2.7 API contract

Request

Response

2.8 Frontend integration (OpenSearch Dashboards)

2.9 Cross-cutting concerns

3. Acceptance Criteria

4. Open Questions

5. References

Appendix — Illustrative end-to-end example

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions