Cost-based query optimizer with statistics collection

## User Story

As a developer running complex queries (joins, subqueries, filtered aggregations) against DecentDB, I want the query planner to choose execution plans based on actual data distribution so that query performance is predictable and efficient regardless of table sizes and data skew.

## Problem

The current planner uses fixed heuristics with no selectivity estimation and processes joins in left-to-right declaration order. For simple queries this works fine, but as EF Core and application queries grow in complexity (multi-table joins, correlated subqueries, filtered aggregates), the lack of statistics leads to suboptimal plans — full scans where index seeks would be better, or wrong join ordering that produces large intermediate results.

## Acceptance Criteria

- [ ] **Statistics collection:** `ANALYZE` command computes per-table row counts and per-index cardinality; statistics stored in catalog
- [ ] **Statistics maintenance:** Row counts updated incrementally during INSERT/UPDATE/DELETE (exact or approximate)
- [ ] **Cost model:** Each plan operator has a cost function based on estimated input cardinality and I/O
- [ ] **Selectivity estimation:** WHERE predicates produce estimated selectivity from statistics (equality, range, LIKE)
- [ ] **Join reordering:** Planner evaluates multiple join orderings and selects lowest-cost plan (at least for ≤6 tables; heuristic fallback for more)
- [ ] **Index selection:** Cost-based choice between table scan vs. index seek based on estimated selectivity
- [ ] **EXPLAIN ANALYZE** shows estimated vs. actual row counts for plan validation
- [ ] **Backward compatibility:** Existing queries produce correct results (plans may change); no database format breakage for users who never run ANALYZE
- [ ] All existing tests pass; differential tests against PostgreSQL continue to match
- [ ] Write benchmarks show no regression; read benchmarks show improvement on multi-join queries

## Technical Notes

- **ADR:** 0038 (currently Deferred/Post-1.0 — needs full design ADR before implementation)
- **Risk:** High — affects every query plan; join reordering could expose latent executor bugs; regression risk requires extensive testing
- **Key files:** `src/planner/planner.nim` (~469 lines today), `src/planner/explain.nim`, `src/exec/exec.nim`, `src/storage/catalog.nim`
- **Current state:** 18 plan operators, zero statistics infrastructure, no cost model, fixed left-to-right join order
- **Format impact:** Yes — catalog needs to persist row counts and cardinality. Requires ADR + format version consideration.
- **Suggested phasing:**
  1. Statistics collection + ANALYZE command
  2. Cost model + selectivity estimation
  3. Index selection improvements
  4. Join reordering (dynamic programming, Selinger-style)

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Cost-based query optimizer with statistics collection #24

User Story

Problem

Acceptance Criteria

Technical Notes

Metadata

Assignees

Labels

Projects

Milestone

Relationships

Development

Uh oh!

Cost-based query optimizer with statistics collection #24

Description

User Story

Problem

Acceptance Criteria

Technical Notes

Metadata

Metadata

Assignees

Labels

Projects

Milestone

Relationships

Development

Issue actions