Skip to content

Cost-based query optimizer with statistics collection #24

@sphildreth

Description

@sphildreth

User Story

As a developer running complex queries (joins, subqueries, filtered aggregations) against DecentDB, I want the query planner to choose execution plans based on actual data distribution so that query performance is predictable and efficient regardless of table sizes and data skew.

Problem

The current planner uses fixed heuristics with no selectivity estimation and processes joins in left-to-right declaration order. For simple queries this works fine, but as EF Core and application queries grow in complexity (multi-table joins, correlated subqueries, filtered aggregates), the lack of statistics leads to suboptimal plans — full scans where index seeks would be better, or wrong join ordering that produces large intermediate results.

Acceptance Criteria

  • Statistics collection: ANALYZE command computes per-table row counts and per-index cardinality; statistics stored in catalog
  • Statistics maintenance: Row counts updated incrementally during INSERT/UPDATE/DELETE (exact or approximate)
  • Cost model: Each plan operator has a cost function based on estimated input cardinality and I/O
  • Selectivity estimation: WHERE predicates produce estimated selectivity from statistics (equality, range, LIKE)
  • Join reordering: Planner evaluates multiple join orderings and selects lowest-cost plan (at least for ≤6 tables; heuristic fallback for more)
  • Index selection: Cost-based choice between table scan vs. index seek based on estimated selectivity
  • EXPLAIN ANALYZE shows estimated vs. actual row counts for plan validation
  • Backward compatibility: Existing queries produce correct results (plans may change); no database format breakage for users who never run ANALYZE
  • All existing tests pass; differential tests against PostgreSQL continue to match
  • Write benchmarks show no regression; read benchmarks show improvement on multi-join queries

Technical Notes

  • ADR: 0038 (currently Deferred/Post-1.0 — needs full design ADR before implementation)
  • Risk: High — affects every query plan; join reordering could expose latent executor bugs; regression risk requires extensive testing
  • Key files: src/planner/planner.nim (~469 lines today), src/planner/explain.nim, src/exec/exec.nim, src/storage/catalog.nim
  • Current state: 18 plan operators, zero statistics infrastructure, no cost model, fixed left-to-right join order
  • Format impact: Yes — catalog needs to persist row counts and cardinality. Requires ADR + format version consideration.
  • Suggested phasing:
    1. Statistics collection + ANALYZE command
    2. Cost model + selectivity estimation
    3. Index selection improvements
    4. Join reordering (dynamic programming, Selinger-style)

Metadata

Metadata

Assignees

No one assigned

    Labels

    enhancementNew feature or request

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions