pingcap · terry1purcell · Mar 16, 2026
diff --git a/skills/tidb-query-tuning/how-to/cost-model-and-index-selection.md b/skills/tidb-query-tuning/how-to/cost-model-and-index-selection.md
@@ -0,0 +1,104 @@
+# How To: Tune the Cost Model and Index Selection
+
+These variables control the foundational cost model that the optimizer uses to compare plans, and how it evaluates index access paths. Adjusting them changes how the optimizer weighs sequential vs random I/O, which cost model version is active, and how it handles edge cases in index selection.
+
+## Variables at a glance
+
+| Variable | Default | What it controls |
+|----------|---------|-----------------|
+| `tidb_cost_model_version` | 2 | Which cost model the optimizer uses |
+| `tidb_opt_seek_factor` | 20 | Cost multiplier for random I/O seeks |
+| `tidb_opt_ordering_index_selectivity_threshold` | 0 | Threshold for preferring an ordering index |
+| `tidb_opt_range_max_size` | 67108864 (64 MB) | Max memory for range construction |
+
+## When to use each variable
+
+### Plans are suboptimal after upgrading from an old TiDB version
+
+**Symptom:** After upgrading, some queries get worse plans. The cluster may still be using cost model v1.
+
+**What to do:**
+
+```sql
+-- Check current version
+SELECT @@tidb_cost_model_version;
+
+-- Switch to v2 (default since TiDB 6.2)
+SET GLOBAL tidb_cost_model_version = 2;
+```
+
+**Benefit:** Cost model v2 is more accurate for modern TiDB. It better estimates the cost of TiKV vs TiFlash access, handles wide tables more accurately, and produces better plans for most workloads.
+
+**Caution:** Switching cost model versions can change many query plans at once. Test on a non-production environment first, or roll out gradually with session-level testing on critical queries.
+
+### The optimizer keeps choosing index lookups over full scans on large analytical queries
+
+**Symptom:** For queries that scan most of a table, the optimizer still picks an index lookup path. The random I/O from index lookups is slower than a sequential full scan would be.
+
+**What to do:**
+
+```sql
+-- Increase seek factor to make random I/O more expensive
+SET tidb_opt_seek_factor = 40;  -- default is 20
+```
+
+**Benefit:** A higher seek factor tells the optimizer that random I/O is relatively more expensive compared to sequential I/O. This pushes it toward sequential scans and away from index lookups when a large portion of the table is being read.
+
+**When to lower it:** On fast SSD/NVMe storage where random I/O is nearly as fast as sequential, you might lower the seek factor (e.g., 5-10) to encourage more index-based access paths.
+
+### ORDER BY ... LIMIT queries pick a bad ordering index
+
+**Symptom:** A query like `SELECT ... FROM t WHERE <filter> ORDER BY col LIMIT 10` uses an index on `col` to avoid sorting, but the filter is poorly selective on that index, causing it to scan many rows before finding matches.
+
+**What to do:**
+
+```sql
+-- Set a selectivity threshold; the optimizer will only use the ordering index
+-- if it estimates the filter selectivity is better than this threshold
+SET tidb_opt_ordering_index_selectivity_threshold = 0.1;  -- 10% selectivity
+```
+
+**Benefit:** Prevents the optimizer from choosing an ordering index when the filter would require scanning too many rows through that index. Instead, it will use a more selective index and add an explicit sort, which is faster overall.
+
+**Typical values:** Start with 0 (disabled, the default) and increase gradually. Values between 0.01 and 0.2 are common. The right value depends on your data distribution and query patterns.
+
+### Queries with huge IN lists fall back to full index scans
+
+**Symptom:** A query like `SELECT * FROM t WHERE id IN (1, 2, 3, ..., 100000)` unexpectedly uses an IndexFullScan instead of point lookups. The optimizer ran out of memory constructing the range.
+
+**What to do:**
+
+```sql
+-- Increase the memory budget for range construction
+SET tidb_opt_range_max_size = 134217728;  -- 128 MB (default is 64 MB)
+```
+
+**Benefit:** Allows the optimizer to construct larger range sets, so queries with many IN-list values or complex range conditions can still use efficient point lookups instead of falling back to a full scan.
+
+**Caution:** Increasing this too much can cause high memory usage during query optimization. Only raise it if you have specific queries that need it, and monitor TiDB memory usage.
+
+### Queries with large IN lists on non-critical paths
+
+If raising `tidb_opt_range_max_size` globally is too risky, use per-query scope:
+
+```sql
+SELECT /*+ SET_VAR(tidb_opt_range_max_size=134217728) */ *
+FROM orders WHERE id IN (/* large list */);
+```
+
+## Diagnostic workflow
+
+1. **Check cost model version first** — this is the single highest-impact variable. If you're on v1, switch to v2 before tuning anything else.
+
+2. **Look at EXPLAIN output for unexpected access paths:**
+   - IndexLookUp on analytical queries → raise `tidb_opt_seek_factor`
+   - Ordering index with poor filter → set `tidb_opt_ordering_index_selectivity_threshold`
+   - IndexFullScan with large IN list → check `tidb_opt_range_max_size`
+
+3. **Validate with EXPLAIN ANALYZE** — compare actual rows scanned, execution time, and memory usage before and after the change.
+
+## General guidance
+
+- `tidb_cost_model_version` should be set globally and consistently across the cluster. Mixed versions make diagnosis harder.
+- `tidb_opt_seek_factor` reflects your storage characteristics. Set it once based on your hardware and leave it.
+- `tidb_opt_ordering_index_selectivity_threshold` and `tidb_opt_range_max_size` are best applied per-query or per-session for specific problem queries rather than globally.
diff --git a/skills/tidb-query-tuning/how-to/memory-and-resource-control.md b/skills/tidb-query-tuning/how-to/memory-and-resource-control.md
@@ -0,0 +1,137 @@
+# How To: Control Memory and Resource Usage
+
+These variables control how much memory and parallelism TiDB uses when executing queries. Use them to prevent OOM kills, manage resource contention between queries, and tune throughput for batch workloads.
+
+## Variables at a glance
+
+| Variable | Default | What it controls |
+|----------|---------|-----------------|
+| `tidb_mem_quota_query` | 1073741824 (1 GB) | Memory limit per query |
+| `tidb_max_chunk_size` | 1024 | Max rows per execution chunk |
+| `tidb_distsql_scan_concurrency` | 15 | Parallel TiKV coprocessor scan requests |
+
+## When to use each variable
+
+### Queries are being killed with "Out Of Memory" errors
+
+**Symptom:** Queries fail with `Out Of Memory Quota!` errors. The TiDB log shows memory quota exceeded.
+
+**What to do:**
+
+```sql
+-- Option 1: Increase the per-query memory limit
+SET tidb_mem_quota_query = 2147483648;  -- 2 GB
+
+-- Option 2: Increase globally for all connections
+SET GLOBAL tidb_mem_quota_query = 2147483648;
+```
+
+**Benefit:** Allows memory-intensive queries (large joins, aggregations, sorts) to complete without hitting the memory limit.
+
+**Caution:** Raising this too high risks TiDB node OOM if multiple large queries run concurrently. Calculate the safe limit as:
+
+```
+safe_limit = (TiDB_available_memory * 0.7) / max_concurrent_heavy_queries
+```
+
+**Better alternatives to consider first:**
+- Optimize the query to use less memory (better indexes, push-down filters)
+- Use TiFlash for analytical queries that process large datasets
+- Add `ORDER BY ... LIMIT` to reduce result set size
+
+### Queries need more memory temporarily for a batch job
+
+**Symptom:** A scheduled batch job needs more memory than the default allows, but you don't want to raise the limit cluster-wide.
+
+**What to do:**
+
+```sql
+-- Set for this session only
+SET tidb_mem_quota_query = 4294967296;  -- 4 GB
+
+-- Run the batch job
+INSERT INTO summary_table SELECT ... FROM large_table GROUP BY ...;
+
+-- Reset (or just close the connection)
+SET tidb_mem_quota_query = 1073741824;
+```
+
+**Benefit:** Gives the batch job the memory it needs without affecting other connections.
+
+### Scan-heavy queries are too slow
+
+**Symptom:** Analytical queries or batch jobs that scan large tables are slow. TiDB is not fully utilizing TiKV read bandwidth. `EXPLAIN ANALYZE` shows long scan times.
+
+**What to do:**
+
+```sql
+-- Increase scan parallelism
+SET tidb_distsql_scan_concurrency = 30;  -- default is 15
+```
+
+**Benefit:** More concurrent coprocessor requests to TiKV means faster data retrieval. Each request scans a different range of data in parallel.
+
+**When to increase:** When TiKV has spare read bandwidth and the query is scan-bound (most time is spent in TableReader/IndexReader).
+
+**When to decrease:** When TiKV is under pressure from too many concurrent scan requests, or when you want to limit the impact of one query on others. For OLTP workloads with many concurrent small queries, the default of 15 is usually sufficient.
+
+### Queries use too much memory on wide tables
+
+**Symptom:** Queries on tables with many columns or large column values use more memory than expected, even with reasonable result set sizes.
+
+**What to do:**
+
+```sql
+-- Reduce chunk size to lower per-batch memory usage
+SET tidb_max_chunk_size = 256;  -- default is 1024
+```
+
+**Benefit:** Smaller chunks mean less data is held in memory at once during execution. Each processing step works on fewer rows, reducing peak memory usage.
+
+**Trade-off:** Smaller chunks mean more processing iterations, which can reduce throughput. Only lower this when memory is the constraint, not speed.
+
+### High-throughput batch processing needs larger chunks
+
+**Symptom:** A batch ETL job or analytical query has plenty of memory headroom but throughput is lower than expected. CPU utilization is moderate.
+
+**What to do:**
+
+```sql
+SET tidb_max_chunk_size = 4096;  -- or higher
+SET tidb_distsql_scan_concurrency = 30;
+```
+
+**Benefit:** Larger chunks amortize per-batch overhead and improve CPU efficiency. Combined with higher scan concurrency, this maximizes throughput for data-intensive operations.
+
+## Common tuning recipes
+
+### Conservative OLTP session (protect the cluster)
+
+```sql
+SET tidb_mem_quota_query = 536870912;    -- 512 MB
+SET tidb_max_chunk_size = 512;
+SET tidb_distsql_scan_concurrency = 10;
+```
+
+### Aggressive batch/ETL session (maximize throughput)
+
+```sql
+SET tidb_mem_quota_query = 4294967296;   -- 4 GB
+SET tidb_max_chunk_size = 4096;
+SET tidb_distsql_scan_concurrency = 30;
+```
+
+### Mixed workload — limit impact of one heavy query
+
+```sql
+-- Per-query via SET_VAR
+SELECT /*+ SET_VAR(tidb_mem_quota_query=2147483648) SET_VAR(tidb_distsql_scan_concurrency=20) */ *
+FROM large_table WHERE ...;
+```
+
+## General guidance
+
+- `tidb_mem_quota_query` is your primary defense against runaway queries. Don't raise it globally without understanding your concurrency profile.
+- `tidb_distsql_scan_concurrency` directly affects TiKV read pressure. Monitor TiKV CPU and I/O when increasing it.
+- `tidb_max_chunk_size` is a throughput-vs-memory trade-off. The default of 1024 is a good balance for most workloads.
+- In connection-pooled environments, session-level changes persist for the life of the connection. Reset variables after batch jobs or use per-query `SET_VAR`.