Skip to content

Commit 730f9d6

Browse files
author
amar-python
committed
Add eval coverage for CSV validator gaps
1 parent ca0c040 commit 730f9d6

60 files changed

Lines changed: 1900 additions & 4 deletions

File tree

Some content is hidden

Large Commits have some content hidden by default. Use the searchbox below for content that may be hidden.

.gitignore

Lines changed: 1 addition & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -17,6 +17,7 @@ test_output/
1717
*.out
1818
*.log
1919
csv/logs/
20+
evals/reports/
2021
*_report_*.txt
2122
*_skipped_*.csv
2223
*_valid_*.csv

README.md

Lines changed: 2 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -328,8 +328,8 @@ Realistic Australian T&E data is loaded automatically when `include_seed_data` i
328328
# Against all environments
329329
./tests/run_tests.sh
330330

331-
# Run Python validator tests
332-
python -m unittest -v tests/test_csv_validator.py
331+
# Run Python tests
332+
python -m unittest discover -s tests -p "test*.py" -v
333333

334334
# Manually via psql
335335
psql -U postgres -d te_mgmt_dev \

evals/FAILURE_MODES.md

Lines changed: 89 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,89 @@
1+
# Failure-mode catalogue — PostgreDataMigrationApp
2+
3+
Each row is a real-world data or operational scenario that *could* break the framework. For each: a concrete example, the **expected** behaviour, and the eval scenario number that proves it.
4+
5+
Legend:
6+
-**handles correctly** — framework already produces the right outcome
7+
- ⚠️ **partial** — works but not asserted by an eval today
8+
-**gap** — silently accepts, crashes, or fails noisily without a clean signal
9+
10+
---
11+
12+
## Tier P — Python CSV validator (`csv/validator.py`)
13+
14+
These run without any database. Driven by `evals/runner.py` against the actual `csv/validator.py` script as a subprocess.
15+
16+
| # | Failure mode | Example input | Expected behaviour | Current | Eval ID |
17+
|---|--------------|--------------|--------------------|---------|---------|
18+
| P1 | Missing required env vars | (no CSV_FILE) | exit 1; stderr contains "Missing required environment variables" || 19 |
19+
| P2 | CSV file path doesn't exist | `CSV_FILE=/tmp/nope.csv` | exit 1; stderr contains "CSV file not found" || 20 |
20+
| P3 | Empty file (0 bytes) | `` | exit 1; stderr contains "CSV file is empty" || 02 |
21+
| P4 | Header-only file (no data rows) | `id,name\n` | exit 1; stderr contains "No valid rows found" || 04 |
22+
| P5 | All valid rows | `id,name\n1,Alice\n2,Bob\n3,Carol\n` | exit 0; valid count 3, skip count 0 || 01 |
23+
| P6 | Mixed valid + skipped | header + 2 valid + empty row + col-mismatch row | exit 0; valid 2, skip 2, reasons "empty row" and "column mismatch" || 05 |
24+
| P7 | Completely empty row | `id,name\n1,Alice\n,\n2,Bob\n` | row 3 skipped as "empty row" || 08 |
25+
| P8 | Whitespace-only row | `id,name\n1,Alice\n , \n` | row 3 skipped as "empty row" (cell.strip() returns empty) || 16 |
26+
| P9 | Row with fewer fields than header | header has 3, row has 2 | row skipped as "column mismatch" || 07 |
27+
| P10 | Row with more fields than header | header has 2, row has 3 | row skipped as "column mismatch" || 07b |
28+
| P11 | Duplicate column names | `id,id,name\n…` | exit 0 but stdout contains "Duplicate column names" warning || 06 |
29+
| P12 | Leading/trailing whitespace in headers | ` id , name \n…` | headers normalised to `id`, `name` (stripped) || 17 |
30+
| P13 | UTF-8 BOM at start of file | `id,name\n…` | BOM stripped (file opened with `utf-8-sig`); behaves as if no BOM || 09 |
31+
| P14 | UTF-8 emoji in value | `1,Alice 👋` | preserved through; row valid || 10 |
32+
| P15 | UTF-8 CJK characters | `1,田中花子` | preserved through; row valid || 18 |
33+
| P16 | CRLF line endings | `id,name\r\n1,Alice\r\n` | csv module handles natively; row valid || 11 |
34+
| P17 | Quoted field containing comma | `id,note\n1,"Smith, John"\n` | parsed as single field `Smith, John` || 13 |
35+
| P18 | Quoted field containing newline | `id,note\n1,"line1\nline2"\n` | parsed as single row, note has embedded newline || 14 |
36+
| P19 | Quoted field containing escaped quote | `id,note\n1,"she said ""hi"""\n` | parsed as `she said "hi"` || 15 |
37+
| P20 | Unicode RTL (Arabic) | `1,محمد` | preserved through; row valid || 21 |
38+
| P21 | Very long row (50KB single field) | massive single value | accepted as one row || 22 |
39+
| P22 | Mixed encoding (Latin-1 bytes inside UTF-8) | `\xe9` bytes in middle of file | clean exit 1; stderr reports unexpected decode error; no traceback || 23 |
40+
41+
**Of 22 modes:** all 22 are now covered by Tier P evals. Scenario numbering is 01–23 because the short and long column-mismatch cases are separate eval fixtures.
42+
43+
---
44+
45+
## Tier I — Idempotency
46+
47+
| # | Failure mode | Example scenario | Expected behaviour | Current | Eval ID |
48+
|---|--------------|------------------|--------------------|---------|---------|
49+
| I1 | Re-run `deploy_all.sh dev` on a deployed DB | run twice in a row | exit 0 both times; no new rows inserted on 2nd run (`ON CONFLICT DO NOTHING` seed pattern); no `CREATE TABLE` errors || 01 |
50+
| I2 | Re-run after `\set` changes to identifiers (table renamed) | edit env_dev.sql to rename a table, re-run | should fail safely (rename isn't auto-detected) — manual migration needed | ⚠️ | not in initial set |
51+
| I3 | Schema deployed; user manually drops a table; re-run | DROP TABLE te_dev.organisations; re-run env_dev.sql | table re-created; FK-dependent seed inserts pass || not in initial set |
52+
| I4 | Re-run with PG service offline | stop PG, run deploy_all.sh | exit non-zero; stderr contains connection-refused || not in initial set |
53+
54+
Tier I initial scope: only I1 (the core re-runnability claim from the README).
55+
56+
---
57+
58+
## Tier S — SQL suite integration
59+
60+
| # | Failure mode | Example scenario | Expected behaviour | Current | Eval ID |
61+
|---|--------------|------------------|--------------------|---------|---------|
62+
| S1 | Fresh deploy + all 5 suites must pass | `deploy_all.sh dev` then `run_tests.sh dev` | exit 0; "ALL TESTS PASSED" appears in stdout; overall pass_rate 100.0% || 01 |
63+
| S2 | Deploy without seed data; suites should fail predictably | `\set include_seed_data false`; run suites | some suites expecting non-zero row counts fail; runner reports the count drift | ⚠️ | not in initial set |
64+
| S3 | Deploy with corrupted seed (intentionally invalid FK) | manually edit seed insert to break an FK | deploy_all fails at seed step; no schema corruption | ⚠️ | not in initial set |
65+
66+
Tier S initial scope: only S1.
67+
68+
---
69+
70+
## What this catalogue does NOT yet cover
71+
72+
- **Multi-DB equivalence** (cross-engine schema parity) — deferred until PG is locked in.
73+
- **Performance / scale** (1M-row load timing) — separate suite if needed later.
74+
- **Cross-environment structural equivalence** (Dev vs Test vs Staging vs Prod) — Tier E, future.
75+
- **Domain-rule deep dives beyond suite 05** — Tier D, future.
76+
- **Validator behaviour on >128KB single field** — beyond the current 50KB eval and Python `csv` default field-size assumptions.
77+
78+
---
79+
80+
## Summary
81+
82+
| Tier | Modes catalogued | Modes in initial eval set | Deferred |
83+
|------|------------------|---------------------------|----------|
84+
| P | 22 | 22 | 0 |
85+
| I | 4 | 1 | 3 |
86+
| S | 3 | 1 | 2 |
87+
| **Total** | **29** | **21** | **7** |
88+
89+
The current eval set covers every catalogued Tier P mode plus the initial Tier I and Tier S operational scenarios. The remaining deferred items are PostgreSQL operational and cross-engine scenarios.

evals/HANDOFF.md

Lines changed: 124 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,124 @@
1+
# Hand-off summary — evals/ package
2+
3+
**Date:** 2026-05-26
4+
**Scope this round:** PostgreSQL only (Tiers P + I + S). MariaDB / SQLite / etc. deferred.
5+
6+
## What was delivered
7+
8+
| File / folder | What it is |
9+
|--------------|-----------|
10+
| `evals/PLAN.md` | Scope, folder layout, tiering, phases |
11+
| `evals/FAILURE_MODES.md` | 29 catalogued failure modes; all 22 Tier P modes now covered |
12+
| `evals/README.md` | How to run; CLI flags; exit codes |
13+
| `evals/runner.py` | Single-file scenario discovery + diff engine + JSON report writer (606 lines) |
14+
| `evals/datasets/tier_p/01-23/` | 23 CSV scenarios (happy + empty + malformed + unicode + generated edge cases) |
15+
| `evals/datasets/tier_i/01_deploy_dev_twice/` | Idempotency scenario (README only — runner drives the work) |
16+
| `evals/datasets/tier_s/01_fresh_deploy_then_all_tests_pass/` | SQL-suite integration scenario |
17+
| `evals/expected/tier_p/*.json` | 20 expected-outcome files |
18+
| `evals/expected/tier_i/01_deploy_dev_twice.json` | Expected outcome (exit codes + row-count parity) |
19+
| `evals/expected/tier_s/01_fresh_deploy_then_all_tests_pass.json` | Expected outcome (85/85 + ALL TESTS PASSED) |
20+
| `evals/reports/` | Auto-created at runtime; one folder per run with `summary.json` |
21+
22+
## What was executed
23+
24+
| Tier | Result | Notes |
25+
|------|--------|-------|
26+
| **P** (23 scenarios) | **23 PASS / 0 FAIL** | Executed locally against the real `csv/validator.py` |
27+
| **I** (1 scenario) | **SKIP** | No PostgreSQL in the sandbox — clean skip with diagnostic |
28+
| **S** (1 scenario) | **SKIP** | Same |
29+
30+
Latest local run report: `evals/reports/20260531T094230Z-223fa5/summary.json`
31+
32+
## What you should do next (in order)
33+
34+
### 1. Verify Tier P on your machine
35+
36+
```powershell
37+
cd "$env:USERPROFILE\OneDrive\Desktop\Migration using ai\PostgreDataMigrationApp"
38+
python evals\runner.py
39+
```
40+
41+
Expect: `total: 20, passed: 20, failed: 0, skipped: 0` and exit code 0.
42+
43+
### 2. Run Tier I + S against your local PostgreSQL
44+
45+
```powershell
46+
cd "$env:USERPROFILE\OneDrive\Desktop\Migration using ai\PostgreDataMigrationApp"
47+
# Make sure libpq vars are set or peer auth works:
48+
# $env:PGHOST = 'localhost'
49+
# $env:PGUSER = 'postgres'
50+
# $env:PGPASSWORD = '...'
51+
python evals\runner.py --tiers p,i,s
52+
```
53+
54+
Expect with PostgreSQL available: 23 P pass, 1 I pass, 1 S pass = **25 / 25**.
55+
56+
If Tier I fails on the row-count parity, check the `actual.row_counts_first` vs
57+
`actual.row_counts_second` in `summary.json` — that pinpoints the table whose seed
58+
inserts aren't using `ON CONFLICT DO NOTHING`.
59+
60+
If Tier S fails, the runner captures the last 2 KB of the suite's stdout in
61+
`actual.stdout_tail` — that almost always pinpoints which assertion failed.
62+
63+
### 3. Archive the workspace cruft
64+
65+
The parent folder `Migration using ai\` still contains my earlier Python demo
66+
(`src/`, `tests/`, `evals/` at root, `config.yaml`, etc.) that has nothing to do
67+
with `PostgreDataMigrationApp`. A cleanup script is ready:
68+
69+
```powershell
70+
cd "$env:USERPROFILE\OneDrive\Desktop\Migration using ai"
71+
.\cleanup_workspace.ps1 # preview (dry run)
72+
.\cleanup_workspace.ps1 -Apply # commit the moves
73+
```
74+
75+
Everything is **moved** to `_archive_demo\` (not deleted). Reversible by drag-back.
76+
`PostgreDataMigrationApp\` is on the never-touch list.
77+
78+
### 4. (Optional) Wire evals into CI
79+
80+
Your existing GitHub Actions workflow `python-validator-tests.yml` only runs
81+
the Python unit tests in `tests/`. To add the 23 Tier P evals:
82+
83+
Add to the workflow:
84+
85+
```yaml
86+
- name: Run Tier P evals
87+
run: python evals/runner.py --tiers p
88+
working-directory: PostgreDataMigrationApp
89+
```
90+
91+
Tier I and S can be added when you have PostgreSQL in your CI runner.
92+
93+
## What's deferred to a later round
94+
95+
| Item | Why deferred |
96+
|------|--------------|
97+
| Tier X — cross-DB schema equivalence (MariaDB / SQLite / Teradata) | You explicitly said "ignore them for now" |
98+
| Tier E — cross-environment (Dev/Test/Staging/Prod) structural parity | Not needed until you ship beyond Dev |
99+
| Tier D — extended domain-rule evals beyond suite 05 | Existing 85 assertions already cover the high-value rules |
100+
| Performance / scale (1M+ rows) | Would need a fixture generator — separate round |
101+
| AI-assisted anomaly detection | Out of scope for deterministic evals |
102+
| Single-field >128 KB | Current eval covers 50KB; larger-than-default `csv` field-size behavior remains future scope |
103+
104+
Catalogued in `FAILURE_MODES.md` so they're not lost.
105+
106+
## File counts
107+
108+
| Category | Count |
109+
|----------|-------|
110+
| Markdown docs (PLAN, FAILURE_MODES, README, HANDOFF) | 4 |
111+
| Python (`runner.py`) | 1 |
112+
| CSV input fixtures | 20 (scenarios 19, 20, 22, and 23 are generated/no-input scenarios) |
113+
| Expected JSON files | 25 |
114+
| Scenario README/TXT files | 4 |
115+
| **Total files created** | **55** |
116+
117+
## Open items
118+
119+
Nothing blocking. If you want me to take the next step, ask for:
120+
121+
- "extend Tier P to N more scenarios" (e.g. RTL, very-long-row, latin-1)
122+
- "add Tier X — cross-DB equivalence" once you have other DB engines installed
123+
- "wire the evals into the GitHub Actions workflow"
124+
- "build a sample-data generator that produces 100 K / 1 M row CSVs for perf"

evals/PLAN.md

Lines changed: 131 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,131 @@
1+
# Evals package — PostgreDataMigrationApp
2+
3+
This folder holds **data-driven evals** for the T&E Database Framework. Evals complement (but don't replace) the existing SQL test suites under `tests/suites/` and the Python unit tests under `tests/test_csv_validator.py`.
4+
5+
## Why a separate `evals/` folder
6+
7+
| Aspect | `tests/` (existing) | `evals/` (this folder) |
8+
|--------|--------------------|--------------------|
9+
| Driver | Code (SQL assertions + Python unittest) | Data (CSV fixtures + expected JSON) |
10+
| Adding a new scenario | Edit Python or SQL | Drop a new `datasets/<NN>/input.csv` and `expected/<NN>.json` |
11+
| What gets reviewed | Code diffs | Data fixtures (much easier to eyeball) |
12+
| Failure granularity | One Python assert, one SQL assert | exit_code + stderr + stdout + output-file contents — all in one report |
13+
| Scope | Schema correctness, business rules, validator unit behaviour | End-to-end behaviour: validator interface, idempotency, full SQL-suite-passes-after-load |
14+
15+
In short: `tests/` proves the **code is correct**; `evals/` proves the **framework behaves correctly on real-world data and operational scenarios**.
16+
17+
## Scope (this round)
18+
19+
- **Database engine:** PostgreSQL only.
20+
- MariaDB / SQLite / InfluxDB / Redis / Teradata adapters are deliberately out of scope until Postgres evals are stable. The eval structure is engine-agnostic so they can be added later.
21+
- **Tiers in scope:**
22+
- **Tier P** — Python CSV validator (`csv/validator.py`). Pure data-in / files-out. No DB.
23+
- **Tier I** — Idempotency of `deploy_all.sh` against a clean Dev PostgreSQL.
24+
- **Tier S** — SQL test suite integration: deploy fresh + run all 5 suites and assert 85/85.
25+
- **Tiers deferred:**
26+
- **Tier X** — Cross-DB schema equivalence (MariaDB/SQLite). Out until Postgres is locked in.
27+
- **Tier E** — Cross-environment (Dev/Test/Staging/Prod) structural equivalence.
28+
- **Tier D** — Extended domain-rule evals beyond what suite 05 already covers.
29+
30+
## Folder layout
31+
32+
```
33+
PostgreDataMigrationApp/
34+
└── evals/
35+
├── PLAN.md ← this file
36+
├── FAILURE_MODES.md ← failure-mode catalogue
37+
├── README.md ← how to run it
38+
├── runner.py ← scenario discovery + diff engine + report
39+
40+
├── datasets/
41+
│ ├── tier_p/ ← Python CSV validator scenarios
42+
│ │ ├── 01_happy_path/
43+
│ │ │ └── input.csv
44+
│ │ ├── 02_empty_file/
45+
│ │ └── … (23 scenarios)
46+
│ │
47+
│ ├── tier_i/ ← idempotency scenarios
48+
│ │ └── 01_deploy_dev_twice/
49+
│ │ └── README.md ← what the runner does (no CSV needed)
50+
│ │
51+
│ └── tier_s/ ← SQL suite integration
52+
│ └── 01_fresh_deploy_then_all_tests_pass/
53+
│ └── README.md
54+
55+
├── expected/
56+
│ ├── tier_p/
57+
│ │ ├── 01_happy_path.json
58+
│ │ ├── 02_empty_file.json
59+
│ │ └── …
60+
│ ├── tier_i/
61+
│ │ └── 01_deploy_dev_twice.json
62+
│ └── tier_s/
63+
│ └── 01_fresh_deploy_then_all_tests_pass.json
64+
65+
└── reports/ ← runtime output (gitignored)
66+
└── <run_id>/
67+
├── tier_p_summary.json
68+
├── tier_i_summary.json
69+
├── tier_s_summary.json
70+
└── overall.json
71+
```
72+
73+
Each scenario is self-contained. To add a new one: create a new folder under `datasets/<tier>/` and a matching `expected/<tier>/<scenario>.json`. No code edits.
74+
75+
## Expected JSON contract
76+
77+
Each `expected/<tier>/<scenario>.json` declares what the runner should see. Only the fields populated are diffed — everything else is ignored. Example:
78+
79+
```json
80+
{
81+
"scenario": "05_mixed_valid_skipped",
82+
"expected": {
83+
"exit_code": 0,
84+
"stdout_contains": ["Valid rows : 2", "Skipped rows : 2"],
85+
"stderr_contains": [],
86+
"valid_csv_rows": [["id", "name"], ["1", "Alice"], ["3", "Bob"]],
87+
"skip_csv_rows_count": 2,
88+
"skip_reasons_contain": ["empty row", "column mismatch"]
89+
},
90+
"notes": "Two valid rows, two skipped — one empty, one with too few fields."
91+
}
92+
```
93+
94+
## Runner behaviour
95+
96+
`runner.py`:
97+
98+
1. Discovers every folder under `datasets/<tier>/`.
99+
2. For Tier P: copies the scenario's `input.csv` to a temp file, sets env vars, runs `python3 csv/validator.py`, captures exit + stdout + stderr + output files, diffs against `expected/<tier>/<scenario>.json`.
100+
3. For Tier I and Tier S: runs the orchestration script (`deploy_all.sh dev`, `tests/run_tests.sh dev`) and asserts the documented outcomes.
101+
4. Prints one line per scenario: `PASS / FAIL / SKIPPED`.
102+
5. Writes per-tier and overall JSON summaries under `reports/<run_id>/`.
103+
104+
Exit code: 0 if all scenarios in selected tiers pass, 1 otherwise. CI-friendly.
105+
106+
## Tiering
107+
108+
- **Tier P** runs anywhere — only needs Python + the validator script. **Fully offline.**
109+
- **Tier I** and **Tier S** need a live PostgreSQL instance + `psql` on PATH + the values in `config.local.env`. They skip cleanly if no DB is reachable.
110+
111+
## Phasing
112+
113+
| Phase | What | Status |
114+
|-------|------|--------|
115+
| 0 | This PLAN + FAILURE_MODES + README | **DONE** |
116+
| 1 | Tier P dataset folders (23 scenarios) + expected JSONs | DONE |
117+
| 2 | Tier P runner.py | DONE |
118+
| 3 | Execute Tier P locally; show results | DONE / awaiting your review |
119+
| 4 | Tier I scaffolding + runner extension | next |
120+
| 5 | Tier S scaffolding + runner extension | next |
121+
| 6 | (Future) Tier X across MariaDB/SQLite once Postgres is locked in | deferred |
122+
123+
## What this DOES NOT do
124+
125+
- Doesn't replace the existing 85-assertion SQL suite under `tests/suites/` — Tier S runs those after a fresh deploy and counts the pass total.
126+
- Doesn't replace the Python unit tests in `tests/` — Tier P is broader (23 scenarios), while unit tests keep fast in-Python coverage for the validator and eval runner.
127+
- Doesn't include performance/load testing yet (scope creep — separate suite if needed later).
128+
129+
## Open questions
130+
131+
None blocking — proceeding with the plan above. If you want to change scope or add scenarios after seeing the Tier P results, edit `FAILURE_MODES.md` and let me know.

0 commit comments

Comments
 (0)