You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Root cause: KOROAD type=json returns flat structure {"resultCode":"00","items":{...},"totalCount":N}, not nested like KMA
Fix: Rewrote _parse_response() to read from top-level keys
Why mocks couldn't catch it: Fixtures were authored with assumed (incorrect) structure
3. KOROAD afos_fid returned as integer
Symptom: pydantic_core.ValidationError: Input should be a valid string for afos_fid
Root cause: Real API returns "afos_fid": 7192978 (int), fixtures had "afos_fid": "0001" (str)
Fix: Added coerce_numbers_to_str=True to AccidentHotspot Pydantic model
Why mocks couldn't catch it: Fixtures only contained string values; real API wire format differs
4. KOROAD NODATA_ERROR treated as hard error
Symptom: E2E multi-turn test failed — Turn 2 KOROAD call returned resultCode='03' and adapter threw ToolExecutionError
Root cause: _parse_response() raised on any resultCode != "00", but code 03 (NODATA_ERROR) means "no matching records" — a valid empty result
Fix: Handle resultCode == "03" as empty result set (0 hotspots)
Why mocks couldn't catch it: Fixtures never returned NODATA_ERROR; real queries with specific parameters can legitimately find no data
5. K-EXAONE reasoning token budget exhaustion
Symptom: LLM stream test got 0 content_delta events
Root cause: K-EXAONE uses reasoning_content tokens before content tokens, sharing the same max_tokens budget. With max_tokens=1024, all tokens were consumed by reasoning, leaving none for actual content
Fix: Increased max_tokens from 1024 → 4096 for streaming tests
Why mocks couldn't catch it: MockLLM returns fixed responses immediately; real model has variable reasoning depth
6. gu_gun parameter was optional (should be required)
Symptom: KOROAD API returned errors when guGun parameter was missing
Root cause: gu_gun was defined as Optional with a default of None in the Pydantic input model, but the KOROAD API documents it as required (항목구분: @1)
Fix: Made gu_gun: GugunCode required across KoroadAccidentSearchInput and RoadRiskScoreInput
Why mocks couldn't catch it: Test fixtures didn't validate parameter completeness against the real API spec
6 out of 8 predicted live-only risk points manifested as real defects. This validates the Epic #291 hypothesis: mock-based tests create a false sense of security for external API integrations. Live validation is essential before closing Phase 1.
reacted with thumbs up emoji reacted with thumbs down emoji reacted with laugh emoji reacted with hooray emoji reacted with confused emoji reacted with heart emoji reacted with rocket emoji reacted with eyes emoji
Uh oh!
There was an error while loading. Please reload this page.
-
Phase 1 Live API Validation Results
Epic: #291 — Phase 1 Final Validation & Stabilization (Live)
PR: #348
Date: 2026-04-13
Commit:
1b5e57eTest Results Summary
Live-Only Defects Discovered & Fixed
These defects were invisible behind mocks and only surfaced when hitting real APIs:
1. KOROAD API
typeparameter (not_type)text/xml;charset=UTF-8)_type=json(KMA convention) but KOROAD usestype=json"_type": "json"→"type": "json"2. KOROAD flat JSON response structure
KeyErrorparsing response — expected nestedresponse.header/bodystructuretype=jsonreturns flat structure{"resultCode":"00","items":{...},"totalCount":N}, not nested like KMA_parse_response()to read from top-level keys3. KOROAD
afos_fidreturned as integerpydantic_core.ValidationError: Input should be a valid stringforafos_fid"afos_fid": 7192978(int), fixtures had"afos_fid": "0001"(str)coerce_numbers_to_str=TruetoAccidentHotspotPydantic model4. KOROAD NODATA_ERROR treated as hard error
resultCode='03'and adapter threwToolExecutionError_parse_response()raised on anyresultCode != "00", but code03(NODATA_ERROR) means "no matching records" — a valid empty resultresultCode == "03"as empty result set (0 hotspots)5. K-EXAONE reasoning token budget exhaustion
content_deltaeventsreasoning_contenttokens beforecontenttokens, sharing the samemax_tokensbudget. Withmax_tokens=1024, all tokens were consumed by reasoning, leaving none for actual contentmax_tokensfrom 1024 → 4096 for streaming testsMockLLMreturns fixed responses immediately; real model has variable reasoning depth6.
gu_gunparameter was optional (should be required)guGunparameter was missinggu_gunwas defined asOptionalwith a default ofNonein the Pydantic input model, but the KOROAD API documents it as required (항목구분: @1)gu_gun: GugunCoderequired acrossKoroadAccidentSearchInputandRoadRiskScoreInputFiles Changed (15 files, +222/-153)
src/kosmos/tools/koroad/koroad_accident_search.pytype=json, flat parse, coerce, NODATA handlingsrc/kosmos/tools/composite/road_risk_score.pygu_gunrequiredtests/tools/koroad/fixtures/*.json(4 files)tests/tools/koroad/test_koroad_accident_search.pytests/live/test_live_*.py(4 files)gu_gunparameter, max_tokenstests/tools/composite/test_road_risk_score.pygu_gunrequiredtests/tools/test_search_integration.pygu_gunrequiredtests/e2e/conftest.pygu_gunin E2E fixturestests/live/conftest.pyValidation of Epic #291 Risk Matrix
typeparameter)Key Takeaway
6 out of 8 predicted live-only risk points manifested as real defects. This validates the Epic #291 hypothesis: mock-based tests create a false sense of security for external API integrations. Live validation is essential before closing Phase 1.
Quality Gates
uv run pytest— 870 passed, 15 skippeduv run pytest -m live— 15 passeduv run mypy src/kosmos— no issuesuv run ruff check src/ tests/— all checks passedBeta Was this translation helpful? Give feedback.
All reactions