🔍 FailureLens IQ

Learning Intelligence from Failed ML Experiments

🎥 Demo Video

FailureLens IQ — Foundry IQ-Powered ML Failure Reasoning Agent

Watch the project demo here: https://youtu.be/7vMRFpLqYM8

This demo shows FailureLens IQ diagnosing an ML experiment failure caused by target leakage / future information leakage, using a multi-agent reasoning pipeline with Foundry IQ grounding, Azure AI Search citations, deployment safety gates, remediation planning, and a judge-ready proof payload.

1. One-Line Summary

FailureLens IQ turns failed ML experiments into evidence-grounded diagnosis, remediation plans, and Microsoft certification-aligned team learning.

2. Microsoft Agents League Track

Track: Reasoning Agents
Microsoft IQ layer: Foundry IQ (grounded retrieval over Azure AI Search)
Enterprise scenario: ML experiment failure learning and certification readiness

3. Why This Project Matters

ML teams repeat the same failure modes (leakage, overfitting, imbalance, drift) across experiments.
Failed experiments are rarely converted into reusable, searchable team memory.
Managers need evidence, uncertainty, and concrete learning actions — not a single opaque score.
Certification / upskilling (DP-100, AI-102) should be connected to real failure cases, not generic advice.

FailureLens IQ runs a multi-agent reasoning pipeline that produces a canonical diagnosis, a remediation playbook, certification mapping, and a judge-facing proof payload — while being explicit about what is and is not live Microsoft IQ.

4. Microsoft IQ Layer: Foundry IQ

Selected IQ Layer: Foundry IQ.

How Foundry IQ is used

Grounding layer for agent reasoning (retrieval before final diagnosis).
ML failure-taxonomy retrieval.
Remediation playbook retrieval.
Certification / learning-path retrieval (DP-100 / AI-102).
Citation-backed reasoning with permission-aware source metadata.
A judge proof payload that records exactly which sources grounded the run.

Live Azure setup

Azure AI Search stores the Foundry IQ knowledge sources (knowledge/foundry_iq_sources/).
The backend calls Azure AI Search before the final diagnosis.
Returned references are attached to the reasoning trace and the canonical diagnosis.
Live Microsoft IQ is claimed only when Azure AI Search returns citations with source_type = azure_ai_search and a permission filter was applied.

Local demo mode

Uses a local Foundry IQ-compatible adapter (FoundryIQLocalAdapter).
Never claims live Microsoft IQ.
Proof level: local_foundry_iq_adapter.
The UI shows: "Local demo mode — not claiming live Microsoft IQ."

Honest mode matrix (proof levels)

Proof level	Grounding (Azure AI Search)	Reasoning (Microsoft model)	Live Microsoft IQ
`live_azure_foundry`	live citations	Microsoft Foundry/Azure OpenAI	Yes
`azure_search_live_with_local_reasoning`	live citations	local deterministic	partial (grounding only)
`foundry_model_live_without_search`	local adapter	Microsoft Foundry/Azure OpenAI	partial (reasoning only)
`local_foundry_iq_adapter`	local adapter	local deterministic	No (honest local demo)

5. Proof of Real IQ Usage

All proof routes return honest, run-specific status and never fabricate live claims.

Route	Purpose
`GET /health`	Service + configuration health.
`GET /version`	Build version, app mode, active reasoning/grounding providers.
`GET /iq/status`	Current IQ layer/provider and proof level.
`GET /proof/live-iq`	Configuration-only proof snapshot (booleans + honest limitation).
`POST /proof/live-iq/run`	Runs the pipeline and returns the actual proof payload for that run.
`GET /proof/azure-search/test`	Direct Azure AI Search diagnostic (configured / request_success / result_count).

Each proof payload includes: selected_iq_layer, proof_level, live_microsoft_iq_grounding, live_microsoft_reasoning, azure_ai_search_configured, azure_ai_search_used_this_run, foundry_or_azure_model_configured, foundry_or_azure_model_used_this_run, active_reasoning_provider, active_grounding_provider, citation/grounding_refs counts, permission scope (via citations), warnings, honest_limitation, run_id, and trace_ids.

Proof-level meanings

live_azure_foundry — both halves of live Microsoft IQ proven (search + Microsoft model).
azure_search_live_with_local_reasoning — grounding is live; reasoning is local.
foundry_model_live_without_search — reasoning is live; grounding is local.
local_foundry_iq_adapter — fully local demo; live Microsoft IQ is honestly false.

6. Architecture

flowchart LR
    UI["React / Vite Frontend"] --> API["FastAPI Backend"]
    API --> ORCH["Reasoning Orchestrator"]
    ORCH --> AGENTS["Reasoning Agents"]
    AGENTS --> IQ["Foundry IQ Grounding Layer"]
    IQ --> SEARCH["Azure AI Search"]
    IQ --> LOCAL["Local Foundry IQ Adapter"]
    AGENTS --> MODEL["Azure OpenAI / Foundry Model"]
    ORCH --> PROOF["Judge Proof Payload"]

The FinalConsistencyValidatorAgent + build_final_report_payload produce one canonical payload that drives the UI, the HTML/JSON export, and the proof — so judge-facing surfaces cannot disagree (a single payload hash covers all of them).

7. Reasoning Agents

Agent	Role	Input	Output	Uses Foundry IQ citations	Uncertainty
ExperimentIntakeAgent	Normalize the experiment record, flag missing fields	raw prompt / experiment log	structured `ExperimentLog`, missing-field list	no	missing-critical-fields list
FailureClassifierAgent	Rule-based taxonomy scoring	experiment metrics/notes	candidate category + conflicts + confidence	no	conflict list, requires_review flag
RootCauseAnalyzerAgent	LLM/grounded root-cause reasoning	classification + grounding	root cause, evidence, counter-evidence, reflections	yes	counter-evidence + reflection notes
ExperimentHistorianAgent	Find comparable past failures	diagnosis + history	recurrence signals	yes	similarity confidence
CertificationEvaluatorAgent	Map failure → Microsoft cert	category + gap	DP-100 / AI-102 path	yes	fallback flag when grounding weak
ReadinessAssessmentAgent	Build readiness questions	canonical diagnosis	targeted practice questions	yes	—
PrescriptiveCoachAgent	3-day / 7-day remediation	canonical diagnosis	remediation playbook + lab	yes	—
IntegrationManagerAgent	Team heatmap + governance	team history	heatmap, cert relevance, recurring alerts	no	vulnerability level
FinalConsistencyValidatorAgent	Single source of truth	all agent outputs	canonical diagnosis + consistency gate	reconciles citations	contradictions list, fails on disagreement

Reasoning is exposed only as structured summaries (evidence, counter-evidence, reflections, calibrated confidence). No hidden chain-of-thought is surfaced.

8. Example Diagnosis (Leakage)

Prompt:

"Validation accuracy jumped to 98% after adding renewal_status_after_30d. Test performance collapsed, and I suspect the feature contains future target information."

Expected output:

Field	Value
Primary category	Data Leakage
Subtype	Target Leakage / Future Information Leakage
Deployment gate	`BLOCKED_PENDING_HUMAN_REVIEW`
Report Judge-ready	Yes (when proof + citations + parity + consistency pass)
Model Deployment-ready	No
Human review required	Yes
`leakage_audit_recommended`	true
`active_leakage_signal`	true
Responsible AI status	`NOT_EVALUATED` (no fairness data provided)
Certification path	DP-100

Report Judge-ready ≠ Model Deployment-ready. A complete, grounded report can correctly conclude the model must not ship.

The reverse case is also enforced: an Overfitting prompt (high train / low validation accuracy) is not flipped to Data Leakage just because the reasoning narrative mentions leakage — leakage stays a secondary, audit-recommended risk unless concrete leakage evidence (leakage columns / explicit future-state features) exists.

9. Local Development (Windows PowerShell)

# 1) Backend virtual environment
python -m venv .venv
.\.venv\Scripts\Activate.ps1
pip install -r requirements.txt

# 2) Frontend packages
npm install --prefix frontend

# 3) Run backend only (http://127.0.0.1:8000)
python -m uvicorn backend.api.main:app --reload --host 127.0.0.1 --port 8000

# 4) Run frontend only (http://127.0.0.1:5173)
npm run dev:frontend --prefix frontend

# 5) Run BOTH together
npm run dev          # from repo root (scripts/dev_full_stack.py)

Root package.json scripts:

Script	Action
`npm run dev`	Start backend + frontend together
`npm run dev:backend`	Backend (uvicorn) only
`npm run dev:frontend`	Frontend (Vite) only
`npm run install:all`	Install frontend packages + backend requirements
`npm run test:backend`	`pytest tests -v`
`npm run test:frontend`	Frontend Vitest suite

10. Environment Variables

Copy .env.example → .env (the repo .gitignore excludes .env; no real secrets are committed).

Variable	Required for live	Notes
`APP_MODE`	yes	`demo` (local) or `live`/`production`.
`FOUNDRY_CALL_MODE`	—	Keep `mock` off in live mode.
`REQUIRE_LIVE_FOUNDRY_IQ`	optional	Fail-closed: forbid local fallback in live.
`AZURE_AI_SEARCH_ENDPOINT`	yes (live)	Azure AI Search endpoint (alias: `AZURE_SEARCH_ENDPOINT`).
`AZURE_AI_SEARCH_INDEX`	yes (live)	Index holding Foundry IQ sources.
`AZURE_AI_SEARCH_KEY`	yes (live)	Search admin/query key.
`AZURE_OPENAI_ENDPOINT` / `FOUNDRY_OPENAI_BASE_URL`	yes (live)	Microsoft model endpoint.
`AZURE_OPENAI_API_KEY` / `FOUNDRY_API_KEY`	yes (live)	Model key.
`AZURE_OPENAI_DEPLOYMENT` / `FOUNDRY_MODEL_DEPLOYMENT`	yes (live)	Deployment name.
`MODEL_PROVIDER`	—	`foundry_openai` / `azure_openai` / `openai` / `local`.

Local demo mode needs none of the Azure values — it runs fully offline with the local Foundry IQ adapter and reports local_foundry_iq_adapter.

11. Testing

# Backend (pytest)
python -m pytest tests -q

# Byte-compile sanity
python -m compileall backend

# Frontend
npm run lint  --prefix frontend     # tsc --noEmit
npm run test  --prefix frontend     # Vitest
npm run build --prefix frontend     # production build

Coverage highlights:

Leakage scenario contract (category, gate, leakage audit, DP-100, RAI status).
No DP-203 for leakage learning paths.
Citation consistency across UI / proof / export / grounding refs / agent trace.
Live IQ honesty — local adapter never claims live Microsoft IQ; azure_ai_search citations required.
Report vs Deployment readiness are separate fields.
UI / export parity — single canonical payload hash.
Confidence calibration — one documented formula object.

12. Judge Demo Script

Open the app (npm run dev).
Check GET /iq/status and GET /proof/live-iq (honest mode + proof level).
Run the leakage prompt from §8.
Show the diagnosis and the blocked deployment gate (Report Judge-ready Yes / Model Deployment-ready No).
Show the Foundry IQ citations (with source_type + permission_scope).
Show the proof payload (POST /proof/live-iq/run or the report's Proof section).
Export the report (HTML / JSON).
Show report / UI parity (same payload hash, same readiness labels).

13. Security, Compliance & Further Reading

docs/MICROSOFT_IQ_HONEST_COMPLIANCE.md — how live-vs-local Microsoft IQ claims are kept honest.
docs/PRODUCTION_HARDENING.md — production hardening checklist.
docs/SECURITY_MODEL.md — auth, CORS, rate limiting, and secret handling.
docs/AZURE_POLICY_BLOCKER.md — documented Azure tenant policy blocker.
docs/OPENAI_FALLBACK_PROVIDER.md — the OpenAI direct API does not replace Microsoft IQ; it is only a reasoning fallback when Microsoft endpoints are unavailable, and never counts as live Microsoft IQ.
docs/JUDGE_REVIEW_GUIDE.md — judge walkthrough.
docs/NO_SECRETS_POLICY.md — no secrets are committed; .env is gitignored.

14. Honesty and Limitations

Local mode is not live Microsoft IQ. It is clearly labeled local_foundry_iq_adapter.
Live Microsoft IQ is only claimed when Azure AI Search returns source_type = azure_ai_search citations and the permission filter is applied.
FailureLens IQ provides decision support, not automatic production approval. Every failure diagnosis blocks deployment pending review.
Human review is required for leakage and low-confidence diagnoses.
Responsible AI is reported as NOT_EVALUATED unless fairness / protected-attribute data and metrics are provided.
Agent reasoning is shown as structured summaries; no hidden chain-of-thought is exposed.

Name		Name	Last commit message	Last commit date
Latest commit History 18 Commits
.claude		.claude
.github/workflows		.github/workflows
backend		backend
data		data
docs		docs
frontend		frontend
knowledge		knowledge
reports		reports
scripts		scripts
tests		tests
.env.azure.example		.env.azure.example
.env.demo		.env.demo
.env.example		.env.example
.gitignore		.gitignore
Dockerfile		Dockerfile
LICENSE		LICENSE
README.md		README.md
docker-compose.observability.yml		docker-compose.observability.yml
docker-compose.prod.yml		docker-compose.prod.yml
docker-compose.yml		docker-compose.yml
package.json		package.json
requirements.txt		requirements.txt

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

🔍 FailureLens IQ

🎥 Demo Video

1. One-Line Summary

2. Microsoft Agents League Track

3. Why This Project Matters

4. Microsoft IQ Layer: Foundry IQ

How Foundry IQ is used

Live Azure setup

Local demo mode

Honest mode matrix (proof levels)

5. Proof of Real IQ Usage

6. Architecture

7. Reasoning Agents

8. Example Diagnosis (Leakage)

9. Local Development (Windows PowerShell)

10. Environment Variables

11. Testing

12. Judge Demo Script

13. Security, Compliance & Further Reading

14. Honesty and Limitations

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

🔍 FailureLens IQ

🎥 Demo Video

1. One-Line Summary

2. Microsoft Agents League Track

3. Why This Project Matters

4. Microsoft IQ Layer: Foundry IQ

How Foundry IQ is used

Live Azure setup

Local demo mode

Honest mode matrix (proof levels)

5. Proof of Real IQ Usage

6. Architecture

7. Reasoning Agents

8. Example Diagnosis (Leakage)

9. Local Development (Windows PowerShell)

10. Environment Variables

11. Testing

12. Judge Demo Script

13. Security, Compliance & Further Reading

14. Honesty and Limitations

About

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages