Delphictunic · Delphictunic · Mar 3, 2026 · Mar 3, 2026 · Mar 3, 2026 · Mar 3, 2026
diff --git a/DEBUG_REVIEW.md b/DEBUG_REVIEW.md
@@ -0,0 +1,200 @@
+# ReasonGraph Debug Review: "Available evidence does not sufficiently support a reliable answer."
+
+This document reviews the pipeline file-by-file to find why the verifier keeps rejecting answers and the system falls back to graceful failure.
+
+---
+
+## Flow summary
+
+```
+main.py → initial_state(query)
+  → graph: START → query_optimizer → retriever → generator → verifier
+  → loop_controller:
+      - verifier_passed → END
+      - search_count < max_search → increment_search → query_optimizer (retry)
+      - else → graceful_failure → END
+```
+
+**When you see the generic message:** The graph ran the verifier, it failed (verifier_passed=False), and after `max_search` attempts (default 2, so up to 3 total runs) the loop controller sent the run to **graceful_failure**, which overwrites `draft_answer` with that message. So the root cause is **verifier failing on every attempt**.
+
+---
+
+## 1. app/graph/state.py
+
+**Purpose:** Define `GraphState` and `initial_state()` so every node has a complete state (no missing keys).
+
+**Checks:**
+- All fields are initialized in `initial_state()` — **OK**
+- `search_count=0` and `max_search=2` — **OK** (gives up to 3 attempts: count 0→1→2, then graceful_failure)
+- State is complete — **OK**
+
+**Potential bugs:** None. If `search_count` or `max_search` were ever missing, `loop_controller` would raise `KeyError`; we added `.get()` fallbacks in loop_controller and kept `initial_state` complete.
+
+**Connection:** Passed to graph; query_optimizer reads `user_query`, retriever reads `user_query` + `optimized_queries`, generator reads `reranked_docs`, verifier reads `draft_answer`, `citations`, `reranked_docs`, loop_controller reads `verifier_passed`, `search_count`, `max_search`.
+
+**Could cause graceful failure?** No. Missing/incomplete state could cause crashes, not systematic verifier failure.
+
+---
+
+## 2. app/nodes/query_optimizer.py
+
+**Purpose:** Turn the user query into 3 optimized search queries for better recall.
+
+**Checks:**
+- Produces up to 3 queries — **OK** (falls back to `[user_query]` on parse/API failure)
+- No explicit handling of "BD" / "S&E" variations — **was a gap**. The document says "bd and s & e" (with spaces); if optimized queries stay as "BD and S&E", BM25/vector may not match as well.
+
+**Fix applied:** Prompt now asks to expand abbreviations (e.g. "business development", "s & e") in at least one query for recall.
+
+**Potential bugs:**
+- On exception, returns `[user_query]` only (one query) — retrieval still runs.
+- If the LLM returns a non-array or empty array, we fall back to single query.
+
+**Connection:** Writes `optimized_queries`. Retriever uses them for vector + BM25.
+
+**Could cause graceful failure?** Indirectly. Weak or single query → fewer/better candidates → generator may not see the right chunk → answer unsupported → verifier fails.
+
+---
+
+## 3. app/nodes/retriever.py
+
+**Purpose:** Hybrid (vector + BM25) retrieval, then cross-encoder rerank; fill `reranked_docs` with top K chunks.
+
+**Checks:**
+- Retrieves for each of `optimized_queries` — **OK**
+- `RERANK_TOP_K` was 7 — **increased to 10** so more context reaches the generator
+- Cross-encoder is used — **OK** (`pairs = [[content, user_query]]`, then sort by score)
+- Chroma path is `./chroma_db` resolved from cwd — **OK** if run from ReasonGraph/; otherwise ensure cwd or use absolute path
+
+**Potential bugs:**
+- If the chunk with the answer (e.g. "bd and s & e activities are limited to...") is ranked 8th–10th by the cross-encoder, it was previously dropped; with top 10 it’s more likely to be included.
+- `collection.query(query_embeddings=[emb.tolist()])` — Chroma expects list of lists; we pass one embedding as `[emb.tolist()]` — **OK**
+
+**Connection:** Writes `reranked_docs`. Generator reads them and builds the evidence block.
+
+**Could cause graceful failure?** Yes. If the right chunk never appears in top K, the generator cannot cite it and may say "evidence does not provide" or cite a wrong chunk → verifier fails.
+
+---
+
+## 4. app/nodes/generator.py
+
+**Purpose:** Produce `draft_answer` and `citations` from `reranked_docs` via LLM; citations use `chunk_ids` (list).
+
+**Checks:**
+- Receives `reranked_docs` from state — **OK**
+- Builds evidence block with `chunk_id` per doc — **OK**
+- Prompt asks for JSON with `answer` and `citations` (claim + chunk_ids) — **OK**
+- Parser normalizes to `chunk_ids` list only (no `chunk_id` in output) — **OK**; verifier supports `chunk_ids`
+- Prompt now stresses: when evidence contains the answer (even different wording), answer and cite; do not say "evidence does not provide"; always include citations — **OK**
+
+**Potential bugs:**
+- If the model returns invalid JSON or omits `citations`, we return `draft_answer` with `citations=[]` → verifier then fails (empty citations + non-failure answer).
+- Chunk IDs must match exactly the `chunk_id` in the evidence block (e.g. `2602.15019v2_c1`). Typos or extra spaces → verifier sees MISSING_CHUNK → fails.
+
+**Connection:** Writes `draft_answer`, `citations`. Verifier reads both and `reranked_docs` to build chunk_map.
+
+**Could cause graceful failure?** Yes. Generator saying "evidence does not provide" or citing wrong/missing chunk_id → verifier rejects.
+
+---
+
+## 5. app/nodes/verifier.py
+
+**Purpose:** Check every cited claim against chunk content; set `verifier_passed` and `unsupported_claims`.
+
+**Checks:**
+- Receives `citations` (list of dicts with `claim` and `chunk_ids`) and `reranked_docs` — **OK**
+- `_citation_chunk_ids(c)` supports both `chunk_id` (string) and `chunk_ids` (list) — **OK**
+- Verification prompt describes multi-chunk synthesis and semantic entailment — **OK**
+- User query is filtered out from `unsupported_claims` (so we don’t show the question as a claim) — **OK**
+- System prompt says "fair" and "Do NOT require exact word-for-word" — **OK**
+- User prompt line "Be fair but critical..." was relaxed to "Be FAIR: accept when chunks collectively support..." — **done**
+
+**Potential bugs:**
+- If the verifier LLM is conservative, it may still return `verifier_passed: false` despite fair instructions. Temperature is 0.1 — **OK**
+- When `invalid_chunk_ids` is non-empty (cited chunk not in reranked_docs), we force `passed = False` and add those claims to unsupported — **correct**
+
+**Connection:** Reads `draft_answer`, `citations`, `reranked_docs`; writes `verifier_passed`, `unsupported_claims`. Loop controller reads `verifier_passed`, `search_count`, `max_search`.
+
+**Could cause graceful failure?** Yes. Overly strict interpretation of "support" or multi-chunk synthesis causes valid answers to be rejected.
+
+---
+
+## 6. app/nodes/loop_controller.py
+
+**Purpose:** After verifier, route to END, retry (increment_search → query_optimizer), or graceful_failure.
+
+**Checks:**
+- `verifier_passed` → end — **OK**
+- `search_count < max_search` → increment_search — **OK**
+- Else → graceful_failure — **OK**
+- `increment_search_node` returns `search_count + 1` — **OK**; LangGraph merges into state
+
+**Fix applied:** Use `state.get("search_count", 0)` and `state.get("max_search", 2)` so missing keys don’t cause KeyError.
+
+**Potential bugs:** None. Loop is not infinite: after `max_search` increments we hit graceful_failure.
+
+**Connection:** Conditional edges from verifier; increment_search → query_optimizer.
+
+**Could cause graceful failure?** Only by design: when retries are exhausted we intentionally go to graceful_failure.
+
+---
+
+## 7. app/graph/graph.py
+
+**Purpose:** Define nodes and edges for the LangGraph pipeline.
+
+**Checks:**
+- Topology: START → query_optimizer → retriever → generator → verifier → conditional → end / increment_search / graceful_failure — **OK**
+- increment_search → query_optimizer — **OK**
+- graceful_failure → END — **OK**
+
+**Potential bugs:** None.
+
+**Could cause graceful failure?** No. Topology is correct.
+
+---
+
+## 8. main.py
+
+**Purpose:** Invoke graph, print draft answer, citations, verifier warning, unsupported claims.
+
+**Checks:**
+- Displays multi-chunk citations (chunk_ids list or chunk_id string) — **OK**
+- Shows verifier_passed via "Warning: The answer may be unreliable..." when not passed — **OK**
+
+**Potential bugs:** When the run ends at graceful_failure, `result["draft_answer"]` is the overwritten generic message, so you never see the last generator answer. Citations and unsupported_claims are still from the last verifier run. So debugging "why did verifier fail?" requires either logging the last draft/citations before overwrite or not overwriting draft on graceful_failure (optional).
+
+**Could cause graceful failure?** No. Display only.
+
+---
+
+## Root causes of "Available evidence does not sufficiently support a reliable answer"
+
+1. **Verifier fails** on every attempt (so after max_search retries we go to graceful_failure).
+2. **Why verifier fails (pick one or more):**
+   - **Wrong or missing chunk in retrieval:** The chunk that says "bd and s & e activities are limited to largely manual and time-consuming tasks" isn’t in top K → generator can’t cite it or says "evidence does not provide".
+   - **Query optimizer:** Queries don’t expand BD/S&E → retrieval doesn’t surface that chunk well → same as above.
+   - **Generator:** Answers with "evidence does not provide" or cites a chunk that doesn’t support the claim → verifier correctly fails.
+   - **Verifier too strict:** Even with the right chunk and a reasonable answer, verifier LLM returns `verifier_passed: false` (e.g. requires exact wording or doesn’t accept synthesis).
+
+## Fixes applied in code
+
+- **state.py:** Comments that initial_state must be complete; return plain dict (unchanged behavior).
+- **query_optimizer.py:** Prompt instruction to expand abbreviations (BD, S&E, etc.) for recall.
+- **retriever.py:** `RERANK_TOP_K` increased from 7 to 10.
+- **generator.py:** Stronger prompt: must answer and cite when evidence contains the answer; always include citations.
+- **verifier.py:** User prompt relaxed to "Be FAIR: accept when chunks collectively support...".
+- **loop_controller.py:** Defensive `state.get("search_count", 0)` and `state.get("max_search", 2)`.
+
+## Format mismatch check
+
+- Generator outputs citations with **chunk_ids** (list) only (parser normalizes old `chunk_id` to list).
+- Verifier **_citation_chunk_ids()** reads both **chunk_id** and **chunk_ids** and returns a list of chunk IDs.
+- So there is **no format mismatch** between generator output and verifier input.
+
+## Suggested next steps
+
+1. Re-run ingest if you changed chunk size (so Chroma has 500-token chunks).
+2. Run the same query and confirm query_optimizer returns at least one query with "business development" or "s & e".
+3. Add temporary logging: in verifier_node, log `draft_answer`, `citations`, and verifier LLM response (or at least `passed`) so you can see why it failed.
+4. Optionally: in the graceful_failure node, do *not* overwrite `draft_answer` so the final result still shows what the generator said (and you can see what was rejected).
diff --git a/app/graph/graph.py b/app/graph/graph.py
@@ -7,7 +7,7 @@
 
 from app.graph.state import GraphState
 from app.nodes.generator import generator_node
-from app.nodes.loop_controller import loop_controller
+from app.nodes.loop_controller import increment_search_node, loop_controller
 from app.nodes.query_optimizer import query_optimizer_node
 from app.nodes.retriever import retriever_node
 from app.nodes.verifier import verifier_node
@@ -20,6 +20,7 @@
 builder.add_node("retriever", retriever_node)
 builder.add_node("generator", generator_node)
 builder.add_node("verifier", verifier_node)
+builder.add_node("increment_search", increment_search_node)
 builder.add_node(
     "graceful_failure",
     lambda state: {
@@ -38,10 +39,11 @@
     loop_controller,
     {
         "end": END,
-        "query_optimizer": "query_optimizer",
+        "increment_search": "increment_search",
         "graceful_failure": "graceful_failure",
     },
 )
+builder.add_edge("increment_search", "query_optimizer")
 builder.add_edge("graceful_failure", END)
 
 graph = builder.compile()
diff --git a/app/graph/state.py b/app/graph/state.py
@@ -1,6 +1,7 @@
 from typing import TypedDict, List, Optional
 
 class GraphState(TypedDict):
+    """State for ReasonGraph. initial_state() must set every key so loop_controller never sees missing search_count/max_search."""
     user_query: str
     optimized_queries: List[str]
     retrieved_docs: List[dict]
@@ -12,16 +13,18 @@ class GraphState(TypedDict):
     search_count: int
     max_search: int
 
+
 def initial_state(user_query: str) -> GraphState:
-    return GraphState(
-        user_query=user_query,
-        optimized_queries=[],
-        retrieved_docs=[],
-        reranked_docs=[],
-        draft_answer=None,
-        citations=[],
-        verifier_passed=False,
-        unsupported_claims=[],
-        search_count=0,
-        max_search=3
-    )
+    """Complete initial state. search_count and max_search are required for retry/graceful_failure routing."""
+    return {
+        "user_query": user_query,
+        "optimized_queries": [],
+        "retrieved_docs": [],
+        "reranked_docs": [],
+        "draft_answer": None,
+        "citations": [],
+        "verifier_passed": False,
+        "unsupported_claims": [],
+        "search_count": 0,
+        "max_search": 2,
+    }