Skip to content
Open
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
200 changes: 200 additions & 0 deletions DEBUG_REVIEW.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,200 @@
# ReasonGraph Debug Review: "Available evidence does not sufficiently support a reliable answer."

This document reviews the pipeline file-by-file to find why the verifier keeps rejecting answers and the system falls back to graceful failure.

---

## Flow summary

```
main.py → initial_state(query)
→ graph: START → query_optimizer → retriever → generator → verifier
→ loop_controller:
- verifier_passed → END
- search_count < max_search → increment_search → query_optimizer (retry)
- else → graceful_failure → END
```

**When you see the generic message:** The graph ran the verifier, it failed (verifier_passed=False), and after `max_search` attempts (default 2, so up to 3 total runs) the loop controller sent the run to **graceful_failure**, which overwrites `draft_answer` with that message. So the root cause is **verifier failing on every attempt**.

---

## 1. app/graph/state.py

**Purpose:** Define `GraphState` and `initial_state()` so every node has a complete state (no missing keys).

**Checks:**
- All fields are initialized in `initial_state()` — **OK**
- `search_count=0` and `max_search=2` — **OK** (gives up to 3 attempts: count 0→1→2, then graceful_failure)
- State is complete — **OK**

**Potential bugs:** None. If `search_count` or `max_search` were ever missing, `loop_controller` would raise `KeyError`; we added `.get()` fallbacks in loop_controller and kept `initial_state` complete.

**Connection:** Passed to graph; query_optimizer reads `user_query`, retriever reads `user_query` + `optimized_queries`, generator reads `reranked_docs`, verifier reads `draft_answer`, `citations`, `reranked_docs`, loop_controller reads `verifier_passed`, `search_count`, `max_search`.

**Could cause graceful failure?** No. Missing/incomplete state could cause crashes, not systematic verifier failure.

---

## 2. app/nodes/query_optimizer.py

**Purpose:** Turn the user query into 3 optimized search queries for better recall.

**Checks:**
- Produces up to 3 queries — **OK** (falls back to `[user_query]` on parse/API failure)
- No explicit handling of "BD" / "S&E" variations — **was a gap**. The document says "bd and s & e" (with spaces); if optimized queries stay as "BD and S&E", BM25/vector may not match as well.

**Fix applied:** Prompt now asks to expand abbreviations (e.g. "business development", "s & e") in at least one query for recall.

**Potential bugs:**
- On exception, returns `[user_query]` only (one query) — retrieval still runs.
- If the LLM returns a non-array or empty array, we fall back to single query.

**Connection:** Writes `optimized_queries`. Retriever uses them for vector + BM25.

**Could cause graceful failure?** Indirectly. Weak or single query → fewer/better candidates → generator may not see the right chunk → answer unsupported → verifier fails.

---

## 3. app/nodes/retriever.py

**Purpose:** Hybrid (vector + BM25) retrieval, then cross-encoder rerank; fill `reranked_docs` with top K chunks.

**Checks:**
- Retrieves for each of `optimized_queries` — **OK**
- `RERANK_TOP_K` was 7 — **increased to 10** so more context reaches the generator
- Cross-encoder is used — **OK** (`pairs = [[content, user_query]]`, then sort by score)
- Chroma path is `./chroma_db` resolved from cwd — **OK** if run from ReasonGraph/; otherwise ensure cwd or use absolute path

**Potential bugs:**
- If the chunk with the answer (e.g. "bd and s & e activities are limited to...") is ranked 8th–10th by the cross-encoder, it was previously dropped; with top 10 it’s more likely to be included.
- `collection.query(query_embeddings=[emb.tolist()])` — Chroma expects list of lists; we pass one embedding as `[emb.tolist()]` — **OK**

**Connection:** Writes `reranked_docs`. Generator reads them and builds the evidence block.

**Could cause graceful failure?** Yes. If the right chunk never appears in top K, the generator cannot cite it and may say "evidence does not provide" or cite a wrong chunk → verifier fails.

---

## 4. app/nodes/generator.py

**Purpose:** Produce `draft_answer` and `citations` from `reranked_docs` via LLM; citations use `chunk_ids` (list).

**Checks:**
- Receives `reranked_docs` from state — **OK**
- Builds evidence block with `chunk_id` per doc — **OK**
- Prompt asks for JSON with `answer` and `citations` (claim + chunk_ids) — **OK**
- Parser normalizes to `chunk_ids` list only (no `chunk_id` in output) — **OK**; verifier supports `chunk_ids`
- Prompt now stresses: when evidence contains the answer (even different wording), answer and cite; do not say "evidence does not provide"; always include citations — **OK**

**Potential bugs:**
- If the model returns invalid JSON or omits `citations`, we return `draft_answer` with `citations=[]` → verifier then fails (empty citations + non-failure answer).
- Chunk IDs must match exactly the `chunk_id` in the evidence block (e.g. `2602.15019v2_c1`). Typos or extra spaces → verifier sees MISSING_CHUNK → fails.

**Connection:** Writes `draft_answer`, `citations`. Verifier reads both and `reranked_docs` to build chunk_map.

**Could cause graceful failure?** Yes. Generator saying "evidence does not provide" or citing wrong/missing chunk_id → verifier rejects.

---

## 5. app/nodes/verifier.py

**Purpose:** Check every cited claim against chunk content; set `verifier_passed` and `unsupported_claims`.

**Checks:**
- Receives `citations` (list of dicts with `claim` and `chunk_ids`) and `reranked_docs` — **OK**
- `_citation_chunk_ids(c)` supports both `chunk_id` (string) and `chunk_ids` (list) — **OK**
- Verification prompt describes multi-chunk synthesis and semantic entailment — **OK**
- User query is filtered out from `unsupported_claims` (so we don’t show the question as a claim) — **OK**
- System prompt says "fair" and "Do NOT require exact word-for-word" — **OK**
- User prompt line "Be fair but critical..." was relaxed to "Be FAIR: accept when chunks collectively support..." — **done**

**Potential bugs:**
- If the verifier LLM is conservative, it may still return `verifier_passed: false` despite fair instructions. Temperature is 0.1 — **OK**
- When `invalid_chunk_ids` is non-empty (cited chunk not in reranked_docs), we force `passed = False` and add those claims to unsupported — **correct**

**Connection:** Reads `draft_answer`, `citations`, `reranked_docs`; writes `verifier_passed`, `unsupported_claims`. Loop controller reads `verifier_passed`, `search_count`, `max_search`.

**Could cause graceful failure?** Yes. Overly strict interpretation of "support" or multi-chunk synthesis causes valid answers to be rejected.

---

## 6. app/nodes/loop_controller.py

**Purpose:** After verifier, route to END, retry (increment_search → query_optimizer), or graceful_failure.

**Checks:**
- `verifier_passed` → end — **OK**
- `search_count < max_search` → increment_search — **OK**
- Else → graceful_failure — **OK**
- `increment_search_node` returns `search_count + 1` — **OK**; LangGraph merges into state

**Fix applied:** Use `state.get("search_count", 0)` and `state.get("max_search", 2)` so missing keys don’t cause KeyError.

**Potential bugs:** None. Loop is not infinite: after `max_search` increments we hit graceful_failure.

**Connection:** Conditional edges from verifier; increment_search → query_optimizer.

**Could cause graceful failure?** Only by design: when retries are exhausted we intentionally go to graceful_failure.

---

## 7. app/graph/graph.py

**Purpose:** Define nodes and edges for the LangGraph pipeline.

**Checks:**
- Topology: START → query_optimizer → retriever → generator → verifier → conditional → end / increment_search / graceful_failure — **OK**
- increment_search → query_optimizer — **OK**
- graceful_failure → END — **OK**

**Potential bugs:** None.

**Could cause graceful failure?** No. Topology is correct.

---

## 8. main.py

**Purpose:** Invoke graph, print draft answer, citations, verifier warning, unsupported claims.

**Checks:**
- Displays multi-chunk citations (chunk_ids list or chunk_id string) — **OK**
- Shows verifier_passed via "Warning: The answer may be unreliable..." when not passed — **OK**

**Potential bugs:** When the run ends at graceful_failure, `result["draft_answer"]` is the overwritten generic message, so you never see the last generator answer. Citations and unsupported_claims are still from the last verifier run. So debugging "why did verifier fail?" requires either logging the last draft/citations before overwrite or not overwriting draft on graceful_failure (optional).

**Could cause graceful failure?** No. Display only.

---

## Root causes of "Available evidence does not sufficiently support a reliable answer"

1. **Verifier fails** on every attempt (so after max_search retries we go to graceful_failure).
2. **Why verifier fails (pick one or more):**
- **Wrong or missing chunk in retrieval:** The chunk that says "bd and s & e activities are limited to largely manual and time-consuming tasks" isn’t in top K → generator can’t cite it or says "evidence does not provide".
- **Query optimizer:** Queries don’t expand BD/S&E → retrieval doesn’t surface that chunk well → same as above.
- **Generator:** Answers with "evidence does not provide" or cites a chunk that doesn’t support the claim → verifier correctly fails.
- **Verifier too strict:** Even with the right chunk and a reasonable answer, verifier LLM returns `verifier_passed: false` (e.g. requires exact wording or doesn’t accept synthesis).

## Fixes applied in code

- **state.py:** Comments that initial_state must be complete; return plain dict (unchanged behavior).
- **query_optimizer.py:** Prompt instruction to expand abbreviations (BD, S&E, etc.) for recall.
- **retriever.py:** `RERANK_TOP_K` increased from 7 to 10.
- **generator.py:** Stronger prompt: must answer and cite when evidence contains the answer; always include citations.
- **verifier.py:** User prompt relaxed to "Be FAIR: accept when chunks collectively support...".
- **loop_controller.py:** Defensive `state.get("search_count", 0)` and `state.get("max_search", 2)`.

## Format mismatch check

- Generator outputs citations with **chunk_ids** (list) only (parser normalizes old `chunk_id` to list).
- Verifier **_citation_chunk_ids()** reads both **chunk_id** and **chunk_ids** and returns a list of chunk IDs.
- So there is **no format mismatch** between generator output and verifier input.

## Suggested next steps

1. Re-run ingest if you changed chunk size (so Chroma has 500-token chunks).
2. Run the same query and confirm query_optimizer returns at least one query with "business development" or "s & e".
3. Add temporary logging: in verifier_node, log `draft_answer`, `citations`, and verifier LLM response (or at least `passed`) so you can see why it failed.
4. Optionally: in the graceful_failure node, do *not* overwrite `draft_answer` so the final result still shows what the generator said (and you can see what was rejected).
6 changes: 4 additions & 2 deletions app/graph/graph.py
Original file line number Diff line number Diff line change
Expand Up @@ -7,7 +7,7 @@

from app.graph.state import GraphState
from app.nodes.generator import generator_node
from app.nodes.loop_controller import loop_controller
from app.nodes.loop_controller import increment_search_node, loop_controller
from app.nodes.query_optimizer import query_optimizer_node
from app.nodes.retriever import retriever_node
from app.nodes.verifier import verifier_node
Expand All @@ -20,6 +20,7 @@
builder.add_node("retriever", retriever_node)
builder.add_node("generator", generator_node)
builder.add_node("verifier", verifier_node)
builder.add_node("increment_search", increment_search_node)
builder.add_node(
"graceful_failure",
lambda state: {
Expand All @@ -38,10 +39,11 @@
loop_controller,
{
"end": END,
"query_optimizer": "query_optimizer",
"increment_search": "increment_search",
"graceful_failure": "graceful_failure",
},
)
builder.add_edge("increment_search", "query_optimizer")
builder.add_edge("graceful_failure", END)

graph = builder.compile()
27 changes: 15 additions & 12 deletions app/graph/state.py
Original file line number Diff line number Diff line change
@@ -1,6 +1,7 @@
from typing import TypedDict, List, Optional

class GraphState(TypedDict):
"""State for ReasonGraph. initial_state() must set every key so loop_controller never sees missing search_count/max_search."""
user_query: str
optimized_queries: List[str]
retrieved_docs: List[dict]
Expand All @@ -12,16 +13,18 @@ class GraphState(TypedDict):
search_count: int
max_search: int


def initial_state(user_query: str) -> GraphState:
return GraphState(
user_query=user_query,
optimized_queries=[],
retrieved_docs=[],
reranked_docs=[],
draft_answer=None,
citations=[],
verifier_passed=False,
unsupported_claims=[],
search_count=0,
max_search=3
)
"""Complete initial state. search_count and max_search are required for retry/graceful_failure routing."""
return {
"user_query": user_query,
"optimized_queries": [],
"retrieved_docs": [],
"reranked_docs": [],
"draft_answer": None,
"citations": [],
"verifier_passed": False,
"unsupported_claims": [],
"search_count": 0,
"max_search": 2,
}
Loading