Enterprise-Grade Deterministic Reasoning & Citation Verification Pipeline
ReasonGraph is a structured reasoning engine built on LangGraph that enforces citation-backed, verifiable responses over a private document corpus. Unlike prompt-only retrieval-augmented generation (RAG) systems that rely on model compliance for factual grounding, ReasonGraph enforces verification through architectural constraints. Every claim in a generated response is traced to an exact document chunk, audited by an independent verification agent, and either confirmed or flagged as unsupported. The system implements deterministic loop control, graceful failure handling, and a hybrid retrieval pipeline combining dense vector search with sparse BM25 keyword matching, reranked by a cross-encoder model. The result is a trust infrastructure suitable for enterprise QA, compliance, legal research, and internal knowledge management.
Standard RAG systems suffer from a structural trust problem: the same model that generates an answer also decides whether that answer is grounded. This conflation of generation and verification produces systems that are fluent but not reliably factual. Prompt-level instructions to "cite your sources" are insufficient because they rely on model compliance rather than architectural enforcement.
ReasonGraph separates these concerns structurally:
- A Generator Agent produces answers and must cite chunk IDs inline
- A Verifier Agent independently audits every citation against the source chunk
- A Loop Controller decides whether to retry, expand search, or fail gracefully
- A Hybrid Retrieval Pipeline ensures the evidence pool is maximally relevant before generation begins
This design draws from principles in multi-agent debate, self-consistency verification, and structured reasoning systems, applying them to the practical problem of grounded enterprise document QA.
User Query
│
▼
┌─────────────────────┐
│ Query Optimizer │ → Expands query into 3 optimized search variants
└─────────────────────┘
│
▼
┌─────────────────────┐
│ Hybrid Retriever │ → Vector search + BM25 + Cross-encoder reranking
└─────────────────────┘
│
▼
┌─────────────────────┐
│ Generator Agent │ → Drafts answer with inline chunk_id citations
└─────────────────────┘
│
▼
┌─────────────────────┐
│ Verifier Agent │ → Audits every claim against source chunk content
└─────────────────────┘
│
▼
┌─────────────────────┐
│ Loop Controller │ → Pass → END
│ │ → Fail + retries remaining → Query Optimizer
│ │ → Fail + retries exhausted → Graceful Failure
└─────────────────────┘
Every query initializes a typed state object that flows through the entire graph:
{
user_query: str,
optimized_queries: List[str], # populated by query optimizer
retrieved_docs: List[dict], # populated by retriever (raw)
reranked_docs: List[dict], # populated by retriever (reranked)
draft_answer: Optional[str], # populated by generator
citations: List[dict], # populated by generator
verifier_passed: bool, # populated by verifier
unsupported_claims: List[str], # populated by verifier
search_count: int, # managed by loop controller
max_search: int # default: 3
}This state object is the single source of truth for the entire pipeline. No node has side effects outside of returning partial state updates.
Node: app/nodes/query_optimizer.py
The Query Optimizer transforms the raw user query into three structured, search-optimized variants using Gemini 2.5 Flash. This improves retrieval recall by capturing different phrasings, extracting version filters and named entities, and reducing retrieval noise.
Example:
Input:
"What is the Bioptic Agent?"
Output:
[
"Bioptic Agent drug asset scouting system architecture",
"Bioptic Agent tree-based self-learning agentic system",
"Bioptic Agent completeness benchmark F1 score performance"
]If the model response cannot be parsed, the system falls back to three copies of the original query to ensure retrieval always proceeds.
Node: app/nodes/retriever.py
Ingestion: app/retrieval/ingest.py
Retrieval is a three-stage pipeline:
PDFs are chunked, embedded, and stored in ChromaDB with three components per chunk:
| Component | Purpose |
|---|---|
| Embedding vector | Semantic similarity search |
| Raw text | BM25 keyword search at query time |
| Metadata | Citation tracing and verification |
Metadata schema per chunk:
{
"chunk_id": "{filename}_p{page}_c{chunk_index}",
"source_url": "relative/path/to/file.pdf",
"page_number": int,
"version": str, # extracted from filename pattern e.g. v1.2
"char_count": int
}The chunk_id is the backbone of the entire citation and verification system. Every downstream reference — generator citations, verifier lookups — traces back to this identifier.
For each of the three optimized queries:
-
Vector Search — embeds the query using
all-MiniLM-L6-v2and retrieves the top 10 semantically similar chunks from ChromaDB. Captures conceptual matches even when exact terms differ. -
BM25 Search — fetches all raw document text from ChromaDB, builds a BM25Okapi index, and scores each chunk by keyword overlap. Captures exact matches for version numbers, model names, and technical terms.
-
Merge + Deduplicate — combines results from both search methods, deduplicates by
chunk_id, and preserves the highest score from each method per chunk.
The merged candidate pool is reranked using cross-encoder/ms-marco-MiniLM-L-6-v2. Unlike bi-encoder models that score query and document independently, cross-encoders process the query-document pair jointly, producing significantly more accurate relevance scores. The top 5 chunks are selected for the generator.
Node: app/nodes/generator.py
The Generator receives the top 5 reranked evidence chunks and is instructed to:
- Answer the user query using only the provided evidence
- Cite the
chunk_idinline for every factual claim - Return a structured JSON output
Output format:
{
"answer": "The Bioptic Agent achieves an F1-score of 0.797 [arxiv_p11_c2]...",
"citations": [
{ "claim": "F1-score of 0.797", "chunk_id": "arxiv_p11_c2" }
]
}The structured citation format is what makes downstream verification possible. Vague references like "according to the document" are architecturally impossible — the model must provide a specific chunk_id for every claim, or the verifier will flag it.
Node: app/nodes/verifier.py
The Verifier is the trust layer of the pipeline. It operates with a distinct system prompt and low temperature (0.1) to minimize shared reasoning bias with the Generator.
System identity:
"You are a strict evidence auditor. You are skeptical by default. You do not infer, assume, or give benefit of the doubt. A claim is only supported if the cited chunk explicitly contains the information. You are not the answer generator. You are its auditor."
Verification checks:
- Does the cited
chunk_idexist in the evidence pool? - Does the chunk content explicitly support the claim semantically?
- Are there factual claims in the answer not covered by any citation?
Output:
{
"verifier_passed": true,
"unsupported_claims": [],
"confidence": 0.94
}Invalid chunk_id references are caught deterministically before the model call — these are treated as immediate failures without requiring LLM judgment.
Why separate models matter: Using the same model for generation and verification creates a self-consistency bias — the model tends to verify its own outputs favorably. The distinct system prompt and temperature setting create a behavioral separation between the two agents even when operating on the same underlying model.
Node: app/nodes/loop_controller.py
The Loop Controller is a pure conditional routing function with no LLM calls:
IF verifier_passed:
→ END
IF NOT passed AND search_count < max_search:
search_count += 1
→ query_optimizer (retry with expanded search)
IF NOT passed AND search_count >= max_search:
→ graceful_failure
This enforces three critical properties:
- No infinite loops — hard ceiling on retry count
- Predictable cost — maximum LLM calls is bounded
- Deterministic latency — worst-case execution time is calculable
When evidence is exhausted and verification still fails, the system responds:
"Available evidence does not sufficiently support a reliable answer."
This is a first-class architectural outcome, not an error condition. It builds user trust by refusing to hallucinate rather than generating a confident but unsupported answer.
reasongraph/
├── main.py # Entry point
├── docs/ # PDF source documents
├── chroma_db/ # Persistent vector store (auto-created)
├── app/
│ ├── graph/
│ │ ├── state.py # Typed state definition + initial_state()
│ │ └── graph.py # LangGraph graph topology
│ ├── nodes/
│ │ ├── query_optimizer.py # Query expansion via Gemini
│ │ ├── retriever.py # Hybrid retrieval + reranking
│ │ ├── generator.py # Citation-backed answer generation
│ │ ├── verifier.py # Independent claim verification
│ │ └── loop_controller.py # Deterministic routing logic
│ └── retrieval/
│ └── ingest.py # Offline PDF ingestion pipeline
├── requirements.txt
└── .gitignore
Generation and verification are structurally decoupled. The Generator cannot influence the Verifier's judgment — they operate independently on the same evidence pool.
Citations are not a stylistic feature — they are a structural requirement enforced by the output schema. The Verifier cannot function without them, which means the Generator cannot bypass them.
Retry behavior is governed by a pure function with no LLM involvement. The system cannot loop indefinitely regardless of model behavior.
Every chunk is stored with a deterministic chunk_id that encodes its origin. Any claim in any response can be traced back to an exact page and position in a source document.
Verification operates at two levels: deterministic (chunk_id existence check) and semantic (LLM-based claim support check). Invalid references are caught before the model call.
| Component | Model / Library |
|---|---|
| Query Optimization | Gemini 2.5 Flash |
| Embedding | all-MiniLM-L6-v2 (sentence-transformers) |
| Keyword Search | BM25Okapi (rank-bm25) |
| Reranking | cross-encoder/ms-marco-MiniLM-L-6-v2 |
| Generation | Gemini 2.5 Flash |
| Verification | Gemini 2.5 Flash (temperature=0.1, distinct system prompt) |
| Vector Store | ChromaDB (persistent, local) |
| Graph Orchestration | LangGraph |
| PDF Parsing | pypdf |
git clone <repo>
cd reasongraph
python -m venv venv
source venv/bin/activate
pip install -r requirements.txtexport GEMINI_API_KEY="your-api-key-here"Place PDFs in the docs/ directory, then run:
python app/retrieval/ingest.pyRe-running ingestion on existing documents is safe — the pipeline uses upsert semantics.
python main.py "Your question here"python -c "
import chromadb
client = chromadb.PersistentClient(path='./chroma_db')
col = client.get_collection('reasongraph_docs')
print('Total chunks:', col.count())
"{answer text with inline [chunk_id] citations}
[chunk_id] → claim text
[chunk_id] → claim text
Warning: The answer may be unreliable (verification did not pass). # only if failed
Unsupported claims: # only if present
- claim text
Query: "What is the Bioptic Agent and how does it work?"
The Bioptic Agent is a specialized scouting scaffold and a tree-based self-learning
agent designed for complete and non-hallucinated drug asset scouting
[arxiv_reasoning_p10_c0, arxiv_reasoning_p1_c0]. It operates as a multilingual,
multi-agent system and leverages GPT-5.2 as its underlying model
[arxiv_reasoning_p1_c0, arxiv_reasoning_p11_c2].
The Bioptic Agent's functionality is driven by its scaffold architecture and
tree-structured exploration [arxiv_reasoning_p11_c3, arxiv_reasoning_p11_c0].
It executes directives in parallel, merges their artifacts into a single pool,
and uses this merged pool to generate the next batch of directives for the
subsequent epoch [arxiv_reasoning_p11_c2].
[arxiv_reasoning_p10_c0] → The Bioptic Agent is a specialized scouting scaffold
and a tree-based self-learning agent.
[arxiv_reasoning_p1_c0] → It operates as a multilingual, multi-agent system.
[arxiv_reasoning_p11_c2] → All directives are executed in parallel, their artifacts
are merged into a single pool.
[arxiv_reasoning_p11_c3] → Its scaffold architecture is the primary driver of
improved completeness rather than model strength alone.
Query: "What F1 score did Bioptic Agent achieve and how did it compare to Claude?"
The Bioptic Agent achieved an F1-score of 0.797 [arxiv_reasoning_p11_c2].
This substantially outperformed Claude Opus 4.6, which had an F1-score
of 0.562 [arxiv_reasoning_p11_c2].
[arxiv_reasoning_p11_c2] → The Bioptic Agent achieved an F1-score of 0.797.
[arxiv_reasoning_p11_c2] → Claude Opus 4.6 had an F1-score of 0.562.
[arxiv_reasoning_p11_c2] → Bioptic Agent substantially outperformed all tested
state-of-the-art search agents.
Each chunk_id encodes its exact origin — arxiv_reasoning_p11_c2 maps to file arxiv_reasoning.pdf, page 11, chunk 2. Every claim is independently verified against that chunk before the response is returned.
- Partial verification scoring — allow answers where a subset of claims are verified to pass with a confidence threshold rather than binary pass/fail
- Evidence scoring thresholds — reject retrieval results below a minimum reranker score before generation
- Observability layer — structured logging of search counts, verifier confidence, retry events, and latency per node
- Multi-hop reasoning — chain multiple retrieval-generation cycles for queries requiring evidence synthesis across documents
- Hallucination benchmarking — evaluate system performance against a golden dataset of known document contents and expected answers
- API layer — FastAPI wrapper exposing the graph as a REST endpoint
- Streaming responses — incremental output as generation and verification proceed
v0 — Control plane validated end-to-end. Intelligence layers (LLM + vector DB) integrated and operational. Verified on academic PDF corpus with traceable citations and independent claim verification.
ReasonGraph is not a chatbot. It is a reasoning engine with memory, control, verification, and loop safety. The goal is not to generate fluent text — it is to generate trustworthy text, where trust is enforced structurally rather than assumed from model capability.
In high-stakes environments — legal, compliance, medical, financial — the cost of a confident hallucination exceeds the cost of an honest "I don't know." ReasonGraph is designed around that asymmetry.