Skip to content

Delphictunic/ReasonGraph

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

4 Commits
 
 
 
 
 
 
 
 
 
 

Repository files navigation

ReasonGraph

Enterprise-Grade Deterministic Reasoning & Citation Verification Pipeline


Abstract

ReasonGraph is a structured reasoning engine built on LangGraph that enforces citation-backed, verifiable responses over a private document corpus. Unlike prompt-only retrieval-augmented generation (RAG) systems that rely on model compliance for factual grounding, ReasonGraph enforces verification through architectural constraints. Every claim in a generated response is traced to an exact document chunk, audited by an independent verification agent, and either confirmed or flagged as unsupported. The system implements deterministic loop control, graceful failure handling, and a hybrid retrieval pipeline combining dense vector search with sparse BM25 keyword matching, reranked by a cross-encoder model. The result is a trust infrastructure suitable for enterprise QA, compliance, legal research, and internal knowledge management.


Motivation

Standard RAG systems suffer from a structural trust problem: the same model that generates an answer also decides whether that answer is grounded. This conflation of generation and verification produces systems that are fluent but not reliably factual. Prompt-level instructions to "cite your sources" are insufficient because they rely on model compliance rather than architectural enforcement.

ReasonGraph separates these concerns structurally:

  • A Generator Agent produces answers and must cite chunk IDs inline
  • A Verifier Agent independently audits every citation against the source chunk
  • A Loop Controller decides whether to retry, expand search, or fail gracefully
  • A Hybrid Retrieval Pipeline ensures the evidence pool is maximally relevant before generation begins

This design draws from principles in multi-agent debate, self-consistency verification, and structured reasoning systems, applying them to the practical problem of grounded enterprise document QA.


System Architecture

End-to-End Flow

User Query
    │
    ▼
┌─────────────────────┐
│   Query Optimizer   │  → Expands query into 3 optimized search variants
└─────────────────────┘
    │
    ▼
┌─────────────────────┐
│  Hybrid Retriever   │  → Vector search + BM25 + Cross-encoder reranking
└─────────────────────┘
    │
    ▼
┌─────────────────────┐
│  Generator Agent    │  → Drafts answer with inline chunk_id citations
└─────────────────────┘
    │
    ▼
┌─────────────────────┐
│  Verifier Agent     │  → Audits every claim against source chunk content
└─────────────────────┘
    │
    ▼
┌─────────────────────┐
│  Loop Controller    │  → Pass → END
│                     │  → Fail + retries remaining → Query Optimizer
│                     │  → Fail + retries exhausted → Graceful Failure
└─────────────────────┘

Phase 0 — State Initialization

Every query initializes a typed state object that flows through the entire graph:

{
  user_query:         str,
  optimized_queries:  List[str],       # populated by query optimizer
  retrieved_docs:     List[dict],      # populated by retriever (raw)
  reranked_docs:      List[dict],      # populated by retriever (reranked)
  draft_answer:       Optional[str],   # populated by generator
  citations:          List[dict],      # populated by generator
  verifier_passed:    bool,            # populated by verifier
  unsupported_claims: List[str],       # populated by verifier
  search_count:       int,             # managed by loop controller
  max_search:         int              # default: 3
}

This state object is the single source of truth for the entire pipeline. No node has side effects outside of returning partial state updates.


Phase 1 — Query Optimization

Node: app/nodes/query_optimizer.py

The Query Optimizer transforms the raw user query into three structured, search-optimized variants using Gemini 2.5 Flash. This improves retrieval recall by capturing different phrasings, extracting version filters and named entities, and reducing retrieval noise.

Example:

Input:

"What is the Bioptic Agent?"

Output:

[
  "Bioptic Agent drug asset scouting system architecture",
  "Bioptic Agent tree-based self-learning agentic system",
  "Bioptic Agent completeness benchmark F1 score performance"
]

If the model response cannot be parsed, the system falls back to three copies of the original query to ensure retrieval always proceeds.


Phase 2 — Hybrid Evidence Retrieval

Node: app/nodes/retriever.py Ingestion: app/retrieval/ingest.py

Retrieval is a three-stage pipeline:

2a. Offline Ingestion (run once)

PDFs are chunked, embedded, and stored in ChromaDB with three components per chunk:

Component Purpose
Embedding vector Semantic similarity search
Raw text BM25 keyword search at query time
Metadata Citation tracing and verification

Metadata schema per chunk:

{
  "chunk_id":    "{filename}_p{page}_c{chunk_index}",
  "source_url":  "relative/path/to/file.pdf",
  "page_number": int,
  "version":     str,   # extracted from filename pattern e.g. v1.2
  "char_count":  int
}

The chunk_id is the backbone of the entire citation and verification system. Every downstream reference — generator citations, verifier lookups — traces back to this identifier.

2b. Hybrid Search at Query Time

For each of the three optimized queries:

  1. Vector Search — embeds the query using all-MiniLM-L6-v2 and retrieves the top 10 semantically similar chunks from ChromaDB. Captures conceptual matches even when exact terms differ.

  2. BM25 Search — fetches all raw document text from ChromaDB, builds a BM25Okapi index, and scores each chunk by keyword overlap. Captures exact matches for version numbers, model names, and technical terms.

  3. Merge + Deduplicate — combines results from both search methods, deduplicates by chunk_id, and preserves the highest score from each method per chunk.

2c. Cross-Encoder Reranking

The merged candidate pool is reranked using cross-encoder/ms-marco-MiniLM-L-6-v2. Unlike bi-encoder models that score query and document independently, cross-encoders process the query-document pair jointly, producing significantly more accurate relevance scores. The top 5 chunks are selected for the generator.


Phase 3 — Generator Agent

Node: app/nodes/generator.py

The Generator receives the top 5 reranked evidence chunks and is instructed to:

  1. Answer the user query using only the provided evidence
  2. Cite the chunk_id inline for every factual claim
  3. Return a structured JSON output

Output format:

{
  "answer": "The Bioptic Agent achieves an F1-score of 0.797 [arxiv_p11_c2]...",
  "citations": [
    { "claim": "F1-score of 0.797", "chunk_id": "arxiv_p11_c2" }
  ]
}

The structured citation format is what makes downstream verification possible. Vague references like "according to the document" are architecturally impossible — the model must provide a specific chunk_id for every claim, or the verifier will flag it.


Phase 4 — Verifier Agent

Node: app/nodes/verifier.py

The Verifier is the trust layer of the pipeline. It operates with a distinct system prompt and low temperature (0.1) to minimize shared reasoning bias with the Generator.

System identity:

"You are a strict evidence auditor. You are skeptical by default. You do not infer, assume, or give benefit of the doubt. A claim is only supported if the cited chunk explicitly contains the information. You are not the answer generator. You are its auditor."

Verification checks:

  1. Does the cited chunk_id exist in the evidence pool?
  2. Does the chunk content explicitly support the claim semantically?
  3. Are there factual claims in the answer not covered by any citation?

Output:

{
  "verifier_passed": true,
  "unsupported_claims": [],
  "confidence": 0.94
}

Invalid chunk_id references are caught deterministically before the model call — these are treated as immediate failures without requiring LLM judgment.

Why separate models matter: Using the same model for generation and verification creates a self-consistency bias — the model tends to verify its own outputs favorably. The distinct system prompt and temperature setting create a behavioral separation between the two agents even when operating on the same underlying model.


Phase 5 — Loop Controller

Node: app/nodes/loop_controller.py

The Loop Controller is a pure conditional routing function with no LLM calls:

IF verifier_passed:
    → END

IF NOT passed AND search_count < max_search:
    search_count += 1
    → query_optimizer (retry with expanded search)

IF NOT passed AND search_count >= max_search:
    → graceful_failure

This enforces three critical properties:

  • No infinite loops — hard ceiling on retry count
  • Predictable cost — maximum LLM calls is bounded
  • Deterministic latency — worst-case execution time is calculable

Phase 6 — Graceful Failure

When evidence is exhausted and verification still fails, the system responds:

"Available evidence does not sufficiently support a reliable answer."

This is a first-class architectural outcome, not an error condition. It builds user trust by refusing to hallucinate rather than generating a confident but unsupported answer.


Repository Structure

reasongraph/
├── main.py                          # Entry point
├── docs/                            # PDF source documents
├── chroma_db/                       # Persistent vector store (auto-created)
├── app/
│   ├── graph/
│   │   ├── state.py                 # Typed state definition + initial_state()
│   │   └── graph.py                 # LangGraph graph topology
│   ├── nodes/
│   │   ├── query_optimizer.py       # Query expansion via Gemini
│   │   ├── retriever.py             # Hybrid retrieval + reranking
│   │   ├── generator.py             # Citation-backed answer generation
│   │   ├── verifier.py              # Independent claim verification
│   │   └── loop_controller.py       # Deterministic routing logic
│   └── retrieval/
│       └── ingest.py                # Offline PDF ingestion pipeline
├── requirements.txt
└── .gitignore

Key Design Principles

1. Separation of Concerns

Generation and verification are structurally decoupled. The Generator cannot influence the Verifier's judgment — they operate independently on the same evidence pool.

2. Citation as Architecture

Citations are not a stylistic feature — they are a structural requirement enforced by the output schema. The Verifier cannot function without them, which means the Generator cannot bypass them.

3. Deterministic Loop Control

Retry behavior is governed by a pure function with no LLM involvement. The system cannot loop indefinitely regardless of model behavior.

4. Metadata-Driven Traceability

Every chunk is stored with a deterministic chunk_id that encodes its origin. Any claim in any response can be traced back to an exact page and position in a source document.

5. Defense-in-Depth Verification

Verification operates at two levels: deterministic (chunk_id existence check) and semantic (LLM-based claim support check). Invalid references are caught before the model call.


Models & Dependencies

Component Model / Library
Query Optimization Gemini 2.5 Flash
Embedding all-MiniLM-L6-v2 (sentence-transformers)
Keyword Search BM25Okapi (rank-bm25)
Reranking cross-encoder/ms-marco-MiniLM-L-6-v2
Generation Gemini 2.5 Flash
Verification Gemini 2.5 Flash (temperature=0.1, distinct system prompt)
Vector Store ChromaDB (persistent, local)
Graph Orchestration LangGraph
PDF Parsing pypdf

Setup & Usage

Installation

git clone <repo>
cd reasongraph
python -m venv venv
source venv/bin/activate
pip install -r requirements.txt

Configuration

export GEMINI_API_KEY="your-api-key-here"

Ingest Documents

Place PDFs in the docs/ directory, then run:

python app/retrieval/ingest.py

Re-running ingestion on existing documents is safe — the pipeline uses upsert semantics.

Query

python main.py "Your question here"

Verify ChromaDB Contents

python -c "
import chromadb
client = chromadb.PersistentClient(path='./chroma_db')
col = client.get_collection('reasongraph_docs')
print('Total chunks:', col.count())
"

Output Format

{answer text with inline [chunk_id] citations}

[chunk_id] → claim text
[chunk_id] → claim text

Warning: The answer may be unreliable (verification did not pass).  # only if failed

Unsupported claims:                                                  # only if present
  - claim text

Example Outputs

Query: "What is the Bioptic Agent and how does it work?"

The Bioptic Agent is a specialized scouting scaffold and a tree-based self-learning 
agent designed for complete and non-hallucinated drug asset scouting 
[arxiv_reasoning_p10_c0, arxiv_reasoning_p1_c0]. It operates as a multilingual, 
multi-agent system and leverages GPT-5.2 as its underlying model 
[arxiv_reasoning_p1_c0, arxiv_reasoning_p11_c2].

The Bioptic Agent's functionality is driven by its scaffold architecture and 
tree-structured exploration [arxiv_reasoning_p11_c3, arxiv_reasoning_p11_c0]. 
It executes directives in parallel, merges their artifacts into a single pool, 
and uses this merged pool to generate the next batch of directives for the 
subsequent epoch [arxiv_reasoning_p11_c2].

[arxiv_reasoning_p10_c0] → The Bioptic Agent is a specialized scouting scaffold 
                            and a tree-based self-learning agent.
[arxiv_reasoning_p1_c0]  → It operates as a multilingual, multi-agent system.
[arxiv_reasoning_p11_c2] → All directives are executed in parallel, their artifacts 
                            are merged into a single pool.
[arxiv_reasoning_p11_c3] → Its scaffold architecture is the primary driver of 
                            improved completeness rather than model strength alone.

Query: "What F1 score did Bioptic Agent achieve and how did it compare to Claude?"

The Bioptic Agent achieved an F1-score of 0.797 [arxiv_reasoning_p11_c2]. 
This substantially outperformed Claude Opus 4.6, which had an F1-score 
of 0.562 [arxiv_reasoning_p11_c2].

[arxiv_reasoning_p11_c2] → The Bioptic Agent achieved an F1-score of 0.797.
[arxiv_reasoning_p11_c2] → Claude Opus 4.6 had an F1-score of 0.562.
[arxiv_reasoning_p11_c2] → Bioptic Agent substantially outperformed all tested 
                            state-of-the-art search agents.

Each chunk_id encodes its exact origin — arxiv_reasoning_p11_c2 maps to file arxiv_reasoning.pdf, page 11, chunk 2. Every claim is independently verified against that chunk before the response is returned.


Future Work

  • Partial verification scoring — allow answers where a subset of claims are verified to pass with a confidence threshold rather than binary pass/fail
  • Evidence scoring thresholds — reject retrieval results below a minimum reranker score before generation
  • Observability layer — structured logging of search counts, verifier confidence, retry events, and latency per node
  • Multi-hop reasoning — chain multiple retrieval-generation cycles for queries requiring evidence synthesis across documents
  • Hallucination benchmarking — evaluate system performance against a golden dataset of known document contents and expected answers
  • API layer — FastAPI wrapper exposing the graph as a REST endpoint
  • Streaming responses — incremental output as generation and verification proceed

Status

v0 — Control plane validated end-to-end. Intelligence layers (LLM + vector DB) integrated and operational. Verified on academic PDF corpus with traceable citations and independent claim verification.


Design Philosophy

ReasonGraph is not a chatbot. It is a reasoning engine with memory, control, verification, and loop safety. The goal is not to generate fluent text — it is to generate trustworthy text, where trust is enforced structurally rather than assumed from model capability.

In high-stakes environments — legal, compliance, medical, financial — the cost of a confident hallucination exceeds the cost of an honest "I don't know." ReasonGraph is designed around that asymmetry.

Releases

No releases published

Packages

 
 
 

Contributors

Languages