Skip to content

mtelang/pageindex

Repository files navigation

PageIndex Fork with Azure OpenAI & MCTS RAG

Fork of VectifyAI/PageIndex with Azure OpenAI support and enhanced MCTS-based retrieval.

🔥 What's New in This Fork

Azure OpenAI Support

  • Full Azure OpenAI integration alongside standard OpenAI
  • Environment-based configuration (no code changes needed to switch)
  • GPT-5/o1/o3 model compatibility fixes (temperature, max_completion_tokens, tiktoken)

MCTS-Based RAG (cookbook/mcts_rag.py)

  • Monte Carlo Tree Search for intelligent document exploration
  • Multi-document support - search across multiple PDFs simultaneously
  • UCB1-based exploration/exploitation balancing
  • Iterative relevance scoring with backpropagation
  • Handles large documents without context window overflow

Local-Only Operation

  • No PageIndex cloud API required
  • All processing happens locally with your Azure/OpenAI credentials

📑 About PageIndex

PageIndex is a vectorless, reasoning-based RAG system that builds a hierarchical tree index from documents and uses LLMs to reason over that index for retrieval.

Key Features:

  • No Vector DB: Uses document structure and LLM reasoning, not vector similarity
  • No Chunking: Documents organized into natural sections
  • Human-like Retrieval: Simulates how experts navigate complex documents
  • Explainable: Traceable reasoning with page/section references

⚙️ Quick Start

1. Install Dependencies

pip install -r requirements.txt

2. Configure Environment

Copy .env.example to .env and configure:

# For Azure OpenAI
AZURE_OPENAI_API_KEY=your-azure-key
AZURE_OPENAI_ENDPOINT=https://your-resource.openai.azure.com/
AZURE_OPENAI_API_VERSION=2024-12-01-preview
AZURE_OPENAI_DEPLOYMENT=gpt-4o  # or gpt-5, etc.

# Or for standard OpenAI
OPENAI_API_KEY=sk-your-openai-key

3. Generate Document Structure

python run_pageindex.py --pdf_path /path/to/document.pdf

Output saved to results/<document>_structure.json

4. Query with MCTS RAG

# Single document
python cookbook/mcts_rag.py \
  -s results/document_structure.json \
  -p path/to/document.pdf \
  -q "Your question here" \
  -v

# Multiple documents
python cookbook/mcts_rag.py \
  -s doc1_structure.json -p doc1.pdf \
  -s doc2_structure.json -p doc2.pdf \
  -q "Question across all docs" \
  -v

# Interactive mode
python cookbook/mcts_rag.py \
  -s results/document_structure.json \
  -p path/to/document.pdf \
  -i

📁 Project Structure

pageindex/
├── pageindex/              # Core library (Azure-enhanced)
│   ├── utils.py            # LLM utilities with Azure support
│   ├── page_index.py       # Structure generation
│   └── config.yaml         # Default settings
├── cookbook/
│   ├── mcts_rag.py         # 🔥 MCTS-based RAG (main tool)
│   └── local_RAG_azure.ipynb  # Jupyter notebook alternative
├── run_pageindex.py        # Structure generation CLI
├── results/                # Generated structures
├── tests/pdfs/             # Sample documents
└── tutorials/              # Documentation

🔧 MCTS RAG Options

Usage: python cookbook/mcts_rag.py [options]

Required:
  -s, --structure   Path to structure JSON (can specify multiple)
  -p, --pdf         Path to PDF file (must match structure order)

Query:
  -q, --query       Question to ask
  -i, --interactive Start interactive mode

Options:
  -v, --verbose     Show detailed search progress
  --iterations N    Max MCTS iterations (default: 20)
  --exploration F   UCB1 exploration weight (default: 1.414)
  --threshold F     Relevance threshold 0-1 (default: 0.6)

🆚 MCTS vs Simple RAG

Aspect Simple RAG MCTS RAG
Selection Single LLM call Iterative exploration
Strategy Pick all relevant nodes UCB1 explore/exploit
Multi-doc Limited ✅ Designed for it
LLM Calls 2-3 10-30 (configurable)
Best for Simple queries Complex, multi-section queries

📝 Changes from Original

Component Original This Fork
OpenAI Client Standard only Azure + Standard
Model Config Hardcoded Environment variables
Retrieval Cloud API or basic MCTS-based local
GPT-5 Support ✅ Full compatibility
Multi-document Via cloud API Local MCTS

📜 License

Apache 2.0 (same as original) - see LICENSE

Original project: VectifyAI/PageIndex


🔗 Resources

About

Forking pageindex.ai to use Azure OpenAI and implement Monte Carlo tree search to replicate PageIndex agentic chat ro run without OpenIndex cloud APIs

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors

Languages