Fork of VectifyAI/PageIndex with Azure OpenAI support and enhanced MCTS-based retrieval.
- Full Azure OpenAI integration alongside standard OpenAI
- Environment-based configuration (no code changes needed to switch)
- GPT-5/o1/o3 model compatibility fixes (temperature, max_completion_tokens, tiktoken)
- Monte Carlo Tree Search for intelligent document exploration
- Multi-document support - search across multiple PDFs simultaneously
- UCB1-based exploration/exploitation balancing
- Iterative relevance scoring with backpropagation
- Handles large documents without context window overflow
- No PageIndex cloud API required
- All processing happens locally with your Azure/OpenAI credentials
PageIndex is a vectorless, reasoning-based RAG system that builds a hierarchical tree index from documents and uses LLMs to reason over that index for retrieval.
Key Features:
- No Vector DB: Uses document structure and LLM reasoning, not vector similarity
- No Chunking: Documents organized into natural sections
- Human-like Retrieval: Simulates how experts navigate complex documents
- Explainable: Traceable reasoning with page/section references
pip install -r requirements.txtCopy .env.example to .env and configure:
# For Azure OpenAI
AZURE_OPENAI_API_KEY=your-azure-key
AZURE_OPENAI_ENDPOINT=https://your-resource.openai.azure.com/
AZURE_OPENAI_API_VERSION=2024-12-01-preview
AZURE_OPENAI_DEPLOYMENT=gpt-4o # or gpt-5, etc.
# Or for standard OpenAI
OPENAI_API_KEY=sk-your-openai-keypython run_pageindex.py --pdf_path /path/to/document.pdfOutput saved to results/<document>_structure.json
# Single document
python cookbook/mcts_rag.py \
-s results/document_structure.json \
-p path/to/document.pdf \
-q "Your question here" \
-v
# Multiple documents
python cookbook/mcts_rag.py \
-s doc1_structure.json -p doc1.pdf \
-s doc2_structure.json -p doc2.pdf \
-q "Question across all docs" \
-v
# Interactive mode
python cookbook/mcts_rag.py \
-s results/document_structure.json \
-p path/to/document.pdf \
-ipageindex/
├── pageindex/ # Core library (Azure-enhanced)
│ ├── utils.py # LLM utilities with Azure support
│ ├── page_index.py # Structure generation
│ └── config.yaml # Default settings
├── cookbook/
│ ├── mcts_rag.py # 🔥 MCTS-based RAG (main tool)
│ └── local_RAG_azure.ipynb # Jupyter notebook alternative
├── run_pageindex.py # Structure generation CLI
├── results/ # Generated structures
├── tests/pdfs/ # Sample documents
└── tutorials/ # Documentation
Usage: python cookbook/mcts_rag.py [options]
Required:
-s, --structure Path to structure JSON (can specify multiple)
-p, --pdf Path to PDF file (must match structure order)
Query:
-q, --query Question to ask
-i, --interactive Start interactive mode
Options:
-v, --verbose Show detailed search progress
--iterations N Max MCTS iterations (default: 20)
--exploration F UCB1 exploration weight (default: 1.414)
--threshold F Relevance threshold 0-1 (default: 0.6)
| Aspect | Simple RAG | MCTS RAG |
|---|---|---|
| Selection | Single LLM call | Iterative exploration |
| Strategy | Pick all relevant nodes | UCB1 explore/exploit |
| Multi-doc | Limited | ✅ Designed for it |
| LLM Calls | 2-3 | 10-30 (configurable) |
| Best for | Simple queries | Complex, multi-section queries |
| Component | Original | This Fork |
|---|---|---|
| OpenAI Client | Standard only | Azure + Standard |
| Model Config | Hardcoded | Environment variables |
| Retrieval | Cloud API or basic | MCTS-based local |
| GPT-5 Support | ❌ | ✅ Full compatibility |
| Multi-document | Via cloud API | Local MCTS |
Apache 2.0 (same as original) - see LICENSE
Original project: VectifyAI/PageIndex
