🚀 RAG PDF System với Gemini AI

Hệ thống Retrieval-Augmented Generation (RAG) để xử lý PDF với OCR, embedding và Q&A sử dụng Gemini AI, Milvus vector database và LangChain.

📋 Mục lục

✨ Tính năng
🏗️ Kiến trúc hệ thống
⚡ Quick Start
📦 Cài đặt
🔧 Cấu hình
🎯 Sử dụng
🤖 Gemini Multi-Model
📝 Logging
🧪 Testing
📚 Documentation
🛠️ Troubleshooting

✨ Tính năng

🔍 Xử lý PDF

✅ Trích xuất text từ PDF với pdfplumber
✅ OCR hình ảnh với Gemini Vision hoặc EasyOCR
✅ Phát hiện và trích xuất bảng biểu
✅ Hỗ trợ PDF nhiều trang

🧠 AI & RAG

✅ Multi-model Gemini với auto-fallback (2.0 Flash → 1.5 Flash → 1.5 Flash 8B)
✅ Multi-key rotation tự động khi hết quota
✅ Vector embedding với SentenceTransformer
✅ Semantic search với Milvus vector database
✅ Context expansion cho câu trả lời chính xác hơn
✅ Fallback Ollama cho local inference

📊 Data Pipeline

✅ Chunking thông minh dựa trên ranh giới câu (NLTK)
✅ Export sang Markdown
✅ Đồng bộ tự động vào Milvus
✅ Logging chi tiết cho toàn bộ pipeline

💬 Q&A Application

✅ Interactive Q&A với streaming
✅ Hiển thị nguồn tham khảo (trang PDF)
✅ Retry và fallback thông minh
✅ Token counting để tránh vượt limit

🏗️ Kiến trúc hệ thống

┌─────────────────────────────────────────────────────────────┐
│                      RAG PDF SYSTEM                         │
└─────────────────────────────────────────────────────────────┘

┌──────────────┐     ┌──────────────┐     ┌──────────────┐
│   PDF File   │────▶│  read_pdf.py │────▶│ export_md.py │
└──────────────┘     └──────────────┘     └──────────────┘
                            │                      │
                     ┌──────▼──────┐              │
                     │ Gemini OCR  │              │
                     │ EasyOCR     │              │
                     └─────────────┘              │
                                                  │
                     ┌────────────────────────────▼────┐
                     │      Markdown Output            │
                     └────────────────┬────────────────┘
                                      │
                     ┌────────────────▼────────────────┐
                     │   populate_milvus.py            │
                     │   - Chunking (NLTK)             │
                     │   - Embedding (SentenceTransf.) │
                     │   - Insert to Milvus            │
                     └────────────────┬────────────────┘
                                      │
                     ┌────────────────▼────────────────┐
                     │      Milvus Vector DB           │
                     │      (IVF_FLAT Index)           │
                     └────────────────┬────────────────┘
                                      │
                     ┌────────────────▼────────────────┐
                     │         qa_app.py               │
                     │   - User Query                  │
                     │   - Semantic Search             │
                     │   - Context Expansion           │
                     │   - LLM Generation (Gemini)     │
                     └─────────────────────────────────┘

📁 Project Structure

RAG_pdf_new/
├── src/                      # 📄 Core Python modules
│   ├── __init__.py
│   ├── config.py            # 🔧 Configuration
│   ├── gemini_client.py     # 🤖 Gemini API client
│   ├── read_pdf.py          # 📖 PDF extraction
│   ├── export_md.py         # 📝 Markdown export
│   ├── populate_milvus.py   # 📊 ETL pipeline
│   ├── milvus.py            # 🗄️ Vector database
│   ├── llm_handler.py       # 🧠 LLM abstraction
│   ├── qa_app.py            # 💬 Q&A application
│   └── logging_config.py    # 📝 Logging setup
│
├── tests/                    # 🧪 Test files
│   ├── test_gemini_client.py    # Unit tests
│   └── test_gemini_setup.py     # Integration tests
│
├── docs/                     # 📚 Documentation
│   ├── GETTING_STARTED.md
│   ├── QUICK_START_GEMINI.md
│   ├── GEMINI_MODELS.md
│   ├── TESTING.md
│   ├── PROJECT_STRUCTURE.md
│   └── ...
│
├── data/                     # 📁 Data files
│   ├── pdfs/                # 📄 Input PDF files
│   └── outputs/             # 📝 Generated Markdown files
│
├── .env                      # 🔐 Environment variables (create this)
├── .gitignore
├── requirements.txt
└── README.md

📖 Chi tiết: Xem docs/PROJECT_STRUCTURE.md để biết thêm về chức năng từng file.

⚡ Quick Start

1️⃣ Cài đặt dependencies

pip install -r requirements.txt

2️⃣ Cấu hình API Keys

Tạo file .env:

# Gemini API Keys (ít nhất 2-3 keys để đảm bảo uptime)
GEMINI_API_KEY_1=AIzaSy...your_key_here
GEMINI_API_KEY_2=AIzaSy...your_key_here
GEMINI_API_KEY_3=AIzaSy...your_key_here

# Milvus Configuration
MILVUS_HOST=localhost
MILVUS_PORT=19530

💡 Lấy API Key miễn phí: https://aistudio.google.com/

3️⃣ Cấu hình PDF path

Đặt PDF của bạn vào folder data/pdfs/ và sửa file src/config.py:

PDF_PATH = "data/pdfs/your_document.pdf"

4️⃣ Chạy pipeline

# Bước 1: Đồng bộ PDF vào Milvus
python -m src.populate_milvus

# Bước 2: Chạy Q&A app
python -m src.qa_app

📦 Cài đặt

Yêu cầu hệ thống

Python: 3.9+
Milvus: 2.3.0+ (Docker hoặc standalone)
RAM: Tối thiểu 4GB (8GB khuyến nghị)
GPU: Tùy chọn (cho EasyOCR và embeddings)

Cài đặt đầy đủ

# Clone repository
git clone https://github.com/Klein1411/RAG_pdf_new.git
cd RAG_pdf_new

# Tạo virtual environment
python -m venv venv
source venv/bin/activate  # Linux/Mac
# hoặc
venv\Scripts\activate     # Windows

# Cài đặt dependencies
pip install -r requirements.txt

# Download NLTK data
python -c "import nltk; nltk.download('punkt'); nltk.download('punkt_tab')"

Cài đặt Milvus (Docker)

# Pull Milvus image
docker pull milvusdb/milvus:latest

# Run Milvus standalone
docker run -d --name milvus_standalone \
  -p 19530:19530 -p 9091:9091 \
  -v milvus_data:/var/lib/milvus \
  milvusdb/milvus:latest

🔧 Cấu hình

File: `config.py`

# PDF path
PDF_PATH = "path/to/your/document.pdf"

# Embedding model
EMBEDDING_MODEL_NAME = 'paraphrase-multilingual-mpnet-base-v2'
EMBEDDING_DIM = 768

# Milvus
COLLECTION_NAME = "pdf_rag_collection"

# Gemini models (theo thứ tự ưu tiên)
GEMINI_MODELS = [
    "gemini-2.5-flash",      # Model chính (mới nhất)
    "gemini-2.0-flash-exp",  # Dự phòng 1
    "gemini-1.5-flash",      # Dự phòng 2
    "gemini-1.5-flash-8b"    # Dự phòng 3
]

# Gemini token limit
GEMINI_INPUT_TOKEN_LIMIT = 1000000  # 1M tokens

# Ollama (nếu dùng local models)
OLLAMA_API_URL = "http://localhost:11434/api/generate"
OLLAMA_MODELS = ["llama3:latest"]

File: `.env`

# Gemini API Keys (thêm nhiều key để tăng uptime)
GEMINI_API_KEY_1=AIzaSyXXXXXXXXXXXXXXXXXXXXXXXXXXXX
GEMINI_API_KEY_2=AIzaSyYYYYYYYYYYYYYYYYYYYYYYYYYYYY
GEMINI_API_KEY_3=AIzaSyZZZZZZZZZZZZZZZZZZZZZZZZZZZZ

# Milvus (nếu không dùng localhost)
MILVUS_HOST=localhost
MILVUS_PORT=19530

🎯 Sử dụng

1. Trích xuất PDF và tạo Markdown

python -m src.export_md

Output: data/outputs/document.md

2. Đồng bộ dữ liệu vào Milvus

python -m src.populate_milvus

Quy trình:

✅ Đọc file Markdown (tự động tạo nếu chưa có)
✅ Chunking văn bản với NLTK
✅ Tạo embeddings
✅ Insert vào Milvus collection

3. Chạy Q&A Application

python -m src.qa_app

Tính năng:

💬 Đặt câu hỏi bằng tiếng Việt hoặc tiếng Anh
📄 Hiển thị nguồn tham khảo (trang PDF)
🔄 Tự động retry và fallback
🚀 Context expansion cho câu trả lời chính xác

Ví dụ:

❓ Đặt câu hỏi của bạn: Machine learning là gì?

✅ Câu trả lời:
Machine learning là một nhánh của trí tuệ nhân tạo...

Nguồn tham khảo: Artificial-Intelligence.pdf (Trang 3, Trang 4)

4. Test PDF extraction

python -m src.read_pdf

Chọn phương án:

1: Gemini Vision (nhanh, chính xác)
2: Manual extraction + OCR (chậm hơn)

🤖 Gemini Multi-Model

Hệ thống Auto-Fallback

Request → Model 2.0 Flash (Key 1,2,3)
            ↓ fail
          Model 1.5 Flash (Key 1,2,3)
            ↓ fail
          Model 1.5 Flash 8B (Key 1,2,3)
            ↓ fail
          Error

So sánh Models

Model	Tốc độ	Độ chính xác	Token Limit	Trạng thái
2.0 Flash Exp	⚡⚡⚡	🎯🎯🎯	1M	Experimental
1.5 Flash	⚡⚡	🎯🎯🎯	1M	Stable
1.5 Flash 8B	⚡⚡⚡⚡	🎯🎯	1M	Stable (Fast)

Test Setup

python -m tests.test_gemini_setup

Chi tiết: docs/QUICK_START_GEMINI.md

📝 Logging

Toàn bộ hệ thống sử dụng Python logging với cấu hình tập trung.

Cấu hình logging level

File: logging_config.py

from logging_config import get_logger

logger = get_logger(__name__)
logger.info("Thông tin hệ thống")
logger.warning("Cảnh báo")
logger.error("Lỗi nghiêm trọng")

Log levels

DEBUG: Chi tiết kỹ thuật (API calls, intermediate values)
INFO: Thông tin chính về tiến trình ✅
WARNING: Cảnh báo không nghiêm trọng ⚠️
ERROR: Lỗi nghiêm trọng ❌
CRITICAL: Lỗi cực kỳ nghiêm trọng 🚨

Xem logs

Logs được xuất ra console với format:

2025-10-17 13:01:01,287 - gemini_client - INFO - ✅ Request thành công

🧪 Testing

Chạy unit tests

# Tất cả tests
pytest test_gemini_client.py -v

# Với coverage
pytest test_gemini_client.py -v --cov=gemini_client --cov-report=html

# Test cụ thể
pytest test_gemini_client.py::TestKeyRotation -v

Test coverage

✅ GeminiClient initialization (4 tests)
✅ Key rotation (2 tests)
✅ Content generation (4 tests)
✅ Token counting (2 tests)
✅ Edge cases (2 tests)

Target coverage: > 90%

Test setup nhanh

python -m tests.test_gemini_setup

Chi tiết: docs/TESTING.md

📚 Documentation

📖 Hướng dẫn chi tiết

File	Thời gian	Mô tả
docs/GETTING_STARTED.md	5 phút	🚀 Quick start siêu ngắn gọn
docs/QUICK_START_GEMINI.md	10 phút	🤖 Hướng dẫn setup Gemini API
docs/GEMINI_MODELS.md	15 phút	🔧 Chi tiết về multi-model fallback
docs/TESTING.md	10 phút	🧪 Hướng dẫn testing và coverage
docs/PROJECT_STRUCTURE.md	15 phút	📂 Chi tiết cấu trúc project

📁 Cấu trúc project

📖 Chi tiết đầy đủ: docs/PROJECT_STRUCTURE.md

RAG_pdf_new/
├── 📄 Core Python Files
│   ├── config.py                 # Cấu hình tập trung
│   ├── gemini_client.py          # Gemini client với multi-model
│   ├── logging_config.py         # Logging configuration
│   ├── read_pdf.py               # PDF extraction & OCR
│   ├── export_md.py              # Export sang Markdown
│   ├── populate_milvus.py        # ETL pipeline vào Milvus
│   ├── milvus.py                 # Milvus connection & collection
│   ├── llm_handler.py            # LLM abstraction (Gemini/Ollama)
│   └── qa_app.py                 # Q&A application
│
├── 🧪 tests/
│   ├── test_gemini_client.py     # Unit tests (pytest + mocking)
│   └── test_gemini_setup.py      # Integration tests (real API)
│
├── 📚 docs/                      # Documentation
│   ├── GETTING_STARTED.md        # Quick start 5 phút
│   ├── QUICK_START_GEMINI.md     # Hướng dẫn setup Gemini
│   ├── GEMINI_MODELS.md          # Multi-model fallback
│   ├── TESTING.md                # Testing guide
│   └── PROJECT_STRUCTURE.md      # Chi tiết cấu trúc
│
└── 📝 Configuration
    ├── .env                      # API keys (không commit)
    ├── requirements.txt          # Python dependencies
    └── .gitignore                # Git ignore rules

🛠️ Troubleshooting

❌ "Không tìm thấy biến môi trường GEMINI_API_KEY"

Giải pháp:

Tạo file .env ở root folder
Thêm GEMINI_API_KEY_1=your_key_here
Restart terminal/IDE

❌ "Tất cả các API key đều đã hết quota"

Giải pháp:

Thêm nhiều API key vào .env
Đợi quota reset (thường 24h)
Check quota tại: https://aistudio.google.com/

❌ "Không thể kết nối đến Milvus"

Giải pháp:

Kiểm tra Milvus đang chạy: docker ps
Kiểm tra port 19530: netstat -an | findstr 19530
Restart Milvus: docker restart milvus_standalone

❌ "Model không tồn tại"

Giải pháp:

Model experimental có thể bị gỡ
Hệ thống sẽ tự động fallback sang model ổn định
Cập nhật GEMINI_MODELS trong config.py

❌ "CUDA out of memory"

Giải pháp:

Giảm batch size khi encode
Sử dụng CPU: device='cpu'
Dùng model nhẹ hơn: gemini-1.5-flash-8b

❌ "PDF không có text"

Giải pháp:

PDF có thể là ảnh scan
Chọn phương án 1 (Gemini Vision) khi chạy read_pdf.py
Hoặc phương án 2 sẽ dùng OCR tự động

🤝 Contributing

Contributions are welcome! Please:

Fork repository
Tạo branch mới: git checkout -b feature/amazing-feature
Commit changes: git commit -m 'Add amazing feature'
Push: git push origin feature/amazing-feature
Tạo Pull Request

📄 License

Dự án này sử dụng các thư viện open-source:

LangChain (MIT)
Milvus (Apache 2.0)
SentenceTransformers (Apache 2.0)
Google Generative AI (Google Terms)

📞 Support

🐛 Issues: GitHub Issues
📧 Email: klein1411@example.com
💬 Discussions: GitHub Discussions

🌟 Acknowledgments

Google Gemini AI - Powerful multimodal AI
Milvus - High-performance vector database
LangChain - LLM application framework
SentenceTransformers - State-of-the-art embeddings

📈 Roadmap

Web UI với Streamlit/Gradio
Multi-PDF support
Cloud deployment (AWS/GCP)
Advanced RAG techniques (HyDE, Query Expansion)
Multi-language support
Document comparison features
Export answers to PDF/DOCX

⭐ Nếu project hữu ích, đừng quên star repo! ⭐

Made with ❤️ by Klein1411

Name		Name	Last commit message	Last commit date
Latest commit History 33 Commits
.vscode		.vscode
agent		agent
data		data
docs		docs
src		src
tests		tests
.gitignore		.gitignore
LANGCHAIN_ANALYSIS.md		LANGCHAIN_ANALYSIS.md
README.md		README.md
milvus_connect.ipynb		milvus_connect.ipynb
requirements.txt		requirements.txt
test_imports_check.py		test_imports_check.py
test_langchain_rag.py		test_langchain_rag.py
test_phase1_llm.py		test_phase1_llm.py
test_search_tool_langchain.py		test_search_tool_langchain.py

Folders and files

Latest commit

History

Repository files navigation

🚀 RAG PDF System với Gemini AI

📋 Mục lục

✨ Tính năng

🔍 Xử lý PDF

🧠 AI & RAG

📊 Data Pipeline

💬 Q&A Application

🏗️ Kiến trúc hệ thống

📁 Project Structure

⚡ Quick Start

1️⃣ Cài đặt dependencies

2️⃣ Cấu hình API Keys

3️⃣ Cấu hình PDF path

4️⃣ Chạy pipeline

📦 Cài đặt

Yêu cầu hệ thống

Cài đặt đầy đủ

Cài đặt Milvus (Docker)

🔧 Cấu hình

File: config.py

File: .env

🎯 Sử dụng

1. Trích xuất PDF và tạo Markdown

2. Đồng bộ dữ liệu vào Milvus

3. Chạy Q&A Application

4. Test PDF extraction

🤖 Gemini Multi-Model

Hệ thống Auto-Fallback

So sánh Models

Test Setup

📝 Logging

Cấu hình logging level

Log levels

Xem logs

🧪 Testing

Chạy unit tests

Test coverage

Test setup nhanh

📚 Documentation

📖 Hướng dẫn chi tiết

📁 Cấu trúc project

🛠️ Troubleshooting

❌ "Không tìm thấy biến môi trường GEMINI_API_KEY"

❌ "Tất cả các API key đều đã hết quota"

❌ "Không thể kết nối đến Milvus"

❌ "Model không tồn tại"

❌ "CUDA out of memory"

❌ "PDF không có text"

🤝 Contributing

📄 License

📞 Support

🌟 Acknowledgments

📈 Roadmap

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

File: `config.py`

File: `.env`

Packages