A tutorial on finetuning Vision-Language Models to generate better search queries through trajectory learning.
- Introduction
- The Core Idea
- Architecture Overview
- Step 1: Collecting Trajectories
- Step 2: Processing & Filtering
- Step 3: Formatting Training Data
- Step 4: LoRA Finetuning
- Step 5: Evaluation
- Results & Analysis
- Appendix: Infrastructure Setup
Document retrieval is a critical component of Retrieval-Augmented Generation (RAG) systems. However, turning natural language questions into effective search queries remains challenging—especially for sparse (keyword-based) retrieval systems like BM25 or Whoosh.
This tutorial demonstrates how to finetune a Vision-Language Model (VLM) to generate better search queries by learning from its own successful search trajectories. The key insight is that we can use the normalized rank of ground truth documents as a reward signal to train the model to produce more effective keywords.
- Sparse search is fast and interpretable but struggles with vocabulary mismatch
- VLMs can understand documents visually and reason about content
- Iterative refinement allows models to learn from their search successes and failures
- Python 3.10+
- CUDA-capable GPU (8×A100 recommended for training, smaller GPUs work for inference)
- Access to a document QA dataset with ground truth locations
- vLLM for efficient model serving
The method is inspired by STaR-SQL: Self-Taught Reasoner for Text-to-SQL, adapted for multimodal document retrieval.
Given a question like:
"What is the maintenance interval for the Apache helicopter's rotor assembly?"
We need to generate search keywords that will retrieve the correct document page from a corpus. A naive approach might search for "apache helicopter maintenance", but the correct page might use different terminology like "AH-64 rotor service schedule".
- Run a VLM agent that iteratively searches and refines queries
- Track which queries successfully retrieve the ground truth document
- Train the model to generate the successful queries directly
The key innovation is the normalized rank score:
rank_score = (top_k - rank + 1) / top_k
| Rank | Score (k=3) | Interpretation |
|---|---|---|
| 1 | 1.00 | Perfect hit |
| 2 | 0.67 | Good |
| 3 | 0.33 | Found but low |
| >3 | 0.00 | Not found |
This score serves as the reward signal for training.
┌─────────────────────────────────────────────────────────────────────────┐
│ DRAG Training Pipeline │
├─────────────────────────────────────────────────────────────────────────┤
│ │
│ ┌─────────────────┐ ┌─────────────────┐ ┌─────────────────┐ │
│ │ 1. COLLECT │ │ 2. PROCESS │ │ 3. FORMAT │ │
│ │ Trajectories │───►│ & Filter │───►│ Training Data │ │
│ └─────────────────┘ └─────────────────┘ └─────────────────┘ │
│ │ │ │ │
│ ▼ ▼ ▼ │
│ trajectories.jsonl processed.jsonl training_data/ │
│ - Full reasoning - GT in top-k only - sft_*.jsonl │
│ - Search queries - Rank scores - dpo_*.jsonl │
│ - Results & ranks - Best queries - reward_*.jsonl │
│ │
│ ┌─────────────────┐ ┌─────────────────┐ │
│ │ 4. FINETUNE │ │ 5. EVALUATE │ │
│ │ LoRA + SFT │───►│ Compare Base │ │
│ └─────────────────┘ └─────────────────┘ │
│ │ │ │
│ ▼ ▼ │
│ checkpoints/ results/*.json │
│ - LoRA adapters - Success rate │
│ - Merged model - Iterations to success │
│ - Hit@k, MRR │
│ │
└─────────────────────────────────────────────────────────────────────────┘
The first step is to run the VLM agent on your document QA dataset and log everything.
For each question, we record:
- Reasoning tokens (
<think>...</think>blocks) - Search queries generated by the model
- Search results with ranks
- Ground truth matching (did GT appear? at what rank?)
# Start the vLLM server
vllm serve Qwen/Qwen3-VL-8B-Thinking \
--tensor-parallel-size 8 \
--gpu-memory-utilization 0.95 \
--api-key "your-key" \
--enable-auto-tool-choice \
--tool-call-parser hermes
# Collect trajectories (parallel mode for throughput)
python trajectory_collector.py \
--output trajectories.jsonl \
--ocr-file /path/to/ocr_output.jsonl \
--model Qwen/Qwen3-VL-8B-Thinking \
--sampling-config default \
--concurrency 16 \
--limit 500Different sampling parameters produce different trajectory diversity:
| Config | Temperature | Top-p | Best For |
|---|---|---|---|
greedy |
0.0 | 1.0 | Deterministic baseline |
default |
0.7 | 0.95 | Balanced exploration |
creative |
1.2 | 0.95 | Maximum diversity |
diverse |
0.9 | 0.98 | Variety with repetition penalty |
Observation: Greedy decoding causes repetition in thinking models. Temperature 0.7 produces much cleaner reasoning traces.
Each trajectory is a JSON object:
{
"id": "traj_q123",
"question": "What is the maximum speed of the CH-47?",
"ground_truth": {
"file": "specs/ch47_manual.pdf",
"page": 42,
"answers": ["170 knots"]
},
"iterations": [
{
"step": 1,
"reasoning": "The user is asking about CH-47 speed specifications...",
"action": "search",
"query": "CH-47 maximum speed knots",
"results": [...],
"gt_found": true,
"gt_rank": 1,
"rank_score": 1.0
}
],
"best_rank_score": 1.0,
"gt_ever_found": true
}Not all trajectories are useful for training. We filter to keep only those where the model successfully retrieved the ground truth document.
If the ground truth document was never in the search results, we have:
- ❌ No positive signal about what worked
- ❌ No way to compute meaningful rank scores
- ❌ Potentially noisy/misleading training data
python process_trajectories.py \
--input trajectories.jsonl \
--output processed.jsonl \
--stats stats.jsonFrom our experiments with ~350 collected trajectories:
TRAJECTORY PROCESSING STATISTICS
Trajectories:
Total: 351
Valid (GT found): 209 (59.5%)
Filtered (GT not found): 142 (40.5%)
Iterations to First Success:
Step 1: 189 (90.4%)
Step 2: 15 (7.2%)
Step 3: 5 (2.4%)
Best Rank Distribution:
Rank 1: 195 (93.3%)
Rank 2: 14 (6.7%)
Key Insight: The model is effective on the first try (90%+ find GT at step 1). This means most training signal comes from single-step trajectories.
We convert processed trajectories into multiple training formats to support different finetuning approaches.
| Format | Input | Output | Use Case |
|---|---|---|---|
sft_query |
Question | Best query | Direct query generation |
sft_trajectory |
Question | Full reasoning trace | Learn reasoning patterns |
sft_context |
Question + prev attempts | Better query | Iterative improvement |
dpo |
Prompt | (chosen, rejected) pair | Preference learning |
reward |
Query | Score | Reward model training |
python format_training_data.py \
--input processed.jsonl \
--output-dir training_data/ \
--formats all \
--val-split 0.1SFT Query Format (simplest, most direct):
{
"messages": [
{"role": "user", "content": "Search query for: What is the maintenance interval?"},
{"role": "assistant", "content": "<search>AH-64 rotor maintenance schedule</search>"}
],
"weight": 1.0
}DPO Format (contrastive learning):
{
"prompt": "Search query for: What is the maintenance interval?",
"chosen": "<search>AH-64 rotor maintenance schedule</search>",
"rejected": "<search>helicopter maintenance</search>"
}A challenge we observed: DPO requires paired comparisons, but most trajectories succeed on the first query. This creates few contrast pairs:
- 351 total trajectories
- Only ~50 have multiple search iterations
- Only ~20 produce valid DPO pairs (same question, different quality queries)
Recommendation: Start with SFT on successful queries. Use DPO only if you have sufficient comparison data.
We use LoRA (Low-Rank Adaptation) to efficiently finetune the VLM while preserving its base capabilities.
- Memory efficient: ~10% of full finetuning memory
- Fast training: Fewer parameters to update
- Composable: Can serve multiple adapters with vLLM
- Safe: Easy to compare with base model
# Basic SFT with query format
python finetune.py sft \
--train-data training_data/sft_best_train.jsonl \
--val-data training_data/sft_best_val.jsonl \
--output-dir ./checkpoints/sft \
--model Qwen/Qwen3-VL-8B-Thinking \
--epochs 3 \
--batch-size 4 \
--lr 2e-4 \
--lora-r 16Focus the loss on what matters:
python finetune.py sft \
--train-data training_data/sft_best_train.jsonl \
--output-dir ./checkpoints/sft_masked \
--mask-strategy search-only \
--weight-scheme rank-scoreMask Strategies:
none: Standard loss on all tokenssearch-only: Only compute loss on<search>...</search>tokensassistant: Loss on all assistant tokens
Weight Schemes:
none: Equal weight for all samplesrank-score: Weight by normalized rank (better queries weighted more)binary: 1.0 for rank-1 hits, 0.5 otherwise
For 8×A100 setup:
torchrun --nproc_per_node=8 finetune.py sft \
--train-data training_data/sft_best_train.jsonl \
--output-dir ./checkpoints/sft \
--model Qwen/Qwen3-VL-8B-ThinkingAfter training, you can merge the LoRA weights into the base model:
python finetune.py merge \
--adapter-path ./checkpoints/sft/final \
--output-path ./merged_modelOr serve with vLLM using dynamic adapter loading (recommended):
vllm serve Qwen/Qwen3-VL-8B-Thinking \
--enable-lora \
--lora-modules my-adapter=checkpoints/sft/final \
--trust-remote-codeCompare the finetuned model against the base model on a held-out test set.
| Metric | Description |
|---|---|
| Success Rate | % of questions answered correctly |
| Iterations to Success | Average searches needed to find GT |
| First-Hit Rank | GT document rank on first search |
| MRR | Mean Reciprocal Rank (1/rank averaged) |
| Hit@k | % of queries with GT in top-k |
# Evaluate base model
python evaluate.py \
--model Qwen/Qwen3-VL-8B-Thinking \
--include-json splits/test.json \
--ocr-file data/ocr_output.jsonl \
--output results/base_results.json
# Evaluate finetuned model (adapter mode)
python evaluate.py \
--model my-adapter \
--include-json splits/test.json \
--ocr-file data/ocr_output.jsonl \
--output results/adapter_results.json| Metric | Base Model | Finetuned | Change |
|---|---|---|---|
| Success Rate | 68.0% | 66.0% | -2.0% |
| Avg Iterations | 4.14 | 3.98 | -3.9% |
| Iterations to Success | 1.97 | 1.85 | -6.1% |
| First-Hit Rate | 50.0% | 52.0% | +4.0% |
| First-Hit Rank | 1.24 | 1.15 | -7.3% |
| First Search MRR | 0.893 | 0.936 | +4.8% |
| Hit@1 | 40.0% | 46.0% | +15.0% |
| Hit@3 | 50.0% | 52.0% | +4.0% |
-
Improved First Search Quality: The finetuned model generates better initial queries (MRR +4.8%, Hit@1 +15%)
-
Fewer Iterations Needed: When successful, the finetuned model needs fewer search iterations (1.85 vs 1.97)
-
Trade-off on Success Rate: Slight decrease in overall success rate (-2%), likely due to the model being more "decisive" and less exploratory
-
Best Rank Improvement: First-hit rank improved from 1.24 to 1.15, meaning the GT document appears higher in results
The finetuning achieved its primary goal: better first-try search queries. The model learned to generate more effective keywords that retrieve the ground truth document higher in the results.
The slight success rate decrease suggests room for improvement:
- More training data (we only had ~200 valid trajectories)
- Longer training (3 epochs may be insufficient)
- Different training formats (trajectory-based vs query-only)
# Mount ephemeral NVMe storage
sudo mkfs.ext4 -F /dev/nvme1n1
sudo mount /dev/nvme1n1 /data
sudo chown ubuntu:ubuntu /data
# Setup Python environment
source ~/.local/bin/env
mkdir -p /data/cache
export UV_CACHE_DIR=/data/cache
uv venv /data/vllm-env --python 3.11
source /data/vllm-env/bin/activate
uv pip install vllm# ~/.ssh/config
Host gpu-server
HostName <your-ip>
User ubuntu
IdentityFile ~/.ssh/your-key.pem
LocalForward 8000 localhost:8000
LocalForward 8888 localhost:8888rsync -avz --exclude '__pycache__' --exclude '.venv' \
./agentic-retrieval-finetuning gpu-server:/data/Base Model:
vllm serve Qwen/Qwen3-VL-8B-Thinking \
--tensor-parallel-size 8 \
--gpu-memory-utilization 0.95 \
--api-key "abc123" \
--enable-auto-tool-choice \
--tool-call-parser hermes \
--max-num-seqs 32With LoRA Adapter:
vllm serve Qwen/Qwen3-VL-8B-Thinking \
--tensor-parallel-size 8 \
--enable-lora \
--lora-modules my-adapter=checkpoints/sft/final \
--trust-remote-codewatch -n 1 'nvidia-smi --query-gpu=utilization.gpu,utilization.memory --format=csv'- STaR-SQL: Self-Taught Reasoner for Text-to-SQL
- Qwen3-VL Model Card
- LoRA: Low-Rank Adaptation of Large Language Models
- TRL: Transformer Reinforcement Learning
MIT
Tutorial created as part of the Agentic Document AI benchmark project.