InfoReasoner: Optimizing Agentic Reasoning with Retrieval via Synthetic Semantic Information Gain Reward
Agentic reasoning enables large reasoning models (LRMs) to dynamically acquire external knowledge, but yet optimizing the retrieval process remains challenging due to the lack of dense, principled reward signals. In this paper, we introduce InfoReasoner, a unified framework that incentivizes effective information seeking via a synthetic semantic information gain reward. Theoretically, we redefine information gain as uncertainty reduction over the model's belief states, establishing guarantees, including non-negativity, telescoping additivity, and channel monotonicity. Practically, to enable scalable optimization without manual retrieval annotations, we propose an output-aware intrinsic estimator that computes information gain directly from the model's output distributions using semantic clustering via bidirectional textual entailment. This intrinsic reward guides the policy to maximize epistemic progress, enabling efficient training via Group Relative Policy Optimization (GRPO). Experiments across seven question-answering benchmarks demonstrate that InfoReasoner consistently outperforms strong retrieval-augmented baselines, achieving up to 5.4% average accuracy improvement. Our work provides a theoretically grounded and scalable path toward agentic reasoning with retrieval.
This repository contains the code for InfoReasoner, including:
- An information-gain (IG) reward service implemented as a FastAPI server under
IG/. - A training pipeline built on top of Search-R1 and veRL, using GRPO and the synthetic semantic IG reward.
- Shell scripts and configs to reproduce GRPO training with IG on retrieval-augmented QA datasets.
At a high level, the workflow is:
- Launch a retriever server (for document retrieval, as in Search-R1).
- Launch the IG service (computes semantic information gain given question / context / answers).
- Run GRPO training with the IG reward, via
train_grpo.sh(which wrapsverl/trainer/config/ppo_trainer.yamlwith overrides).
We recommend three conda environments:
- One for RL training (Search-R1 + veRL).
- One for the retriever service (Faiss / retrieval stack).
- One for the IG service.
conda create -n searchr1 python=3.9
conda activate searchr1
# Core dependencies
pip install torch==2.4.0 --index-url https://download.pytorch.org/whl/cu121
pip install vllm==0.6.3
# Install this repo (Search-R1 + InfoReasoner)
pip install -e .
# Optional but recommended
pip install flash-attn --no-build-isolation
pip install wandbYou can find more details about veRL itself in VERL_README.md.
The IG service runs as a separate HTTP server and can be hosted on a different machine.
conda create -n ig-service python=3.10
conda activate ig-service
# Core runtime
pip install "torch>=2.1.0"
pip install transformers accelerate sentencepiece
# Service and networking
pip install fastapi uvicorn requests
# Optional: multi-GPU support
pip install rayMake sure the IG service environment has access to the base model checkpoint used for reward computation (e.g., Qwen/Qwen2.5-3B).
If you run local retrieval (e.g., dense e5 + Faiss), use a dedicated environment:
conda create -n retriever python=3.10
conda activate retriever
# Recommended for GPU Faiss
conda install pytorch==2.4.0 torchvision==0.19.0 torchaudio==2.4.0 pytorch-cuda=12.1 -c pytorch -c nvidia
pip install transformers datasets pyserini
conda install -c pytorch -c nvidia faiss-gpu=1.8.0
# API server deps
pip install fastapi uvicornThis project expects a retriever endpoint like http://127.0.0.1:8000/retrieve.
You have two common options:
save_path=/path/to/retrieval_assets
python scripts/download.py --save_path "$save_path"
cat "$save_path"/part_* > "$save_path"/e5_Flat.index
gzip -d "$save_path"/wiki-18.jsonl.gzThen place assets where your launch script expects them, or pass explicit paths to the server.
bash search_r1/search/build_index.shCustomize retriever model / corpus settings in that script before running.
Quick start with the provided launcher:
conda activate retriever
bash retrieval_launch.shOr start manually with explicit paths:
python search_r1/search/retrieval_server.py \
--index_path /path/to/e5_Flat.index \
--corpus_path /path/to/wiki-18.jsonl \
--topk 3 \
--retriever_name e5 \
--retriever_model intfloat/e5-base-v2 \
--faiss_gpuDefault server bind is 0.0.0.0:8000.
Health/behavior can be checked by sending retrieval requests from training or from a simple curl:
curl -X POST http://127.0.0.1:8000/retrieve \
-H "Content-Type: application/json" \
-d '{"query":"What is the capital of France?","topk":3}'In training scripts/configs, ensure:
retriever.urlpoints to your running retriever service (typicallyhttp://127.0.0.1:8000/retrieve)retriever.topkmatches your desired retrieval depth
For example, train_grpo.sh already sets:
retriever.url="http://127.0.0.1:8000/retrieve"
retriever.topk=3The following sequence launches the full stack in three terminals.
conda activate retriever
bash retrieval_launch.shExpected endpoint: http://127.0.0.1:8000/retrieve
conda activate ig-service
bash IG_service_launch.sh \
--port 310 \
--device cuda:0 \
--model-path Qwen/Qwen2.5-3B \
--num-generations 10 \
--max-concurrent-requests 4 \
--num-gpus 1Health check:
curl http://127.0.0.1:310/healthconda activate searchr1
# Optional overrides before launch
export BASE_MODEL=/path/to/base-or-checkpoint
export EXPERIMENT_NAME="$(date +%m%d-%H%M)-nq-train-grpo-ig"
bash train_grpo.sh- Retriever not reachable: verify
retriever.urlin training config points to:8000/retrieve. - IG timeout: increase
IG_TIMEOUTand/or reduceIG_BATCH_SIZE. - GPU OOM in IG service: reduce
--max-concurrent-requestsor--num-generations. - Throughput bottleneck: use IG multi-GPU mode (
--num-gpus > 1) and ensure sufficient request concurrency.
The IG service lives under IG/service/ and is launched via IG_service_launch.sh.
conda activate ig-service
bash IG_service_launch.sh \
--port 310 \
--device cuda:0 \
--model-path Qwen/Qwen2.5-3B \
--num-generations 10 \
--max-concurrent-requests 4 \
--num-gpus 1Key arguments / environment variables (mapped to IG/service/config.py):
IG_HOST/--host(default0.0.0.0)IG_PORT/--port(default8000)IG_MODEL_PATH/--model-path(e.g.,Qwen/Qwen2.5-3B)IG_DEVICE/--device(e.g.,cuda:0)IG_NUM_GENERATIONS/--num-generationsIG_TEMPERATURE/--temperatureIG_MAX_NEW_TOKENS/--max-new-tokensIG_MAX_CONTEXT_WORDS/--max-context-wordsIG_COMPUTATION_CHUNK_SIZE/--computation-chunk-sizeIG_MAX_CONCURRENT_REQUESTS/--max-concurrent-requestsIG_NUM_GPUS/--num-gpus
After startup, you should see logs indicating the generator and entailment models are loaded and the service is ready.
The service supports multi-GPU deployment via Ray. You can either specify --num-gpus on the command line or via IG_NUM_GPUS.
conda activate ig-service
# Use 4 GPUs with Ray workers
bash IG_service_launch.sh \
--port 310 \
--num-gpus 4 \
--model-path Qwen/Qwen2.5-3BOr with environment variables:
export IG_NUM_GPUS=4
export IG_PORT=310
bash IG_service_launch.shIn multi-GPU mode:
- The main FastAPI process forwards requests to multiple Ray workers.
- Each worker owns one GPU and runs its own generator + entailment model.
- Requests are distributed in a round-robin fashion across workers.
You can verify the deployment:
curl http://localhost:310/healthMulti-GPU responses include fields like:
{
"status": "healthy",
"mode": "multi-GPU",
"num_workers": 4,
"devices": ["cuda:0", "cuda:1", "cuda:2", "cuda:3"]
}See IG/service/README.md for more details on the service API and configuration.
Once the retriever and IG service are running, you can launch GRPO training from the project root.
Edit train_grpo.sh or export the relevant variables:
export BASE_MODEL=/path/to/base-or-checkpoint
export WAND_PROJECT='Qwen2.5-7B-GRPO'
export EXPERIMENT_NAME="$(date +%m%d-%H%M)-nq_train-grpo-qwen2.5-7b-ig"train_grpo.sh also sets:
export IG_SERVICE_URL="http://0.0.0.0:310"
export IG_BATCH_SIZE='512'
export IG_TIMEOUT='120'Make sure IG_SERVICE_URL matches the host/port where you launched the IG service.
conda activate searchr1
bash train_grpo.shThis script ultimately calls:
verl.trainer.main_ppowith overrides on top of the base Hydra configverl/trainer/config/ppo_trainer.yaml.
The actual IG reward computation flows through:
verl/utils/reward_score/qa_em.py(EM + IG integration).verl/utils/reward_score/IG_client.py(HTTP client wrapper).verl/utils/reward_score/IG_reward.py(local IG reward calculator, if needed).
train_grpo.sh configures Weights & Biases logging via:
trainer.project_nametrainer.experiment_name
You can monitor:
- EM accuracy metrics.
- IG reward statistics (mean / max / min).
- Training losses and KL metrics.
The IG service exposes two main endpoints:
GET /health– basic health and mode information (single-GPU / multi-GPU).POST /compute_info_gain– batch IG computation.
Example request:
curl -X POST http://localhost:310/compute_info_gain \
-H "Content-Type: application/json" \
-d '{
"items": [
{
"question": "What is the capital of France?",
"context": "France is a country in Europe. Paris is its capital.",
"answers": ["Paris"]
}
]
}'Example response:
{
"scores": [0.123],
"errors": [null],
"details": [...]
}The training code (qa_em.py) consumes only the scores field; details is useful for analysis and ablations.
This project builds upon the Search-R1 framework and the verl reinforcement learning library. We sincerely thank the authors of these projects for their valuable contributions, which have significantly supported and inspired our work.
@misc{hu2026optimizingagenticreasoningretrieval,
title={Optimizing Agentic Reasoning with Retrieval via Synthetic Semantic Information Gain Reward},
author={Senkang Hu and Yong Dai and Yuzhi Zhao and Yihang Tao and Yu Guo and Zhengru Fang and Sam Tak Wu Kwong and Yuguang Fang},
year={2026},
eprint={2602.00845},
archivePrefix={arXiv},
primaryClass={cs.AI},
url={https://arxiv.org/abs/2602.00845},
}