Production-simulated RAG + agentic migration intelligence platform for version-safe developer change decisions.
LLM migration assistants hallucinate deprecated APIs. The model does not know which version you are on.
A vanilla RAG system asked "how do I migrate requests v2 to v3?" will return chunks from multiple versions without filtering. The LLM synthesizes across them, mixing v2-era patterns with v3 breaking changes. The developer applies the output. Tests fail. The failure is silent and the root cause is invisible — the retrieval layer served the wrong version.
Version-safe retrieval solves this at the retrieval layer, before any LLM sees a single token. Wrong-version rate: 0.0.
The core architectural decision: version filtering happens as a hard gate, not as a soft reranking signal.
Every document chunk in the corpus is tagged with a version_tag. When a query specifies a target version (e.g., requests v3.2.1), the retrieval layer applies a hard filter: chunks tagged to incompatible versions are excluded before BM25 or dense scoring runs. No wrong-version chunk can reach the LLM, regardless of semantic similarity score.
| Metric | Value |
|---|---|
| Hybrid Recall@5 (BM25 + dense) | 0.94 |
| Reranker simulated Recall@5 | 0.97 |
| Wrong-version answer rate | 0.0 |
This is why wrong-version rate is exactly 0.0 rather than approximately 0.0: the gate is deterministic, not probabilistic.
The LLM is the last component in the pipeline, not the first.
Before any synthesis occurs, the pipeline requires:
- Version filter active and applied
- Conflict detector has run and returned CLEAR
- At least one grounded chunk retrieved for the target version
- Staleness and cross-source consistency checks passed
Only when all gates clear does the LLM generate a migration response, with citations assembled from retrieved chunks. If any gate fails, the pipeline returns a deterministic RISKY or BLOCKED decision without calling the LLM.
This design means the LLM cannot hallucinate API names or version requirements — it only synthesizes over pre-validated, version-pinned evidence.
The conflict detector runs before synthesis on every query and classifies retrieved chunks across four conflict types:
| Conflict Type | Description |
|---|---|
| Stale | Chunk version tag older than target, may contain superseded guidance |
| Contradictory | Two chunks from same version make incompatible claims |
| Deprecated | Chunk explicitly marks an API or pattern as deprecated |
| Cross-source | Changelog and migration guide disagree on same version |
Conflict Macro F1: 0.966 — across all four classes on held-out evaluation set.
User Query → Query Parser → [VERSION FILTER HARD GATE] → Version-Safe Retrieval
→ Conflict Detector → [LLM-LAST GATE: grounded evidence required] → LLM Synthesis
→ Migration Report (SAFE / RISKY / BLOCKED)
- Deterministic query parsing and version extraction
- Complexity routing (simple lookup vs. multi-hop migration)
- Hard version-filtered retrieval — BM25 + dense hybrid
- Conflict detection across four types
- SAFE / RISKY / BLOCKED report generation
- LLM synthesis with programmatic citation assembly
- Fallback and audit artifacts on every run
- 24 evidence artifacts, 10 failure/recovery scenarios
GoalParser+DependencyDeltaDetectorTaskPlanner→TaskExecutorwith bounded retry capRecoveryDeciderwith escalation on repeated failurePlanSummaryReporter- Staged migration recommendation
- 8 evidence artifacts, 9 failure/recovery scenarios
The repo-aware extension scans a local sample repository against the migration target:
- Dependency usage mapping across source files
- Risky callsite detection (10/10 callsites flagged in test repo)
DO_NOT_APPLY_WITHOUT_REVIEW— conservative by default
Artifacts: repo_inspection_report.json, dependency_usage_map.json, risky_callsite_report.json
DevPulse generates reviewer-safe migration artifacts:
proposed_file_changes.json — structured patch proposal
proposed_migration_patch.diff — reviewable diff
patch_risk_report.json — risk assessment
before_tests_report.json — pre-patch test state
after_patch_tests_report.json — post-patch test simulation
test_failure_triage_report.json — failure root cause
pr_body.md — GitHub PR description
reviewer_checklist.md — structured sign-off checklist
rollback_plan.md — revert procedure
The 37-day backtest replays 2,479 queries across a scripted traffic corpus to validate:
- Wrong-version rate stays 0.0 across all query types
- Conflict detection F1 holds under traffic variance
- SAFE/RISKY/BLOCKED decisions are consistent with ground truth
Key RAG hardening artifacts: retrieval ablation, reranker simulation, conflict confusion matrix, corpus perturbation, traffic backtest report.
| Metric | Value |
|---|---|
| Backtest duration | 37 days |
| Total queries | 2,479 |
| RAG eval query count | 180 |
| Wrong-version rate | 0.0 |
| Hybrid Recall@5 | 0.94 |
| Conflict Macro F1 | 0.966 |
Full artifact: retrieval_ablation_report.json. The table below shows Recall@5 across retrieval strategies on 180 held-out migration queries, evaluated after version pre-filtering:
| Strategy | Recall@5 | Notes |
|---|---|---|
| BM25 only | 0.71 | Misses paraphrased API names and synonym matches |
| Dense only (no version filter) | 0.61 | Retrieves semantically similar but wrong-version chunks |
| Dense only (version-filtered) | 0.82 | Version gate fixes the wrong-version contamination |
| BM25 + dense hybrid | 0.94 | Keyword precision + semantic coverage compound |
| Hybrid + reranker (simulated) | 0.97 | Cross-encoder reranker lifts top-1 precision further |
The critical finding: dense-only retrieval without version pre-filtering scores 0.61 — almost no better than guessing — because semantic similarity causes the model to retrieve v2 documentation when a user asks about v3. Version pre-filtering is not optional; it is the structural guarantee that makes Recall@5 = 0.97 meaningful.
# Full PRD validation bundle
PYTHONPATH=. python3 scripts/run_devpulse_complete_v3.py
PYTHONPATH=. python3 scripts/show_final_demo_report.py
# Repo-aware extension
PYTHONPATH=. python3 scripts/run_repo_aware_scan_v35.py
# Patch + PR simulation
PYTHONPATH=. python3 scripts/run_patch_pr_simulation_v35.py
# RAG hardening
PYTHONPATH=. python3 scripts/run_rag_eval_hardening_v35.py
# Dashboard
PYTHONPATH=. python3 scripts/build_dashboard_v35.py
open outputs/dashboard/index.htmlhttps://sidharthkriplani.github.io/devpulse_platform/
Covers: PRD completion status, Query Mode and Goal Mode flow, RAG evaluation metrics, repo-aware migration risk, patch and PR simulation, final validation artifacts, evidence inventory (70 artifacts).
| Artifact | Proves |
|---|---|
traffic_backtest_37_day_report.json |
Wrong-version rate=0.0 across 2,479 queries |
retrieval_ablation_report.json |
BM25 vs. dense vs. hybrid Recall@5 comparison |
conflict_confusion_matrix.json |
Macro F1=0.966 per conflict type |
risky_callsite_report.json |
Repo-aware: 10/10 risky callsites found |
devpulse_prd_completion_report_v3.json |
PRD v3.5 PASS |
plan_summary_report.json |
Goal mode: task planning + recovery |
What this is: Solo-built, non-production, production-simulated system. Every major claim is backed by executable scripts, generated artifacts, and a public dashboard.
Not claimed: Real production SaaS traffic, live npm/PyPI/Maven registry integration, real GitHub PR creation, real CI execution, autonomous production code mutation, production deployment.
configs/ controlled registries and scope config
src/devpulse/ core Query Mode and Goal Mode modules
scripts/ executable demo, validation, and artifact builders
sample_repos/ controlled local repo for repo-aware simulation
outputs/evidence/ core evidence artifacts
outputs/rag_eval/ RAG evaluation hardening artifacts
outputs/repo_aware/ repo-aware migration scan artifacts
outputs/patches/ patch proposal artifacts
outputs/pr_simulation/ PR-ready simulation package
docs/ public GitHub Pages dashboard and assets
docs/assets/ SVG architecture diagrams
Full design rationale, architecture decisions, and expected interview questions with answers:
docs/defense/DevPulse_Interview_Defense_v2.pdf
Covers: LLM-Last principle rationale, version-safe retrieval hard-gate design, conflict detection architecture, wrong-version rate guarantee, hybrid RAG design choices, and production failure modes.
This project is part of a 13-repo portfolio targeting Applied LLM Systems Engineer, MLOps, and Technical AI PM roles.
Applied Systems (LangGraph pipelines):
| Project | Domain | Primary Failure Mode |
|---|---|---|
| LendFlow | Financial underwriting | When to stop or escalate |
| AgentReliabilityLab | Cyber threat triage | When to stop or escalate |
| NexusSupply | Supplier risk intelligence | Conflicting signal fusion |
Platforms & Auditors (domain-agnostic tooling):
| Project | What It Audits / Builds |
|---|---|
| InferenceLens | Inference cost/quality tradeoffs — Pareto frontier, routing rules |
| RiskFrame | ML model lifecycle — champion/challenger, drift, fairness |
| MetaSignal | A/B experiment validity — CUPED, guardrail-first, SRM |
| DevPulse | Version-safe RAG — conflict detection, LLM-Last architecture |
| PulseRank | Marketplace ranking — IPS debiasing, MMR diversity |
| TrialCheck | A/B readout audit — SRM, peeking, underpowered tests |
| FeatureLeakageLens | Pre-training leakage — target, temporal, overlap |
| GoldenSetAuditor | LLM/RAG eval dataset quality |
| DocIngestQA | RAG document ingestion quality — 11 deterministic checks |
| MetricLens | Metric movement decomposition — mix shift vs rate shift |