Deterministic, Version-Aware Codebase Migration with Structured LLM Reasoning
Large-scale library upgrades introduce breaking API changes, semantic shifts, and dependency conflicts that are difficult to detect reliably. This tool formalizes code migration as a constraint satisfaction problem over a structured intermediate representation (IR) rather than a token-level reasoning task.
The system combines:
- Deterministic AST-based static analysis
- Explicit dependency version modeling
- Version-gated compatibility rules
- Constraint-driven violation detection
- LLM-based structured migration planning
Violations are detected deterministically. LLMs are used only for structured reasoning and strategy synthesis.
Modern software systems depend on deep dependency stacks:
- Machine learning frameworks (NumPy, pandas, scikit-learn)
- Web frameworks (Django, Flask)
- HTTP clients (requests)
- Serialization and utility libraries
Upgrading these dependencies introduces:
- Removed methods
- Renamed APIs
- Removed keyword arguments
- Deprecated type aliases
- Behavioral changes
- Security-impacting parameter changes
Naive approaches suffer from:
- Token-based guessing
- Lack of version awareness
- No dependency topology modeling
- Hallucinated or non-reproducible reasoning
This system instead:
- Extracts structured program information
- Applies version-aware compatibility constraints
- Produces grounded violations
- Uses LLMs only after deterministic analysis
Raw Repository
↓
AST-Based Parser
↓
Structured Intermediate Representation (IR)
↓
Dependency Version Context (requirements parser)
↓
Versioned Knowledge Base (rules.yaml)
↓
Deterministic Rule Engine
↓
Structured Violations
↓
LLM Strategy Generator (Gemini)
AST-Based Parsing (src/parser.py)
The parser uses Python's ast module to extract structured information:
- Import specifications (with alias resolution)
- Fully qualified API calls
- Receiver objects
- Method names
- Resolved modules
- Line numbers and column offsets
- Positional and keyword arguments
No heuristics. No regex. No LLM detection.
Defined in src/ir.py:
ImportSpecAPICallFileIRProjectIR
Converts unstructured source code into structured program data, enabling:
- Alias resolution
- Fully qualified call reconstruction
- Version-aware rule matching
- Cross-file compatibility analysis
The requirements parser (src/dependency/) extracts version context from requirements.txt:
{
"requests": "==3.0.0",
"numpy": "==1.21.0",
"django": "==4.2.0"
}Rules activate only if:
Version(installed) ∈ SpecifierSet(version_range)
This prevents false positives when older versions are used.
Breaking changes are encoded declaratively in src/kb/rules.yaml:
- library: numpy
version_range: ">=1.20"
breaking_changes:
- type: removed_method
method: float
suggested_fix: "Use numpy.float64 or built-in float()."
severity: warningThe knowledge base is:
- Explicit and inspectable
- Version-aware
- Extensible
- Deterministic in execution
Located in src/engine/rule_engine.py, the engine:
- Iterates over all API calls in ProjectIR
- Matches resolved modules against rule libraries
- Checks version constraints
- Applies breaking-change conditions
- Produces structured violations
This process is deterministic, reproducible, version-aware, and fully auditable.
After violations are detected, the system invokes Gemini to generate:
- Migration scope classification
- Affected subsystem grouping
- Recommended migration ordering
- Testing priorities
- Risk assessment
The LLM receives structured violations, not raw code, ensuring:
- Grounded reasoning
- No hallucinated API detection
- Clear separation of analysis and synthesis
pip install -e .Requirements:
- Python 3.8+
- Set
GEMINI_API_KEYenvironment variable for strategy generation:export GEMINI_API_KEY="your-api-key"
Analyze a project:
python main.py /path/to/projectpython main.py /path/to/project --requirements path/to/requirements.txtpython main.py /path/to/project --rules src/kb/rules.yamlFile: tests/input_requirements.txt
requests==3.0.0
django==4.2.0
numpy==1.21.0
pandas==1.5.0
sklearn==1.2.0
File: tests/sample_project/test_input.py
import requests
from django.conf.urls import url
import numpy as np
import pandas as pd
from sklearn.feature_extraction.text import CountVectorizer
BASE = "https://example.com"
def test_requests():
# removed_kwarg
requests.get(BASE, verify=False)
# removed_kwarg
requests.post(BASE, verify=False)
def test_django():
# removed_method
url(r"^home/$", lambda r: None)
def test_numpy():
# removed_method
np.float(10)
# removed_method
np.int(20)
def test_pandas(df):
# renamed_method
df.sort()
def test_sklearn():
# removed_method
vec = CountVectorizer()
vec.get_feature_names()python main.py tests/sample_project --requirements tests/input_requirements.txtParsing project: /Users/abhinavs/code-migration-agent/tests/sample_project
Found 2 Python file(s)
Loaded dependencies from tests/input_requirements.txt (5 package(s))
Found 7 violation(s):
[error] /Users/abhinavs/code-migration-agent/tests/sample_project/test_input.py:13
requests.get: 'verify' parameter removed in requests 3.0.
→ Remove 'verify' kwarg or configure SSL differently.
[error] /Users/abhinavs/code-migration-agent/tests/sample_project/test_input.py:16
requests.post: 'verify' parameter removed in requests 3.0.
→ Remove 'verify' kwarg or configure SSL differently.
[error] /Users/abhinavs/code-migration-agent/tests/sample_project/test_input.py:21
django.url: django.conf.urls.url was removed in Django 4.0.
→ Use path() or re_path() from django.urls.
[warning] /Users/abhinavs/code-migration-agent/tests/sample_project/test_input.py:26
numpy.float: np.float was removed in NumPy 1.20.
→ Use np.float64 or built-in float().
[warning] /Users/abhinavs/code-migration-agent/tests/sample_project/test_input.py:29
numpy.int: np.int was removed in NumPy 1.20.
→ Use np.int64 or built-in int().
[warning] /Users/abhinavs/code-migration-agent/tests/sample_project/test_input.py:34
pandas.sort: DataFrame.sort() renamed to sort_values().
→ Use df.sort_values().
[error] /Users/abhinavs/code-migration-agent/tests/sample_project/test_input.py:43
sklearn.get_feature_names: get_feature_names() removed in sklearn 1.0.
→ Use get_feature_names_out().
Migration strategy
============================================================
Scope: medium
Affected areas:
• HTTP client operations (requests)
• Web routing and URL configuration (Django)
• Numerical data type handling (NumPy)
• Data manipulation and sorting (Pandas)
• Machine learning feature processing (Scikit-learn)
Recommended order:
1. Address 'requests' library changes: Remove 'verify' parameter from get/post calls.
2. Address 'django' library changes: Replace 'django.conf.urls.url' with 'path()' or 're_path()'.
3. Address 'sklearn' library changes: Update 'get_feature_names()' to 'get_feature_names_out()'.
4. Address 'numpy' library changes: Replace 'np.float' with 'np.float64' or 'float()',
and 'np.int' with 'np.int64' or 'int()'.
5. Address 'pandas' library changes: Rename 'DataFrame.sort()' to 'sort_values()'.
Testing focus:
• All network requests, ensuring proper SSL/TLS handling after 'verify' parameter removal.
• All application URL routes and navigation.
• Numerical computations and data type conversions involving NumPy.
• Dataframe sorting logic and results.
• Machine learning pipelines, specifically feature extraction and model inputs.
• End-to-end application functionality to catch any integration issues.
Risk notes:
Multiple libraries have critical 'error' level changes requiring immediate fixes for code
functionality. The removal of 'verify' in 'requests' requires careful consideration of
security implications related to SSL certificate verification. While individual fixes are
mostly straightforward (renames, parameter removals), the breadth across different
functional areas (web, data, ML) necessitates comprehensive testing.
Deterministic API extraction via AST
Alias-aware module resolution
Version-gated rule activation
Constraint-driven violation detection
Structured compatibility modeling
LLM grounded on deterministic outputs
No probabilistic detection of breaking changes
Migration is treated as a compatibility constraint problem, not a token prediction task.
Currently includes breaking-change rules for:
- requests (3.0.0+)
- django (4.0+)
- numpy (1.20+)
- pandas (1.0+)
- scikit-learn (1.0+)
- Flask, FastAPI, and others (extensible via
rules.yaml)
code-migration-agent/
├── src/
│ ├── parser.py # AST-based code parser
│ ├── ir.py # Intermediate representation
│ ├── dependency/ # Requirements parsing
│ ├── engine/
│ │ └── rule_engine.py # Deterministic rule matching
│ ├── kb/
│ │ └── rules.yaml # Breaking change knowledge base
│ └── llm/
│ └── strategy.py # LLM-driven strategy generation
├── tests/
│ ├── sample_project/ # Example project
│ └── input_requirements.txt # Test dependencies
├── main.py # Entry point
└── README.md # This file
- Hybrid static + LLM migration reasoning - Combines deterministic AST analysis with LLM-driven strategy synthesis
- Structured IR-based compatibility modeling - Intermediate representation enables version-aware analysis
- Deterministic constraint-based rule matching - Version-gated rules prevent false positives
- Version-aware breaking change detection - Respects dependency version specifications
- Controlled LLM grounding architecture - LLMs reason over structured data, not raw tokens
Contributions welcome! Please:
- Add new rules to
src/kb/rules.yaml - Include test cases for new breaking changes
- Document your additions in README.md
For issues, questions, or contributions, please open an issue on the project repository.