Migration Assistant

Deterministic, Version-Aware Codebase Migration with Structured LLM Reasoning

Abstract

Large-scale library upgrades introduce breaking API changes, semantic shifts, and dependency conflicts that are difficult to detect reliably. This tool formalizes code migration as a constraint satisfaction problem over a structured intermediate representation (IR) rather than a token-level reasoning task.

The system combines:

Deterministic AST-based static analysis
Explicit dependency version modeling
Version-gated compatibility rules
Constraint-driven violation detection
LLM-based structured migration planning

Violations are detected deterministically. LLMs are used only for structured reasoning and strategy synthesis.

Motivation

Modern software systems depend on deep dependency stacks:

Machine learning frameworks (NumPy, pandas, scikit-learn)
Web frameworks (Django, Flask)
HTTP clients (requests)
Serialization and utility libraries

Upgrading these dependencies introduces:

Removed methods
Renamed APIs
Removed keyword arguments
Deprecated type aliases
Behavioral changes
Security-impacting parameter changes

Why This Approach?

Naive approaches suffer from:

Token-based guessing
Lack of version awareness
No dependency topology modeling
Hallucinated or non-reproducible reasoning

This system instead:

Extracts structured program information
Applies version-aware compatibility constraints
Produces grounded violations
Uses LLMs only after deterministic analysis

System Architecture

Raw Repository
        ↓
AST-Based Parser
        ↓
Structured Intermediate Representation (IR)
        ↓
Dependency Version Context (requirements parser)
        ↓
Versioned Knowledge Base (rules.yaml)
        ↓
Deterministic Rule Engine
        ↓
Structured Violations
        ↓
LLM Strategy Generator (Gemini)

Key Components

1. Deterministic Static Analysis

AST-Based Parsing (src/parser.py)

The parser uses Python's ast module to extract structured information:

Import specifications (with alias resolution)
Fully qualified API calls
Receiver objects
Method names
Resolved modules
Line numbers and column offsets
Positional and keyword arguments

No heuristics. No regex. No LLM detection.

2. Intermediate Representation (IR)

Defined in src/ir.py:

ImportSpec
APICall
FileIR
ProjectIR

Converts unstructured source code into structured program data, enabling:

Alias resolution
Fully qualified call reconstruction
Version-aware rule matching
Cross-file compatibility analysis

3. Dependency Version Modeling

The requirements parser (src/dependency/) extracts version context from requirements.txt:

{
  "requests": "==3.0.0",
  "numpy": "==1.21.0",
  "django": "==4.2.0"
}

Rules activate only if:

Version(installed) ∈ SpecifierSet(version_range)

This prevents false positives when older versions are used.

4. Structured Knowledge Base

Breaking changes are encoded declaratively in src/kb/rules.yaml:

- library: numpy
  version_range: ">=1.20"
  breaking_changes:
    - type: removed_method
      method: float
      suggested_fix: "Use numpy.float64 or built-in float()."
      severity: warning

The knowledge base is:

Explicit and inspectable
Version-aware
Extensible
Deterministic in execution

5. Rule Engine

Located in src/engine/rule_engine.py, the engine:

Iterates over all API calls in ProjectIR
Matches resolved modules against rule libraries
Checks version constraints
Applies breaking-change conditions
Produces structured violations

This process is deterministic, reproducible, version-aware, and fully auditable.

6. LLM-Driven Migration Strategy

After violations are detected, the system invokes Gemini to generate:

Migration scope classification
Affected subsystem grouping
Recommended migration ordering
Testing priorities
Risk assessment

The LLM receives structured violations, not raw code, ensuring:

Grounded reasoning
No hallucinated API detection
Clear separation of analysis and synthesis

Installation

pip install -e .

Requirements:

Python 3.8+
Set GEMINI_API_KEY environment variable for strategy generation:
```
export GEMINI_API_KEY="your-api-key"
```

Usage

Basic Analysis

Analyze a project:

python main.py /path/to/project

Specify Requirements File

python main.py /path/to/project --requirements path/to/requirements.txt

Specify Custom Rule File

python main.py /path/to/project --rules src/kb/rules.yaml

Example: End-to-End Migration Analysis

Input: Sample Project Structure

File: tests/input_requirements.txt

requests==3.0.0
django==4.2.0
numpy==1.21.0
pandas==1.5.0
sklearn==1.2.0

File: tests/sample_project/test_input.py

import requests
from django.conf.urls import url
import numpy as np
import pandas as pd
from sklearn.feature_extraction.text import CountVectorizer


BASE = "https://example.com"


def test_requests():
    # removed_kwarg
    requests.get(BASE, verify=False)

    # removed_kwarg
    requests.post(BASE, verify=False)


def test_django():
    # removed_method
    url(r"^home/$", lambda r: None)


def test_numpy():
    # removed_method
    np.float(10)

    # removed_method
    np.int(20)


def test_pandas(df):
    # renamed_method
    df.sort()


def test_sklearn():
    # removed_method
    vec = CountVectorizer()
    vec.get_feature_names()

Command

python main.py tests/sample_project --requirements tests/input_requirements.txt

Output: Violation Detection

Parsing project: /Users/abhinavs/code-migration-agent/tests/sample_project
  Found 2 Python file(s)
  Loaded dependencies from tests/input_requirements.txt (5 package(s))

Found 7 violation(s):

  [error] /Users/abhinavs/code-migration-agent/tests/sample_project/test_input.py:13
    requests.get: 'verify' parameter removed in requests 3.0.
    → Remove 'verify' kwarg or configure SSL differently.

  [error] /Users/abhinavs/code-migration-agent/tests/sample_project/test_input.py:16
    requests.post: 'verify' parameter removed in requests 3.0.
    → Remove 'verify' kwarg or configure SSL differently.

  [error] /Users/abhinavs/code-migration-agent/tests/sample_project/test_input.py:21
    django.url: django.conf.urls.url was removed in Django 4.0.
    → Use path() or re_path() from django.urls.

  [warning] /Users/abhinavs/code-migration-agent/tests/sample_project/test_input.py:26
    numpy.float: np.float was removed in NumPy 1.20.
    → Use np.float64 or built-in float().

  [warning] /Users/abhinavs/code-migration-agent/tests/sample_project/test_input.py:29
    numpy.int: np.int was removed in NumPy 1.20.
    → Use np.int64 or built-in int().

  [warning] /Users/abhinavs/code-migration-agent/tests/sample_project/test_input.py:34
    pandas.sort: DataFrame.sort() renamed to sort_values().
    → Use df.sort_values().

  [error] /Users/abhinavs/code-migration-agent/tests/sample_project/test_input.py:43
    sklearn.get_feature_names: get_feature_names() removed in sklearn 1.0.
    → Use get_feature_names_out().

Output: LLM-Generated Migration Strategy

Migration strategy
============================================================

Scope: medium

Affected areas:
  • HTTP client operations (requests)
  • Web routing and URL configuration (Django)
  • Numerical data type handling (NumPy)
  • Data manipulation and sorting (Pandas)
  • Machine learning feature processing (Scikit-learn)

Recommended order:
  1. Address 'requests' library changes: Remove 'verify' parameter from get/post calls.
  2. Address 'django' library changes: Replace 'django.conf.urls.url' with 'path()' or 're_path()'.
  3. Address 'sklearn' library changes: Update 'get_feature_names()' to 'get_feature_names_out()'.
  4. Address 'numpy' library changes: Replace 'np.float' with 'np.float64' or 'float()', 
     and 'np.int' with 'np.int64' or 'int()'.
  5. Address 'pandas' library changes: Rename 'DataFrame.sort()' to 'sort_values()'.

Testing focus:
  • All network requests, ensuring proper SSL/TLS handling after 'verify' parameter removal.
  • All application URL routes and navigation.
  • Numerical computations and data type conversions involving NumPy.
  • Dataframe sorting logic and results.
  • Machine learning pipelines, specifically feature extraction and model inputs.
  • End-to-end application functionality to catch any integration issues.

Risk notes:
  Multiple libraries have critical 'error' level changes requiring immediate fixes for code 
  functionality. The removal of 'verify' in 'requests' requires careful consideration of 
  security implications related to SSL certificate verification. While individual fixes are 
  mostly straightforward (renames, parameter removals), the breadth across different 
  functional areas (web, data, ML) necessitates comprehensive testing.

Key Technical Properties

Deterministic API extraction via AST
Alias-aware module resolution
Version-gated rule activation
Constraint-driven violation detection
Structured compatibility modeling
LLM grounded on deterministic outputs
No probabilistic detection of breaking changes

Migration is treated as a compatibility constraint problem, not a token prediction task.

Supported Libraries

Currently includes breaking-change rules for:

requests (3.0.0+)
django (4.0+)
numpy (1.20+)
pandas (1.0+)
scikit-learn (1.0+)
Flask, FastAPI, and others (extensible via rules.yaml)

Project Structure

code-migration-agent/
├── src/
│   ├── parser.py              # AST-based code parser
│   ├── ir.py                  # Intermediate representation
│   ├── dependency/            # Requirements parsing
│   ├── engine/
│   │   └── rule_engine.py     # Deterministic rule matching
│   ├── kb/
│   │   └── rules.yaml         # Breaking change knowledge base
│   └── llm/
│       └── strategy.py        # LLM-driven strategy generation
├── tests/
│   ├── sample_project/        # Example project
│   └── input_requirements.txt # Test dependencies
├── main.py                    # Entry point
└── README.md                  # This file

Research Contributions

Hybrid static + LLM migration reasoning - Combines deterministic AST analysis with LLM-driven strategy synthesis
Structured IR-based compatibility modeling - Intermediate representation enables version-aware analysis
Deterministic constraint-based rule matching - Version-gated rules prevent false positives
Version-aware breaking change detection - Respects dependency version specifications
Controlled LLM grounding architecture - LLMs reason over structured data, not raw tokens

Contributing

Contributions welcome! Please:

Add new rules to src/kb/rules.yaml
Include test cases for new breaking changes
Document your additions in README.md

Contact & Support

For issues, questions, or contributions, please open an issue on the project repository.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Migration Assistant

Abstract

Motivation

Why This Approach?

System Architecture

Key Components

1. Deterministic Static Analysis

2. Intermediate Representation (IR)

3. Dependency Version Modeling

4. Structured Knowledge Base

5. Rule Engine

6. LLM-Driven Migration Strategy

Installation

Usage

Basic Analysis

Specify Requirements File

Specify Custom Rule File

Example: End-to-End Migration Analysis

Input: Sample Project Structure

Command

Output: Violation Detection

Output: LLM-Generated Migration Strategy

Key Technical Properties

Supported Libraries

Project Structure

Research Contributions

Contributing

Contact & Support

About

Uh oh!

Packages

Uh oh!

Uh oh!

Contributors

Uh oh!

Languages

Name		Name	Last commit message	Last commit date
Latest commit History 12 Commits
src		src
tests		tests
.gitignore		.gitignore
README.md		README.md
main.py		main.py
pyproject.toml		pyproject.toml
requirements.txt		requirements.txt

Folders and files

Latest commit

History

Repository files navigation

Migration Assistant

Abstract

Motivation

Why This Approach?

System Architecture

Key Components

1. Deterministic Static Analysis

2. Intermediate Representation (IR)

3. Dependency Version Modeling

4. Structured Knowledge Base

5. Rule Engine

6. LLM-Driven Migration Strategy

Installation

Usage

Basic Analysis

Specify Requirements File

Specify Custom Rule File

Example: End-to-End Migration Analysis

Input: Sample Project Structure

Command

Output: Violation Detection

Output: LLM-Generated Migration Strategy

Key Technical Properties

Supported Libraries

Project Structure

Research Contributions

Contributing

Contact & Support

About

Topics

Resources

Uh oh!

Stars

Watchers

Forks

Packages 0

Uh oh!

Uh oh!

Contributors

Uh oh!

Languages

Packages