Skip to content

Delphictunic/codebase_migration_reasoning

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

12 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

Migration Assistant

Deterministic, Version-Aware Codebase Migration with Structured LLM Reasoning

Abstract

Large-scale library upgrades introduce breaking API changes, semantic shifts, and dependency conflicts that are difficult to detect reliably. This tool formalizes code migration as a constraint satisfaction problem over a structured intermediate representation (IR) rather than a token-level reasoning task.

The system combines:

  • Deterministic AST-based static analysis
  • Explicit dependency version modeling
  • Version-gated compatibility rules
  • Constraint-driven violation detection
  • LLM-based structured migration planning

Violations are detected deterministically. LLMs are used only for structured reasoning and strategy synthesis.


Motivation

Modern software systems depend on deep dependency stacks:

  • Machine learning frameworks (NumPy, pandas, scikit-learn)
  • Web frameworks (Django, Flask)
  • HTTP clients (requests)
  • Serialization and utility libraries

Upgrading these dependencies introduces:

  • Removed methods
  • Renamed APIs
  • Removed keyword arguments
  • Deprecated type aliases
  • Behavioral changes
  • Security-impacting parameter changes

Why This Approach?

Naive approaches suffer from:

  • Token-based guessing
  • Lack of version awareness
  • No dependency topology modeling
  • Hallucinated or non-reproducible reasoning

This system instead:

  • Extracts structured program information
  • Applies version-aware compatibility constraints
  • Produces grounded violations
  • Uses LLMs only after deterministic analysis

System Architecture

Raw Repository
        ↓
AST-Based Parser
        ↓
Structured Intermediate Representation (IR)
        ↓
Dependency Version Context (requirements parser)
        ↓
Versioned Knowledge Base (rules.yaml)
        ↓
Deterministic Rule Engine
        ↓
Structured Violations
        ↓
LLM Strategy Generator (Gemini)

Key Components

1. Deterministic Static Analysis

AST-Based Parsing (src/parser.py)

The parser uses Python's ast module to extract structured information:

  • Import specifications (with alias resolution)
  • Fully qualified API calls
  • Receiver objects
  • Method names
  • Resolved modules
  • Line numbers and column offsets
  • Positional and keyword arguments

No heuristics. No regex. No LLM detection.

2. Intermediate Representation (IR)

Defined in src/ir.py:

  • ImportSpec
  • APICall
  • FileIR
  • ProjectIR

Converts unstructured source code into structured program data, enabling:

  • Alias resolution
  • Fully qualified call reconstruction
  • Version-aware rule matching
  • Cross-file compatibility analysis

3. Dependency Version Modeling

The requirements parser (src/dependency/) extracts version context from requirements.txt:

{
  "requests": "==3.0.0",
  "numpy": "==1.21.0",
  "django": "==4.2.0"
}

Rules activate only if:

Version(installed) ∈ SpecifierSet(version_range)

This prevents false positives when older versions are used.

4. Structured Knowledge Base

Breaking changes are encoded declaratively in src/kb/rules.yaml:

- library: numpy
  version_range: ">=1.20"
  breaking_changes:
    - type: removed_method
      method: float
      suggested_fix: "Use numpy.float64 or built-in float()."
      severity: warning

The knowledge base is:

  • Explicit and inspectable
  • Version-aware
  • Extensible
  • Deterministic in execution

5. Rule Engine

Located in src/engine/rule_engine.py, the engine:

  1. Iterates over all API calls in ProjectIR
  2. Matches resolved modules against rule libraries
  3. Checks version constraints
  4. Applies breaking-change conditions
  5. Produces structured violations

This process is deterministic, reproducible, version-aware, and fully auditable.

6. LLM-Driven Migration Strategy

After violations are detected, the system invokes Gemini to generate:

  • Migration scope classification
  • Affected subsystem grouping
  • Recommended migration ordering
  • Testing priorities
  • Risk assessment

The LLM receives structured violations, not raw code, ensuring:

  • Grounded reasoning
  • No hallucinated API detection
  • Clear separation of analysis and synthesis

Installation

pip install -e .

Requirements:

  • Python 3.8+
  • Set GEMINI_API_KEY environment variable for strategy generation:
    export GEMINI_API_KEY="your-api-key"

Usage

Basic Analysis

Analyze a project:

python main.py /path/to/project

Specify Requirements File

python main.py /path/to/project --requirements path/to/requirements.txt

Specify Custom Rule File

python main.py /path/to/project --rules src/kb/rules.yaml

Example: End-to-End Migration Analysis

Input: Sample Project Structure

File: tests/input_requirements.txt

requests==3.0.0
django==4.2.0
numpy==1.21.0
pandas==1.5.0
sklearn==1.2.0

File: tests/sample_project/test_input.py

import requests
from django.conf.urls import url
import numpy as np
import pandas as pd
from sklearn.feature_extraction.text import CountVectorizer


BASE = "https://example.com"


def test_requests():
    # removed_kwarg
    requests.get(BASE, verify=False)

    # removed_kwarg
    requests.post(BASE, verify=False)


def test_django():
    # removed_method
    url(r"^home/$", lambda r: None)


def test_numpy():
    # removed_method
    np.float(10)

    # removed_method
    np.int(20)


def test_pandas(df):
    # renamed_method
    df.sort()


def test_sklearn():
    # removed_method
    vec = CountVectorizer()
    vec.get_feature_names()

Command

python main.py tests/sample_project --requirements tests/input_requirements.txt

Output: Violation Detection

Parsing project: /Users/abhinavs/code-migration-agent/tests/sample_project
  Found 2 Python file(s)
  Loaded dependencies from tests/input_requirements.txt (5 package(s))

Found 7 violation(s):

  [error] /Users/abhinavs/code-migration-agent/tests/sample_project/test_input.py:13
    requests.get: 'verify' parameter removed in requests 3.0.
    → Remove 'verify' kwarg or configure SSL differently.

  [error] /Users/abhinavs/code-migration-agent/tests/sample_project/test_input.py:16
    requests.post: 'verify' parameter removed in requests 3.0.
    → Remove 'verify' kwarg or configure SSL differently.

  [error] /Users/abhinavs/code-migration-agent/tests/sample_project/test_input.py:21
    django.url: django.conf.urls.url was removed in Django 4.0.
    → Use path() or re_path() from django.urls.

  [warning] /Users/abhinavs/code-migration-agent/tests/sample_project/test_input.py:26
    numpy.float: np.float was removed in NumPy 1.20.
    → Use np.float64 or built-in float().

  [warning] /Users/abhinavs/code-migration-agent/tests/sample_project/test_input.py:29
    numpy.int: np.int was removed in NumPy 1.20.
    → Use np.int64 or built-in int().

  [warning] /Users/abhinavs/code-migration-agent/tests/sample_project/test_input.py:34
    pandas.sort: DataFrame.sort() renamed to sort_values().
    → Use df.sort_values().

  [error] /Users/abhinavs/code-migration-agent/tests/sample_project/test_input.py:43
    sklearn.get_feature_names: get_feature_names() removed in sklearn 1.0.
    → Use get_feature_names_out().

Output: LLM-Generated Migration Strategy

Migration strategy
============================================================

Scope: medium

Affected areas:
  • HTTP client operations (requests)
  • Web routing and URL configuration (Django)
  • Numerical data type handling (NumPy)
  • Data manipulation and sorting (Pandas)
  • Machine learning feature processing (Scikit-learn)

Recommended order:
  1. Address 'requests' library changes: Remove 'verify' parameter from get/post calls.
  2. Address 'django' library changes: Replace 'django.conf.urls.url' with 'path()' or 're_path()'.
  3. Address 'sklearn' library changes: Update 'get_feature_names()' to 'get_feature_names_out()'.
  4. Address 'numpy' library changes: Replace 'np.float' with 'np.float64' or 'float()', 
     and 'np.int' with 'np.int64' or 'int()'.
  5. Address 'pandas' library changes: Rename 'DataFrame.sort()' to 'sort_values()'.

Testing focus:
  • All network requests, ensuring proper SSL/TLS handling after 'verify' parameter removal.
  • All application URL routes and navigation.
  • Numerical computations and data type conversions involving NumPy.
  • Dataframe sorting logic and results.
  • Machine learning pipelines, specifically feature extraction and model inputs.
  • End-to-end application functionality to catch any integration issues.

Risk notes:
  Multiple libraries have critical 'error' level changes requiring immediate fixes for code 
  functionality. The removal of 'verify' in 'requests' requires careful consideration of 
  security implications related to SSL certificate verification. While individual fixes are 
  mostly straightforward (renames, parameter removals), the breadth across different 
  functional areas (web, data, ML) necessitates comprehensive testing.

Key Technical Properties

Deterministic API extraction via AST
Alias-aware module resolution
Version-gated rule activation
Constraint-driven violation detection
Structured compatibility modeling
LLM grounded on deterministic outputs
No probabilistic detection of breaking changes

Migration is treated as a compatibility constraint problem, not a token prediction task.


Supported Libraries

Currently includes breaking-change rules for:

  • requests (3.0.0+)
  • django (4.0+)
  • numpy (1.20+)
  • pandas (1.0+)
  • scikit-learn (1.0+)
  • Flask, FastAPI, and others (extensible via rules.yaml)

Project Structure

code-migration-agent/
├── src/
│   ├── parser.py              # AST-based code parser
│   ├── ir.py                  # Intermediate representation
│   ├── dependency/            # Requirements parsing
│   ├── engine/
│   │   └── rule_engine.py     # Deterministic rule matching
│   ├── kb/
│   │   └── rules.yaml         # Breaking change knowledge base
│   └── llm/
│       └── strategy.py        # LLM-driven strategy generation
├── tests/
│   ├── sample_project/        # Example project
│   └── input_requirements.txt # Test dependencies
├── main.py                    # Entry point
└── README.md                  # This file

Research Contributions

  1. Hybrid static + LLM migration reasoning - Combines deterministic AST analysis with LLM-driven strategy synthesis
  2. Structured IR-based compatibility modeling - Intermediate representation enables version-aware analysis
  3. Deterministic constraint-based rule matching - Version-gated rules prevent false positives
  4. Version-aware breaking change detection - Respects dependency version specifications
  5. Controlled LLM grounding architecture - LLMs reason over structured data, not raw tokens

Contributing

Contributions welcome! Please:

  1. Add new rules to src/kb/rules.yaml
  2. Include test cases for new breaking changes
  3. Document your additions in README.md

Contact & Support

For issues, questions, or contributions, please open an issue on the project repository.

About

Graph-based dependency analysis and compatibility planning engine for automated large-scale codebase migration.

Topics

Resources

Stars

Watchers

Forks

Packages

 
 
 

Contributors

Languages