Skip to content

DependencyGraph Architectural Incompatibility #519

@DevLenn

Description

@DevLenn

DependencyGraph Architectural Incompatibility with Transformers CausalLM and MistralForCausalLM Models

Issue Overview

The DependencyGraph.build_dependency() method exhibits systematic incompatibility with Hugging Face Transformers causal language model architectures due to fundamental differences in computational graph construction patterns and input/output tensor handling semantics. This architectural mismatch prevents successful structural pruning operations across all LLaMA, Mistral, and derivative model families.

Application Context

This incompatibility emerged during development of an interactive neural network pruning pipeline designed for production model compression workflows. The intended system architecture encompasses:

Model Management Layer:

  • Interactive selection interface for locally cached models (argilla/CapybaraHermes-2.5-Mistral-7B, TinyLlama/TinyLlama-1.1B-Chat-v1.0)
  • Hierarchical module enumeration with comprehensive layer identification and indexing
  • Multi-target layer selection via space-delimited numerical interface

Pruning Configuration System:

  • Fine-grained retention ratio specification (0.0-100.0% with arbitrary decimal precision)
  • Semantic pruning interpretations: 100% indicates structural preservation, 0% denotes complete removal, intermediate values represent proportional channel/dimension reduction
  • Per-layer granular control supporting heterogeneous pruning strategies across network topology

Resource Management Framework:

  • Memory-constrained execution with 90% GPU memory allocation ceiling
  • Automatic CPU offloading for overflow allocation with transparent fallback mechanisms
  • CUDA availability detection with seamless CPU-only execution mode

Model Serialization Pipeline:

  • Intermediate checkpoint persistence in HuggingFace-compatible format
  • Integration with llama.cpp quantization toolchain for production deployment
  • Target quantization: Q6_K precision with GGUF container format

Directory Structure Specifications:

  • Local model repository: ../full_models/<model-identifier-without-namespace>/
  • Cross-platform path resolution with consistent behavior across execution environments

Environment

  • Python: 3.13.3
  • PyTorch: 2.8.0 (tested with both CPU and CUDA 12.6)
  • torch-pruning: latest (pip)
  • transformers: latest (pip)
  • accelerate: latest (pip)
  • OS: Windows 11

Affected Models

  • argilla/CapybaraHermes-2.5-Mistral-7B
  • TinyLlama/TinyLlama-1.1B-Chat-v1.0
  • All LlamaForCausalLM and MistralForCausalLM based models

Error Sequence 1: Tuple Input to Embedding Layer

Script Snippet

# Initial torch-pruning approach with tuple input format
enc = tokenizer("Hallo Welt", return_tensors="pt")
example_inputs = enc["input_ids"].to(next(model.parameters()).device)

DG = tp.DependencyGraph().build_dependency(
    model, 
    example_inputs=example_inputs  # Direct tensor input
)

Error

TypeError: embedding(): argument 'indices' (position 2) must be Tensor, not tuple
    at torch.nn.functional.py:2551, in embedding
    return torch.embedding(weight, input, padding_idx, scale_grad_by_freq, sparse)

What Triggers This Error

The error occurs when the dependency graph attempts to trace through the model's embedding layer. The build_dependency method internally creates a forward function that passes inputs as positional arguments, but Transformers models expect keyword arguments like input_ids=tensor.

Reproduction

import torch
from transformers import AutoModelForCausalLM, AutoTokenizer
import torch_pruning as tp

# Load local model
model = AutoModelForCausalLM.from_pretrained("../full_models/TinyLlama-1.1B-Chat-v1.0")
tokenizer = AutoTokenizer.from_pretrained("../full_models/TinyLlama-1.1B-Chat-v1.0")

enc = tokenizer("Test", return_tensors="pt")
example_inputs = enc["input_ids"]

# This triggers the TypeError
DG = tp.DependencyGraph().build_dependency(model, example_inputs=example_inputs)

Expected vs Actual Behavior

  • Expected: The dependency graph should build successfully with standard transformer model inputs
  • Actual: TypeError occurs because the embedding layer receives incompatible argument format

Error Explanation

Transformers models require keyword argument calls (model(input_ids=tensor)) rather than positional calls (model(tensor)). The default dependency graph construction assumes vision model patterns where inputs can be passed positionally, leading to argument mismatch at the embedding layer level.


Error Sequence 2: grad_fn AttributeError in Computational Graph

Script Snippet

# Attempted fix using tuple format and forward function
enc = tokenizer("Hallo Welt", return_tensors="pt")
example_inputs = enc["input_ids"].to(next(model.parameters()).device)

model.train()
dg = tp.DependencyGraph()
dg.build_dependency(
    model,
    example_inputs=(example_inputs,),            # tuple format
    forward_fn=lambda m, x: m(x).logits.float()  # extract logits
)

Error

Traceback (most recent call last):
  File "torch_pruning/dependency/graph.py", line 514, in _trace_computational_graph
    grad_fn_root = output.grad_fn
AttributeError: 'tuple' object has no attribute 'grad_fn'

What Triggers This Error

The error occurs in _trace_computational_graph() when the method attempts to access the grad_fn attribute of the forward function output. However, the forward function lambda m, x: m(x).logits.float() fails because m(x) with tuple x causes the same embedding layer issue, resulting in a tuple output rather than a tensor with gradient information.

Reproduction

import torch
from transformers import AutoModelForCausalLM, AutoTokenizer
import torch_pruning as tp

model = AutoModelForCausalLM.from_pretrained("../full_models/TinyLlama-1.1B-Chat-v1.0")
tokenizer = AutoTokenizer.from_pretrained("../full_models/TinyLlama-1.1B-Chat-v1.0")

enc = tokenizer("Test", return_tensors="pt")
example_inputs = enc["input_ids"]
model.train()

# This will trigger the AttributeError
dg = tp.DependencyGraph()
dg.build_dependency(
    model,
    example_inputs=(example_inputs,),
    forward_fn=lambda m, x: m(x).logits.float()
)

Expected vs Actual Behavior

  • Expected: The dependency graph should successfully trace the computational graph and extract gradient information
  • Actual: AttributeError because the forward function fails, preventing gradient graph construction

Error Explanation

The combination of tuple input format and incompatible forward function signature creates a cascade failure. The forward function cannot execute successfully due to the underlying embedding layer argument mismatch, which prevents the dependency graph from obtaining a valid tensor output with gradient information.


Error Sequence 3: Integer Tensor Gradient Requirement

Script Snippet

# Attempt to manually enable gradients on input tokens
enc = tokenizer("Hallo Welt", return_tensors="pt")
example_inputs = enc["input_ids"].to(next(model.parameters()).device)
example_inputs.requires_grad_(True)  # This line causes the error

model.train()
dg = tp.DependencyGraph()
dg.build_dependency(model, example_inputs=(example_inputs,))

Error

RuntimeError: only Tensors of floating point dtype can require gradients

What Triggers This Error

The error occurs when attempting to call requires_grad_(True) on input_ids tensors, which have dtype=torch.int64. PyTorch's automatic differentiation system only supports gradient computation on floating-point tensors.

Reproduction

import torch
from transformers import AutoTokenizer

tokenizer = AutoTokenizer.from_pretrained("../full_models/TinyLlama-1.1B-Chat-v1.0")
enc = tokenizer("Test", return_tensors="pt")
input_ids = enc["input_ids"]

print(f"input_ids dtype: {input_ids.dtype}")  # torch.int64
input_ids.requires_grad_(True)  # RuntimeError occurs here

Expected vs Actual Behavior

  • Expected: The dependency graph construction should handle input tensor types appropriately without requiring manual gradient enablement
  • Actual: RuntimeError when attempting to enable gradients on discrete token indices

Error Explanation

Token IDs represent discrete vocabulary indices (torch.int64) rather than continuous values. Gradient computation requires continuous, differentiable parameters. The error occurs when code attempts to enable gradient tracking on these discrete input tokens, which is mathematically invalid for automatic differentiation.


Error Sequence 4: BasePruner Integration Failure

Script Snippet

# Direct BasePruner usage attempt
modules = list(model.modules())
layer = modules[64]  # Linear layer selection
ratio = 0.2

imp = tp.importance.GroupMagnitudeImportance(p=2)
pruner = tp.pruner.BasePruner(
    model,
    {"input_ids": example_inputs},  # Dictionary format attempt
    importance=imp,
    pruning_ratio=ratio,
    pruning_ratio_dict={layer: ratio}
)

Error

Traceback (most recent call last):
  File "torch_pruning/pruner/algorithms/base_pruner.py", line 137, in __init__
    self.DG = dependency.DependencyGraph().build_dependency(...)
  File "torch_pruning/dependency/graph.py", line 514, in _trace_computational_graph
    grad_fn_root = output.grad_fn
AttributeError: 'tuple' object has no attribute 'grad_fn'

What Triggers This Error

The error occurs during BasePruner initialization when it internally constructs a DependencyGraph using default parameters. Even with dictionary input format, the internal dependency graph construction encounters the same computational graph tracing failure described in Error Sequence 2.

Reproduction

import torch
from transformers import AutoModelForCausalLM, AutoTokenizer
import torch_pruning as tp

model = AutoModelForCausalLM.from_pretrained("../full_models/TinyLlama-1.1B-Chat-v1.0")
tokenizer = AutoTokenizer.from_pretrained("../full_models/TinyLlama-1.1B-Chat-v1.0")

enc = tokenizer("Test", return_tensors="pt")
example_inputs = enc["input_ids"]
modules = list(model.modules())
target_layer = modules[64]

imp = tp.importance.GroupMagnitudeImportance(p=2)
pruner = tp.pruner.BasePruner(
    model,
    {"input_ids": example_inputs},
    importance=imp,
    pruning_ratio=0.2,
    pruning_ratio_dict={target_layer: 0.2}
)  # Fails during initialization

Expected vs Actual Behavior

  • Expected: BasePruner should initialize successfully and perform structural pruning on Transformers models
  • Actual: Initialization fails due to underlying dependency graph construction incompatibilities

Error Explanation

BasePruner internally creates a dependency graph during initialization using default forward functions and output transforms. Since these defaults are designed for vision models with tensor outputs, they fail when applied to Transformers models that return structured CausalLMOutput objects, preventing the pruner from completing initialization.


Complete Script Evolution and Error Logs

Toggle complete Script Evolution and Error Logs

Initial Script Implementation

#!/usr/bin/env python3
from pathlib import Path
import subprocess
import sys
import torch
from transformers import AutoModelForCausalLM, AutoTokenizer
import torch_pruning as tp

BASE_DIR = Path(__file__).resolve().parent

CHOICES = {
    "1": "argilla/CapybaraHermes-2.5-Mistral-7B",
    "2": "TinyLlama/TinyLlama-1.1B-Chat-v1.0"
}

def choose_model():
    print("Choose a model:")
    print("1) argilla/CapybaraHermes-2.5-Mistral-7B")
    print("2) TinyLlama/TinyLlama-1.1B-Chat-v1.0")
    c = input("Selection (1 or 2): ").strip()
    if c not in CHOICES:
        print("Invalid selection. Exiting.")
        sys.exit(1)
    return CHOICES[c]

def get_cache_dir(repo_id: str) -> Path:
    model_name = repo_id.split("/")[-1]
    return (BASE_DIR / ".." / "full_models" / model_name).resolve()

def device_and_mem_settings():
    if torch.cuda.is_available():
        return torch.device("cuda"), "auto", {0: "90%", "cpu": "10%"}
    else:
        return torch.device("cpu"), None, {"cpu": "100%"}

def load_model_and_tokenizer(repo_id: str, cache_dir: Path, device_map, max_memory):
    print(f"Loading local model from {cache_dir} ...")
    if not cache_dir.exists():
        print(f"Error: Local model folder {cache_dir} not found.")
        sys.exit(1)
    load_kwargs = {}
    if device_map is not None:
        load_kwargs.update({"device_map": device_map, "max_memory": max_memory})
    model = AutoModelForCausalLM.from_pretrained(str(cache_dir), **load_kwargs)
    tokenizer = AutoTokenizer.from_pretrained(str(cache_dir))
    return model, tokenizer

def list_all_layers(model):
    mods = list(model.modules())
    print("\n--- All Modules / Layers (numbered) ---")
    for i, m in enumerate(mods):
        print(f"[{i}] {m.__class__.__name__}")
    return mods

def ask_layer_selection():
    s = input("\nEnter layer numbers (space-separated): ").strip()
    return [int(x) for x in s.split()] if s else []

def ask_percentage_for_layer(idx, module):
    raw = input(f"For layer [{idx}] {module.__class__.__name__} -> Keep what percentage? (0-100): ").strip()
    raw = raw.replace(",", ".")
    v = float(raw)
    if v < 0 or v > 100: 
        sys.exit("Invalid value")
    return v

def perform_pruning(model, tokenizer, modules, selected_idxs):
    enc = tokenizer("Hello World", return_tensors="pt")
    example_inputs = enc["input_ids"]
    print("Building DependencyGraph...")
    tp.DependencyGraph().build_dependency(model, example_inputs=example_inputs)
    
    for idx in selected_idxs:
        layer = modules[idx]
        keep = ask_percentage_for_layer(idx, layer)
        if keep == 100:
            print("No change.")
            continue
        ratio = 1 - keep / 100.0
        print(f"Prune {layer.__class__.__name__} ({idx}) with {ratio*100:.2f}% ...")
        imp = tp.importance.GroupMagnitudeImportance(p=2)
        pruner = tp.pruner.BasePruner(
            model,
            example_inputs,
            importance=imp,
            pruning_ratio=ratio,
            pruning_ratio_dict={layer: ratio}
        )
        pruner.step()
    return model

def main():
    repo_id = choose_model()
    model_name = repo_id.split("/")[-1]
    cache_dir = get_cache_dir(repo_id)
    device, device_map, max_memory = device_and_mem_settings()

    model, tokenizer = load_model_and_tokenizer(repo_id, cache_dir, device_map, max_memory)
    mods = list_all_layers(model)
    sel = ask_layer_selection()
    if not sel: 
        sys.exit("No layers selected")

    pruned = perform_pruning(model, tokenizer, mods, sel)

if __name__ == "__main__":
    main()

First Execution and Error Log

Choose a model:
1) argilla/CapybaraHermes-2.5-Mistral-7B
2) TinyLlama/TinyLlama-1.1B-Chat-v1.0
Selection (1 or 2): 2
Loading local model from /path/to/full_models/TinyLlama-1.1B-Chat-v1.0 ...

--- All Modules / Layers (numbered) ---
[0] LlamaForCausalLM
[1] LlamaModel
[2] Embedding
[3] ModuleList
[4] LlamaDecoderLayer
[5] LlamaAttention
[6] Linear
...
[292] Linear

Enter layer numbers (space-separated): 8 16 32 64 128 200 201 202 203 204 205 206 207 208 209 210
Building DependencyGraph...

UserWarning: Unwrapped parameters detected: ['model.layers.0.post_attention_layernorm.weight', ...]
 Torch-Pruning will prune the last non-singleton dimension of these parameters.

Traceback (most recent call last):
  File "script.py", line 156, in <module>
    main()
  File "script.py", line 151, in main
    pruned = perform_pruning(model, tokenizer, mods, sel)
  File "script.py", line 88, in perform_pruning
    tp.DependencyGraph().build_dependency(model, example_inputs=example_inputs)
  File "torch_pruning/dependency/graph.py", line 514, in _trace_computational_graph
    grad_fn_root = output.grad_fn
AttributeError: 'tuple' object has no attribute 'grad_fn'

Second Iteration with Tuple Format

def perform_pruning(model, tokenizer, modules, selected_idxs):
    enc = tokenizer("Hello World", return_tensors="pt")
    example_inputs = enc["input_ids"].to(next(model.parameters()).device)
    example_inputs.requires_grad_(True)

    print("Building DependencyGraph...")
    model.train()
    dg = tp.DependencyGraph()
    dg.build_dependency(
        model,
        example_inputs=(example_inputs,),
        forward_fn=lambda m, x: m(x).logits
    )
    # ... rest unchanged

Second Error Log

Enter layer numbers (space-separated): 64
Building DependencyGraph...

Traceback (most recent call last):
  File "script.py", line 86, in perform_pruning
    example_inputs.requires_grad_(True)
RuntimeError: only Tensors of floating point dtype can require gradients

Third Iteration without requires_grad_

def perform_pruning(model, tokenizer, modules, selected_idxs):
    enc = tokenizer("Hello World", return_tensors="pt")
    example_inputs = enc["input_ids"].to(next(model.parameters()).device)

    print("Building DependencyGraph...")
    model.train()
    dg = tp.DependencyGraph()
    dg.build_dependency(
        model,
        example_inputs=(example_inputs,),
        forward_fn=lambda m, x: m(x).logits.float()
    )
    # ... rest unchanged

Third Error Log

Enter layer numbers (space-separated): 64
Building DependencyGraph...

Traceback (most recent call last):
  File "script.py", line 93, in <lambda>
    forward_fn=lambda m, x: m(x).logits.float()
  File "torch/nn/functional.py", line 2551, in embedding
    return torch.embedding(weight, input, padding_idx, scale_grad_by_freq, sparse)
TypeError: embedding(): argument 'indices' (position 2) must be Tensor, not tuple

Fourth Iteration with Dictionary Format

def perform_pruning(model, tokenizer, modules, selected_idxs):
    enc = tokenizer("Hello World", return_tensors="pt")
    example_inputs = enc["input_ids"].to(next(model.parameters()).device)

    print("Building DependencyGraph...")
    model.train()
    dg = tp.DependencyGraph()
    dg.build_dependency(
        model,
        example_inputs={"input_ids": example_inputs},
        forward_fn=lambda m, x: m(**x).logits.float()
    )

    for idx in selected_idxs:
        layer = modules[idx]
        keep = ask_percentage_for_layer(idx, layer)
        if keep == 100:
            print("No change.")
            continue
        ratio = 1 - keep / 100.0
        print(f"Prune {layer.__class__.__name__} ({idx}) with {ratio*100:.2f}% ...")
        imp = tp.importance.GroupMagnitudeImportance(p=2)
        pruner = tp.pruner.BasePruner(
            model,
            {"input_ids": example_inputs},
            importance=imp,
            pruning_ratio=ratio,
            pruning_ratio_dict={layer: ratio}
        )
        pruner.step()
    return model

Final Error Log

Enter layer numbers (space-separated): 64
Building DependencyGraph...
For layer [64] Linear -> Keep what percentage? (0-100): 80
Prune Linear (64) with 20.00% ...

Traceback (most recent call last):
  File "script.py", line 106, in perform_pruning
    pruner = tp.pruner.BasePruner(
        model,
        {"input_ids": example_inputs},
        importance=imp,
        pruning_ratio=ratio,
        pruning_ratio_dict={layer: ratio}
    )
  File "torch_pruning/pruner/algorithms/base_pruner.py", line 137, in __init__
    self.DG = dependency.DependencyGraph().build_dependency(...)
  File "torch_pruning/dependency/graph.py", line 514, in _trace_computational_graph
    grad_fn_root = output.grad_fn
AttributeError: 'tuple' object has no attribute 'grad_fn'

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions