Skip to content

refactor: move core logics of DPP -> AIC and support static profiling#6285

Merged
tedzhouhk merged 35 commits intomainfrom
hzhou/new-dgdr
Feb 26, 2026
Merged

refactor: move core logics of DPP -> AIC and support static profiling#6285
tedzhouhk merged 35 commits intomainfrom
hzhou/new-dgdr

Conversation

@tedzhouhk
Copy link
Copy Markdown
Contributor

@tedzhouhk tedzhouhk commented Feb 13, 2026

Changes

New files

  • components/src/dynamo/profiler/__main__.py — New entry point: python -m dynamo.profiler --config <dgdr.yaml>. Parses DGDR spec from JSON/YAML file or inline JSON string, following the same pattern as the planner's __main__.py.

  • components/src/dynamo/profiler/utils/aic_dataframe.py — Helpers to build AIC-compatible DataFrames from real-GPU benchmark results. Only populates the minimal columns actually accessed by AIC's picking functions (traced by reading pick_autoscale_build_disagg_summary_dict).

  • tests/profiler/test_profile_sla_dgdr.py — 11 pytest cases covering all profiler modes. All marked pre_merge + gpu_0 (no GPU required).

  • dpp_test/*.yaml — 10 DGDR config files for manual and automated testing.

Modified files

  • components/src/dynamo/profiler/profile_sla.py — Complete rewrite of run_profile():

    • Input changed from argparse args to (dgdr: DynamoGraphDeploymentRequestSpec, ops: ProfilerOperationalConfig)
    • RAPID path: AIC support check → TaskRunner simulation or naive fallback, with picking mode auto-selected (autoscale/load-match/default)
    • THOROUGH path: enumerate_profiling_configs() → deploy + benchmark each candidate → build DataFrames → AIC picking
    • Interpolation: takes the picked disagg DGD, convert_config() to strip to standalone P/D engines, runs sweep
    • Final assembly: generate_dgd_config_with_planner() adds planner service + ConfigMaps; mocker flag selects mocker DGD
    • Gate checks: thorough+auto backend rejected; AIC-unsupported + throughput planner rejected; rapid sweep falls back to none
    • SLA warnings: unachievable SLA auto-updated; load-match GPU shortage warned
    • Dryrun mode preserved
  • components/src/dynamo/profiler/utils/dgd_generation.py — Rewritten to remove ParallelizationMapping dependency:

    • generate_dgd_config_with_planner() now takes dgdr directly (not args)
    • Planner config passed via PlannerConfig JSON in a dedicated ConfigMap (not CLI args)
    • Profiling data in a separate ConfigMap
    • Removed config_modifier parameter (picked DGD already has correct image/parallelization)
  • components/src/dynamo/profiler/utils/config_modifiers/parallelization_mapping.py — Added PickedParallelConfig dataclass alongside existing ParallelizationMapping. Uses explicit (tp, pp, dp, moe_tp, moe_ep) fields matching AIC's representation.

  • components/src/dynamo/profiler/utils/config_modifiers/protocol.py — Fixed update_model_from_pvc() to mount PVC on all services (not just Frontend), and derive pvc_path from model_path for correct --model-path in workers.

  • components/src/dynamo/profiler/utils/dgdr_v1beta1_types.py — Fixed SLASpec validator (changed from field_validator to model_validator to avoid cross-field ordering issue). Added default SLA values (ttft=2000, itl=30).

AIC-side (separate PR)

  • src/aiconfigurator/generator/enumerate.pyenumerate_profiling_configs() now returns list[EnumeratedCandidate] instead of list[dict]. Each EnumeratedCandidate bundles the DGD config with parallelization metadata (tp, pp, dp, moe_tp, moe_ep, num_gpus).

Dependency updates

  • container/deps/requirements.txt — AIC pinned to 168a948d
  • benchmarks/pyproject.toml — AIC pinned to 168a948d

Deleted files

  • tests/profiler/test_profile_sla_dryrun.py — Replaced by test_profile_sla_dgdr.py
  • tests/profiler/test_profile_sla_aiconfigurator.py — Replaced by test_profile_sla_dgdr.py

How to review

  1. Start with profile_sla.py — Read run_profile() top to bottom. The flow is: parse DGDR → gate checks → dryrun/RAPID/THOROUGH → SLA warnings → interpolation → final assembly → save.

  2. _run_rapid() — Follows the same pattern as tests/integration/test_picking.py in AIC. Uses build_default_task_configs + _execute_task_configs for default/load-match, TaskRunner.run(autoscale=True) for autoscale. DGD generated via _generate_dgd_from_pick().

  3. _run_thorough() — Uses enumerate_profiling_configs() to get candidates, benchmarks each via DynamoDeploymentClient, builds DataFrames via aic_dataframe.py helpers, then calls pick_autoscale/pick_load_match/pick_default.

  4. aic_dataframe.py — Check the column mapping tables in the plan doc. Only columns actually accessed by AIC picking functions are populated.

  5. dgd_generation.py — Now takes dgdr directly. Planner config via ConfigMap (not CLI args). Two separate ConfigMaps: planner-config and planner-profile-data.

  6. test_profile_sla_dgdr.py — Run with: cd dynamo && PYTHONPATH=components/src:$PYTHONPATH python -m pytest tests/profiler/test_profile_sla_dgdr.py -v --noconftest --override-ini="addopts=". All 11 tests should pass in ~73s.

Summary by CodeRabbit

Release Notes

  • New Features

    • Added profiler command-line interface with configuration file and operational parameter support.
    • Introduced support for DynamoGraphDeploymentRequest specification format for profiling workflows.
    • Enabled planner-driven optimization with scaling modes and pre-deployment sweeping configuration.
    • Implemented multiple profiling strategies (rapid, thorough, load-match) for flexible performance tuning.
    • Added persistent volume claim (PVC) support for model caching.
  • Chores

    • Updated external dependency pin to specific commit revision.

close

Signed-off-by: hongkuanz <hongkuanz@nvidia.com>
Signed-off-by: hongkuanz <hongkuanz@nvidia.com>
@tedzhouhk tedzhouhk requested a review from a team as a code owner February 13, 2026 18:51
@tedzhouhk tedzhouhk requested a review from a team February 13, 2026 18:51
@coderabbitai
Copy link
Copy Markdown
Contributor

coderabbitai Bot commented Feb 13, 2026

Walkthrough

This pull request introduces DGDR (DynamoGraphDeploymentRequest) v1beta1 support with comprehensive Pydantic type definitions, establishes a new profiler CLI entry point with operational configuration, and substantially refactors the profiling pipeline to be planner-aware with support for rapid and thorough profiling paths. Dependency pins are updated to specific commits, and legacy test coverage is replaced with DGDR-focused integration tests.

Changes

Cohort / File(s) Summary
Profiler CLI & Main Entry Point
components/src/dynamo/profiler/__main__.py
New module establishing argparse-based CLI with config parsing, logging setup, and async profiler invocation via DynamoGraphDeploymentRequestSpec and ProfilerOperationalConfig.
Profiling Pipeline Refactor
components/src/dynamo/profiler/profile_sla.py
Major rework introducing rapid (AIC-based) and thorough profiling paths, planner-driven DGD generation, ProfilerOperationalConfig class, and multiple helper functions for mode selection, SLA validation, GPU warnings, and interpolation workflows.
DGDR Type System
components/src/dynamo/profiler/utils/dgdr_v1beta1_types.py
New file defining auto-generated v1beta1 DGDR Pydantic models including DynamoGraphDeploymentRequestSpec, SLASpec, WorkloadSpec, HardwareSpec, enums (DGDRPhase, OptimizationType, PlannerPreDeploymentSweepMode), and cross-field validators.
Planner Configuration
components/src/dynamo/planner/defaults.py, components/src/dynamo/planner/utils/planner_config.py
SLAPlannerDefaults.no_correction default changed from False to True; PlannerConfig field renamed from plannerPreDeploymentSweeping to pre_deployment_sweeping_mode with validation; new scaling_enabled() helper method added.
Profiler Utilities: AIC & Parallelization
components/src/dynamo/profiler/utils/aic_dataframe.py, components/src/dynamo/profiler/utils/config_modifiers/parallelization_mapping.py
New aic_dataframe module with functions to construct AIC-compatible DataFrames (make_parallel_label, build_prefill_row, build_decode_row, build_disagg_df_from_static); new PickedParallelConfig dataclass with num_gpus and tp_size properties plus label() and to_parallelization_mapping() methods.
DGD Generation & Protocol Updates
components/src/dynamo/profiler/utils/dgd_generation.py, components/src/dynamo/profiler/utils/config_modifiers/protocol.py
dgd_generation.py refactored to consume DGDR spec, build PlannerConfig, load profiling data from NPZ, generate ConfigMaps, and support multi-document DGD output; protocol.py updated to mount PVCs to all services and derive pvc_path from effective model path.
Dependencies
benchmarks/pyproject.toml, container/deps/requirements.txt
aiconfigurator dependency pinned to specific commit hash 168a948d5bc32209728fe8639191a9e0d9083d18 across both files.
Test Coverage Restructuring
tests/profiler/test_profile_sla_dgdr.py, tests/profiler/test_profile_sla_aiconfigurator.py, tests/profiler/test_profile_sla_dryrun.py
New comprehensive DGDR-based profiler test suite with TestRapidSupported, TestRapidUnsupported, TestThoroughDryRun, TestMockerEnabled, TestGateChecks classes covering planner, load-match, PVC, and mocker scenarios; removed legacy aiconfigurator and dry-run test modules.

Estimated code review effort

🎯 4 (Complex) | ⏱️ ~60 minutes

Poem

🐰 Whiskers twitch with profiler delight,
DGDR types now shimmering bright,
Planner paths both rapid and deep,
CLI hops that configuration keep,
Tests refactored, old dust swept clean—
The finest deployment pipeline seen! 🌟

🚥 Pre-merge checks | ✅ 2 | ❌ 1

❌ Failed checks (1 warning)

Check name Status Explanation Resolution
Docstring Coverage ⚠️ Warning Docstring coverage is 72.00% which is insufficient. The required threshold is 80.00%. Write docstrings for the functions missing them to satisfy the coverage threshold.
✅ Passed checks (2 passed)
Check name Status Explanation
Title check ✅ Passed The PR title 'refactor: move core logics of DPP -> AIC and support static profiling' directly describes the main changes in the changeset: major refactoring of profiler logic to integrate with AIC and support static profiling workflows.
Description check ✅ Passed The PR description is comprehensive and well-structured, covering new files, modified files, AIC-side changes, dependency updates, deleted files, and review guidance. It exceeds the template requirements with detailed section breakdowns and specific file-level changes.

✏️ Tip: You can configure your own custom pre-merge checks in the settings.


Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

Comment @coderabbitai help to get the list of available commands and usage tips.

Copy link
Copy Markdown
Contributor

@coderabbitai coderabbitai Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 3

🤖 Fix all issues with AI agents
In `@components/src/dynamo/profiler/utils/dgdr.py`:
- Around line 56-59: The vram_mb Field is missing a positivity constraint;
update the Field declaration for vram_mb to include gt=0 (matching total_gpus
and num_gpus_per_node) so VRAM must be greater than zero—locate the vram_mb
Field in the DGDR-related model (symbol: vram_mb) and add gt=0 to its Field(...)
arguments and adjust any related validation/tests if needed.
- Around line 150-157: The description string for the Field
planner_pre_deployment_sweeping concatenates adjacent string literals without
spaces, producing run-together sentences; update the description in the
planner_pre_deployment_sweeping Field so each sentence boundary includes a
trailing space (or explicitly add spaces between the quoted fragments) so
phrases like "only load-based scaling in planner is possible.Rapid" become "only
load-based scaling in planner is possible. Rapid" and similarly for
"combination.Thorough" -> "combination. Thorough".
- Around line 81-94: The Field descriptions for the concurrency and request_rate
attributes in dgdr.py have missing spaces between sentences causing
"disabled.Will" to render; update the description strings in the concurrency and
request_rate Field declarations (symbols: concurrency, request_rate) to include
a trailing space after the sentence ending with "disabled." so each sentence is
separated (e.g., "...disabled. Will be ignored...") to fix the rendered text.
🧹 Nitpick comments (1)
components/src/dynamo/profiler/utils/dgdr.py (1)

232-232: Consider moving model_config to the top of the class body.

Pydantic convention (and the docs examples) place model_config before field declarations so readers immediately see the model's behavior constraints. Currently it sits after the validators at the very end.

Comment thread components/src/dynamo/profiler/utils/dgdr.py Outdated
Comment thread components/src/dynamo/profiler/utils/dgdr.py Outdated
Comment thread components/src/dynamo/profiler/utils/dgdr.py Outdated
Comment thread components/src/dynamo/profiler/utils/dgdr.py Outdated
Comment thread components/src/dynamo/profiler/utils/dgdr.py Outdated
Comment thread components/src/dynamo/profiler/utils/dgdr.py Outdated
Comment thread components/src/dynamo/profiler/utils/dgdr.py Outdated
Comment thread components/src/dynamo/profiler/utils/dgdr.py Outdated
Comment thread components/src/dynamo/profiler/utils/dgdr.py Outdated
Signed-off-by: hongkuanz <hongkuanz@nvidia.com>
@tedzhouhk tedzhouhk changed the title chores: pydantic template for AugmentedDGDR chore: pydantic template for AugmentedDGDR Feb 13, 2026
@github-actions github-actions Bot added the chore label Feb 13, 2026
Signed-off-by: hongkuanz <hongkuanz@nvidia.com>
Comment thread components/src/dynamo/profiler/utils/dgdr.py Outdated
Comment thread components/src/dynamo/profiler/utils/dgdr.py Outdated
Comment thread components/src/dynamo/profiler/utils/dgdr.py Outdated
Comment thread components/src/dynamo/profiler/utils/dgdr.py Outdated
Comment thread components/src/dynamo/profiler/utils/dgdr.py Outdated
Comment thread components/src/dynamo/profiler/utils/dgdr.py Outdated
Comment thread components/src/dynamo/profiler/utils/dgdr.py Outdated
Comment thread components/src/dynamo/profiler/utils/dgdr.py Outdated
Signed-off-by: hongkuanz <hongkuanz@nvidia.com>
.
Signed-off-by: hongkuanz <hongkuanz@nvidia.com>
Signed-off-by: hongkuanz <hongkuanz@nvidia.com>
hhzhang16
hhzhang16 previously approved these changes Feb 17, 2026
@tedzhouhk tedzhouhk marked this pull request as draft February 17, 2026 21:15
…o it (#6534)

Signed-off-by: Hannah Zhang <hannahz@nvidia.com>
@github-actions github-actions Bot added the deployment::k8s Relates to dynamo deployment in kubernetes label Feb 24, 2026
Signed-off-by: hongkuanz <hongkuanz@nvidia.com>
…/new-dgdr

Signed-off-by: hongkuanz <hongkuanz@nvidia.com>
Signed-off-by: hongkuanz <hongkuanz@nvidia.com>
#6538)

Signed-off-by: Hannah Zhang <hannahz@nvidia.com>
Signed-off-by: hongkuanz <hongkuanz@nvidia.com>
.
Signed-off-by: hongkuanz <hongkuanz@nvidia.com>
Signed-off-by: hongkuanz <hongkuanz@nvidia.com>
…/new-dgdr

Signed-off-by: hongkuanz <hongkuanz@nvidia.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment