refactor: move core logics of DPP -> AIC and support static profiling#6285
refactor: move core logics of DPP -> AIC and support static profiling#6285
Conversation
Signed-off-by: hongkuanz <hongkuanz@nvidia.com>
WalkthroughThis pull request introduces DGDR (DynamoGraphDeploymentRequest) v1beta1 support with comprehensive Pydantic type definitions, establishes a new profiler CLI entry point with operational configuration, and substantially refactors the profiling pipeline to be planner-aware with support for rapid and thorough profiling paths. Dependency pins are updated to specific commits, and legacy test coverage is replaced with DGDR-focused integration tests. Changes
Estimated code review effort🎯 4 (Complex) | ⏱️ ~60 minutes Poem
🚥 Pre-merge checks | ✅ 2 | ❌ 1❌ Failed checks (1 warning)
✅ Passed checks (2 passed)
✏️ Tip: You can configure your own custom pre-merge checks in the settings. Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out. Comment |
There was a problem hiding this comment.
Actionable comments posted: 3
🤖 Fix all issues with AI agents
In `@components/src/dynamo/profiler/utils/dgdr.py`:
- Around line 56-59: The vram_mb Field is missing a positivity constraint;
update the Field declaration for vram_mb to include gt=0 (matching total_gpus
and num_gpus_per_node) so VRAM must be greater than zero—locate the vram_mb
Field in the DGDR-related model (symbol: vram_mb) and add gt=0 to its Field(...)
arguments and adjust any related validation/tests if needed.
- Around line 150-157: The description string for the Field
planner_pre_deployment_sweeping concatenates adjacent string literals without
spaces, producing run-together sentences; update the description in the
planner_pre_deployment_sweeping Field so each sentence boundary includes a
trailing space (or explicitly add spaces between the quoted fragments) so
phrases like "only load-based scaling in planner is possible.Rapid" become "only
load-based scaling in planner is possible. Rapid" and similarly for
"combination.Thorough" -> "combination. Thorough".
- Around line 81-94: The Field descriptions for the concurrency and request_rate
attributes in dgdr.py have missing spaces between sentences causing
"disabled.Will" to render; update the description strings in the concurrency and
request_rate Field declarations (symbols: concurrency, request_rate) to include
a trailing space after the sentence ending with "disabled." so each sentence is
separated (e.g., "...disabled. Will be ignored...") to fix the rendered text.
🧹 Nitpick comments (1)
components/src/dynamo/profiler/utils/dgdr.py (1)
232-232: Consider movingmodel_configto the top of the class body.Pydantic convention (and the docs examples) place
model_configbefore field declarations so readers immediately see the model's behavior constraints. Currently it sits after the validators at the very end.
Signed-off-by: hongkuanz <hongkuanz@nvidia.com>
Signed-off-by: hongkuanz <hongkuanz@nvidia.com>
Signed-off-by: hongkuanz <hongkuanz@nvidia.com>
…o it (#6534) Signed-off-by: Hannah Zhang <hannahz@nvidia.com>
Signed-off-by: hongkuanz <hongkuanz@nvidia.com>
…/new-dgdr Signed-off-by: hongkuanz <hongkuanz@nvidia.com>
#6538) Signed-off-by: Hannah Zhang <hannahz@nvidia.com>
Signed-off-by: hongkuanz <hongkuanz@nvidia.com>
…/new-dgdr Signed-off-by: hongkuanz <hongkuanz@nvidia.com>
Changes
New files
components/src/dynamo/profiler/__main__.py— New entry point:python -m dynamo.profiler --config <dgdr.yaml>. Parses DGDR spec from JSON/YAML file or inline JSON string, following the same pattern as the planner's__main__.py.components/src/dynamo/profiler/utils/aic_dataframe.py— Helpers to build AIC-compatible DataFrames from real-GPU benchmark results. Only populates the minimal columns actually accessed by AIC's picking functions (traced by readingpick_autoscale→_build_disagg_summary_dict).tests/profiler/test_profile_sla_dgdr.py— 11 pytest cases covering all profiler modes. All markedpre_merge+gpu_0(no GPU required).dpp_test/*.yaml— 10 DGDR config files for manual and automated testing.Modified files
components/src/dynamo/profiler/profile_sla.py— Complete rewrite ofrun_profile():argsto(dgdr: DynamoGraphDeploymentRequestSpec, ops: ProfilerOperationalConfig)TaskRunnersimulation or naive fallback, with picking mode auto-selected (autoscale/load-match/default)enumerate_profiling_configs()→ deploy + benchmark each candidate → build DataFrames → AIC pickingconvert_config()to strip to standalone P/D engines, runs sweepgenerate_dgd_config_with_planner()adds planner service + ConfigMaps; mocker flag selects mocker DGDcomponents/src/dynamo/profiler/utils/dgd_generation.py— Rewritten to removeParallelizationMappingdependency:generate_dgd_config_with_planner()now takesdgdrdirectly (notargs)PlannerConfigJSON in a dedicated ConfigMap (not CLI args)config_modifierparameter (picked DGD already has correct image/parallelization)components/src/dynamo/profiler/utils/config_modifiers/parallelization_mapping.py— AddedPickedParallelConfigdataclass alongside existingParallelizationMapping. Uses explicit(tp, pp, dp, moe_tp, moe_ep)fields matching AIC's representation.components/src/dynamo/profiler/utils/config_modifiers/protocol.py— Fixedupdate_model_from_pvc()to mount PVC on all services (not just Frontend), and derivepvc_pathfrommodel_pathfor correct--model-pathin workers.components/src/dynamo/profiler/utils/dgdr_v1beta1_types.py— FixedSLASpecvalidator (changed fromfield_validatortomodel_validatorto avoid cross-field ordering issue). Added default SLA values (ttft=2000, itl=30).AIC-side (separate PR)
src/aiconfigurator/generator/enumerate.py—enumerate_profiling_configs()now returnslist[EnumeratedCandidate]instead oflist[dict]. EachEnumeratedCandidatebundles the DGD config with parallelization metadata(tp, pp, dp, moe_tp, moe_ep, num_gpus).Dependency updates
container/deps/requirements.txt— AIC pinned to168a948dbenchmarks/pyproject.toml— AIC pinned to168a948dDeleted files
tests/profiler/test_profile_sla_dryrun.py— Replaced bytest_profile_sla_dgdr.pytests/profiler/test_profile_sla_aiconfigurator.py— Replaced bytest_profile_sla_dgdr.pyHow to review
Start with
profile_sla.py— Readrun_profile()top to bottom. The flow is: parse DGDR → gate checks → dryrun/RAPID/THOROUGH → SLA warnings → interpolation → final assembly → save._run_rapid()— Follows the same pattern astests/integration/test_picking.pyin AIC. Usesbuild_default_task_configs+_execute_task_configsfor default/load-match,TaskRunner.run(autoscale=True)for autoscale. DGD generated via_generate_dgd_from_pick()._run_thorough()— Usesenumerate_profiling_configs()to get candidates, benchmarks each viaDynamoDeploymentClient, builds DataFrames viaaic_dataframe.pyhelpers, then callspick_autoscale/pick_load_match/pick_default.aic_dataframe.py— Check the column mapping tables in the plan doc. Only columns actually accessed by AIC picking functions are populated.dgd_generation.py— Now takesdgdrdirectly. Planner config via ConfigMap (not CLI args). Two separate ConfigMaps: planner-config and planner-profile-data.test_profile_sla_dgdr.py— Run with:cd dynamo && PYTHONPATH=components/src:$PYTHONPATH python -m pytest tests/profiler/test_profile_sla_dgdr.py -v --noconftest --override-ini="addopts=". All 11 tests should pass in ~73s.Summary by CodeRabbit
Release Notes
New Features
Chores
close