-
Notifications
You must be signed in to change notification settings - Fork 4
feat(deployments): scaffold plugin API and registry (AIRCORE-755) #280
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Changes from all commits
File filter
Filter by extension
Conversations
Jump to
Diff view
Diff view
There are no files selected for viewing
| Original file line number | Diff line number | Diff line change |
|---|---|---|
| @@ -0,0 +1,33 @@ | ||
| # NeMo Deployments Plugin | ||
|
|
||
| Substrate-agnostic deployment lifecycle for the NeMo Platform. This plugin provides | ||
| entity schemas, CRUD APIs, a `DeploymentBackend` ABC, and an executor registry. | ||
|
|
||
| **Scope (this ticket):** scaffold only — entity types, v1 CRUD routes, backend contract, | ||
| and executor registry. Docker/K8s backends and the reconcile controller land in follow-on | ||
| tickets (756–758). | ||
|
|
||
| ## Prerequisites | ||
|
|
||
| - NeMo Platform workspace bootstrapped (`make bootstrap`, `nemo setup`) | ||
| - Plugin enabled in root `pyproject.toml` (`enabled-plugins` includes `deployments`) | ||
|
|
||
| ## API base path | ||
|
|
||
| `/apis/deployments/v1/workspaces/{workspace}/...` | ||
|
|
||
| Cross-workspace bulk queries use the entity-store sentinel workspace ``-``: | ||
|
|
||
| ``GET /apis/deployments/v1/workspaces/-/deployments?status_in=pending,starting`` | ||
|
|
||
| ## Next steps | ||
|
|
||
| - **756 / 757:** Docker and Kubernetes `DeploymentBackend` implementations | ||
| - **758:** Reconcile controller wiring status writes and backend lifecycle | ||
|
|
||
| ## Tests | ||
|
|
||
| ```bash | ||
| uv sync | ||
| uv run pytest plugins/nemo-deployments/tests/unit -v | ||
| ``` | ||
|
Comment on lines
+1
to
+33
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. 🛠️ Refactor suggestion | 🟠 Major | 🏗️ Heavy lift README violates Diataxis structure and progressive-disclosure guidelines. This page mixes three Diataxis quadrants (REFERENCE for API path, HOW-TO for commands, EXPLANATION for plugin overview) into 18 lines. Per guidelines, "Each documentation page should fit ONE Diataxis quadrant; do not mix tutorials with reference tables or how-tos with architecture explanations." Restructure by either narrowing to one quadrant (e.g., HOW-TO: "Set up and test the plugin") and moving API/architecture details to separate pages, or expanding into a proper multi-section structure with progressive disclosure: Layer 1 (30s: what/who/value) → Layer 2 (3-5min: core concepts and architecture) → Layer 3 (10min+: API endpoints and advanced features) → Layer 4 (separate reference pages for complete specs). 🤖 Prompt for AI AgentsSource: Coding guidelines
coderabbitai[bot] marked this conversation as resolved.
|
||
| Original file line number | Diff line number | Diff line change |
|---|---|---|
| @@ -0,0 +1,36 @@ | ||
| [project] | ||
| name = "nemo-deployments-plugin" | ||
| version = "0.1.0" | ||
| description = "NeMo Deployments Plugin — substrate-agnostic deployment lifecycle on the NeMo Platform." | ||
| readme = "README.md" | ||
| requires-python = ">=3.11,<3.14" | ||
| dependencies = [ | ||
| "fastapi>=0.115", | ||
| "nemo-platform", | ||
| "nemo-platform-plugin", | ||
| "pydantic>=2.10.6", | ||
| ] | ||
|
coderabbitai[bot] marked this conversation as resolved.
|
||
|
|
||
| [project.entry-points."nemo.services"] | ||
| deployments = "nemo_deployments_plugin.service:DeploymentsService" | ||
|
|
||
| [build-system] | ||
| requires = ["hatchling"] | ||
| build-backend = "hatchling.build" | ||
|
|
||
| [tool.hatch.build.targets.wheel] | ||
| packages = ["src/nemo_deployments_plugin"] | ||
|
|
||
| [tool.uv.sources] | ||
| nemo-platform = { workspace = true } | ||
| nemo-platform-plugin = { workspace = true } | ||
|
|
||
| [dependency-groups] | ||
| dev = ["pytest>=8.3.4", "pytest-asyncio>=0.25.3", "httpx>=0.27", "fastapi>=0.115"] | ||
|
|
||
| [tool.pytest.ini_options] | ||
| testpaths = ["tests"] | ||
| asyncio_mode = "auto" | ||
| pythonpath = ["src", "tests/unit"] | ||
|
|
||
| [tool.nemo.openapi] | ||
| Original file line number | Diff line number | Diff line change |
|---|---|---|
| @@ -0,0 +1,23 @@ | ||
| # SPDX-FileCopyrightText: Copyright (c) 2025-2026 NVIDIA CORPORATION & AFFILIATES. All rights reserved. | ||
| # SPDX-License-Identifier: Apache-2.0 | ||
|
|
||
| """FastAPI dependencies for the deployments plugin API.""" | ||
|
|
||
| from __future__ import annotations | ||
|
|
||
| from fastapi import HTTPException, Request | ||
| from nemo_platform_plugin.entity_client import get_entity_client | ||
|
|
||
| __all__ = ["get_entity_client", "require_service_principal"] | ||
|
|
||
| _PRINCIPAL_ID_HEADER = "X-NMP-Principal-Id" | ||
|
|
||
|
|
||
| def require_service_principal(request: Request) -> None: | ||
| """Restrict controller-only status writes to service principals.""" | ||
| principal_id = request.headers.get(_PRINCIPAL_ID_HEADER, "") | ||
| if not principal_id.startswith("service:"): | ||
| raise HTTPException( | ||
| status_code=403, | ||
| detail="Status updates require a service principal.", | ||
| ) |
| Original file line number | Diff line number | Diff line change |
|---|---|---|
| @@ -0,0 +1,137 @@ | ||
| # SPDX-FileCopyrightText: Copyright (c) 2025-2026 NVIDIA CORPORATION & AFFILIATES. All rights reserved. | ||
| # SPDX-License-Identifier: Apache-2.0 | ||
|
|
||
| """DeploymentConfig CRUD routes.""" | ||
|
|
||
| from __future__ import annotations | ||
|
|
||
| import logging | ||
|
|
||
| from fastapi import APIRouter, Depends, HTTPException, Query | ||
| from nemo_deployments_plugin.api.v1.dependencies import get_entity_client | ||
| from nemo_deployments_plugin.entities import DeploymentConfig | ||
| from nemo_deployments_plugin.schema import ( | ||
| CreateDeploymentConfigRequest, | ||
| DeploymentConfigFilter, | ||
| DeploymentConfigPage, | ||
| ) | ||
| from nemo_deployments_plugin.validation import ( | ||
| PrerequisiteCycleError, | ||
| build_existing_prerequisite_map, | ||
| detect_prerequisite_cycle, | ||
| prerequisite_names, | ||
| ) | ||
| from nemo_platform_plugin.api.filters import make_filter_obj_dep | ||
| from nemo_platform_plugin.entity_client import NemoEntitiesClient, NemoEntityConflictError, NemoEntityNotFoundError | ||
| from nemo_platform_plugin.schema import PaginationData | ||
|
|
||
| logger = logging.getLogger(__name__) | ||
|
|
||
| router = APIRouter() | ||
|
|
||
| _config_filter_dep = make_filter_obj_dep(DeploymentConfigFilter) | ||
|
|
||
|
|
||
| async def _list_all_deployment_configs( | ||
| entity_client: NemoEntitiesClient, | ||
| workspace: str, | ||
| ) -> list[DeploymentConfig]: | ||
| """Page through all deployment configs for prerequisite graph validation.""" | ||
| page = 1 | ||
| configs: list[DeploymentConfig] = [] | ||
| while True: | ||
| result = await entity_client.list( | ||
| DeploymentConfig, | ||
| workspace=workspace, | ||
| page=page, | ||
| page_size=100, | ||
| ) | ||
| configs.extend(result.data) | ||
| if result.pagination is None or page >= result.pagination.total_pages: | ||
| break | ||
| page += 1 | ||
| return configs | ||
|
|
||
|
|
||
| @router.post("/deployment-configs", response_model=DeploymentConfig, status_code=201, tags=["Deployment Configs"]) | ||
| async def create_deployment_config( | ||
| workspace: str, | ||
| body: CreateDeploymentConfigRequest, | ||
| entity_client: NemoEntitiesClient = Depends(get_entity_client), | ||
| ) -> DeploymentConfig: | ||
| prereq_names = prerequisite_names(body.prerequisites) | ||
| try: | ||
| existing_configs = await _list_all_deployment_configs(entity_client, workspace) | ||
| existing_map = build_existing_prerequisite_map(existing_configs) | ||
| detect_prerequisite_cycle( | ||
| config_name=body.name, | ||
| prerequisites=prereq_names, | ||
| existing=existing_map, | ||
| ) | ||
| except PrerequisiteCycleError as exc: | ||
| raise HTTPException(status_code=400, detail=str(exc)) from exc | ||
|
|
||
| config = DeploymentConfig( | ||
| name=body.name, | ||
| workspace=workspace, | ||
| **body.model_dump(exclude={"name"}, exclude_none=True), | ||
| ) | ||
| try: | ||
| return await entity_client.create(config) | ||
| except NemoEntityConflictError as exc: | ||
| raise HTTPException( | ||
| status_code=409, | ||
| detail=f"DeploymentConfig '{body.name}' already exists in workspace '{workspace}'.", | ||
| ) from exc | ||
|
|
||
|
|
||
| @router.get("/deployment-configs", response_model=DeploymentConfigPage, tags=["Deployment Configs"]) | ||
| async def list_deployment_configs( | ||
| workspace: str, | ||
| page: int = Query(default=1, ge=1), | ||
| page_size: int = Query(default=20, ge=1, le=100), | ||
| sort: str = Query(default="-created_at"), | ||
| filter: DeploymentConfigFilter = Depends(_config_filter_dep), | ||
| entity_client: NemoEntitiesClient = Depends(get_entity_client), | ||
| ) -> DeploymentConfigPage: | ||
| filter_dict = filter if isinstance(filter, dict) else filter.model_dump(exclude_none=True) | ||
| result = await entity_client.list( | ||
| DeploymentConfig, | ||
| workspace=workspace, | ||
| page=page, | ||
| page_size=page_size, | ||
| sort=sort, | ||
| filter_obj=filter_dict or None, | ||
| ) | ||
| pagination = PaginationData.model_validate(result.pagination.model_dump()) if result.pagination else None | ||
| return DeploymentConfigPage(data=result.data, pagination=pagination, sort=sort, filter=filter) | ||
|
|
||
|
|
||
| @router.get("/deployment-configs/{name}", response_model=DeploymentConfig, tags=["Deployment Configs"]) | ||
| async def get_deployment_config( | ||
| workspace: str, | ||
| name: str, | ||
| entity_client: NemoEntitiesClient = Depends(get_entity_client), | ||
| ) -> DeploymentConfig: | ||
| try: | ||
| return await entity_client.get(DeploymentConfig, name=name, workspace=workspace) | ||
| except NemoEntityNotFoundError as exc: | ||
| raise HTTPException( | ||
| status_code=404, | ||
| detail=f"DeploymentConfig '{name}' not found in workspace '{workspace}'.", | ||
| ) from exc | ||
|
|
||
|
|
||
| @router.delete("/deployment-configs/{name}", status_code=204, tags=["Deployment Configs"]) | ||
| async def delete_deployment_config( | ||
| workspace: str, | ||
| name: str, | ||
| entity_client: NemoEntitiesClient = Depends(get_entity_client), | ||
| ) -> None: | ||
| try: | ||
| await entity_client.delete(DeploymentConfig, name=name, workspace=workspace) | ||
|
Contributor
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Could we ensure that there are no existing Deployments that reference the config and 4xx with a helpful message if so? Same goes for Volumes (if they're referenced by deployments). |
||
| except NemoEntityNotFoundError as exc: | ||
| raise HTTPException( | ||
| status_code=404, | ||
| detail=f"DeploymentConfig '{name}' not found in workspace '{workspace}'.", | ||
| ) from exc | ||
| Original file line number | Diff line number | Diff line change |
|---|---|---|
| @@ -0,0 +1,148 @@ | ||
| # SPDX-FileCopyrightText: Copyright (c) 2025-2026 NVIDIA CORPORATION & AFFILIATES. All rights reserved. | ||
| # SPDX-License-Identifier: Apache-2.0 | ||
|
|
||
| """Deployment CRUD routes.""" | ||
|
|
||
| from __future__ import annotations | ||
|
|
||
| import logging | ||
| from typing import cast | ||
|
|
||
| from fastapi import APIRouter, Depends, HTTPException, Query | ||
| from nemo_deployments_plugin.api.v1.dependencies import get_entity_client | ||
| from nemo_deployments_plugin.entities import Deployment, DeploymentConfig, DeploymentStatus | ||
| from nemo_deployments_plugin.schema import CreateDeploymentRequest, DeploymentFilter, DeploymentPage | ||
| from nemo_platform_plugin.api.filters import make_filter_obj_dep | ||
| from nemo_platform_plugin.entity_client import NemoEntitiesClient, NemoEntityConflictError, NemoEntityNotFoundError | ||
| from nemo_platform_plugin.filter_ops import ComparisonOperation, FilterOperator | ||
| from nemo_platform_plugin.schema import PaginationData | ||
|
|
||
| logger = logging.getLogger(__name__) | ||
|
|
||
| router = APIRouter() | ||
|
|
||
| _deployment_filter_dep = make_filter_obj_dep(DeploymentFilter) | ||
|
|
||
| _VALID_DEPLOYMENT_STATUSES: frozenset[str] = frozenset( | ||
| {"PENDING", "STARTING", "READY", "SUCCEEDED", "FAILED", "LOST", "DELETING"} | ||
| ) | ||
|
|
||
|
|
||
| def _parse_status_in(status_in: str | None) -> list[DeploymentStatus]: | ||
| if not status_in: | ||
| return [] | ||
| values = [part.strip().upper() for part in status_in.split(",") if part.strip()] | ||
| invalid = [value for value in values if value not in _VALID_DEPLOYMENT_STATUSES] | ||
| if invalid: | ||
| raise HTTPException( | ||
| status_code=400, | ||
| detail=f"Invalid deployment status values: {', '.join(invalid)}", | ||
| ) | ||
| return cast(list[DeploymentStatus], values) | ||
|
|
||
|
|
||
| @router.post("/deployments", response_model=Deployment, status_code=201, tags=["Deployments"]) | ||
| async def create_deployment( | ||
| workspace: str, | ||
| body: CreateDeploymentRequest, | ||
| entity_client: NemoEntitiesClient = Depends(get_entity_client), | ||
| ) -> Deployment: | ||
| try: | ||
| await entity_client.get(DeploymentConfig, name=body.deployment_config_name, workspace=workspace) | ||
| except NemoEntityNotFoundError as exc: | ||
| raise HTTPException( | ||
| status_code=404, | ||
| detail=(f"DeploymentConfig '{body.deployment_config_name}' not found in workspace '{workspace}'."), | ||
| ) from exc | ||
|
|
||
| deployment = Deployment( | ||
| name=body.name, | ||
| workspace=workspace, | ||
| deployment_config_name=body.deployment_config_name, | ||
| desired_state=body.desired_state, | ||
| executor=body.executor, | ||
| status="PENDING", | ||
| ) | ||
| try: | ||
| return await entity_client.create(deployment) | ||
| except NemoEntityConflictError as exc: | ||
| raise HTTPException( | ||
| status_code=409, | ||
| detail=f"Deployment '{body.name}' already exists in workspace '{workspace}'.", | ||
| ) from exc | ||
|
|
||
|
|
||
| @router.get("/deployments", response_model=DeploymentPage, tags=["Deployments"]) | ||
| async def list_deployments( | ||
| workspace: str, | ||
| page: int = Query(default=1, ge=1), | ||
| page_size: int = Query(default=20, ge=1, le=100), | ||
| sort: str = Query(default="-created_at"), | ||
| status_in: str | None = Query( | ||
| default=None, | ||
| description="Comma-separated deployment statuses for bulk reconciler queries.", | ||
| ), | ||
| filter: DeploymentFilter = Depends(_deployment_filter_dep), | ||
| entity_client: NemoEntitiesClient = Depends(get_entity_client), | ||
| ) -> DeploymentPage: | ||
| filter_dict = filter if isinstance(filter, dict) else filter.model_dump(exclude_none=True) | ||
| statuses = _parse_status_in(status_in) if status_in else [] | ||
| filter_operation = None | ||
| if statuses: | ||
| filter_operation = ComparisonOperation( | ||
| operator=FilterOperator.IN, | ||
| field="status", | ||
| value=statuses, | ||
| ) | ||
| result = await entity_client.list( | ||
| Deployment, | ||
| workspace=workspace, | ||
| page=page, | ||
| page_size=page_size, | ||
| sort=sort, | ||
| filter_obj=filter_dict or None, | ||
| filter_operation=filter_operation, | ||
| ) | ||
| pagination = PaginationData.model_validate(result.pagination.model_dump()) if result.pagination else None | ||
| return DeploymentPage(data=result.data, pagination=pagination, sort=sort, filter=filter) | ||
|
|
||
|
|
||
| @router.get("/deployments/{name}", response_model=Deployment, tags=["Deployments"]) | ||
| async def get_deployment( | ||
| workspace: str, | ||
| name: str, | ||
| entity_client: NemoEntitiesClient = Depends(get_entity_client), | ||
| ) -> Deployment: | ||
| try: | ||
| return await entity_client.get(Deployment, name=name, workspace=workspace) | ||
| except NemoEntityNotFoundError as exc: | ||
| raise HTTPException( | ||
| status_code=404, | ||
| detail=f"Deployment '{name}' not found in workspace '{workspace}'.", | ||
| ) from exc | ||
|
|
||
|
|
||
| @router.delete("/deployments/{name}", status_code=204, tags=["Deployments"]) | ||
| async def delete_deployment( | ||
| workspace: str, | ||
| name: str, | ||
| entity_client: NemoEntitiesClient = Depends(get_entity_client), | ||
| ) -> None: | ||
| try: | ||
| deployment = await entity_client.get(Deployment, name=name, workspace=workspace) | ||
| except NemoEntityNotFoundError as exc: | ||
| raise HTTPException( | ||
| status_code=404, | ||
| detail=f"Deployment '{name}' not found in workspace '{workspace}'.", | ||
| ) from exc | ||
|
|
||
| deployment.status = "DELETING" | ||
| try: | ||
| await entity_client.update(deployment) | ||
| except NemoEntityNotFoundError: | ||
| logger.info("Deployment already deleted before status update") | ||
| except NemoEntityConflictError as exc: | ||
| raise HTTPException( | ||
| status_code=409, | ||
| detail=f"Deployment '{name}' is being modified concurrently.", | ||
| ) from exc |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Nit: We can probs drop everything from this
Scopeline through to the end of the file, save for theTestssection.After the work is done and backends are ready, I'm sure an agent can cook up a good README that's helpful and doesn't have transient info in it.