Skip to content
Open
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
33 changes: 33 additions & 0 deletions plugins/nemo-deployments/README.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,33 @@
# NeMo Deployments Plugin

Substrate-agnostic deployment lifecycle for the NeMo Platform. This plugin provides
entity schemas, CRUD APIs, a `DeploymentBackend` ABC, and an executor registry.

**Scope (this ticket):** scaffold only — entity types, v1 CRUD routes, backend contract,

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nit: We can probs drop everything from this Scope line through to the end of the file, save for the Tests section.

After the work is done and backends are ready, I'm sure an agent can cook up a good README that's helpful and doesn't have transient info in it.

and executor registry. Docker/K8s backends and the reconcile controller land in follow-on
tickets (756–758).

## Prerequisites

- NeMo Platform workspace bootstrapped (`make bootstrap`, `nemo setup`)
- Plugin enabled in root `pyproject.toml` (`enabled-plugins` includes `deployments`)

## API base path

`/apis/deployments/v1/workspaces/{workspace}/...`

Cross-workspace bulk queries use the entity-store sentinel workspace ``-``:

``GET /apis/deployments/v1/workspaces/-/deployments?status_in=pending,starting``

## Next steps

- **756 / 757:** Docker and Kubernetes `DeploymentBackend` implementations
- **758:** Reconcile controller wiring status writes and backend lifecycle

## Tests

```bash
uv sync
uv run pytest plugins/nemo-deployments/tests/unit -v
```
Comment on lines +1 to +33

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🛠️ Refactor suggestion | 🟠 Major | 🏗️ Heavy lift

README violates Diataxis structure and progressive-disclosure guidelines.

This page mixes three Diataxis quadrants (REFERENCE for API path, HOW-TO for commands, EXPLANATION for plugin overview) into 18 lines. Per guidelines, "Each documentation page should fit ONE Diataxis quadrant; do not mix tutorials with reference tables or how-tos with architecture explanations."

Restructure by either narrowing to one quadrant (e.g., HOW-TO: "Set up and test the plugin") and moving API/architecture details to separate pages, or expanding into a proper multi-section structure with progressive disclosure: Layer 1 (30s: what/who/value) → Layer 2 (3-5min: core concepts and architecture) → Layer 3 (10min+: API endpoints and advanced features) → Layer 4 (separate reference pages for complete specs).

🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@plugins/nemo-deployments/README.md` around lines 1 - 18, The README mixes
reference, how-to, and explanatory content; split it into a single-quadrant page
or reorganize with progressive disclosure. Choose either: (A) convert this
README into a HOW-TO "Set up and test the plugin" that keeps the uv sync/pytest
commands and a brief 30s summary, then move the API base path and
architecture/Backend/DeploymentBackend ABC details into separate REFERENCE and
EXPLANATION pages; or (B) expand this README into layered sections (Layer 1: 30s
summary of purpose/value and intended audience; Layer 2: 3–5min core concepts
and mention of DeploymentBackend and executor registry; Layer 3: 10min+ API
endpoints with the `/apis/deployments/v1/workspaces/{workspace}/...` example and
cross-workspace sentinel `-`; Layer 4: links to separate reference pages for
full API spec and schema). Update headings accordingly (e.g., "How to set up and
test", "Concepts", "API reference", "Further reading") and move details about
the DeploymentBackend ABC and executor registry into dedicated files or sections
to avoid mixing quadrants.

Source: Coding guidelines

Comment thread
coderabbitai[bot] marked this conversation as resolved.
36 changes: 36 additions & 0 deletions plugins/nemo-deployments/pyproject.toml
Original file line number Diff line number Diff line change
@@ -0,0 +1,36 @@
[project]
name = "nemo-deployments-plugin"
version = "0.1.0"
description = "NeMo Deployments Plugin — substrate-agnostic deployment lifecycle on the NeMo Platform."
readme = "README.md"
requires-python = ">=3.11,<3.14"
dependencies = [
"fastapi>=0.115",
"nemo-platform",
"nemo-platform-plugin",
"pydantic>=2.10.6",
]
Comment thread
coderabbitai[bot] marked this conversation as resolved.

[project.entry-points."nemo.services"]
deployments = "nemo_deployments_plugin.service:DeploymentsService"

[build-system]
requires = ["hatchling"]
build-backend = "hatchling.build"

[tool.hatch.build.targets.wheel]
packages = ["src/nemo_deployments_plugin"]

[tool.uv.sources]
nemo-platform = { workspace = true }
nemo-platform-plugin = { workspace = true }

[dependency-groups]
dev = ["pytest>=8.3.4", "pytest-asyncio>=0.25.3", "httpx>=0.27", "fastapi>=0.115"]

[tool.pytest.ini_options]
testpaths = ["tests"]
asyncio_mode = "auto"
pythonpath = ["src", "tests/unit"]

[tool.nemo.openapi]
Original file line number Diff line number Diff line change
@@ -0,0 +1,23 @@
# SPDX-FileCopyrightText: Copyright (c) 2025-2026 NVIDIA CORPORATION & AFFILIATES. All rights reserved.
# SPDX-License-Identifier: Apache-2.0

"""FastAPI dependencies for the deployments plugin API."""

from __future__ import annotations

from fastapi import HTTPException, Request
from nemo_platform_plugin.entity_client import get_entity_client

__all__ = ["get_entity_client", "require_service_principal"]

_PRINCIPAL_ID_HEADER = "X-NMP-Principal-Id"


def require_service_principal(request: Request) -> None:
"""Restrict controller-only status writes to service principals."""
principal_id = request.headers.get(_PRINCIPAL_ID_HEADER, "")
if not principal_id.startswith("service:"):
raise HTTPException(
status_code=403,
detail="Status updates require a service principal.",
)
Original file line number Diff line number Diff line change
@@ -0,0 +1,137 @@
# SPDX-FileCopyrightText: Copyright (c) 2025-2026 NVIDIA CORPORATION & AFFILIATES. All rights reserved.
# SPDX-License-Identifier: Apache-2.0

"""DeploymentConfig CRUD routes."""

from __future__ import annotations

import logging

from fastapi import APIRouter, Depends, HTTPException, Query
from nemo_deployments_plugin.api.v1.dependencies import get_entity_client
from nemo_deployments_plugin.entities import DeploymentConfig
from nemo_deployments_plugin.schema import (
CreateDeploymentConfigRequest,
DeploymentConfigFilter,
DeploymentConfigPage,
)
from nemo_deployments_plugin.validation import (
PrerequisiteCycleError,
build_existing_prerequisite_map,
detect_prerequisite_cycle,
prerequisite_names,
)
from nemo_platform_plugin.api.filters import make_filter_obj_dep
from nemo_platform_plugin.entity_client import NemoEntitiesClient, NemoEntityConflictError, NemoEntityNotFoundError
from nemo_platform_plugin.schema import PaginationData

logger = logging.getLogger(__name__)

router = APIRouter()

_config_filter_dep = make_filter_obj_dep(DeploymentConfigFilter)


async def _list_all_deployment_configs(
entity_client: NemoEntitiesClient,
workspace: str,
) -> list[DeploymentConfig]:
"""Page through all deployment configs for prerequisite graph validation."""
page = 1
configs: list[DeploymentConfig] = []
while True:
result = await entity_client.list(
DeploymentConfig,
workspace=workspace,
page=page,
page_size=100,
)
configs.extend(result.data)
if result.pagination is None or page >= result.pagination.total_pages:
break
page += 1
return configs


@router.post("/deployment-configs", response_model=DeploymentConfig, status_code=201, tags=["Deployment Configs"])
async def create_deployment_config(
workspace: str,
body: CreateDeploymentConfigRequest,
entity_client: NemoEntitiesClient = Depends(get_entity_client),
) -> DeploymentConfig:
prereq_names = prerequisite_names(body.prerequisites)
try:
existing_configs = await _list_all_deployment_configs(entity_client, workspace)
existing_map = build_existing_prerequisite_map(existing_configs)
detect_prerequisite_cycle(
config_name=body.name,
prerequisites=prereq_names,
existing=existing_map,
)
except PrerequisiteCycleError as exc:
raise HTTPException(status_code=400, detail=str(exc)) from exc

config = DeploymentConfig(
name=body.name,
workspace=workspace,
**body.model_dump(exclude={"name"}, exclude_none=True),
)
try:
return await entity_client.create(config)
except NemoEntityConflictError as exc:
raise HTTPException(
status_code=409,
detail=f"DeploymentConfig '{body.name}' already exists in workspace '{workspace}'.",
) from exc


@router.get("/deployment-configs", response_model=DeploymentConfigPage, tags=["Deployment Configs"])
async def list_deployment_configs(
workspace: str,
page: int = Query(default=1, ge=1),
page_size: int = Query(default=20, ge=1, le=100),
sort: str = Query(default="-created_at"),
filter: DeploymentConfigFilter = Depends(_config_filter_dep),
entity_client: NemoEntitiesClient = Depends(get_entity_client),
) -> DeploymentConfigPage:
filter_dict = filter if isinstance(filter, dict) else filter.model_dump(exclude_none=True)
result = await entity_client.list(
DeploymentConfig,
workspace=workspace,
page=page,
page_size=page_size,
sort=sort,
filter_obj=filter_dict or None,
)
pagination = PaginationData.model_validate(result.pagination.model_dump()) if result.pagination else None
return DeploymentConfigPage(data=result.data, pagination=pagination, sort=sort, filter=filter)


@router.get("/deployment-configs/{name}", response_model=DeploymentConfig, tags=["Deployment Configs"])
async def get_deployment_config(
workspace: str,
name: str,
entity_client: NemoEntitiesClient = Depends(get_entity_client),
) -> DeploymentConfig:
try:
return await entity_client.get(DeploymentConfig, name=name, workspace=workspace)
except NemoEntityNotFoundError as exc:
raise HTTPException(
status_code=404,
detail=f"DeploymentConfig '{name}' not found in workspace '{workspace}'.",
) from exc


@router.delete("/deployment-configs/{name}", status_code=204, tags=["Deployment Configs"])
async def delete_deployment_config(
workspace: str,
name: str,
entity_client: NemoEntitiesClient = Depends(get_entity_client),
) -> None:
try:
await entity_client.delete(DeploymentConfig, name=name, workspace=workspace)

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could we ensure that there are no existing Deployments that reference the config and 4xx with a helpful message if so? Same goes for Volumes (if they're referenced by deployments).

except NemoEntityNotFoundError as exc:
raise HTTPException(
status_code=404,
detail=f"DeploymentConfig '{name}' not found in workspace '{workspace}'.",
) from exc
Original file line number Diff line number Diff line change
@@ -0,0 +1,148 @@
# SPDX-FileCopyrightText: Copyright (c) 2025-2026 NVIDIA CORPORATION & AFFILIATES. All rights reserved.
# SPDX-License-Identifier: Apache-2.0

"""Deployment CRUD routes."""

from __future__ import annotations

import logging
from typing import cast

from fastapi import APIRouter, Depends, HTTPException, Query
from nemo_deployments_plugin.api.v1.dependencies import get_entity_client
from nemo_deployments_plugin.entities import Deployment, DeploymentConfig, DeploymentStatus
from nemo_deployments_plugin.schema import CreateDeploymentRequest, DeploymentFilter, DeploymentPage
from nemo_platform_plugin.api.filters import make_filter_obj_dep
from nemo_platform_plugin.entity_client import NemoEntitiesClient, NemoEntityConflictError, NemoEntityNotFoundError
from nemo_platform_plugin.filter_ops import ComparisonOperation, FilterOperator
from nemo_platform_plugin.schema import PaginationData

logger = logging.getLogger(__name__)

router = APIRouter()

_deployment_filter_dep = make_filter_obj_dep(DeploymentFilter)

_VALID_DEPLOYMENT_STATUSES: frozenset[str] = frozenset(
{"PENDING", "STARTING", "READY", "SUCCEEDED", "FAILED", "LOST", "DELETING"}
)


def _parse_status_in(status_in: str | None) -> list[DeploymentStatus]:
if not status_in:
return []
values = [part.strip().upper() for part in status_in.split(",") if part.strip()]
invalid = [value for value in values if value not in _VALID_DEPLOYMENT_STATUSES]
if invalid:
raise HTTPException(
status_code=400,
detail=f"Invalid deployment status values: {', '.join(invalid)}",
)
return cast(list[DeploymentStatus], values)


@router.post("/deployments", response_model=Deployment, status_code=201, tags=["Deployments"])
async def create_deployment(
workspace: str,
body: CreateDeploymentRequest,
entity_client: NemoEntitiesClient = Depends(get_entity_client),
) -> Deployment:
try:
await entity_client.get(DeploymentConfig, name=body.deployment_config_name, workspace=workspace)
except NemoEntityNotFoundError as exc:
raise HTTPException(
status_code=404,
detail=(f"DeploymentConfig '{body.deployment_config_name}' not found in workspace '{workspace}'."),
) from exc

deployment = Deployment(
name=body.name,
workspace=workspace,
deployment_config_name=body.deployment_config_name,
desired_state=body.desired_state,
executor=body.executor,
status="PENDING",
)
try:
return await entity_client.create(deployment)
except NemoEntityConflictError as exc:
raise HTTPException(
status_code=409,
detail=f"Deployment '{body.name}' already exists in workspace '{workspace}'.",
) from exc


@router.get("/deployments", response_model=DeploymentPage, tags=["Deployments"])
async def list_deployments(
workspace: str,
page: int = Query(default=1, ge=1),
page_size: int = Query(default=20, ge=1, le=100),
sort: str = Query(default="-created_at"),
status_in: str | None = Query(
default=None,
description="Comma-separated deployment statuses for bulk reconciler queries.",
),
filter: DeploymentFilter = Depends(_deployment_filter_dep),
entity_client: NemoEntitiesClient = Depends(get_entity_client),
) -> DeploymentPage:
filter_dict = filter if isinstance(filter, dict) else filter.model_dump(exclude_none=True)
statuses = _parse_status_in(status_in) if status_in else []
filter_operation = None
if statuses:
filter_operation = ComparisonOperation(
operator=FilterOperator.IN,
field="status",
value=statuses,
)
result = await entity_client.list(
Deployment,
workspace=workspace,
page=page,
page_size=page_size,
sort=sort,
filter_obj=filter_dict or None,
filter_operation=filter_operation,
)
pagination = PaginationData.model_validate(result.pagination.model_dump()) if result.pagination else None
return DeploymentPage(data=result.data, pagination=pagination, sort=sort, filter=filter)


@router.get("/deployments/{name}", response_model=Deployment, tags=["Deployments"])
async def get_deployment(
workspace: str,
name: str,
entity_client: NemoEntitiesClient = Depends(get_entity_client),
) -> Deployment:
try:
return await entity_client.get(Deployment, name=name, workspace=workspace)
except NemoEntityNotFoundError as exc:
raise HTTPException(
status_code=404,
detail=f"Deployment '{name}' not found in workspace '{workspace}'.",
) from exc


@router.delete("/deployments/{name}", status_code=204, tags=["Deployments"])
async def delete_deployment(
workspace: str,
name: str,
entity_client: NemoEntitiesClient = Depends(get_entity_client),
) -> None:
try:
deployment = await entity_client.get(Deployment, name=name, workspace=workspace)
except NemoEntityNotFoundError as exc:
raise HTTPException(
status_code=404,
detail=f"Deployment '{name}' not found in workspace '{workspace}'.",
) from exc

deployment.status = "DELETING"
try:
await entity_client.update(deployment)
except NemoEntityNotFoundError:
logger.info("Deployment already deleted before status update")
except NemoEntityConflictError as exc:
raise HTTPException(
status_code=409,
detail=f"Deployment '{name}' is being modified concurrently.",
) from exc
Loading
Loading