Skip to content
Open
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
43 changes: 34 additions & 9 deletions plugins/nemo-deployments/README.md
Original file line number Diff line number Diff line change
@@ -1,16 +1,39 @@
# NeMo Deployments Plugin

Substrate-agnostic deployment lifecycle for the NeMo Platform. This plugin provides
entity schemas, CRUD APIs, a `DeploymentBackend` ABC, and an executor registry.
entity schemas, CRUD APIs, a `DeploymentBackend` ABC, an executor registry, and a
background reconcile controller (`DeploymentsController`).

**Scope (this ticket):** scaffold only — entity types, v1 CRUD routes, backend contract,
and executor registry. Docker/K8s backends and the reconcile controller land in follow-on
tickets (756–758).
**Scope:** entity types, v1 CRUD routes, backend contract, executor registry, and the
reconcile controller (758). Docker/K8s backends land in follow-on tickets (756–757).

## Prerequisites

- NeMo Platform workspace bootstrapped (`make bootstrap`, `nemo setup`)
- Plugin enabled in root `pyproject.toml` (`enabled-plugins` includes `deployments`)
- At least one executor backend registered for live reconciliation (756+)

## Controller

Register `DeploymentsController` via the `nemo.controllers` entry point. The controller:

- Paginates non-terminal deployment/volume lists (no 100-item cap)
- Reconciles volumes before deployments (puller→server ordering)
- Gates deployment create on mounted volumes reaching `BOUND`
- Writes status via the service-principal entity client
- Tracks list health separately: `_deployments_list_ok` and `_volumes_list_ok`; `is_healthy` is true only when **both** succeed
- Runs orphan substrate cleanup on a configurable interval (skipped when deployment list fails)

Per-config drift backoff overrides live on `DeploymentConfig.driftRecovery` (`maxAttempts`, `baseDelaySeconds`, `maxDelaySeconds`); unset fields fall back to `DeploymentsConfig.controller`.

## Deferred (follow-on tickets)

| Item | Why deferred |
|------|----------------|
| Volume delete → `RELEASED` | Volume DELETE API removes the entity immediately; no `DELETING` state or `list_managed_volume_names` on the backend ABC yet |
| Volume orphan cleanup | Requires backend support to list substrate volumes without entities |
| Docker/K8s E2E | AIRCORE-756/757 — `BACKEND_CLASSES` empty until backends register |
| Per-volume executor routing | No `Volume.executor` field in 755; volumes use `default_executor` |

## API base path

Expand All @@ -20,14 +43,16 @@ Cross-workspace bulk queries use the entity-store sentinel workspace ``-``:

``GET /apis/deployments/v1/workspaces/-/deployments?status_in=pending,starting``

## Next steps

- **756 / 757:** Docker and Kubernetes `DeploymentBackend` implementations
- **758:** Reconcile controller wiring status writes and backend lifecycle

## Tests

```bash
uv sync
uv run pytest plugins/nemo-deployments/tests/unit -v
```

## Next steps

- **[AIRCORE-756](https://linear.app/nvidia/issue/AIRCORE-756):** Docker `DeploymentBackend` — unblocks reconciler E2E and volume orphan cleanup
- **[AIRCORE-757](https://linear.app/nvidia/issue/AIRCORE-757):** Kubernetes `DeploymentBackend`
- **[AIRCORE-759](https://linear.app/nvidia/issue/AIRCORE-759):** Models/agents adoption projecting from plugin `Deployment` status
- **755 scaffold:** entity CRUD and executor registry ([PR #280](https://github.com/NVIDIA-NeMo/nemo-platform/pull/280))
3 changes: 3 additions & 0 deletions plugins/nemo-deployments/pyproject.toml
Original file line number Diff line number Diff line change
Expand Up @@ -14,6 +14,9 @@ dependencies = [
[project.entry-points."nemo.services"]
deployments = "nemo_deployments_plugin.service:DeploymentsService"

[project.entry-points."nemo.controllers"]
deployments = "nemo_deployments_plugin.controller:DeploymentsController"

[build-system]
requires = ["hatchling"]
build-backend = "hatchling.build"
Expand Down
28 changes: 28 additions & 0 deletions plugins/nemo-deployments/src/nemo_deployments_plugin/config.py
Original file line number Diff line number Diff line change
Expand Up @@ -17,6 +17,30 @@ class ExecutorConfigEntry(BaseModel):
config: dict[str, Any] = Field(default_factory=dict)


class ControllerConfig(BaseModel):
"""Configuration for the deployments reconcile controller."""

interval_seconds: int = Field(default=5, gt=0, description="Reconciliation loop interval in seconds.")
drift_recovery_max_attempts: int = Field(default=5, ge=0, description="Max drift recovery attempts before FAILED.")
drift_recovery_base_delay_seconds: int = Field(
default=5, ge=0, description="Base delay for drift recovery backoff."
)
drift_recovery_max_delay_seconds: int = Field(
default=300, ge=0, description="Max delay cap for drift recovery backoff."
)
orphan_cleanup_every_n_cycles: int = Field(
default=6,
ge=0,
description="Run orphan substrate cleanup every N reconcile cycles (0 disables).",
)

Comment on lines +20 to +36

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🎯 Functional Correctness | 🟠 Major | ⚡ Quick win

Add validation for ControllerConfig fields.

Missing constraints allow nonsensical values:

  • interval_seconds can be ≤ 0 (breaks reconcile loop)
  • drift_recovery_base_delay_seconds can exceed drift_recovery_max_delay_seconds (impossible backoff)
  • orphan_cleanup_every_n_cycles can be ≤ 0 (breaks cleanup scheduling)
🛡️ Proposed validators
+    `@model_validator`(mode="after")
+    def _validate_controller_config(self) -> ControllerConfig:
+        if self.interval_seconds <= 0:
+            raise ValueError("interval_seconds must be positive")
+        if self.drift_recovery_base_delay_seconds > self.drift_recovery_max_delay_seconds:
+            raise ValueError("drift_recovery_base_delay_seconds must not exceed drift_recovery_max_delay_seconds")
+        if self.orphan_cleanup_every_n_cycles <= 0:
+            raise ValueError("orphan_cleanup_every_n_cycles must be positive")
+        return self
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@plugins/nemo-deployments/src/nemo_deployments_plugin/config.py` around lines
20 - 31, ControllerConfig currently allows invalid values; add Pydantic field
constraints and a cross-field validator: use Field(..., gt=0) for
interval_seconds and orphan_cleanup_every_n_cycles, use Field(..., ge=0) for
drift_recovery_base_delay_seconds and Field(..., ge=0) for
drift_recovery_max_delay_seconds (and Field(..., ge=0) for
drift_recovery_max_attempts), then implement a `@root_validator` (e.g.,
validate_backoff) on ControllerConfig to assert
drift_recovery_base_delay_seconds <= drift_recovery_max_delay_seconds and raise
a ValueError with a clear message if violated; reference the ControllerConfig
class and the field names interval_seconds, drift_recovery_base_delay_seconds,
drift_recovery_max_delay_seconds, orphan_cleanup_every_n_cycles, and
drift_recovery_max_attempts when locating where to add these checks.

@model_validator(mode="after")
def _validate_backoff(self) -> ControllerConfig:
if self.drift_recovery_base_delay_seconds > self.drift_recovery_max_delay_seconds:
raise ValueError("drift_recovery_base_delay_seconds must not exceed drift_recovery_max_delay_seconds")
return self


class DeploymentsConfig(NemoConfig):
plugin_name: ClassVar[str] = "deployments"
plugin_description: ClassVar[str] = "Configuration for the NeMo Platform deployments plugin."
Expand All @@ -29,6 +53,10 @@ class DeploymentsConfig(NemoConfig):
default=None,
description="Fallback executor when Deployment.executor is unset.",
)
controller: ControllerConfig = Field(
default_factory=ControllerConfig,
description="Deployment reconciler controller settings.",
)
port_range_start: int = Field(default=9000, description="Default Docker port range start.")
port_range_end: int = Field(default=9100, description="Default Docker port range end.")

Expand Down
Loading
Loading