Skip to content

feat: PSAPCluster controller with sentinel locking, auto-discovery, and GPU discovery#74

Open
MML-coder wants to merge 4 commits intomainfrom
feat/psapcluster-controller
Open

feat: PSAPCluster controller with sentinel locking, auto-discovery, and GPU discovery#74
MML-coder wants to merge 4 commits intomainfrom
feat/psapcluster-controller

Conversation

@MML-coder
Copy link
Copy Markdown
Collaborator

@MML-coder MML-coder commented Apr 29, 2026

Summary

Introduces a PSAPCluster custom resource and kopf controller that unifies cluster
management with GPU discovery, sentinel-based locking via Kueue, and automatic
cluster onboarding from kubeconfig secrets.

Closes #37, closes #51

What's included

  • PSAPCluster CRD (fournos.dev/v1) — single source of truth for cluster state with printer columns for oc get psapclusters
  • GPU discovery — connects to target clusters via kubeconfig secrets, lists nodes, aggregates GPU counts by vendor/model with exponential backoff on failures
  • Sentinel FournosJob locking — set spec.owner to create a lockOnly: true, exclusive: true sentinel FournosJob that holds all 100 cluster-slots via Kueue, blocking competing workloads. Optional spec.ttl for auto-expiry
  • Auto-discovery — daemon thread scans kubeconfig secrets every 60s, auto-creates ResourceFlavor + ClusterQueue flavor entry + PSAPCluster CR for each new cluster (labeled fournos.dev/auto-discovered: true)
  • lockOnly FournosJob support — new CRD field for sentinel jobs that skip Forge resolution and stay Admitted indefinitely
  • Self-healing — reconcile timer recreates missing sentinel jobs, validates kubeconfig, retries GPU discovery
  • Single global ClusterQueue — all clusters share fournos-queue with per-cluster ResourceFlavors

Architecture

Kubeconfig Secret (psap-secrets)
  └─ Auto-Discovery (60s scan)
       ├─ ResourceFlavor (nodeLabel: fournos.dev/cluster=<name>)
       ├─ ClusterQueue flavor entry (cluster-slot: 100)
       └─ PSAPCluster CR
            ├─ GPU Discovery (probes target cluster nodes)
            │    └─ Updates GPU quotas in ClusterQueue
            └─ Locking (spec.owner)
                 └─ Sentinel FournosJob (lockOnly + exclusive)
                      └─ Kueue Workload (holds 100 cluster-slots)

New/modified files

File Change
manifests/crd.yaml Added lockOnly field, removed forge from required
manifests/psapcluster-crd.yaml Updated for sentinel approach (added lockJobName, removed old CQ fields)
manifests/rbac/role_fournos.yaml Added create/delete for fournosjobs, create for psapclusters
fournos/core/discovery.py NewClusterDiscovery class for auto-onboarding clusters
fournos/core/clusters.py Added extract_cluster_name, list_kubeconfig_secrets
fournos/core/kueue.py Added create_flavor, add_flavor_to_cluster_queue, update_flavor_quotas
fournos/core/psapcluster.py Deleted — PSAPClusterManager removed
fournos/handlers/discovery.py Newscan_clusters handler
fournos/handlers/psapcluster.py Rewritten for sentinel FournosJob create/delete
fournos/handlers/lifecycle.py Added lockOnly handling in on_create
fournos/handlers/execution.py Added lockOnly early return in reconcile_admitted
fournos/operator.py Wired discovery daemon thread and ClusterDiscovery init
tests/unit/test_cluster_discovery.py New — 28 unit tests for discovery
tests/unit/test_psapcluster_handlers.py Rewritten for sentinel approach
tests/unit/test_psapcluster_manager.py Deleted

Usage

# View clusters
oc get psapclusters -n psap-automation

# Lock a cluster (indefinite)
oc patch psapcluster athena-fire -n psap-automation --type merge \
  -p '{"spec":{"owner":"mehul"}}'

# Lock with 4-hour TTL
oc patch psapcluster athena-fire -n psap-automation --type merge \
  -p '{"spec":{"owner":"mehul","ttl":"4h"}}'

# Unlock
oc patch psapcluster athena-fire -n psap-automation --type merge \
  -p '{"spec":{"owner":""}}'

# Auto-discover: just drop a kubeconfig secret
oc create secret generic kubeconfig-new-cluster \
  --from-file=kubeconfig=/path/to/kubeconfig -n psap-secrets
# Within 60s: ResourceFlavor + CQ entry + PSAPCluster CR created automatically

Test plan

  • 91 unit tests pass (28 discovery + 34 PSAPCluster handlers + 29 GPU discovery)
  • Deployed to psap-automation cluster
  • End-to-end locking test: sentinel created, competing workload blocked, unlock releases it
  • GPU discovery confirmed for athena-fire (8x H200) and psap-mgmt
  • Auto-discovery created PSAPClusters for kubeconfig secrets (cluster-1 through cluster-4)
  • Verified RBAC allows sentinel create/delete
  • Verified self-healing: reconciler recreates missing sentinel when owner is set

Summary by CodeRabbit

  • New Features

    • Support for managing remote Kubernetes clusters with automatic GPU inventory tracking
    • Automatic discovery of clusters from kubeconfig secrets
    • Exclusive cluster locking with ownership-based controls and optional TTL expiry
    • New lockOnly job mode to reserve cluster capacity without executing pipelines
  • Documentation

    • Added comprehensive operational guide for cluster management, including setup procedures and troubleshooting

…ocking

Introduces a PSAPCluster custom resource and kopf controller that unifies
cluster definition, discovers GPU hardware dynamically, manages per-cluster
Kueue ClusterQueues, and enables ownership locking via stopPolicy.

- PSAPCluster CRD (fournos.dev/v1) with spec fields: kubeconfigSecret,
  owner, ttl, evict, gpuDiscoveryInterval
- GPU discovery client that connects to target clusters via kubeconfig
  secrets, lists nodes, and aggregates GPU counts by vendor/model
- Per-cluster ClusterQueues in a shared cohort with self-healing
- Ownership locking with Hold/HoldAndDrain stopPolicy and TTL auto-expiry
- Timer-based reconciliation: kubeconfig validation, TTL expiry, GPU
  discovery with backoff, CQ/RF self-healing
- 82 unit tests covering all modules

Closes #37, closes #51
@openshift-ci
Copy link
Copy Markdown

openshift-ci Bot commented Apr 29, 2026

[APPROVALNOTIFIER] This PR is NOT APPROVED

This pull-request has been approved by:
Once this PR has been reviewed and has the lgtm label, please assign ashishkamra for approval. For more information see the Code Review Process.

The full list of commands accepted by this bot can be found here.

Details Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@openshift-ci
Copy link
Copy Markdown

openshift-ci Bot commented Apr 29, 2026

PR needs rebase.

Details

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository.

@openshift-ci openshift-ci Bot added the needs-rebase Indicates a PR cannot be merged because it has merge conflicts with HEAD. label Apr 29, 2026
@coderabbitai
Copy link
Copy Markdown

coderabbitai Bot commented Apr 29, 2026

📝 Walkthrough

Walkthrough

This PR introduces PSAPCluster custom resource for cluster state visibility, implements GPU discovery from target clusters via kubeconfig secrets, adds cluster auto-discovery from kubeconfig secrets, enables cluster-level exclusive locking via sentinel jobs, and provides Kueue flavor/quota management for multi-cluster scheduling.

Changes

Cohort / File(s) Summary
PSAPCluster CRD and RBAC
manifests/psapcluster-crd.yaml, manifests/rbac/role_fournos.yaml
New PSAPCluster CRD with spec/status fields for kubeconfig, owner-based locking, GPU discovery, TTL, and hardware inventory. RBAC updated to grant PSAPCluster and LocalQueue management permissions.
GPU Discovery
fournos/core/gpu_discovery.py, tests/unit/test_gpu_discovery.py
New GPU discovery module connecting to target clusters via kubeconfig, aggregating GPU counts by vendor/model, handling failures with explicit exceptions, and returning structured discovery results. Comprehensive unit tests covering success/error paths and model normalization.
Cluster Auto-Discovery
fournos/core/clusters.py, fournos/core/discovery.py, fournos/handlers/discovery.py, tests/unit/test_cluster_discovery.py
Utilities to parse kubeconfig secret names by regex pattern, list matching secrets, and automatically discover clusters from secrets. ClusterDiscovery creates PSAPCluster CRs and Kueue flavors for newly discovered clusters. Handler exposes scan_clusters() entry point. Tests validate regex parsing, filtering, flavor/PSAPCluster creation, and idempotency.
PSAPCluster Handlers and Reconciliation
fournos/handlers/psapcluster.py, tests/unit/test_psapcluster_handlers.py
New handler module for PSAPCluster lifecycle: on-create validation and initialization, owner-change based locking/unlocking sentinel jobs, and timer-based reconciliation enforcing TTL expiry, GPU discovery with interval/backoff, lock self-healing, and Kueue quota updates. Tests cover all handler paths including failure modes and state transitions.
Kueue Integration
fournos/core/kueue.py
Three new KueueClient methods: create_flavor for idempotent ResourceFlavor creation, add_flavor_to_cluster_queue for appending flavors with default zero quotas, and update_flavor_quotas for modifying existing flavor GPU/cluster-slot allocations.
FournosJob Locking Support
fournos/handlers/lifecycle.py, fournos/handlers/execution.py, manifests/crd.yaml
Support for spec.lockOnly mode: FournosJob create validates cluster requirement, bypasses Forge for lockOnly jobs, creates exclusive Kueue Workload; execution handler skips PipelineRun checks. CRD removes forge requirement and adds lockOnly boolean field.
Operator Integration and Configuration
fournos/operator.py, fournos/settings.py, fournos/state.py, fournos/core/constants.py, fournos/handlers/__init__.py
Operator startup creates GPU discovery and cluster discovery clients, starts background discovery loop, registers PSAPCluster kopf handlers. Settings adds cluster-queue name, timer intervals, discovery timeout, and cluster-discovery interval. State holds new client references. Constants define lock/discovery labels and condition types. Handlers exports re-export new functions.
Documentation and Samples
README.md, dev/mock-psapcluster.yaml, config/forge/samples/job-full.yaml
Extensive README section documenting PSAPCluster onboarding, locking mechanism, GPU discovery, TTL/expiry, troubleshooting, and e2e test. Mock cluster samples for testing. Sample job updated to reference athena-fire cluster.

Sequence Diagram(s)

sequenceDiagram
    participant Operator as PSAPCluster Operator
    participant K8s as Kubernetes API
    participant TargetCluster as Target Cluster API
    participant Kueue as Kueue API

    Operator->>K8s: Watch PSAPCluster events
    
    rect rgba(0, 100, 150, 0.5)
    Note over Operator,K8s: On PSAPCluster Create
    K8s->>Operator: Trigger create event
    Operator->>K8s: Validate kubeconfig secret
    Operator->>K8s: Initialize status, conditions
    Operator->>K8s: Create sentinel lock job if spec.owner set
    K8s->>Operator: Acknowledge
    end

    rect rgba(100, 50, 150, 0.5)
    Note over Operator,TargetCluster: GPU Discovery Loop
    Operator->>Operator: Timer triggers reconciliation
    Operator->>K8s: Read kubeconfig secret
    K8s->>Operator: Return secret data
    Operator->>TargetCluster: Connect & list nodes
    TargetCluster->>Operator: Return GPU allocatable resources
    Operator->>Operator: Aggregate by vendor/model
    Operator->>K8s: Update PSAPCluster status.hardware
    end

    rect rgba(150, 100, 50, 0.5)
    Note over Operator,Kueue: Quota Update on Inventory Change
    Operator->>K8s: Detect GPU count change
    Operator->>Kueue: Ensure flavor exists
    Kueue->>Operator: Confirm flavor
    Operator->>Kueue: Update flavor quotas
    Kueue->>Operator: Acknowledge patch
    end

    rect rgba(100, 150, 50, 0.5)
    Note over Operator,K8s: Owner Change Triggers Lock
    K8s->>Operator: Trigger spec.owner update
    Operator->>K8s: Create/delete sentinel job
    Operator->>K8s: Update locked, lockExpiresAt
    K8s->>Operator: Acknowledge
    end
Loading
sequenceDiagram
    participant BackgroundLoop as Discovery Loop
    participant K8s as Kubernetes API
    participant ClusterDiscovery as ClusterDiscovery
    participant KueueClient as KueueClient

    BackgroundLoop->>ClusterDiscovery: Trigger scan_clusters()
    
    rect rgba(100, 150, 100, 0.5)
    Note over ClusterDiscovery,K8s: Auto-Discover Clusters
    ClusterDiscovery->>K8s: List kubeconfig secrets
    K8s->>ClusterDiscovery: Return secret names
    ClusterDiscovery->>ClusterDiscovery: Extract cluster names via regex
    ClusterDiscovery->>K8s: Check for existing PSAPCluster CRs
    K8s->>ClusterDiscovery: Return existing clusters
    ClusterDiscovery->>ClusterDiscovery: Filter new clusters
    
    loop For each new cluster
        ClusterDiscovery->>KueueClient: Ensure flavor exists
        KueueClient->>K8s: Create ResourceFlavor
        ClusterDiscovery->>KueueClient: Add flavor to cluster queue
        KueueClient->>K8s: Patch ClusterQueue resourceGroups
        ClusterDiscovery->>K8s: Create PSAPCluster CR
        K8s->>ClusterDiscovery: Confirm CR created
    end
    end
    
    BackgroundLoop->>BackgroundLoop: Log discovered clusters
Loading

Estimated code review effort

🎯 4 (Complex) | ⏱️ ~60 minutes

Possibly related PRs

Suggested labels

lgtm

Poem

🐰 A cluster discovered, with GPUs laid bare,
Sentinels standing in locks debonair,
Quotas dance with the Kueue's grace,
Auto-finding homes for each cluster's space!

🚥 Pre-merge checks | ✅ 4 | ❌ 1

❌ Failed checks (1 warning)

Check name Status Explanation Resolution
Docstring Coverage ⚠️ Warning Docstring coverage is 7.39% which is insufficient. The required threshold is 80.00%. Write docstrings for the functions missing them to satisfy the coverage threshold.
✅ Passed checks (4 passed)
Check name Status Explanation
Title check ✅ Passed The PR title accurately summarizes the main changes: PSAPCluster controller implementation with sentinel locking, auto-discovery, and GPU discovery.
Linked Issues check ✅ Passed The PR comprehensively addresses both linked issues: #37 introduces PSAPCluster CRD for cluster state tracking, and #51 implements GPU discovery via kubeconfig secrets with dynamic hardware enumeration.
Out of Scope Changes check ✅ Passed All changes are scoped to the objectives: PSAPCluster controller, GPU discovery, cluster auto-discovery, Kueue integration, and lock-only job support. One minor update to job-full.yaml sample cluster reference is incidental.
Description Check ✅ Passed Check skipped - CodeRabbit’s high-level summary is enabled.

✏️ Tip: You can configure your own custom pre-merge checks in the settings.

✨ Finishing Touches
📝 Generate docstrings
  • Create stacked PR
  • Commit on current branch
🧪 Generate unit tests (beta)
  • Create PR with unit tests
  • Commit unit tests in branch feat/psapcluster-controller

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

Comment @coderabbitai help to get the list of available commands and usage tips.

Copy link
Copy Markdown

@coderabbitai coderabbitai Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 3

🧹 Nitpick comments (1)
fournos/handlers/psapcluster.py (1)

52-64: 💤 Low value

Consider exposing a public method on ctx.registry instead of accessing _k8s directly.

Accessing the private _k8s attribute couples this handler to the internal implementation of ClusterRegistry. Consider adding a public method like ctx.registry.secret_exists(name, namespace) or using a shared CoreV1Api instance.

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@fournos/handlers/psapcluster.py` around lines 52 - 64, Update
_check_kubeconfig to stop accessing the private ctx.registry._k8s and call a new
public API on the registry (e.g., ctx.registry.secret_exists(secret_name,
namespace)) that encapsulates the read_namespaced_secret logic; add the
corresponding method on ClusterRegistry (secret_exists(name, namespace)) which
returns a boolean or raises a controlled exception for non-404 errors, then have
_check_kubeconfig map True->"Valid", False->"Missing" and handle other errors as
"Invalid" while logging the exception via logger.warning as before.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.

Inline comments:
In `@fournos/core/gpu_discovery.py`:
- Around line 125-139: The timeout on the Kubernetes call is ineffective because
setting api_client.configuration.request_timeout doesn't apply to individual
calls; update the code that builds the client and calls list_node so that you
remove the assignment to api_client.configuration.request_timeout and instead
call target_core.list_node(...) passing
_request_timeout=(settings.gpu_discovery_timeout_sec,
settings.gpu_discovery_timeout_sec); keep the existing try/except that wraps the
call and raise GPUDiscoveryError on exception, and ensure api_client.close()
remains in the finally block.

In `@fournos/core/psapcluster.py`:
- Around line 181-185: The stopPolicy patch handler currently catches
ApiException but ApiException is not imported; update the exception handling in
that handler to catch client.exceptions.ApiException instead (i.e., change the
except block that reads "except ApiException as exc:" to use
client.exceptions.ApiException) so it matches other handlers in this module and
avoids a NameError; ensure the logger.warning and raise behavior remain
unchanged and reference cq_name as before.

In `@README.md`:
- Around line 307-312: The fenced code block showing the table lacks a language
identifier (MD040); update the backticks surrounding the block (the triple-fence
that begins the block containing "NAME OWNER GPUS KUBECONFIG LOCKED AGE") to
include a language tag such as "text" (e.g., change ``` to ```text) so
markdownlint no longer flags the code fence.

---

Nitpick comments:
In `@fournos/handlers/psapcluster.py`:
- Around line 52-64: Update _check_kubeconfig to stop accessing the private
ctx.registry._k8s and call a new public API on the registry (e.g.,
ctx.registry.secret_exists(secret_name, namespace)) that encapsulates the
read_namespaced_secret logic; add the corresponding method on ClusterRegistry
(secret_exists(name, namespace)) which returns a boolean or raises a controlled
exception for non-404 errors, then have _check_kubeconfig map True->"Valid",
False->"Missing" and handle other errors as "Invalid" while logging the
exception via logger.warning as before.
🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

  • Push a commit to this branch (recommended)
  • Create a new PR with the fixes

ℹ️ Review info
⚙️ Run configuration

Configuration used: defaults

Review profile: CHILL

Plan: Pro

Run ID: e083bc62-0476-4762-9cde-25f919d35cc5

📥 Commits

Reviewing files that changed from the base of the PR and between 83811e2 and 2634aef.

📒 Files selected for processing (15)
  • README.md
  • dev/mock-psapcluster.yaml
  • fournos/core/constants.py
  • fournos/core/gpu_discovery.py
  • fournos/core/psapcluster.py
  • fournos/handlers/__init__.py
  • fournos/handlers/psapcluster.py
  • fournos/operator.py
  • fournos/settings.py
  • fournos/state.py
  • manifests/psapcluster-crd.yaml
  • manifests/rbac/role_fournos.yaml
  • tests/unit/test_gpu_discovery.py
  • tests/unit/test_psapcluster_handlers.py
  • tests/unit/test_psapcluster_manager.py

Comment on lines +125 to +139
api_client = k8s_config.new_client_from_config_dict(kubeconfig_dict)
api_client.configuration.request_timeout = (
settings.gpu_discovery_timeout_sec,
settings.gpu_discovery_timeout_sec,
)

try:
target_core = client.CoreV1Api(api_client)
nodes = target_core.list_node()
except Exception as exc:
raise GPUDiscoveryError(
f"Failed to connect to cluster {cluster_name!r}: {exc}"
) from exc
finally:
api_client.close()
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🔴 Critical

🧩 Analysis chain

🌐 Web query:

kubernetes python client request_timeout configuration tuple format

💡 Result:

In the Kubernetes Python client, the _request_timeout parameter configures client-side request timeouts for API calls. It is passed as a keyword argument (e.g., _request_timeout=60) to API methods like CoreV1Api.list_namespaced_pod(...) or watch.Watch.stream(...). The parameter accepts: - A single number (int/float): total timeout for the request in seconds. - A tuple of length 2: (connect_timeout, read_timeout) in seconds, where the first value sets the connection timeout and the second sets the read timeout. This is handled by the underlying ApiClient.call_api and RESTClient.request methods, which pass it to urllib3's HTTP request machinery. Example: from kubernetes import client, watch v1 = client.CoreV1Api w = watch.Watch for event in w.stream(v1.list_namespaced_pod, namespace="default", _request_timeout=(5.0, 30.0)): ... Note: There is no global default timeout in Configuration; it must be set per-call. For exec streams, tuple support may have bugs in some versions (e.g., issue #2292). Official documentation: https://github.com/kubernetes-client/python/blob/master/examples/watch/timeout-settings.md explicitly states it accepts "int/long" or "a tuple (with a length of 2)". Client docs confirm: "If one number provided, it will be total request timeout. It can also be a pair (tuple) of (connection, read) timeouts."

Citations:


🌐 Web query:

kubernetes python client Configuration request_timeout global default

💡 Result:

In the Kubernetes Python client library, there is no global default request_timeout setting in the Configuration class. The _request_timeout parameter must be specified explicitly on individual API calls (e.g., list_namespaced_pod(_request_timeout=60)) or watch streams. If not set, the default is None, meaning no timeout (infinite wait, matching urllib3/socket defaults). Official documentation confirms: "In case, if the _request_timeout value is not set, then the default value is None & socket will have no timeout." (examples/watch/timeout-settings.md). Requests for global configuration via Configuration (e.g., issues #1961, #2227) remain open or unresolved as of 2026, with no such field in configuration.py source code. Per-call _request_timeout accepts int (total timeout) or tuple (connect, read).

Citations:


Timeout configuration is ineffective; pass _request_timeout to the API call instead.

Setting api_client.configuration.request_timeout does not apply timeouts to Kubernetes API calls. The Python client requires the _request_timeout parameter to be passed explicitly to individual API methods. Change target_core.list_node() to target_core.list_node(_request_timeout=(settings.gpu_discovery_timeout_sec, settings.gpu_discovery_timeout_sec)), and remove the Configuration assignment. Without this fix, the timeout has no effect and the call will hang indefinitely on an unreachable cluster.

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@fournos/core/gpu_discovery.py` around lines 125 - 139, The timeout on the
Kubernetes call is ineffective because setting
api_client.configuration.request_timeout doesn't apply to individual calls;
update the code that builds the client and calls list_node so that you remove
the assignment to api_client.configuration.request_timeout and instead call
target_core.list_node(...) passing
_request_timeout=(settings.gpu_discovery_timeout_sec,
settings.gpu_discovery_timeout_sec); keep the existing try/except that wraps the
call and raise GPUDiscoveryError on exception, and ensure api_client.close()
remains in the finally block.

Comment thread fournos/core/psapcluster.py Outdated
Comment thread README.md
Replace per-cluster ClusterQueue stopPolicy locking with sentinel
FournosJob approach. A lockOnly sentinel holds all 100 cluster-slots
via Kueue, blocking competing workloads without requiring per-cluster
CQ infrastructure.

Add auto-discovery daemon that scans kubeconfig secrets every 60s and
auto-creates ResourceFlavor, ClusterQueue flavor entry, and PSAPCluster
CR for each new cluster found.

Key changes:
- Add lockOnly field to FournosJob CRD, skip Forge for lockOnly jobs
- Rewrite PSAPCluster handlers for sentinel create/delete lifecycle
- Remove PSAPClusterManager (fournos/core/psapcluster.py)
- Add create_flavor, add_flavor_to_cluster_queue, update_flavor_quotas
  to KueueClient
- Add ClusterDiscovery with extract_cluster_name, list_kubeconfig_secrets
- Wire discovery daemon thread in operator startup
- Fix RBAC: add create/delete for fournosjobs, create for psapclusters
- Add end-to-end locking test walkthrough to README
- 28 new discovery unit tests, rewritten PSAPCluster handler tests
@MML-coder MML-coder changed the title feat: PSAPCluster CRD controller with GPU discovery and cluster locking feat: PSAPCluster controller with sentinel locking, auto-discovery, and GPU discovery May 1, 2026
Copy link
Copy Markdown

@coderabbitai coderabbitai Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 7

🧹 Nitpick comments (1)
fournos/operator.py (1)

169-177: ⚡ Quick win

Run discovery once before the first interval sleep.

Current loop delays initial discovery until the first timer elapses. Doing an immediate first pass reduces startup lag for newly added kubeconfig secrets.

Proposed fix
 def _discovery_loop():
     interval = settings.cluster_discovery_interval_sec
     while True:
-        time.sleep(interval)
         try:
             handlers.scan_clusters()
         except Exception:
             logger.exception("Cluster discovery scan failed")
+        time.sleep(interval)
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@fournos/operator.py` around lines 169 - 177, The discovery loop in
_discovery_loop currently waits for settings.cluster_discovery_interval_sec
before the first scan; change the loop to perform an immediate scan on startup
by calling handlers.scan_clusters() (inside the existing try/except that logs
via logger.exception) before sleeping, i.e. move the sleep to occur after the
scan so the sequence is: try: handlers.scan_clusters() except Exception:
logger.exception(...) then time.sleep(settings.cluster_discovery_interval_sec),
and repeat.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.

Inline comments:
In `@fournos/core/discovery.py`:
- Around line 47-63: The loop that processes each secret should not abort the
entire discovery pass on a single failure; wrap the per-cluster work (after
cluster_name resolution) in a try/except so exceptions from
extract_cluster_name, self._kueue.create_flavor,
self._kueue.add_flavor_to_cluster_queue, or self._create_psapcluster are caught,
logged with contextual info (cluster_name and secret_name) and then continue to
the next secret; ensure you only append to discovered on success so failed
clusters are not marked discovered and rethrow nothing so the outer discovery
loop proceeds.

In `@fournos/core/kueue.py`:
- Around line 215-271: add_flavor_to_cluster_queue (and similarly
update_flavor_quotas) currently does a read-modify-write on the global
ClusterQueue and can lose concurrent changes; modify these functions to perform
optimistic retry-on-conflict by reading and including the object's current
metadata.resourceVersion when patching and retrying on ApiException conflict
(409) a few times, or alternatively switch to a narrower patch/Server-Side Apply
that targets only the specific flavor element instead of replacing
spec.resourceGroups; locate add_flavor_to_cluster_queue and update_flavor_quotas
in kueue.py and implement a loop that GETs the ClusterQueue, computes the
change, issues a PATCH/SSApply with resourceVersion or field-specific patch, and
retries on conflict until success or max attempts.

In `@fournos/handlers/lifecycle.py`:
- Around line 52-63: Reject combinations where lockOnly is true but exclusive is
false instead of silently overriding: in the validation blocks that read
exclusive = spec["exclusive"] and lock_only = spec.get("lockOnly", False) (and
the duplicate block around lines with similar logic), add a check that if
lock_only is True and exclusive is False then set patch.status["phase"] =
Phase.FAILED and patch.status["message"] to a clear message like "lockOnly: true
requires 'exclusive' to be true" and return; reference the same symbols
(exclusive, lock_only, spec) and ensure the same check is applied in both places
(the initial validation and the later similar block that currently enforces
exclusivity in _create_lock_workload()) so user input is rejected rather than
silently overridden.

In `@fournos/handlers/psapcluster.py`:
- Around line 216-220: reconcile_psapcluster currently calls
_reconcile_ttl_expiry(...) then continues to call _reconcile_lock_job(...) which
can immediately recreate a lock because patch was mutated but spec/status still
reflect the old locked state; modify _reconcile_ttl_expiry to return a boolean
(e.g., released=True if it called _release_lock()) and update
reconcile_psapcluster to check that return and stop the reconcile pass (early
return) when TTL expiry released the lock so _reconcile_lock_job is not invoked
with stale spec/status; apply the same pattern to the other reconcile sequences
mentioned (the blocks around lines 238-260 and 354-383) so each
_reconcile_ttl_expiry caller aborts further reconcile steps when it reports a
release.
- Around line 303-310: The code currently always appends a new condition into
patch.status.setdefault("conditions", []) (see the append call with
_make_condition and COND_GPU_DISCOVERED), causing duplicate COND_GPU_DISCOVERED
entries over time; change this to "upsert" by finding and removing any existing
condition whose "type" equals COND_GPU_DISCOVERED from
patch.status["conditions"] before adding the new _make_condition(...) entry.
Apply the same replacement in both the failure path (the shown append using
str(exc)) and the corresponding success path where COND_GPU_DISCOVERED is
appended, using the symbols patch.status, COND_GPU_DISCOVERED, and
_make_condition to locate and modify the code.
- Around line 342-347: The guard `and new_resources` prevents clearing Kueue
quotas when discovery reports zero GPUs; change the logic so that
`ctx.kueue.update_flavor_quotas(name, new_resources)` is called whenever
`new_resources != prev_resources` (even if `new_resources` is empty) so stale
GPU quotas are removed — update the block around the variables `prev_gpus`,
`new_resources`, `prev_resources` and the call to
`ctx.kueue.update_flavor_quotas(name, new_resources)` to remove the `and
new_resources` condition and ensure empty `new_resources` clears quotas.

In `@README.md`:
- Around line 323-325: The README's PSAPCluster onboarding is inconsistent about
manual quota setup: update the wording around the step that says "Add a
ResourceFlavor and quota for the cluster in `config/kueue-cluster-config.yaml`"
(and the repeated text at the later block) to clearly state whether this manual
step is required, optional fallback, or legacy; explicitly mention that the
controller (referenced as "controller updates ClusterQueue quotas automatically"
in the "Onboarding a new cluster" section) will normally create/update
ClusterQueue quotas for onboarded clusters and keep the manual
`config/kueue-cluster-config.yaml` step as an optional/manual fallback for edge
cases or legacy workflows, and add a short cross-reference to the "Onboarding a
new cluster" section and a one-line example/criteria when the manual step is
needed.

---

Nitpick comments:
In `@fournos/operator.py`:
- Around line 169-177: The discovery loop in _discovery_loop currently waits for
settings.cluster_discovery_interval_sec before the first scan; change the loop
to perform an immediate scan on startup by calling handlers.scan_clusters()
(inside the existing try/except that logs via logger.exception) before sleeping,
i.e. move the sleep to occur after the scan so the sequence is: try:
handlers.scan_clusters() except Exception: logger.exception(...) then
time.sleep(settings.cluster_discovery_interval_sec), and repeat.
🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

  • Push a commit to this branch (recommended)
  • Create a new PR with the fixes

ℹ️ Review info
⚙️ Run configuration

Configuration used: defaults

Review profile: CHILL

Plan: Pro

Run ID: 9c74b8b6-0fa5-4d35-a8eb-3736ecf1f197

📥 Commits

Reviewing files that changed from the base of the PR and between 2634aef and efa8139.

📒 Files selected for processing (19)
  • README.md
  • config/forge/samples/job-full.yaml
  • fournos/core/clusters.py
  • fournos/core/constants.py
  • fournos/core/discovery.py
  • fournos/core/kueue.py
  • fournos/handlers/__init__.py
  • fournos/handlers/discovery.py
  • fournos/handlers/execution.py
  • fournos/handlers/lifecycle.py
  • fournos/handlers/psapcluster.py
  • fournos/operator.py
  • fournos/settings.py
  • fournos/state.py
  • manifests/crd.yaml
  • manifests/psapcluster-crd.yaml
  • manifests/rbac/role_fournos.yaml
  • tests/unit/test_cluster_discovery.py
  • tests/unit/test_psapcluster_handlers.py
✅ Files skipped from review due to trivial changes (3)
  • config/forge/samples/job-full.yaml
  • fournos/handlers/init.py
  • manifests/psapcluster-crd.yaml
🚧 Files skipped from review as they are similar to previous changes (4)
  • fournos/state.py
  • fournos/core/constants.py
  • manifests/rbac/role_fournos.yaml
  • tests/unit/test_psapcluster_handlers.py

Comment thread fournos/core/discovery.py
Comment on lines +47 to +63
for secret_name in secrets:
cluster_name = extract_cluster_name(secret_name)
if cluster_name is None:
continue

if cluster_name in existing_clusters:
continue

logger.info("Discovered new cluster %s from secret %s", cluster_name, secret_name)

if cluster_name not in existing_flavors:
self._kueue.create_flavor(cluster_name)

self._kueue.add_flavor_to_cluster_queue(cluster_name)
self._create_psapcluster(cluster_name, secret_name)
discovered.append(cluster_name)

Copy link
Copy Markdown

@coderabbitai coderabbitai Bot May 1, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟠 Major | ⚡ Quick win

A single onboarding failure aborts the whole discovery pass.

If one cluster fails during flavor/CQ/CR creation, the loop exits and later valid clusters are skipped for that cycle. Isolate errors per cluster and continue.

Proposed fix
         discovered: list[str] = []
         for secret_name in secrets:
             cluster_name = extract_cluster_name(secret_name)
             if cluster_name is None:
                 continue

             if cluster_name in existing_clusters:
                 continue

             logger.info("Discovered new cluster %s from secret %s", cluster_name, secret_name)

-            if cluster_name not in existing_flavors:
-                self._kueue.create_flavor(cluster_name)
-
-            self._kueue.add_flavor_to_cluster_queue(cluster_name)
-            self._create_psapcluster(cluster_name, secret_name)
-            discovered.append(cluster_name)
+            try:
+                if cluster_name not in existing_flavors:
+                    self._kueue.create_flavor(cluster_name)
+                    existing_flavors.add(cluster_name)
+
+                self._kueue.add_flavor_to_cluster_queue(cluster_name)
+                self._create_psapcluster(cluster_name, secret_name)
+                discovered.append(cluster_name)
+            except Exception:
+                logger.exception(
+                    "Failed onboarding discovered cluster %s from secret %s",
+                    cluster_name,
+                    secret_name,
+                )
+                continue
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@fournos/core/discovery.py` around lines 47 - 63, The loop that processes each
secret should not abort the entire discovery pass on a single failure; wrap the
per-cluster work (after cluster_name resolution) in a try/except so exceptions
from extract_cluster_name, self._kueue.create_flavor,
self._kueue.add_flavor_to_cluster_queue, or self._create_psapcluster are caught,
logged with contextual info (cluster_name and secret_name) and then continue to
the next secret; ensure you only append to discovered on success so failed
clusters are not marked discovered and rethrow nothing so the outer discovery
loop proceeds.

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

looks like a valid concern

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@kpouget, glad we're aligned! The fix is straightforward — wrapping the per-cluster onboarding block in a try/except ensures that a transient API error (e.g., a flaky Kueue patch or a race on PSAPCluster creation) for one cluster doesn't silently drop all subsequent clusters from that scan cycle.

The proposed fix from the review comment is ready to apply as-is:

discovered: list[str] = []
for secret_name in secrets:
    cluster_name = extract_cluster_name(secret_name)
    if cluster_name is None:
        continue

    if cluster_name in existing_clusters:
        continue

    logger.info("Discovered new cluster %s from secret %s", cluster_name, secret_name)

    try:
        if cluster_name not in existing_flavors:
            self._kueue.create_flavor(cluster_name)
            existing_flavors.add(cluster_name)

        self._kueue.add_flavor_to_cluster_queue(cluster_name)
        self._create_psapcluster(cluster_name, secret_name)
        discovered.append(cluster_name)
    except Exception:
        logger.exception(
            "Failed onboarding discovered cluster %s from secret %s",
            cluster_name,
            secret_name,
        )
        continue

Note also the existing_flavors.add(cluster_name) after a successful create_flavor call — this keeps the in-memory set consistent so that if the same flavor name somehow appears again later in the same scan, it won't attempt a redundant creation.

Would you like me to plan these code changes for you?

Copy link
Copy Markdown
Collaborator Author

@MML-coder MML-coder May 1, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good suggestion to incorporate related to transient api error.

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Seems like the humans are having a chat. I'll hop back into my burrow for now. If you need me again, just tag @coderabbitai in a new comment, and I'll come hopping out!

Comment thread fournos/core/kueue.py
Comment on lines +215 to +271
def add_flavor_to_cluster_queue(self, flavor_name: str) -> dict | None:
"""Add a flavor entry to the global ClusterQueue.

Read-modify-write: GET the CQ, check if flavor already present,
append a new flavor entry with cluster-slot quota and zero GPU
quotas for all existing coveredResources, PATCH back.

Returns the patched CQ, or ``None`` if already present or CQ not found.
"""
cq_name = settings.kueue_cluster_queue_name
try:
cq = self._k8s.get_cluster_custom_object(
group=KUEUE_GROUP,
version=KUEUE_VERSION,
plural=KUEUE_CLUSTER_QUEUE_PLURAL,
name=cq_name,
)
except client.exceptions.ApiException as exc:
if exc.status == 404:
logger.warning("ClusterQueue %s not found", cq_name)
return None
raise

resource_groups = cq.get("spec", {}).get("resourceGroups", [])
if not resource_groups:
logger.warning("ClusterQueue %s has no resourceGroups", cq_name)
return None

rg = resource_groups[0]
flavors = rg.get("flavors", [])

for f in flavors:
if f["name"] == flavor_name:
logger.debug("Flavor %s already in ClusterQueue %s", flavor_name, cq_name)
return None

covered = rg.get("coveredResources", [])
resources: list[dict] = []
for resource_name in covered:
if resource_name == CLUSTER_SLOT_RESOURCE:
resources.append({"name": CLUSTER_SLOT_RESOURCE, "nominalQuota": MAX_CLUSTER_SLOTS})
else:
resources.append({"name": resource_name, "nominalQuota": 0})

if not any(r["name"] == CLUSTER_SLOT_RESOURCE for r in resources):
resources.append({"name": CLUSTER_SLOT_RESOURCE, "nominalQuota": MAX_CLUSTER_SLOTS})

flavors.append({"name": flavor_name, "resources": resources})
rg["flavors"] = flavors

result = self._k8s.patch_cluster_custom_object(
group=KUEUE_GROUP,
version=KUEUE_VERSION,
plural=KUEUE_CLUSTER_QUEUE_PLURAL,
name=cq_name,
body={"spec": {"resourceGroups": resource_groups}},
)
Copy link
Copy Markdown

@coderabbitai coderabbitai Bot May 1, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟠 Major

🧩 Analysis chain

🏁 Script executed:

cat -n fournos/core/kueue.py | sed -n '215,347p'

Repository: openshift-psap/fournos

Length of output: 6129


🏁 Script executed:

# Search for retry logic, locks, or conflict handling in the file
rg -n "retry|lock|conflict|patch|atomic" fournos/core/kueue.py -A 2 -B 2

Repository: openshift-psap/fournos

Length of output: 668


🏁 Script executed:

# Check the broader architecture for concurrency patterns
rg -n "auto.discovery|PSAPCluster|reconcil" --type py | head -20

Repository: openshift-psap/fournos

Length of output: 2092


🏁 Script executed:

# Look for how patch_cluster_custom_object is used elsewhere
rg -n "patch_cluster_custom_object" fournos/ --type py -B 2 -A 2

Repository: openshift-psap/fournos

Length of output: 658


🏁 Script executed:

# Verify if there's any threading/async concurrency model
fd -e py | xargs rg -l "threading|asyncio|concurrent|Thread|async" | head -10

Repository: openshift-psap/fournos

Length of output: 86


🏁 Script executed:

# Check operator.py for concurrency model
head -100 fournos/operator.py

Repository: openshift-psap/fournos

Length of output: 3419


🏁 Script executed:

# Look for how these KueueClient methods are called
rg -n "add_flavor_to_cluster_queue|update_flavor_quotas" fournos/ --type py -B 3 -A 3

Repository: openshift-psap/fournos

Length of output: 2007


🏁 Script executed:

# Check for any Kubernetes API version or patch strategy documentation in comments
rg -n "KUEUE_VERSION|patch|Strategic" fournos/core/kueue.py | head -20

Repository: openshift-psap/fournos

Length of output: 870


🏁 Script executed:

# Look for imports and client initialization to understand which Kubernetes client is being used
head -50 fournos/core/kueue.py

Repository: openshift-psap/fournos

Length of output: 1417


🏁 Script executed:

# Verify the actual patch behavior - check if there's any resourceVersion or conflict detection
rg -n "resourceVersion|conflict|409" fournos/ --type py

Repository: openshift-psap/fournos

Length of output: 1760


Protect shared ClusterQueue updates from lost writes due to concurrent patching.

Both add_flavor_to_cluster_queue and update_flavor_quotas perform read-modify-write on the same global ClusterQueue without conflict detection. The discovery loop and PSAPCluster handler run concurrently in separate threads, and since both patch the entire spec.resourceGroups array, the last write silently overwrites any concurrent changes from the other. Add optimistic retry-on-conflict using resourceVersion or switch to a narrower patch/Server-Side Apply strategy scoped to the specific flavor being modified.

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@fournos/core/kueue.py` around lines 215 - 271, add_flavor_to_cluster_queue
(and similarly update_flavor_quotas) currently does a read-modify-write on the
global ClusterQueue and can lose concurrent changes; modify these functions to
perform optimistic retry-on-conflict by reading and including the object's
current metadata.resourceVersion when patching and retrying on ApiException
conflict (409) a few times, or alternatively switch to a narrower
patch/Server-Side Apply that targets only the specific flavor element instead of
replacing spec.resourceGroups; locate add_flavor_to_cluster_queue and
update_flavor_quotas in kueue.py and implement a loop that GETs the
ClusterQueue, computes the change, issues a PATCH/SSApply with resourceVersion
or field-specific patch, and retries on conflict until success or max attempts.

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

looks like a valid concern

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@kpouget, agreed — here's a concrete fix using an optimistic retry loop. The key idea is to include the metadata.resourceVersion from the GET response in the PATCH body so Kubernetes will reject stale writes with a 409, then retry from scratch.

  • Extract a shared _retry_patch_cluster_queue helper that encapsulates the GET → mutate → PATCH loop with up to N retries on conflict (409)
  • Refactor add_flavor_to_cluster_queue to use it
  • Refactor update_flavor_quotas to use it

🧠 Learnings used
Learnt from: avasilevskii
Repo: openshift-psap/fournos PR: 43
File: fournos/handlers/lifecycle.py:95-95
Timestamp: 2026-04-21T12:23:23.225Z
Learning: In `fournos/handlers/lifecycle.py`, the `cluster` value used as a Kubernetes label value for `LABEL_EXCLUSIVE_CLUSTER` (defined as `"fournos.dev/exclusive-cluster"` in `fournos/core/constants.py`) is always validated against existing Kueue ResourceFlavor names before use. ResourceFlavor names are controlled by platform operators (not end users) and in practice are short identifiers like "cluster-1" or "cluster-2", so they are inherently Kubernetes label-value safe. No additional sanitization is needed.

The requested changes have been implemented and a pull request has been created: View PR

Comment on lines 52 to +63
exclusive = spec["exclusive"]
lock_only = spec.get("lockOnly", False)

if lock_only and not cluster:
patch.status["phase"] = Phase.FAILED
patch.status["message"] = "lockOnly: true requires 'cluster' to be set"
return

if not lock_only and not spec.get("forge"):
patch.status["phase"] = Phase.FAILED
patch.status["message"] = "spec.forge is required for non-lockOnly jobs"
return
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟡 Minor | ⚡ Quick win

Fail fast on lockOnly: true + exclusive: false instead of silently overriding.

Current flow accepts that combo but _create_lock_workload() enforces exclusive anyway, so user intent is ignored. Rejecting this input makes behavior explicit and debuggable.

Proposed fix
 def on_create(spec, name, namespace, status, patch, body):
@@
     cluster = spec.get("cluster")
     exclusive = spec["exclusive"]
     lock_only = spec.get("lockOnly", False)

+    if lock_only and not exclusive:
+        patch.status["phase"] = Phase.FAILED
+        patch.status["message"] = "lockOnly jobs require 'exclusive: true'"
+        return
+
     if lock_only and not cluster:
         patch.status["phase"] = Phase.FAILED
         patch.status["message"] = "lockOnly: true requires 'cluster' to be set"
         return

Also applies to: 95-106

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@fournos/handlers/lifecycle.py` around lines 52 - 63, Reject combinations
where lockOnly is true but exclusive is false instead of silently overriding: in
the validation blocks that read exclusive = spec["exclusive"] and lock_only =
spec.get("lockOnly", False) (and the duplicate block around lines with similar
logic), add a check that if lock_only is True and exclusive is False then set
patch.status["phase"] = Phase.FAILED and patch.status["message"] to a clear
message like "lockOnly: true requires 'exclusive' to be true" and return;
reference the same symbols (exclusive, lock_only, spec) and ensure the same
check is applied in both places (the initial validation and the later similar
block that currently enforces exclusivity in _create_lock_workload()) so user
input is rejected rather than silently overridden.

Comment on lines +216 to +220
def reconcile_psapcluster(spec, name, namespace, status, patch, body):
_reconcile_kubeconfig(spec, name, status, patch)
_reconcile_ttl_expiry(spec, name, status, patch)
_reconcile_gpu_discovery(spec, name, status, patch)
_reconcile_lock_job(spec, name, status, patch)
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🔴 Critical | ⚡ Quick win

Stop the reconcile pass after TTL expiry releases the lock.

_reconcile_ttl_expiry() only updates patch, but the same pass still calls _reconcile_lock_job() with the old spec/status. After _release_lock() deletes the sentinel, the self-healing branch still sees status.locked == True and the old spec.owner, so it can recreate the lock immediately and make TTL expiry ineffective.

Suggested fix
 def reconcile_psapcluster(spec, name, namespace, status, patch, body):
     _reconcile_kubeconfig(spec, name, status, patch)
-    _reconcile_ttl_expiry(spec, name, status, patch)
+    if _reconcile_ttl_expiry(spec, name, status, patch):
+        return
     _reconcile_gpu_discovery(spec, name, status, patch)
     _reconcile_lock_job(spec, name, status, patch)
 
-
 def _reconcile_ttl_expiry(spec, name, status, patch):
     if not status.get("locked"):
-        return
+        return False
 ...
     if datetime.now(timezone.utc) >= expires_at:
         prev_owner = spec.get("owner", "unknown")
         logger.info(
             "PSAPCluster %s: ownership expired (was owned by %s)", name, prev_owner
         )
         patch.spec["owner"] = ""
         _release_lock(name, patch)
+        return True
+    return False

Also applies to: 238-260, 354-383

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@fournos/handlers/psapcluster.py` around lines 216 - 220,
reconcile_psapcluster currently calls _reconcile_ttl_expiry(...) then continues
to call _reconcile_lock_job(...) which can immediately recreate a lock because
patch was mutated but spec/status still reflect the old locked state; modify
_reconcile_ttl_expiry to return a boolean (e.g., released=True if it called
_release_lock()) and update reconcile_psapcluster to check that return and stop
the reconcile pass (early return) when TTL expiry released the lock so
_reconcile_lock_job is not invoked with stale spec/status; apply the same
pattern to the other reconcile sequences mentioned (the blocks around lines
238-260 and 354-383) so each _reconcile_ttl_expiry caller aborts further
reconcile steps when it reports a release.

Comment on lines +303 to +310
patch.status.setdefault("conditions", []).append(
_make_condition(
COND_GPU_DISCOVERED,
"False",
"DiscoveryFailed",
str(exc),
)
)
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟠 Major | ⚡ Quick win

Upsert conditions instead of appending duplicates on every timer run.

Both the success and failure paths append a fresh COND_GPU_DISCOVERED entry every reconciliation. On a long-lived cluster this grows status.conditions without bound and makes consumers pick between duplicate condition records. Replace the existing condition by type instead of always appending.

Also applies to: 338-340

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@fournos/handlers/psapcluster.py` around lines 303 - 310, The code currently
always appends a new condition into patch.status.setdefault("conditions", [])
(see the append call with _make_condition and COND_GPU_DISCOVERED), causing
duplicate COND_GPU_DISCOVERED entries over time; change this to "upsert" by
finding and removing any existing condition whose "type" equals
COND_GPU_DISCOVERED from patch.status["conditions"] before adding the new
_make_condition(...) entry. Apply the same replacement in both the failure path
(the shown append using str(exc)) and the corresponding success path where
COND_GPU_DISCOVERED is appended, using the symbols patch.status,
COND_GPU_DISCOVERED, and _make_condition to locate and modify the code.

Comment on lines +342 to +347
prev_gpus = hardware.get("gpus", [])
new_resources = [(g.short_name, g.count) for g in result.gpus]
prev_resources = [(g.get("shortName"), g.get("count")) for g in prev_gpus]
if new_resources != prev_resources and new_resources:
try:
ctx.kueue.update_flavor_quotas(name, new_resources)
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟠 Major | ⚡ Quick win

Clear Kueue quotas when discovery drops to zero GPUs.

The and new_resources guard skips update_flavor_quotas() when a cluster previously had GPUs but discovery now returns none. That leaves stale GPU quota in Kueue and can schedule workloads onto capacity that no longer exists.

Suggested fix
-    if new_resources != prev_resources and new_resources:
+    if new_resources != prev_resources:
         try:
             ctx.kueue.update_flavor_quotas(name, new_resources)
         except Exception as exc:
             logger.warning(
                 "PSAPCluster %s: failed to update flavor quotas: %s", name, exc
📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change
prev_gpus = hardware.get("gpus", [])
new_resources = [(g.short_name, g.count) for g in result.gpus]
prev_resources = [(g.get("shortName"), g.get("count")) for g in prev_gpus]
if new_resources != prev_resources and new_resources:
try:
ctx.kueue.update_flavor_quotas(name, new_resources)
prev_gpus = hardware.get("gpus", [])
new_resources = [(g.short_name, g.count) for g in result.gpus]
prev_resources = [(g.get("shortName"), g.get("count")) for g in prev_gpus]
if new_resources != prev_resources:
try:
ctx.kueue.update_flavor_quotas(name, new_resources)
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@fournos/handlers/psapcluster.py` around lines 342 - 347, The guard `and
new_resources` prevents clearing Kueue quotas when discovery reports zero GPUs;
change the logic so that `ctx.kueue.update_flavor_quotas(name, new_resources)`
is called whenever `new_resources != prev_resources` (even if `new_resources` is
empty) so stale GPU quotas are removed — update the block around the variables
`prev_gpus`, `new_resources`, `prev_resources` and the call to
`ctx.kueue.update_flavor_quotas(name, new_resources)` to remove the `and
new_resources` condition and ensure empty `new_resources` clears quotas.

Comment thread README.md
Comment on lines +323 to +325
2. Add a ResourceFlavor and quota for the cluster in
`config/kueue-cluster-config.yaml` (see [Onboarding a new cluster](#onboarding-a-new-cluster)).

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟡 Minor | ⚡ Quick win

PSAPCluster onboarding instructions are internally inconsistent.

This section says users must manually add ResourceFlavor/quota, but the next section states the controller updates ClusterQueue quotas automatically. Please clarify whether step 2 is required, optional fallback, or legacy flow.

Proposed wording adjustment
-2. Add a ResourceFlavor and quota for the cluster in
-   `config/kueue-cluster-config.yaml` (see [Onboarding a new cluster](`#onboarding-a-new-cluster`)).
+2. (Optional/manual fallback) If auto-discovery automation is disabled in your environment,
+   add a ResourceFlavor and quota for the cluster in
+   `config/kueue-cluster-config.yaml` (see [Onboarding a new cluster](`#onboarding-a-new-cluster`)).

Also applies to: 341-345

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@README.md` around lines 323 - 325, The README's PSAPCluster onboarding is
inconsistent about manual quota setup: update the wording around the step that
says "Add a ResourceFlavor and quota for the cluster in
`config/kueue-cluster-config.yaml`" (and the repeated text at the later block)
to clearly state whether this manual step is required, optional fallback, or
legacy; explicitly mention that the controller (referenced as "controller
updates ClusterQueue quotas automatically" in the "Onboarding a new cluster"
section) will normally create/update ClusterQueue quotas for onboarded clusters
and keep the manual `config/kueue-cluster-config.yaml` step as an
optional/manual fallback for edge cases or legacy workflows, and add a short
cross-reference to the "Onboarding a new cluster" section and a one-line
example/criteria when the manual step is needed.

Comment thread fournos/core/discovery.py
Comment on lines +52 to +53
if cluster_name in existing_clusters:
continue
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

we need to make sure that the objects (flavor, etc) are there, even if cluster_name in existing_clusters
they need to

  • exist
  • have the right content
    (I think they need to be created every time)

Comment on lines +16 to +25
GPU_MODEL_MAP: dict[str, str] = {
"NVIDIA-A100-SXM4-80GB": "a100",
"NVIDIA-A100-SXM4-40GB": "a100",
"NVIDIA-A100-PCIE-80GB": "a100",
"NVIDIA-A100-PCIE-40GB": "a100",
"NVIDIA-H100-SXM5-80GB": "h100",
"NVIDIA-H100-PCIE-80GB": "h100",
"NVIDIA-H200-SXM-141GB": "h200",
"AMD-Instinct-MI300X": "mi300x",
}
Copy link
Copy Markdown
Collaborator

@kpouget kpouget May 1, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

the can be in a user-configurable config map
(can be done later. If we don't do it in the PR, need to fill a ticket so that we don't forget it)

Comment on lines +84 to +85
elif secret.string_data and "kubeconfig" in secret.string_data:
raw = secret.string_data["kubeconfig"]
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

when would that happen ??

Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good point. I think it would never when reading real k8s API. but since we have onboarded cluster-1. cluster-2 .... type of fake cluster, for convenience and having some sort defensive code added by claude. I was hitting the issue primarily for those fake cluster. :)

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What is stringData?

Normally, the data field in a Kubernetes Secret requires values to be base64-encoded. While base64 isn't encryption, it’s a hassle to encode strings manually every time you write a YAML file.

The stringData field allows you to provide secret data as plain text.
How it Works

    You write your Secret YAML using stringData.

    You kubectl apply the file.

    Kubernetes automatically encodes the values to base64, moves them to the data field, and removes the stringData field from the stored object.

this path will never happen in the real runs of the controller.
Test test_stringdata_kubeconfig generates it, but I don't think that makes sense

Copy link
Copy Markdown
Collaborator

@kpouget kpouget left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@MML-coder , could you turn this into a controller fully independent from FOURNOS ?

you create a secondary top directory, and only important the relevant FOURNOS bits (constants, I assume)

we'll also need a name for the controller,
and maybe use another name than "PSAP"
Maybe FournosCluster, or something alike if we find another nice name

Comment on lines +1 to +2
apiVersion: apiextensions.k8s.io/v1
kind: CustomResourceDefinition
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

we'll rediscuss it, but I think we'll want to have

spec:
   resources: 
      h200: 8

with spec.resources either user provided or dynamically looked up if spec.resources: {}

Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Are you thinking of user could provide how many GPus they want to use by specifying spec.resources: 2 or 8 ? and the job can use available GPUs for the job?

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

there are two things I have in mind:

  1. we want the controller to do this only once (hence the spec field. Like a dynamic default value, not a truly reconciliated field)
  2. in the future, we want to allow the customization of the resources list with SW properties (TASK: refine the future behavior of the exclusive/cluster/hardware flags #70)

Comment thread fournos/core/discovery.py
Comment on lines +34 to +47
def scan(self) -> list[str]:
"""Scan for new clusters and onboard them.

Returns the list of newly discovered cluster names.
"""
secrets = list_kubeconfig_secrets(self._k8s_core)
if not secrets:
return []

existing_flavors = self._kueue.list_flavors()
existing_clusters = self._list_existing_psapclusters()

discovered: list[str] = []
for secret_name in secrets:
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think the pattern is wrong here for a controller

the way I think it can work in a safer way:

watch the creation / update of secrets with -lfournos.dev/cluster-kubeconfig=true

  • when such a secret is touched, make sure there's a PSAPCluster object matching it
  • put the owner field in the secret (TBC) to the PSAPCluster object

watch the creation / update of PSAPSecrets objects

  • when such an object is touched, make sure that its resourceflavor exists
  • when the kubeconfig secret is available, if spec.hardware is empty, look it up in the cluster
  • when the spec.hardware is filled, update the ClusterQueue

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

needs-rebase Indicates a PR cannot be merged because it has merge conflicts with HEAD.

Projects

None yet

Development

Successfully merging this pull request may close these issues.

FEATURE: ability to discover the target cluster GPUs FEATURE: define a FournosCluster resource to track the state of the clusters

2 participants