feat: PSAPCluster controller with sentinel locking, auto-discovery, and GPU discovery by MML-coder · Pull Request #74 · openshift-psap/fournos

MML-coder · 2026-04-29T19:47:12Z

Summary

Introduces a PSAPCluster custom resource and kopf controller that unifies cluster
management with GPU discovery, sentinel-based locking via Kueue, and automatic
cluster onboarding from kubeconfig secrets.

Closes #37, closes #51

What's included

PSAPCluster CRD (fournos.dev/v1) — single source of truth for cluster state with printer columns for oc get psapclusters
GPU discovery — connects to target clusters via kubeconfig secrets, lists nodes, aggregates GPU counts by vendor/model with exponential backoff on failures
Sentinel FournosJob locking — set spec.owner to create a lockOnly: true, exclusive: true sentinel FournosJob that holds all 100 cluster-slots via Kueue, blocking competing workloads. Optional spec.ttl for auto-expiry
Auto-discovery — daemon thread scans kubeconfig secrets every 60s, auto-creates ResourceFlavor + ClusterQueue flavor entry + PSAPCluster CR for each new cluster (labeled fournos.dev/auto-discovered: true)
lockOnly FournosJob support — new CRD field for sentinel jobs that skip Forge resolution and stay Admitted indefinitely
Self-healing — reconcile timer recreates missing sentinel jobs, validates kubeconfig, retries GPU discovery
Single global ClusterQueue — all clusters share fournos-queue with per-cluster ResourceFlavors

Architecture

Kubeconfig Secret (psap-secrets)
  └─ Auto-Discovery (60s scan)
       ├─ ResourceFlavor (nodeLabel: fournos.dev/cluster=<name>)
       ├─ ClusterQueue flavor entry (cluster-slot: 100)
       └─ PSAPCluster CR
            ├─ GPU Discovery (probes target cluster nodes)
            │    └─ Updates GPU quotas in ClusterQueue
            └─ Locking (spec.owner)
                 └─ Sentinel FournosJob (lockOnly + exclusive)
                      └─ Kueue Workload (holds 100 cluster-slots)

New/modified files

File	Change
`manifests/crd.yaml`	Added `lockOnly` field, removed `forge` from required
`manifests/psapcluster-crd.yaml`	Updated for sentinel approach (added `lockJobName`, removed old CQ fields)
`manifests/rbac/role_fournos.yaml`	Added `create`/`delete` for fournosjobs, `create` for psapclusters
`fournos/core/discovery.py`	New — `ClusterDiscovery` class for auto-onboarding clusters
`fournos/core/clusters.py`	Added `extract_cluster_name`, `list_kubeconfig_secrets`
`fournos/core/kueue.py`	Added `create_flavor`, `add_flavor_to_cluster_queue`, `update_flavor_quotas`
`fournos/core/psapcluster.py`	Deleted — PSAPClusterManager removed
`fournos/handlers/discovery.py`	New — `scan_clusters` handler
`fournos/handlers/psapcluster.py`	Rewritten for sentinel FournosJob create/delete
`fournos/handlers/lifecycle.py`	Added `lockOnly` handling in `on_create`
`fournos/handlers/execution.py`	Added `lockOnly` early return in `reconcile_admitted`
`fournos/operator.py`	Wired discovery daemon thread and `ClusterDiscovery` init
`tests/unit/test_cluster_discovery.py`	New — 28 unit tests for discovery
`tests/unit/test_psapcluster_handlers.py`	Rewritten for sentinel approach
`tests/unit/test_psapcluster_manager.py`	Deleted

Usage

# View clusters
oc get psapclusters -n psap-automation

# Lock a cluster (indefinite)
oc patch psapcluster athena-fire -n psap-automation --type merge \
  -p '{"spec":{"owner":"mehul"}}'

# Lock with 4-hour TTL
oc patch psapcluster athena-fire -n psap-automation --type merge \
  -p '{"spec":{"owner":"mehul","ttl":"4h"}}'

# Unlock
oc patch psapcluster athena-fire -n psap-automation --type merge \
  -p '{"spec":{"owner":""}}'

# Auto-discover: just drop a kubeconfig secret
oc create secret generic kubeconfig-new-cluster \
  --from-file=kubeconfig=/path/to/kubeconfig -n psap-secrets
# Within 60s: ResourceFlavor + CQ entry + PSAPCluster CR created automatically

Test plan

91 unit tests pass (28 discovery + 34 PSAPCluster handlers + 29 GPU discovery)
Deployed to psap-automation cluster
End-to-end locking test: sentinel created, competing workload blocked, unlock releases it
GPU discovery confirmed for athena-fire (8x H200) and psap-mgmt
Auto-discovery created PSAPClusters for kubeconfig secrets (cluster-1 through cluster-4)
Verified RBAC allows sentinel create/delete
Verified self-healing: reconciler recreates missing sentinel when owner is set

Summary by CodeRabbit

New Features
- Support for managing remote Kubernetes clusters with automatic GPU inventory tracking
- Automatic discovery of clusters from kubeconfig secrets
- Exclusive cluster locking with ownership-based controls and optional TTL expiry
- New lockOnly job mode to reserve cluster capacity without executing pipelines
Documentation
- Added comprehensive operational guide for cluster management, including setup procedures and troubleshooting

…ocking Introduces a PSAPCluster custom resource and kopf controller that unifies cluster definition, discovers GPU hardware dynamically, manages per-cluster Kueue ClusterQueues, and enables ownership locking via stopPolicy. - PSAPCluster CRD (fournos.dev/v1) with spec fields: kubeconfigSecret, owner, ttl, evict, gpuDiscoveryInterval - GPU discovery client that connects to target clusters via kubeconfig secrets, lists nodes, and aggregates GPU counts by vendor/model - Per-cluster ClusterQueues in a shared cohort with self-healing - Ownership locking with Hold/HoldAndDrain stopPolicy and TTL auto-expiry - Timer-based reconciliation: kubeconfig validation, TTL expiry, GPU discovery with backoff, CQ/RF self-healing - 82 unit tests covering all modules Closes #37, closes #51

openshift-ci · 2026-04-29T19:47:17Z

[APPROVALNOTIFIER] This PR is NOT APPROVED

This pull-request has been approved by:
Once this PR has been reviewed and has the lgtm label, please assign ashishkamra for approval. For more information see the Code Review Process.

The full list of commands accepted by this bot can be found here.

Details

Needs approval from an approver in each of these files:

OWNERS

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

openshift-ci · 2026-04-29T19:47:21Z

PR needs rebase.

Details

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository.

…ADME

coderabbitai · 2026-04-29T19:52:32Z

📝 Walkthrough

Walkthrough

This PR introduces PSAPCluster custom resource for cluster state visibility, implements GPU discovery from target clusters via kubeconfig secrets, adds cluster auto-discovery from kubeconfig secrets, enables cluster-level exclusive locking via sentinel jobs, and provides Kueue flavor/quota management for multi-cluster scheduling.

Changes

Cohort / File(s)	Summary
PSAPCluster CRD and RBAC `manifests/psapcluster-crd.yaml`, `manifests/rbac/role_fournos.yaml`	New PSAPCluster CRD with spec/status fields for kubeconfig, owner-based locking, GPU discovery, TTL, and hardware inventory. RBAC updated to grant PSAPCluster and LocalQueue management permissions.
GPU Discovery `fournos/core/gpu_discovery.py`, `tests/unit/test_gpu_discovery.py`	New GPU discovery module connecting to target clusters via kubeconfig, aggregating GPU counts by vendor/model, handling failures with explicit exceptions, and returning structured discovery results. Comprehensive unit tests covering success/error paths and model normalization.
Cluster Auto-Discovery `fournos/core/clusters.py`, `fournos/core/discovery.py`, `fournos/handlers/discovery.py`, `tests/unit/test_cluster_discovery.py`	Utilities to parse kubeconfig secret names by regex pattern, list matching secrets, and automatically discover clusters from secrets. ClusterDiscovery creates PSAPCluster CRs and Kueue flavors for newly discovered clusters. Handler exposes scan_clusters() entry point. Tests validate regex parsing, filtering, flavor/PSAPCluster creation, and idempotency.
PSAPCluster Handlers and Reconciliation `fournos/handlers/psapcluster.py`, `tests/unit/test_psapcluster_handlers.py`	New handler module for PSAPCluster lifecycle: on-create validation and initialization, owner-change based locking/unlocking sentinel jobs, and timer-based reconciliation enforcing TTL expiry, GPU discovery with interval/backoff, lock self-healing, and Kueue quota updates. Tests cover all handler paths including failure modes and state transitions.
Kueue Integration `fournos/core/kueue.py`	Three new KueueClient methods: create_flavor for idempotent ResourceFlavor creation, add_flavor_to_cluster_queue for appending flavors with default zero quotas, and update_flavor_quotas for modifying existing flavor GPU/cluster-slot allocations.
FournosJob Locking Support `fournos/handlers/lifecycle.py`, `fournos/handlers/execution.py`, `manifests/crd.yaml`	Support for spec.lockOnly mode: FournosJob create validates cluster requirement, bypasses Forge for lockOnly jobs, creates exclusive Kueue Workload; execution handler skips PipelineRun checks. CRD removes forge requirement and adds lockOnly boolean field.
Operator Integration and Configuration `fournos/operator.py`, `fournos/settings.py`, `fournos/state.py`, `fournos/core/constants.py`, `fournos/handlers/__init__.py`	Operator startup creates GPU discovery and cluster discovery clients, starts background discovery loop, registers PSAPCluster kopf handlers. Settings adds cluster-queue name, timer intervals, discovery timeout, and cluster-discovery interval. State holds new client references. Constants define lock/discovery labels and condition types. Handlers exports re-export new functions.
Documentation and Samples `README.md`, `dev/mock-psapcluster.yaml`, `config/forge/samples/job-full.yaml`	Extensive README section documenting PSAPCluster onboarding, locking mechanism, GPU discovery, TTL/expiry, troubleshooting, and e2e test. Mock cluster samples for testing. Sample job updated to reference athena-fire cluster.

Sequence Diagram(s)

sequenceDiagram
    participant Operator as PSAPCluster Operator
    participant K8s as Kubernetes API
    participant TargetCluster as Target Cluster API
    participant Kueue as Kueue API

    Operator->>K8s: Watch PSAPCluster events
    
    rect rgba(0, 100, 150, 0.5)
    Note over Operator,K8s: On PSAPCluster Create
    K8s->>Operator: Trigger create event
    Operator->>K8s: Validate kubeconfig secret
    Operator->>K8s: Initialize status, conditions
    Operator->>K8s: Create sentinel lock job if spec.owner set
    K8s->>Operator: Acknowledge
    end

    rect rgba(100, 50, 150, 0.5)
    Note over Operator,TargetCluster: GPU Discovery Loop
    Operator->>Operator: Timer triggers reconciliation
    Operator->>K8s: Read kubeconfig secret
    K8s->>Operator: Return secret data
    Operator->>TargetCluster: Connect & list nodes
    TargetCluster->>Operator: Return GPU allocatable resources
    Operator->>Operator: Aggregate by vendor/model
    Operator->>K8s: Update PSAPCluster status.hardware
    end

    rect rgba(150, 100, 50, 0.5)
    Note over Operator,Kueue: Quota Update on Inventory Change
    Operator->>K8s: Detect GPU count change
    Operator->>Kueue: Ensure flavor exists
    Kueue->>Operator: Confirm flavor
    Operator->>Kueue: Update flavor quotas
    Kueue->>Operator: Acknowledge patch
    end

    rect rgba(100, 150, 50, 0.5)
    Note over Operator,K8s: Owner Change Triggers Lock
    K8s->>Operator: Trigger spec.owner update
    Operator->>K8s: Create/delete sentinel job
    Operator->>K8s: Update locked, lockExpiresAt
    K8s->>Operator: Acknowledge
    end

sequenceDiagram
    participant BackgroundLoop as Discovery Loop
    participant K8s as Kubernetes API
    participant ClusterDiscovery as ClusterDiscovery
    participant KueueClient as KueueClient

    BackgroundLoop->>ClusterDiscovery: Trigger scan_clusters()
    
    rect rgba(100, 150, 100, 0.5)
    Note over ClusterDiscovery,K8s: Auto-Discover Clusters
    ClusterDiscovery->>K8s: List kubeconfig secrets
    K8s->>ClusterDiscovery: Return secret names
    ClusterDiscovery->>ClusterDiscovery: Extract cluster names via regex
    ClusterDiscovery->>K8s: Check for existing PSAPCluster CRs
    K8s->>ClusterDiscovery: Return existing clusters
    ClusterDiscovery->>ClusterDiscovery: Filter new clusters
    
    loop For each new cluster
        ClusterDiscovery->>KueueClient: Ensure flavor exists
        KueueClient->>K8s: Create ResourceFlavor
        ClusterDiscovery->>KueueClient: Add flavor to cluster queue
        KueueClient->>K8s: Patch ClusterQueue resourceGroups
        ClusterDiscovery->>K8s: Create PSAPCluster CR
        K8s->>ClusterDiscovery: Confirm CR created
    end
    end
    
    BackgroundLoop->>BackgroundLoop: Log discovered clusters

Estimated code review effort

🎯 4 (Complex) | ⏱️ ~60 minutes

Possibly related PRs

Turn Fournos into a k8s operator #4 — Main operator refactor introducing the same operator/settings/constants/kueue/handlers modules extended by this PR for PSAPCluster support.
Introduce cluster pinning and use Kueue for both hardware and cluster requests #2 — Extends Kueue integration for multi-cluster scheduling via flavor/ClusterQueue management and Workload nodeSelector handling, overlapping with KueueClient flavor and quota methods.
Introduce exclusive cluster locking #34 — Implements exclusive cluster-locking design using cluster-slot semaphore and sentinel-based locking that shares the same architectural pattern as this PR.

Suggested labels

lgtm

Poem

🐰 A cluster discovered, with GPUs laid bare,
Sentinels standing in locks debonair,
Quotas dance with the Kueue's grace,
Auto-finding homes for each cluster's space!

🚥 Pre-merge checks | ✅ 4 | ❌ 1

❌ Failed checks (1 warning)

Check name	Status	Explanation	Resolution
Docstring Coverage	⚠️ Warning	Docstring coverage is 7.39% which is insufficient. The required threshold is 80.00%.	Write docstrings for the functions missing them to satisfy the coverage threshold.

✅ Passed checks (4 passed)

Check name	Status	Explanation
Title check	✅ Passed	The PR title accurately summarizes the main changes: PSAPCluster controller implementation with sentinel locking, auto-discovery, and GPU discovery.
Linked Issues check	✅ Passed	The PR comprehensively addresses both linked issues: `#37` introduces PSAPCluster CRD for cluster state tracking, and `#51` implements GPU discovery via kubeconfig secrets with dynamic hardware enumeration.
Out of Scope Changes check	✅ Passed	All changes are scoped to the objectives: PSAPCluster controller, GPU discovery, cluster auto-discovery, Kueue integration, and lock-only job support. One minor update to job-full.yaml sample cluster reference is incidental.
Description Check	✅ Passed	Check skipped - CodeRabbit’s high-level summary is enabled.

_{✏️ Tip: You can configure your own custom pre-merge checks in the settings.}

✨ Finishing Touches

📝 Generate docstrings

Create stacked PR
Commit on current branch

🧪 Generate unit tests (beta)

Create PR with unit tests
Commit unit tests in branch feat/psapcluster-controller

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

_{Comment @coderabbitai help to get the list of available commands and usage tips.}

coderabbitai

Actionable comments posted: 3

🧹 Nitpick comments (1)

fournos/handlers/psapcluster.py (1)
52-64: 💤 Low value

Consider exposing a public method on ctx.registry instead of accessing _k8s directly.

Accessing the private _k8s attribute couples this handler to the internal implementation of ClusterRegistry. Consider adding a public method like ctx.registry.secret_exists(name, namespace) or using a shared CoreV1Api instance.
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@fournos/handlers/psapcluster.py` around lines 52 - 64, Update
_check_kubeconfig to stop accessing the private ctx.registry._k8s and call a new
public API on the registry (e.g., ctx.registry.secret_exists(secret_name,
namespace)) that encapsulates the read_namespaced_secret logic; add the
corresponding method on ClusterRegistry (secret_exists(name, namespace)) which
returns a boolean or raises a controlled exception for non-404 errors, then have
_check_kubeconfig map True->"Valid", False->"Missing" and handle other errors as
"Invalid" while logging the exception via logger.warning as before.

🤖 Prompt for all review comments with AI agents

Verify each finding against the current code and only fix it if needed.

Inline comments:
In `@fournos/core/gpu_discovery.py`:
- Around line 125-139: The timeout on the Kubernetes call is ineffective because
setting api_client.configuration.request_timeout doesn't apply to individual
calls; update the code that builds the client and calls list_node so that you
remove the assignment to api_client.configuration.request_timeout and instead
call target_core.list_node(...) passing
_request_timeout=(settings.gpu_discovery_timeout_sec,
settings.gpu_discovery_timeout_sec); keep the existing try/except that wraps the
call and raise GPUDiscoveryError on exception, and ensure api_client.close()
remains in the finally block.

In `@fournos/core/psapcluster.py`:
- Around line 181-185: The stopPolicy patch handler currently catches
ApiException but ApiException is not imported; update the exception handling in
that handler to catch client.exceptions.ApiException instead (i.e., change the
except block that reads "except ApiException as exc:" to use
client.exceptions.ApiException) so it matches other handlers in this module and
avoids a NameError; ensure the logger.warning and raise behavior remain
unchanged and reference cq_name as before.

In `@README.md`:
- Around line 307-312: The fenced code block showing the table lacks a language
identifier (MD040); update the backticks surrounding the block (the triple-fence
that begins the block containing "NAME OWNER GPUS KUBECONFIG LOCKED AGE") to
include a language tag such as "text" (e.g., change ``` to ```text) so
markdownlint no longer flags the code fence.

---

Nitpick comments:
In `@fournos/handlers/psapcluster.py`:
- Around line 52-64: Update _check_kubeconfig to stop accessing the private
ctx.registry._k8s and call a new public API on the registry (e.g.,
ctx.registry.secret_exists(secret_name, namespace)) that encapsulates the
read_namespaced_secret logic; add the corresponding method on ClusterRegistry
(secret_exists(name, namespace)) which returns a boolean or raises a controlled
exception for non-404 errors, then have _check_kubeconfig map True->"Valid",
False->"Missing" and handle other errors as "Invalid" while logging the
exception via logger.warning as before.

🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

Push a commit to this branch (recommended)
Create a new PR with the fixes

ℹ️ Review info

⚙️ Run configuration

Configuration used: defaults

Review profile: CHILL

Plan: Pro

Run ID: e083bc62-0476-4762-9cde-25f919d35cc5

📥 Commits

Reviewing files that changed from the base of the PR and between 83811e2 and 2634aef.

📒 Files selected for processing (15)

README.md
dev/mock-psapcluster.yaml
fournos/core/constants.py
fournos/core/gpu_discovery.py
fournos/core/psapcluster.py
fournos/handlers/__init__.py
fournos/handlers/psapcluster.py
fournos/operator.py
fournos/settings.py
fournos/state.py
manifests/psapcluster-crd.yaml
manifests/rbac/role_fournos.yaml
tests/unit/test_gpu_discovery.py
tests/unit/test_psapcluster_handlers.py
tests/unit/test_psapcluster_manager.py

coderabbitai · 2026-04-30T13:09:23Z

+        api_client = k8s_config.new_client_from_config_dict(kubeconfig_dict)
+        api_client.configuration.request_timeout = (
+            settings.gpu_discovery_timeout_sec,
+            settings.gpu_discovery_timeout_sec,
+        )
+
+        try:
+            target_core = client.CoreV1Api(api_client)
+            nodes = target_core.list_node()
+        except Exception as exc:
+            raise GPUDiscoveryError(
+                f"Failed to connect to cluster {cluster_name!r}: {exc}"
+            ) from exc
+        finally:
+            api_client.close()


⚠️ Potential issue | 🔴 Critical

🧩 Analysis chain

🌐 Web query:

kubernetes python client request_timeout configuration tuple format

💡 Result:

In the Kubernetes Python client, the _request_timeout parameter configures client-side request timeouts for API calls. It is passed as a keyword argument (e.g., _request_timeout=60) to API methods like CoreV1Api.list_namespaced_pod(...) or watch.Watch.stream(...). The parameter accepts: - A single number (int/float): total timeout for the request in seconds. - A tuple of length 2: (connect_timeout, read_timeout) in seconds, where the first value sets the connection timeout and the second sets the read timeout. This is handled by the underlying ApiClient.call_api and RESTClient.request methods, which pass it to urllib3's HTTP request machinery. Example: from kubernetes import client, watch v1 = client.CoreV1Api w = watch.Watch for event in w.stream(v1.list_namespaced_pod, namespace="default", _request_timeout=(5.0, 30.0)): ... Note: There is no global default timeout in Configuration; it must be set per-call. For exec streams, tuple support may have bugs in some versions (e.g., issue #2292). Official documentation: https://github.com/kubernetes-client/python/blob/master/examples/watch/timeout-settings.md explicitly states it accepts "int/long" or "a tuple (with a length of 2)". Client docs confirm: "If one number provided, it will be total request timeout. It can also be a pair (tuple) of (connection, read) timeouts."

Citations:

1: https://github.com/kubernetes-client/python/blob/master/examples/watch/timeout-settings.md

2: http://jashandeep-sohik8s-python.readthedocs.io/en/latest/kubernetes.client.html

3: https://github.com/kubernetes-client/python/blob/master/kubernetes/client/api_client.py

4: https://github.com/kubernetes-client/python/blob/master/kubernetes/client/rest.py

🌐 Web query:

kubernetes python client Configuration request_timeout global default

💡 Result:

In the Kubernetes Python client library, there is no global default request_timeout setting in the Configuration class. The _request_timeout parameter must be specified explicitly on individual API calls (e.g., list_namespaced_pod(_request_timeout=60)) or watch streams. If not set, the default is None, meaning no timeout (infinite wait, matching urllib3/socket defaults). Official documentation confirms: "In case, if the _request_timeout value is not set, then the default value is None & socket will have no timeout." (examples/watch/timeout-settings.md). Requests for global configuration via Configuration (e.g., issues #1961, #2227) remain open or unresolved as of 2026, with no such field in configuration.py source code. Per-call _request_timeout accepts int (total timeout) or tuple (connect, read).

Citations:

1: https://github.com/kubernetes-client/python/blob/master/examples/watch/timeout-settings.md

2: Configure default timeout for all API calls kubernetes-client/python#1961

3: https://github.com/kubernetes-client/python/blob/master/kubernetes/client/configuration.py

4: feat: global request timeout setting via config kubernetes-client/python#1713

5: Request timeout in APIs kubernetes-client/python#2227

6: https://github.com/kubernetes-client/python/blob/master/kubernetes/client/api_client.py

7: https://github.com/kubernetes-client/python/blob/master/CHANGELOG.md

8: https://github.com/kubernetes-client/python/blob/v27.2.0/CHANGELOG.md

Timeout configuration is ineffective; pass _request_timeout to the API call instead.

Setting api_client.configuration.request_timeout does not apply timeouts to Kubernetes API calls. The Python client requires the _request_timeout parameter to be passed explicitly to individual API methods. Change target_core.list_node() to target_core.list_node(_request_timeout=(settings.gpu_discovery_timeout_sec, settings.gpu_discovery_timeout_sec)), and remove the Configuration assignment. Without this fix, the timeout has no effect and the call will hang indefinitely on an unreachable cluster.

🤖 Prompt for AI Agents

Verify each finding against the current code and only fix it if needed. In `@fournos/core/gpu_discovery.py` around lines 125 - 139, The timeout on the Kubernetes call is ineffective because setting api_client.configuration.request_timeout doesn't apply to individual calls; update the code that builds the client and calls list_node so that you remove the assignment to api_client.configuration.request_timeout and instead call target_core.list_node(...) passing _request_timeout=(settings.gpu_discovery_timeout_sec, settings.gpu_discovery_timeout_sec); keep the existing try/except that wraps the call and raise GPUDiscoveryError on exception, and ensure api_client.close() remains in the finally block.

Replace per-cluster ClusterQueue stopPolicy locking with sentinel FournosJob approach. A lockOnly sentinel holds all 100 cluster-slots via Kueue, blocking competing workloads without requiring per-cluster CQ infrastructure. Add auto-discovery daemon that scans kubeconfig secrets every 60s and auto-creates ResourceFlavor, ClusterQueue flavor entry, and PSAPCluster CR for each new cluster found. Key changes: - Add lockOnly field to FournosJob CRD, skip Forge for lockOnly jobs - Rewrite PSAPCluster handlers for sentinel create/delete lifecycle - Remove PSAPClusterManager (fournos/core/psapcluster.py) - Add create_flavor, add_flavor_to_cluster_queue, update_flavor_quotas to KueueClient - Add ClusterDiscovery with extract_cluster_name, list_kubeconfig_secrets - Wire discovery daemon thread in operator startup - Fix RBAC: add create/delete for fournosjobs, create for psapclusters - Add end-to-end locking test walkthrough to README - 28 new discovery unit tests, rewritten PSAPCluster handler tests

coderabbitai

Actionable comments posted: 7

🧹 Nitpick comments (1)

fournos/operator.py (1)

169-177: ⚡ Quick win

Run discovery once before the first interval sleep.

Current loop delays initial discovery until the first timer elapses. Doing an immediate first pass reduces startup lag for newly added kubeconfig secrets.

Proposed fix

 def _discovery_loop():
     interval = settings.cluster_discovery_interval_sec
     while True:
-        time.sleep(interval)
         try:
             handlers.scan_clusters()
         except Exception:
             logger.exception("Cluster discovery scan failed")
+        time.sleep(interval)

🤖 Prompt for AI Agents

Verify each finding against the current code and only fix it if needed.

In `@fournos/operator.py` around lines 169 - 177, The discovery loop in
_discovery_loop currently waits for settings.cluster_discovery_interval_sec
before the first scan; change the loop to perform an immediate scan on startup
by calling handlers.scan_clusters() (inside the existing try/except that logs
via logger.exception) before sleeping, i.e. move the sleep to occur after the
scan so the sequence is: try: handlers.scan_clusters() except Exception:
logger.exception(...) then time.sleep(settings.cluster_discovery_interval_sec),
and repeat.

🤖 Prompt for all review comments with AI agents

Verify each finding against the current code and only fix it if needed.

Inline comments:
In `@fournos/core/discovery.py`:
- Around line 47-63: The loop that processes each secret should not abort the
entire discovery pass on a single failure; wrap the per-cluster work (after
cluster_name resolution) in a try/except so exceptions from
extract_cluster_name, self._kueue.create_flavor,
self._kueue.add_flavor_to_cluster_queue, or self._create_psapcluster are caught,
logged with contextual info (cluster_name and secret_name) and then continue to
the next secret; ensure you only append to discovered on success so failed
clusters are not marked discovered and rethrow nothing so the outer discovery
loop proceeds.

In `@fournos/core/kueue.py`:
- Around line 215-271: add_flavor_to_cluster_queue (and similarly
update_flavor_quotas) currently does a read-modify-write on the global
ClusterQueue and can lose concurrent changes; modify these functions to perform
optimistic retry-on-conflict by reading and including the object's current
metadata.resourceVersion when patching and retrying on ApiException conflict
(409) a few times, or alternatively switch to a narrower patch/Server-Side Apply
that targets only the specific flavor element instead of replacing
spec.resourceGroups; locate add_flavor_to_cluster_queue and update_flavor_quotas
in kueue.py and implement a loop that GETs the ClusterQueue, computes the
change, issues a PATCH/SSApply with resourceVersion or field-specific patch, and
retries on conflict until success or max attempts.

In `@fournos/handlers/lifecycle.py`:
- Around line 52-63: Reject combinations where lockOnly is true but exclusive is
false instead of silently overriding: in the validation blocks that read
exclusive = spec["exclusive"] and lock_only = spec.get("lockOnly", False) (and
the duplicate block around lines with similar logic), add a check that if
lock_only is True and exclusive is False then set patch.status["phase"] =
Phase.FAILED and patch.status["message"] to a clear message like "lockOnly: true
requires 'exclusive' to be true" and return; reference the same symbols
(exclusive, lock_only, spec) and ensure the same check is applied in both places
(the initial validation and the later similar block that currently enforces
exclusivity in _create_lock_workload()) so user input is rejected rather than
silently overridden.

In `@fournos/handlers/psapcluster.py`:
- Around line 216-220: reconcile_psapcluster currently calls
_reconcile_ttl_expiry(...) then continues to call _reconcile_lock_job(...) which
can immediately recreate a lock because patch was mutated but spec/status still
reflect the old locked state; modify _reconcile_ttl_expiry to return a boolean
(e.g., released=True if it called _release_lock()) and update
reconcile_psapcluster to check that return and stop the reconcile pass (early
return) when TTL expiry released the lock so _reconcile_lock_job is not invoked
with stale spec/status; apply the same pattern to the other reconcile sequences
mentioned (the blocks around lines 238-260 and 354-383) so each
_reconcile_ttl_expiry caller aborts further reconcile steps when it reports a
release.
- Around line 303-310: The code currently always appends a new condition into
patch.status.setdefault("conditions", []) (see the append call with
_make_condition and COND_GPU_DISCOVERED), causing duplicate COND_GPU_DISCOVERED
entries over time; change this to "upsert" by finding and removing any existing
condition whose "type" equals COND_GPU_DISCOVERED from
patch.status["conditions"] before adding the new _make_condition(...) entry.
Apply the same replacement in both the failure path (the shown append using
str(exc)) and the corresponding success path where COND_GPU_DISCOVERED is
appended, using the symbols patch.status, COND_GPU_DISCOVERED, and
_make_condition to locate and modify the code.
- Around line 342-347: The guard `and new_resources` prevents clearing Kueue
quotas when discovery reports zero GPUs; change the logic so that
`ctx.kueue.update_flavor_quotas(name, new_resources)` is called whenever
`new_resources != prev_resources` (even if `new_resources` is empty) so stale
GPU quotas are removed — update the block around the variables `prev_gpus`,
`new_resources`, `prev_resources` and the call to
`ctx.kueue.update_flavor_quotas(name, new_resources)` to remove the `and
new_resources` condition and ensure empty `new_resources` clears quotas.

In `@README.md`:
- Around line 323-325: The README's PSAPCluster onboarding is inconsistent about
manual quota setup: update the wording around the step that says "Add a
ResourceFlavor and quota for the cluster in `config/kueue-cluster-config.yaml`"
(and the repeated text at the later block) to clearly state whether this manual
step is required, optional fallback, or legacy; explicitly mention that the
controller (referenced as "controller updates ClusterQueue quotas automatically"
in the "Onboarding a new cluster" section) will normally create/update
ClusterQueue quotas for onboarded clusters and keep the manual
`config/kueue-cluster-config.yaml` step as an optional/manual fallback for edge
cases or legacy workflows, and add a short cross-reference to the "Onboarding a
new cluster" section and a one-line example/criteria when the manual step is
needed.

---

Nitpick comments:
In `@fournos/operator.py`:
- Around line 169-177: The discovery loop in _discovery_loop currently waits for
settings.cluster_discovery_interval_sec before the first scan; change the loop
to perform an immediate scan on startup by calling handlers.scan_clusters()
(inside the existing try/except that logs via logger.exception) before sleeping,
i.e. move the sleep to occur after the scan so the sequence is: try:
handlers.scan_clusters() except Exception: logger.exception(...) then
time.sleep(settings.cluster_discovery_interval_sec), and repeat.

🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

Push a commit to this branch (recommended)
Create a new PR with the fixes

ℹ️ Review info

⚙️ Run configuration

Configuration used: defaults

Review profile: CHILL

Plan: Pro

Run ID: 9c74b8b6-0fa5-4d35-a8eb-3736ecf1f197

📥 Commits

Reviewing files that changed from the base of the PR and between 2634aef and efa8139.

📒 Files selected for processing (19)

README.md
config/forge/samples/job-full.yaml
fournos/core/clusters.py
fournos/core/constants.py
fournos/core/discovery.py
fournos/core/kueue.py
fournos/handlers/__init__.py
fournos/handlers/discovery.py
fournos/handlers/execution.py
fournos/handlers/lifecycle.py
fournos/handlers/psapcluster.py
fournos/operator.py
fournos/settings.py
fournos/state.py
manifests/crd.yaml
manifests/psapcluster-crd.yaml
manifests/rbac/role_fournos.yaml
tests/unit/test_cluster_discovery.py
tests/unit/test_psapcluster_handlers.py

✅ Files skipped from review due to trivial changes (3)

config/forge/samples/job-full.yaml
fournos/handlers/init.py
manifests/psapcluster-crd.yaml

🚧 Files skipped from review as they are similar to previous changes (4)

fournos/state.py
fournos/core/constants.py
manifests/rbac/role_fournos.yaml
tests/unit/test_psapcluster_handlers.py

coderabbitai · 2026-05-01T19:44:20Z

+        for secret_name in secrets:
+            cluster_name = extract_cluster_name(secret_name)
+            if cluster_name is None:
+                continue
+
+            if cluster_name in existing_clusters:
+                continue
+
+            logger.info("Discovered new cluster %s from secret %s", cluster_name, secret_name)
+
+            if cluster_name not in existing_flavors:
+                self._kueue.create_flavor(cluster_name)
+
+            self._kueue.add_flavor_to_cluster_queue(cluster_name)
+            self._create_psapcluster(cluster_name, secret_name)
+            discovered.append(cluster_name)
+


⚠️ Potential issue | 🟠 Major | ⚡ Quick win

A single onboarding failure aborts the whole discovery pass.

If one cluster fails during flavor/CQ/CR creation, the loop exits and later valid clusters are skipped for that cycle. Isolate errors per cluster and continue.

Proposed fix

discovered: list[str] = [] for secret_name in secrets: cluster_name = extract_cluster_name(secret_name) if cluster_name is None: continue if cluster_name in existing_clusters: continue logger.info("Discovered new cluster %s from secret %s", cluster_name, secret_name) - if cluster_name not in existing_flavors: - self._kueue.create_flavor(cluster_name) - - self._kueue.add_flavor_to_cluster_queue(cluster_name) - self._create_psapcluster(cluster_name, secret_name) - discovered.append(cluster_name) + try: + if cluster_name not in existing_flavors: + self._kueue.create_flavor(cluster_name) + existing_flavors.add(cluster_name) + + self._kueue.add_flavor_to_cluster_queue(cluster_name) + self._create_psapcluster(cluster_name, secret_name) + discovered.append(cluster_name) + except Exception: + logger.exception( + "Failed onboarding discovered cluster %s from secret %s", + cluster_name, + secret_name, + ) + continue

🤖 Prompt for AI Agents

Verify each finding against the current code and only fix it if needed. In `@fournos/core/discovery.py` around lines 47 - 63, The loop that processes each secret should not abort the entire discovery pass on a single failure; wrap the per-cluster work (after cluster_name resolution) in a try/except so exceptions from extract_cluster_name, self._kueue.create_flavor, self._kueue.add_flavor_to_cluster_queue, or self._create_psapcluster are caught, logged with contextual info (cluster_name and secret_name) and then continue to the next secret; ensure you only append to discovered on success so failed clusters are not marked discovered and rethrow nothing so the outer discovery loop proceeds.

looks like a valid concern

@kpouget, glad we're aligned! The fix is straightforward — wrapping the per-cluster onboarding block in a try/except ensures that a transient API error (e.g., a flaky Kueue patch or a race on PSAPCluster creation) for one cluster doesn't silently drop all subsequent clusters from that scan cycle.

The proposed fix from the review comment is ready to apply as-is:

discovered: list[str] = [] for secret_name in secrets: cluster_name = extract_cluster_name(secret_name) if cluster_name is None: continue if cluster_name in existing_clusters: continue logger.info("Discovered new cluster %s from secret %s", cluster_name, secret_name) try: if cluster_name not in existing_flavors: self._kueue.create_flavor(cluster_name) existing_flavors.add(cluster_name) self._kueue.add_flavor_to_cluster_queue(cluster_name) self._create_psapcluster(cluster_name, secret_name) discovered.append(cluster_name) except Exception: logger.exception( "Failed onboarding discovered cluster %s from secret %s", cluster_name, secret_name, ) continue

Note also the existing_flavors.add(cluster_name) after a successful create_flavor call — this keeps the in-memory set consistent so that if the same flavor name somehow appears again later in the same scan, it won't attempt a redundant creation.

Would you like me to plan these code changes for you?

Good suggestion to incorporate related to transient api error.

Seems like the humans are having a chat. I'll hop back into my burrow for now. If you need me again, just tag @coderabbitai in a new comment, and I'll come hopping out!

coderabbitai · 2026-05-01T19:44:21Z

+    def add_flavor_to_cluster_queue(self, flavor_name: str) -> dict | None:
+        """Add a flavor entry to the global ClusterQueue.
+
+        Read-modify-write: GET the CQ, check if flavor already present,
+        append a new flavor entry with cluster-slot quota and zero GPU
+        quotas for all existing coveredResources, PATCH back.
+
+        Returns the patched CQ, or ``None`` if already present or CQ not found.
+        """
+        cq_name = settings.kueue_cluster_queue_name
+        try:
+            cq = self._k8s.get_cluster_custom_object(
+                group=KUEUE_GROUP,
+                version=KUEUE_VERSION,
+                plural=KUEUE_CLUSTER_QUEUE_PLURAL,
+                name=cq_name,
+            )
+        except client.exceptions.ApiException as exc:
+            if exc.status == 404:
+                logger.warning("ClusterQueue %s not found", cq_name)
+                return None
+            raise
+
+        resource_groups = cq.get("spec", {}).get("resourceGroups", [])
+        if not resource_groups:
+            logger.warning("ClusterQueue %s has no resourceGroups", cq_name)
+            return None
+
+        rg = resource_groups[0]
+        flavors = rg.get("flavors", [])
+
+        for f in flavors:
+            if f["name"] == flavor_name:
+                logger.debug("Flavor %s already in ClusterQueue %s", flavor_name, cq_name)
+                return None
+
+        covered = rg.get("coveredResources", [])
+        resources: list[dict] = []
+        for resource_name in covered:
+            if resource_name == CLUSTER_SLOT_RESOURCE:
+                resources.append({"name": CLUSTER_SLOT_RESOURCE, "nominalQuota": MAX_CLUSTER_SLOTS})
+            else:
+                resources.append({"name": resource_name, "nominalQuota": 0})
+
+        if not any(r["name"] == CLUSTER_SLOT_RESOURCE for r in resources):
+            resources.append({"name": CLUSTER_SLOT_RESOURCE, "nominalQuota": MAX_CLUSTER_SLOTS})
+
+        flavors.append({"name": flavor_name, "resources": resources})
+        rg["flavors"] = flavors
+
+        result = self._k8s.patch_cluster_custom_object(
+            group=KUEUE_GROUP,
+            version=KUEUE_VERSION,
+            plural=KUEUE_CLUSTER_QUEUE_PLURAL,
+            name=cq_name,
+            body={"spec": {"resourceGroups": resource_groups}},
+        )


⚠️ Potential issue | 🟠 Major

🧩 Analysis chain

🏁 Script executed:

cat -n fournos/core/kueue.py | sed -n '215,347p'

Repository: openshift-psap/fournos

Length of output: 6129

🏁 Script executed:

# Search for retry logic, locks, or conflict handling in the file rg -n "retry|lock|conflict|patch|atomic" fournos/core/kueue.py -A 2 -B 2

Repository: openshift-psap/fournos

Length of output: 668

🏁 Script executed:

# Check the broader architecture for concurrency patterns rg -n "auto.discovery|PSAPCluster|reconcil" --type py | head -20

Repository: openshift-psap/fournos

Length of output: 2092

🏁 Script executed:

# Look for how patch_cluster_custom_object is used elsewhere rg -n "patch_cluster_custom_object" fournos/ --type py -B 2 -A 2

Repository: openshift-psap/fournos

Length of output: 658

🏁 Script executed:

# Verify if there's any threading/async concurrency model fd -e py | xargs rg -l "threading|asyncio|concurrent|Thread|async" | head -10

Repository: openshift-psap/fournos

Length of output: 86

🏁 Script executed:

# Check operator.py for concurrency model head -100 fournos/operator.py

Repository: openshift-psap/fournos

Length of output: 3419

🏁 Script executed:

# Look for how these KueueClient methods are called rg -n "add_flavor_to_cluster_queue|update_flavor_quotas" fournos/ --type py -B 3 -A 3

Repository: openshift-psap/fournos

Length of output: 2007

🏁 Script executed:

# Check for any Kubernetes API version or patch strategy documentation in comments rg -n "KUEUE_VERSION|patch|Strategic" fournos/core/kueue.py | head -20

Repository: openshift-psap/fournos

Length of output: 870

🏁 Script executed:

# Look for imports and client initialization to understand which Kubernetes client is being used head -50 fournos/core/kueue.py

Repository: openshift-psap/fournos

Length of output: 1417

🏁 Script executed:

# Verify the actual patch behavior - check if there's any resourceVersion or conflict detection rg -n "resourceVersion|conflict|409" fournos/ --type py

Repository: openshift-psap/fournos

Length of output: 1760

Protect shared ClusterQueue updates from lost writes due to concurrent patching.

Both add_flavor_to_cluster_queue and update_flavor_quotas perform read-modify-write on the same global ClusterQueue without conflict detection. The discovery loop and PSAPCluster handler run concurrently in separate threads, and since both patch the entire spec.resourceGroups array, the last write silently overwrites any concurrent changes from the other. Add optimistic retry-on-conflict using resourceVersion or switch to a narrower patch/Server-Side Apply strategy scoped to the specific flavor being modified.

🤖 Prompt for AI Agents

Verify each finding against the current code and only fix it if needed. In `@fournos/core/kueue.py` around lines 215 - 271, add_flavor_to_cluster_queue (and similarly update_flavor_quotas) currently does a read-modify-write on the global ClusterQueue and can lose concurrent changes; modify these functions to perform optimistic retry-on-conflict by reading and including the object's current metadata.resourceVersion when patching and retrying on ApiException conflict (409) a few times, or alternatively switch to a narrower patch/Server-Side Apply that targets only the specific flavor element instead of replacing spec.resourceGroups; locate add_flavor_to_cluster_queue and update_flavor_quotas in kueue.py and implement a loop that GETs the ClusterQueue, computes the change, issues a PATCH/SSApply with resourceVersion or field-specific patch, and retries on conflict until success or max attempts.

looks like a valid concern

@kpouget, agreed — here's a concrete fix using an optimistic retry loop. The key idea is to include the metadata.resourceVersion from the GET response in the PATCH body so Kubernetes will reject stale writes with a 409, then retry from scratch.

Extract a shared _retry_patch_cluster_queue helper that encapsulates the GET → mutate → PATCH loop with up to N retries on conflict (409)

Refactor add_flavor_to_cluster_queue to use it

Refactor update_flavor_quotas to use it

🧠 Learnings used

Learnt from: avasilevskii Repo: openshift-psap/fournos PR: 43 File: fournos/handlers/lifecycle.py:95-95 Timestamp: 2026-04-21T12:23:23.225Z Learning: In `fournos/handlers/lifecycle.py`, the `cluster` value used as a Kubernetes label value for `LABEL_EXCLUSIVE_CLUSTER` (defined as `"fournos.dev/exclusive-cluster"` in `fournos/core/constants.py`) is always validated against existing Kueue ResourceFlavor names before use. ResourceFlavor names are controlled by platform operators (not end users) and in practice are short identifiers like "cluster-1" or "cluster-2", so they are inherently Kubernetes label-value safe. No additional sanitization is needed.

The requested changes have been implemented and a pull request has been created: View PR

coderabbitai · 2026-05-01T19:44:21Z

    exclusive = spec["exclusive"]
+    lock_only = spec.get("lockOnly", False)
+
+    if lock_only and not cluster:
+        patch.status["phase"] = Phase.FAILED
+        patch.status["message"] = "lockOnly: true requires 'cluster' to be set"
+        return
+
+    if not lock_only and not spec.get("forge"):
+        patch.status["phase"] = Phase.FAILED
+        patch.status["message"] = "spec.forge is required for non-lockOnly jobs"
+        return


⚠️ Potential issue | 🟡 Minor | ⚡ Quick win

Fail fast on lockOnly: true + exclusive: false instead of silently overriding.

Current flow accepts that combo but _create_lock_workload() enforces exclusive anyway, so user intent is ignored. Rejecting this input makes behavior explicit and debuggable.

Proposed fix

def on_create(spec, name, namespace, status, patch, body): @@ cluster = spec.get("cluster") exclusive = spec["exclusive"] lock_only = spec.get("lockOnly", False) + if lock_only and not exclusive: + patch.status["phase"] = Phase.FAILED + patch.status["message"] = "lockOnly jobs require 'exclusive: true'" + return + if lock_only and not cluster: patch.status["phase"] = Phase.FAILED patch.status["message"] = "lockOnly: true requires 'cluster' to be set" return

Also applies to: 95-106

🤖 Prompt for AI Agents

Verify each finding against the current code and only fix it if needed. In `@fournos/handlers/lifecycle.py` around lines 52 - 63, Reject combinations where lockOnly is true but exclusive is false instead of silently overriding: in the validation blocks that read exclusive = spec["exclusive"] and lock_only = spec.get("lockOnly", False) (and the duplicate block around lines with similar logic), add a check that if lock_only is True and exclusive is False then set patch.status["phase"] = Phase.FAILED and patch.status["message"] to a clear message like "lockOnly: true requires 'exclusive' to be true" and return; reference the same symbols (exclusive, lock_only, spec) and ensure the same check is applied in both places (the initial validation and the later similar block that currently enforces exclusivity in _create_lock_workload()) so user input is rejected rather than silently overridden.

coderabbitai · 2026-05-01T19:44:21Z

+def reconcile_psapcluster(spec, name, namespace, status, patch, body):
+    _reconcile_kubeconfig(spec, name, status, patch)
+    _reconcile_ttl_expiry(spec, name, status, patch)
+    _reconcile_gpu_discovery(spec, name, status, patch)
+    _reconcile_lock_job(spec, name, status, patch)


⚠️ Potential issue | 🔴 Critical | ⚡ Quick win

Stop the reconcile pass after TTL expiry releases the lock.

_reconcile_ttl_expiry() only updates patch, but the same pass still calls _reconcile_lock_job() with the old spec/status. After _release_lock() deletes the sentinel, the self-healing branch still sees status.locked == True and the old spec.owner, so it can recreate the lock immediately and make TTL expiry ineffective.

Suggested fix

def reconcile_psapcluster(spec, name, namespace, status, patch, body): _reconcile_kubeconfig(spec, name, status, patch) - _reconcile_ttl_expiry(spec, name, status, patch) + if _reconcile_ttl_expiry(spec, name, status, patch): + return _reconcile_gpu_discovery(spec, name, status, patch) _reconcile_lock_job(spec, name, status, patch) - def _reconcile_ttl_expiry(spec, name, status, patch): if not status.get("locked"): - return + return False ... if datetime.now(timezone.utc) >= expires_at: prev_owner = spec.get("owner", "unknown") logger.info( "PSAPCluster %s: ownership expired (was owned by %s)", name, prev_owner ) patch.spec["owner"] = "" _release_lock(name, patch) + return True + return False

Also applies to: 238-260, 354-383

🤖 Prompt for AI Agents

Verify each finding against the current code and only fix it if needed. In `@fournos/handlers/psapcluster.py` around lines 216 - 220, reconcile_psapcluster currently calls _reconcile_ttl_expiry(...) then continues to call _reconcile_lock_job(...) which can immediately recreate a lock because patch was mutated but spec/status still reflect the old locked state; modify _reconcile_ttl_expiry to return a boolean (e.g., released=True if it called _release_lock()) and update reconcile_psapcluster to check that return and stop the reconcile pass (early return) when TTL expiry released the lock so _reconcile_lock_job is not invoked with stale spec/status; apply the same pattern to the other reconcile sequences mentioned (the blocks around lines 238-260 and 354-383) so each _reconcile_ttl_expiry caller aborts further reconcile steps when it reports a release.

coderabbitai · 2026-05-01T19:44:21Z

+        patch.status.setdefault("conditions", []).append(
+            _make_condition(
+                COND_GPU_DISCOVERED,
+                "False",
+                "DiscoveryFailed",
+                str(exc),
+            )
+        )


⚠️ Potential issue | 🟠 Major | ⚡ Quick win

Upsert conditions instead of appending duplicates on every timer run.

Both the success and failure paths append a fresh COND_GPU_DISCOVERED entry every reconciliation. On a long-lived cluster this grows status.conditions without bound and makes consumers pick between duplicate condition records. Replace the existing condition by type instead of always appending.

Also applies to: 338-340

🤖 Prompt for AI Agents

Verify each finding against the current code and only fix it if needed. In `@fournos/handlers/psapcluster.py` around lines 303 - 310, The code currently always appends a new condition into patch.status.setdefault("conditions", []) (see the append call with _make_condition and COND_GPU_DISCOVERED), causing duplicate COND_GPU_DISCOVERED entries over time; change this to "upsert" by finding and removing any existing condition whose "type" equals COND_GPU_DISCOVERED from patch.status["conditions"] before adding the new _make_condition(...) entry. Apply the same replacement in both the failure path (the shown append using str(exc)) and the corresponding success path where COND_GPU_DISCOVERED is appended, using the symbols patch.status, COND_GPU_DISCOVERED, and _make_condition to locate and modify the code.

coderabbitai · 2026-05-01T19:44:21Z

+    prev_gpus = hardware.get("gpus", [])
+    new_resources = [(g.short_name, g.count) for g in result.gpus]
+    prev_resources = [(g.get("shortName"), g.get("count")) for g in prev_gpus]
+    if new_resources != prev_resources and new_resources:
+        try:
+            ctx.kueue.update_flavor_quotas(name, new_resources)


⚠️ Potential issue | 🟠 Major | ⚡ Quick win

Clear Kueue quotas when discovery drops to zero GPUs.

The and new_resources guard skips update_flavor_quotas() when a cluster previously had GPUs but discovery now returns none. That leaves stale GPU quota in Kueue and can schedule workloads onto capacity that no longer exists.

Suggested fix

- if new_resources != prev_resources and new_resources: + if new_resources != prev_resources: try: ctx.kueue.update_flavor_quotas(name, new_resources) except Exception as exc: logger.warning( "PSAPCluster %s: failed to update flavor quotas: %s", name, exc

📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change

prev_gpus = hardware.get("gpus", [])

new_resources = [(g.short_name, g.count) for g in result.gpus]

prev_resources = [(g.get("shortName"), g.get("count")) for g in prev_gpus]

if new_resources != prev_resources and new_resources:

try:

ctx.kueue.update_flavor_quotas(name, new_resources)

prev_gpus = hardware.get("gpus", [])

new_resources = [(g.short_name, g.count) for g in result.gpus]

prev_resources = [(g.get("shortName"), g.get("count")) for g in prev_gpus]

if new_resources != prev_resources:

try:

ctx.kueue.update_flavor_quotas(name, new_resources)

🤖 Prompt for AI Agents

Verify each finding against the current code and only fix it if needed. In `@fournos/handlers/psapcluster.py` around lines 342 - 347, The guard `and new_resources` prevents clearing Kueue quotas when discovery reports zero GPUs; change the logic so that `ctx.kueue.update_flavor_quotas(name, new_resources)` is called whenever `new_resources != prev_resources` (even if `new_resources` is empty) so stale GPU quotas are removed — update the block around the variables `prev_gpus`, `new_resources`, `prev_resources` and the call to `ctx.kueue.update_flavor_quotas(name, new_resources)` to remove the `and new_resources` condition and ensure empty `new_resources` clears quotas.

coderabbitai · 2026-05-01T19:44:21Z

+2. Add a ResourceFlavor and quota for the cluster in
+   `config/kueue-cluster-config.yaml` (see [Onboarding a new cluster](#onboarding-a-new-cluster)).
+


⚠️ Potential issue | 🟡 Minor | ⚡ Quick win

PSAPCluster onboarding instructions are internally inconsistent.

This section says users must manually add ResourceFlavor/quota, but the next section states the controller updates ClusterQueue quotas automatically. Please clarify whether step 2 is required, optional fallback, or legacy flow.

Proposed wording adjustment

-2. Add a ResourceFlavor and quota for the cluster in - `config/kueue-cluster-config.yaml` (see [Onboarding a new cluster](`#onboarding-a-new-cluster`)). +2. (Optional/manual fallback) If auto-discovery automation is disabled in your environment, + add a ResourceFlavor and quota for the cluster in + `config/kueue-cluster-config.yaml` (see [Onboarding a new cluster](`#onboarding-a-new-cluster`)).

Also applies to: 341-345

🤖 Prompt for AI Agents

Verify each finding against the current code and only fix it if needed. In `@README.md` around lines 323 - 325, The README's PSAPCluster onboarding is inconsistent about manual quota setup: update the wording around the step that says "Add a ResourceFlavor and quota for the cluster in `config/kueue-cluster-config.yaml`" (and the repeated text at the later block) to clearly state whether this manual step is required, optional fallback, or legacy; explicitly mention that the controller (referenced as "controller updates ClusterQueue quotas automatically" in the "Onboarding a new cluster" section) will normally create/update ClusterQueue quotas for onboarded clusters and keep the manual `config/kueue-cluster-config.yaml` step as an optional/manual fallback for edge cases or legacy workflows, and add a short cross-reference to the "Onboarding a new cluster" section and a one-line example/criteria when the manual step is needed.

kpouget · 2026-05-01T20:00:42Z

+            if cluster_name in existing_clusters:
+                continue


we need to make sure that the objects (flavor, etc) are there, even if cluster_name in existing_clusters
they need to

exist

have the right content
(I think they need to be created every time)

kpouget · 2026-05-01T20:02:14Z

+GPU_MODEL_MAP: dict[str, str] = {
+    "NVIDIA-A100-SXM4-80GB": "a100",
+    "NVIDIA-A100-SXM4-40GB": "a100",
+    "NVIDIA-A100-PCIE-80GB": "a100",
+    "NVIDIA-A100-PCIE-40GB": "a100",
+    "NVIDIA-H100-SXM5-80GB": "h100",
+    "NVIDIA-H100-PCIE-80GB": "h100",
+    "NVIDIA-H200-SXM-141GB": "h200",
+    "AMD-Instinct-MI300X": "mi300x",
+}


the can be in a user-configurable config map
(can be done later. If we don't do it in the PR, need to fill a ticket so that we don't forget it)

kpouget · 2026-05-01T20:04:04Z

+        elif secret.string_data and "kubeconfig" in secret.string_data:
+            raw = secret.string_data["kubeconfig"]


when would that happen ??

Good point. I think it would never when reading real k8s API. but since we have onboarded cluster-1. cluster-2 .... type of fake cluster, for convenience and having some sort defensive code added by claude. I was hitting the issue primarily for those fake cluster. :)

What is stringData? Normally, the data field in a Kubernetes Secret requires values to be base64-encoded. While base64 isn't encryption, it’s a hassle to encode strings manually every time you write a YAML file. The stringData field allows you to provide secret data as plain text. How it Works You write your Secret YAML using stringData. You kubectl apply the file. Kubernetes automatically encodes the values to base64, moves them to the data field, and removes the stringData field from the stored object.

this path will never happen in the real runs of the controller.
Test test_stringdata_kubeconfig generates it, but I don't think that makes sense

kpouget

@MML-coder , could you turn this into a controller fully independent from FOURNOS ?

you create a secondary top directory, and only important the relevant FOURNOS bits (constants, I assume)

we'll also need a name for the controller,
and maybe use another name than "PSAP"
Maybe FournosCluster, or something alike if we find another nice name

kpouget · 2026-05-01T20:18:45Z

+apiVersion: apiextensions.k8s.io/v1
+kind: CustomResourceDefinition


we'll rediscuss it, but I think we'll want to have

spec: resources: h200: 8

with spec.resources either user provided or dynamically looked up if spec.resources: {}

Are you thinking of user could provide how many GPus they want to use by specifying spec.resources: 2 or 8 ? and the job can use available GPUs for the job?

there are two things I have in mind:

we want the controller to do this only once (hence the spec field. Like a dynamic default value, not a truly reconciliated field)

in the future, we want to allow the customization of the resources list with SW properties (TASK: refine the future behavior of the exclusive/cluster/hardware flags #70)

kpouget · 2026-05-02T07:44:59Z

+    def scan(self) -> list[str]:
+        """Scan for new clusters and onboard them.
+
+        Returns the list of newly discovered cluster names.
+        """
+        secrets = list_kubeconfig_secrets(self._k8s_core)
+        if not secrets:
+            return []
+
+        existing_flavors = self._kueue.list_flavors()
+        existing_clusters = self._list_existing_psapclusters()
+
+        discovered: list[str] = []
+        for secret_name in secrets:


I think the pattern is wrong here for a controller

the way I think it can work in a safer way:

watch the creation / update of secrets with -lfournos.dev/cluster-kubeconfig=true

when such a secret is touched, make sure there's a PSAPCluster object matching it

put the owner field in the secret (TBC) to the PSAPCluster object

watch the creation / update of PSAPSecrets objects

when such an object is touched, make sure that its resourceflavor exists

when the kubeconfig secret is available, if spec.hardware is empty, look it up in the cluster

when the spec.hardware is filled, update the ClusterQueue

openshift-ci Bot added the needs-rebase Indicates a PR cannot be merged because it has merge conflicts with HEAD. label Apr 29, 2026

docs: add PSAPCluster usage, troubleshooting, and configuration to RE…

8c60c53

…ADME

Removed preemption as it is not needed. it should FIFO order

2634aef

coderabbitai Bot reviewed Apr 30, 2026

View reviewed changes

MML-coder changed the title ~~feat: PSAPCluster CRD controller with GPU discovery and cluster locking~~ feat: PSAPCluster controller with sentinel locking, auto-discovery, and GPU discovery May 1, 2026

coderabbitai Bot reviewed May 1, 2026

View reviewed changes

kpouget reviewed May 1, 2026

View reviewed changes

coderabbitai Bot mentioned this pull request May 1, 2026

📝 CodeRabbit Chat: Implement requested code changes #76

Closed

kpouget requested changes May 1, 2026

View reviewed changes

kpouget reviewed May 1, 2026

View reviewed changes

kpouget reviewed May 2, 2026

View reviewed changes

		2. Add a ResourceFlavor and quota for the cluster in
		`config/kueue-cluster-config.yaml` (see [Onboarding a new cluster](#onboarding-a-new-cluster)).

		elif secret.string_data and "kubeconfig" in secret.string_data:
		raw = secret.string_data["kubeconfig"]

		apiVersion: apiextensions.k8s.io/v1
		kind: CustomResourceDefinition

Conversation

MML-coder commented Apr 29, 2026 • edited by coderabbitai Bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

What's included

Architecture

New/modified files

Usage

Test plan

Summary by CodeRabbit

Uh oh!

openshift-ci Bot commented Apr 29, 2026

Uh oh!

openshift-ci Bot commented Apr 29, 2026

Uh oh!

coderabbitai Bot commented Apr 29, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Walkthrough

Changes

Sequence Diagram(s)

Estimated code review effort

Possibly related PRs

Suggested labels

Poem

❌ Failed checks (1 warning)

Uh oh!

coderabbitai Bot left a comment

Choose a reason for hiding this comment

Uh oh!

coderabbitai Bot Apr 30, 2026

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

coderabbitai Bot left a comment

Choose a reason for hiding this comment

Uh oh!

coderabbitai Bot May 1, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

coderabbitai Bot May 1, 2026

Choose a reason for hiding this comment

Uh oh!

MML-coder May 1, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

coderabbitai Bot May 1, 2026

Choose a reason for hiding this comment

Uh oh!

coderabbitai Bot May 1, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

coderabbitai Bot May 1, 2026

Choose a reason for hiding this comment

Uh oh!

coderabbitai Bot May 1, 2026

Choose a reason for hiding this comment

Uh oh!

coderabbitai Bot May 1, 2026

Choose a reason for hiding this comment

Uh oh!

coderabbitai Bot May 1, 2026

Choose a reason for hiding this comment

Uh oh!

coderabbitai Bot May 1, 2026

Choose a reason for hiding this comment

Uh oh!

coderabbitai Bot May 1, 2026

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

kpouget May 1, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

MML-coder commented Apr 29, 2026 •

edited by coderabbitai Bot

Loading

coderabbitai Bot commented Apr 29, 2026 •

edited

Loading

coderabbitai Bot May 1, 2026 •

edited

Loading

MML-coder May 1, 2026 •

edited

Loading

coderabbitai Bot May 1, 2026 •

edited

Loading

kpouget May 1, 2026 •

edited

Loading