feat: PSAPCluster controller with sentinel locking, auto-discovery, and GPU discovery#74
feat: PSAPCluster controller with sentinel locking, auto-discovery, and GPU discovery#74
Conversation
…ocking Introduces a PSAPCluster custom resource and kopf controller that unifies cluster definition, discovers GPU hardware dynamically, manages per-cluster Kueue ClusterQueues, and enables ownership locking via stopPolicy. - PSAPCluster CRD (fournos.dev/v1) with spec fields: kubeconfigSecret, owner, ttl, evict, gpuDiscoveryInterval - GPU discovery client that connects to target clusters via kubeconfig secrets, lists nodes, and aggregates GPU counts by vendor/model - Per-cluster ClusterQueues in a shared cohort with self-healing - Ownership locking with Hold/HoldAndDrain stopPolicy and TTL auto-expiry - Timer-based reconciliation: kubeconfig validation, TTL expiry, GPU discovery with backoff, CQ/RF self-healing - 82 unit tests covering all modules Closes #37, closes #51
|
[APPROVALNOTIFIER] This PR is NOT APPROVED This pull-request has been approved by: The full list of commands accepted by this bot can be found here. DetailsNeeds approval from an approver in each of these files:Approvers can indicate their approval by writing |
|
PR needs rebase. DetailsInstructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository. |
📝 WalkthroughWalkthroughThis PR introduces PSAPCluster custom resource for cluster state visibility, implements GPU discovery from target clusters via kubeconfig secrets, adds cluster auto-discovery from kubeconfig secrets, enables cluster-level exclusive locking via sentinel jobs, and provides Kueue flavor/quota management for multi-cluster scheduling. Changes
Sequence Diagram(s)sequenceDiagram
participant Operator as PSAPCluster Operator
participant K8s as Kubernetes API
participant TargetCluster as Target Cluster API
participant Kueue as Kueue API
Operator->>K8s: Watch PSAPCluster events
rect rgba(0, 100, 150, 0.5)
Note over Operator,K8s: On PSAPCluster Create
K8s->>Operator: Trigger create event
Operator->>K8s: Validate kubeconfig secret
Operator->>K8s: Initialize status, conditions
Operator->>K8s: Create sentinel lock job if spec.owner set
K8s->>Operator: Acknowledge
end
rect rgba(100, 50, 150, 0.5)
Note over Operator,TargetCluster: GPU Discovery Loop
Operator->>Operator: Timer triggers reconciliation
Operator->>K8s: Read kubeconfig secret
K8s->>Operator: Return secret data
Operator->>TargetCluster: Connect & list nodes
TargetCluster->>Operator: Return GPU allocatable resources
Operator->>Operator: Aggregate by vendor/model
Operator->>K8s: Update PSAPCluster status.hardware
end
rect rgba(150, 100, 50, 0.5)
Note over Operator,Kueue: Quota Update on Inventory Change
Operator->>K8s: Detect GPU count change
Operator->>Kueue: Ensure flavor exists
Kueue->>Operator: Confirm flavor
Operator->>Kueue: Update flavor quotas
Kueue->>Operator: Acknowledge patch
end
rect rgba(100, 150, 50, 0.5)
Note over Operator,K8s: Owner Change Triggers Lock
K8s->>Operator: Trigger spec.owner update
Operator->>K8s: Create/delete sentinel job
Operator->>K8s: Update locked, lockExpiresAt
K8s->>Operator: Acknowledge
end
sequenceDiagram
participant BackgroundLoop as Discovery Loop
participant K8s as Kubernetes API
participant ClusterDiscovery as ClusterDiscovery
participant KueueClient as KueueClient
BackgroundLoop->>ClusterDiscovery: Trigger scan_clusters()
rect rgba(100, 150, 100, 0.5)
Note over ClusterDiscovery,K8s: Auto-Discover Clusters
ClusterDiscovery->>K8s: List kubeconfig secrets
K8s->>ClusterDiscovery: Return secret names
ClusterDiscovery->>ClusterDiscovery: Extract cluster names via regex
ClusterDiscovery->>K8s: Check for existing PSAPCluster CRs
K8s->>ClusterDiscovery: Return existing clusters
ClusterDiscovery->>ClusterDiscovery: Filter new clusters
loop For each new cluster
ClusterDiscovery->>KueueClient: Ensure flavor exists
KueueClient->>K8s: Create ResourceFlavor
ClusterDiscovery->>KueueClient: Add flavor to cluster queue
KueueClient->>K8s: Patch ClusterQueue resourceGroups
ClusterDiscovery->>K8s: Create PSAPCluster CR
K8s->>ClusterDiscovery: Confirm CR created
end
end
BackgroundLoop->>BackgroundLoop: Log discovered clusters
Estimated code review effort🎯 4 (Complex) | ⏱️ ~60 minutes Possibly related PRs
Suggested labels
Poem
🚥 Pre-merge checks | ✅ 4 | ❌ 1❌ Failed checks (1 warning)
✅ Passed checks (4 passed)
✏️ Tip: You can configure your own custom pre-merge checks in the settings. ✨ Finishing Touches📝 Generate docstrings
🧪 Generate unit tests (beta)
Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out. Comment |
There was a problem hiding this comment.
Actionable comments posted: 3
🧹 Nitpick comments (1)
fournos/handlers/psapcluster.py (1)
52-64: 💤 Low valueConsider exposing a public method on
ctx.registryinstead of accessing_k8sdirectly.Accessing the private
_k8sattribute couples this handler to the internal implementation ofClusterRegistry. Consider adding a public method likectx.registry.secret_exists(name, namespace)or using a shared CoreV1Api instance.🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@fournos/handlers/psapcluster.py` around lines 52 - 64, Update _check_kubeconfig to stop accessing the private ctx.registry._k8s and call a new public API on the registry (e.g., ctx.registry.secret_exists(secret_name, namespace)) that encapsulates the read_namespaced_secret logic; add the corresponding method on ClusterRegistry (secret_exists(name, namespace)) which returns a boolean or raises a controlled exception for non-404 errors, then have _check_kubeconfig map True->"Valid", False->"Missing" and handle other errors as "Invalid" while logging the exception via logger.warning as before.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.
Inline comments:
In `@fournos/core/gpu_discovery.py`:
- Around line 125-139: The timeout on the Kubernetes call is ineffective because
setting api_client.configuration.request_timeout doesn't apply to individual
calls; update the code that builds the client and calls list_node so that you
remove the assignment to api_client.configuration.request_timeout and instead
call target_core.list_node(...) passing
_request_timeout=(settings.gpu_discovery_timeout_sec,
settings.gpu_discovery_timeout_sec); keep the existing try/except that wraps the
call and raise GPUDiscoveryError on exception, and ensure api_client.close()
remains in the finally block.
In `@fournos/core/psapcluster.py`:
- Around line 181-185: The stopPolicy patch handler currently catches
ApiException but ApiException is not imported; update the exception handling in
that handler to catch client.exceptions.ApiException instead (i.e., change the
except block that reads "except ApiException as exc:" to use
client.exceptions.ApiException) so it matches other handlers in this module and
avoids a NameError; ensure the logger.warning and raise behavior remain
unchanged and reference cq_name as before.
In `@README.md`:
- Around line 307-312: The fenced code block showing the table lacks a language
identifier (MD040); update the backticks surrounding the block (the triple-fence
that begins the block containing "NAME OWNER GPUS KUBECONFIG LOCKED AGE") to
include a language tag such as "text" (e.g., change ``` to ```text) so
markdownlint no longer flags the code fence.
---
Nitpick comments:
In `@fournos/handlers/psapcluster.py`:
- Around line 52-64: Update _check_kubeconfig to stop accessing the private
ctx.registry._k8s and call a new public API on the registry (e.g.,
ctx.registry.secret_exists(secret_name, namespace)) that encapsulates the
read_namespaced_secret logic; add the corresponding method on ClusterRegistry
(secret_exists(name, namespace)) which returns a boolean or raises a controlled
exception for non-404 errors, then have _check_kubeconfig map True->"Valid",
False->"Missing" and handle other errors as "Invalid" while logging the
exception via logger.warning as before.
🪄 Autofix (Beta)
Fix all unresolved CodeRabbit comments on this PR:
- Push a commit to this branch (recommended)
- Create a new PR with the fixes
ℹ️ Review info
⚙️ Run configuration
Configuration used: defaults
Review profile: CHILL
Plan: Pro
Run ID: e083bc62-0476-4762-9cde-25f919d35cc5
📒 Files selected for processing (15)
README.mddev/mock-psapcluster.yamlfournos/core/constants.pyfournos/core/gpu_discovery.pyfournos/core/psapcluster.pyfournos/handlers/__init__.pyfournos/handlers/psapcluster.pyfournos/operator.pyfournos/settings.pyfournos/state.pymanifests/psapcluster-crd.yamlmanifests/rbac/role_fournos.yamltests/unit/test_gpu_discovery.pytests/unit/test_psapcluster_handlers.pytests/unit/test_psapcluster_manager.py
| api_client = k8s_config.new_client_from_config_dict(kubeconfig_dict) | ||
| api_client.configuration.request_timeout = ( | ||
| settings.gpu_discovery_timeout_sec, | ||
| settings.gpu_discovery_timeout_sec, | ||
| ) | ||
|
|
||
| try: | ||
| target_core = client.CoreV1Api(api_client) | ||
| nodes = target_core.list_node() | ||
| except Exception as exc: | ||
| raise GPUDiscoveryError( | ||
| f"Failed to connect to cluster {cluster_name!r}: {exc}" | ||
| ) from exc | ||
| finally: | ||
| api_client.close() |
There was a problem hiding this comment.
🧩 Analysis chain
🌐 Web query:
kubernetes python client request_timeout configuration tuple format
💡 Result:
In the Kubernetes Python client, the _request_timeout parameter configures client-side request timeouts for API calls. It is passed as a keyword argument (e.g., _request_timeout=60) to API methods like CoreV1Api.list_namespaced_pod(...) or watch.Watch.stream(...). The parameter accepts: - A single number (int/float): total timeout for the request in seconds. - A tuple of length 2: (connect_timeout, read_timeout) in seconds, where the first value sets the connection timeout and the second sets the read timeout. This is handled by the underlying ApiClient.call_api and RESTClient.request methods, which pass it to urllib3's HTTP request machinery. Example: from kubernetes import client, watch v1 = client.CoreV1Api w = watch.Watch for event in w.stream(v1.list_namespaced_pod, namespace="default", _request_timeout=(5.0, 30.0)): ... Note: There is no global default timeout in Configuration; it must be set per-call. For exec streams, tuple support may have bugs in some versions (e.g., issue #2292). Official documentation: https://github.com/kubernetes-client/python/blob/master/examples/watch/timeout-settings.md explicitly states it accepts "int/long" or "a tuple (with a length of 2)". Client docs confirm: "If one number provided, it will be total request timeout. It can also be a pair (tuple) of (connection, read) timeouts."
Citations:
- 1: https://github.com/kubernetes-client/python/blob/master/examples/watch/timeout-settings.md
- 2: http://jashandeep-sohik8s-python.readthedocs.io/en/latest/kubernetes.client.html
- 3: https://github.com/kubernetes-client/python/blob/master/kubernetes/client/api_client.py
- 4: https://github.com/kubernetes-client/python/blob/master/kubernetes/client/rest.py
🌐 Web query:
kubernetes python client Configuration request_timeout global default
💡 Result:
In the Kubernetes Python client library, there is no global default request_timeout setting in the Configuration class. The _request_timeout parameter must be specified explicitly on individual API calls (e.g., list_namespaced_pod(_request_timeout=60)) or watch streams. If not set, the default is None, meaning no timeout (infinite wait, matching urllib3/socket defaults). Official documentation confirms: "In case, if the _request_timeout value is not set, then the default value is None & socket will have no timeout." (examples/watch/timeout-settings.md). Requests for global configuration via Configuration (e.g., issues #1961, #2227) remain open or unresolved as of 2026, with no such field in configuration.py source code. Per-call _request_timeout accepts int (total timeout) or tuple (connect, read).
Citations:
- 1: https://github.com/kubernetes-client/python/blob/master/examples/watch/timeout-settings.md
- 2: Configure default timeout for all API calls kubernetes-client/python#1961
- 3: https://github.com/kubernetes-client/python/blob/master/kubernetes/client/configuration.py
- 4: feat: global request timeout setting via config kubernetes-client/python#1713
- 5: Request timeout in APIs kubernetes-client/python#2227
- 6: https://github.com/kubernetes-client/python/blob/master/kubernetes/client/api_client.py
- 7: https://github.com/kubernetes-client/python/blob/master/CHANGELOG.md
- 8: https://github.com/kubernetes-client/python/blob/v27.2.0/CHANGELOG.md
Timeout configuration is ineffective; pass _request_timeout to the API call instead.
Setting api_client.configuration.request_timeout does not apply timeouts to Kubernetes API calls. The Python client requires the _request_timeout parameter to be passed explicitly to individual API methods. Change target_core.list_node() to target_core.list_node(_request_timeout=(settings.gpu_discovery_timeout_sec, settings.gpu_discovery_timeout_sec)), and remove the Configuration assignment. Without this fix, the timeout has no effect and the call will hang indefinitely on an unreachable cluster.
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.
In `@fournos/core/gpu_discovery.py` around lines 125 - 139, The timeout on the
Kubernetes call is ineffective because setting
api_client.configuration.request_timeout doesn't apply to individual calls;
update the code that builds the client and calls list_node so that you remove
the assignment to api_client.configuration.request_timeout and instead call
target_core.list_node(...) passing
_request_timeout=(settings.gpu_discovery_timeout_sec,
settings.gpu_discovery_timeout_sec); keep the existing try/except that wraps the
call and raise GPUDiscoveryError on exception, and ensure api_client.close()
remains in the finally block.
Replace per-cluster ClusterQueue stopPolicy locking with sentinel FournosJob approach. A lockOnly sentinel holds all 100 cluster-slots via Kueue, blocking competing workloads without requiring per-cluster CQ infrastructure. Add auto-discovery daemon that scans kubeconfig secrets every 60s and auto-creates ResourceFlavor, ClusterQueue flavor entry, and PSAPCluster CR for each new cluster found. Key changes: - Add lockOnly field to FournosJob CRD, skip Forge for lockOnly jobs - Rewrite PSAPCluster handlers for sentinel create/delete lifecycle - Remove PSAPClusterManager (fournos/core/psapcluster.py) - Add create_flavor, add_flavor_to_cluster_queue, update_flavor_quotas to KueueClient - Add ClusterDiscovery with extract_cluster_name, list_kubeconfig_secrets - Wire discovery daemon thread in operator startup - Fix RBAC: add create/delete for fournosjobs, create for psapclusters - Add end-to-end locking test walkthrough to README - 28 new discovery unit tests, rewritten PSAPCluster handler tests
There was a problem hiding this comment.
Actionable comments posted: 7
🧹 Nitpick comments (1)
fournos/operator.py (1)
169-177: ⚡ Quick winRun discovery once before the first interval sleep.
Current loop delays initial discovery until the first timer elapses. Doing an immediate first pass reduces startup lag for newly added kubeconfig secrets.
Proposed fix
def _discovery_loop(): interval = settings.cluster_discovery_interval_sec while True: - time.sleep(interval) try: handlers.scan_clusters() except Exception: logger.exception("Cluster discovery scan failed") + time.sleep(interval)🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@fournos/operator.py` around lines 169 - 177, The discovery loop in _discovery_loop currently waits for settings.cluster_discovery_interval_sec before the first scan; change the loop to perform an immediate scan on startup by calling handlers.scan_clusters() (inside the existing try/except that logs via logger.exception) before sleeping, i.e. move the sleep to occur after the scan so the sequence is: try: handlers.scan_clusters() except Exception: logger.exception(...) then time.sleep(settings.cluster_discovery_interval_sec), and repeat.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.
Inline comments:
In `@fournos/core/discovery.py`:
- Around line 47-63: The loop that processes each secret should not abort the
entire discovery pass on a single failure; wrap the per-cluster work (after
cluster_name resolution) in a try/except so exceptions from
extract_cluster_name, self._kueue.create_flavor,
self._kueue.add_flavor_to_cluster_queue, or self._create_psapcluster are caught,
logged with contextual info (cluster_name and secret_name) and then continue to
the next secret; ensure you only append to discovered on success so failed
clusters are not marked discovered and rethrow nothing so the outer discovery
loop proceeds.
In `@fournos/core/kueue.py`:
- Around line 215-271: add_flavor_to_cluster_queue (and similarly
update_flavor_quotas) currently does a read-modify-write on the global
ClusterQueue and can lose concurrent changes; modify these functions to perform
optimistic retry-on-conflict by reading and including the object's current
metadata.resourceVersion when patching and retrying on ApiException conflict
(409) a few times, or alternatively switch to a narrower patch/Server-Side Apply
that targets only the specific flavor element instead of replacing
spec.resourceGroups; locate add_flavor_to_cluster_queue and update_flavor_quotas
in kueue.py and implement a loop that GETs the ClusterQueue, computes the
change, issues a PATCH/SSApply with resourceVersion or field-specific patch, and
retries on conflict until success or max attempts.
In `@fournos/handlers/lifecycle.py`:
- Around line 52-63: Reject combinations where lockOnly is true but exclusive is
false instead of silently overriding: in the validation blocks that read
exclusive = spec["exclusive"] and lock_only = spec.get("lockOnly", False) (and
the duplicate block around lines with similar logic), add a check that if
lock_only is True and exclusive is False then set patch.status["phase"] =
Phase.FAILED and patch.status["message"] to a clear message like "lockOnly: true
requires 'exclusive' to be true" and return; reference the same symbols
(exclusive, lock_only, spec) and ensure the same check is applied in both places
(the initial validation and the later similar block that currently enforces
exclusivity in _create_lock_workload()) so user input is rejected rather than
silently overridden.
In `@fournos/handlers/psapcluster.py`:
- Around line 216-220: reconcile_psapcluster currently calls
_reconcile_ttl_expiry(...) then continues to call _reconcile_lock_job(...) which
can immediately recreate a lock because patch was mutated but spec/status still
reflect the old locked state; modify _reconcile_ttl_expiry to return a boolean
(e.g., released=True if it called _release_lock()) and update
reconcile_psapcluster to check that return and stop the reconcile pass (early
return) when TTL expiry released the lock so _reconcile_lock_job is not invoked
with stale spec/status; apply the same pattern to the other reconcile sequences
mentioned (the blocks around lines 238-260 and 354-383) so each
_reconcile_ttl_expiry caller aborts further reconcile steps when it reports a
release.
- Around line 303-310: The code currently always appends a new condition into
patch.status.setdefault("conditions", []) (see the append call with
_make_condition and COND_GPU_DISCOVERED), causing duplicate COND_GPU_DISCOVERED
entries over time; change this to "upsert" by finding and removing any existing
condition whose "type" equals COND_GPU_DISCOVERED from
patch.status["conditions"] before adding the new _make_condition(...) entry.
Apply the same replacement in both the failure path (the shown append using
str(exc)) and the corresponding success path where COND_GPU_DISCOVERED is
appended, using the symbols patch.status, COND_GPU_DISCOVERED, and
_make_condition to locate and modify the code.
- Around line 342-347: The guard `and new_resources` prevents clearing Kueue
quotas when discovery reports zero GPUs; change the logic so that
`ctx.kueue.update_flavor_quotas(name, new_resources)` is called whenever
`new_resources != prev_resources` (even if `new_resources` is empty) so stale
GPU quotas are removed — update the block around the variables `prev_gpus`,
`new_resources`, `prev_resources` and the call to
`ctx.kueue.update_flavor_quotas(name, new_resources)` to remove the `and
new_resources` condition and ensure empty `new_resources` clears quotas.
In `@README.md`:
- Around line 323-325: The README's PSAPCluster onboarding is inconsistent about
manual quota setup: update the wording around the step that says "Add a
ResourceFlavor and quota for the cluster in `config/kueue-cluster-config.yaml`"
(and the repeated text at the later block) to clearly state whether this manual
step is required, optional fallback, or legacy; explicitly mention that the
controller (referenced as "controller updates ClusterQueue quotas automatically"
in the "Onboarding a new cluster" section) will normally create/update
ClusterQueue quotas for onboarded clusters and keep the manual
`config/kueue-cluster-config.yaml` step as an optional/manual fallback for edge
cases or legacy workflows, and add a short cross-reference to the "Onboarding a
new cluster" section and a one-line example/criteria when the manual step is
needed.
---
Nitpick comments:
In `@fournos/operator.py`:
- Around line 169-177: The discovery loop in _discovery_loop currently waits for
settings.cluster_discovery_interval_sec before the first scan; change the loop
to perform an immediate scan on startup by calling handlers.scan_clusters()
(inside the existing try/except that logs via logger.exception) before sleeping,
i.e. move the sleep to occur after the scan so the sequence is: try:
handlers.scan_clusters() except Exception: logger.exception(...) then
time.sleep(settings.cluster_discovery_interval_sec), and repeat.
🪄 Autofix (Beta)
Fix all unresolved CodeRabbit comments on this PR:
- Push a commit to this branch (recommended)
- Create a new PR with the fixes
ℹ️ Review info
⚙️ Run configuration
Configuration used: defaults
Review profile: CHILL
Plan: Pro
Run ID: 9c74b8b6-0fa5-4d35-a8eb-3736ecf1f197
📒 Files selected for processing (19)
README.mdconfig/forge/samples/job-full.yamlfournos/core/clusters.pyfournos/core/constants.pyfournos/core/discovery.pyfournos/core/kueue.pyfournos/handlers/__init__.pyfournos/handlers/discovery.pyfournos/handlers/execution.pyfournos/handlers/lifecycle.pyfournos/handlers/psapcluster.pyfournos/operator.pyfournos/settings.pyfournos/state.pymanifests/crd.yamlmanifests/psapcluster-crd.yamlmanifests/rbac/role_fournos.yamltests/unit/test_cluster_discovery.pytests/unit/test_psapcluster_handlers.py
✅ Files skipped from review due to trivial changes (3)
- config/forge/samples/job-full.yaml
- fournos/handlers/init.py
- manifests/psapcluster-crd.yaml
🚧 Files skipped from review as they are similar to previous changes (4)
- fournos/state.py
- fournos/core/constants.py
- manifests/rbac/role_fournos.yaml
- tests/unit/test_psapcluster_handlers.py
| for secret_name in secrets: | ||
| cluster_name = extract_cluster_name(secret_name) | ||
| if cluster_name is None: | ||
| continue | ||
|
|
||
| if cluster_name in existing_clusters: | ||
| continue | ||
|
|
||
| logger.info("Discovered new cluster %s from secret %s", cluster_name, secret_name) | ||
|
|
||
| if cluster_name not in existing_flavors: | ||
| self._kueue.create_flavor(cluster_name) | ||
|
|
||
| self._kueue.add_flavor_to_cluster_queue(cluster_name) | ||
| self._create_psapcluster(cluster_name, secret_name) | ||
| discovered.append(cluster_name) | ||
|
|
There was a problem hiding this comment.
A single onboarding failure aborts the whole discovery pass.
If one cluster fails during flavor/CQ/CR creation, the loop exits and later valid clusters are skipped for that cycle. Isolate errors per cluster and continue.
Proposed fix
discovered: list[str] = []
for secret_name in secrets:
cluster_name = extract_cluster_name(secret_name)
if cluster_name is None:
continue
if cluster_name in existing_clusters:
continue
logger.info("Discovered new cluster %s from secret %s", cluster_name, secret_name)
- if cluster_name not in existing_flavors:
- self._kueue.create_flavor(cluster_name)
-
- self._kueue.add_flavor_to_cluster_queue(cluster_name)
- self._create_psapcluster(cluster_name, secret_name)
- discovered.append(cluster_name)
+ try:
+ if cluster_name not in existing_flavors:
+ self._kueue.create_flavor(cluster_name)
+ existing_flavors.add(cluster_name)
+
+ self._kueue.add_flavor_to_cluster_queue(cluster_name)
+ self._create_psapcluster(cluster_name, secret_name)
+ discovered.append(cluster_name)
+ except Exception:
+ logger.exception(
+ "Failed onboarding discovered cluster %s from secret %s",
+ cluster_name,
+ secret_name,
+ )
+ continue🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.
In `@fournos/core/discovery.py` around lines 47 - 63, The loop that processes each
secret should not abort the entire discovery pass on a single failure; wrap the
per-cluster work (after cluster_name resolution) in a try/except so exceptions
from extract_cluster_name, self._kueue.create_flavor,
self._kueue.add_flavor_to_cluster_queue, or self._create_psapcluster are caught,
logged with contextual info (cluster_name and secret_name) and then continue to
the next secret; ensure you only append to discovered on success so failed
clusters are not marked discovered and rethrow nothing so the outer discovery
loop proceeds.
There was a problem hiding this comment.
looks like a valid concern
There was a problem hiding this comment.
@kpouget, glad we're aligned! The fix is straightforward — wrapping the per-cluster onboarding block in a try/except ensures that a transient API error (e.g., a flaky Kueue patch or a race on PSAPCluster creation) for one cluster doesn't silently drop all subsequent clusters from that scan cycle.
The proposed fix from the review comment is ready to apply as-is:
discovered: list[str] = []
for secret_name in secrets:
cluster_name = extract_cluster_name(secret_name)
if cluster_name is None:
continue
if cluster_name in existing_clusters:
continue
logger.info("Discovered new cluster %s from secret %s", cluster_name, secret_name)
try:
if cluster_name not in existing_flavors:
self._kueue.create_flavor(cluster_name)
existing_flavors.add(cluster_name)
self._kueue.add_flavor_to_cluster_queue(cluster_name)
self._create_psapcluster(cluster_name, secret_name)
discovered.append(cluster_name)
except Exception:
logger.exception(
"Failed onboarding discovered cluster %s from secret %s",
cluster_name,
secret_name,
)
continueNote also the existing_flavors.add(cluster_name) after a successful create_flavor call — this keeps the in-memory set consistent so that if the same flavor name somehow appears again later in the same scan, it won't attempt a redundant creation.
Would you like me to plan these code changes for you?
There was a problem hiding this comment.
Good suggestion to incorporate related to transient api error.
There was a problem hiding this comment.
Seems like the humans are having a chat. I'll hop back into my burrow for now. If you need me again, just tag @coderabbitai in a new comment, and I'll come hopping out!
| def add_flavor_to_cluster_queue(self, flavor_name: str) -> dict | None: | ||
| """Add a flavor entry to the global ClusterQueue. | ||
|
|
||
| Read-modify-write: GET the CQ, check if flavor already present, | ||
| append a new flavor entry with cluster-slot quota and zero GPU | ||
| quotas for all existing coveredResources, PATCH back. | ||
|
|
||
| Returns the patched CQ, or ``None`` if already present or CQ not found. | ||
| """ | ||
| cq_name = settings.kueue_cluster_queue_name | ||
| try: | ||
| cq = self._k8s.get_cluster_custom_object( | ||
| group=KUEUE_GROUP, | ||
| version=KUEUE_VERSION, | ||
| plural=KUEUE_CLUSTER_QUEUE_PLURAL, | ||
| name=cq_name, | ||
| ) | ||
| except client.exceptions.ApiException as exc: | ||
| if exc.status == 404: | ||
| logger.warning("ClusterQueue %s not found", cq_name) | ||
| return None | ||
| raise | ||
|
|
||
| resource_groups = cq.get("spec", {}).get("resourceGroups", []) | ||
| if not resource_groups: | ||
| logger.warning("ClusterQueue %s has no resourceGroups", cq_name) | ||
| return None | ||
|
|
||
| rg = resource_groups[0] | ||
| flavors = rg.get("flavors", []) | ||
|
|
||
| for f in flavors: | ||
| if f["name"] == flavor_name: | ||
| logger.debug("Flavor %s already in ClusterQueue %s", flavor_name, cq_name) | ||
| return None | ||
|
|
||
| covered = rg.get("coveredResources", []) | ||
| resources: list[dict] = [] | ||
| for resource_name in covered: | ||
| if resource_name == CLUSTER_SLOT_RESOURCE: | ||
| resources.append({"name": CLUSTER_SLOT_RESOURCE, "nominalQuota": MAX_CLUSTER_SLOTS}) | ||
| else: | ||
| resources.append({"name": resource_name, "nominalQuota": 0}) | ||
|
|
||
| if not any(r["name"] == CLUSTER_SLOT_RESOURCE for r in resources): | ||
| resources.append({"name": CLUSTER_SLOT_RESOURCE, "nominalQuota": MAX_CLUSTER_SLOTS}) | ||
|
|
||
| flavors.append({"name": flavor_name, "resources": resources}) | ||
| rg["flavors"] = flavors | ||
|
|
||
| result = self._k8s.patch_cluster_custom_object( | ||
| group=KUEUE_GROUP, | ||
| version=KUEUE_VERSION, | ||
| plural=KUEUE_CLUSTER_QUEUE_PLURAL, | ||
| name=cq_name, | ||
| body={"spec": {"resourceGroups": resource_groups}}, | ||
| ) |
There was a problem hiding this comment.
🧩 Analysis chain
🏁 Script executed:
cat -n fournos/core/kueue.py | sed -n '215,347p'Repository: openshift-psap/fournos
Length of output: 6129
🏁 Script executed:
# Search for retry logic, locks, or conflict handling in the file
rg -n "retry|lock|conflict|patch|atomic" fournos/core/kueue.py -A 2 -B 2Repository: openshift-psap/fournos
Length of output: 668
🏁 Script executed:
# Check the broader architecture for concurrency patterns
rg -n "auto.discovery|PSAPCluster|reconcil" --type py | head -20Repository: openshift-psap/fournos
Length of output: 2092
🏁 Script executed:
# Look for how patch_cluster_custom_object is used elsewhere
rg -n "patch_cluster_custom_object" fournos/ --type py -B 2 -A 2Repository: openshift-psap/fournos
Length of output: 658
🏁 Script executed:
# Verify if there's any threading/async concurrency model
fd -e py | xargs rg -l "threading|asyncio|concurrent|Thread|async" | head -10Repository: openshift-psap/fournos
Length of output: 86
🏁 Script executed:
# Check operator.py for concurrency model
head -100 fournos/operator.pyRepository: openshift-psap/fournos
Length of output: 3419
🏁 Script executed:
# Look for how these KueueClient methods are called
rg -n "add_flavor_to_cluster_queue|update_flavor_quotas" fournos/ --type py -B 3 -A 3Repository: openshift-psap/fournos
Length of output: 2007
🏁 Script executed:
# Check for any Kubernetes API version or patch strategy documentation in comments
rg -n "KUEUE_VERSION|patch|Strategic" fournos/core/kueue.py | head -20Repository: openshift-psap/fournos
Length of output: 870
🏁 Script executed:
# Look for imports and client initialization to understand which Kubernetes client is being used
head -50 fournos/core/kueue.pyRepository: openshift-psap/fournos
Length of output: 1417
🏁 Script executed:
# Verify the actual patch behavior - check if there's any resourceVersion or conflict detection
rg -n "resourceVersion|conflict|409" fournos/ --type pyRepository: openshift-psap/fournos
Length of output: 1760
Protect shared ClusterQueue updates from lost writes due to concurrent patching.
Both add_flavor_to_cluster_queue and update_flavor_quotas perform read-modify-write on the same global ClusterQueue without conflict detection. The discovery loop and PSAPCluster handler run concurrently in separate threads, and since both patch the entire spec.resourceGroups array, the last write silently overwrites any concurrent changes from the other. Add optimistic retry-on-conflict using resourceVersion or switch to a narrower patch/Server-Side Apply strategy scoped to the specific flavor being modified.
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.
In `@fournos/core/kueue.py` around lines 215 - 271, add_flavor_to_cluster_queue
(and similarly update_flavor_quotas) currently does a read-modify-write on the
global ClusterQueue and can lose concurrent changes; modify these functions to
perform optimistic retry-on-conflict by reading and including the object's
current metadata.resourceVersion when patching and retrying on ApiException
conflict (409) a few times, or alternatively switch to a narrower
patch/Server-Side Apply that targets only the specific flavor element instead of
replacing spec.resourceGroups; locate add_flavor_to_cluster_queue and
update_flavor_quotas in kueue.py and implement a loop that GETs the
ClusterQueue, computes the change, issues a PATCH/SSApply with resourceVersion
or field-specific patch, and retries on conflict until success or max attempts.
There was a problem hiding this comment.
looks like a valid concern
There was a problem hiding this comment.
@kpouget, agreed — here's a concrete fix using an optimistic retry loop. The key idea is to include the metadata.resourceVersion from the GET response in the PATCH body so Kubernetes will reject stale writes with a 409, then retry from scratch.
- Extract a shared
_retry_patch_cluster_queuehelper that encapsulates the GET → mutate → PATCH loop with up to N retries on conflict (409) - Refactor
add_flavor_to_cluster_queueto use it - Refactor
update_flavor_quotasto use it
🧠 Learnings used
Learnt from: avasilevskii
Repo: openshift-psap/fournos PR: 43
File: fournos/handlers/lifecycle.py:95-95
Timestamp: 2026-04-21T12:23:23.225Z
Learning: In `fournos/handlers/lifecycle.py`, the `cluster` value used as a Kubernetes label value for `LABEL_EXCLUSIVE_CLUSTER` (defined as `"fournos.dev/exclusive-cluster"` in `fournos/core/constants.py`) is always validated against existing Kueue ResourceFlavor names before use. ResourceFlavor names are controlled by platform operators (not end users) and in practice are short identifiers like "cluster-1" or "cluster-2", so they are inherently Kubernetes label-value safe. No additional sanitization is needed.
The requested changes have been implemented and a pull request has been created: View PR
| exclusive = spec["exclusive"] | ||
| lock_only = spec.get("lockOnly", False) | ||
|
|
||
| if lock_only and not cluster: | ||
| patch.status["phase"] = Phase.FAILED | ||
| patch.status["message"] = "lockOnly: true requires 'cluster' to be set" | ||
| return | ||
|
|
||
| if not lock_only and not spec.get("forge"): | ||
| patch.status["phase"] = Phase.FAILED | ||
| patch.status["message"] = "spec.forge is required for non-lockOnly jobs" | ||
| return |
There was a problem hiding this comment.
Fail fast on lockOnly: true + exclusive: false instead of silently overriding.
Current flow accepts that combo but _create_lock_workload() enforces exclusive anyway, so user intent is ignored. Rejecting this input makes behavior explicit and debuggable.
Proposed fix
def on_create(spec, name, namespace, status, patch, body):
@@
cluster = spec.get("cluster")
exclusive = spec["exclusive"]
lock_only = spec.get("lockOnly", False)
+ if lock_only and not exclusive:
+ patch.status["phase"] = Phase.FAILED
+ patch.status["message"] = "lockOnly jobs require 'exclusive: true'"
+ return
+
if lock_only and not cluster:
patch.status["phase"] = Phase.FAILED
patch.status["message"] = "lockOnly: true requires 'cluster' to be set"
returnAlso applies to: 95-106
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.
In `@fournos/handlers/lifecycle.py` around lines 52 - 63, Reject combinations
where lockOnly is true but exclusive is false instead of silently overriding: in
the validation blocks that read exclusive = spec["exclusive"] and lock_only =
spec.get("lockOnly", False) (and the duplicate block around lines with similar
logic), add a check that if lock_only is True and exclusive is False then set
patch.status["phase"] = Phase.FAILED and patch.status["message"] to a clear
message like "lockOnly: true requires 'exclusive' to be true" and return;
reference the same symbols (exclusive, lock_only, spec) and ensure the same
check is applied in both places (the initial validation and the later similar
block that currently enforces exclusivity in _create_lock_workload()) so user
input is rejected rather than silently overridden.
| def reconcile_psapcluster(spec, name, namespace, status, patch, body): | ||
| _reconcile_kubeconfig(spec, name, status, patch) | ||
| _reconcile_ttl_expiry(spec, name, status, patch) | ||
| _reconcile_gpu_discovery(spec, name, status, patch) | ||
| _reconcile_lock_job(spec, name, status, patch) |
There was a problem hiding this comment.
Stop the reconcile pass after TTL expiry releases the lock.
_reconcile_ttl_expiry() only updates patch, but the same pass still calls _reconcile_lock_job() with the old spec/status. After _release_lock() deletes the sentinel, the self-healing branch still sees status.locked == True and the old spec.owner, so it can recreate the lock immediately and make TTL expiry ineffective.
Suggested fix
def reconcile_psapcluster(spec, name, namespace, status, patch, body):
_reconcile_kubeconfig(spec, name, status, patch)
- _reconcile_ttl_expiry(spec, name, status, patch)
+ if _reconcile_ttl_expiry(spec, name, status, patch):
+ return
_reconcile_gpu_discovery(spec, name, status, patch)
_reconcile_lock_job(spec, name, status, patch)
-
def _reconcile_ttl_expiry(spec, name, status, patch):
if not status.get("locked"):
- return
+ return False
...
if datetime.now(timezone.utc) >= expires_at:
prev_owner = spec.get("owner", "unknown")
logger.info(
"PSAPCluster %s: ownership expired (was owned by %s)", name, prev_owner
)
patch.spec["owner"] = ""
_release_lock(name, patch)
+ return True
+ return FalseAlso applies to: 238-260, 354-383
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.
In `@fournos/handlers/psapcluster.py` around lines 216 - 220,
reconcile_psapcluster currently calls _reconcile_ttl_expiry(...) then continues
to call _reconcile_lock_job(...) which can immediately recreate a lock because
patch was mutated but spec/status still reflect the old locked state; modify
_reconcile_ttl_expiry to return a boolean (e.g., released=True if it called
_release_lock()) and update reconcile_psapcluster to check that return and stop
the reconcile pass (early return) when TTL expiry released the lock so
_reconcile_lock_job is not invoked with stale spec/status; apply the same
pattern to the other reconcile sequences mentioned (the blocks around lines
238-260 and 354-383) so each _reconcile_ttl_expiry caller aborts further
reconcile steps when it reports a release.
| patch.status.setdefault("conditions", []).append( | ||
| _make_condition( | ||
| COND_GPU_DISCOVERED, | ||
| "False", | ||
| "DiscoveryFailed", | ||
| str(exc), | ||
| ) | ||
| ) |
There was a problem hiding this comment.
Upsert conditions instead of appending duplicates on every timer run.
Both the success and failure paths append a fresh COND_GPU_DISCOVERED entry every reconciliation. On a long-lived cluster this grows status.conditions without bound and makes consumers pick between duplicate condition records. Replace the existing condition by type instead of always appending.
Also applies to: 338-340
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.
In `@fournos/handlers/psapcluster.py` around lines 303 - 310, The code currently
always appends a new condition into patch.status.setdefault("conditions", [])
(see the append call with _make_condition and COND_GPU_DISCOVERED), causing
duplicate COND_GPU_DISCOVERED entries over time; change this to "upsert" by
finding and removing any existing condition whose "type" equals
COND_GPU_DISCOVERED from patch.status["conditions"] before adding the new
_make_condition(...) entry. Apply the same replacement in both the failure path
(the shown append using str(exc)) and the corresponding success path where
COND_GPU_DISCOVERED is appended, using the symbols patch.status,
COND_GPU_DISCOVERED, and _make_condition to locate and modify the code.
| prev_gpus = hardware.get("gpus", []) | ||
| new_resources = [(g.short_name, g.count) for g in result.gpus] | ||
| prev_resources = [(g.get("shortName"), g.get("count")) for g in prev_gpus] | ||
| if new_resources != prev_resources and new_resources: | ||
| try: | ||
| ctx.kueue.update_flavor_quotas(name, new_resources) |
There was a problem hiding this comment.
Clear Kueue quotas when discovery drops to zero GPUs.
The and new_resources guard skips update_flavor_quotas() when a cluster previously had GPUs but discovery now returns none. That leaves stale GPU quota in Kueue and can schedule workloads onto capacity that no longer exists.
Suggested fix
- if new_resources != prev_resources and new_resources:
+ if new_resources != prev_resources:
try:
ctx.kueue.update_flavor_quotas(name, new_resources)
except Exception as exc:
logger.warning(
"PSAPCluster %s: failed to update flavor quotas: %s", name, exc📝 Committable suggestion
‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.
| prev_gpus = hardware.get("gpus", []) | |
| new_resources = [(g.short_name, g.count) for g in result.gpus] | |
| prev_resources = [(g.get("shortName"), g.get("count")) for g in prev_gpus] | |
| if new_resources != prev_resources and new_resources: | |
| try: | |
| ctx.kueue.update_flavor_quotas(name, new_resources) | |
| prev_gpus = hardware.get("gpus", []) | |
| new_resources = [(g.short_name, g.count) for g in result.gpus] | |
| prev_resources = [(g.get("shortName"), g.get("count")) for g in prev_gpus] | |
| if new_resources != prev_resources: | |
| try: | |
| ctx.kueue.update_flavor_quotas(name, new_resources) |
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.
In `@fournos/handlers/psapcluster.py` around lines 342 - 347, The guard `and
new_resources` prevents clearing Kueue quotas when discovery reports zero GPUs;
change the logic so that `ctx.kueue.update_flavor_quotas(name, new_resources)`
is called whenever `new_resources != prev_resources` (even if `new_resources` is
empty) so stale GPU quotas are removed — update the block around the variables
`prev_gpus`, `new_resources`, `prev_resources` and the call to
`ctx.kueue.update_flavor_quotas(name, new_resources)` to remove the `and
new_resources` condition and ensure empty `new_resources` clears quotas.
| 2. Add a ResourceFlavor and quota for the cluster in | ||
| `config/kueue-cluster-config.yaml` (see [Onboarding a new cluster](#onboarding-a-new-cluster)). | ||
|
|
There was a problem hiding this comment.
PSAPCluster onboarding instructions are internally inconsistent.
This section says users must manually add ResourceFlavor/quota, but the next section states the controller updates ClusterQueue quotas automatically. Please clarify whether step 2 is required, optional fallback, or legacy flow.
Proposed wording adjustment
-2. Add a ResourceFlavor and quota for the cluster in
- `config/kueue-cluster-config.yaml` (see [Onboarding a new cluster](`#onboarding-a-new-cluster`)).
+2. (Optional/manual fallback) If auto-discovery automation is disabled in your environment,
+ add a ResourceFlavor and quota for the cluster in
+ `config/kueue-cluster-config.yaml` (see [Onboarding a new cluster](`#onboarding-a-new-cluster`)).Also applies to: 341-345
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.
In `@README.md` around lines 323 - 325, The README's PSAPCluster onboarding is
inconsistent about manual quota setup: update the wording around the step that
says "Add a ResourceFlavor and quota for the cluster in
`config/kueue-cluster-config.yaml`" (and the repeated text at the later block)
to clearly state whether this manual step is required, optional fallback, or
legacy; explicitly mention that the controller (referenced as "controller
updates ClusterQueue quotas automatically" in the "Onboarding a new cluster"
section) will normally create/update ClusterQueue quotas for onboarded clusters
and keep the manual `config/kueue-cluster-config.yaml` step as an
optional/manual fallback for edge cases or legacy workflows, and add a short
cross-reference to the "Onboarding a new cluster" section and a one-line
example/criteria when the manual step is needed.
| if cluster_name in existing_clusters: | ||
| continue |
There was a problem hiding this comment.
we need to make sure that the objects (flavor, etc) are there, even if cluster_name in existing_clusters
they need to
- exist
- have the right content
(I think they need to be created every time)
| GPU_MODEL_MAP: dict[str, str] = { | ||
| "NVIDIA-A100-SXM4-80GB": "a100", | ||
| "NVIDIA-A100-SXM4-40GB": "a100", | ||
| "NVIDIA-A100-PCIE-80GB": "a100", | ||
| "NVIDIA-A100-PCIE-40GB": "a100", | ||
| "NVIDIA-H100-SXM5-80GB": "h100", | ||
| "NVIDIA-H100-PCIE-80GB": "h100", | ||
| "NVIDIA-H200-SXM-141GB": "h200", | ||
| "AMD-Instinct-MI300X": "mi300x", | ||
| } |
There was a problem hiding this comment.
the can be in a user-configurable config map
(can be done later. If we don't do it in the PR, need to fill a ticket so that we don't forget it)
| elif secret.string_data and "kubeconfig" in secret.string_data: | ||
| raw = secret.string_data["kubeconfig"] |
There was a problem hiding this comment.
Good point. I think it would never when reading real k8s API. but since we have onboarded cluster-1. cluster-2 .... type of fake cluster, for convenience and having some sort defensive code added by claude. I was hitting the issue primarily for those fake cluster. :)
There was a problem hiding this comment.
What is stringData?
Normally, the data field in a Kubernetes Secret requires values to be base64-encoded. While base64 isn't encryption, it’s a hassle to encode strings manually every time you write a YAML file.
The stringData field allows you to provide secret data as plain text.
How it Works
You write your Secret YAML using stringData.
You kubectl apply the file.
Kubernetes automatically encodes the values to base64, moves them to the data field, and removes the stringData field from the stored object.
this path will never happen in the real runs of the controller.
Test test_stringdata_kubeconfig generates it, but I don't think that makes sense
kpouget
left a comment
There was a problem hiding this comment.
@MML-coder , could you turn this into a controller fully independent from FOURNOS ?
you create a secondary top directory, and only important the relevant FOURNOS bits (constants, I assume)
we'll also need a name for the controller,
and maybe use another name than "PSAP"
Maybe FournosCluster, or something alike if we find another nice name
| apiVersion: apiextensions.k8s.io/v1 | ||
| kind: CustomResourceDefinition |
There was a problem hiding this comment.
we'll rediscuss it, but I think we'll want to have
spec:
resources:
h200: 8
with spec.resources either user provided or dynamically looked up if spec.resources: {}
There was a problem hiding this comment.
Are you thinking of user could provide how many GPus they want to use by specifying spec.resources: 2 or 8 ? and the job can use available GPUs for the job?
There was a problem hiding this comment.
there are two things I have in mind:
- we want the controller to do this only once (hence the
specfield. Like a dynamic default value, not a truly reconciliated field) - in the future, we want to allow the customization of the resources list with SW properties (TASK: refine the future behavior of the exclusive/cluster/hardware flags #70)
| def scan(self) -> list[str]: | ||
| """Scan for new clusters and onboard them. | ||
|
|
||
| Returns the list of newly discovered cluster names. | ||
| """ | ||
| secrets = list_kubeconfig_secrets(self._k8s_core) | ||
| if not secrets: | ||
| return [] | ||
|
|
||
| existing_flavors = self._kueue.list_flavors() | ||
| existing_clusters = self._list_existing_psapclusters() | ||
|
|
||
| discovered: list[str] = [] | ||
| for secret_name in secrets: |
There was a problem hiding this comment.
I think the pattern is wrong here for a controller
the way I think it can work in a safer way:
watch the creation / update of secrets with -lfournos.dev/cluster-kubeconfig=true
- when such a secret is touched, make sure there's a PSAPCluster object matching it
- put the
ownerfield in the secret (TBC) to the PSAPCluster object
watch the creation / update of PSAPSecrets objects
- when such an object is touched, make sure that its resourceflavor exists
- when the kubeconfig secret is available, if
spec.hardwareis empty, look it up in the cluster - when the
spec.hardwareis filled, update the ClusterQueue
Summary
Introduces a
PSAPClustercustom resource and kopf controller that unifies clustermanagement with GPU discovery, sentinel-based locking via Kueue, and automatic
cluster onboarding from kubeconfig secrets.
Closes #37, closes #51
What's included
fournos.dev/v1) — single source of truth for cluster state with printer columns foroc get psapclustersspec.ownerto create alockOnly: true, exclusive: truesentinel FournosJob that holds all 100 cluster-slots via Kueue, blocking competing workloads. Optionalspec.ttlfor auto-expiryfournos.dev/auto-discovered: true)lockOnlyFournosJob support — new CRD field for sentinel jobs that skip Forge resolution and stay Admitted indefinitelyfournos-queuewith per-cluster ResourceFlavorsArchitecture
New/modified files
manifests/crd.yamllockOnlyfield, removedforgefrom requiredmanifests/psapcluster-crd.yamllockJobName, removed old CQ fields)manifests/rbac/role_fournos.yamlcreate/deletefor fournosjobs,createfor psapclustersfournos/core/discovery.pyClusterDiscoveryclass for auto-onboarding clustersfournos/core/clusters.pyextract_cluster_name,list_kubeconfig_secretsfournos/core/kueue.pycreate_flavor,add_flavor_to_cluster_queue,update_flavor_quotasfournos/core/psapcluster.pyfournos/handlers/discovery.pyscan_clustershandlerfournos/handlers/psapcluster.pyfournos/handlers/lifecycle.pylockOnlyhandling inon_createfournos/handlers/execution.pylockOnlyearly return inreconcile_admittedfournos/operator.pyClusterDiscoveryinittests/unit/test_cluster_discovery.pytests/unit/test_psapcluster_handlers.pytests/unit/test_psapcluster_manager.pyUsage
Test plan
Summary by CodeRabbit
New Features
lockOnlyjob mode to reserve cluster capacity without executing pipelinesDocumentation