Skip to content

Add pre/post lease hooks#140

Open
kirkbrauer wants to merge 47 commits intomainfrom
add-hooks
Open

Add pre/post lease hooks#140
kirkbrauer wants to merge 47 commits intomainfrom
add-hooks

Conversation

@kirkbrauer
Copy link
Member

@kirkbrauer kirkbrauer commented Jan 24, 2026

Summary

Implement hook scripts that can be configured in the exporter config to run j commands within a temporary lease environment. This PR combines changes across Python, Controller, and Protocol components.

Environment Variables

The following environment variables are exposed to hook scripts:

  • JUMPSTARTER_HOST - Path to the Unix socket for j CLI access
  • JMP_DRIVERS_ALLOW=UNSAFE - Allows all drivers for local access
  • LEASE_NAME - The name of the current lease
  • CLIENT_NAME - Name of the client that acquired the lease

Example Configuration

apiVersion: jumpstarter.dev/v1alpha1
kind: ExporterConfig
metadata:
  namespace: default
  name: test-exporter
endpoint: grpc.jumpstarter.100.123.60.47.nip.io:8082
hooks:
  beforeLease:
    script: |
      echo "Starting lease setup for ${CLIENT_NAME}"
      j power on
      echo "Pre-lease setup completed!"
    timeout: 120
    onFailure: exit
  afterLease:
    script: |
      echo "Starting lease cleanup for ${CLIENT_NAME}"
      j power off
      echo "Post-lease cleanup completed!"
    timeout: 60
    onFailure: warn
tls:
  ca: ''
  insecure: true
token: abc123
export:
  power:
    type: "jumpstarter_driver_power.driver.MockPower"

Changes

Python (python/)

  • Hook executor with before/after lease lifecycle support
  • Configurable timeouts and failure modes (exit, warn)
  • Exporter status reporting and streaming
  • EndSession support for graceful session termination
  • Typed Protobuf/gRPC stubs generation
  • Status field exposed in CLI (jmp admin get exporter)

Controller (controller/)

  • Full ExporterStatus field support with state transitions
  • Status updates reflected from exporters in real-time
  • EndSession API implementation
  • Exporter-initiated lease release support
  • CRDs managed via Helm chart templates

Protocol (protocol/)

  • EndSession API for keeping connections open after lease ends
  • release_lease field for exporter-initiated lease release
  • status_version and previous_status fields for transition tracking

Related PRs

Summary by CodeRabbit

Release Notes

  • New Features

    • Added exporter status tracking and display with multiple operational states
    • Introduced lifecycle hooks (before/after lease) for custom script execution during lease operations
    • Added status monitoring for non-blocking lease readiness checks
    • Enhanced logging with source-based routing and filtering
  • Bug Fixes

    • Improved exporter connection handling with retry logic and status validation
  • Tests

    • Added comprehensive end-to-end tests for hook functionality and status management

@coderabbitai
Copy link
Contributor

coderabbitai bot commented Jan 24, 2026

📝 Walkthrough

Walkthrough

This PR introduces comprehensive exporter lifecycle management through beforeLease/afterLease hooks, exporter status reporting and monitoring, log source tagging, and extended gRPC capabilities. It adds StatusMonitor for non-blocking status polling, HookExecutor for hook execution, LeaseContext for lease-scoped resources, ReportStatus endpoint, EndSession RPC, and corresponding Python/Go client updates across multiple packages.

Changes

Cohort / File(s) Summary
Protocol Definitions
protocol/proto/jumpstarter/v1/jumpstarter.proto
Adds ReportStatus RPC, EndSession RPC/messages, release_lease field to ReportStatusRequest, status_version and previous_status fields to GetStatusResponse, and optional source field to LogStreamResponse.
CRD Definitions
controller/deploy/helm/jumpstarter/crds/jumpstarter.dev_*.yaml, controller/deploy/helm/jumpstarter/charts/jumpstarter-controller/templates/crds/*
Adds new CRDs (Client, ExporterAccessPolicy, Exporter, Lease) and updates Exporter CRD with exporterStatus/statusMessage fields, printer columns, and status schema.
Controller API
controller/api/v1alpha1/exporter_*.go
Adds ExporterStatusValue and StatusMessage fields to ExporterStatus struct, introduces status enum constants, adds kubebuilder markers for printer columns, and implements ToProtobuf mapping with stringToProtoStatus conversion.
Controller Service
controller/internal/controller/exporter_controller.go, controller/internal/service/controller_service.go
Implements ReportStatus endpoint handler, status-to-string conversion, Online condition synchronization, lease release handling, exporter status validation for driver calls, and integrates LastSeen/StatusMessage updates.
Exporter Runtime
python/packages/jumpstarter/jumpstarter/exporter/exporter.py, python/packages/jumpstarter/jumpstarter/exporter/session.py
Major restructuring: adds hook_executor field, LeaseContext support, _handle_lease method, _register/_unregister/_report_status helpers, session_for_lease context manager yielding dual sockets, serve coordination with status streaming, and lease release request handling.
Hook Infrastructure
python/packages/jumpstarter/jumpstarter/exporter/hooks.py, python/packages/jumpstarter/jumpstarter/exporter/lease_context.py, python/packages/jumpstarter/jumpstarter/exporter/logging_protocol.py
New modules implementing HookExecutor (hook configuration, environment setup, process execution, timeout/error handling), LeaseContext (lease-scoped state, status propagation, event signaling), and LoggerRegistration protocol for logger source management.
Logging Extensions
python/packages/jumpstarter/jumpstarter/exporter/logging.py
Extends LogHandler with source field and per-logger source mappings, adds context_log_source context manager, introduces get_logger helper integrating source tracking with session registration.
Status Monitoring
python/packages/jumpstarter/jumpstarter/client/status_monitor.py
New module implementing StatusMonitor with background polling, event-driven callbacks, wait-for patterns, adaptive polling (fast/slow/critical), connection-loss detection, and missed-transition tracking.
Client Core
python/packages/jumpstarter/jumpstarter/client/core.py
Adds ExporterNotReady exception, async status methods (get_status_async, wait_for_lease_ready, end_session_async, wait_for_hook_status), StatusMonitor integration, enhanced log_stream_async with source-based routing and filtering, and RichHandler with show_path=False.
Client gRPC
python/packages/jumpstarter/jumpstarter/client/grpc.py, python/packages/jumpstarter/jumpstarter/client/lease.py
Adds Exporter.status field and show_status option, dial_timeout field to Lease, exponential-backoff retry loop in handle_async with failure detection, and _wait_for_ready_connection helper for connection establishment.
Configuration
python/packages/jumpstarter/jumpstarter/config/exporter.py, python/packages/jumpstarter/jumpstarter/config/client.py
Adds HookInstanceConfigV1Alpha1 and HookConfigV1Alpha1 models, hooks field to ExporterConfig with HookExecutor wiring, dial_timeout field to ClientConfigV1Alpha1Lease, and include_status parameter to list_exporters.
Common Exports
python/packages/jumpstarter/jumpstarter/common/enums.py, python/packages/jumpstarter/jumpstarter/common/types.py, python/packages/jumpstarter/jumpstarter/common/__init__.py
New enum classes ExporterStatus and LogSource with from_proto/to_proto converters, TypeAlias definitions for AsyncChannel and service stubs, and expanded all exports.
Generated Protocol Stubs
python/packages/jumpstarter-protocol/jumpstarter_protocol/jumpstarter/*/pb2*.py{,i}
Generated stubs for client v1 protocol, gRPC servicer interfaces, common enums, and router services with full type annotations and message definitions.
Driver & Test Updates
python/packages/jumpstarter/jumpstarter/driver/base.py, python/packages/jumpstarter/jumpstarter/exporter/session.py
Updates Driver logger initialization to use get_logger with LogSource.DRIVER, modifies Session initialization to map driver logs, adjusts LogStream polling interval, and implements _check_status_for_driver_call precondition.
Test Suites
controller/internal/service/controller_service_test.go, controller/internal/service/controller_service_integration_test.go, python/packages/jumpstarter/jumpstarter/exporter/hooks_test.py, python/packages/jumpstarter/jumpstarter/client/status_monitor_test.py, python/packages/jumpstarter/jumpstarter/exporter/lease_context_test.py
Comprehensive tests for hook execution, status handling, lease release, online condition syncing, status monitoring, and hook failure modes.
E2E Tests & Configs
e2e/tests-hooks.bats, e2e/exporters/exporter-hooks-*.yaml
New hook-specific e2e test suite covering beforeLease/afterLease execution, failure modes (warn/endLease/exit), timeouts, environment variables, and test configurations for various hook scenarios.
Build & Deployment
controller/Makefile, controller/hack/deploy_with_helm.sh, controller/.gitignore, python/buf.gen.yaml
Adds awk-based sync-wave annotation injection for role, CRD template copying, --skip-crds flag for helm, mypy/mypy-grpc plugins for Python protobuf generation.

Sequence Diagram(s)

sequenceDiagram
    participant Client
    participant Exporter as Exporter<br/>(Main)
    participant HookExec as HookExecutor
    participant Driver
    participant ControllerSvc as ControllerService
    participant StatusMon as StatusMonitor

    Client->>Exporter: request_lease()
    Exporter->>ControllerSvc: ReportStatus(BEFORE_LEASE_HOOK)
    ControllerSvc->>StatusMon: (status updated)
    
    Exporter->>HookExec: run_before_lease_hook()
    HookExec->>Driver: instantiate device
    HookExec->>HookExec: execute_hook_process()
    HookExec-->>Exporter: (hook complete/failed)
    
    alt Hook Success
        Exporter->>Exporter: drivers ready event set
        Exporter->>ControllerSvc: ReportStatus(LEASE_READY)
        Exporter->>Client: (allow client driver calls)
        Client->>Driver: driver operations
    else Hook Failure (on_failure=endLease)
        Exporter->>Exporter: _request_lease_release()
        Exporter->>ControllerSvc: (lease release signaled)
    end
    
    Exporter->>HookExec: run_after_lease_hook()
    HookExec->>HookExec: execute_hook_process()
    HookExec-->>Exporter: (hook complete)
    
    Exporter->>ControllerSvc: ReportStatus(OFFLINE)
    ControllerSvc->>StatusMon: (status updated)
Loading
sequenceDiagram
    participant Client
    participant AsyncClient as AsyncDriverClient
    participant StatusMon as StatusMonitor
    participant ExporterSvc as ExporterService<br/>(gRPC)

    Client->>StatusMon: start()
    activate StatusMon
    StatusMon->>ExporterSvc: GetStatus() [background poll]
    ExporterSvc-->>StatusMon: (status response)
    StatusMon->>StatusMon: notify on_status_change callbacks
    
    Client->>AsyncClient: wait_for_lease_ready(timeout=300s)
    AsyncClient->>StatusMon: wait_for_status(LEASE_READY)
    
    loop Poll Loop (0.3s intervals)
        StatusMon->>ExporterSvc: GetStatus()
        ExporterSvc-->>StatusMon: BEFORE_LEASE_HOOK
        StatusMon->>StatusMon: signal any_change_event
        AsyncClient->>StatusMon: check current_status
        alt Status != LEASE_READY
            AsyncClient->>AsyncClient: continue waiting
        else Status == LEASE_READY
            AsyncClient-->>Client: ready
        end
    end
    
    Client->>AsyncClient: call_async(method, args)
    AsyncClient->>AsyncClient: check status (LEASE_READY)
    AsyncClient->>ExporterSvc: DriverCall()
    ExporterSvc-->>AsyncClient: result
    
    StatusMon->>StatusMon: stop polling
    deactivate StatusMon
Loading

Estimated code review effort

🎯 5 (Critical) | ⏱️ ~120 minutes

Possibly related issues

  • Support for Exporter lifecycle hooks #65: Directly implements exporter lifecycle hooks feature across exporter, session, hook executor, lease context, CLI, and e2e tests.
  • jumpstarter-dev/jumpstarter#522: Implements exporter lifecycle hooks (beforeLease/afterLease) with configuration support and runtime execution across config, exporter, CRDs, and tests.
  • [Proposal] Enhanced Lease Statuses for Hooks #104: Implements exporter status enums, status messaging, ReportStatus endpoint, LogStream source tagging, and hook-driven status transitions matching proposal specifications.
  • jumpstarter-dev/jumpstarter-protocol#25: Directly addresses protocol-level proposals for status enums, log-source tagging, status reporting, and hook execution monitoring.

Possibly related PRs

  • jumpstarter-dev/jumpstarter#357: Both modify exporter/exporter.py restructuring the controller listen/status loops and adding retry/resilience around interrupted connections.
  • jumpstarter-dev/jumpstarter#704: Both modify client/lease.py with connection establishment, retry logic, and lease-related configuration (dial_timeout).
  • jumpstarter-dev/jumpstarter#297: Both extend exporter-to-client logging surface with LogHandler source mappings, Session log queue integration, and client-side log_stream_async with source-based routing.

Suggested labels

operator, feature, gRPC, status-management, hooks, lifecycle-management

Poem

🐰 A Leap of Status and Hooks

Exporters now report their state,
With hooks that run before and late,
StatusMonitor watches with care,
While leases dance through the air—
From offline to ready, we celebrate! 🎉

🚥 Pre-merge checks | ✅ 2 | ❌ 1
❌ Failed checks (1 warning)
Check name Status Explanation Resolution
Docstring Coverage ⚠️ Warning Docstring coverage is 41.83% which is insufficient. The required threshold is 80.00%. Write docstrings for the functions missing them to satisfy the coverage threshold.
✅ Passed checks (2 passed)
Check name Status Explanation
Description Check ✅ Passed Check skipped - CodeRabbit’s high-level summary is enabled.
Title check ✅ Passed The title 'Add pre/post lease hooks' directly and clearly summarizes the main change—implementing configurable beforeLease and afterLease lifecycle hooks for exporters.

✏️ Tip: You can configure your own custom pre-merge checks in the settings.

✨ Finishing touches
  • 📝 Generate docstrings
🧪 Generate unit tests (beta)
  • Create PR with unit tests
  • Post copyable unit tests in a comment
  • Commit unit tests in branch add-hooks

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

Comment @coderabbitai help to get the list of available commands and usage tips.

Copy link
Contributor

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 12

Caution

Some comments are outside the diff and can’t be posted inline due to platform limitations.

⚠️ Outside diff range comments (2)
python/packages/jumpstarter/jumpstarter/config/client.py (1)

184-196: Add status field to exporter reconstruction to preserve it when include_leases=True.

The exporter objects lose their status field when leases are included. The original exporters have a status field that should be preserved during reconstruction, especially when include_status=True is requested. Update line 187-193 to include status=exporter.status.

Code snippet
exporter_with_lease = Exporter(
    namespace=exporter.namespace,
    name=exporter.name,
    labels=exporter.labels,
    online=exporter.online,
    status=exporter.status,  # Add this line
    lease=lease,
)
python/packages/jumpstarter/jumpstarter/config/exporter.py (1)

224-246: Remove unused device_factory parameter from HookExecutor.

The device_factory parameter passed to HookExecutor at line 231 is never used. HookExecutor only executes shell commands via subprocess and accesses the session for logging; it has no need for driver instantiation. Remove the unused parameter from both the HookExecutor dataclass definition and the instantiation call.

🤖 Fix all issues with AI agents
In
`@controller/deploy/helm/jumpstarter/charts/jumpstarter-controller/templates/crds/jumpstarter.dev_clients.yaml`:
- Around line 20-47: Update the CRD description strings so they reference
"Client" consistently instead of "identities" or "Identity": change the top
description "Client is the Schema for the identities API" to a concise "Client
is the Schema for the Client API" (or simply "Client"), and update the
spec.description "ClientSpec defines the desired state of Identity" and
status.description "ClientStatus defines the observed state of Identity" to use
"Client" (e.g., "ClientSpec defines the desired state of Client" and
"ClientStatus defines the observed state of Client") so all descriptive fields
in the CRD (the description at the top and the
spec.description/status.description entries) no longer reference
Identity/identities.

In
`@controller/deploy/helm/jumpstarter/charts/jumpstarter-controller/templates/crds/jumpstarter.dev_leases.yaml`:
- Around line 30-31: Update the CRD description for the Lease resource: replace
the current "Lease is the Schema for the exporters API" text in the description
property of the Lease CRD with the corrected wording using "leases" (e.g.,
"Leases is the Schema for the exporters API" or "leases is the Schema for the
exporters API") so the description uses the plural form; locate the description
property under the Lease CRD (kind: Lease) in the CRD template and change that
string accordingly.

In `@python/packages/jumpstarter-cli-common/jumpstarter_cli_common/opt.py`:
- Line 19: basicConfig assignment line exceeds max line length; split the call
across multiple lines to satisfy Ruff E501. Locate the basicConfig variable
initialization (basicConfig = partial(logging.basicConfig, format="%(message)s
[%(name)s]", handlers=[RichHandler(show_path=False)]) ) and reformat the
arguments onto separate lines (e.g., place format=... and handlers=[...] each on
their own lines and close parens on its own line) so the call to
partial(logging.basicConfig, ...) fits within line-length limits while
preserving the same arguments and use of RichHandler(show_path=False).

In `@python/packages/jumpstarter-cli/jumpstarter_cli/shell.py`:
- Around line 80-89: Remove the unused synchronous helper _run_shell_with_lease:
delete the entire function definition (including its docstring and with-blocks)
since the async variant _run_shell_with_lease_async is now the only one used;
then search the repo for any remaining references to _run_shell_with_lease and
remove or update them, run the test suite/lint to confirm no import or name
errors, and ensure behavior relies solely on _run_shell_with_lease_async.

In `@python/packages/jumpstarter/jumpstarter/client/status_monitor.py`:
- Around line 395-401: The async context manager on StatusMonitor is incomplete:
__aenter__ currently just returns self and never starts the background polling;
update __aenter__ to initialize the task group and call self.start() (mirroring
client.status_monitor_async() behavior) so entering the context begins
monitoring, and keep __aexit__ awaiting self.stop(); alternatively remove both
methods if context-manager usage is unsupported—modify the methods on the
StatusMonitor class (__aenter__, __aexit__) to either perform full start/stop
lifecycle (create task group + await self.start() in __aenter__, await
self.stop() in __aexit__) or delete them to avoid silent failures.

In `@python/packages/jumpstarter/jumpstarter/common/enums.py`:
- Around line 26-27: The docstring for the enum value AFTER_LEASE_HOOK (mapped
to common_pb2.EXPORTER_STATUS_AFTER_LEASE_HOOK) contains a typo; update the
string from "Lease was releaseed, but exporter is executing after lease hook" to
"Lease was released, but exporter is executing after lease hook" so the
documentation reads correctly.

In `@python/packages/jumpstarter/jumpstarter/exporter/exporter.py`:
- Around line 541-613: The Listen retry task started with
tg.start_soon(self._retry_stream, "Listen",
self._listen_stream_factory(lease_name), listen_tx) is attached to the outer
task group so it keeps running after the lease ends; move that start call into
the lease-scoped task group (use conn_tg.start_soon(...) before starting
wait_for_lease_end/process_connections) so the Listen task is cancelled with the
lease, and in the enclosing finally block ensure the send stream is closed
(await listen_tx.aclose() or the appropriate close method) to signal listen_rx
termination; reference symbols: _retry_stream,
_listen_stream_factory(lease_name), listen_tx/listen_rx, tg -> conn_tg, and
conn_tg.start_soon.
- Around line 202-321: Change _get_controller_stub to be an async context
manager (e.g., async def _controller_stub(self):) that creates the channel via
self.channel_factory(), yields a
jumpstarter_pb2_grpc.ControllerServiceStub(channel) and always closes the
underlying channel with await channel.close() in a finally block; then update
all callers to use "async with self._controller_stub() as controller:" instead
of awaiting _get_controller_stub(), specifically in _retry_stream (where
controller is acquired for stream_factory), _register_with_controller (Register
call), and _report_status (ReportStatus call) so channels are closed after use
and no resources leak.

In `@python/packages/jumpstarter/jumpstarter/exporter/hooks.py`:
- Around line 299-307: wait_for_process currently uses
anyio.to_thread.run_sync(process.wait, abandon_on_cancel=True) which can leave
the child as a zombie if the wait is abandoned; modify wait_for_process to
ensure the subprocess is reaped on cancellation by adding a try/finally: in the
try await the run_sync as before, but in the finally check process.poll() and if
it is None try to terminate/kill (process.terminate()/process.kill()) then call
a non-abandoning run_sync(process.wait) or loop-poll with process.poll() until
exit to fully reap the child; reference the wait_for_process coroutine and the
process variable when making the changes.
- Around line 257-260: The debug call inside read_pty_output exceeds the
120-char line limit; shorten the line by splitting the log message across
multiple strings or moving the format string to a short variable and then
calling logger.debug. For example, break the message into two adjacent string
literals or assign msg = "read_pty_output: heartbeat at %.1fs, iterations=%d"
and call logger.debug(msg, elapsed, read_count) so you keep the same
logger.debug call and the same format arguments (logger, elapsed, read_count)
but comply with the E501 line-length rule.
- Around line 186-189: The test mock for logging_session must support the
context manager used by logging_session.context_log_source in hooks.py; update
the test fixture (e.g., mock_session) so mock_session.context_log_source returns
an object that implements __enter__ and __exit__ (for async usage you can set
mock_session.context_log_source.return_value.__enter__ and __exit__ to AsyncMock
or wrap with contextlib.nullcontext), ensuring the context manager protocol is
honored when hooks.py calls logging_session.context_log_source.
🧹 Nitpick comments (10)
controller/deploy/helm/jumpstarter/charts/jumpstarter-controller/templates/crds/jumpstarter.dev_exporteraccesspolicies.yaml (2)

149-152: Add schema validation for duration and priority.

Without constraints, malformed durations or negative priorities can be accepted and only fail at runtime. Consider adding format/bounds in the CRD.

♻️ Suggested schema tightening
                   maximumDuration:
                     type: string
+                    format: duration
                   priority:
                     type: integer
+                    minimum: 0

40-157: Require spec/policies to avoid empty CRs.

Right now an empty resource (missing spec or policies) is valid. If that’s not intended, enforce it in the schema to reduce no-op objects.

♻️ Suggested required fields
           spec:
             description: ExporterAccessPolicySpec defines the desired state of ExporterAccessPolicy.
             properties:
               exporterSelector:
                 description: |-
                   A label selector is a label query over a set of resources. The result of matchLabels and
                   matchExpressions are ANDed. An empty label selector matches all objects. A null
                   label selector matches no objects.
                 properties:
                   matchExpressions:
                     description: matchExpressions is a list of label selector requirements.
                       The requirements are ANDed.
                     items:
                       description: |-
                         A label selector requirement is a selector that contains values, a key, and an operator that
                         relates the key and values.
                       properties:
                         key:
                           description: key is the label key that the selector applies
                             to.
                           type: string
                         operator:
                           description: |-
                             operator represents a key's relationship to a set of values.
                             Valid operators are In, NotIn, Exists and DoesNotExist.
                           type: string
                         values:
                           description: |-
                             values is an array of string values. If the operator is In or NotIn,
                             the values array must be non-empty. If the operator is Exists or DoesNotExist,
                             the values array must be empty. This array is replaced during a strategic
                             merge patch.
                           items:
                             type: string
                           type: array
                           x-kubernetes-list-type: atomic
                       required:
                       - key
                       - operator
                       type: object
                     type: array
                     x-kubernetes-list-type: atomic
                   matchLabels:
                     additionalProperties:
                       type: string
                     description: |-
                       matchLabels is a map of {key,value} pairs. A single {key,value} in the matchLabels
                       map is equivalent to an element of matchExpressions, whose key field is "key", the
                       operator is "In", and the values array contains only "value". The requirements are ANDed.
                     type: object
                 type: object
                 x-kubernetes-map-type: atomic
               policies:
                 items:
                   properties:
                     from:
                       items:
                         properties:
                           clientSelector:
                             description: |-
                               A label selector is a label query over a set of resources. The result of matchLabels and
                               matchExpressions are ANDed. An empty label selector matches all objects. A null
                               label selector matches no objects.
                             properties:
                               matchExpressions:
                                 description: matchExpressions is a list of label selector
                                   requirements. The requirements are ANDed.
                                 items:
                                   description: |-
                                     A label selector requirement is a selector that contains values, a key, and an operator that
                                     relates the key and values.
                                   properties:
                                     key:
                                       description: key is the label key that the selector
                                         applies to.
                                       type: string
                                     operator:
                                       description: |-
                                         operator represents a key's relationship to a set of values.
                                         Valid operators are In, NotIn, Exists and DoesNotExist.
                                       type: string
                                     values:
                                       description: |-
                                         values is an array of string values. If the operator is In or NotIn,
                                         the values array must be non-empty. If the operator is Exists or DoesNotExist,
                                         the values array must be empty. This array is replaced during a strategic
                                         merge patch.
                                       items:
                                         type: string
                                       type: array
                                       x-kubernetes-list-type: atomic
                                   required:
                                   - key
                                   - operator
                                   type: object
                                 type: array
                                 x-kubernetes-list-type: atomic
                               matchLabels:
                                 additionalProperties:
                                   type: string
                                 description: |-
                                   matchLabels is a map of {key,value} pairs. A single {key,value} in the matchLabels
                                   map is equivalent to an element of matchExpressions, whose key field is "key", the
                                   operator is "In", and the values array contains only "value". The requirements are ANDed.
                                 type: object
                             type: object
                             x-kubernetes-map-type: atomic
                         type: object
                       type: array
                     maximumDuration:
                       type: string
                     priority:
                       type: integer
                     spotAccess:
                       type: boolean
                   type: object
                 type: array
+                minItems: 1
             type: object
+            required:
+            - policies
           status:
             description: ExporterAccessPolicyStatus defines the observed state of
               ExporterAccessPolicy.
             type: object
         type: object
+        required:
+        - spec

Also applies to: 162-162

python/packages/jumpstarter-cli/jumpstarter_cli/shell.py (1)

157-161: Redundant timeout configuration.

Both move_on_after(300) and wait_for_any_of(..., timeout=300.0) specify 300 seconds. Since wait_for_any_of has its own timeout, the outer move_on_after is redundant. Consider either removing the outer timeout or differentiating them if different semantics are intended (e.g., outer as a hard cap slightly longer than inner).

♻️ Suggested simplification
-                                        with anyio.move_on_after(300) as timeout_scope:  # 5 minute timeout
-                                            result = await monitor.wait_for_any_of(
-                                                [ExporterStatus.AVAILABLE, ExporterStatus.AFTER_LEASE_HOOK_FAILED],
-                                                timeout=300.0
-                                            )
+                                        result = await monitor.wait_for_any_of(
+                                            [ExporterStatus.AVAILABLE, ExporterStatus.AFTER_LEASE_HOOK_FAILED],
+                                            timeout=300.0
+                                        )
+                                        if result is None:
+                                            logger.warning("Timeout waiting for afterLease hook to complete")

Note: This would require adjusting the subsequent conditional logic (lines 162-172) to handle the timeout case inline rather than via timeout_scope.cancelled_caught.

controller/Makefile (1)

66-68: Make the CRD copy step more robust with mkdir.

If the templates directory is missing in a fresh checkout, cp will fail.

♻️ Suggested tweak
 	cp deploy/helm/jumpstarter/crds/* deploy/operator/config/crd/bases/
+	mkdir -p deploy/helm/jumpstarter/charts/jumpstarter-controller/templates/crds/
 	cp deploy/helm/jumpstarter/crds/* deploy/helm/jumpstarter/charts/jumpstarter-controller/templates/crds/
controller/internal/service/controller_service.go (1)

318-338: Consider logging unexpected status values in the default case.

The protoStatusToString function silently maps unknown proto status values to ExporterStatusUnspecified. While this provides a safe fallback, logging at debug/warning level would help diagnose protocol mismatches.

♻️ Optional: Add logging for unexpected status values
 func protoStatusToString(status pb.ExporterStatus) string {
 	switch status {
 	case pb.ExporterStatus_EXPORTER_STATUS_OFFLINE:
 		return jumpstarterdevv1alpha1.ExporterStatusOffline
 	// ... other cases ...
 	default:
+		// Log unexpected status values for debugging protocol mismatches
+		// Consider using a package-level logger if available
 		return jumpstarterdevv1alpha1.ExporterStatusUnspecified
 	}
 }
python/packages/jumpstarter/jumpstarter/common/enums.py (1)

38-45: Consider handling invalid protobuf values gracefully in from_proto.

The current implementation will raise ValueError if an unrecognized integer is passed to from_proto. For robustness against protocol version mismatches, consider returning UNSPECIFIED for unknown values.

♻️ Optional: Handle unknown values gracefully
     `@classmethod`
     def from_proto(cls, value: int) -> "ExporterStatus":
         """Convert from protobuf integer to enum."""
-        return cls(value)
+        try:
+            return cls(value)
+        except ValueError:
+            return cls.UNSPECIFIED
python/packages/jumpstarter/jumpstarter/exporter/logging.py (1)

31-39: Consider longest-prefix matching for hierarchical loggers.

The current prefix matching returns the first match found during iteration. With hierarchical logger names (e.g., driver.power, driver.power.usb), if both driver and driver.power are mapped, the match depends on iteration order. Consider sorting by key length descending to ensure the most specific (longest) prefix wins.

♻️ Suggested improvement for deterministic matching
     def get_source_for_record(self, record):
         """Determine the appropriate log source for a record."""
         with self._lock:
             # Check if this record comes from a logger with a specific source mapping
             logger_name = record.name
-            for mapped_logger, source in self._child_handlers.items():
+            # Sort by key length descending to match most specific prefix first
+            for mapped_logger, source in sorted(
+                self._child_handlers.items(), key=lambda x: len(x[0]), reverse=True
+            ):
                 if logger_name.startswith(mapped_logger):
                     return source
             return self.source
python/packages/jumpstarter/jumpstarter/client/core.py (1)

440-510: Consider extracting helper functions to reduce complexity.

The static analysis correctly flags this method as too complex (C901: 14 > 10). The method handles multiple concerns: reconnection with backoff, log source routing, and filtering. While the current structure is readable, extracting helpers would improve maintainability.

♻️ Suggested structure for reduced complexity
def _route_log_to_logger(self, response, show_all_logs: bool) -> None:
    """Route a log response to the appropriate logger."""
    from jumpstarter.common import LogSource

    if response.HasField("source"):
        source = LogSource(response.source)
        is_hook = source in (LogSource.BEFORE_LEASE_HOOK, LogSource.AFTER_LEASE_HOOK)
    else:
        source = LogSource.SYSTEM
        is_hook = False

    if not (is_hook or show_all_logs):
        return

    severity = response.severity if response.severity else "INFO"
    log_level = getattr(logging, severity, logging.INFO)

    logger_names = {
        LogSource.BEFORE_LEASE_HOOK: "exporter:beforeLease",
        LogSource.AFTER_LEASE_HOOK: "exporter:afterLease",
        LogSource.DRIVER: "exporter:driver",
    }
    logger_name = logger_names.get(source, "exporter:system")
    logging.getLogger(logger_name).log(log_level, response.message)

Then use this helper in the main loop to reduce nesting.

python/packages/jumpstarter/jumpstarter/exporter/hooks.py (1)

496-498: TODO: Controller-side lease ending not yet implemented.

The comment indicates that when on_failure='endLease', there's no mechanism to actually end the lease from the exporter side. This should be tracked.

Would you like me to open an issue to track the controller-side lease ending mechanism for beforeLease hook failures with on_failure='endLease'?

python/packages/jumpstarter/jumpstarter/client/status_monitor.py (1)

197-279: Consider extracting common verification logic.

Both wait_for_status and wait_for_any_of share substantial code for connection recovery verification. This duplication contributes to the complexity flagged by static analysis.

♻️ Suggested refactor to reduce duplication
async def _verify_connection_recovery(self) -> ExporterStatus | None:
    """Verify connection and return current status if recovered, None if still lost."""
    if not self._connection_lost:
        return self._current_status
    
    logger.debug("Connection was marked as lost, verifying...")
    try:
        response = await self._stub.GetStatus(jumpstarter_pb2.GetStatusRequest())
        logger.info("Connection recovered during verification poll")
        self._connection_lost = False
        new_status = ExporterStatus.from_proto(response.status)
        self._current_status = new_status
        self._status_version = response.status_version
        return new_status
    except AioRpcError as e:
        if e.code() == StatusCode.UNAVAILABLE:
            logger.debug("Connection still lost (UNAVAILABLE)")
        else:
            logger.debug("GetStatus error during verification: %s", e.code())
        return None
    except Exception as e:
        logger.debug("GetStatus error during verification: %s", e)
        return None

Then use this helper in both wait_for_status and wait_for_any_of.

Comment on lines +20 to +47
description: Client is the Schema for the identities API
properties:
apiVersion:
description: |-
APIVersion defines the versioned schema of this representation of an object.
Servers should convert recognized schemas to the latest internal value, and
may reject unrecognized values.
More info: https://git.k8s.io/community/contributors/devel/sig-architecture/api-conventions.md#resources
type: string
kind:
description: |-
Kind is a string value representing the REST resource this object represents.
Servers may infer this from the endpoint the client submits requests to.
Cannot be updated.
In CamelCase.
More info: https://git.k8s.io/community/contributors/devel/sig-architecture/api-conventions.md#types-kinds
type: string
metadata:
type: object
spec:
description: ClientSpec defines the desired state of Identity
properties:
username:
type: string
type: object
status:
description: ClientStatus defines the observed state of Identity
properties:
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟡 Minor

Fix CRD description wording to “Client” (minor).

The descriptions currently reference “identities/Identity,” which looks like a copy‑paste typo.

✏️ Suggested wording fix
-        description: Client is the Schema for the identities API
+        description: Client is the Schema for the clients API
@@
-            description: ClientSpec defines the desired state of Identity
+            description: ClientSpec defines the desired state of Client
@@
-            description: ClientStatus defines the observed state of Identity
+            description: ClientStatus defines the observed state of Client
📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change
description: Client is the Schema for the identities API
properties:
apiVersion:
description: |-
APIVersion defines the versioned schema of this representation of an object.
Servers should convert recognized schemas to the latest internal value, and
may reject unrecognized values.
More info: https://git.k8s.io/community/contributors/devel/sig-architecture/api-conventions.md#resources
type: string
kind:
description: |-
Kind is a string value representing the REST resource this object represents.
Servers may infer this from the endpoint the client submits requests to.
Cannot be updated.
In CamelCase.
More info: https://git.k8s.io/community/contributors/devel/sig-architecture/api-conventions.md#types-kinds
type: string
metadata:
type: object
spec:
description: ClientSpec defines the desired state of Identity
properties:
username:
type: string
type: object
status:
description: ClientStatus defines the observed state of Identity
properties:
description: Client is the Schema for the clients API
properties:
apiVersion:
description: |-
APIVersion defines the versioned schema of this representation of an object.
Servers should convert recognized schemas to the latest internal value, and
may reject unrecognized values.
More info: https://git.k8s.io/community/contributors/devel/sig-architecture/api-conventions.md#resources
type: string
kind:
description: |-
Kind is a string value representing the REST resource this object represents.
Servers may infer this from the endpoint the client submits requests to.
Cannot be updated.
In CamelCase.
More info: https://git.k8s.io/community/contributors/devel/sig-architecture/api-conventions.md#types-kinds
type: string
metadata:
type: object
spec:
description: ClientSpec defines the desired state of Client
properties:
username:
type: string
type: object
status:
description: ClientStatus defines the observed state of Client
properties:
🤖 Prompt for AI Agents
In
`@controller/deploy/helm/jumpstarter/charts/jumpstarter-controller/templates/crds/jumpstarter.dev_clients.yaml`
around lines 20 - 47, Update the CRD description strings so they reference
"Client" consistently instead of "identities" or "Identity": change the top
description "Client is the Schema for the identities API" to a concise "Client
is the Schema for the Client API" (or simply "Client"), and update the
spec.description "ClientSpec defines the desired state of Identity" and
status.description "ClientStatus defines the observed state of Identity" to use
"Client" (e.g., "ClientSpec defines the desired state of Client" and
"ClientStatus defines the observed state of Client") so all descriptive fields
in the CRD (the description at the top and the
spec.description/status.description entries) no longer reference
Identity/identities.

Comment on lines +30 to +31
description: Lease is the Schema for the exporters API
properties:
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟡 Minor

Fix CRD description wording to “leases” (minor).

The description says “exporters API” for the Lease CRD.

✏️ Suggested wording fix
-        description: Lease is the Schema for the exporters API
+        description: Lease is the Schema for the leases API
📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change
description: Lease is the Schema for the exporters API
properties:
description: Lease is the Schema for the leases API
properties:
🤖 Prompt for AI Agents
In
`@controller/deploy/helm/jumpstarter/charts/jumpstarter-controller/templates/crds/jumpstarter.dev_leases.yaml`
around lines 30 - 31, Update the CRD description for the Lease resource: replace
the current "Lease is the Schema for the exporters API" text in the description
property of the Lease CRD with the corrected wording using "leases" (e.g.,
"Leases is the Schema for the exporters API" or "leases is the Schema for the
exporters API") so the description uses the plural form; locate the description
property under the Lease CRD (kind: Lease) in the CRD template and change that
string accordingly.

Copy link
Contributor

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 4

🤖 Fix all issues with AI agents
In
`@python/packages/jumpstarter-driver-power/jumpstarter_driver_power/client_test.py`:
- Around line 8-20: The test_log_stream uses fixed time.sleep(1) which is flaky;
replace these sleeps with a short-polling loop that checks caplog.text until the
expected substring appears or a timeout expires. In test_log_stream (around
client.log_stream(), client.on(), client.off()) implement a small helper or
inline poll using time.monotonic() to loop with a tiny sleep (e.g., 0.01–0.1s)
and fail if timeout reached, then assert "power on" and "power off" only after
the poll succeeds; keep using caplog and the same logger scope so the
client.log_stream(), client.on(), and client.off() calls remain unchanged.

In `@python/packages/jumpstarter/jumpstarter/client/client_test.py`:
- Line 3: Remove the unused import OrderedDict from the imports in
client_test.py; locate the line "from collections import OrderedDict" and delete
it (or remove OrderedDict from the import list) so the lint warning is resolved
and no functionality is changed.

In `@python/packages/jumpstarter/jumpstarter/client/status_monitor_test.py`:
- Line 3: The import list in status_monitor_test.py includes an unused symbol
AsyncMock; remove AsyncMock from the import statement (keep MagicMock) so the
module only imports symbols actually used in the tests, ensuring static analysis
no longer flags the unused AsyncMock.

In `@python/packages/jumpstarter/jumpstarter/common/enums_test.py`:
- Line 3: Remove the unused top-level import "import pytest" from enums_test.py;
locate the import statement at the top of the test module and delete it so the
file no longer contains the unused pytest symbol.
🧹 Nitpick comments (4)
python/packages/jumpstarter/jumpstarter/exporter/lease_context_test.py (1)

186-200: Avoid time-based sleep to reduce test flakiness.

Line 195 uses a fixed sleep, which can be flaky on slow CI. Consider a handshake event so the test deterministically waits for wait_for_drivers() to start before setting the hook.

♻️ Proposed deterministic rewrite
     async def test_wait_for_drivers_blocks_until_set(self) -> None:
         """Test that wait_for_drivers() blocks until hook event is set."""
         import anyio
 
         before_hook = Event()
         ctx = LeaseContext(lease_name="test-lease", before_lease_hook=before_hook)
 
-        # Set the event after a short delay
-        async def set_after_delay():
-            await anyio.sleep(0.05)
-            before_hook.set()
+        waiter_started = Event()
+
+        async def wait_and_signal():
+            waiter_started.set()
+            await ctx.wait_for_drivers()
 
         async with anyio.create_task_group() as tg:
-            tg.start_soon(set_after_delay)
-            await ctx.wait_for_drivers()
+            tg.start_soon(wait_and_signal)
+            await waiter_started.wait()
+            before_hook.set()
 
         assert ctx.drivers_ready()
python/packages/jumpstarter/jumpstarter/client/core_test.py (2)

16-32: Consider extracting shared test utilities.

MockAioRpcError and create_mock_rpc_error are duplicated from status_monitor_test.py. Consider extracting these to a shared test utilities module (e.g., jumpstarter/client/test_utils.py or a conftest.py).


90-116: Move import to module level.

The anyio import inside wait_for_hook_status should be at the module level for consistency and clarity.

Proposed fix

Add to imports at top of file:

import anyio

Then update the method:

     async def wait_for_hook_status(self, target_status: ExporterStatus, timeout: float = 60.0) -> bool:
         """Wait for exporter to reach a target status using polling."""
-        import anyio
-
         poll_interval = 0.1  # Fast for testing
python/packages/jumpstarter/jumpstarter/client/client_test.py (1)

121-124: Consider adding assertion to fully verify topological order.

The comment states children should be built before parents, but the assertion only verifies grandparent is built after parent. Adding an assertion that child is built before parent would fully validate the topological ordering.

Suggested improvement
         # Children should be built before parents in topological order
-        # The order should be child, parent, grandparent (or parent, child, grandparent)
+        # The order should be child, parent, grandparent
         assert "grandparent" in build_order
         assert build_order.index("grandparent") > build_order.index("parent")
+        assert build_order.index("parent") > build_order.index("child")

Comment on lines +8 to +20
def test_log_stream(caplog):
"""Test that driver logs are properly streamed to the client."""
with serve(MockPower()) as client:
log = MagicMock()
monkeypatch.setattr(client, "_AsyncDriverClient__log", log)
with client.log_stream():
client.on()
time.sleep(1) # to ensure log is flushed
log.assert_called_with(logging.INFO, "power on")
# Set log level to capture INFO messages from exporter:driver logger
with caplog.at_level(logging.INFO, logger="exporter:driver"):
with client.log_stream():
client.on()
time.sleep(1) # to ensure log is flushed
assert "power on" in caplog.text

client.off()
time.sleep(1)
log.assert_called_with(logging.INFO, "power off")
client.off()
time.sleep(1)
assert "power off" in caplog.text
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟡 Minor

Avoid fixed sleeps; wait for logs with a timeout.

The 1s sleeps can be flaky (slow/fast CI) and unnecessarily slow locally. Consider polling caplog.text with a short timeout to make the test reliable and quicker.

🔧 Suggested fix
 def test_log_stream(caplog):
     """Test that driver logs are properly streamed to the client."""
     with serve(MockPower()) as client:
         # Set log level to capture INFO messages from exporter:driver logger
         with caplog.at_level(logging.INFO, logger="exporter:driver"):
             with client.log_stream():
                 client.on()
-                time.sleep(1)  # to ensure log is flushed
-                assert "power on" in caplog.text
+                for _ in range(50):
+                    if "power on" in caplog.text:
+                        break
+                    time.sleep(0.1)
+                assert "power on" in caplog.text
 
                 client.off()
-                time.sleep(1)
-                assert "power off" in caplog.text
+                for _ in range(50):
+                    if "power off" in caplog.text:
+                        break
+                    time.sleep(0.1)
+                assert "power off" in caplog.text
🤖 Prompt for AI Agents
In
`@python/packages/jumpstarter-driver-power/jumpstarter_driver_power/client_test.py`
around lines 8 - 20, The test_log_stream uses fixed time.sleep(1) which is
flaky; replace these sleeps with a short-polling loop that checks caplog.text
until the expected substring appears or a timeout expires. In test_log_stream
(around client.log_stream(), client.on(), client.off()) implement a small helper
or inline poll using time.monotonic() to loop with a tiny sleep (e.g.,
0.01–0.1s) and fail if timeout reached, then assert "power on" and "power off"
only after the poll succeeds; keep using caplog and the same logger scope so the
client.log_stream(), client.on(), and client.off() calls remain unchanged.

Copy link
Contributor

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 6

🤖 Fix all issues with AI agents
In `@controller/internal/service/suite_test.go`:
- Around line 56-58: The test harness sets up testEnv as an envtest.Environment
with CRDDirectoryPaths and ErrorIfCRDPathMissing: true, which will abort the
suite if the CRD path is wrong; verify and update the CRDDirectoryPaths value to
the actual CRD directory (or add multiple fallback paths) so envtest can find
the CRDs, e.g., check and correct the path used in the testEnv initialization
(the envtest.Environment struct, the CRDDirectoryPaths slice, and the
ErrorIfCRDPathMissing flag) or set ErrorIfCRDPathMissing to false while you
confirm the correct location.
- Around line 29-31: The test currently ignores errors when creating a
Namespace; change the Namespace creation logic (the call that creates a
corev1.Namespace object, e.g., client.CoreV1().Namespaces().Create or similar in
the test setup) to explicitly handle errors: import
k8s.io/apimachinery/pkg/api/errors as apierrors, then after the Create call
check if err != nil and if apierrors.IsAlreadyExists(err) continue/ignore,
otherwise fail the test (t.Fatalf or return the error). Apply the same pattern
for the other namespace-creation site referenced in the comment (the second
block around lines 81-87) so only AlreadyExists is tolerated and all other
errors cause the test to fail.
- Around line 19-24: Update the test setup in suite_test.go to avoid hard-coding
BinaryAssetsDirectory: first respect the KUBEBUILDER_ASSETS environment variable
if set (use it to set BinaryAssetsDirectory), then fall back to existing helper
GetFirstFoundEnvTestBinaryDir() or a directory-existence check for the current
hard-coded path (bin/k8s/1.30.0-<os>-<arch>), and only assign the hard-coded
path as a last resort; modify the code that sets BinaryAssetsDirectory to
perform conditional checks (os.Getenv("KUBEBUILDER_ASSETS"), os.Stat on
candidate dirs, and GetFirstFoundEnvTestBinaryDir()) so tests run correctly
outside the Makefile and when the version/path differs.

In `@python/packages/jumpstarter/jumpstarter/client/client_test.py`:
- Around line 120-124: The test's topological-order check only ensures
"grandparent" comes after "parent" but doesn't verify that "child" precedes
"parent"; update the assertions to explicitly assert that
build_order.index("child") < build_order.index("parent") (and keep/optionally
assert both "child" and "parent" are before "grandparent") so the test enforces
children-before-parents ordering; reference the existing build_order list and
the node names "child", "parent", and "grandparent" when adding the
assertion(s).

In `@python/packages/jumpstarter/jumpstarter/client/status_monitor.py`:
- Around line 31-38: The docstring usage example in status_monitor.py
incorrectly calls client.start_status_monitor(); update the example to call the
actual method client.status_monitor_async() (or add a thin alias
start_status_monitor -> status_monitor_async in the client if you prefer) so the
docstring matches the real API; ensure the rest of the example still references
monitor.current_status and monitor.wait_for_status(ExporterStatus.LEASE_READY).

In `@python/packages/jumpstarter/jumpstarter/exporter/hooks.py`:
- Around line 426-500: The before-lease failure branch doesn't actually end the
lease for on_failure='endLease'; modify run_before_lease_hook to accept an async
lease-release callback (e.g., end_lease: Callable[[LeaseContext],
Awaitable[None]]) or reuse an existing controller-release method, and in the
HookExecutionError else-branch (the on_failure='endLease' path) invoke await
end_lease(lease_scope) (and still set lease_scope.before_lease_hook
event/unblock connections and report status) so the lease is cleanly terminated;
update all call sites of run_before_lease_hook to pass the new end_lease
callback (mirroring how run_after_lease_hook is wired).
♻️ Duplicate comments (2)
python/packages/jumpstarter-cli/jumpstarter_cli/shell.py (1)

80-90: Unused synchronous function.

This function is no longer called anywhere in the codebase since line 213 now uses _run_shell_with_lease_async. Consider removing it to reduce maintenance burden.

python/packages/jumpstarter/jumpstarter/client/status_monitor.py (1)

397-403: Async context manager won't start monitoring.

The __aenter__ method returns self without calling start(). Direct usage as an async context manager (async with StatusMonitor(...) as monitor:) will silently fail to start background polling. Users must use client.status_monitor_async() instead. Either fix __aenter__ to properly start monitoring or remove the context manager methods to prevent misuse.

🧹 Nitpick comments (9)
python/packages/jumpstarter-cli/jumpstarter_cli/shell.py (2)

157-161: Redundant double timeout.

Both anyio.move_on_after(300) and timeout=300.0 in wait_for_any_of set 5-minute timeouts. The inner timeout parameter makes the outer cancel scope redundant. Consider removing one:

♻️ Proposed fix
-                                        with anyio.move_on_after(300) as timeout_scope:  # 5 minute timeout
-                                            result = await monitor.wait_for_any_of(
-                                                [ExporterStatus.AVAILABLE, ExporterStatus.AFTER_LEASE_HOOK_FAILED],
-                                                timeout=300.0
-                                            )
+                                        result = await monitor.wait_for_any_of(
+                                            [ExporterStatus.AVAILABLE, ExporterStatus.AFTER_LEASE_HOOK_FAILED],
+                                            timeout=300.0
+                                        )
+                                        if result is None:
+                                            logger.warning("Timeout waiting for afterLease hook to complete")

Then adjust the subsequent conditional logic accordingly, removing the timeout_scope.cancelled_caught check at line 171.


133-136: Consider checking connection_lost for beforeLease hook.

The afterLease handling (line 166) checks monitor.connection_lost to distinguish between timeout and connection loss. The beforeLease handling treats both cases as "Timeout" which could be misleading in logs.

♻️ Proposed enhancement
                             if result == ExporterStatus.BEFORE_LEASE_HOOK_FAILED:
                                 logger.warning("beforeLease hook failed")
                             elif result is None:
-                                logger.warning("Timeout waiting for beforeLease hook")
+                                if monitor.connection_lost:
+                                    logger.warning("Connection lost during beforeLease hook")
+                                else:
+                                    logger.warning("Timeout waiting for beforeLease hook")
python/packages/jumpstarter/jumpstarter/client/client_test.py (1)

58-58: Ensure ExitStack is closed after each test.

ExitStack() is never closed, which can leak cleanup callbacks if client_from_channel registers any. Prefer with ExitStack() as mock_stack: to guarantee teardown.

♻️ Example pattern (apply to each test)
-        mock_stack = ExitStack()
+        with ExitStack() as mock_stack:
             with patch(
                 "jumpstarter.client.client.MultipathExporterStub",
                 return_value=mock_stub
             ), patch(
                 "jumpstarter.client.client.import_class",
                 return_value=MockDriverClient
             ):
                 from jumpstarter.client.client import client_from_channel

                 client = await client_from_channel(
                     mock_channel, mock_portal, mock_stack, allow=[], unsafe=True
                 )

Also applies to: 98-98, 143-143, 177-177, 211-211, 245-245

python/packages/jumpstarter/jumpstarter/client/status_monitor_test.py (1)

16-32: Consider extracting shared mock utilities to a common test module.

MockAioRpcError and create_mock_rpc_error are duplicated here and in core_test.py (per the relevant snippets). Extracting these to a shared test utilities module (e.g., conftest.py or a dedicated testing.py) would reduce duplication.

python/packages/jumpstarter/jumpstarter/client/core.py (1)

332-351: Asymmetric behavior when monitor is not started.

wait_for_lease_ready_monitored raises RuntimeError if the monitor isn't started, while wait_for_hook_complete_monitored silently returns True. This asymmetry could be surprising. Consider documenting this behavior difference more explicitly or using consistent error handling.

python/packages/jumpstarter/jumpstarter/client/status_monitor.py (1)

309-309: Use %-style formatting in logger calls instead of f-strings.

F-strings in logging calls evaluate the string even when the log level is disabled, causing unnecessary overhead. Use %-style formatting for deferred evaluation.

Proposed fix
-                    logger.warning(f"Missed {missed} status transition(s)")
+                    logger.warning("Missed %d status transition(s)", missed)
-                    logger.info(f"Status changed: {old_status} -> {new_status} (version={new_version})")
+                    logger.info("Status changed: %s -> %s (version=%d)", old_status, new_status, new_version)
-                            logger.error(f"Status change callback error: {e}")
+                            logger.error("Status change callback error: %s", e)
-                logger.debug(f"GetStatus poll error: {e.code()}")
+                logger.debug("GetStatus poll error: %s", e.code())
-                logger.debug(f"GetStatus poll error: {e}")
+                logger.debug("GetStatus poll error: %s", e)

Also applies to: 320-320, 337-337, 352-354

controller/internal/service/suite_test.go (1)

90-93: Guard teardown if setup failed.

If testEnv never initialized, Stop() can panic in AfterSuite. A simple nil check avoids cascading failures.

🛡️ Suggested guard
 	By("tearing down the test environment")
-	err := testEnv.Stop()
-	Expect(err).NotTo(HaveOccurred())
+	if testEnv != nil {
+		err := testEnv.Stop()
+		Expect(err).NotTo(HaveOccurred())
+	}
controller/Makefile (1)

66-75: Guard against duplicate metadata.annotations blocks.

Line 66–71 unconditionally inserts a new annotations: map. If the generated role ever includes annotations (now or later), this can create duplicate keys and potentially drop the sync-wave or other annotations depending on YAML parsing behavior. Consider merging into an existing annotations map or using a YAML-aware tool for the insert.

python/packages/jumpstarter/jumpstarter/exporter/hooks.py (1)

341-356: Use anyio.to_thread.run_sync() with timeout for blocking process.wait() after process.kill().

The direct process.wait() call at line 354 blocks the event loop if the process doesn't exit promptly. The same block already demonstrates the correct pattern at line 346–347 with anyio.to_thread.run_sync() and anyio.move_on_after(). Apply the same pattern here for consistency and robustness:

Safer non‑blocking wait
                 if process.poll() is None:
                     process.kill()
                     try:
-                        process.wait()
+                        with anyio.move_on_after(5):
+                            await anyio.to_thread.run_sync(process.wait)
                     except Exception:
                         pass

Comment on lines +19 to +24
import (
"context"
"fmt"
"path/filepath"
"runtime"
"testing"
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟡 Minor

🧩 Analysis chain

🏁 Script executed:

# First, let's look at the suite_test.go file to see the current implementation
find . -name "suite_test.go" -path "*/controller/internal/service/*"

Repository: jumpstarter-dev/jumpstarter

Length of output: 114


🏁 Script executed:

# Search for the file more broadly
find . -name "suite_test.go" | grep -E "(controller|service)"

Repository: jumpstarter-dev/jumpstarter

Length of output: 321


🏁 Script executed:

# List files in the controller directory
git ls-files | grep -E "controller.*service.*test" | head -20

Repository: jumpstarter-dev/jumpstarter

Length of output: 234


🏁 Script executed:

cat -n ./controller/internal/service/suite_test.go

Repository: jumpstarter-dev/jumpstarter

Length of output: 3574


🌐 Web query:

controller-runtime envtest BinaryAssetsDirectory KUBEBUILDER_ASSETS best practices

💡 Result:

Best practices (controller-runtime envtest) for BinaryAssetsDirectory / KUBEBUILDER_ASSETS

  1. Prefer driving asset selection from your build/test harness (Makefile/CI) via KUBEBUILDER_ASSETS, not hardcoding paths in Go.
    Kubebuilder’s recommended pattern is to have setup-envtest pick the right binaries and export the directory for the test run, e.g. KUBEBUILDER_ASSETS="$(setup-envtest use $(ENVTEST_K8S_VERSION) ... -p path)" go test ./.... This keeps tests reproducible across dev machines and CI. [2]

  2. Use BinaryAssetsDirectory only when you explicitly want to control/share the on-disk cache location from Go code.
    Controller-runtime provides SetupEnvtestDefaultBinaryAssetsDirectory() specifically so you can point Environment.BinaryAssetsDirectory at the same OS-specific directory that setup-envtest uses—so runs can share binaries—without hardcoding OS paths. [1]

  3. Avoid relying on the historical default /usr/local/kubebuilder/bin.
    Envtest loads binaries from /usr/local/kubebuilder/bin by default, but best practice is to override via KUBEBUILDER_ASSETS (or explicit TEST_ASSET_* vars) for portability and hermetic CI. [1]

  4. For offline/air-gapped or “no network during tests”, use installed-only mode.
    Use setup-envtest ... -i (or ENVTEST_INSTALLED_ONLY=true) so it won’t try to download binaries during go test. If you already manage binaries elsewhere, use --use-env -i (or ENVTEST_USE_ENV=true) will respect an externally-provided KUBEBUILDER_ASSETS. [2][4]

  5. Ensure you’re using the current envtest artifact index/location (don’t depend on the old GCS bucket).
    The old storage.googleapis.com/kubebuilder-tools artifacts are deprecated; newer versions are promoted via controller-toolsenvtest-releases.yaml, and you need setup-envtest from controller-runtime v0.19.0+ to transparently use the new location. [3]


🏁 Script executed:

# Check if other test suites in the repo use conditional BinaryAssetsDirectory
rg "BinaryAssetsDirectory" -A 3 -B 1 controller/

Repository: jumpstarter-dev/jumpstarter

Length of output: 3875


🏁 Script executed:

# Check if KUBEBUILDER_ASSETS is used anywhere in the codebase
rg "KUBEBUILDER_ASSETS" .

Repository: jumpstarter-dev/jumpstarter

Length of output: 613


🏁 Script executed:

# Examine the conditional approach used in other test files
cat controller/deploy/operator/internal/controller/testutils/envtest.go | head -50

Repository: jumpstarter-dev/jumpstarter

Length of output: 1885


Use a conditional check for BinaryAssetsDirectory instead of hard-coding a fixed path.

The current code unconditionally sets BinaryAssetsDirectory to bin/k8s/1.30.0-<os>-<arch>, which will fail when:

  • Running tests via IDE or direct go test invocation (outside the Makefile that sets KUBEBUILDER_ASSETS)
  • Binaries haven't been downloaded to that specific path
  • The Kubernetes version is upgraded

Other test suites in the codebase already use conditional logic (e.g., controller/deploy/operator/ uses GetFirstFoundEnvTestBinaryDir()). Consider making this check conditional on directory existence, or respecting the KUBEBUILDER_ASSETS environment variable as the primary source.

🔧 Suggested change (conditional BinaryAssetsDirectory)
 import (
 	"context"
 	"fmt"
+	"os"
 	"path/filepath"
 	"runtime"
 	"testing"
@@
 	By("bootstrapping test environment")
 	testEnv = &envtest.Environment{
 		CRDDirectoryPaths:     []string{filepath.Join("..", "..", "deploy", "helm", "jumpstarter", "crds")},
 		ErrorIfCRDPathMissing: true,
-
-		// The BinaryAssetsDirectory is only required if you want to run the tests directly
-		// without calling the makefile target test. If not informed it will look for the
-		// default path defined in controller-runtime which is /usr/local/kubebuilder/.
-		// Note that you must have the required binaries setup under the bin directory to perform
-		// the tests directly. When we run make test it will be setup and used automatically.
-		BinaryAssetsDirectory: filepath.Join("..", "..", "bin", "k8s",
-			fmt.Sprintf("1.30.0-%s-%s", runtime.GOOS, runtime.GOARCH)),
 	}
+
+	// Only pin envtest binaries if the expected directory exists
+	binPath := filepath.Join("..", "..", "bin", "k8s",
+		fmt.Sprintf("1.30.0-%s-%s", runtime.GOOS, runtime.GOARCH))
+	if _, err := os.Stat(binPath); err == nil {
+		testEnv.BinaryAssetsDirectory = binPath
+	}
🤖 Prompt for AI Agents
In `@controller/internal/service/suite_test.go` around lines 19 - 24, Update the
test setup in suite_test.go to avoid hard-coding BinaryAssetsDirectory: first
respect the KUBEBUILDER_ASSETS environment variable if set (use it to set
BinaryAssetsDirectory), then fall back to existing helper
GetFirstFoundEnvTestBinaryDir() or a directory-existence check for the current
hard-coded path (bin/k8s/1.30.0-<os>-<arch>), and only assign the hard-coded
path as a last resort; modify the code that sets BinaryAssetsDirectory to
perform conditional checks (os.Getenv("KUBEBUILDER_ASSETS"), os.Stat on
candidate dirs, and GetFirstFoundEnvTestBinaryDir()) so tests run correctly
outside the Makefile and when the version/path differs.

Comment on lines +29 to +31
corev1 "k8s.io/api/core/v1"
metav1 "k8s.io/apimachinery/pkg/apis/meta/v1"
"k8s.io/client-go/kubernetes/scheme"
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟡 Minor

🧩 Analysis chain

🏁 Script executed:

# Find the suite_test.go file and read it
find . -name "suite_test.go" -path "*/controller/internal/service/*" | head -5

Repository: jumpstarter-dev/jumpstarter

Length of output: 114


🏁 Script executed:

# Read the suite_test.go file to see the code at the mentioned lines
cat -n ./controller/internal/service/suite_test.go | head -100

Repository: jumpstarter-dev/jumpstarter

Length of output: 3574


🏁 Script executed:

# Verify apierrors package exists and IsAlreadyExists is the correct function
web_search "k8s.io/apimachinery/pkg/api/errors IsAlreadyExists golang"

Repository: jumpstarter-dev/jumpstarter

Length of output: 121


🌐 Web query:

k8s.io/apimachinery pkg api errors IsAlreadyExists usage golang kubernetes

💡 Result:

In Go, k8s.io/apimachinery/pkg/api/errors.IsAlreadyExists(err) is the standard way to check whether a Kubernetes API call failed because the object you tried to create already exists. It returns true for “already exists” API StatusErrors (and also supports wrapped errors). [1]

Typical usage (client-go)

import (
  apierrors "k8s.io/apimachinery/pkg/api/errors"
  metav1 "k8s.io/apimachinery/pkg/apis/meta/v1"
  corev1 "k8s.io/api/core/v1"
)

cm := &corev1.ConfigMap{
  ObjectMeta: metav1.ObjectMeta{
    Name: "demo",
    Namespace: "default",
  },
  Data: map[string]string{"k": "v"},
}

_, err := clientset.CoreV1().ConfigMaps("default").Create(ctx, cm, metav1.CreateOptions{})
if err != nil {
  if apierrors.IsAlreadyExists(err) {
    // idempotent create: ignore or fetch/update instead
    return nil
  }
  return err
}
return nil

Notes

  • Use it after Create(...) (or any call that can return an “already exists” response). [1]
  • The function signature is func IsAlreadyExists(err error) bool. [1]

Sources:
[1] pkg.go.dev: k8s.io/apimachinery/pkg/api/errors (IsAlreadyExists) (pkg.go.dev)

Citations:


Don't ignore namespace creation errors.

Ignoring the error can mask real failures (RBAC, invalid config, etc.). Handle AlreadyExists explicitly and fail on other errors.

✅ Suggested fix
 import (
 	corev1 "k8s.io/api/core/v1"
+	apierrors "k8s.io/apimachinery/pkg/api/errors"
 	metav1 "k8s.io/apimachinery/pkg/apis/meta/v1"
 	"k8s.io/client-go/kubernetes/scheme"
 	ns := &corev1.Namespace{
 		ObjectMeta: metav1.ObjectMeta{
 			Name: "default",
 		},
 	}
-	_ = k8sClient.Create(ctx, ns)
+	if err := k8sClient.Create(ctx, ns); err != nil && !apierrors.IsAlreadyExists(err) {
+		Expect(err).NotTo(HaveOccurred())
+	}

Also applies to: 81-87

🤖 Prompt for AI Agents
In `@controller/internal/service/suite_test.go` around lines 29 - 31, The test
currently ignores errors when creating a Namespace; change the Namespace
creation logic (the call that creates a corev1.Namespace object, e.g.,
client.CoreV1().Namespaces().Create or similar in the test setup) to explicitly
handle errors: import k8s.io/apimachinery/pkg/api/errors as apierrors, then
after the Create call check if err != nil and if apierrors.IsAlreadyExists(err)
continue/ignore, otherwise fail the test (t.Fatalf or return the error). Apply
the same pattern for the other namespace-creation site referenced in the comment
(the second block around lines 81-87) so only AlreadyExists is tolerated and all
other errors cause the test to fail.

Comment on lines +120 to +124
# Children should be built before parents in topological order
# The order should be child, parent, grandparent (or parent, child, grandparent)
assert "grandparent" in build_order
assert build_order.index("grandparent") > build_order.index("parent")

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟡 Minor

Strengthen the topological-order assertion.

The test claims “children before parents” but only checks that grandparent is after parent. It never asserts child is before parent, and the comment even allows the opposite order.

✅ Suggested assertion update
-        # The order should be child, parent, grandparent (or parent, child, grandparent)
-        assert "grandparent" in build_order
-        assert build_order.index("grandparent") > build_order.index("parent")
+        # The order should be child -> parent -> grandparent
+        assert "grandparent" in build_order
+        assert build_order.index("parent") > build_order.index("child")
+        assert build_order.index("grandparent") > build_order.index("parent")
🤖 Prompt for AI Agents
In `@python/packages/jumpstarter/jumpstarter/client/client_test.py` around lines
120 - 124, The test's topological-order check only ensures "grandparent" comes
after "parent" but doesn't verify that "child" precedes "parent"; update the
assertions to explicitly assert that build_order.index("child") <
build_order.index("parent") (and keep/optionally assert both "child" and
"parent" are before "grandparent") so the test enforces children-before-parents
ordering; reference the existing build_order list and the node names "child",
"parent", and "grandparent" when adding the assertion(s).

Copy link
Contributor

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 1

🤖 Fix all issues with AI agents
In `@controller/Makefile`:
- Line 74: The Makefile currently runs the cp command copying CRDs to
deploy/helm/jumpstarter/charts/jumpstarter-controller/templates/crds/, which
will fail if that target directory doesn't exist; update the Makefile to create
the directory first by running a mkdir -p for
deploy/helm/jumpstarter/charts/jumpstarter-controller/templates/crds/
immediately before the cp command so the cp in the Makefile succeeds even when
the templates/crds directory is missing.
🧹 Nitpick comments (1)
controller/Makefile (1)

66-71: Idempotency risk: duplicate annotations: keys if file already contains annotations.

The AWK script unconditionally inserts an annotations: block after the name: line. If controller-gen ever outputs annotations in the generated role.yaml, or if the manifests target runs without a fresh regeneration, this will produce invalid YAML with duplicate annotations: keys.

Consider using a more robust approach that checks for existing annotations before inserting:

♻️ Suggested safer AWK approach
 	# Add ArgoCD sync-wave annotation to RBAC role for proper deployment ordering (PR `#207`)
-	`@awk` '/^  name: jumpstarter-manager-role$$/{print; print "  annotations:"; print "    argocd.argoproj.io/sync-wave: \"-1\""; next}1' \
+	`@awk` ' \
+		/^  name: jumpstarter-manager-role$$/ { print; found_name=1; next } \
+		found_name && /^  annotations:/ { print; found_annotations=1; next } \
+		found_name && found_annotations && /^  [a-zA-Z]/ && !/^    / { \
+			print "    argocd.argoproj.io/sync-wave: \"-1\""; found_name=0; found_annotations=0 \
+		} \
+		found_name && !found_annotations && /^  [a-zA-Z]/ { \
+			print "  annotations:"; print "    argocd.argoproj.io/sync-wave: \"-1\""; found_name=0 \
+		} \
+		{ print }' \
 		deploy/helm/jumpstarter/charts/jumpstarter-controller/templates/rbac/role.yaml > \
 		deploy/helm/jumpstarter/charts/jumpstarter-controller/templates/rbac/role.yaml.tmp && \
 		mv deploy/helm/jumpstarter/charts/jumpstarter-controller/templates/rbac/role.yaml.tmp \
 		deploy/helm/jumpstarter/charts/jumpstarter-controller/templates/rbac/role.yaml

Alternatively, consider using a tool like yq for safer YAML manipulation.

Also, the comment references "PR #207" but this appears to be PR #140.

deploy/helm/jumpstarter/charts/jumpstarter-controller/templates/rbac/role.yaml

cp deploy/helm/jumpstarter/crds/* deploy/operator/config/crd/bases/
cp deploy/helm/jumpstarter/crds/* deploy/helm/jumpstarter/charts/jumpstarter-controller/templates/crds/
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟡 Minor

🧩 Analysis chain

🏁 Script executed:

#!/bin/bash
# Check if the target directory exists in the repository
find deploy/helm/jumpstarter/charts/jumpstarter-controller/templates -type d -name "crds" 2>/dev/null

# Check if it's tracked in git at all
git ls-tree -r HEAD deploy/helm/jumpstarter/charts/jumpstarter-controller/templates/crds 2>/dev/null || echo "No files found in git"

# List what's in templates directory
echo "=== Contents of templates directory ==="
ls -la deploy/helm/jumpstarter/charts/jumpstarter-controller/templates/ 2>/dev/null || echo "Directory does not exist"

# Check the Makefile around line 74 to see context
echo "=== Makefile context around line 74 ==="
sed -n '70,80p' controller/Makefile

Repository: jumpstarter-dev/jumpstarter

Length of output: 764


🏁 Script executed:

#!/bin/bash
# Check if deploy/helm/jumpstarter/charts directory exists
echo "=== Checking helm chart directory structure ==="
ls -la deploy/helm/jumpstarter/charts/ 2>/dev/null || echo "charts directory does not exist"

# Search Makefile for any mkdir commands
echo "=== Searching for mkdir in Makefile ==="
grep -n "mkdir" controller/Makefile || echo "No mkdir found"

# Check if there's a target that creates the templates directory
echo "=== Search for directory creation patterns ==="
grep -n "templates" controller/Makefile

# Check the generate-crds or similar target
echo "=== Full Makefile targets related to CRD or helm ==="
grep -n "^[a-zA-Z].*:" controller/Makefile | grep -E "(crd|helm|generate)"

Repository: jumpstarter-dev/jumpstarter

Length of output: 1423


🏁 Script executed:

#!/bin/bash
# Get the full target definition that contains line 74
echo "=== Full target containing the cp command ==="
sed -n '60,80p' controller/Makefile

# Check if there's an earlier step that might create the directory
echo "=== Check for any manifests or controller-gen target ==="
sed -n '1,100p' controller/Makefile | grep -A 20 "^manifests:"

# Look for the controller-gen command
echo "=== Look for controller-gen configuration ==="
grep -B 5 -A 10 "controller-gen" controller/Makefile | head -40

Repository: jumpstarter-dev/jumpstarter

Length of output: 4959


🏁 Script executed:

#!/bin/bash
# Check if there are any .gitkeep or placeholder files in the charts/templates directories
echo "=== Check for placeholder files ==="
find deploy/helm/jumpstarter/charts -type f 2>/dev/null | head -20

# Check the full directory structure of deploy/helm
echo "=== Full helm structure ==="
find deploy/helm -type d 2>/dev/null | sort

# Check if crds directory exists at all
echo "=== Search for crds directories ==="
find deploy -type d -name "crds" 2>/dev/null

Repository: jumpstarter-dev/jumpstarter

Length of output: 170


🏁 Script executed:

#!/bin/bash
# Check what exists in deploy directory
echo "=== Full deploy directory structure ==="
find deploy -type d 2>/dev/null | head -30

# Check if operator directory structure exists
echo "=== Check operator crd/bases directory ==="
ls -la deploy/operator/config/crd/bases/ 2>/dev/null || echo "Directory does not exist"

# Check if there are any crds in the source
echo "=== Check source crds ==="
ls -la deploy/helm/jumpstarter/crds/ 2>/dev/null || echo "CRDs source directory does not exist"

Repository: jumpstarter-dev/jumpstarter

Length of output: 241


🏁 Script executed:

#!/bin/bash
# List everything in deploy directory
echo "=== All files in deploy/ recursively ==="
find deploy -type f 2>/dev/null | head -50

# Check if deploy directory itself exists
echo "=== Deploy directory contents ==="
ls -la deploy/ 2>/dev/null

# Count total files in entire repo
echo "=== Repository file count and structure ==="
find . -type f -not -path './.git/*' | wc -l
echo "Total directories:"
find . -type d -not -path './.git/*' | wc -l

Repository: jumpstarter-dev/jumpstarter

Length of output: 216


🏁 Script executed:

#!/bin/bash
# Check git status of deploy directory
echo "=== Check if deploy is tracked in git ==="
git ls-tree -d HEAD deploy 2>/dev/null || echo "deploy directory not in git"

# List top-level directories
echo "=== Top-level directories ==="
ls -la | grep '^d'

# Check if there's a .gitkeep or similar in deploy
echo "=== Hidden files in deploy ==="
ls -la deploy/ 2>/dev/null

# Check recent commits to understand if deploy is supposed to exist
echo "=== Check if deploy paths were ever added ==="
git log --all --full-history --oneline -- "deploy/helm/*" "deploy/operator/*" 2>/dev/null | head -5

Repository: jumpstarter-dev/jumpstarter

Length of output: 676


🏁 Script executed:

#!/bin/bash
# Check if deploy directory exists at all
echo "=== Does deploy directory exist? ==="
test -d deploy && echo "Yes" || echo "No"

# List actual path with stat
echo "=== Using stat on deploy ==="
stat deploy 2>&1 || echo "deploy does not exist"

# Check the controller Makefile more carefully for context
echo "=== Check lines around the change (60-80) ==="
sed -n '60,80p' controller/Makefile

# Check if there's documentation about running manifests
echo "=== Check README or setup docs ==="
ls -la *.md controller/*.md 2>/dev/null | head -10

Repository: jumpstarter-dev/jumpstarter

Length of output: 1817


🏁 Script executed:

#!/bin/bash
# Check what controller-gen does - look at how it's invoked and if it creates directories
echo "=== Check controller version and documentation ==="
grep -r "controller-gen" controller/ | grep -v ".o\|Binary" | head -5

# Check if there's documentation about directory creation
echo "=== Check if there are any docs about the build process ==="
cat README.md | head -50

# Check controller/README.md
echo "=== Check controller README ==="
cat controller/README.md

Repository: jumpstarter-dev/jumpstarter

Length of output: 5972


Add directory creation before copying CRDs to the Helm chart templates.

The cp command on line 74 will fail because deploy/helm/jumpstarter/charts/jumpstarter-controller/templates/crds/ doesn't exist. While controller-gen creates the output directories it writes to, it doesn't create the templates/crds/ subdirectory. Add mkdir -p before the copy.

♻️ Suggested fix
 	cp deploy/helm/jumpstarter/crds/* deploy/operator/config/crd/bases/
+	mkdir -p deploy/helm/jumpstarter/charts/jumpstarter-controller/templates/crds/
 	cp deploy/helm/jumpstarter/crds/* deploy/helm/jumpstarter/charts/jumpstarter-controller/templates/crds/
📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change
cp deploy/helm/jumpstarter/crds/* deploy/helm/jumpstarter/charts/jumpstarter-controller/templates/crds/
mkdir -p deploy/helm/jumpstarter/charts/jumpstarter-controller/templates/crds/
cp deploy/helm/jumpstarter/crds/* deploy/helm/jumpstarter/charts/jumpstarter-controller/templates/crds/
🤖 Prompt for AI Agents
In `@controller/Makefile` at line 74, The Makefile currently runs the cp command
copying CRDs to
deploy/helm/jumpstarter/charts/jumpstarter-controller/templates/crds/, which
will fail if that target directory doesn't exist; update the Makefile to create
the directory first by running a mkdir -p for
deploy/helm/jumpstarter/charts/jumpstarter-controller/templates/crds/
immediately before the cp command so the cp in the Makefile succeeds even when
the templates/crds directory is missing.

Copy link
Contributor

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 1

🤖 Fix all issues with AI agents
In `@python/packages/jumpstarter-cli/jumpstarter_cli/shell.py`:
- Around line 125-143: The code currently launches the shell unconditionally
after monitor.wait_for_any_of; change it to treat
ExporterStatus.BEFORE_LEASE_HOOK_FAILED and a None (timeout) as hard stops by
not calling _run_shell_only. After the wait_for_any_of call
(monitor.wait_for_any_of), check result and if it equals
ExporterStatus.BEFORE_LEASE_HOOK_FAILED or is None, log the warning and exit the
function (or raise an appropriate exception) instead of proceeding to call
anyio.to_thread.run_sync(_run_shell_only, lease, config, command, path); only
call _run_shell_only when result == ExporterStatus.LEASE_READY.
♻️ Duplicate comments (5)
python/packages/jumpstarter/jumpstarter/exporter/exporter.py (4)

202-204: Controller channels are not closed after use.

This creates a new gRPC channel per call but never closes it. Multiple calls during status updates and stream retries will leak channels. The pattern in _unregister_with_controller() (lines 357-369) demonstrates proper cleanup with explicit await channel.close().


414-458: Task may hang when lease ends without EndSession.

This task only waits on end_session_requested (line 426). If the lease ends without the client calling EndSession, the task continues waiting indefinitely on the old lease context. Consider racing against lease_ended or using a task group scoped to the lease lifetime.


546-553: Listen task should be scoped to the lease lifetime.

The _retry_stream task for Listen is started on the outer tg (line 548), causing it to continue running after handle_lease exits. This results in the task persisting across lease transitions. Move this to conn_tg so it's automatically cancelled when the lease ends, and close listen_tx in the finally block to signal proper stream termination.


578-579: Consider scoping _handle_end_session to the lease task group.

Starting this task on the outer tg means it survives the lease ending. If combined with the fact that it only waits on end_session_requested, the task can be orphaned. This is related to the hang issue noted in the _handle_end_session method.

python/packages/jumpstarter-cli/jumpstarter_cli/shell.py (1)

80-89: Consider removing _run_shell_with_lease if it’s now unused.

The async path appears to be the only entry point, so this sync helper likely adds maintenance overhead.

#!/bin/bash
# Verify remaining usages of the sync helper
rg -nP '\b_run_shell_with_lease\s*\(' --type py

Copy link
Contributor

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 4

🤖 Fix all issues with AI agents
In `@python/packages/jumpstarter/jumpstarter/client/status_monitor.py`:
- Around line 8-18: The import block is unsorted per Ruff I001; reorder imports
into standard-library first (logging, collections.abc.Awaitable,
collections.abc.Callable, contextlib.AsyncExitStack), then third-party packages
(anyio, anyio.Event, anyio.create_task_group, grpc.StatusCode,
grpc.aio.AioRpcError, jumpstarter_protocol.jumpstarter_pb2), and finally
local/project imports (jumpstarter.common.ExporterStatus) so the import order
satisfies the linter and keeps the existing imported symbols intact.
- Around line 196-201: The timeout checks in wait_for_status() and
wait_for_any_of() incorrectly use "if timeout:" which treats 0 as falsy; change
both checks to "if timeout is not None:" so a timeout of 0 is handled as an
immediate timeout; update the branches around anyio.move_on_after in
wait_for_status() and wait_for_any_of() (the blocks that call wait_loop() and
return False) to use the new condition.
- Around line 396-400: stop() currently sets _running = False and signals
_stop_event but does not wake waiters blocked on _any_change_event; update
stop() to also set _any_change_event (and optionally _any_change_event.set()
followed by _any_change_event.clear() if you want a one-shot wake) so waiters in
wait_for_status and wait_for_any_of that are awaiting _any_change_event.wait()
will wake and observe _running == False and exit their loops; reference the stop
method and the _any_change_event, wait_for_status, and wait_for_any_of functions
when applying this change.

In `@python/packages/jumpstarter/jumpstarter/exporter/exporter.py`:
- Around line 168-191: The stop() method currently ignores the caller's
should_unregister when wait_for_lease_exit is True, causing later calls (e.g.,
in serve()) to override intent; update stop() so that whenever a deferred stop
is requested (the branch setting self._stop_requested = True) you also set
self._unregister = should_unregister (and likewise when a stop has already been
requested you should update/preserve self._unregister accordingly) so the
original caller preference is retained; refer to the stop() method, the
attributes self._unregister, self._stop_requested, and self._tg.cancel_scope to
implement this change.
🧹 Nitpick comments (1)
python/packages/jumpstarter-cli/jumpstarter_cli/shell.py (1)

147-162: Remove redundant timeout wrapper and improve timeout detection logic.

The StatusMonitor.connection_lost property is valid and well-tested. However, the outer anyio.move_on_after(30) is redundant—wait_for_any_of() already has an identical timeout=30.0 parameter. When both are equal, the inner timeout fires first, making the outer scope's cancelled_caught flag never True. This means timeout cases currently fall through to "Hook completion not confirmed" rather than the timeout warning.

The suggested refactor properly addresses this by removing the outer timeout wrapper and using result is None to detect timeouts, which disambiguates between connection loss and timeout.

♻️ Suggested simplification
-                                        with anyio.move_on_after(30) as timeout_scope:  # 30 second timeout
-                                            result = await monitor.wait_for_any_of(
-                                                [ExporterStatus.AVAILABLE, ExporterStatus.AFTER_LEASE_HOOK_FAILED],
-                                                timeout=30.0
-                                            )
-                                            if result == ExporterStatus.AVAILABLE:
-                                                logger.info("afterLease hook completed")
-                                            elif result == ExporterStatus.AFTER_LEASE_HOOK_FAILED:
-                                                logger.warning("afterLease hook failed")
-                                            elif monitor.connection_lost:
-                                                # Connection lost - exporter closed, hook likely completed
-                                                logger.info("Exporter connection closed (hook completed)")
-                                            else:
-                                                logger.debug("Hook completion not confirmed")
-                                        if timeout_scope.cancelled_caught:
-                                            logger.warning("Timeout waiting for afterLease hook to complete")
+                                        result = await monitor.wait_for_any_of(
+                                            [ExporterStatus.AVAILABLE, ExporterStatus.AFTER_LEASE_HOOK_FAILED],
+                                            timeout=30.0
+                                        )
+                                        if result == ExporterStatus.AVAILABLE:
+                                            logger.info("afterLease hook completed")
+                                        elif result == ExporterStatus.AFTER_LEASE_HOOK_FAILED:
+                                            logger.warning("afterLease hook failed")
+                                        elif result is None and monitor.connection_lost:
+                                            # Connection lost - exporter closed, hook likely completed
+                                            logger.info("Exporter connection closed (hook completed)")
+                                        elif result is None:
+                                            logger.warning("Timeout waiting for afterLease hook to complete")
+                                        else:
+                                            logger.debug("Hook completion not confirmed")

Comment on lines +196 to +204
if timeout:
with anyio.move_on_after(timeout):
return await wait_loop()
return False
else:
return await wait_loop()
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟡 Minor

🧩 Analysis chain

🏁 Script executed:

cd /tmp && find . -name "status_monitor.py" -type f 2>/dev/null | head -5

Repository: jumpstarter-dev/jumpstarter

Length of output: 53


🏁 Script executed:

# First, let's find the file and examine the relevant lines
find . -name "status_monitor.py" -type f 2>/dev/null

Repository: jumpstarter-dev/jumpstarter

Length of output: 137


🏁 Script executed:

# If found, read the file to see context around the mentioned lines
cat -n python/packages/jumpstarter/jumpstarter/client/status_monitor.py 2>/dev/null | sed -n '190,210p'

Repository: jumpstarter-dev/jumpstarter

Length of output: 923


🏁 Script executed:

# Also check the second location (lines 280-285)
cat -n python/packages/jumpstarter/jumpstarter/client/status_monitor.py 2>/dev/null | sed -n '275,295p'

Repository: jumpstarter-dev/jumpstarter

Length of output: 939


🏁 Script executed:

# Search for function calls to wait_for_status and wait_for_any_of
rg "wait_for_status\(|wait_for_any_of\(" --type=py -A 3 -B 1

Repository: jumpstarter-dev/jumpstarter

Length of output: 9431


🏁 Script executed:

# Search for timeout=0 patterns
rg "timeout\s*=\s*0" --type=py -B 2 -A 2

Repository: jumpstarter-dev/jumpstarter

Length of output: 8378


🏁 Script executed:

# Check if there's any documentation or usage of anyio.move_on_after in the codebase
rg "anyio.move_on_after|move_on_after" --type=py -B 1 -A 1

Repository: jumpstarter-dev/jumpstarter

Length of output: 3402


🏁 Script executed:

# Check the function definitions to understand the full context
cat -n python/packages/jumpstarter/jumpstarter/client/status_monitor.py | sed -n '155,210p'

Repository: jumpstarter-dev/jumpstarter

Length of output: 2539


🏁 Script executed:

# Look at the second function to confirm it's identical pattern
cat -n python/packages/jumpstarter/jumpstarter/client/status_monitor.py | sed -n '245,290p'

Repository: jumpstarter-dev/jumpstarter

Length of output: 2252


Use if timeout is not None: to correctly handle timeout=0.

The condition if timeout: treats 0 as falsy, which causes an immediate timeout (timeout=0) to instead wait indefinitely. This occurs in both wait_for_status() and wait_for_any_of() methods because the timeout parameter accepts float | None, where 0 is a valid value distinct from None.

Affected locations
  • Lines 196-201 in wait_for_status()
  • Lines 280-285 in wait_for_any_of()
🤖 Prompt for AI Agents
In `@python/packages/jumpstarter/jumpstarter/client/status_monitor.py` around
lines 196 - 201, The timeout checks in wait_for_status() and wait_for_any_of()
incorrectly use "if timeout:" which treats 0 as falsy; change both checks to "if
timeout is not None:" so a timeout of 0 is handled as an immediate timeout;
update the branches around anyio.move_on_after in wait_for_status() and
wait_for_any_of() (the blocks that call wait_loop() and return False) to use the
new condition.

Copy link
Contributor

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 2

🤖 Fix all issues with AI agents
In `@python/packages/jumpstarter/jumpstarter/client/lease.py`:
- Around line 237-257: The fixed-count retry loop around self.controller.Dial
(using max_retries/base_delay/max_delay) should be replaced with a time‑based
retry bounded by a configurable timeout (use the existing acquisition_timeout if
available or add a new dial_timeout attribute/parameter) so Dial keeps retrying
until time.monotonic() exceeds the deadline; compute deadline = now + timeout,
then loop attempting await self.controller.Dial(...), on AioRpcError with
FAILED_PRECONDITION and "not ready" apply the same exponential/backoff delay
(min(base_delay*2**attempt, max_delay)) but stop retrying and re-raise once
monotonic() > deadline; update the retry log to show remaining time or attempts
left and ensure the new timeout is plumbed through the Lease acquisition
call/site that calls this Dial logic.

In `@python/packages/jumpstarter/jumpstarter/client/status_monitor.py`:
- Around line 343-347: In _poll_loop (status_monitor.py) the AioRpcError
UNIMPLEMENTED branch currently breaks without clearing _running or notifying
waiters; update that branch to mirror the UNAVAILABLE handling by setting
self._running = False and calling self._any_change_event.set() (and any other
cleanup done in the UNAVAILABLE branch) before breaking so wait_for_status() and
wait_for_any_of() don't hang; also add a unit test that triggers UNIMPLEMENTED
while a waiter (using wait_for_status() or wait_for_any_of()) is active and
assert the waiter returns promptly.
♻️ Duplicate comments (3)
python/packages/jumpstarter/jumpstarter/client/status_monitor.py (3)

8-18: Reorder imports to satisfy Ruff (I001).

Current ordering may still fail Ruff I001.

🔧 Proposed import order
-import logging
-from collections.abc import Awaitable, Callable
-from contextlib import AsyncExitStack
+from collections.abc import Awaitable, Callable
+from contextlib import AsyncExitStack
+import logging

Please verify with:

#!/bin/bash
ruff check python/packages/jumpstarter/jumpstarter/client/status_monitor.py

Based on learnings, Ruff is the formatter/linter for Python files.


196-201: Handle timeout=0 explicitly.

if timeout: treats 0 as falsy, so an immediate timeout becomes an infinite wait (Line 196 and Line 280).

🐛 Proposed fix
-        if timeout:
+        if timeout is not None:
             with anyio.move_on_after(timeout):
                 return await wait_loop()
             return False
-        if timeout:
+        if timeout is not None:
             with anyio.move_on_after(timeout):
                 return await wait_for_first()
             return None

Optional verification:

#!/bin/bash
rg -n "wait_for_status|wait_for_any_of|timeout\s*=\s*0" --type=py

Also applies to: 280-285


396-400: Wake waiting coroutines on stop().

stop() doesn’t signal _any_change_event, so wait_for_* can hang after shutdown.

🛠️ Proposed fix
     async def stop(self):
         """Stop the background polling task."""
         self._running = False
         self._stop_event.set()
+        # Wake any waiters so they can observe _running=False
+        self._any_change_event.set()
+        self._any_change_event = Event()

Copy link
Contributor

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 1

🤖 Fix all issues with AI agents
In `@e2e/dex.values.yaml`:
- Around line 37-40: The userID value for the test user "test-exporter-hooks" is
not a valid UUID because it contains the non-hex character 'g' (value:
c6ed6f40-6689-6g7e-c64c-55d0a9a5871f); replace this with a valid UUID (only 0-9
and a-f), e.g., generate a new v4 UUID and set userID to that new value so the
entry for username "test-exporter-hooks" and its userID field are valid.
🧹 Nitpick comments (5)
e2e/tests-hooks.bats (5)

9-10: Unused variable declaration.

CURRENT_HOOKS_CONFIG is declared but never referenced anywhere in the file. Consider removing it to avoid confusion.

Suggested fix
-# Track which config is currently active
-CURRENT_HOOKS_CONFIG=""

51-54: Consider graceful termination before SIGKILL.

Using kill -9 immediately doesn't give the process a chance to clean up. Consider sending SIGTERM first with a brief grace period before escalating to SIGKILL.

Suggested improvement
         if ps -p "$pid" > /dev/null 2>&1; then
           echo "  Killing PID $pid" >&2
-          kill -9 "$pid" 2>/dev/null || true
+          kill -TERM "$pid" 2>/dev/null || true
+          sleep 0.5
+          ps -p "$pid" > /dev/null 2>&1 && kill -9 "$pid" 2>/dev/null || true
         fi

104-110: Fixed sleep may introduce flakiness.

The 2-second sleep on line 107 is intended to avoid catching pre-connect state, but this is timing-dependent. If the exporter connects faster or slower than expected, tests could become flaky. Consider polling for a specific pre-condition instead, or documenting why 2 seconds is sufficient.


205-208: Add safeguards for empty grep results.

If the grep commands fail to find matches (e.g., due to unexpected output format), before_pos or after_pos could be empty, causing the comparison on line 208 to fail with a shell error. Consider adding validation or using || true with a fallback.

Suggested improvement
   # Verify order: BEFORE should appear before AFTER in output
   local before_pos=$(echo "$output" | grep -n "BEFORE_HOOK:" | head -1 | cut -d: -f1)
   local after_pos=$(echo "$output" | grep -n "AFTER_HOOK:" | head -1 | cut -d: -f1)

+  # Ensure both markers were found
+  [ -n "$before_pos" ] && [ -n "$after_pos" ]
   [ "$before_pos" -lt "$after_pos" ]

315-333: Tests lack verification of actual j command effects.

Tests E1 and E2 only verify that the hook scripts complete (via output markers), but don't verify that j power on/off commands actually executed or had the expected effect. Consider adding assertions that verify the power state changed, or at minimum checking for command output in the hook logs.

Copy link
Contributor

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 1

🤖 Fix all issues with AI agents
In `@python/packages/jumpstarter/jumpstarter/client/status_monitor.py`:
- Around line 358-365: The UNAVAILABLE branch breaks the poll loop but doesn't
update the monitor's running state, so set self._running = False when handling
StatusCode.UNAVAILABLE; specifically, inside the elif e.code() ==
StatusCode.UNAVAILABLE block update self._running to False alongside setting
self._connection_lost, trigger and recreate self._any_change_event as shown, and
ensure start() can restart the monitor after this exit.
♻️ Duplicate comments (2)
python/packages/jumpstarter/jumpstarter/client/status_monitor.py (2)

199-204: Handle timeout=0 explicitly.

if timeout: treats 0 as falsy, so timeout=0 waits indefinitely instead of timing out immediately.

🛠️ Proposed fix
-        if timeout:
+        if timeout is not None:
             with anyio.move_on_after(timeout):
                 return await wait_loop()
             return False
-        if timeout:
+        if timeout is not None:
             with anyio.move_on_after(timeout):
                 return await wait_for_first()
             return None

Also applies to: 283-286


416-420: Wake waiters on stop to avoid hangs.

stop() only signals _stop_event, but waiters block on _any_change_event, so they can hang indefinitely.

🛠️ Proposed fix
     async def stop(self):
         """Stop the background polling task."""
         self._running = False
         self._stop_event.set()
+        # Wake waiters so they can observe _running=False
+        self._any_change_event.set()
+        self._any_change_event = Event()

Comment on lines +358 to +365
elif e.code() == StatusCode.UNAVAILABLE:
# Connection lost - exporter closed or restarted
logger.info("Connection lost (UNAVAILABLE), signaling waiters")
self._connection_lost = True
# Fire the change event to wake up any waiters
self._any_change_event.set()
self._any_change_event = Event()
break
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟡 Minor

Mark monitor as not running on UNAVAILABLE exit.

The poll loop breaks on UNAVAILABLE but leaves _running=True, preventing restart via start() and misrepresenting state.

🛠️ Proposed fix
                 elif e.code() == StatusCode.UNAVAILABLE:
                     # Connection lost - exporter closed or restarted
                     logger.info("Connection lost (UNAVAILABLE), signaling waiters")
                     self._connection_lost = True
+                    self._running = False
                     # Fire the change event to wake up any waiters
                     self._any_change_event.set()
                     self._any_change_event = Event()
                     break
🤖 Prompt for AI Agents
In `@python/packages/jumpstarter/jumpstarter/client/status_monitor.py` around
lines 358 - 365, The UNAVAILABLE branch breaks the poll loop but doesn't update
the monitor's running state, so set self._running = False when handling
StatusCode.UNAVAILABLE; specifically, inside the elif e.code() ==
StatusCode.UNAVAILABLE block update self._running to False alongside setting
self._connection_lost, trigger and recreate self._any_change_event as shown, and
ensure start() can restart the monitor after this exit.

Copy link
Contributor

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 0

Caution

Some comments are outside the diff and can’t be posted inline due to platform limitations.

⚠️ Outside diff range comments (1)
python/packages/jumpstarter/jumpstarter/config/client.py (1)

162-177: Thread include_status through to the ListExporters request.

The include_status parameter is only used to set a display flag on the response (line 176), but it's never passed to ClientService.ListExporters. If the server requires this flag in the request to include status data, status information won't be fetched. Update ClientService.ListExporters to accept and propagate include_status to the protobuf request.

Suggested update
-exporters_response = await svc.ListExporters(page_size=page_size, page_token=page_token, filter=filter)
+exporters_response = await svc.ListExporters(
+    page_size=page_size,
+    page_token=page_token,
+    filter=filter,
+    include_status=include_status,
+)

- Add release_lease field to ReportStatusRequest to allow exporters to
  signal the controller to release the active lease after hooks complete
- Add status_version and previous_status fields to GetStatusResponse for
  tracking status transitions
The make manifests target regenerates RBAC role.yaml using controller-gen,
which strips manually-added annotations. This adds a post-processing step
using awk to re-add the ArgoCD sync-wave annotation that was added in PR #207
to ensure proper deployment ordering.
Updates the operator install.yaml to include the new ExporterStatus and
StatusMessage fields added to the Exporter CRD, along with additional
printer columns for better kubectl output.
This ensures the operator's install.yaml stays in sync with CRD changes
by automatically running the operator's build-installer target as part
of the manifests target. This prevents drift between the controller CRDs
and the operator bundle.
Copy link
Contributor

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 6

🤖 Fix all issues with AI agents
In `@controller/internal/service/controller_service_integration_test.go`:
- Around line 76-86: The Lease objects in the tests (the Lease struct created
and passed to k8sClient.Create in controller_service_integration_test.go) are
missing the required Selector field in LeaseSpec which causes API validation to
fail; update the Lease initializations (the variable lease used with
k8sClient.Create and its LeaseSpec) to include Selector: metav1.LabelSelector{}
or an appropriate label selector (e.g., Selector:
&metav1.LabelSelector{MatchLabels: map[string]string{"app":"test"}}) so the
CRD-required selector is present when Create(ctx, lease) is called.

In `@e2e/tests.bats`:
- Around line 190-196: Two occurrences of the yq load expression still reference
the old path load("e2e/exporter.yaml"); update those commands that match the
pattern go run ... -i ". * load(\"e2e/exporter.yaml\")" to use
load("e2e/exporters/exporter.yaml") instead so all yq invocations point to the
moved exporter.yaml file (ensure the same replacement is applied for both
remaining occurrences).

In `@python/packages/jumpstarter-cli/jumpstarter_cli/run.py`:
- Around line 79-94: The parent currently interprets any non-zero child exit as
a signal and adds 128, which double-encodes real exit codes; change the protocol
so the child (exporter) encodes signals as 128 + signal number and the parent in
run.py should stop adding 128 — i.e., use exporter_exit_code directly when set
(replace any logic that converts non-zero exporter_exit_code into 128+value),
and keep received_signal only for locally-caught signals; update handling around
exporter_exit_code and received_signal (and cancellation via
signal_tg.cancel_scope.cancel()) so the parent returns exporter_exit_code as-is
when present.

In `@python/packages/jumpstarter/jumpstarter/client/core.py`:
- Around line 458-485: Guard creation of LogSource from response.source with a
try/except to handle unknown enum values: when evaluating source =
LogSource(response.source) (and computing is_hook), catch ValueError and set
source to LogSource.SYSTEM (and is_hook = False) so the stream continues instead
of raising; keep the existing routing logic that maps source to logger_name and
logging through source_logger.log(log_level, response.message). Also ensure
severity fallback (severity = response.severity if response.severity else
"INFO") remains unchanged.
- Around line 127-187: The wait_for_lease_ready method currently logs a warning
on timeout but returns silently so callers cannot detect failure; modify
wait_for_lease_ready to raise DriverError (or return False if you prefer an API
change) when the timeout elapses instead of only logging: replace the final
self.logger.warning(...) with raising DriverError(f"Timeout waiting for
beforeLease hook to complete after {timeout:.1f}s and {poll_count} polls")
(ensure DriverError is imported/available) so callers of wait_for_lease_ready
can programmatically handle timeouts.

In `@python/packages/jumpstarter/jumpstarter/exporter/exporter.py`:
- Around line 168-191: The stop method currently ignores the caller's
should_unregister value when deferring stop; update stop(self, ...) to persist
should_unregister into an instance field (e.g. set self._unregister =
should_unregister) whenever wait_for_lease_exit is true and you set
self._stop_requested = True so the original preference is preserved, and ensure
the lease-exit path that later invokes stop uses that stored self._unregister
rather than unconditionally passing True; update references to _unregister and
_stop_requested so deferred stops respect the saved preference.
🧹 Nitpick comments (11)
python/packages/jumpstarter-cli/jumpstarter_cli/shell.py (1)

147-162: Redundant timeout pattern may cause confusion.

The code uses both anyio.move_on_after(30) and timeout=30.0 in wait_for_any_of. Since both timeouts are identical, the inner wait_for_any_of will return None on timeout before move_on_after can trigger, meaning timeout_scope.cancelled_caught at line 161 will likely never be True.

Consider removing one timeout mechanism:

♻️ Suggested simplification
-                                        with anyio.move_on_after(30) as timeout_scope:  # 30 second timeout
-                                            result = await monitor.wait_for_any_of(
-                                                [ExporterStatus.AVAILABLE, ExporterStatus.AFTER_LEASE_HOOK_FAILED],
-                                                timeout=30.0
-                                            )
-                                            if result == ExporterStatus.AVAILABLE:
-                                                logger.info("afterLease hook completed")
-                                            elif result == ExporterStatus.AFTER_LEASE_HOOK_FAILED:
-                                                logger.warning("afterLease hook failed")
-                                            elif monitor.connection_lost:
-                                                # Connection lost - exporter closed, hook likely completed
-                                                logger.info("Exporter connection closed (hook completed)")
-                                            else:
-                                                logger.debug("Hook completion not confirmed")
-                                        if timeout_scope.cancelled_caught:
-                                            logger.warning("Timeout waiting for afterLease hook to complete")
+                                        result = await monitor.wait_for_any_of(
+                                            [ExporterStatus.AVAILABLE, ExporterStatus.AFTER_LEASE_HOOK_FAILED],
+                                            timeout=30.0
+                                        )
+                                        if result == ExporterStatus.AVAILABLE:
+                                            logger.info("afterLease hook completed")
+                                        elif result == ExporterStatus.AFTER_LEASE_HOOK_FAILED:
+                                            logger.warning("afterLease hook failed")
+                                        elif monitor.connection_lost:
+                                            logger.info("Exporter connection closed (hook completed)")
+                                        else:
+                                            logger.warning("Timeout waiting for afterLease hook to complete")
e2e/tests-hooks.bats (2)

75-76: Consider caching yq installation for faster test execution.

Using go run github.com/mikefarah/yq/v4@latest fetches from the network on each invocation. For repeated test runs, this adds latency. Consider installing yq once in the test environment setup or using a pre-installed binary.


78-82: Clarify the 3>&- redirection purpose.

The 3>&- closes file descriptor 3 before running the background loop. While this works, a brief comment explaining why this is needed (likely to prevent fd leaks from bats internals) would help maintainability.

python/packages/jumpstarter/jumpstarter/client/core_test.py (1)

16-53: Optional: consolidate gRPC error/status helpers.

These helpers look similar to those in status_monitor_test.py; a shared test util could reduce duplication.

python/packages/jumpstarter/jumpstarter/client/core.py (1)

105-126: Consider allowing ExporterStatus.UNSPECIFIED for backward compatibility.

If GetStatus returns the default/unspecified value, Line 124 currently raises ExporterNotReady, which can block driver calls for exporters that haven’t adopted explicit status reporting yet. Adding UNSPECIFIED keeps behavior more permissive without impacting the ready path.

♻️ Suggested tweak
-        ALLOWED_STATUSES = {
-            ExporterStatus.LEASE_READY,
-            ExporterStatus.BEFORE_LEASE_HOOK,
-            ExporterStatus.AFTER_LEASE_HOOK,
-        }
+        ALLOWED_STATUSES = {
+            ExporterStatus.LEASE_READY,
+            ExporterStatus.BEFORE_LEASE_HOOK,
+            ExporterStatus.AFTER_LEASE_HOOK,
+            ExporterStatus.UNSPECIFIED,
+        }
python/packages/jumpstarter/jumpstarter/common/types.py (1)

16-18: Consider removing the empty TYPE_CHECKING block.

The TYPE_CHECKING guard and its comment suggest async stub type aliases were planned but the block is empty. This is dead code that could be removed for clarity, or the intended type aliases should be added.

🧹 Proposed cleanup
-# Async stub type aliases are only available for type checking (defined in .pyi files)
-if TYPE_CHECKING:
-    pass
-

Also update the import:

-from typing import TYPE_CHECKING, TypeAlias
+from typing import TypeAlias
python/packages/jumpstarter/jumpstarter/exporter/lease_context.py (1)

54-57: Consider using explicit validation instead of assert statements.

assert statements can be disabled when Python runs with optimization flags (-O), which would skip these critical validations in production. Consider raising ValueError explicitly.

♻️ Proposed fix
     def __post_init__(self):
         """Validate that required resources are present."""
-        assert self.before_lease_hook is not None, "LeaseScope requires a before_lease_hook event"
-        assert self.lease_name, "LeaseScope requires a non-empty lease_name"
+        if self.before_lease_hook is None:
+            raise ValueError("LeaseContext requires a before_lease_hook event")
+        if not self.lease_name:
+            raise ValueError("LeaseContext requires a non-empty lease_name")
python/packages/jumpstarter/jumpstarter/exporter/hooks.py (1)

621-629: Consider making the 1-second delay configurable or documenting its purpose more prominently.

The delay at Line 624 allows clients to poll and see the AVAILABLE status before the lease is released. While the comment explains this, users might wonder why there's a hardcoded delay. Consider extracting this to a constant or making it configurable.

python/packages/jumpstarter/jumpstarter/client/status_monitor_test.py (1)

426-522: Good coverage of wait_for_any_of behavior.

The tests cover the key scenarios for multi-target waiting. The parallel structure to wait_for_status tests ensures consistent behavior.

Consider adding a test for DEADLINE_EXCEEDED handling in wait_for_any_of for completeness (similar to test_wait_for_status_deadline_exceeded_returns_promptly), though the UNAVAILABLE test provides reasonable coverage of the connection-loss path.

python/packages/jumpstarter/jumpstarter/exporter/session.py (1)

270-280: Consider reducing GetStatus log level.

The GetStatus RPC may be polled frequently by clients (e.g., StatusMonitor polls every 0.1-0.3s). Using logger.info could result in excessive log output.

Consider using DEBUG level
     async def GetStatus(self, request, context):
         """Get the current exporter status with transition tracking."""
-        logger.info("GetStatus() -> %s (version=%d)", self._current_status, self._status_version)
+        logger.debug("GetStatus() -> %s (version=%d)", self._current_status, self._status_version)
         response = jumpstarter_pb2.GetStatusResponse(
python/packages/jumpstarter/jumpstarter/exporter/exporter.py (1)

360-382: Consider using _controller_stub in _unregister_with_controller.

This method manages the channel manually instead of using the _controller_stub context manager. While the explicit channel.close() with CancelScope(shield=True) is correct, using the context manager would be more consistent with the rest of the codebase.

Optional refactor for consistency
     async def _unregister_with_controller(self):
         """Safely unregister from controller with timeout and error handling."""
         if not (self._registered and self._unregister):
             return

         logger.info("Unregistering exporter with controller")
         try:
             with move_on_after(10):  # 10 second timeout
-                channel = await self.channel_factory()
-                try:
-                    controller = jumpstarter_pb2_grpc.ControllerServiceStub(channel)
+                async with self._controller_stub() as controller:
                     await self._report_status(ExporterStatus.OFFLINE, "Exporter shutting down")
                     await controller.Unregister(
                         jumpstarter_pb2.UnregisterRequest(
                             reason="Exporter shutdown",
                         )
                     )
                     logger.info("Controller unregistration completed successfully")
-                finally:
-                    with CancelScope(shield=True):
-                        await channel.close()
         except Exception as e:
             logger.error("Error during controller unregistration: %s", e, exc_info=True)

Comment on lines +76 to +86
lease = &jumpstarterdevv1alpha1.Lease{
ObjectMeta: metav1.ObjectMeta{
Name: "lease-to-release",
Namespace: testNamespace,
},
Spec: jumpstarterdevv1alpha1.LeaseSpec{
ClientRef: corev1.LocalObjectReference{Name: "test-client"},
Release: false,
},
}
Expect(k8sClient.Create(ctx, lease)).To(Succeed())
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🔴 Critical

🧩 Analysis chain

🏁 Script executed:

#!/bin/bash
# Check the LeaseSpec definition and Selector field
fd -e go "lease_types.go" controller/

# Get the actual content of LeaseSpec
rg -n "type LeaseSpec struct" --type=go -A20 controller/

Repository: jumpstarter-dev/jumpstarter

Length of output: 2012


🏁 Script executed:

#!/bin/bash
# Check for validation markers on the Selector field
rg -n "Selector" controller/api/v1alpha1/lease_types.go -B5 -A1

# Check if there are validation rules defined elsewhere for Lease
rg -n "kubebuilder:" controller/api/v1alpha1/lease_types.go | head -20

Repository: jumpstarter-dev/jumpstarter

Length of output: 864


🏁 Script executed:

#!/bin/bash
# Check if there are CRD definition files
fd -e yaml "lease.*crd" controller/

# Check the test file to see if it has any setup that might affect validation
head -40 controller/internal/service/controller_service_integration_test.go

# Look for any ValidatingWebhook or validation setup
rg -n "Validat" controller/ --type=go | head -20

# Check if there's a markers file or +kubebuilder validation rules
rg -n "+kubebuilder:validation" controller/api/v1alpha1/lease_types.go

Repository: jumpstarter-dev/jumpstarter

Length of output: 2310


🏁 Script executed:

#!/bin/bash
# Search for CRD files more broadly
find controller -name "*.yaml" -o -name "*.yml" | grep -i lease

# Check if there's a config or manifests directory
fd -e yaml controller/config/ | head -20

# Look for the validation webhook or rules setup
rg -n "LeaseSpec" controller/api/v1alpha1/lease_types.go -B10 | grep -E "kubebuilder|validation"

# Check the entire lease_types.go file for validation markers
cat controller/api/v1alpha1/lease_types.go | grep -E "kubebuilder|validation"

Repository: jumpstarter-dev/jumpstarter

Length of output: 1509


🏁 Script executed:

#!/bin/bash
# Check the CRD definition for validation rules on Selector
cat controller/deploy/operator/config/crd/bases/jumpstarter.dev_leases.yaml | grep -A30 "selector:"

# Also check for any validation rules in the properties section
cat controller/deploy/operator/config/crd/bases/jumpstarter.dev_leases.yaml | grep -B5 -A10 '"selector"'

Repository: jumpstarter-dev/jumpstarter

Length of output: 1734


🏁 Script executed:

#!/bin/bash
# Get more context from the CRD - check if selector is in the required fields
cat controller/deploy/operator/config/crd/bases/jumpstarter.dev_leases.yaml | grep -B20 "selector:" | grep -A30 "required:"

# Or let's look at the entire spec section to see required fields
cat controller/deploy/operator/config/crd/bases/jumpstarter.dev_leases.yaml | sed -n '/spec:/,/status:/p' | grep -A100 "properties:" | head -60

Repository: jumpstarter-dev/jumpstarter

Length of output: 2973


🏁 Script executed:

#!/bin/bash
# Look for the required fields in the spec section
cat controller/deploy/operator/config/crd/bases/jumpstarter.dev_leases.yaml | sed -n '/spec:/,/status:/p' | tail -50

Repository: jumpstarter-dev/jumpstarter

Length of output: 2575


Add required Selector field to all Lease objects in tests.

The Selector field is required in LeaseSpec according to the CRD definition (see required: [clientRef, selector] in controller/deploy/operator/config/crd/bases/jumpstarter.dev_leases.yaml). All Lease objects created in these tests omit this field, which will cause validation failures when k8sClient.Create(ctx, lease) is called. Set Selector: metav1.LabelSelector{} or a valid label selector for each Lease creation.

🤖 Prompt for AI Agents
In `@controller/internal/service/controller_service_integration_test.go` around
lines 76 - 86, The Lease objects in the tests (the Lease struct created and
passed to k8sClient.Create in controller_service_integration_test.go) are
missing the required Selector field in LeaseSpec which causes API validation to
fail; update the Lease initializations (the variable lease used with
k8sClient.Create and its LeaseSpec) to include Selector: metav1.LabelSelector{}
or an appropriate label selector (e.g., Selector:
&metav1.LabelSelector{MatchLabels: map[string]string{"app":"test"}}) so the
CRD-required selector is present when Create(ctx, lease) is called.

Comment on lines +190 to +196
go run github.com/mikefarah/yq/v4@latest -i ". * load(\"e2e/exporters/exporter.yaml\")" \
/etc/jumpstarter/exporters/test-exporter-oidc.yaml
go run github.com/mikefarah/yq/v4@latest -i ". * load(\"e2e/exporters/exporter.yaml\")" \
/etc/jumpstarter/exporters/test-exporter-sa.yaml
go run github.com/mikefarah/yq/v4@latest -i ". * load(\"e2e/exporters/exporter.yaml\")" \
/etc/jumpstarter/exporters/test-exporter-legacy.yaml

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟠 Major

🧩 Analysis chain

🏁 Script executed:

#!/bin/bash
# Check which exporter.yaml paths exist in the e2e directory
echo "=== Checking exporter.yaml locations ==="
fd -t f "exporter.yaml" e2e/

echo ""
echo "=== References in tests.bats ==="
rg -n "exporter\.yaml" e2e/tests.bats

Repository: jumpstarter-dev/jumpstarter

Length of output: 633


Update exporter.yaml path references on lines 126 and 174.

The file e2e/exporter.yaml no longer exists; it was moved to e2e/exporters/exporter.yaml. Lines 190–195 were correctly updated, but lines 126 and 174 still reference the old non-existent path, causing test failures:

  • Line 126: Change e2e/exporter.yaml to e2e/exporters/exporter.yaml
  • Line 174: Change e2e/exporter.yaml to e2e/exporters/exporter.yaml
🤖 Prompt for AI Agents
In `@e2e/tests.bats` around lines 190 - 196, Two occurrences of the yq load
expression still reference the old path load("e2e/exporter.yaml"); update those
commands that match the pattern go run ... -i ". * load(\"e2e/exporter.yaml\")"
to use load("e2e/exporters/exporter.yaml") instead so all yq invocations point
to the moved exporter.yaml file (ensure the same replacement is applied for both
remaining occurrences).

Comment on lines +79 to +94
# Check if exporter set an exit code (e.g., from hook failure with on_failure='exit')
exporter_exit_code = exporter.exit_code

# Cancel the signal handler after exporter completes
signal_tg.cancel_scope.cancel()

# Return signal number if received, otherwise 0 for immediate restart
return received_signal if received_signal else 0
# Return exit code in priority order:
# 1. Signal number if received (for signal-based termination)
# 2. Exporter's exit code if set (for hook failure with on_failure='exit')
# 3. 0 for immediate restart (normal exit without signal or explicit exit code)
if received_signal:
return received_signal
elif exporter_exit_code is not None:
return exporter_exit_code
else:
return 0
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟠 Major

Align child/parent exit-code encoding to avoid 128+ skew.

With the new exporter_exit_code, the parent still assumes any non‑zero child exit code is a signal number and adds 128 (Line 131–138). That will mangle real exit codes (e.g., 1 → 129). Encode signals as 128+signal in the child and have the parent return the child exit code directly.

Proposed fix
@@
-        # 1. Signal number if received (for signal-based termination)
+        # 1. Signal-based termination → return standard 128+signal
@@
-        if received_signal:
-            return received_signal
+        if received_signal:
+            return 128 + received_signal
@@
-        if child_exit_code == 0:
-            return None  # restart child (unexpected exit/exception)
-        else:
-            # Child indicates termination (signal number)
-            return 128 + child_exit_code  # Return standard Unix exit code
+        if child_exit_code == 0:
+            return None  # restart child (unexpected exit/exception)
+        else:
+            # Child already encodes final exit status (128+signal or exporter exit code)
+            return child_exit_code
🤖 Prompt for AI Agents
In `@python/packages/jumpstarter-cli/jumpstarter_cli/run.py` around lines 79 - 94,
The parent currently interprets any non-zero child exit as a signal and adds
128, which double-encodes real exit codes; change the protocol so the child
(exporter) encodes signals as 128 + signal number and the parent in run.py
should stop adding 128 — i.e., use exporter_exit_code directly when set (replace
any logic that converts non-zero exporter_exit_code into 128+value), and keep
received_signal only for locally-caught signals; update handling around
exporter_exit_code and received_signal (and cancellation via
signal_tg.cancel_scope.cancel()) so the parent returns exporter_exit_code as-is
when present.

Comment on lines +127 to +187
async def wait_for_lease_ready(self, timeout: float = 300.0) -> None:
"""Wait for exporter to report LEASE_READY status.

This polls GetStatus until the beforeLease hook completes.
Should be called after log streaming is started so hook output
can be displayed in real-time.

Prefer using StatusMonitor.wait_for_any_of() for non-blocking status tracking.

Args:
timeout: Maximum time to wait in seconds (default: 5 minutes)
"""
import anyio

poll_interval = 0.5 # seconds
elapsed = 0.0
poll_count = 0

self.logger.debug("Waiting for exporter to be ready...")
while elapsed < timeout:
poll_count += 1
self.logger.debug("[POLL %d] Calling GetStatus (elapsed: %.1fs)...", poll_count, elapsed)
try:
status = await self.get_status_async()
self.logger.debug("[POLL %d] GetStatus returned: %s", poll_count, status)
except Exception as e:
# Connection error - keep trying
self.logger.debug("[POLL %d] Error getting status, will retry: %s", poll_count, e)
await anyio.sleep(poll_interval)
elapsed += poll_interval
continue

if status is None:
# GetStatus not implemented - assume ready for backward compatibility
self.logger.debug("[POLL %d] GetStatus not implemented, assuming ready", poll_count)
return

if status == ExporterStatus.LEASE_READY:
self.logger.info("Exporter ready, starting shell...")
return
elif status == ExporterStatus.BEFORE_LEASE_HOOK:
# Hook is running - this is expected, keep waiting
self.logger.debug("[POLL %d] beforeLease hook is running...", poll_count)
elif status == ExporterStatus.BEFORE_LEASE_HOOK_FAILED:
# Hook failed - log but continue (exporter may still be usable)
self.logger.warning("beforeLease hook failed")
return
elif status == ExporterStatus.AVAILABLE:
# Exporter is available but not yet leased - keep waiting
# This can happen if client connects before exporter receives lease assignment
self.logger.debug("[POLL %d] Exporter status: AVAILABLE (waiting for lease assignment)", poll_count)
else:
# Other status - continue waiting
self.logger.debug("[POLL %d] Exporter status: %s (waiting...)", poll_count, status)

self.logger.debug("[POLL %d] Sleeping for %.1fs before next poll...", poll_count, poll_interval)
await anyio.sleep(poll_interval)
elapsed += poll_interval

self.logger.warning("Timeout waiting for beforeLease hook to complete (after %d polls)", poll_count)

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟡 Minor

Timeout isn’t surfaced to callers.

After the loop, Line 186 only logs a warning and returns, so callers can’t distinguish success from a timeout. Consider raising DriverError (or returning a boolean) when the timeout elapses.

🛠️ Possible fix (raise on timeout)
-        self.logger.warning("Timeout waiting for beforeLease hook to complete (after %d polls)", poll_count)
+        self.logger.warning("Timeout waiting for beforeLease hook to complete (after %d polls)", poll_count)
+        raise DriverError("Timeout waiting for beforeLease hook to complete")
🤖 Prompt for AI Agents
In `@python/packages/jumpstarter/jumpstarter/client/core.py` around lines 127 -
187, The wait_for_lease_ready method currently logs a warning on timeout but
returns silently so callers cannot detect failure; modify wait_for_lease_ready
to raise DriverError (or return False if you prefer an API change) when the
timeout elapses instead of only logging: replace the final
self.logger.warning(...) with raising DriverError(f"Timeout waiting for
beforeLease hook to complete after {timeout:.1f}s and {poll_count} polls")
(ensure DriverError is imported/available) so callers of wait_for_lease_ready
can programmatically handle timeouts.

Comment on lines +458 to +485
# Determine log source
if response.HasField("source"):
source = LogSource(response.source)
is_hook = source in (LogSource.BEFORE_LEASE_HOOK, LogSource.AFTER_LEASE_HOOK)
else:
source = LogSource.SYSTEM
is_hook = False

# Filter: always show hooks, only show system logs if enabled
if is_hook or show_all_logs:
# Get severity level
severity = response.severity if response.severity else "INFO"
log_level = getattr(logging, severity, logging.INFO)

# Route to appropriate logger based on source
if source == LogSource.BEFORE_LEASE_HOOK:
logger_name = "exporter:beforeLease"
elif source == LogSource.AFTER_LEASE_HOOK:
logger_name = "exporter:afterLease"
elif source == LogSource.DRIVER:
logger_name = "exporter:driver"
else: # SYSTEM
logger_name = "exporter:system"

# Log through logger for RichHandler formatting
source_logger = logging.getLogger(logger_name)
source_logger.log(log_level, response.message)

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟡 Minor

Guard against unknown LogSource values.

If the server sends a new/unknown enum value, LogSource(response.source) raises ValueError, which can trigger reconnect churn. A small fallback keeps the stream healthy.

🛠️ Suggested fallback
-                        if response.HasField("source"):
-                            source = LogSource(response.source)
-                            is_hook = source in (LogSource.BEFORE_LEASE_HOOK, LogSource.AFTER_LEASE_HOOK)
+                        if response.HasField("source"):
+                            try:
+                                source = LogSource(response.source)
+                            except ValueError:
+                                self.logger.debug("Unknown LogSource value: %s", response.source)
+                                source = LogSource.SYSTEM
+                            is_hook = source in (LogSource.BEFORE_LEASE_HOOK, LogSource.AFTER_LEASE_HOOK)
🤖 Prompt for AI Agents
In `@python/packages/jumpstarter/jumpstarter/client/core.py` around lines 458 -
485, Guard creation of LogSource from response.source with a try/except to
handle unknown enum values: when evaluating source = LogSource(response.source)
(and computing is_hook), catch ValueError and set source to LogSource.SYSTEM
(and is_hook = False) so the stream continues instead of raising; keep the
existing routing logic that maps source to logger_name and logging through
source_logger.log(log_level, response.message). Also ensure severity fallback
(severity = response.severity if response.severity else "INFO") remains
unchanged.

Comment on lines +168 to 191
def stop(self, wait_for_lease_exit=False, should_unregister=False, exit_code: int | None = None):
"""Signal the exporter to stop.

Args:
wait_for_lease_exit (bool): If True, wait for the current lease to exit before stopping.
should_unregister (bool): If True, unregister from controller. Otherwise rely on heartbeat.
exit_code (int | None): If set, the exporter will exit with this code (non-zero means no restart).
"""
# Set exit code if provided
if exit_code is not None:
self._exit_code = exit_code

# Stop immediately if not started yet or if immediate stop is requested
if (not self._started or not wait_for_lease_exit) and self._tg is not None:
logger.info("Stopping exporter immediately, unregister from controller=%s", should_unregister)
if should_unregister:
logger.info("Stopping exporter immediately, unregistering from controller")
else:
logger.info("Stopping exporter immediately, will not unregister from controller")
self._unregister = should_unregister
# Cancel any ongoing tasks
self._tg.cancel_scope.cancel()
elif not self._stop_requested:
self._stop_requested = True
logger.info("Exporter marked for stop upon lease exit")
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟠 Major

🧩 Analysis chain

🏁 Script executed:

#!/bin/bash
# First, let's find the exporter.py file and check its size
find . -name "exporter.py" -path "*jumpstarter*" -type f

Repository: jumpstarter-dev/jumpstarter

Length of output: 194


🏁 Script executed:

#!/bin/bash
# Get the exact file location and check its size
EXPORTER_FILE="python/packages/jumpstarter/jumpstarter/exporter/exporter.py"
if [ -f "$EXPORTER_FILE" ]; then
  wc -l "$EXPORTER_FILE"
fi

Repository: jumpstarter-dev/jumpstarter

Length of output: 135


🏁 Script executed:

#!/bin/bash
# Read the relevant sections: lines 168-191 (stop method) and around line 788
EXPORTER_FILE="python/packages/jumpstarter/jumpstarter/exporter/exporter.py"

echo "=== Lines 168-191 (stop method) ==="
sed -n '168,191p' "$EXPORTER_FILE"

echo -e "\n=== Lines 780-800 (serve method call to stop) ==="
sed -n '780,800p' "$EXPORTER_FILE"

echo -e "\n=== Check for _stop_requested and should_unregister references ==="
grep -n "_stop_requested\|should_unregister\|_unregister" "$EXPORTER_FILE" | head -30

Repository: jumpstarter-dev/jumpstarter

Length of output: 2962


🏁 Script executed:

#!/bin/bash
# Search for all calls to stop() in the exporter module to understand usage patterns
EXPORTER_DIR="python/packages/jumpstarter/jumpstarter/exporter/"
rg -n '\.stop\(' "$EXPORTER_DIR" -A 1 -B 1

Repository: jumpstarter-dev/jumpstarter

Length of output: 1341


should_unregister preference is not preserved when deferring stop.

When stop(wait_for_lease_exit=True, should_unregister=False) is called, the should_unregister=False preference is discarded. The method only sets _stop_requested=True without storing the should_unregister value. Later, when the lease exits (line 788), stop(should_unregister=True) is called unconditionally, overriding the caller's intent to defer unregistration.

Consider storing should_unregister in a field (similar to how _unregister is used for immediate stops) when deferring the stop, so the original caller preference is honored when the lease eventually exits.

🤖 Prompt for AI Agents
In `@python/packages/jumpstarter/jumpstarter/exporter/exporter.py` around lines
168 - 191, The stop method currently ignores the caller's should_unregister
value when deferring stop; update stop(self, ...) to persist should_unregister
into an instance field (e.g. set self._unregister = should_unregister) whenever
wait_for_lease_exit is true and you set self._stop_requested = True so the
original preference is preserved, and ensure the lease-exit path that later
invokes stop uses that stored self._unregister rather than unconditionally
passing True; update references to _unregister and _stop_requested so deferred
stops respect the saved preference.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant