Skip to content

feat(pool): add reusable pod reset policy with task-executor sidecar support#521

Open
fengcone wants to merge 3 commits intoalibaba:mainfrom
fengcone:feature/public-podRecycle
Open

feat(pool): add reusable pod reset policy with task-executor sidecar support#521
fengcone wants to merge 3 commits intoalibaba:mainfrom
fengcone:feature/public-podRecycle

Conversation

@fengcone
Copy link
Collaborator

Summary

#452 support pod disposal policy for pooled BatchSandbox deletion

Testing

  • Not run (explain why)
  • Unit tests
  • Integration tests
  • e2e / manual verification

Breaking Changes

  • None
  • Yes (the default recycling strategy for SandboxPool has been changed from Reuse to Delete)

Checklist

  • Linked Issue or clearly described motivation
  • Added/updated docs (if needed)
  • Added/updated tests (if needed)
  • Security impact considered
  • Backward compatibility considered

- Introduce PodRecyclePolicy with options Delete and Reuse for pooled BatchSandbox pods
- Add ResetSpec for reset configuration when using Reuse policy (container restart, clean dirs, timeout)
- Implement controller logic to handle pod disposal and resetting before BatchSandbox deletion
- Add finalizer management for pod disposal lifecycle in BatchSandbox reconciler
- Integrate calls to task-executor Reset API for pod reset status and control in controller
- Track pod recycle states (Resetting, ResetSucceeded, ResetFailed) via pod labels and update accordingly
- Implement task-executor sidecar Reset API to support asynchronous pod reset operations
- Block task creation and sync during reset in task-executor to avoid race conditions
- Support cleaning of task data and user-specified directories and main container restart during reset
- Add new CLI flags for configuring task-executor image and resources for pod reset support
- Update kustomize deploy to substitute TASK_EXECUTOR_IMAGE_PLACEHOLDER with configured image
- Add new API types and deepcopy methods for reset configuration and responses
- Add new status field to Pool to track number of pods resetting for observability
# Conflicts:
#	kubernetes/Makefile
#	kubernetes/cmd/controller/main.go
Copy link

@chatgpt-codex-connector chatgpt-codex-connector bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: a7eb332017

ℹ️ About Codex in GitHub

Codex has been enabled to automatically review pull requests in this repo. Reviews are triggered when you

  • Open a pull request for review
  • Mark a draft as ready
  • Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

When you sign up for Codex through ChatGPT, Codex can also answer questions or update the PR, like "@codex address that feedback".

Comment on lines +263 to +272
// Case 2: Reset already completed -> return terminal status
if h.reset.status == api.ResetStatusSuccess ||
h.reset.status == api.ResetStatusFailed ||
h.reset.status == api.ResetStatusTimeout ||
h.reset.status == api.ResetStatusNotSupported {
klog.InfoS("Reset already completed, returning terminal status", "status", h.reset.status)
resp := h.currentResetResponse()
h.reset.mu.Unlock()
writeJSON(w, http.StatusOK, resp)
return

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P1 Badge Reinitialize reset state before accepting another recycle

After the first successful recycle, h.reset.status remains in a terminal state forever. The next POST /reset for the same sidecar will hit this branch and immediately return the previous Success/Failed response instead of starting executeReset again, so subsequent pod reuses skip task cleanup, directory cleanup, and main-container restart entirely. This affects any pool pod that is reused more than once, which is the normal steady-state for the new Reuse policy.

Useful? React with 👍 / 👎.

Comment on lines +479 to +485
// Step 2: Wait for process to exit gracefully (container runtime will restart it based on restartPolicy)
gracefulTimeout := 30 * time.Second
gracefulDeadline := time.Now().Add(gracefulTimeout)
for time.Now().Before(gracefulDeadline) {
if !isProcessRunning(mainPID) {
klog.InfoS("Main container process exited gracefully", "pid", mainPID)
return nil

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P1 Badge Wait for the restarted main container before returning success

restartMainContainer reports success as soon as the old PID disappears, but at that point the replacement container may still be starting up. The pool allocator still treats any unallocated pod with Phase=Running as reusable (internal/controller/allocator.go:223-230), and a pod stays Running while the task-executor sidecar is alive, so a reused pod can be handed to a new BatchSandbox before its main container is back or ready. This is most visible with images that take a few seconds to boot after SIGTERM/SIGKILL.

Useful? React with 👍 / 👎.

Comment on lines +13 to +16
args:
- --leader-elect
- --health-probe-bind-address=:8081
- --task-executor-image=TASK_EXECUTOR_IMAGE_PLACEHOLDER

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P2 Badge Merge the task-executor arg without replacing metrics args

config/default/kustomization.yaml still applies manager_metrics_patch.yaml, but this strategic-merge patch supplies a full args list for the manager container. That replaces the slice rather than appending to it, so the generated Kustomize deployment drops --metrics-bind-address=:8443 and leaves the controller running with metrics disabled while metrics_service.yaml still exposes a metrics Service. This affects the make deploy/config/default installation path introduced in this change.

Useful? React with 👍 / 👎.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant