feat(pool): add reusable pod reset policy with task-executor sidecar support by fengcone · Pull Request #521 · alibaba/OpenSandbox

fengcone · 2026-03-21T05:11:35Z

Summary

#452 support pod disposal policy for pooled BatchSandbox deletion

Testing

Not run (explain why)
Unit tests
Integration tests
e2e / manual verification

Breaking Changes

None
Yes (the default recycling strategy for SandboxPool has been changed from Reuse to Delete)

Checklist

Linked Issue or clearly described motivation
Added/updated docs (if needed)
Added/updated tests (if needed)
Security impact considered
Backward compatibility considered

- Introduce PodRecyclePolicy with options Delete and Reuse for pooled BatchSandbox pods - Add ResetSpec for reset configuration when using Reuse policy (container restart, clean dirs, timeout) - Implement controller logic to handle pod disposal and resetting before BatchSandbox deletion - Add finalizer management for pod disposal lifecycle in BatchSandbox reconciler - Integrate calls to task-executor Reset API for pod reset status and control in controller - Track pod recycle states (Resetting, ResetSucceeded, ResetFailed) via pod labels and update accordingly - Implement task-executor sidecar Reset API to support asynchronous pod reset operations - Block task creation and sync during reset in task-executor to avoid race conditions - Support cleaning of task data and user-specified directories and main container restart during reset - Add new CLI flags for configuring task-executor image and resources for pod reset support - Update kustomize deploy to substitute TASK_EXECUTOR_IMAGE_PLACEHOLDER with configured image - Add new API types and deepcopy methods for reset configuration and responses - Add new status field to Pool to track number of pods resetting for observability

# Conflicts: # kubernetes/Makefile # kubernetes/cmd/controller/main.go

…support

chatgpt-codex-connector

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: a7eb332017

ℹ️ About Codex in GitHub

Codex has been enabled to automatically review pull requests in this repo. Reviews are triggered when you

Open a pull request for review
Mark a draft as ready
Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

When you sign up for Codex through ChatGPT, Codex can also answer questions or update the PR, like "@codex address that feedback".

chatgpt-codex-connector · 2026-03-21T05:17:16Z

kubernetes/internal/task-executor/server/handler.go

+	// Case 2: Reset already completed -> return terminal status
+	if h.reset.status == api.ResetStatusSuccess ||
+		h.reset.status == api.ResetStatusFailed ||
+		h.reset.status == api.ResetStatusTimeout ||
+		h.reset.status == api.ResetStatusNotSupported {
+		klog.InfoS("Reset already completed, returning terminal status", "status", h.reset.status)
+		resp := h.currentResetResponse()
+		h.reset.mu.Unlock()
+		writeJSON(w, http.StatusOK, resp)
+		return


Reinitialize reset state before accepting another recycle

After the first successful recycle, h.reset.status remains in a terminal state forever. The next POST /reset for the same sidecar will hit this branch and immediately return the previous Success/Failed response instead of starting executeReset again, so subsequent pod reuses skip task cleanup, directory cleanup, and main-container restart entirely. This affects any pool pod that is reused more than once, which is the normal steady-state for the new Reuse policy.

Useful? React with 👍 / 👎.

chatgpt-codex-connector · 2026-03-21T05:17:16Z

kubernetes/internal/task-executor/runtime/process.go

+	// Step 2: Wait for process to exit gracefully (container runtime will restart it based on restartPolicy)
+	gracefulTimeout := 30 * time.Second
+	gracefulDeadline := time.Now().Add(gracefulTimeout)
+	for time.Now().Before(gracefulDeadline) {
+		if !isProcessRunning(mainPID) {
+			klog.InfoS("Main container process exited gracefully", "pid", mainPID)
+			return nil


Wait for the restarted main container before returning success

restartMainContainer reports success as soon as the old PID disappears, but at that point the replacement container may still be starting up. The pool allocator still treats any unallocated pod with Phase=Running as reusable (internal/controller/allocator.go:223-230), and a pod stays Running while the task-executor sidecar is alive, so a reused pod can be handed to a new BatchSandbox before its main container is back or ready. This is most visible with images that take a few seconds to boot after SIGTERM/SIGKILL.

Useful? React with 👍 / 👎.

chatgpt-codex-connector · 2026-03-21T05:17:16Z

kubernetes/config/default/manager_task_executor_patch.yaml

+        args:
+        - --leader-elect
+        - --health-probe-bind-address=:8081
+        - --task-executor-image=TASK_EXECUTOR_IMAGE_PLACEHOLDER


Merge the task-executor arg without replacing metrics args

config/default/kustomization.yaml still applies manager_metrics_patch.yaml, but this strategic-merge patch supplies a full args list for the manager container. That replaces the slice rather than appending to it, so the generated Kustomize deployment drops --metrics-bind-address=:8443 and leaves the controller running with metrics disabled while metrics_service.yaml still exposes a metrics Service. This affects the make deploy/config/default installation path introduced in this change.

Useful? React with 👍 / 👎.

fengcone added 3 commits March 21, 2026 03:04

Merge branch 'main' into feature/public-podRecycle

63f8c20

# Conflicts: # kubernetes/Makefile # kubernetes/cmd/controller/main.go

feat(pool): add reusable pod reset policy with task-executor sidecar …

a7eb332

…support

fengcone requested review from Generalwin, Pangjiping, Spground, hittyt, kevinlynx and ninan-nn as code owners March 21, 2026 05:11

chatgpt-codex-connector bot reviewed Mar 21, 2026

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat(pool): add reusable pod reset policy with task-executor sidecar support#521

feat(pool): add reusable pod reset policy with task-executor sidecar support#521
fengcone wants to merge 3 commits intoalibaba:mainfrom
fengcone:feature/public-podRecycle

fengcone commented Mar 21, 2026

Uh oh!

chatgpt-codex-connector bot left a comment

Uh oh!

chatgpt-codex-connector bot Mar 21, 2026

Uh oh!

chatgpt-codex-connector bot Mar 21, 2026

Uh oh!

chatgpt-codex-connector bot Mar 21, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

fengcone commented Mar 21, 2026

Summary

Testing

Breaking Changes

Checklist

Uh oh!

chatgpt-codex-connector bot left a comment

Choose a reason for hiding this comment

💡 Codex Review

Uh oh!

chatgpt-codex-connector bot Mar 21, 2026

Choose a reason for hiding this comment

Uh oh!

chatgpt-codex-connector bot Mar 21, 2026

Choose a reason for hiding this comment

Uh oh!

chatgpt-codex-connector bot Mar 21, 2026

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant