feat: implement snapstart for codeinterpreter by lyuyun · Pull Request #380 · volcano-sh/agentcube

lyuyun · 2026-06-05T10:20:01Z

No description provided.

Signed-off-by: lyuyun <lyuyun068@gmail.com>

volcano-sh-bot · 2026-06-05T10:20:07Z

[APPROVALNOTIFIER] This PR is NOT APPROVED

This pull-request has been approved by:
Once this PR has been reviewed and has the lgtm label, please assign kevin-wangzefeng for approval. For more information see the Kubernetes Code Review Process.

The full list of commands accepted by this bot can be found here.

Details

Needs approval from an approver in each of these files:

OWNERS

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

Copilot

Pull request overview

Note

Copilot was unable to run its full agentic suite in this review.

This PR introduces SandboxSnapshot support (Fork mode) across the control-plane and node agent, including new CRDs, controllers, artifact-store persistence, and restore-intent injection for session sandboxes.

Changes:

Add SnapshotClass/SandboxSnapshot/SandboxSnapshotTask APIs + CRDs, plus controllers in workload-manager and agentd.
Implement Fork snapshot mode end-to-end, including per-node build sandboxes, task orchestration, hashing, and artifact manifest management.
Add Redis/Valkey-backed artifact store and wire snapshot restore-intent into sandbox creation.

Reviewed changes

Copilot reviewed 22 out of 40 changed files in this pull request and generated 3 comments.

Show a summary per file

File	Description
pkg/workloadmanager/snapshot_mode.go	Defines the mode-handler interface for snapshot reconciliation.
pkg/workloadmanager/snapshot_fork.go	Implements Fork-mode snapshot orchestration, hashing, and snapshot-key lookup for restore intent.
pkg/workloadmanager/snapshot_controller.go	Adds the SandboxSnapshot controller and shared manifest/status utilities.
pkg/workloadmanager/server.go	Wires optional snapshot lookup dependencies into the API server.
pkg/workloadmanager/handlers.go	Injects snapshot restore intent into sandbox creation (CodeInterpreter path).
pkg/workloadmanager/codeinterpreter_controller.go	Always reconciles SandboxTemplate so it can serve as Fork snapshot source.
pkg/workloadmanager/artifact_store_init.go	Adds env-based artifact store init (Redis/Valkey) with noop fallback.
pkg/store/artifact_store.go	Introduces ArtifactStore interface and snapshot artifact manifest schema.
pkg/store/artifact_store_redis.go	Implements ArtifactStore with Redis CAS semantics.
pkg/apis/runtime/v1alpha1/snapshot_types.go	Adds SnapshotClass/SandboxSnapshot/SandboxSnapshotTask API types and constants.
pkg/apis/runtime/v1alpha1/zz_generated.deepcopy.go	Adds generated DeepCopy methods for new API types.
pkg/agentd/snapshot_task_reconciler.go	Adds node-agent controller for executing SandboxSnapshotTasks.
pkg/agentd/snapshot_mode.go	Adds per-mode target validation (Fork waits for build sandbox readiness).
pkg/agentd/snapshot_driver.go	Adds snapshot driver interface and request/response structs.
pkg/agentd/kuasar_driver.go	Adds a Kuasar SnapStart driver scaffold (handshake + create/delete/inspect stubs).
manifests/charts/base/crds/runtime.agentcube.volcano.sh_snapshotclasses.yaml	Adds SnapshotClass CRD.
manifests/charts/base/crds/runtime.agentcube.volcano.sh_sandboxsnapshots.yaml	Adds SandboxSnapshot CRD.
manifests/charts/base/crds/runtime.agentcube.volcano.sh_sandboxsnapshottasks.yaml	Adds SandboxSnapshotTask CRD.
hack/update-codegen.sh	Extends lister-gen workaround to new snapshot resources.
cmd/workload-manager/main.go	Registers snapshot controller, initializes artifact store, and passes deps to server.
cmd/agentd/main.go	Registers snapshot task controller and adds runtime API scheme.
go.mod	Minor ordering adjustment in require block.

Files not reviewed (18)

client-go/clientset/versioned/typed/runtime/v1alpha1/fake/fake_runtime_client.go: Language not supported
client-go/clientset/versioned/typed/runtime/v1alpha1/fake/fake_sandboxsnapshot.go: Language not supported
client-go/clientset/versioned/typed/runtime/v1alpha1/fake/fake_sandboxsnapshottask.go: Language not supported
client-go/clientset/versioned/typed/runtime/v1alpha1/fake/fake_snapshotclass.go: Language not supported
client-go/clientset/versioned/typed/runtime/v1alpha1/generated_expansion.go: Language not supported
client-go/clientset/versioned/typed/runtime/v1alpha1/runtime_client.go: Language not supported
client-go/clientset/versioned/typed/runtime/v1alpha1/sandboxsnapshot.go: Language not supported
client-go/clientset/versioned/typed/runtime/v1alpha1/sandboxsnapshottask.go: Language not supported
client-go/clientset/versioned/typed/runtime/v1alpha1/snapshotclass.go: Language not supported
client-go/informers/externalversions/generic.go: Language not supported
client-go/informers/externalversions/runtime/v1alpha1/interface.go: Language not supported
client-go/informers/externalversions/runtime/v1alpha1/sandboxsnapshot.go: Language not supported
client-go/informers/externalversions/runtime/v1alpha1/sandboxsnapshottask.go: Language not supported
client-go/informers/externalversions/runtime/v1alpha1/snapshotclass.go: Language not supported
client-go/listers/runtime/v1alpha1/expansion_generated.go: Language not supported
client-go/listers/runtime/v1alpha1/sandboxsnapshot.go: Language not supported
client-go/listers/runtime/v1alpha1/sandboxsnapshottask.go: Language not supported
client-go/listers/runtime/v1alpha1/snapshotclass.go: Language not supported

Comments suppressed due to low confidence (5)

pkg/workloadmanager/snapshot_controller.go:1

Completed-task cleanup ignores SnapshotArtifactPhaseUnavailable. Tasks that end as Unavailable won't be cleaned up (and mode-specific cleanup like deleting build sandboxes won't run), causing resource leaks. Include Unavailable among the terminal phases for cleanup.
pkg/workloadmanager/snapshot_controller.go:1
This reads manifest.ArtifactSets[pendingKey] without checking presence. If the manifest is corrupted or the set was deleted (e.g., manual edits / store issues), pending becomes a zero-value set and could be incorrectly promoted or cause confusing logs/status. Guard with an ok check and clear PendingSetRef (or fail status) when the key is missing.
pkg/workloadmanager/snapshot_fork.go:1
Hash normalization sorts tolerations only by Key. If multiple tolerations share a key, the comparator doesn't define a strict total order, and sort.Slice is not stable—this can yield non-deterministic ordering and spurious hash changes/rebuilds. Sort by a full tuple (e.g., Key, Operator, Value, Effect, TolerationSeconds) to make the ordering deterministic.
pkg/workloadmanager/snapshot_fork.go:1
lookupActiveForkSnapshotKey lists all SandboxSnapshots in the namespace and then potentially calls the artifact store per candidate. This can become a hot-path bottleneck during session creation (list cost + N artifact-store reads). Consider adding a label (e.g., sourceRef name + mode) on SandboxSnapshot objects and using client.MatchingLabels to narrow the list, or maintaining an index/cache for ready snapshots keyed by (namespace, sourceTemplateName).
pkg/store/artifact_store_redis.go:1
The comment about the version token being a 'JSON encoding of the raw string value' doesn't match the implementation: PutManifest directly compares current (raw Redis string) with version (also a raw string). Either update the comment to reflect raw-string comparison, or adjust the code to actually use an encoded token format if that's the intended contract.

+                  description: SandboxSnapshotMode selects the snapshot usage mode.
+                  enum:
+                  - Fork
+                  - Resume
+                  type: string


+	// Skip tasks that have already reached a terminal phase.
+	if task.Status.Phase == runtimev1alpha1.SnapshotArtifactPhaseReady ||
+		task.Status.Phase == runtimev1alpha1.SnapshotArtifactPhaseFailed {
+		return ctrl.Result{}, nil
+	}


+	snapshotReconciler := &workloadmanager.SandboxSnapshotReconciler{
+		Client:        mgr.GetClient(),
+		ArtifactStore: workloadmanager.NewArtifactStoreFromEnv(),
+	}


gemini-code-assist

Code Review

This pull request introduces a sandbox snapshotting feature to Volcano AgentCube, adding CRDs for SandboxSnapshot, SandboxSnapshotTask, and SnapshotClass, along with a node-local Kuasar snapshot driver, a snapshot task reconciler, and a control-plane snapshot controller. It also integrates snapshot restore intent into the workload manager and implements a Redis/Valkey-backed artifact store. The code review highlights several critical issues and improvement opportunities: a potential resource leak of physical artifacts on nodes during promotion, non-deterministic behavior in snapshot lookup and pod spec normalization, and inefficient O(N) slice length calculation. Additionally, feedback suggests improving robustness by aligning connection deadlines with context deadlines in the Kuasar driver and returning the raw manifest string from the artifact store to avoid double-marshaling overhead.

Important

The consumer version of Gemini Code Assist on GitHub is being sunset. Starting June 18, 2026, new organization installations will be blocked, and all code review activity will officially cease on July 17, 2026.
For more details on the timeline and next steps, please review the Help Documentation.

gemini-code-assist · 2026-06-05T10:22:33Z

+	// GetManifest retrieves the manifest for the given snapshot owner key.
+	// Returns nil, nil when no manifest exists.
+	GetManifest(ctx context.Context, ownerKey string) (*SnapshotArtifactManifest, error)


GetManifest does not return the raw version string of the manifest from Redis. This forces loadManifest to use json.Marshal to generate a version string, which is non-deterministic (due to map key ordering/omitted fields) and introduces double-marshaling overhead. Changing GetManifest to return (*SnapshotArtifactManifest, string, error) where the second return value is the raw string retrieved from Redis avoids these issues.

Suggested change

// GetManifest retrieves the manifest for the given snapshot owner key.

// Returns nil, nil when no manifest exists.

GetManifest(ctx context.Context, ownerKey string) (*SnapshotArtifactManifest, error)

// GetManifest retrieves the manifest and its raw string version for the given snapshot owner key.

// Returns nil, "", nil when no manifest exists.

GetManifest(ctx context.Context, ownerKey string) (*SnapshotArtifactManifest, string, error)

gemini-code-assist · 2026-06-05T10:22:33Z

+			if manifest.ActiveSetRef.SnapshotKey != "" {
+				delete(manifest.ArtifactSets, manifest.ActiveSetRef.SnapshotKey)
+			}


When promoting a pending set to active, the old active set is deleted from manifest.ArtifactSets, but the physical snapshot artifacts on the nodes are never deleted, leading to a permanent resource leak on the nodes. Consider introducing a deletion/cleanup task or garbage collection mechanism to remove physical artifacts that are no longer referenced.

gemini-code-assist · 2026-06-05T10:22:33Z

+func lookupActiveForkSnapshotKey(
+	ctx context.Context,
+	k8sClient client.Client,
+	artifactStore store.ArtifactStore,
+	namespace, sandboxTemplateName string,
+) string {
+	snapshotList := &runtimev1alpha1.SandboxSnapshotList{}
+	if err := k8sClient.List(ctx, snapshotList, client.InNamespace(namespace)); err != nil {
+		klog.V(4).InfoS("snapshot lookup: failed to list snapshots, falling back to cold start",
+			"namespace", namespace, "error", err)
+		return ""
+	}
+
+	for i := range snapshotList.Items {
+		ss := &snapshotList.Items[i]
+		if ss.Spec.SnapshotMode != runtimev1alpha1.SandboxSnapshotModeFork {
+			continue
+		}
+		if ss.Spec.SourceRef.Name != sandboxTemplateName {
+			continue
+		}
+		if ss.Status.Phase != runtimev1alpha1.SandboxSnapshotPhaseReady {
+			continue
+		}
+
+		ownerKey := store.ArtifactOwnerKey("SandboxSnapshot", ss.Namespace, ss.Name, string(ss.UID))
+		manifest, err := artifactStore.GetManifest(ctx, ownerKey)
+		if err != nil {
+			klog.V(4).InfoS("snapshot lookup: artifact store error, falling back to cold start",
+				"snapshot", ss.Name, "error", err)
+			continue
+		}
+		if manifest == nil || manifest.ActiveSetRef.SnapshotKey == "" {
+			continue
+		}
+		activeSet, ok := manifest.ArtifactSets[manifest.ActiveSetRef.SnapshotKey]
+		if !ok {
+			continue
+		}
+		for _, art := range activeSet.Artifacts {
+			if art.Phase == store.SnapshotArtifactPhaseReady {
+				klog.V(4).InfoS("snapshot lookup: found active fork snapshot key",
+					"snapshot", ss.Name, "snapshotKey", manifest.ActiveSetRef.SnapshotKey)
+				return manifest.ActiveSetRef.SnapshotKey
+			}
+		}
+	}
+	return ""
+}


lookupActiveForkSnapshotKey returns the first active snapshot key it finds in an arbitrary order. If multiple snapshots exist for the same template, this leads to non-deterministic behavior. Sorting or selecting the snapshot with the latest CreationTimestamp or ReadyAt time ensures deterministic behavior.

func lookupActiveForkSnapshotKey( ctx context.Context, k8sClient client.Client, artifactStore store.ArtifactStore, namespace, sandboxTemplateName string, ) string { snapshotList := &runtimev1alpha1.SandboxSnapshotList{} if err := k8sClient.List(ctx, snapshotList, client.InNamespace(namespace)); err != nil { klog.V(4).InfoS("snapshot lookup: failed to list snapshots, falling back to cold start", "namespace", namespace, "error", err) return "" } var bestSnapshot *runtimev1alpha1.SandboxSnapshot var bestSnapshotKey string for i := range snapshotList.Items { ss := &snapshotList.Items[i] if ss.Spec.SnapshotMode != runtimev1alpha1.SandboxSnapshotModeFork { continue } if ss.Spec.SourceRef.Name != sandboxTemplateName { continue } if ss.Status.Phase != runtimev1alpha1.SandboxSnapshotPhaseReady { continue } ownerKey := store.ArtifactOwnerKey("SandboxSnapshot", ss.Namespace, ss.Name, string(ss.UID)) manifest, err := artifactStore.GetManifest(ctx, ownerKey) if err != nil { klog.V(4).InfoS("snapshot lookup: artifact store error, falling back to cold start", "snapshot", ss.Name, "error", err) continue } if manifest == nil || manifest.ActiveSetRef.SnapshotKey == "" { continue } activeSet, ok := manifest.ArtifactSets[manifest.ActiveSetRef.SnapshotKey] if !ok { continue } hasReadyArtifact := false for _, art := range activeSet.Artifacts { if art.Phase == store.SnapshotArtifactPhaseReady { hasReadyArtifact = true break } } if !hasReadyArtifact { continue } if bestSnapshot == nil || ss.CreationTimestamp.After(bestSnapshot.CreationTimestamp.Time) { bestSnapshot = ss bestSnapshotKey = manifest.ActiveSetRef.SnapshotKey } } if bestSnapshotKey != "" { klog.V(4).InfoS("snapshot lookup: found active fork snapshot key", "snapshot", bestSnapshot.Name, "snapshotKey", bestSnapshotKey) return bestSnapshotKey } return "" }

gemini-code-assist · 2026-06-05T10:22:33Z

+func (d *KuasarDriver) waitForReadiness(ctx context.Context, conn net.Conn, rd *bufio.Reader, sandboxID string) error {
+	ticker := time.NewTicker(kuasarReadinessPollInterval)
+	defer ticker.Stop()


In waitForReadiness, the connection's deadline is not aligned with the ctx deadline. If ctx times out, the loop might hang for up to an extra minute because readResponse is blocked on a socket read with a longer deadline. Setting the connection's deadline to ctx.Deadline() at the start of waitForReadiness ensures that any blocked read/write inside the loop will unblock immediately when ctx times out.

func (d *KuasarDriver) waitForReadiness(ctx context.Context, conn net.Conn, rd *bufio.Reader, sandboxID string) error { if deadline, ok := ctx.Deadline(); ok { _ = conn.SetDeadline(deadline) } ticker := time.NewTicker(kuasarReadinessPollInterval) defer ticker.Stop()

gemini-code-assist · 2026-06-05T10:22:33Z

+func (d *KuasarDriver) dialSocket(ctx context.Context) (net.Conn, error) {
+	dialer := &net.Dialer{}
+	conn, err := dialer.DialContext(ctx, "unix", d.SocketPath)
+	if err != nil {
+		return nil, fmt.Errorf("dial unix %s: %w", d.SocketPath, err)
+	}
+	_ = conn.SetDeadline(time.Now().Add(kuasarReadinessTimeout + time.Minute))
+	return conn, nil


In dialSocket, the connection's deadline is set to a fixed 6 minutes in the future, ignoring any deadline/timeout on the passed ctx. It is much more robust and idiomatic to respect ctx.Deadline() if it is set.

Suggested change

func (d *KuasarDriver) dialSocket(ctx context.Context) (net.Conn, error) {

dialer := &net.Dialer{}

conn, err := dialer.DialContext(ctx, "unix", d.SocketPath)

if err != nil {

return nil, fmt.Errorf("dial unix %s: %w", d.SocketPath, err)

}

_ = conn.SetDeadline(time.Now().Add(kuasarReadinessTimeout + time.Minute))

return conn, nil

func (d *KuasarDriver) dialSocket(ctx context.Context) (net.Conn, error) {

dialer := &net.Dialer{}

conn, err := dialer.DialContext(ctx, "unix", d.SocketPath)

if err != nil {

return nil, fmt.Errorf("dial unix %s: %w", d.SocketPath, err)

}

if deadline, ok := ctx.Deadline(); ok {

_ = conn.SetDeadline(deadline)

} else {

_ = conn.SetDeadline(time.Now().Add(kuasarReadinessTimeout + time.Minute))

}

return conn, nil

}

gemini-code-assist · 2026-06-05T10:22:33Z

+func int32Len(artifacts []store.SnapshotArtifact) int32 {
+	count := int32(0)
+	for range artifacts {
+		if count == maxInt32Value {
+			return maxInt32Value
+		}
+		count++
+	}
+	return count
+}


int32Len is implemented as an O(N) loop over the slice, which is highly inefficient and unnecessary. Using int32(len(artifacts)) directly is O(1) and much cleaner.

func int32Len(artifacts []store.SnapshotArtifact) int32 { return int32(len(artifacts)) }

gemini-code-assist · 2026-06-05T10:22:33Z

+func normalizePodSpec(spec corev1.PodSpec) corev1.PodSpec {
+	s := spec.DeepCopy()
+	s.NodeName = ""
+	sort.Slice(s.Tolerations, func(i, j int) bool {
+		return s.Tolerations[i].Key < s.Tolerations[j].Key
+	})
+	return *s
+}


normalizePodSpec sorts tolerations only by Key. If multiple tolerations have the same key (or empty keys), the sort order is non-deterministic, which can cause hash mismatches for semantically identical specs. Sorting by Key, Value, and Effect ensures deterministic sorting.

func normalizePodSpec(spec corev1.PodSpec) corev1.PodSpec { s := spec.DeepCopy() s.NodeName = "" sort.Slice(s.Tolerations, func(i, j int) bool { if s.Tolerations[i].Key != s.Tolerations[j].Key { return s.Tolerations[i].Key < s.Tolerations[j].Key } if s.Tolerations[i].Value != s.Tolerations[j].Value { return s.Tolerations[i].Value < s.Tolerations[j].Value } return string(s.Tolerations[i].Effect) < string(s.Tolerations[j].Effect) }) return *s }

codecov-commenter · 2026-06-05T10:24:18Z

⚠️ Please install the to ensure uploads and comments are reliably processed by Codecov.

Codecov Report

❌ Patch coverage is 1.67598% with 1056 lines in your changes missing coverage. Please review.
✅ Project coverage is 43.93%. Comparing base (524e55e) to head (646c785).
⚠️ Report is 119 commits behind head on main.

Files with missing lines	Patch %	Lines
pkg/workloadmanager/snapshot_controller.go	0.00%	287 Missing ⚠️
pkg/workloadmanager/snapshot_fork.go	0.00%	265 Missing ⚠️
pkg/apis/runtime/v1alpha1/zz_generated.deepcopy.go	0.00%	187 Missing ⚠️
pkg/agentd/kuasar_driver.go	0.00%	123 Missing ⚠️
pkg/agentd/snapshot_task_reconciler.go	0.00%	86 Missing ⚠️
pkg/store/artifact_store_redis.go	0.00%	40 Missing ⚠️
pkg/workloadmanager/artifact_store_init.go	0.00%	30 Missing ⚠️
pkg/agentd/snapshot_mode.go	0.00%	17 Missing ⚠️
pkg/workloadmanager/handlers.go	60.86%	9 Missing ⚠️
pkg/workloadmanager/codeinterpreter_controller.go	0.00%	6 Missing ⚠️
... and 2 more
❗ Your organization needs to install the Codecov GitHub app to enable full functionality.

Additional details and impacted files

@@            Coverage Diff             @@
##             main     #380      +/-   ##
==========================================
- Coverage   47.57%   43.93%   -3.64%     
==========================================
  Files          30       43      +13     
  Lines        2819     4224    +1405     
==========================================
+ Hits         1341     1856     +515     
- Misses       1338     2184     +846     
- Partials      140      184      +44

Flag	Coverage Δ
unittests	`43.93% <1.67%> (-3.64%)`	⬇️

Flags with carried forward coverage won't be shown. Click here to find out more.

☔ View full report in Codecov by Harness.
📢 Have feedback on the report? Share it here.

🚀 New features to boost your workflow:

❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

feat: implement snapstart for codeinterpreter

646c785

Signed-off-by: lyuyun <lyuyun068@gmail.com>

Copilot AI review requested due to automatic review settings June 5, 2026 10:20

volcano-sh-bot requested review from LiZhenCheng9527, VanderChen, YaoZengzeng, acsoto and hzxuzhonghu June 5, 2026 10:20

volcano-sh-bot added the size/XXL label Jun 5, 2026

Copilot AI reviewed Jun 5, 2026

View reviewed changes

gemini-code-assist Bot reviewed Jun 5, 2026

View reviewed changes

lyuyun closed this Jun 5, 2026

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat: implement snapstart for codeinterpreter#380

feat: implement snapstart for codeinterpreter#380
lyuyun wants to merge 1 commit into
volcano-sh:mainfrom
lyuyun:snapstart-feat-dev

lyuyun commented Jun 5, 2026

Uh oh!

volcano-sh-bot commented Jun 5, 2026

Uh oh!

Copilot AI left a comment

Uh oh!

gemini-code-assist Bot left a comment

Uh oh!

gemini-code-assist Bot Jun 5, 2026

Uh oh!

gemini-code-assist Bot Jun 5, 2026

Uh oh!

gemini-code-assist Bot Jun 5, 2026

Uh oh!

gemini-code-assist Bot Jun 5, 2026

Uh oh!

gemini-code-assist Bot Jun 5, 2026

Uh oh!

gemini-code-assist Bot Jun 5, 2026

Uh oh!

gemini-code-assist Bot Jun 5, 2026

Uh oh!

codecov-commenter commented Jun 5, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

Conversation

lyuyun commented Jun 5, 2026

Uh oh!

volcano-sh-bot commented Jun 5, 2026

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull request overview

Reviewed changes

Uh oh!

gemini-code-assist Bot left a comment

Choose a reason for hiding this comment

Code Review

Uh oh!

gemini-code-assist Bot Jun 5, 2026

Choose a reason for hiding this comment

Uh oh!

gemini-code-assist Bot Jun 5, 2026

Choose a reason for hiding this comment

Uh oh!

gemini-code-assist Bot Jun 5, 2026

Choose a reason for hiding this comment

Uh oh!

gemini-code-assist Bot Jun 5, 2026

Choose a reason for hiding this comment

Uh oh!

gemini-code-assist Bot Jun 5, 2026

Choose a reason for hiding this comment

Uh oh!

gemini-code-assist Bot Jun 5, 2026

Choose a reason for hiding this comment

Uh oh!

gemini-code-assist Bot Jun 5, 2026

Choose a reason for hiding this comment

Uh oh!

codecov-commenter commented Jun 5, 2026

Codecov Report

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants