feat: implement snapstart for codeinterpreter#380
Conversation
Signed-off-by: lyuyun <lyuyun068@gmail.com>
|
[APPROVALNOTIFIER] This PR is NOT APPROVED This pull-request has been approved by: The full list of commands accepted by this bot can be found here. DetailsNeeds approval from an approver in each of these files:Approvers can indicate their approval by writing |
There was a problem hiding this comment.
Pull request overview
Note
Copilot was unable to run its full agentic suite in this review.
This PR introduces SandboxSnapshot support (Fork mode) across the control-plane and node agent, including new CRDs, controllers, artifact-store persistence, and restore-intent injection for session sandboxes.
Changes:
- Add SnapshotClass/SandboxSnapshot/SandboxSnapshotTask APIs + CRDs, plus controllers in workload-manager and agentd.
- Implement Fork snapshot mode end-to-end, including per-node build sandboxes, task orchestration, hashing, and artifact manifest management.
- Add Redis/Valkey-backed artifact store and wire snapshot restore-intent into sandbox creation.
Reviewed changes
Copilot reviewed 22 out of 40 changed files in this pull request and generated 3 comments.
Show a summary per file
| File | Description |
|---|---|
| pkg/workloadmanager/snapshot_mode.go | Defines the mode-handler interface for snapshot reconciliation. |
| pkg/workloadmanager/snapshot_fork.go | Implements Fork-mode snapshot orchestration, hashing, and snapshot-key lookup for restore intent. |
| pkg/workloadmanager/snapshot_controller.go | Adds the SandboxSnapshot controller and shared manifest/status utilities. |
| pkg/workloadmanager/server.go | Wires optional snapshot lookup dependencies into the API server. |
| pkg/workloadmanager/handlers.go | Injects snapshot restore intent into sandbox creation (CodeInterpreter path). |
| pkg/workloadmanager/codeinterpreter_controller.go | Always reconciles SandboxTemplate so it can serve as Fork snapshot source. |
| pkg/workloadmanager/artifact_store_init.go | Adds env-based artifact store init (Redis/Valkey) with noop fallback. |
| pkg/store/artifact_store.go | Introduces ArtifactStore interface and snapshot artifact manifest schema. |
| pkg/store/artifact_store_redis.go | Implements ArtifactStore with Redis CAS semantics. |
| pkg/apis/runtime/v1alpha1/snapshot_types.go | Adds SnapshotClass/SandboxSnapshot/SandboxSnapshotTask API types and constants. |
| pkg/apis/runtime/v1alpha1/zz_generated.deepcopy.go | Adds generated DeepCopy methods for new API types. |
| pkg/agentd/snapshot_task_reconciler.go | Adds node-agent controller for executing SandboxSnapshotTasks. |
| pkg/agentd/snapshot_mode.go | Adds per-mode target validation (Fork waits for build sandbox readiness). |
| pkg/agentd/snapshot_driver.go | Adds snapshot driver interface and request/response structs. |
| pkg/agentd/kuasar_driver.go | Adds a Kuasar SnapStart driver scaffold (handshake + create/delete/inspect stubs). |
| manifests/charts/base/crds/runtime.agentcube.volcano.sh_snapshotclasses.yaml | Adds SnapshotClass CRD. |
| manifests/charts/base/crds/runtime.agentcube.volcano.sh_sandboxsnapshots.yaml | Adds SandboxSnapshot CRD. |
| manifests/charts/base/crds/runtime.agentcube.volcano.sh_sandboxsnapshottasks.yaml | Adds SandboxSnapshotTask CRD. |
| hack/update-codegen.sh | Extends lister-gen workaround to new snapshot resources. |
| cmd/workload-manager/main.go | Registers snapshot controller, initializes artifact store, and passes deps to server. |
| cmd/agentd/main.go | Registers snapshot task controller and adds runtime API scheme. |
| go.mod | Minor ordering adjustment in require block. |
Files not reviewed (18)
- client-go/clientset/versioned/typed/runtime/v1alpha1/fake/fake_runtime_client.go: Language not supported
- client-go/clientset/versioned/typed/runtime/v1alpha1/fake/fake_sandboxsnapshot.go: Language not supported
- client-go/clientset/versioned/typed/runtime/v1alpha1/fake/fake_sandboxsnapshottask.go: Language not supported
- client-go/clientset/versioned/typed/runtime/v1alpha1/fake/fake_snapshotclass.go: Language not supported
- client-go/clientset/versioned/typed/runtime/v1alpha1/generated_expansion.go: Language not supported
- client-go/clientset/versioned/typed/runtime/v1alpha1/runtime_client.go: Language not supported
- client-go/clientset/versioned/typed/runtime/v1alpha1/sandboxsnapshot.go: Language not supported
- client-go/clientset/versioned/typed/runtime/v1alpha1/sandboxsnapshottask.go: Language not supported
- client-go/clientset/versioned/typed/runtime/v1alpha1/snapshotclass.go: Language not supported
- client-go/informers/externalversions/generic.go: Language not supported
- client-go/informers/externalversions/runtime/v1alpha1/interface.go: Language not supported
- client-go/informers/externalversions/runtime/v1alpha1/sandboxsnapshot.go: Language not supported
- client-go/informers/externalversions/runtime/v1alpha1/sandboxsnapshottask.go: Language not supported
- client-go/informers/externalversions/runtime/v1alpha1/snapshotclass.go: Language not supported
- client-go/listers/runtime/v1alpha1/expansion_generated.go: Language not supported
- client-go/listers/runtime/v1alpha1/sandboxsnapshot.go: Language not supported
- client-go/listers/runtime/v1alpha1/sandboxsnapshottask.go: Language not supported
- client-go/listers/runtime/v1alpha1/snapshotclass.go: Language not supported
Comments suppressed due to low confidence (5)
pkg/workloadmanager/snapshot_controller.go:1
- Completed-task cleanup ignores
SnapshotArtifactPhaseUnavailable. Tasks that end as Unavailable won't be cleaned up (and mode-specific cleanup like deleting build sandboxes won't run), causing resource leaks. IncludeUnavailableamong the terminal phases for cleanup.
pkg/workloadmanager/snapshot_controller.go:1 - This reads
manifest.ArtifactSets[pendingKey]without checking presence. If the manifest is corrupted or the set was deleted (e.g., manual edits / store issues),pendingbecomes a zero-value set and could be incorrectly promoted or cause confusing logs/status. Guard with anokcheck and clearPendingSetRef(or fail status) when the key is missing.
pkg/workloadmanager/snapshot_fork.go:1 - Hash normalization sorts tolerations only by
Key. If multiple tolerations share a key, the comparator doesn't define a strict total order, andsort.Sliceis not stable—this can yield non-deterministic ordering and spurious hash changes/rebuilds. Sort by a full tuple (e.g., Key, Operator, Value, Effect, TolerationSeconds) to make the ordering deterministic.
pkg/workloadmanager/snapshot_fork.go:1 lookupActiveForkSnapshotKeylists all SandboxSnapshots in the namespace and then potentially calls the artifact store per candidate. This can become a hot-path bottleneck during session creation (list cost + N artifact-store reads). Consider adding a label (e.g., sourceRef name + mode) on SandboxSnapshot objects and usingclient.MatchingLabelsto narrow the list, or maintaining an index/cache for ready snapshots keyed by (namespace, sourceTemplateName).
pkg/store/artifact_store_redis.go:1- The comment about the version token being a 'JSON encoding of the raw string value' doesn't match the implementation:
PutManifestdirectly comparescurrent(raw Redis string) withversion(also a raw string). Either update the comment to reflect raw-string comparison, or adjust the code to actually use an encoded token format if that's the intended contract.
| description: SandboxSnapshotMode selects the snapshot usage mode. | ||
| enum: | ||
| - Fork | ||
| - Resume | ||
| type: string |
| // Skip tasks that have already reached a terminal phase. | ||
| if task.Status.Phase == runtimev1alpha1.SnapshotArtifactPhaseReady || | ||
| task.Status.Phase == runtimev1alpha1.SnapshotArtifactPhaseFailed { | ||
| return ctrl.Result{}, nil | ||
| } |
| snapshotReconciler := &workloadmanager.SandboxSnapshotReconciler{ | ||
| Client: mgr.GetClient(), | ||
| ArtifactStore: workloadmanager.NewArtifactStoreFromEnv(), | ||
| } |
There was a problem hiding this comment.
Code Review
This pull request introduces a sandbox snapshotting feature to Volcano AgentCube, adding CRDs for SandboxSnapshot, SandboxSnapshotTask, and SnapshotClass, along with a node-local Kuasar snapshot driver, a snapshot task reconciler, and a control-plane snapshot controller. It also integrates snapshot restore intent into the workload manager and implements a Redis/Valkey-backed artifact store. The code review highlights several critical issues and improvement opportunities: a potential resource leak of physical artifacts on nodes during promotion, non-deterministic behavior in snapshot lookup and pod spec normalization, and inefficient O(N) slice length calculation. Additionally, feedback suggests improving robustness by aligning connection deadlines with context deadlines in the Kuasar driver and returning the raw manifest string from the artifact store to avoid double-marshaling overhead.
Important
The consumer version of Gemini Code Assist on GitHub is being sunset. Starting June 18, 2026, new organization installations will be blocked, and all code review activity will officially cease on July 17, 2026.
For more details on the timeline and next steps, please review the Help Documentation.
| // GetManifest retrieves the manifest for the given snapshot owner key. | ||
| // Returns nil, nil when no manifest exists. | ||
| GetManifest(ctx context.Context, ownerKey string) (*SnapshotArtifactManifest, error) |
There was a problem hiding this comment.
GetManifest does not return the raw version string of the manifest from Redis. This forces loadManifest to use json.Marshal to generate a version string, which is non-deterministic (due to map key ordering/omitted fields) and introduces double-marshaling overhead. Changing GetManifest to return (*SnapshotArtifactManifest, string, error) where the second return value is the raw string retrieved from Redis avoids these issues.
| // GetManifest retrieves the manifest for the given snapshot owner key. | |
| // Returns nil, nil when no manifest exists. | |
| GetManifest(ctx context.Context, ownerKey string) (*SnapshotArtifactManifest, error) | |
| // GetManifest retrieves the manifest and its raw string version for the given snapshot owner key. | |
| // Returns nil, "", nil when no manifest exists. | |
| GetManifest(ctx context.Context, ownerKey string) (*SnapshotArtifactManifest, string, error) |
| if manifest.ActiveSetRef.SnapshotKey != "" { | ||
| delete(manifest.ArtifactSets, manifest.ActiveSetRef.SnapshotKey) | ||
| } |
There was a problem hiding this comment.
When promoting a pending set to active, the old active set is deleted from manifest.ArtifactSets, but the physical snapshot artifacts on the nodes are never deleted, leading to a permanent resource leak on the nodes. Consider introducing a deletion/cleanup task or garbage collection mechanism to remove physical artifacts that are no longer referenced.
| func lookupActiveForkSnapshotKey( | ||
| ctx context.Context, | ||
| k8sClient client.Client, | ||
| artifactStore store.ArtifactStore, | ||
| namespace, sandboxTemplateName string, | ||
| ) string { | ||
| snapshotList := &runtimev1alpha1.SandboxSnapshotList{} | ||
| if err := k8sClient.List(ctx, snapshotList, client.InNamespace(namespace)); err != nil { | ||
| klog.V(4).InfoS("snapshot lookup: failed to list snapshots, falling back to cold start", | ||
| "namespace", namespace, "error", err) | ||
| return "" | ||
| } | ||
|
|
||
| for i := range snapshotList.Items { | ||
| ss := &snapshotList.Items[i] | ||
| if ss.Spec.SnapshotMode != runtimev1alpha1.SandboxSnapshotModeFork { | ||
| continue | ||
| } | ||
| if ss.Spec.SourceRef.Name != sandboxTemplateName { | ||
| continue | ||
| } | ||
| if ss.Status.Phase != runtimev1alpha1.SandboxSnapshotPhaseReady { | ||
| continue | ||
| } | ||
|
|
||
| ownerKey := store.ArtifactOwnerKey("SandboxSnapshot", ss.Namespace, ss.Name, string(ss.UID)) | ||
| manifest, err := artifactStore.GetManifest(ctx, ownerKey) | ||
| if err != nil { | ||
| klog.V(4).InfoS("snapshot lookup: artifact store error, falling back to cold start", | ||
| "snapshot", ss.Name, "error", err) | ||
| continue | ||
| } | ||
| if manifest == nil || manifest.ActiveSetRef.SnapshotKey == "" { | ||
| continue | ||
| } | ||
| activeSet, ok := manifest.ArtifactSets[manifest.ActiveSetRef.SnapshotKey] | ||
| if !ok { | ||
| continue | ||
| } | ||
| for _, art := range activeSet.Artifacts { | ||
| if art.Phase == store.SnapshotArtifactPhaseReady { | ||
| klog.V(4).InfoS("snapshot lookup: found active fork snapshot key", | ||
| "snapshot", ss.Name, "snapshotKey", manifest.ActiveSetRef.SnapshotKey) | ||
| return manifest.ActiveSetRef.SnapshotKey | ||
| } | ||
| } | ||
| } | ||
| return "" | ||
| } |
There was a problem hiding this comment.
lookupActiveForkSnapshotKey returns the first active snapshot key it finds in an arbitrary order. If multiple snapshots exist for the same template, this leads to non-deterministic behavior. Sorting or selecting the snapshot with the latest CreationTimestamp or ReadyAt time ensures deterministic behavior.
func lookupActiveForkSnapshotKey(
ctx context.Context,
k8sClient client.Client,
artifactStore store.ArtifactStore,
namespace, sandboxTemplateName string,
) string {
snapshotList := &runtimev1alpha1.SandboxSnapshotList{}
if err := k8sClient.List(ctx, snapshotList, client.InNamespace(namespace)); err != nil {
klog.V(4).InfoS("snapshot lookup: failed to list snapshots, falling back to cold start",
"namespace", namespace, "error", err)
return ""
}
var bestSnapshot *runtimev1alpha1.SandboxSnapshot
var bestSnapshotKey string
for i := range snapshotList.Items {
ss := &snapshotList.Items[i]
if ss.Spec.SnapshotMode != runtimev1alpha1.SandboxSnapshotModeFork {
continue
}
if ss.Spec.SourceRef.Name != sandboxTemplateName {
continue
}
if ss.Status.Phase != runtimev1alpha1.SandboxSnapshotPhaseReady {
continue
}
ownerKey := store.ArtifactOwnerKey("SandboxSnapshot", ss.Namespace, ss.Name, string(ss.UID))
manifest, err := artifactStore.GetManifest(ctx, ownerKey)
if err != nil {
klog.V(4).InfoS("snapshot lookup: artifact store error, falling back to cold start",
"snapshot", ss.Name, "error", err)
continue
}
if manifest == nil || manifest.ActiveSetRef.SnapshotKey == "" {
continue
}
activeSet, ok := manifest.ArtifactSets[manifest.ActiveSetRef.SnapshotKey]
if !ok {
continue
}
hasReadyArtifact := false
for _, art := range activeSet.Artifacts {
if art.Phase == store.SnapshotArtifactPhaseReady {
hasReadyArtifact = true
break
}
}
if !hasReadyArtifact {
continue
}
if bestSnapshot == nil || ss.CreationTimestamp.After(bestSnapshot.CreationTimestamp.Time) {
bestSnapshot = ss
bestSnapshotKey = manifest.ActiveSetRef.SnapshotKey
}
}
if bestSnapshotKey != "" {
klog.V(4).InfoS("snapshot lookup: found active fork snapshot key",
"snapshot", bestSnapshot.Name, "snapshotKey", bestSnapshotKey)
return bestSnapshotKey
}
return ""
}| func (d *KuasarDriver) waitForReadiness(ctx context.Context, conn net.Conn, rd *bufio.Reader, sandboxID string) error { | ||
| ticker := time.NewTicker(kuasarReadinessPollInterval) | ||
| defer ticker.Stop() |
There was a problem hiding this comment.
In waitForReadiness, the connection's deadline is not aligned with the ctx deadline. If ctx times out, the loop might hang for up to an extra minute because readResponse is blocked on a socket read with a longer deadline. Setting the connection's deadline to ctx.Deadline() at the start of waitForReadiness ensures that any blocked read/write inside the loop will unblock immediately when ctx times out.
func (d *KuasarDriver) waitForReadiness(ctx context.Context, conn net.Conn, rd *bufio.Reader, sandboxID string) error {
if deadline, ok := ctx.Deadline(); ok {
_ = conn.SetDeadline(deadline)
}
ticker := time.NewTicker(kuasarReadinessPollInterval)
defer ticker.Stop()| func (d *KuasarDriver) dialSocket(ctx context.Context) (net.Conn, error) { | ||
| dialer := &net.Dialer{} | ||
| conn, err := dialer.DialContext(ctx, "unix", d.SocketPath) | ||
| if err != nil { | ||
| return nil, fmt.Errorf("dial unix %s: %w", d.SocketPath, err) | ||
| } | ||
| _ = conn.SetDeadline(time.Now().Add(kuasarReadinessTimeout + time.Minute)) | ||
| return conn, nil |
There was a problem hiding this comment.
In dialSocket, the connection's deadline is set to a fixed 6 minutes in the future, ignoring any deadline/timeout on the passed ctx. It is much more robust and idiomatic to respect ctx.Deadline() if it is set.
| func (d *KuasarDriver) dialSocket(ctx context.Context) (net.Conn, error) { | |
| dialer := &net.Dialer{} | |
| conn, err := dialer.DialContext(ctx, "unix", d.SocketPath) | |
| if err != nil { | |
| return nil, fmt.Errorf("dial unix %s: %w", d.SocketPath, err) | |
| } | |
| _ = conn.SetDeadline(time.Now().Add(kuasarReadinessTimeout + time.Minute)) | |
| return conn, nil | |
| func (d *KuasarDriver) dialSocket(ctx context.Context) (net.Conn, error) { | |
| dialer := &net.Dialer{} | |
| conn, err := dialer.DialContext(ctx, "unix", d.SocketPath) | |
| if err != nil { | |
| return nil, fmt.Errorf("dial unix %s: %w", d.SocketPath, err) | |
| } | |
| if deadline, ok := ctx.Deadline(); ok { | |
| _ = conn.SetDeadline(deadline) | |
| } else { | |
| _ = conn.SetDeadline(time.Now().Add(kuasarReadinessTimeout + time.Minute)) | |
| } | |
| return conn, nil | |
| } |
| func int32Len(artifacts []store.SnapshotArtifact) int32 { | ||
| count := int32(0) | ||
| for range artifacts { | ||
| if count == maxInt32Value { | ||
| return maxInt32Value | ||
| } | ||
| count++ | ||
| } | ||
| return count | ||
| } |
| func normalizePodSpec(spec corev1.PodSpec) corev1.PodSpec { | ||
| s := spec.DeepCopy() | ||
| s.NodeName = "" | ||
| sort.Slice(s.Tolerations, func(i, j int) bool { | ||
| return s.Tolerations[i].Key < s.Tolerations[j].Key | ||
| }) | ||
| return *s | ||
| } |
There was a problem hiding this comment.
normalizePodSpec sorts tolerations only by Key. If multiple tolerations have the same key (or empty keys), the sort order is non-deterministic, which can cause hash mismatches for semantically identical specs. Sorting by Key, Value, and Effect ensures deterministic sorting.
func normalizePodSpec(spec corev1.PodSpec) corev1.PodSpec {
s := spec.DeepCopy()
s.NodeName = ""
sort.Slice(s.Tolerations, func(i, j int) bool {
if s.Tolerations[i].Key != s.Tolerations[j].Key {
return s.Tolerations[i].Key < s.Tolerations[j].Key
}
if s.Tolerations[i].Value != s.Tolerations[j].Value {
return s.Tolerations[i].Value < s.Tolerations[j].Value
}
return string(s.Tolerations[i].Effect) < string(s.Tolerations[j].Effect)
})
return *s
}|
Codecov Report❌ Patch coverage is
Additional details and impacted files@@ Coverage Diff @@
## main #380 +/- ##
==========================================
- Coverage 47.57% 43.93% -3.64%
==========================================
Files 30 43 +13
Lines 2819 4224 +1405
==========================================
+ Hits 1341 1856 +515
- Misses 1338 2184 +846
- Partials 140 184 +44
Flags with carried forward coverage won't be shown. Click here to find out more. ☔ View full report in Codecov by Harness. 🚀 New features to boost your workflow:
|
No description provided.