Skip to content

feat(templates): E2B-style display names with CubeMaster name cache#607

Open
HuChundong wants to merge 1 commit into
TencentCloud:masterfrom
HuChundong:feat/template-name-e2b-align
Open

feat(templates): E2B-style display names with CubeMaster name cache#607
HuChundong wants to merge 1 commit into
TencentCloud:masterfrom
HuChundong:feat/template-name-e2b-align

Conversation

@HuChundong

@HuChundong HuChundong commented Jun 22, 2026

Copy link
Copy Markdown
Contributor

Summary

  • Add optional human-readable template name (CubeMaster display_name) with cluster-wide case-insensitive uniqueness (20260623051112_template_display_name_index.sql).
  • Web: list/show display names; PATCH rename on template detail; optional name on create.
  • API: POST /sandboxes and template paths accept display name or tpl-* id; CubeAPI resolves name→id once at entry, then uses existing tpl-id paths.
  • CubeMaster: GET /cube/template/lookup, POST /cube/template/display-name; sync name reservation at submit (ReserveTemplateDisplayNameForCreate → 409); release name after failed builds.
  • Name cache (CubeMaster): display_nametemplate_id cached in templatecenter (300s TTL, 4096 entries, singleflight); invalidated on rename/delete/create. CubeAPI is a thin client with no local cache.

Architecture

Layer Approach
CubeAPI Thin resolver; delegates all name lookups to CubeMaster
CubeMaster Owns display_name, lookup cache, and invalidation
Uniqueness Global per cluster (no tenant/team namespace today)

Compared to E2B infra: E2B uses team-scoped aliases reserved in RegisterBuild; we use global display_name_key + sync submit check. We release names on build failure (E2B keeps aliases until template delete).

Test plan

  • go test ./pkg/templatecenter/...
  • go test ./pkg/service/httpservice/cube/...
  • cargo test -p cube-api
  • go test ./pkg/base/dao/migrate/ -run TestMigrationFilenames
  • Apply migration 20260623051112_template_display_name_index on staging (run audit queries in migration header first)
  • Create template with name → create sandbox by name when READY
  • Duplicate name at submit → 409; failed build → name reusable

Migration note

This migration may clear legacy tpl-/snap- prefixed names and dedupe display names before adding the unique index. Run preflight audit queries on staging before production.

Assisted-by: Cursor:Composer

@kinwin-ustc

kinwin-ustc commented Jun 22, 2026

Copy link
Copy Markdown
Collaborator

You can check if PR #612 meets your requirements? The scale of the changes is much smaller

@HuChundong HuChundong force-pushed the feat/template-name-e2b-align branch from ff87c16 to 2ed302c Compare June 22, 2026 08:28
Comment thread CubeAPI/src/services/templates.rs Outdated

match self
.fetch_resolved_template(&template_ref, &template_id)
.await

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Redundant second fetch_resolved_template call

update_template_name already calls fetch_resolved_template on line 144 to get the existing template metadata (including the old display_name for cache invalidation). After the successful update, it calls fetch_resolved_template again on line 164 solely to build the response. This adds an unnecessary cross-service HTTP roundtrip + DB query.

The response could be constructed from the already-fetched existing record (which was returned just 20 lines earlier), substituting the new name for display_name. Consider eliminating this second fetch to reduce latency — especially since the CubeMaster PATCH update already returned a success envelope, so the template definition is expected to still exist.

.await?;
let existing = self
.fetch_resolved_template(&template_ref, &template_id)
.await?;

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Extra fetch_resolved_template purely for display_name cache invalidation

delete_template calls fetch_resolved_template solely to retrieve the old display_name for cache invalidation (line 228: for display_name in template_names(&existing.display_name)). This triggers an HTTP roundtrip to CubeMaster + a DB query just to get a single string field that's already available in CubeMaster's response to the lookup endpoint.

Consider having GET /cube/template/lookup also return display_name alongside template_id, so the single lookup call serves both the ID resolution and the cache invalidation data.

return normalized, nil
}

func mapDefinitionCreateDuplicateError(err error, displayName string) error {

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Raw database errors propagated to API responses

mapDefinitionCreateDuplicateError returns err unmodified when the error message does not contain "1062" or "Duplicate entry". This means transient MySQL errors (e.g. deadlocks, connection timeouts, schema mismatches) flow through to the API response's RetMsg field verbatim (via handleTemplateLookupAction line 118 and handleTemplateDisplayNameAction line 168).

While these are authenticated endpoints, exposing raw DB error texts can reveal internal schema details, engine versions, and connection strings. Consider wrapping non-duplicate errors into a generic internal-error message before returning to clients, while logging the raw error server-side.

Comment thread CubeAPI/src/services/template_names.rs Outdated
error::{AppError, AppResult},
};

const CACHE_TTL: Duration = Duration::from_secs(60);

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Cache TTL of 60 seconds is unnecessarily short

Template display names are effectively immutable after creation — they only change through the explicit update_template_name endpoint, which triggers targeted cache invalidation. Every 60s cache miss forces a cross-service HTTP call to CubeMaster's /cube/template/lookup endpoint (which queries the DB).

Since inline invalidation already handles renames on the originating replica, a TTL of 300-600s would dramatically reduce CubeMaster load without increasing the staleness window beyond what the existing invalidation already provides. Multi-replica deployments already tolerate up to one TTL window of staleness; 60s is unnecessarily tight for a mapping that changes at human timescales.

Comment thread CubeAPI/src/services/template_names.rs Outdated
//

//! Template name → templateID resolution with an in-memory cache.
//!

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Module doc comment omits several routes using name resolution

The doc comment lists POST /sandboxes, GET/PATCH/DELETE /templates/{templateID}, and compat-adopt as the only routes with name resolution. However, the TemplateService also resolves names for POST /templates/:templateID (rebuild), POST /templates/:templateID/builds/:buildID (start build), and GET /templates/:templateID/builds/:buildID/status (get build status). A developer reading this comment would miss roughly half the affected endpoints. Recommend updating the comment to reflect the full set of routes.

@HuChundong

Copy link
Copy Markdown
Contributor Author

You can check if PR #612 meets your requirements? The scale of the changes is much smaller

These two PRs overlap slightly, but they have completely different priorities. This PR focuses on syncing the E2B API, making template-based sandbox creation the core feature. Displaying names on the WebUI serves little purpose if those names can't actually be used to spin up sandboxes.

@HuChundong HuChundong force-pushed the feat/template-name-e2b-align branch from 2ed302c to 214dc12 Compare June 22, 2026 08:52
@HuChundong HuChundong marked this pull request as draft June 22, 2026 08:55
@HuChundong HuChundong force-pushed the feat/template-name-e2b-align branch from 214dc12 to f801c11 Compare June 22, 2026 09:06
@HuChundong HuChundong marked this pull request as ready for review June 22, 2026 09:16
@HuChundong HuChundong force-pushed the feat/template-name-e2b-align branch from 19ceabf to b297013 Compare June 22, 2026 09:25

@HuChundong HuChundong left a comment

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Automated review of b297013 — see inline notes for actionable items.

Comment thread CubeAPI/src/services/template_names.rs Outdated
});
}
let key = reference.to_ascii_lowercase();
if let Some(template_id) = self.cache_get(&key).await {

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Multi-replica stale cache on destructive paths

get_template_build_status re-validates cache hits via fetch_resolved_template, but create_sandbox, delete_template, and update_template_name trust cached name→ID mappings for up to 300s on replicas that did not handle the rename/delete.

After a name is freed or reassigned, other CubeAPI replicas can still route POST /sandboxes or DELETE/PATCH to the old template. Consider reusing the build-status revalidation pattern (or a lightweight lookup) on cache hits for mutating endpoints.

Comment thread CubeAPI/src/services/templates.rs Outdated
.template_names
.resolve_template_ref(&template_ref)
.await?;
let names_for_invalidation = if TemplateNameCache::needs_resolution(&template_ref) {

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Service-layer tests missing for name-resolution orchestration

template_names.rs has solid unit tests, but update_template_name, delete_template (name vs tpl-id branches), and sandbox create-by-name paths in this module are untested.

A few focused tests with a mock CubeMasterClient would guard cache invalidation and the delete-by-name skip-fetch behavior added in this PR.


CALL cubemaster_assert_table_exists('t_cube_template_definition');

-- Drop invalid reserved-prefix names on templates (snapshots keep their names).

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Migration preflight clears legacy names irreversibly

Lines 14–37 blank out reserved-prefix names and duplicate display names before adding the unique index. This is reasonable for greenfield but can surprise production upgrades.

Please confirm staging/release notes call this out explicitly (PR body mentions it — worth a pre-migration audit query for affected rows).

@HuChundong

Copy link
Copy Markdown
Contributor Author

Code review (b297013)

Overall the PR is in good shape after the review fixes: name validation is centralized in CubeMaster, internal DB errors are masked, redundant fetches were trimmed, and test coverage on the Master side is solid.

Worth addressing / confirming

  1. Multi-replica cache staleness (medium) — Process-local 300s cache is only invalidated on the writing replica. Mutating paths (POST /sandboxes, DELETE/PATCH /templates/{name}) do not re-validate cache hits, unlike get_template_build_status. Document as a known limitation or add revalidation on cache hits for destructive routes.

  2. CubeAPI service tests (medium) — Add mock-based tests for update_template_name / delete_template orchestration and cache invalidation; unit tests in template_names.rs do not cover these flows.

  3. Migration 0010 preflight (low) — Preflight UPDATEs silently clear invalid/duplicate legacy display names. Ensure staging migration checklist includes an audit query before apply.

Look good

  • DB uniqueness via display_name_key + placeholder reservation at submit
  • Generic internal errors in Master handlers
  • OpenAPI / module docs aligned with name-resolution routes
  • CubeMaster unit + HTTP handler tests for lookup/rename paths

No blocking issues found for merge after acknowledging multi-replica cache behavior in release notes or follow-up.

@HuChundong HuChundong force-pushed the feat/template-name-e2b-align branch from 9438441 to 4483cba Compare June 22, 2026 10:32
Comment thread CubeAPI/src/services/template_names.rs Outdated
Comment on lines +149 to +156
async fn lookup_and_cache(&self, reference: &str, key: &str) -> AppResult<String> {
let lock = {
let mut locks = self.lookup_locks.write().await;
locks
.entry(key.to_string())
.or_insert_with(|| Arc::new(Mutex::new(())))
.clone()
};

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Potential memory leak: lookup_locks entries are never removed. Every unique name ever resolved adds an Arc<Mutex<()>> to the lookup_locks map, but no code cleans them up — not when cache entries are evicted (TTL or capacity), not on invalidation. Over a long-running process with many distinct template names, this grows unbounded, violating the MAX_CACHE_ENTRIES contract on self.entries.

Suggestion: Either clean up lookup_locks entries alongside cache eviction (cache_set / invalidate_name), or switch to a fixed-size striped lock pool (e.g., 64 or 256 slots via usize::from_str_bytes hash) to bound memory and reduce write-lock contention on the lookup_locks RwLock.

Comment thread CubeAPI/src/services/template_names.rs Outdated
Comment on lines +82 to +95
/// Resolve a display name for mutating API paths, always refreshing via lookup.
pub async fn resolve_template_ref_for_mutation(&self, reference: &str) -> AppResult<String> {
let reference = reference.trim();
if reference.is_empty() {
return Err(AppError::BadRequest(
"template reference is empty".to_string(),
));
}
if !Self::needs_resolution(reference) {
return Ok(reference.to_string());
}
let key = reference.to_ascii_lowercase();
self.lookup_and_cache(reference, &key).await
}

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Doc/behavior mismatch: The doc comment says "always refreshing via lookup," but the implementation calls lookup_and_cache, which checks the cache first and returns a cached value if still valid (300s TTL). Mutating paths (delete, rebuild, adopt-baseline) can therefore operate on stale template_id mappings for up to 300s after a name is reassigned on another replica.

Suggestion: Either bypass the cache for mutation paths (clear entry before lookup), or update the doc comment to accurately describe the caching behavior (e.g., "uses cached resolution when available; cache TTL bounds cross-replica staleness").

}},
}
}
normalized, _ := templatecenter.NormalizeTemplateDisplayName(name)

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Error silently discarded: NormalizeTemplateDisplayName can return an error (e.g., name exceeds 256 chars after lookup succeeded, or contains unexpected characters), but the error is discarded via _. The response will have a valid template_id but an empty/incorrect display_name, which is misleading for API consumers.

Suggestion: Log the error or fall back to the original name:

normalized := name
if n, err := templatecenter.NormalizeTemplateDisplayName(name); err == nil {
    normalized = n
} else {
    log.G(ctx).Warnf("lookup returned valid template but name normalization failed: %v", err)
}

@cubesandboxbot

cubesandboxbot Bot commented Jun 22, 2026

Copy link
Copy Markdown

PR Review: E2B-style display names with CubeMaster name cache

Overview

This is a well-structured feature addition implementing human-readable template display names. The architecture follows a clean layered pattern: CubeAPI is a thin resolver that delegates all name lookups to CubeMaster via HTTP, which owns the name cache (300s TTL, 4096 entries, singleflight dedup) and all invalidation logic. The display_name_key generated column with a UNIQUE index provides case-insensitive uniqueness at the database level. Test coverage is solid overall with a healthy test-to-code ratio.

Key Findings

Medium: Missing input validation at CubeAPI boundary (CubeAPI/src/models/mod.rs:671)
UpdateTemplateRequest.name is a bare String with no validation. Other request structs use derive[Validate] but this one does not. All documented constraints are enforced only by CubeMaster -- a defense-in-depth gap adding unnecessary network round-trips.

Medium: Internal error details forwarded to API callers (CubeAPI/src/services/templates.rs:102-106)
Transport errors from CubeMaster format messages like "CubeMaster unavailable: {e}" that include internal cluster URLs, returned as 502/503 to API callers. These should be sanitized server-side before forwarding.

Medium: Tracing context dropped in displayNameFromTemplateImageJob (CubeMaster/pkg/templatecenter/template_display_name.go:382)
Uses context.Background() for logging instead of accepting a context.Context parameter. Breaks distributed trace correlation on build failure log lines.

Medium: ReleaseTemplateDisplayNameAfterBuildFailure is untested (template_display_name.go:235)
Called on every build failure to release the display name. Has no tests despite non-trivial branching logic. If broken, names become permanently locked on failed builds.

Low: No rate limiting on /templates/lookup (CubeAPI/src/routes.rs:189)
The template routes use with_auth (auth only), not with_auth_and_rate_limit. The lookup endpoint is an unthrottled name-to-ID oracle for authenticated clients.

Low: Cache invalidation maintainability gap (CubeMaster/pkg/templatecenter/cache.go:204)
invalidateTemplateCaches does not invalidate the display name cache, but there is no comment explaining this. Future developers must remember to invalidate both caches separately.

Minor Items

  • CopyableText in TemplateDetail.tsx duplicates copy-to-clipboard logic from CopyButton
  • Five template endpoints absent from OpenAPI spec
  • evictOneDisplayNameCacheEntry() uses O(n) Items() copy to delete one entry
  • fresh=true handler path untested in template_display_name_test.go

Positive Notes

  • Singleflight correctly deduplicates concurrent lookups with separate group keys for fresh vs cached
  • Migration SQL is well-documented with pre-audit queries and clear rollback
  • Display_name max length (256) is consistent across Rust, Go, SQL, and OpenAPI spec
  • TemplateNameCache correctly bypasses resolution for tpl-/snap- prefixed IDs
  • Inline comments posted for specific issues above

Comment thread CubeAPI/src/services/template_names.rs Outdated
Comment on lines +149 to +156
async fn lookup_and_cache(&self, reference: &str, key: &str) -> AppResult<String> {
let lock = {
let mut locks = self.lookup_locks.write().await;
locks
.entry(key.to_string())
.or_insert_with(|| Arc::new(Mutex::new(())))
.clone()
};

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Potential memory leak: lookup_locks entries are never removed. Every unique name ever resolved adds an Arc<Mutex<()>> to the lookup_locks map, but no code cleans them up — not when cache entries are evicted (TTL or capacity), not on invalidation. Over a long-running process with many distinct template names, this grows unbounded, violating the MAX_CACHE_ENTRIES contract on self.entries.

Suggestion: Either clean up lookup_locks entries alongside cache eviction (cache_set / invalidate_name), or switch to a fixed-size striped lock pool (e.g., 64 or 256 slots via usize::from_str_bytes hash) to bound memory and reduce write-lock contention on the lookup_locks RwLock.

Comment thread CubeAPI/src/services/template_names.rs Outdated
Comment on lines +82 to +95
/// Resolve a display name for mutating API paths, always refreshing via lookup.
pub async fn resolve_template_ref_for_mutation(&self, reference: &str) -> AppResult<String> {
let reference = reference.trim();
if reference.is_empty() {
return Err(AppError::BadRequest(
"template reference is empty".to_string(),
));
}
if !Self::needs_resolution(reference) {
return Ok(reference.to_string());
}
let key = reference.to_ascii_lowercase();
self.lookup_and_cache(reference, &key).await
}

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Doc/behavior mismatch: The doc comment says "always refreshing via lookup," but the implementation calls lookup_and_cache, which checks the cache first and returns a cached value if still valid (300s TTL). Mutating paths (delete, rebuild, adopt-baseline) can therefore operate on stale template_id mappings for up to 300s after a name is reassigned on another replica.

Suggestion: Either bypass the cache for mutation paths (clear entry before lookup), or update the doc comment to accurately describe the caching behavior (e.g., "uses cached resolution when available; cache TTL bounds cross-replica staleness").

}},
}
}
normalized, _ := templatecenter.NormalizeTemplateDisplayName(name)

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Error silently discarded: NormalizeTemplateDisplayName can return an error (e.g., name exceeds 256 chars after lookup succeeded, or contains unexpected characters), but the error is discarded via _. The response will have a valid template_id but an empty/incorrect display_name, which is misleading for API consumers.

Suggestion: Log the error or fall back to the original name:

normalized := name
if n, err := templatecenter.NormalizeTemplateDisplayName(name); err == nil {
    normalized = n
} else {
    log.G(ctx).Warnf("lookup returned valid template but name normalization failed: %v", err)
}

@HuChundong HuChundong force-pushed the feat/template-name-e2b-align branch from 4483cba to 52079e1 Compare June 22, 2026 10:55
return def.DisplayName
}

func assertDisplayNameAvailableForCreate(ctx context.Context, normalized, templateID string) error {

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Performance: Two DB queries under write lock

This function does two separate DB round-trips while holding the per-name write lock:

  1. findTemplateIDByDisplayNameFromDB — SELECT template_id
  2. GetDefinition — full row fetch (only status is needed)

Combine into a single query that fetches both template_id and status:

@HuChundong HuChundong force-pushed the feat/template-name-e2b-align branch from 5e7a6b5 to 4f5faba Compare June 23, 2026 10:44
if err != nil {
return err
}
if err := ValidateTemplateDisplayNameAvailable(ctx, normalized, templateID); err != nil {

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

TOCTOU: availability check and DB write are not atomic

ValidateTemplateDisplayNameAvailable (line 308) queries the DB to check name availability, then the Update on line 312-314 writes the new name. Between these two operations, a concurrent caller can claim the same name. The DB unique index on display_name_key catches the duplicate at write time, but the resulting error is ambiguous — mapDefinitionCreateDuplicateError maps it to a generic "internal error" for non-key errors (template_display_name.go:356), and the user sees an opaque 500 instead of a clear 409 Conflict.

The existing withDisplayNameCreateLock primitive (line 203) provides exactly the serialization this path needs (check + write under the per-name write lock) but is not used here. Consider wrapping the check+write in withTemplateWriteLock to match the pattern used by ensureTemplateDefinition.

Comment on lines +221 to +231

// ReserveTemplateDisplayNameForCreate checks that displayName is free before persisting
// a template definition. Names on FAILED templates are cleared so they can be reused.
func ReserveTemplateDisplayNameForCreate(ctx context.Context, displayName, templateID string) error {

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Misleading export: lock is released before the caller INSERTs

ReserveTemplateDisplayNameForCreate acquires the per-name write lock, verifies availability against the DB, then releases the lock immediately (the fn is a no-op func() error { return nil }). The caller must separately INSERT the template definition afterward. Between lock release and the INSERT, any concurrent caller can claim the same name.

By contrast, withDisplayNameCreateLock (line 203) supports the correct pattern: the caller passes the INSERT as fn and it runs under the lock. The exported ReserveTemplateDisplayNameForCreate is a footgun because it looks like a complete "check and reserve" API but only does the check. Consider either removing the export (unexport and use withDisplayNameCreateLock directly in the caller) or documenting prominently that callers must still INSERT under a separate lock.

Comment thread CubeAPI/src/error/mod.rs
AppError::BadRequest(msg) => (StatusCode::BAD_REQUEST, 400, msg.clone()),
AppError::BadGateway(msg) => (StatusCode::BAD_GATEWAY, 502, msg.clone()),
AppError::ServiceUnavailable(msg) => {
(StatusCode::SERVICE_UNAVAILABLE, 503, msg.clone())

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Error messages leak internal details to API clients

These new variants follow the same pattern as AppError::Internal (line 54): the full error message is serialized into the JSON response body. For BadGateway and ServiceUnavailable, this leaks HTTP error details from cross-service CubeMaster calls (timeouts, connection refusals) directly to external callers. Consider logging the full error server-side and returning a generic message to the client, matching the Go-side pattern (template_display_name.go:356 — log detail, return "internal error").

-- WHERE t.kind = 'template' AND TRIM(t.display_name) <> '' AND t.id <> keep.keep_id;

-- +goose NO TRANSACTION
-- +goose Up

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Up is destructive, Down is not reversible

The two UPDATEs (clearing tpl-/snap- prefixed names and deduplicating display_name) permanently destroy original metadata. The Down migration only drops the column and index — it cannot restore the cleared names. If the migration is applied and then rolled back, the original display names are lost. Consider either making Up non-destructive (log warnings instead of clearing) or adding a backup table/column that Down can restore from.

@@ -20,6 +20,11 @@ import (
const (
templateDefinitionCacheTTL = 360 * time.Minute

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Per-replica cache: inconsistent state across replicas for up to 300s

The cache comment (lines 21-25) acknowledges this is per-process only, but during rename/create on replica A, other replicas serve stale positive or not-found cache entries until TTL expiry. A caller hitting replica B after a rename still gets the old templateID for up to 300s; a newly created template is invisible on replicas with a not-found cache entry for 30s. The fresh=true parameter mitigates this but requires callers to know when to use it.

Consider: shorter TTL for not-found cache (already 30s is reasonable), or a shared backing store (Redis) for multi-replica deployments.

if fresh {
groupKey = "fresh:" + cacheKey
}
result, err := templateDisplayNameFetchGroup.Do(groupKey, func() (interface{}, error) {

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Double cache check is correct but fragile — should be documented

The cache is checked before entering the singleflight group (lines 98-104), and again inside the callback (lines 110-117). The inner check handles the race where a concurrent goroutine populated the cache between the outer check and entering Do. Without a comment explaining this, a future reader could remove the inner check as "redundant dead code," breaking the coalescing guarantee. Consider adding a brief inline comment.

let display_name = display_name.trim();
if display_name.is_empty() {
Vec::new()
} else {

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

names: Vec<String> is always 0 or 1 — consider Option<String>

template_names (line 85) returns either vec![] or vec!["name"]. Every downstream consumer (Go sends display_name as scalar, Rust wraps in Vec, TypeScript reads names[0]) treats this as a single optional value. If forward-compatibility is intended (e.g., aliases), document it explicitly; otherwise this indirection adds unnecessary cognitive load across the full stack.

templateDisplayNameCache.Set(key, templateID, templateDisplayNameCacheTTL)
}

func evictOneDisplayNameNotFoundCacheEntry() {

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not-found cache eviction is O(n)

When the cache reaches templateDisplayNameNotFoundCacheMaxLen (1024), every subsequent insert calls Items() which allocates and returns a shallow copy of the entire map, then iterates to find one key to delete. Under a burst of not-found probes (e.g., enumeration of nonexistent names), this creates unnecessary GC pressure. A FIFO ring buffer or counter-based eviction would be O(1).

if err != nil {
return err
}
if err := ValidateTemplateDisplayNameAvailable(ctx, normalized, templateID); err != nil {

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

TOCTOU: availability check and DB write are not atomic

ValidateTemplateDisplayNameAvailable (line 308) queries the DB to check name availability, then the Update on line 312-314 writes the new name. Between these two operations, a concurrent caller can claim the same name. The DB unique index on display_name_key catches the duplicate at write time, but the resulting error is ambiguous — mapDefinitionCreateDuplicateError maps it to a generic "internal error" for non-key errors (template_display_name.go:356), and the user sees an opaque 500 instead of a clear 409 Conflict.

The existing withDisplayNameCreateLock primitive (line 203) provides exactly the serialization this path needs (check + write under the per-name write lock) but is not used here. Consider wrapping the check+write in withTemplateWriteLock to match the pattern used by ensureTemplateDefinition.

Comment on lines +221 to +231

// ReserveTemplateDisplayNameForCreate checks that displayName is free before persisting
// a template definition. Names on FAILED templates are cleared so they can be reused.
func ReserveTemplateDisplayNameForCreate(ctx context.Context, displayName, templateID string) error {

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Misleading export: lock is released before the caller INSERTs

ReserveTemplateDisplayNameForCreate acquires the per-name write lock, verifies availability against the DB, then releases the lock immediately (the fn is a no-op func() error { return nil }). The caller must separately INSERT the template definition afterward. Between lock release and the INSERT, any concurrent caller can claim the same name.

By contrast, withDisplayNameCreateLock (line 203) supports the correct pattern: the caller passes the INSERT as fn and it runs under the lock. The exported ReserveTemplateDisplayNameForCreate is a footgun because it looks like a complete "check and reserve" API but only does the check. Consider either removing the export (unexport Reserve and use withDisplayNameCreateLock directly in the caller) or documenting prominently that callers must still INSERT under a separate lock.

@HuChundong HuChundong force-pushed the feat/template-name-e2b-align branch from 4f5faba to 6118ff2 Compare June 23, 2026 11:03
Comment thread CubeAPI/src/models/mod.rs
/// use `tpl-` or `snap-` prefix; cluster-wide case-insensitive unique among templates
/// that currently hold the name (assigned when the definition is created;
/// released after failed builds). Sandboxes may reference this name only after
/// the template build succeeds (template ready on a node).

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Defense-in-depth: Name validation is missing on the Rust/CubeAPI side

The name field is a plain Option<String> with no #[validate(...)] annotation. Actual syntax enforcement (character set, length, tpl-/snap- prefix rejection) happens only on the Go/CubeMaster side inside NormalizeTemplateDisplayName. Invalid names are forwarded across the internal HTTP link before being rejected.

Consider adding a #[validate(custom = ...)] validator that mirrors the Go-side rules so CubeAPI rejects bad input at the API boundary. This also keeps the utoipa-annotated constraints schema-accurate at the edge.

@@ -100,12 +108,14 @@ func (g *templateFetchGroup) Do(key string, fn func() (interface{}, error)) (int
g.calls[key] = call

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Latent correctness bug: custom singleflight silently returns (nil, nil) on panic

This deferred cleanup runs regardless of whether fn() panics. If the callback panics, call.val and call.err remain at their Go zero values (nil, nil). All goroutines waiting on <-call.done (line 104) wake up and return (nil, nil), receiving a wrong result.

In findTemplateIDByDisplayName (template_display_name.go:131), the result is type-asserted: templateID, _ := result.(string) — a nil interface value produces "", which is then returned as a valid (but wrong) template ID.

Recommend adding recover() in this deferred function that sets call.err and re-panics, or replacing with golang.org/x/sync/singleflight.

return clearDefinitionDisplayNameLocked(ctx, templateID, def)
})
}
}

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Cache inconsistency on unexpected GetDefinition error

When GetDefinition returns an unexpected error (transient DB error, not ErrTemplateNotFound), def is nil and the code falls through to clearDefinitionDisplayNameLocked(ctx, templateID, def) at line 290 (now line 290 in the new file). Inside that function (line 265: if def != nil { invalidateTemplateDisplayNameCache(def.DisplayName) }), the cache invalidation is skipped because def is nil. The DB update still runs (setting display_name to ""), but the old display name remains in the in-memory cache until TTL expiry, causing subsequent lookups to return a stale template ID.

Recommend returning the error instead of falling through, or fetching the display name separately for cache invalidation.

if err != nil {
code := int(errorcode.ErrorCode_MasterInternalError)
retMsg := "internal error"
switch {

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Enumeration oracle: distinct HTTP codes reveal name state

This error handling block returns distinct status codes for different failure modes:

  • 404 (ErrorCode_NotFound) → name not found
  • 400 (ErrorCode_MasterParamsError) → name exists but is ambiguous or invalid
  • 200 → name exists and resolves to a template ID

An authenticated user can probe whether any specific display name exists, enumerate template names, and distinguish "not found" from "invalid syntax." This feeds into the cluster-wide uniqueness concern — an attacker can build a catalog of template names.

Recommend returning a generic failure code (e.g., 404) for all three lookup failures, and relying on client-side regex validation (mirroring NormalizeTemplateDisplayName) for the create-form UX check instead of API-level discrimination.

@HuChundong HuChundong force-pushed the feat/template-name-e2b-align branch 2 times, most recently from 2954300 to 6983aa7 Compare June 23, 2026 11:25
}
normalized, err := NormalizeTemplateDisplayName(req.DisplayName)
if err != nil {
log.G(context.Background()).Warnf("template image job display_name normalize failed: %v", err)

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Observation: displayNameFromTemplateImageJob (line 382) takes no context.Context parameter and uses context.Background() for logging (lines 388, 393). This severs distributed trace correlation — log lines emitted here have no link to the originating request.

Every other public function in this file accepts a context.Context. Adding one here and propagating it from callers (job_repo.go line 157, image_job_runner.go lines 211-214) would be a small change with significant observability benefit.

Comment thread CubeAPI/src/models/mod.rs

/// Body for PATCH /templates/:templateID (update display name).
#[derive(Debug, Deserialize, ToSchema)]
pub struct UpdateTemplateRequest {

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Observation: UpdateTemplateRequest.name is a bare String with no validation annotations. Other request structs in this file (e.g. NewSandbox) use #[derive(Validate)] with constraints. The documented constraints (max 256 chars, ASCII letters/digits/hyphen/underscore only, no tpl-/snap- prefix) are enforced only by CubeMaster, not at the CubeAPI boundary.

This is a defense-in-depth gap: invalid names pass through CubeAPI to CubeMaster before being rejected, costing a network round-trip. A CubeMaster regression in error code mapping could produce unexpected HTTP status codes instead of the documented 400.

Comment thread CubeAPI/src/services/templates.rs Outdated
CubeMasterError::Http(e) if e.is_timeout() || e.is_connect() => {
AppError::ServiceUnavailable(format!("CubeMaster unavailable: {e}"))
}
CubeMasterError::Http(e) => AppError::BadGateway(format!("CubeMaster request failed: {e}")),

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Observation: When CubeMaster returns an unexpected transport error, map_resolve_err formats messages like "CubeMaster unavailable: {e}"/"CubeMaster request failed: {e}" where {e} is the full reqwest::Error Display string. This can include the internal CubeMaster URL (e.g., http://cubemaster.internal:8080/...) and connection details, which are forwarded to the API caller as 502/503 responses.

Suggestion: Log the full error server-side and return a sanitized message like "upstream service unavailable" to avoid leaking internal network topology.


// ReleaseTemplateDisplayNameAfterBuildFailure clears display_name on a template
// whose image build failed so the name can be reused on a later create.
func ReleaseTemplateDisplayNameAfterBuildFailure(ctx context.Context, templateID string) {

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Observation: ReleaseTemplateDisplayNameAfterBuildFailure is called on every build failure to release the display name so it can be reused (from failTemplateImageJob in job_repo.go line 150). If this function breaks, failed templates permanently lock their display names, preventing users from retrying under the same name.

This function has no unit tests despite non-trivial logic: it skips on empty templateID, tolerates ErrTemplateNotFound, and logs on other errors. None of these paths are exercised.

Suggestion: Add a unit test covering the success path, the empty-ID skip, and the ErrTemplateNotFound tolerance.

}
normalized, err := NormalizeTemplateDisplayName(req.DisplayName)
if err != nil {
log.G(context.Background()).Warnf("template image job display_name normalize failed: %v", err)

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Observation: displayNameFromTemplateImageJob (line 382) takes no context.Context parameter and uses context.Background() for logging (lines 388, 393). This severs distributed trace correlation — any log line emitted here has no link to the originating request.

Every other public function in this file accepts a context.Context. Adding one here and plumbing it from callers (job_repo.go line 157, image_job_runner.go lines 211-214) would be a small change with significant observability benefit.

Suggestion: Add ctx context.Context parameter and propagate from callers.

Comment thread CubeAPI/src/models/mod.rs

/// Body for PATCH /templates/:templateID (update display name).
#[derive(Debug, Deserialize, ToSchema)]
pub struct UpdateTemplateRequest {

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Observation: UpdateTemplateRequest.name is a bare String with no validation. Other request structs in this file (e.g. NewSandbox) use #[derive(Validate)] with constraints. The documented constraints (max 256 chars, ASCII letters/digits/hyphen/underscore only, no tpl-/snap- prefix) are enforced only by CubeMaster, not at the CubeAPI boundary.

This is a defense-in-depth gap: invalid names pass through CubeAPI to CubeMaster before being rejected, costing a network round-trip. A CubeMaster regression in error code mapping could produce unpredictable HTTP status codes instead of the documented 400.

Suggestion: Add #[derive(Validate)] with length and regex constraints matching the documented rules, or add a manual validator. The #[validate] derive on CreateTemplateRequest (line 601) is also never called — consider fixing both.

Comment thread CubeAPI/src/services/templates.rs Outdated
CubeMasterError::Http(e) if e.is_timeout() || e.is_connect() => {
AppError::ServiceUnavailable(format!("CubeMaster unavailable: {e}"))
}
CubeMasterError::Http(e) => AppError::BadGateway(format!("CubeMaster request failed: {e}")),

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Observation: When CubeMaster returns an unexpected transport error, map_resolve_err formats messages like "CubeMaster unavailable: {e}"/"CubeMaster request failed: {e}" where {e} is the full reqwest::Error Display string. This can include the internal CubeMaster URL (e.g., http://cubemaster.internal:8080/...) and connection details, which are forwarded to the API caller as 502/503 responses.

Suggestion: Log the full error server-side and return a sanitized message like "upstream service unavailable" to avoid leaking internal network topology.


// ReleaseTemplateDisplayNameAfterBuildFailure clears display_name on a template
// whose image build failed so the name can be reused on a later create.
func ReleaseTemplateDisplayNameAfterBuildFailure(ctx context.Context, templateID string) {

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Observation: ReleaseTemplateDisplayNameAfterBuildFailure is called on build failure to release the display name so it can be reused (called from failTemplateImageJob in job_repo.go line 150). If this function breaks, failed templates permanently lock their display names, and users cannot retry under the same name.

This function has no unit tests despite non-trivial logic: it skips on empty templateID, tolerates ErrTemplateNotFound, and logs on other errors. None of these paths are exercised.

Suggestion: Add a unit test covering the success path, the empty-ID skip, and the ErrTemplateNotFound tolerance.

Add cluster-wide template display names (CubeMaster display_name) so
sandboxes and APIs can reference templates by human-readable name.
CubeMaster owns lookup, reservation at definition creation, bounded
positive/negative caches with invalidation, and a DB migration with
dedupe; CubeAPI exposes thin name resolution and GET /templates/lookup;
Web adds create hints, list/detail rename, and i18n.

Assisted-by: Cursor:Composer
Signed-off-by: carmake <gycm520@gmail.com>
Co-authored-by: Cursor <cursoragent@cursor.com>
@HuChundong HuChundong force-pushed the feat/template-name-e2b-align branch from 6983aa7 to 25a5804 Compare June 23, 2026 12:16
@HuChundong HuChundong marked this pull request as ready for review June 23, 2026 12:35
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants