feat(templates): E2B-style display names with CubeMaster name cache#607
feat(templates): E2B-style display names with CubeMaster name cache#607HuChundong wants to merge 1 commit into
Conversation
d9adc05 to
ff87c16
Compare
|
You can check if PR #612 meets your requirements? The scale of the changes is much smaller |
ff87c16 to
2ed302c
Compare
|
|
||
| match self | ||
| .fetch_resolved_template(&template_ref, &template_id) | ||
| .await |
There was a problem hiding this comment.
Redundant second fetch_resolved_template call
update_template_name already calls fetch_resolved_template on line 144 to get the existing template metadata (including the old display_name for cache invalidation). After the successful update, it calls fetch_resolved_template again on line 164 solely to build the response. This adds an unnecessary cross-service HTTP roundtrip + DB query.
The response could be constructed from the already-fetched existing record (which was returned just 20 lines earlier), substituting the new name for display_name. Consider eliminating this second fetch to reduce latency — especially since the CubeMaster PATCH update already returned a success envelope, so the template definition is expected to still exist.
| .await?; | ||
| let existing = self | ||
| .fetch_resolved_template(&template_ref, &template_id) | ||
| .await?; |
There was a problem hiding this comment.
Extra fetch_resolved_template purely for display_name cache invalidation
delete_template calls fetch_resolved_template solely to retrieve the old display_name for cache invalidation (line 228: for display_name in template_names(&existing.display_name)). This triggers an HTTP roundtrip to CubeMaster + a DB query just to get a single string field that's already available in CubeMaster's response to the lookup endpoint.
Consider having GET /cube/template/lookup also return display_name alongside template_id, so the single lookup call serves both the ID resolution and the cache invalidation data.
| return normalized, nil | ||
| } | ||
|
|
||
| func mapDefinitionCreateDuplicateError(err error, displayName string) error { |
There was a problem hiding this comment.
Raw database errors propagated to API responses
mapDefinitionCreateDuplicateError returns err unmodified when the error message does not contain "1062" or "Duplicate entry". This means transient MySQL errors (e.g. deadlocks, connection timeouts, schema mismatches) flow through to the API response's RetMsg field verbatim (via handleTemplateLookupAction line 118 and handleTemplateDisplayNameAction line 168).
While these are authenticated endpoints, exposing raw DB error texts can reveal internal schema details, engine versions, and connection strings. Consider wrapping non-duplicate errors into a generic internal-error message before returning to clients, while logging the raw error server-side.
| error::{AppError, AppResult}, | ||
| }; | ||
|
|
||
| const CACHE_TTL: Duration = Duration::from_secs(60); |
There was a problem hiding this comment.
Cache TTL of 60 seconds is unnecessarily short
Template display names are effectively immutable after creation — they only change through the explicit update_template_name endpoint, which triggers targeted cache invalidation. Every 60s cache miss forces a cross-service HTTP call to CubeMaster's /cube/template/lookup endpoint (which queries the DB).
Since inline invalidation already handles renames on the originating replica, a TTL of 300-600s would dramatically reduce CubeMaster load without increasing the staleness window beyond what the existing invalidation already provides. Multi-replica deployments already tolerate up to one TTL window of staleness; 60s is unnecessarily tight for a mapping that changes at human timescales.
| // | ||
|
|
||
| //! Template name → templateID resolution with an in-memory cache. | ||
| //! |
There was a problem hiding this comment.
Module doc comment omits several routes using name resolution
The doc comment lists POST /sandboxes, GET/PATCH/DELETE /templates/{templateID}, and compat-adopt as the only routes with name resolution. However, the TemplateService also resolves names for POST /templates/:templateID (rebuild), POST /templates/:templateID/builds/:buildID (start build), and GET /templates/:templateID/builds/:buildID/status (get build status). A developer reading this comment would miss roughly half the affected endpoints. Recommend updating the comment to reflect the full set of routes.
These two PRs overlap slightly, but they have completely different priorities. This PR focuses on syncing the E2B API, making template-based sandbox creation the core feature. Displaying names on the WebUI serves little purpose if those names can't actually be used to spin up sandboxes. |
2ed302c to
214dc12
Compare
214dc12 to
f801c11
Compare
19ceabf to
b297013
Compare
HuChundong
left a comment
There was a problem hiding this comment.
Automated review of b297013 — see inline notes for actionable items.
| }); | ||
| } | ||
| let key = reference.to_ascii_lowercase(); | ||
| if let Some(template_id) = self.cache_get(&key).await { |
There was a problem hiding this comment.
Multi-replica stale cache on destructive paths
get_template_build_status re-validates cache hits via fetch_resolved_template, but create_sandbox, delete_template, and update_template_name trust cached name→ID mappings for up to 300s on replicas that did not handle the rename/delete.
After a name is freed or reassigned, other CubeAPI replicas can still route POST /sandboxes or DELETE/PATCH to the old template. Consider reusing the build-status revalidation pattern (or a lightweight lookup) on cache hits for mutating endpoints.
| .template_names | ||
| .resolve_template_ref(&template_ref) | ||
| .await?; | ||
| let names_for_invalidation = if TemplateNameCache::needs_resolution(&template_ref) { |
There was a problem hiding this comment.
Service-layer tests missing for name-resolution orchestration
template_names.rs has solid unit tests, but update_template_name, delete_template (name vs tpl-id branches), and sandbox create-by-name paths in this module are untested.
A few focused tests with a mock CubeMasterClient would guard cache invalidation and the delete-by-name skip-fetch behavior added in this PR.
|
|
||
| CALL cubemaster_assert_table_exists('t_cube_template_definition'); | ||
|
|
||
| -- Drop invalid reserved-prefix names on templates (snapshots keep their names). |
There was a problem hiding this comment.
Migration preflight clears legacy names irreversibly
Lines 14–37 blank out reserved-prefix names and duplicate display names before adding the unique index. This is reasonable for greenfield but can surprise production upgrades.
Please confirm staging/release notes call this out explicitly (PR body mentions it — worth a pre-migration audit query for affected rows).
Code review (b297013)Overall the PR is in good shape after the review fixes: name validation is centralized in CubeMaster, internal DB errors are masked, redundant fetches were trimmed, and test coverage on the Master side is solid. Worth addressing / confirming
Look good
No blocking issues found for merge after acknowledging multi-replica cache behavior in release notes or follow-up. |
9438441 to
4483cba
Compare
| async fn lookup_and_cache(&self, reference: &str, key: &str) -> AppResult<String> { | ||
| let lock = { | ||
| let mut locks = self.lookup_locks.write().await; | ||
| locks | ||
| .entry(key.to_string()) | ||
| .or_insert_with(|| Arc::new(Mutex::new(()))) | ||
| .clone() | ||
| }; |
There was a problem hiding this comment.
Potential memory leak: lookup_locks entries are never removed. Every unique name ever resolved adds an Arc<Mutex<()>> to the lookup_locks map, but no code cleans them up — not when cache entries are evicted (TTL or capacity), not on invalidation. Over a long-running process with many distinct template names, this grows unbounded, violating the MAX_CACHE_ENTRIES contract on self.entries.
Suggestion: Either clean up lookup_locks entries alongside cache eviction (cache_set / invalidate_name), or switch to a fixed-size striped lock pool (e.g., 64 or 256 slots via usize::from_str_bytes hash) to bound memory and reduce write-lock contention on the lookup_locks RwLock.
| /// Resolve a display name for mutating API paths, always refreshing via lookup. | ||
| pub async fn resolve_template_ref_for_mutation(&self, reference: &str) -> AppResult<String> { | ||
| let reference = reference.trim(); | ||
| if reference.is_empty() { | ||
| return Err(AppError::BadRequest( | ||
| "template reference is empty".to_string(), | ||
| )); | ||
| } | ||
| if !Self::needs_resolution(reference) { | ||
| return Ok(reference.to_string()); | ||
| } | ||
| let key = reference.to_ascii_lowercase(); | ||
| self.lookup_and_cache(reference, &key).await | ||
| } |
There was a problem hiding this comment.
Doc/behavior mismatch: The doc comment says "always refreshing via lookup," but the implementation calls lookup_and_cache, which checks the cache first and returns a cached value if still valid (300s TTL). Mutating paths (delete, rebuild, adopt-baseline) can therefore operate on stale template_id mappings for up to 300s after a name is reassigned on another replica.
Suggestion: Either bypass the cache for mutation paths (clear entry before lookup), or update the doc comment to accurately describe the caching behavior (e.g., "uses cached resolution when available; cache TTL bounds cross-replica staleness").
| }}, | ||
| } | ||
| } | ||
| normalized, _ := templatecenter.NormalizeTemplateDisplayName(name) |
There was a problem hiding this comment.
Error silently discarded: NormalizeTemplateDisplayName can return an error (e.g., name exceeds 256 chars after lookup succeeded, or contains unexpected characters), but the error is discarded via _. The response will have a valid template_id but an empty/incorrect display_name, which is misleading for API consumers.
Suggestion: Log the error or fall back to the original name:
normalized := name
if n, err := templatecenter.NormalizeTemplateDisplayName(name); err == nil {
normalized = n
} else {
log.G(ctx).Warnf("lookup returned valid template but name normalization failed: %v", err)
}
PR Review: E2B-style display names with CubeMaster name cacheOverviewThis is a well-structured feature addition implementing human-readable template display names. The architecture follows a clean layered pattern: CubeAPI is a thin resolver that delegates all name lookups to CubeMaster via HTTP, which owns the name cache (300s TTL, 4096 entries, singleflight dedup) and all invalidation logic. The display_name_key generated column with a UNIQUE index provides case-insensitive uniqueness at the database level. Test coverage is solid overall with a healthy test-to-code ratio. Key FindingsMedium: Missing input validation at CubeAPI boundary (CubeAPI/src/models/mod.rs:671) Medium: Internal error details forwarded to API callers (CubeAPI/src/services/templates.rs:102-106) Medium: Tracing context dropped in displayNameFromTemplateImageJob (CubeMaster/pkg/templatecenter/template_display_name.go:382) Medium: ReleaseTemplateDisplayNameAfterBuildFailure is untested (template_display_name.go:235) Low: No rate limiting on /templates/lookup (CubeAPI/src/routes.rs:189) Low: Cache invalidation maintainability gap (CubeMaster/pkg/templatecenter/cache.go:204) Minor Items
Positive Notes
|
| async fn lookup_and_cache(&self, reference: &str, key: &str) -> AppResult<String> { | ||
| let lock = { | ||
| let mut locks = self.lookup_locks.write().await; | ||
| locks | ||
| .entry(key.to_string()) | ||
| .or_insert_with(|| Arc::new(Mutex::new(()))) | ||
| .clone() | ||
| }; |
There was a problem hiding this comment.
Potential memory leak: lookup_locks entries are never removed. Every unique name ever resolved adds an Arc<Mutex<()>> to the lookup_locks map, but no code cleans them up — not when cache entries are evicted (TTL or capacity), not on invalidation. Over a long-running process with many distinct template names, this grows unbounded, violating the MAX_CACHE_ENTRIES contract on self.entries.
Suggestion: Either clean up lookup_locks entries alongside cache eviction (cache_set / invalidate_name), or switch to a fixed-size striped lock pool (e.g., 64 or 256 slots via usize::from_str_bytes hash) to bound memory and reduce write-lock contention on the lookup_locks RwLock.
| /// Resolve a display name for mutating API paths, always refreshing via lookup. | ||
| pub async fn resolve_template_ref_for_mutation(&self, reference: &str) -> AppResult<String> { | ||
| let reference = reference.trim(); | ||
| if reference.is_empty() { | ||
| return Err(AppError::BadRequest( | ||
| "template reference is empty".to_string(), | ||
| )); | ||
| } | ||
| if !Self::needs_resolution(reference) { | ||
| return Ok(reference.to_string()); | ||
| } | ||
| let key = reference.to_ascii_lowercase(); | ||
| self.lookup_and_cache(reference, &key).await | ||
| } |
There was a problem hiding this comment.
Doc/behavior mismatch: The doc comment says "always refreshing via lookup," but the implementation calls lookup_and_cache, which checks the cache first and returns a cached value if still valid (300s TTL). Mutating paths (delete, rebuild, adopt-baseline) can therefore operate on stale template_id mappings for up to 300s after a name is reassigned on another replica.
Suggestion: Either bypass the cache for mutation paths (clear entry before lookup), or update the doc comment to accurately describe the caching behavior (e.g., "uses cached resolution when available; cache TTL bounds cross-replica staleness").
| }}, | ||
| } | ||
| } | ||
| normalized, _ := templatecenter.NormalizeTemplateDisplayName(name) |
There was a problem hiding this comment.
Error silently discarded: NormalizeTemplateDisplayName can return an error (e.g., name exceeds 256 chars after lookup succeeded, or contains unexpected characters), but the error is discarded via _. The response will have a valid template_id but an empty/incorrect display_name, which is misleading for API consumers.
Suggestion: Log the error or fall back to the original name:
normalized := name
if n, err := templatecenter.NormalizeTemplateDisplayName(name); err == nil {
normalized = n
} else {
log.G(ctx).Warnf("lookup returned valid template but name normalization failed: %v", err)
}4483cba to
52079e1
Compare
| return def.DisplayName | ||
| } | ||
|
|
||
| func assertDisplayNameAvailableForCreate(ctx context.Context, normalized, templateID string) error { |
There was a problem hiding this comment.
Performance: Two DB queries under write lock
This function does two separate DB round-trips while holding the per-name write lock:
findTemplateIDByDisplayNameFromDB— SELECT template_idGetDefinition— full row fetch (only status is needed)
Combine into a single query that fetches both template_id and status:
5e7a6b5 to
4f5faba
Compare
| if err != nil { | ||
| return err | ||
| } | ||
| if err := ValidateTemplateDisplayNameAvailable(ctx, normalized, templateID); err != nil { |
There was a problem hiding this comment.
TOCTOU: availability check and DB write are not atomic
ValidateTemplateDisplayNameAvailable (line 308) queries the DB to check name availability, then the Update on line 312-314 writes the new name. Between these two operations, a concurrent caller can claim the same name. The DB unique index on display_name_key catches the duplicate at write time, but the resulting error is ambiguous — mapDefinitionCreateDuplicateError maps it to a generic "internal error" for non-key errors (template_display_name.go:356), and the user sees an opaque 500 instead of a clear 409 Conflict.
The existing withDisplayNameCreateLock primitive (line 203) provides exactly the serialization this path needs (check + write under the per-name write lock) but is not used here. Consider wrapping the check+write in withTemplateWriteLock to match the pattern used by ensureTemplateDefinition.
|
|
||
| // ReserveTemplateDisplayNameForCreate checks that displayName is free before persisting | ||
| // a template definition. Names on FAILED templates are cleared so they can be reused. | ||
| func ReserveTemplateDisplayNameForCreate(ctx context.Context, displayName, templateID string) error { |
There was a problem hiding this comment.
Misleading export: lock is released before the caller INSERTs
ReserveTemplateDisplayNameForCreate acquires the per-name write lock, verifies availability against the DB, then releases the lock immediately (the fn is a no-op func() error { return nil }). The caller must separately INSERT the template definition afterward. Between lock release and the INSERT, any concurrent caller can claim the same name.
By contrast, withDisplayNameCreateLock (line 203) supports the correct pattern: the caller passes the INSERT as fn and it runs under the lock. The exported ReserveTemplateDisplayNameForCreate is a footgun because it looks like a complete "check and reserve" API but only does the check. Consider either removing the export (unexport and use withDisplayNameCreateLock directly in the caller) or documenting prominently that callers must still INSERT under a separate lock.
| AppError::BadRequest(msg) => (StatusCode::BAD_REQUEST, 400, msg.clone()), | ||
| AppError::BadGateway(msg) => (StatusCode::BAD_GATEWAY, 502, msg.clone()), | ||
| AppError::ServiceUnavailable(msg) => { | ||
| (StatusCode::SERVICE_UNAVAILABLE, 503, msg.clone()) |
There was a problem hiding this comment.
Error messages leak internal details to API clients
These new variants follow the same pattern as AppError::Internal (line 54): the full error message is serialized into the JSON response body. For BadGateway and ServiceUnavailable, this leaks HTTP error details from cross-service CubeMaster calls (timeouts, connection refusals) directly to external callers. Consider logging the full error server-side and returning a generic message to the client, matching the Go-side pattern (template_display_name.go:356 — log detail, return "internal error").
| -- WHERE t.kind = 'template' AND TRIM(t.display_name) <> '' AND t.id <> keep.keep_id; | ||
|
|
||
| -- +goose NO TRANSACTION | ||
| -- +goose Up |
There was a problem hiding this comment.
Up is destructive, Down is not reversible
The two UPDATEs (clearing tpl-/snap- prefixed names and deduplicating display_name) permanently destroy original metadata. The Down migration only drops the column and index — it cannot restore the cleared names. If the migration is applied and then rolled back, the original display names are lost. Consider either making Up non-destructive (log warnings instead of clearing) or adding a backup table/column that Down can restore from.
| @@ -20,6 +20,11 @@ import ( | |||
| const ( | |||
| templateDefinitionCacheTTL = 360 * time.Minute | |||
There was a problem hiding this comment.
Per-replica cache: inconsistent state across replicas for up to 300s
The cache comment (lines 21-25) acknowledges this is per-process only, but during rename/create on replica A, other replicas serve stale positive or not-found cache entries until TTL expiry. A caller hitting replica B after a rename still gets the old templateID for up to 300s; a newly created template is invisible on replicas with a not-found cache entry for 30s. The fresh=true parameter mitigates this but requires callers to know when to use it.
Consider: shorter TTL for not-found cache (already 30s is reasonable), or a shared backing store (Redis) for multi-replica deployments.
| if fresh { | ||
| groupKey = "fresh:" + cacheKey | ||
| } | ||
| result, err := templateDisplayNameFetchGroup.Do(groupKey, func() (interface{}, error) { |
There was a problem hiding this comment.
Double cache check is correct but fragile — should be documented
The cache is checked before entering the singleflight group (lines 98-104), and again inside the callback (lines 110-117). The inner check handles the race where a concurrent goroutine populated the cache between the outer check and entering Do. Without a comment explaining this, a future reader could remove the inner check as "redundant dead code," breaking the coalescing guarantee. Consider adding a brief inline comment.
| let display_name = display_name.trim(); | ||
| if display_name.is_empty() { | ||
| Vec::new() | ||
| } else { |
There was a problem hiding this comment.
names: Vec<String> is always 0 or 1 — consider Option<String>
template_names (line 85) returns either vec![] or vec!["name"]. Every downstream consumer (Go sends display_name as scalar, Rust wraps in Vec, TypeScript reads names[0]) treats this as a single optional value. If forward-compatibility is intended (e.g., aliases), document it explicitly; otherwise this indirection adds unnecessary cognitive load across the full stack.
| templateDisplayNameCache.Set(key, templateID, templateDisplayNameCacheTTL) | ||
| } | ||
|
|
||
| func evictOneDisplayNameNotFoundCacheEntry() { |
There was a problem hiding this comment.
Not-found cache eviction is O(n)
When the cache reaches templateDisplayNameNotFoundCacheMaxLen (1024), every subsequent insert calls Items() which allocates and returns a shallow copy of the entire map, then iterates to find one key to delete. Under a burst of not-found probes (e.g., enumeration of nonexistent names), this creates unnecessary GC pressure. A FIFO ring buffer or counter-based eviction would be O(1).
| if err != nil { | ||
| return err | ||
| } | ||
| if err := ValidateTemplateDisplayNameAvailable(ctx, normalized, templateID); err != nil { |
There was a problem hiding this comment.
TOCTOU: availability check and DB write are not atomic
ValidateTemplateDisplayNameAvailable (line 308) queries the DB to check name availability, then the Update on line 312-314 writes the new name. Between these two operations, a concurrent caller can claim the same name. The DB unique index on display_name_key catches the duplicate at write time, but the resulting error is ambiguous — mapDefinitionCreateDuplicateError maps it to a generic "internal error" for non-key errors (template_display_name.go:356), and the user sees an opaque 500 instead of a clear 409 Conflict.
The existing withDisplayNameCreateLock primitive (line 203) provides exactly the serialization this path needs (check + write under the per-name write lock) but is not used here. Consider wrapping the check+write in withTemplateWriteLock to match the pattern used by ensureTemplateDefinition.
|
|
||
| // ReserveTemplateDisplayNameForCreate checks that displayName is free before persisting | ||
| // a template definition. Names on FAILED templates are cleared so they can be reused. | ||
| func ReserveTemplateDisplayNameForCreate(ctx context.Context, displayName, templateID string) error { |
There was a problem hiding this comment.
Misleading export: lock is released before the caller INSERTs
ReserveTemplateDisplayNameForCreate acquires the per-name write lock, verifies availability against the DB, then releases the lock immediately (the fn is a no-op func() error { return nil }). The caller must separately INSERT the template definition afterward. Between lock release and the INSERT, any concurrent caller can claim the same name.
By contrast, withDisplayNameCreateLock (line 203) supports the correct pattern: the caller passes the INSERT as fn and it runs under the lock. The exported ReserveTemplateDisplayNameForCreate is a footgun because it looks like a complete "check and reserve" API but only does the check. Consider either removing the export (unexport Reserve and use withDisplayNameCreateLock directly in the caller) or documenting prominently that callers must still INSERT under a separate lock.
4f5faba to
6118ff2
Compare
| /// use `tpl-` or `snap-` prefix; cluster-wide case-insensitive unique among templates | ||
| /// that currently hold the name (assigned when the definition is created; | ||
| /// released after failed builds). Sandboxes may reference this name only after | ||
| /// the template build succeeds (template ready on a node). |
There was a problem hiding this comment.
Defense-in-depth: Name validation is missing on the Rust/CubeAPI side
The name field is a plain Option<String> with no #[validate(...)] annotation. Actual syntax enforcement (character set, length, tpl-/snap- prefix rejection) happens only on the Go/CubeMaster side inside NormalizeTemplateDisplayName. Invalid names are forwarded across the internal HTTP link before being rejected.
Consider adding a #[validate(custom = ...)] validator that mirrors the Go-side rules so CubeAPI rejects bad input at the API boundary. This also keeps the utoipa-annotated constraints schema-accurate at the edge.
| @@ -100,12 +108,14 @@ func (g *templateFetchGroup) Do(key string, fn func() (interface{}, error)) (int | |||
| g.calls[key] = call | |||
There was a problem hiding this comment.
Latent correctness bug: custom singleflight silently returns (nil, nil) on panic
This deferred cleanup runs regardless of whether fn() panics. If the callback panics, call.val and call.err remain at their Go zero values (nil, nil). All goroutines waiting on <-call.done (line 104) wake up and return (nil, nil), receiving a wrong result.
In findTemplateIDByDisplayName (template_display_name.go:131), the result is type-asserted: templateID, _ := result.(string) — a nil interface value produces "", which is then returned as a valid (but wrong) template ID.
Recommend adding recover() in this deferred function that sets call.err and re-panics, or replacing with golang.org/x/sync/singleflight.
| return clearDefinitionDisplayNameLocked(ctx, templateID, def) | ||
| }) | ||
| } | ||
| } |
There was a problem hiding this comment.
Cache inconsistency on unexpected GetDefinition error
When GetDefinition returns an unexpected error (transient DB error, not ErrTemplateNotFound), def is nil and the code falls through to clearDefinitionDisplayNameLocked(ctx, templateID, def) at line 290 (now line 290 in the new file). Inside that function (line 265: if def != nil { invalidateTemplateDisplayNameCache(def.DisplayName) }), the cache invalidation is skipped because def is nil. The DB update still runs (setting display_name to ""), but the old display name remains in the in-memory cache until TTL expiry, causing subsequent lookups to return a stale template ID.
Recommend returning the error instead of falling through, or fetching the display name separately for cache invalidation.
| if err != nil { | ||
| code := int(errorcode.ErrorCode_MasterInternalError) | ||
| retMsg := "internal error" | ||
| switch { |
There was a problem hiding this comment.
Enumeration oracle: distinct HTTP codes reveal name state
This error handling block returns distinct status codes for different failure modes:
- 404 (
ErrorCode_NotFound) → name not found - 400 (
ErrorCode_MasterParamsError) → name exists but is ambiguous or invalid - 200 → name exists and resolves to a template ID
An authenticated user can probe whether any specific display name exists, enumerate template names, and distinguish "not found" from "invalid syntax." This feeds into the cluster-wide uniqueness concern — an attacker can build a catalog of template names.
Recommend returning a generic failure code (e.g., 404) for all three lookup failures, and relying on client-side regex validation (mirroring NormalizeTemplateDisplayName) for the create-form UX check instead of API-level discrimination.
2954300 to
6983aa7
Compare
| } | ||
| normalized, err := NormalizeTemplateDisplayName(req.DisplayName) | ||
| if err != nil { | ||
| log.G(context.Background()).Warnf("template image job display_name normalize failed: %v", err) |
There was a problem hiding this comment.
Observation: displayNameFromTemplateImageJob (line 382) takes no context.Context parameter and uses context.Background() for logging (lines 388, 393). This severs distributed trace correlation — log lines emitted here have no link to the originating request.
Every other public function in this file accepts a context.Context. Adding one here and propagating it from callers (job_repo.go line 157, image_job_runner.go lines 211-214) would be a small change with significant observability benefit.
|
|
||
| /// Body for PATCH /templates/:templateID (update display name). | ||
| #[derive(Debug, Deserialize, ToSchema)] | ||
| pub struct UpdateTemplateRequest { |
There was a problem hiding this comment.
Observation: UpdateTemplateRequest.name is a bare String with no validation annotations. Other request structs in this file (e.g. NewSandbox) use #[derive(Validate)] with constraints. The documented constraints (max 256 chars, ASCII letters/digits/hyphen/underscore only, no tpl-/snap- prefix) are enforced only by CubeMaster, not at the CubeAPI boundary.
This is a defense-in-depth gap: invalid names pass through CubeAPI to CubeMaster before being rejected, costing a network round-trip. A CubeMaster regression in error code mapping could produce unexpected HTTP status codes instead of the documented 400.
| CubeMasterError::Http(e) if e.is_timeout() || e.is_connect() => { | ||
| AppError::ServiceUnavailable(format!("CubeMaster unavailable: {e}")) | ||
| } | ||
| CubeMasterError::Http(e) => AppError::BadGateway(format!("CubeMaster request failed: {e}")), |
There was a problem hiding this comment.
Observation: When CubeMaster returns an unexpected transport error, map_resolve_err formats messages like "CubeMaster unavailable: {e}"/"CubeMaster request failed: {e}" where {e} is the full reqwest::Error Display string. This can include the internal CubeMaster URL (e.g., http://cubemaster.internal:8080/...) and connection details, which are forwarded to the API caller as 502/503 responses.
Suggestion: Log the full error server-side and return a sanitized message like "upstream service unavailable" to avoid leaking internal network topology.
|
|
||
| // ReleaseTemplateDisplayNameAfterBuildFailure clears display_name on a template | ||
| // whose image build failed so the name can be reused on a later create. | ||
| func ReleaseTemplateDisplayNameAfterBuildFailure(ctx context.Context, templateID string) { |
There was a problem hiding this comment.
Observation: ReleaseTemplateDisplayNameAfterBuildFailure is called on every build failure to release the display name so it can be reused (from failTemplateImageJob in job_repo.go line 150). If this function breaks, failed templates permanently lock their display names, preventing users from retrying under the same name.
This function has no unit tests despite non-trivial logic: it skips on empty templateID, tolerates ErrTemplateNotFound, and logs on other errors. None of these paths are exercised.
Suggestion: Add a unit test covering the success path, the empty-ID skip, and the ErrTemplateNotFound tolerance.
| } | ||
| normalized, err := NormalizeTemplateDisplayName(req.DisplayName) | ||
| if err != nil { | ||
| log.G(context.Background()).Warnf("template image job display_name normalize failed: %v", err) |
There was a problem hiding this comment.
Observation: displayNameFromTemplateImageJob (line 382) takes no context.Context parameter and uses context.Background() for logging (lines 388, 393). This severs distributed trace correlation — any log line emitted here has no link to the originating request.
Every other public function in this file accepts a context.Context. Adding one here and plumbing it from callers (job_repo.go line 157, image_job_runner.go lines 211-214) would be a small change with significant observability benefit.
Suggestion: Add ctx context.Context parameter and propagate from callers.
|
|
||
| /// Body for PATCH /templates/:templateID (update display name). | ||
| #[derive(Debug, Deserialize, ToSchema)] | ||
| pub struct UpdateTemplateRequest { |
There was a problem hiding this comment.
Observation: UpdateTemplateRequest.name is a bare String with no validation. Other request structs in this file (e.g. NewSandbox) use #[derive(Validate)] with constraints. The documented constraints (max 256 chars, ASCII letters/digits/hyphen/underscore only, no tpl-/snap- prefix) are enforced only by CubeMaster, not at the CubeAPI boundary.
This is a defense-in-depth gap: invalid names pass through CubeAPI to CubeMaster before being rejected, costing a network round-trip. A CubeMaster regression in error code mapping could produce unpredictable HTTP status codes instead of the documented 400.
Suggestion: Add #[derive(Validate)] with length and regex constraints matching the documented rules, or add a manual validator. The #[validate] derive on CreateTemplateRequest (line 601) is also never called — consider fixing both.
| CubeMasterError::Http(e) if e.is_timeout() || e.is_connect() => { | ||
| AppError::ServiceUnavailable(format!("CubeMaster unavailable: {e}")) | ||
| } | ||
| CubeMasterError::Http(e) => AppError::BadGateway(format!("CubeMaster request failed: {e}")), |
There was a problem hiding this comment.
Observation: When CubeMaster returns an unexpected transport error, map_resolve_err formats messages like "CubeMaster unavailable: {e}"/"CubeMaster request failed: {e}" where {e} is the full reqwest::Error Display string. This can include the internal CubeMaster URL (e.g., http://cubemaster.internal:8080/...) and connection details, which are forwarded to the API caller as 502/503 responses.
Suggestion: Log the full error server-side and return a sanitized message like "upstream service unavailable" to avoid leaking internal network topology.
|
|
||
| // ReleaseTemplateDisplayNameAfterBuildFailure clears display_name on a template | ||
| // whose image build failed so the name can be reused on a later create. | ||
| func ReleaseTemplateDisplayNameAfterBuildFailure(ctx context.Context, templateID string) { |
There was a problem hiding this comment.
Observation: ReleaseTemplateDisplayNameAfterBuildFailure is called on build failure to release the display name so it can be reused (called from failTemplateImageJob in job_repo.go line 150). If this function breaks, failed templates permanently lock their display names, and users cannot retry under the same name.
This function has no unit tests despite non-trivial logic: it skips on empty templateID, tolerates ErrTemplateNotFound, and logs on other errors. None of these paths are exercised.
Suggestion: Add a unit test covering the success path, the empty-ID skip, and the ErrTemplateNotFound tolerance.
Add cluster-wide template display names (CubeMaster display_name) so sandboxes and APIs can reference templates by human-readable name. CubeMaster owns lookup, reservation at definition creation, bounded positive/negative caches with invalidation, and a DB migration with dedupe; CubeAPI exposes thin name resolution and GET /templates/lookup; Web adds create hints, list/detail rename, and i18n. Assisted-by: Cursor:Composer Signed-off-by: carmake <gycm520@gmail.com> Co-authored-by: Cursor <cursoragent@cursor.com>
6983aa7 to
25a5804
Compare
Summary
name(CubeMasterdisplay_name) with cluster-wide case-insensitive uniqueness (20260623051112_template_display_name_index.sql).nameon create.POST /sandboxesand template paths accept display name ortpl-*id; CubeAPI resolves name→id once at entry, then uses existing tpl-id paths.GET /cube/template/lookup,POST /cube/template/display-name; sync name reservation at submit (ReserveTemplateDisplayNameForCreate→ 409); release name after failed builds.display_name→template_idcached intemplatecenter(300s TTL, 4096 entries, singleflight); invalidated on rename/delete/create. CubeAPI is a thin client with no local cache.Architecture
display_name, lookup cache, and invalidationCompared to E2B infra: E2B uses team-scoped aliases reserved in
RegisterBuild; we use globaldisplay_name_key+ sync submit check. We release names on build failure (E2B keeps aliases until template delete).Test plan
go test ./pkg/templatecenter/...go test ./pkg/service/httpservice/cube/...cargo test -p cube-apigo test ./pkg/base/dao/migrate/ -run TestMigrationFilenames20260623051112_template_display_name_indexon staging (run audit queries in migration header first)name→ create sandbox by name when READYMigration note
This migration may clear legacy
tpl-/snap-prefixed names and dedupe display names before adding the unique index. Run preflight audit queries on staging before production.Assisted-by: Cursor:Composer