Skip to content

feat(CubeAPI): support e2b V3 template-build protocol end-to-end#422

Open
chengjoey wants to merge 2 commits into
TencentCloud:masterfrom
chengjoey:feat/e2b-template
Open

feat(CubeAPI): support e2b V3 template-build protocol end-to-end#422
chengjoey wants to merge 2 commits into
TencentCloud:masterfrom
chengjoey:feat/e2b-template

Conversation

@chengjoey

Copy link
Copy Markdown
Contributor

Add V3 routes: POST /v3/templates, GET .../files/{hash}, POST /v2/.../builds/{bid}, GET .../status (handlers/templates_v3.rs)

feat: #74

@cubesandboxbot

cubesandboxbot Bot commented Jun 2, 2026

Copy link
Copy Markdown

Claude finished @chengjoey's task in 1m 46s —— View job


Code Review: PR #422 — feat(CubeAPI): support e2b V3 template-build protocol

Overview

This PR adds comprehensive E2B V3 protocol support to CubeAPI, including a new OCI registry reverse proxy, in-memory build registry, V3 endpoint handlers (POST /v3/templates, GET files/{hash}, POST /v2/builds/{bid}, GET /status), and the readyCmd → probe translation layer. The implementation is well-structured with clear separation of handlers, services, and models. 21 files changed (~3,150 additions).

Issues Found

1. [Bug] V3 build status log entries all get Utc::now() as timestamptemplates.rs:822

In v3_get_build_status, when converting internal build log lines to V3BuildLogEntry, every entry gets chrono::Utc::now() as its timestamp instead of the original timestamp:

// templates_v3.rs / v3_get_build_status calls get_template_build_status
// which returns Vec<String> (formatted as "TIMESTAMP LINE"), then:
.logs.iter()
.map(|line| V3BuildLogEntry {
    timestamp: chrono::Utc::now(),  // always "now", not the original timestamp
    message: line.clone(),
    level: "info".to_string(),
})

The BuildContext already stores proper BuildLogLine { timestamp, line } structs. The fix is either (a) return structured data from get_template_build_status instead of stringified logs, or (b) have v3_get_build_status read from the BuildRegistry directly instead of going through the string path.

2. [Minor] GET /templates/{tid}/files/{hash} returns 201 CREATEDtemplates_v3.rs:41

Standard HTTP semantics dictate 200 OK for a successful GET. The 201 CREATED value here is non-standard. If intentional (e.g., required by the E2B SDK), this should be documented in a code comment.

3. [Concern] No rate limiting on V3 template routesroutes.rs:168-212

The build_template_routes function uses with_auth() (auth-only), while sandbox routes use with_auth_and_rate_limit(). The new V3 template build endpoints (POST /v3/templates, POST /v2/.../builds/{bid}) aren't rate-limited. This could be a DoS vector — an attacker could repeatedly trigger expensive CubeMaster pipeline dispatches without any per-key throttling.

4. [Concern] In-memory build registry grows unboundedbuilds.rs:104

BuildRegistry wraps a DashMap with no eviction policy for completed/failed builds. The comment acknowledges this ("acceptable for a build flow that always reaches a terminal state within minutes"), but abandoned builds or builds whose clients never poll will leak memory indefinitely. Log lines are truncated at 10k, but the BuildContext structs (including full CreateTemplateRequest) are never removed. Consider a TTL-based eviction or limiting the total entry count.

5. [Testing gap] No integration tests for V3 protocol flowtemplates.rs:1199+

Unit tests exist for the helpers (build_resources, merge_envs, parse_ready_url, build_probe, etc.) and for create_template_e2b_mode. However, the core V3 protocol methods have no test coverage:

  • v3_create_template
  • v3_trigger_build
  • v3_get_build_status

The route test (v3_template_build_routes_are_reachable in routes.rs:559) verifies routing + the create response envelope, but doesn't test the dispatch or status polling flow.

6. [Minor] Duplicate registry credential generation logictemplates.rs:192-208 and templates.rs:859-883

issue_registry_credential() (line 859) duplicates logic already present in create_template_e2b_mode() (lines 192-208). Both compute the credential URL, repository prefix, and password. create_template_e2b_mode could call issue_registry_credential() and then add its additional fields.

7. [Minor] cube:// URI in from_template resolution is unresolvedtemplates.rs:685

format!("cube://{}", parent)

The comment says "letting downstream callers resolve it," but nothing in this PR handles the cube:// scheme. If this is a forward-looking placeholder, consider noting it more explicitly (e.g., with a TODO or feature flag).

8. [Nit] build_registry_router ignores its _state parameterroutes.rs:316

The function signature uses _state: &AppState but the Router is built purely from path definitions. The registry handler (registry::proxy) later accesses state via State<AppState>, so the state IS needed at the merged router level. However, the underscore prefix is misleading — this function does indirectly use the state through the handlers.

Correct Design Decisions

  • Separate V3 handlers file (handlers/templates_v3.rs): Cleanly distinguishes the new E2B protocol surface from the existing CubeSandbox-native templates handlers.
  • parse_ready_url safety: The function explicitly rejects non-loopback hosts and URLs without explicit ports, preventing misdirected probes.
  • with_auth on template routes: New V3 routes inherit auth checking via the existing infrastructure.
  • In-memory build registry with DashMap: Appropriate for the expected short-lived build lifecycle, with dual-indexing (by build_id and {template_id}::{build_id}) for flexible lookups.
  • Hop-by-hop header stripping in registry proxy: The HOP_BY_HOP constant at registry.rs:20 correctly prevents proxy-specific headers from being forwarded.

Summary

Category Count
Bugs 1 (timestamp)
API correctness 1 (GET→201)
Security/Scale 2 (rate limiting, memory eviction)
Testing 1 (missing V3 coverage)
Code quality 3 (duplication, unresolved URI, unused param)

The PR is fundamentally sound and well-structured. The timestamp bug (issue #1) is the only item that would produce incorrect runtime behavior for consumers of the status API. The missing rate limiting (issue #3) is worth addressing before shipping if this API is exposed to unconstrained clients.

@fslongjin fslongjin left a comment

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I reviewed the V3 template-build implementation with a focus on correctness, architecture, and workaround risk. The main blockers are the current test compilation failure, the fake file-cache hit for unsupported context builds, and the registry proxy/auth model.

Comment thread CubeAPI/src/routes.rs Outdated
/// route exists and returns the V3 envelope.
#[tokio::test]
async fn v3_template_build_routes_are_reachable() {
let server = test_server();

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This test currently does not compile because test_server() is async and returns a Future. Please await it before using the TestServer methods:

let server = test_server().await;

Right now cargo test fails with no method named post/get found for opaque type impl Future<Output = TestServer>, so this PR cannot pass the existing Rust test suite in its current state.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

done

/// currently supports) don't need the build context.
pub fn v3_get_file_upload(&self, _template_id: &str, _files_hash: &str) -> AppResult<crate::models::V3TemplateFileUpload> {
Ok(crate::models::V3TemplateFileUpload {
present: true,

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Always returning present: true makes the SDK believe the build context already exists server-side, so it will skip uploading the tarball. That is only safe for pure fromImage builds.

For Dockerfile/steps-based V3 builds this silently drops the build context, and the later trigger step may dispatch a registry image reference that was never produced. I think unsupported context builds should fail explicitly here, or this endpoint should return a real upload URL and persist/consume the uploaded context.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

done, fromImage will succeed, Dockerfile/steps-based will point to 501

Comment thread CubeAPI/src/services/templates.rs Outdated
// letting downstream callers resolve it. Adjust to your local
// convention if needed.
format!("cube://{}", parent)
} else if !ctx.image_ref.is_empty() {

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This fallback uses ctx.image_ref as long as it is non-empty, but that value is populated during POST /v3/templates; it does not prove that the client has actually pushed the image.

If neither fromImage nor a supported fromTemplate source is provided, we should verify that the registry push has completed before dispatching CubeMaster. Otherwise the API accepts the build and moves the failure to a later image-pull stage, which makes the error much harder to diagnose.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

done

Comment thread CubeAPI/src/services/templates.rs Outdated
// synthesise a CubeMaster reference of the form `cube://<tid>`,
// letting downstream callers resolve it. Adjust to your local
// convention if needed.
format!("cube://{}", parent)

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This introduces a cube://<template> source reference, but I do not see corresponding support in this PR for resolving that scheme downstream in CubeMaster/cubelet.

If the downstream stack does not already understand cube://, this path will look supported at the API layer but fail later during image resolution. Please either implement the resolution in this PR or reject fromTemplate with a clear 501 Not Implemented until the path is actually wired end-to-end.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

done, Now the displayed return is 501

Comment thread CubeAPI/src/services/templates.rs Outdated
let log_entries: Vec<V3BuildLogEntry> = logs
.iter()
.map(|line| V3BuildLogEntry {
timestamp: chrono::Utc::now(),

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

These timestamps are generated at status-poll time, not at log-write time. That means the same historical log line will receive a new timestamp every time the client polls.

BuildContext already stores BuildLogLine { timestamp, line }, so the V3 conversion should preserve that structured timestamp instead of assigning Utc::now() here.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

done

Comment thread CubeAPI/src/routes.rs
Router::new()
.route("/v2/", get(registry::ping))
.route("/v2", get(registry::ping))
.route("/v2/*path", any(registry::proxy))

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The registry proxy is mounted without CubeAPI auth or rate limiting. I understand Docker registry auth is a different credential domain, but this implementation also does not validate the short-lived RegistryCredential issued by CubeAPI; it simply forwards requests to the upstream registry.

Given the one-click deployment points this at a default local registry, exposing CubeAPI could accidentally expose a public writable registry proxy. Please add credential validation/rate limiting here, or make the deployment contract require an authenticated upstream registry and document that clearly.

Comment thread CubeAPI/src/routes.rs
// → POST /v2/.../builds/{bid} → GET .../status. Routes are mounted
// at the same level as /templates so both `/v3/...` and `/v2/...`
// segments are reachable.
.route("/v3/templates", post(templates_v3::v3_create_template))

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Template build routes are currently protected by auth only, while sandbox routes use with_auth_and_rate_limit.

The new V3 endpoints can allocate in-memory build contexts and trigger expensive CubeMaster template-build pipelines, so they should also be rate-limited or governed by a dedicated build quota.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

V3 build-pipeline routes moved to with_auth_and_rate_limit.


/// Thread-safe, in-process build registry.
#[derive(Clone, Default)]
pub struct BuildRegistry {

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The in-memory build registry has no eviction policy. Completed, failed, and abandoned builds remain in the DashMap indefinitely, including the original create request, credentials, status fields, logs, and the duplicate pair index.

For a long-running CubeAPI process this can become an unbounded memory leak. Please add a TTL cleanup path for terminal builds and/or a maximum registry size.

@chengjoey chengjoey Jun 10, 2026

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

BuildRegistry gains TTL + size-cap + background GC

Comment thread CubeAPI/src/handlers/registry.rs Outdated
tag = %parsed.tag,
"manifest pushed; marking build as image-pushed"
);
state.services.templates.mark_image_pushed(&parsed.tag);

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This marks a build as image-pushed based only on the manifest tag. It does not verify that the repository path matches the expected <repo_prefix>/<templateID> for the build context.

The build IDs are UUID-like and hard to guess, but the proxy should still validate both repo and tag against the stored build context before advancing the build state.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

done, mark_image_pushed add repo arg, cross-checks the manifest repo

Comment thread CubeAPI/src/services/templates.rs Outdated
/// 3. `body.from_template` — copy from another known CubeSandbox
/// template (resolved via CubeMaster `get_template`).
///
/// `start_cmd` becomes container `args`; `ready_cmd` becomes a Probe.Exec.

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Several comments still say readyCmd is translated into Probe.Exec, but the actual implementation parses a loopback URL and creates an HTTP GET probe because CubeMaster/cubelet do not support exec probes here.

Please update the comments/API model docs to say Probe.HttpGet; otherwise future maintainers may assume exec-style ready commands are supported end-to-end.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

done

Add V3 routes: POST /v3/templates, GET .../files/{hash}, POST /v2/.../builds/{bid}, GET .../status (handlers/templates_v3.rs)

Signed-off-by: joeyczheng <joeyczheng@tencent.com>
Security:
  * Per-build short-lived push credentials (`bld_<random>` + 256-bit
    password) replace the global `_token` shared secret.
  * Registry reverse-proxy validates Basic auth against an in-memory
    credential index and enforces repo scoping; 401 with WWW-Authenticate
    + 403 on cross-build access. Per-credential rate limit added.
  * `mark_image_pushed` cross-checks the manifest repo, not just the tag.
  * Startup WARN when public bind meets unauthenticated loopback upstream.

Resource bounds:
  * `BuildRegistry` gains TTL + size-cap + background GC; in-flight builds
    are never evicted. New `build_registry_*` config knobs.
  * Registry proxy streams request/response bodies end-to-end instead of
    buffering, so concurrent multi-GiB pushes no longer pin the heap.
  * V3 build-pipeline routes moved to `with_auth_and_rate_limit`.

Correctness:
  * `image_pushed` flag is the single source of truth for "client really
    pushed"; OCI fallback no longer dispatches against an unpushed ref.
  * `fromTemplate` and `dockerfile`/`steps`-only builds are rejected
    with 501 until a real resolver/builder ships, instead of failing
    obscurely deeper in the pipeline.
  * `BuildLogLine.timestamp` is preserved across status polls.
  * Probe synthesis comments fixed: `readyCmd` -> `Probe.HttpGet`, not
    `Probe.Exec`.

Signed-off-by: joeyczheng <joeyczheng@tencent.com>
@chengjoey chengjoey force-pushed the feat/e2b-template branch from 6e93954 to 8f8f6fa Compare June 10, 2026 09:49
CredentialOutcome::Authenticated(ctx)
}

fn constant_time_eq_strings(a: &str, b: &str) -> bool {

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

constant_time_eq_strings — compiler may optimize away the timing-safe loop

When a.len() != b.len() (line 382), the function walks the longer slice XORing each byte with 0 (byte ^ 0) and discards the result (let _ = diff;). A sufficiently smart optimizer can identify diff as dead code and eliminate the loop entirely, defeating the constant-time purpose.

Consider either:

  1. Using a proven constant-time comparison crate (e.g. subtle's ConstantTimeEq);
  2. Writing the result through std::sync::atomic::compiler_fence or core::ptr::read_volatile to prevent elision.

The equal-length path (line 392-396) does use diff as the return value, so that path is fine — only the unequal-length path is affected.

timestamp: Utc::now(),
line,
});
const MAX_LOGS: usize = 10_000;

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Vec::drain(0..drop) is O(n) on every append past cap

Once the log exceeds MAX_LOGS (10,000), every new append_log call drains drop elements from the front and shifts all remaining entries left. With a tail capacity of 10K, this is ~10K element moves per append.

Consider using VecDeque instead — pop_front() is O(1), and the push/pop pattern is exactly what it's designed for. The log is append-only and indexed by offset, which VecDeque supports efficiently via make_contiguous when needed.

self.get(&bid)
}

pub fn get(&self, build_id: &str) -> Option<BuildContext> {

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

get() clones the full BuildContext including the log buffer

Every status poll calls get() or get_by_pair(), which clones the entire BuildContext (line 294). With a log buffer of up to 10K entries, each clone allocates and copies all of them — even when the caller only reads non-log fields.

Consider one of:

  1. Wrapping the DashMap value in Arc<BuildContext>clone() becomes an atomic refcount increment instead of a deep copy;
  2. Separating the log buffer into an Arc<Vec<BuildLogLine>> field so only the log pointer is cloned rather than the full vector.

CredentialOutcome::Authenticated(ctx)
}

fn constant_time_eq_strings(a: &str, b: &str) -> bool {

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

constant_time_eq_strings — compiler may optimize away the timing-safe loop

When a.len() != b.len() (line 382), the function walks the longer slice XORing each byte with 0 (byte ^ 0) and discards the result (let _ = diff;). A sufficiently smart optimizer can identify diff as dead code and eliminate the loop entirely, defeating the constant-time purpose.

Consider either:

  1. Using a proven constant-time comparison crate (e.g. subtle's ConstantTimeEq);
  2. Writing the result through std::sync::atomic::compiler_fence or core::ptr::read_volatile to prevent elision.

The equal-length path (line 392-396) does use diff as the return value, so that path is fine — only the unequal-length path is affected.

timestamp: Utc::now(),
line,
});
const MAX_LOGS: usize = 10_000;

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Vec::drain(0..drop) is O(n) on every append past cap

Once the log exceeds MAX_LOGS (10,000), every new append_log call drains drop elements from the front and shifts all remaining entries left. With a tail capacity of 10K, this is ~10K element moves per append.

Consider using VecDeque instead — pop_front() is O(1), and the push/pop pattern is exactly what it's designed for. The log is append-only and indexed by offset, which VecDeque supports efficiently via make_contiguous when needed.

self.get(&bid)
}

pub fn get(&self, build_id: &str) -> Option<BuildContext> {

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

get() clones the full BuildContext including the log buffer

Every status poll calls get() or get_by_pair(), which clones the entire BuildContext (line 294). With a log buffer of up to 10K entries, each clone allocates and copies all of them — even when the caller only reads non-log fields.

Consider one of:

  1. Wrapping the DashMap value in Arc<BuildContext>clone() becomes an atomic refcount increment instead of a deep copy;
  2. Separating the log buffer into an Arc<Vec<BuildLogLine>> field so only the log pointer is cloned rather than the full vector.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants