feat(CubeAPI): support e2b V3 template-build protocol end-to-end#422
feat(CubeAPI): support e2b V3 template-build protocol end-to-end#422chengjoey wants to merge 2 commits into
Conversation
|
Claude finished @chengjoey's task in 1m 46s —— View job Code Review: PR #422 — feat(CubeAPI): support e2b V3 template-build protocolOverviewThis PR adds comprehensive E2B V3 protocol support to CubeAPI, including a new OCI registry reverse proxy, in-memory build registry, V3 endpoint handlers ( Issues Found1. [Bug] V3 build status log entries all get In // templates_v3.rs / v3_get_build_status calls get_template_build_status
// which returns Vec<String> (formatted as "TIMESTAMP LINE"), then:
.logs.iter()
.map(|line| V3BuildLogEntry {
timestamp: chrono::Utc::now(), // always "now", not the original timestamp
message: line.clone(),
level: "info".to_string(),
})The 2. [Minor] Standard HTTP semantics dictate 3. [Concern] No rate limiting on V3 template routes — The 4. [Concern] In-memory build registry grows unbounded —
5. [Testing gap] No integration tests for V3 protocol flow — Unit tests exist for the helpers (
The route test ( 6. [Minor] Duplicate registry credential generation logic —
7. [Minor] format!("cube://{}", parent)The comment says "letting downstream callers resolve it," but nothing in this PR handles the 8. [Nit] The function signature uses Correct Design Decisions
Summary
The PR is fundamentally sound and well-structured. The timestamp bug (issue #1) is the only item that would produce incorrect runtime behavior for consumers of the status API. The missing rate limiting (issue #3) is worth addressing before shipping if this API is exposed to unconstrained clients. |
fslongjin
left a comment
There was a problem hiding this comment.
I reviewed the V3 template-build implementation with a focus on correctness, architecture, and workaround risk. The main blockers are the current test compilation failure, the fake file-cache hit for unsupported context builds, and the registry proxy/auth model.
| /// route exists and returns the V3 envelope. | ||
| #[tokio::test] | ||
| async fn v3_template_build_routes_are_reachable() { | ||
| let server = test_server(); |
There was a problem hiding this comment.
This test currently does not compile because test_server() is async and returns a Future. Please await it before using the TestServer methods:
let server = test_server().await;
Right now cargo test fails with no method named post/get found for opaque type impl Future<Output = TestServer>, so this PR cannot pass the existing Rust test suite in its current state.
| /// currently supports) don't need the build context. | ||
| pub fn v3_get_file_upload(&self, _template_id: &str, _files_hash: &str) -> AppResult<crate::models::V3TemplateFileUpload> { | ||
| Ok(crate::models::V3TemplateFileUpload { | ||
| present: true, |
There was a problem hiding this comment.
Always returning present: true makes the SDK believe the build context already exists server-side, so it will skip uploading the tarball. That is only safe for pure fromImage builds.
For Dockerfile/steps-based V3 builds this silently drops the build context, and the later trigger step may dispatch a registry image reference that was never produced. I think unsupported context builds should fail explicitly here, or this endpoint should return a real upload URL and persist/consume the uploaded context.
There was a problem hiding this comment.
done, fromImage will succeed, Dockerfile/steps-based will point to 501
| // letting downstream callers resolve it. Adjust to your local | ||
| // convention if needed. | ||
| format!("cube://{}", parent) | ||
| } else if !ctx.image_ref.is_empty() { |
There was a problem hiding this comment.
This fallback uses ctx.image_ref as long as it is non-empty, but that value is populated during POST /v3/templates; it does not prove that the client has actually pushed the image.
If neither fromImage nor a supported fromTemplate source is provided, we should verify that the registry push has completed before dispatching CubeMaster. Otherwise the API accepts the build and moves the failure to a later image-pull stage, which makes the error much harder to diagnose.
| // synthesise a CubeMaster reference of the form `cube://<tid>`, | ||
| // letting downstream callers resolve it. Adjust to your local | ||
| // convention if needed. | ||
| format!("cube://{}", parent) |
There was a problem hiding this comment.
This introduces a cube://<template> source reference, but I do not see corresponding support in this PR for resolving that scheme downstream in CubeMaster/cubelet.
If the downstream stack does not already understand cube://, this path will look supported at the API layer but fail later during image resolution. Please either implement the resolution in this PR or reject fromTemplate with a clear 501 Not Implemented until the path is actually wired end-to-end.
There was a problem hiding this comment.
done, Now the displayed return is 501
| let log_entries: Vec<V3BuildLogEntry> = logs | ||
| .iter() | ||
| .map(|line| V3BuildLogEntry { | ||
| timestamp: chrono::Utc::now(), |
There was a problem hiding this comment.
These timestamps are generated at status-poll time, not at log-write time. That means the same historical log line will receive a new timestamp every time the client polls.
BuildContext already stores BuildLogLine { timestamp, line }, so the V3 conversion should preserve that structured timestamp instead of assigning Utc::now() here.
| Router::new() | ||
| .route("/v2/", get(registry::ping)) | ||
| .route("/v2", get(registry::ping)) | ||
| .route("/v2/*path", any(registry::proxy)) |
There was a problem hiding this comment.
The registry proxy is mounted without CubeAPI auth or rate limiting. I understand Docker registry auth is a different credential domain, but this implementation also does not validate the short-lived RegistryCredential issued by CubeAPI; it simply forwards requests to the upstream registry.
Given the one-click deployment points this at a default local registry, exposing CubeAPI could accidentally expose a public writable registry proxy. Please add credential validation/rate limiting here, or make the deployment contract require an authenticated upstream registry and document that clearly.
| // → POST /v2/.../builds/{bid} → GET .../status. Routes are mounted | ||
| // at the same level as /templates so both `/v3/...` and `/v2/...` | ||
| // segments are reachable. | ||
| .route("/v3/templates", post(templates_v3::v3_create_template)) |
There was a problem hiding this comment.
Template build routes are currently protected by auth only, while sandbox routes use with_auth_and_rate_limit.
The new V3 endpoints can allocate in-memory build contexts and trigger expensive CubeMaster template-build pipelines, so they should also be rate-limited or governed by a dedicated build quota.
There was a problem hiding this comment.
V3 build-pipeline routes moved to with_auth_and_rate_limit.
|
|
||
| /// Thread-safe, in-process build registry. | ||
| #[derive(Clone, Default)] | ||
| pub struct BuildRegistry { |
There was a problem hiding this comment.
The in-memory build registry has no eviction policy. Completed, failed, and abandoned builds remain in the DashMap indefinitely, including the original create request, credentials, status fields, logs, and the duplicate pair index.
For a long-running CubeAPI process this can become an unbounded memory leak. Please add a TTL cleanup path for terminal builds and/or a maximum registry size.
There was a problem hiding this comment.
BuildRegistry gains TTL + size-cap + background GC
| tag = %parsed.tag, | ||
| "manifest pushed; marking build as image-pushed" | ||
| ); | ||
| state.services.templates.mark_image_pushed(&parsed.tag); |
There was a problem hiding this comment.
This marks a build as image-pushed based only on the manifest tag. It does not verify that the repository path matches the expected <repo_prefix>/<templateID> for the build context.
The build IDs are UUID-like and hard to guess, but the proxy should still validate both repo and tag against the stored build context before advancing the build state.
There was a problem hiding this comment.
done, mark_image_pushed add repo arg, cross-checks the manifest repo
| /// 3. `body.from_template` — copy from another known CubeSandbox | ||
| /// template (resolved via CubeMaster `get_template`). | ||
| /// | ||
| /// `start_cmd` becomes container `args`; `ready_cmd` becomes a Probe.Exec. |
There was a problem hiding this comment.
Several comments still say readyCmd is translated into Probe.Exec, but the actual implementation parses a loopback URL and creates an HTTP GET probe because CubeMaster/cubelet do not support exec probes here.
Please update the comments/API model docs to say Probe.HttpGet; otherwise future maintainers may assume exec-style ready commands are supported end-to-end.
Add V3 routes: POST /v3/templates, GET .../files/{hash}, POST /v2/.../builds/{bid}, GET .../status (handlers/templates_v3.rs)
Signed-off-by: joeyczheng <joeyczheng@tencent.com>
Security:
* Per-build short-lived push credentials (`bld_<random>` + 256-bit
password) replace the global `_token` shared secret.
* Registry reverse-proxy validates Basic auth against an in-memory
credential index and enforces repo scoping; 401 with WWW-Authenticate
+ 403 on cross-build access. Per-credential rate limit added.
* `mark_image_pushed` cross-checks the manifest repo, not just the tag.
* Startup WARN when public bind meets unauthenticated loopback upstream.
Resource bounds:
* `BuildRegistry` gains TTL + size-cap + background GC; in-flight builds
are never evicted. New `build_registry_*` config knobs.
* Registry proxy streams request/response bodies end-to-end instead of
buffering, so concurrent multi-GiB pushes no longer pin the heap.
* V3 build-pipeline routes moved to `with_auth_and_rate_limit`.
Correctness:
* `image_pushed` flag is the single source of truth for "client really
pushed"; OCI fallback no longer dispatches against an unpushed ref.
* `fromTemplate` and `dockerfile`/`steps`-only builds are rejected
with 501 until a real resolver/builder ships, instead of failing
obscurely deeper in the pipeline.
* `BuildLogLine.timestamp` is preserved across status polls.
* Probe synthesis comments fixed: `readyCmd` -> `Probe.HttpGet`, not
`Probe.Exec`.
Signed-off-by: joeyczheng <joeyczheng@tencent.com>
6e93954 to
8f8f6fa
Compare
| CredentialOutcome::Authenticated(ctx) | ||
| } | ||
|
|
||
| fn constant_time_eq_strings(a: &str, b: &str) -> bool { |
There was a problem hiding this comment.
constant_time_eq_strings — compiler may optimize away the timing-safe loop
When a.len() != b.len() (line 382), the function walks the longer slice XORing each byte with 0 (byte ^ 0) and discards the result (let _ = diff;). A sufficiently smart optimizer can identify diff as dead code and eliminate the loop entirely, defeating the constant-time purpose.
Consider either:
- Using a proven constant-time comparison crate (e.g.
subtle'sConstantTimeEq); - Writing the result through
std::sync::atomic::compiler_fenceorcore::ptr::read_volatileto prevent elision.
The equal-length path (line 392-396) does use diff as the return value, so that path is fine — only the unequal-length path is affected.
| timestamp: Utc::now(), | ||
| line, | ||
| }); | ||
| const MAX_LOGS: usize = 10_000; |
There was a problem hiding this comment.
Vec::drain(0..drop) is O(n) on every append past cap
Once the log exceeds MAX_LOGS (10,000), every new append_log call drains drop elements from the front and shifts all remaining entries left. With a tail capacity of 10K, this is ~10K element moves per append.
Consider using VecDeque instead — pop_front() is O(1), and the push/pop pattern is exactly what it's designed for. The log is append-only and indexed by offset, which VecDeque supports efficiently via make_contiguous when needed.
| self.get(&bid) | ||
| } | ||
|
|
||
| pub fn get(&self, build_id: &str) -> Option<BuildContext> { |
There was a problem hiding this comment.
get() clones the full BuildContext including the log buffer
Every status poll calls get() or get_by_pair(), which clones the entire BuildContext (line 294). With a log buffer of up to 10K entries, each clone allocates and copies all of them — even when the caller only reads non-log fields.
Consider one of:
- Wrapping the DashMap value in
Arc<BuildContext>—clone()becomes an atomic refcount increment instead of a deep copy; - Separating the log buffer into an
Arc<Vec<BuildLogLine>>field so only the log pointer is cloned rather than the full vector.
| CredentialOutcome::Authenticated(ctx) | ||
| } | ||
|
|
||
| fn constant_time_eq_strings(a: &str, b: &str) -> bool { |
There was a problem hiding this comment.
constant_time_eq_strings — compiler may optimize away the timing-safe loop
When a.len() != b.len() (line 382), the function walks the longer slice XORing each byte with 0 (byte ^ 0) and discards the result (let _ = diff;). A sufficiently smart optimizer can identify diff as dead code and eliminate the loop entirely, defeating the constant-time purpose.
Consider either:
- Using a proven constant-time comparison crate (e.g.
subtle'sConstantTimeEq); - Writing the result through
std::sync::atomic::compiler_fenceorcore::ptr::read_volatileto prevent elision.
The equal-length path (line 392-396) does use diff as the return value, so that path is fine — only the unequal-length path is affected.
| timestamp: Utc::now(), | ||
| line, | ||
| }); | ||
| const MAX_LOGS: usize = 10_000; |
There was a problem hiding this comment.
Vec::drain(0..drop) is O(n) on every append past cap
Once the log exceeds MAX_LOGS (10,000), every new append_log call drains drop elements from the front and shifts all remaining entries left. With a tail capacity of 10K, this is ~10K element moves per append.
Consider using VecDeque instead — pop_front() is O(1), and the push/pop pattern is exactly what it's designed for. The log is append-only and indexed by offset, which VecDeque supports efficiently via make_contiguous when needed.
| self.get(&bid) | ||
| } | ||
|
|
||
| pub fn get(&self, build_id: &str) -> Option<BuildContext> { |
There was a problem hiding this comment.
get() clones the full BuildContext including the log buffer
Every status poll calls get() or get_by_pair(), which clones the entire BuildContext (line 294). With a log buffer of up to 10K entries, each clone allocates and copies all of them — even when the caller only reads non-log fields.
Consider one of:
- Wrapping the DashMap value in
Arc<BuildContext>—clone()becomes an atomic refcount increment instead of a deep copy; - Separating the log buffer into an
Arc<Vec<BuildLogLine>>field so only the log pointer is cloned rather than the full vector.
Add V3 routes: POST /v3/templates, GET .../files/{hash}, POST /v2/.../builds/{bid}, GET .../status (handlers/templates_v3.rs)
feat: #74