Feat/webCaseCenter#615
Conversation
b9cb05d to
321e017
Compare
| .env("CUBE_API_ENVD_AUTH", &self.envd_auth) | ||
| .current_dir(&work_dir); | ||
|
|
||
| if std::env::var("CUBE_API_KEY").is_err() { |
There was a problem hiding this comment.
Security: Hardcoded default CUBE_API_KEY
When CUBE_API_KEY is not set in the parent process environment, the example subprocess receives a well-known default key (cube_0000000000...). Any user who triggers an example run can extract this key from the process environment and use it to authenticate arbitrary API calls.
Either remove the fallback entirely (scripts should handle a missing key gracefully) or use an obviously non-functional sentinel (e.g., empty string) that scripts can detect and refuse to use.
| } | ||
|
|
||
| // 3. Any healthy/ready template | ||
| if let Ok(tpls) = templates.list_templates().await { |
There was a problem hiding this comment.
Performance: Template validation iterates healthy candidates twice + makes 2N gRPC calls
Line 448 chains tpls.iter() (the full set) after the filtered subset, so every template is visited at least once and healthy/ready templates are validated twice. The inner templates.get_template() call is a gRPC round-trip to CubeMaster — in the worst case this makes 2N sequential gRPC calls.
The fix: replace .chain(tpls.iter()) with .chain(tpls.iter().filter(|t| t.status != "healthy" && t.status != "ready")), or better, inline the status check from list_templates and skip get_template entirely for templates already confirmed as healthy/ready.
| } | ||
| }; | ||
|
|
||
| let fingerprint = { |
There was a problem hiding this comment.
DefaultHasher uses random seeds per-process — cache misses after every restart
DefaultHasher::new() uses RandomState with per-process seeds, so the same requirements.txt content produces a different hash after every server restart. The fingerprint stamp file written by a previous process lifetime will never match, causing pip3 install to rerun unconditionally on the first example run after each restart.
Replace with a deterministic hash (e.g., SipHasher::new_with_keys(0, 0)) or simply compare the file content directly for these trivially small requirements files.
| .map_err(anyhow::Error::from) | ||
| } | ||
|
|
||
| #[allow(dead_code)] |
There was a problem hiding this comment.
Dead code with explicit #[allow(dead_code)] — will drift from schema
This method is never called (explicitly acknowledged by the attribute). DB access methods that go unused will silently drift as columns are renamed/added/removed, then fail when someone finally calls them. Either wire it into a real caller or remove it.
| fn default_sandbox_proxy_url() -> String { | ||
| std::env::var("AGENTHUB_SANDBOX_PROXY_URL").unwrap_or_else(|_| "http://127.0.0.1".to_string()) | ||
| } | ||
| fn default_envd_auth() -> String { |
There was a problem hiding this comment.
Security: Default envd_auth is a well-known base64 value
The default Basic cm9vdDo= decodes to root: with an empty password — the envd built-in default. This shared credential is sent to every sandbox's Jupyter/envd endpoint. Anyone who learns it (including by running code in their own sandbox and observing request headers) can access the in-sandbox services of other sandboxes. Use per-sandbox credentials (the sandbox detail already includes an envd_access_token field) instead of a cluster-wide shared secret.
| // Pull all images in parallel, then inspect each. | ||
| let handles: Vec<_> = STORE_IMAGES | ||
| /// Pulls all store images from the registry then returns the updated metadata. | ||
| pub async fn refresh_store_meta(State(state): State<AppState>) -> impl IntoResponse { |
There was a problem hiding this comment.
Security & performance: Concurrent Docker pulls with no rate limiting
This spawns one docker pull --quiet per catalog image concurrently via spawn_blocking. With 6+ images (some multi-GB), this can saturate Docker daemon capacity and network bandwidth. Also, POST /store/refresh should use with_auth_and_rate_limit in the router (like the node isolation endpoint) rather than plain with_auth. Consider adding a semaphore to cap concurrent pulls (e.g., 2-3 at a time).
| .env("CUBE_API_ENVD_AUTH", &self.envd_auth) | ||
| .current_dir(&work_dir); | ||
|
|
||
| if std::env::var("CUBE_API_KEY").is_err() { |
There was a problem hiding this comment.
Security: Hardcoded default CUBE_API_KEY
When CUBE_API_KEY is not set in the parent process environment, the example subprocess receives a well-known default key (cube_0000000000...). Any user who triggers an example run can extract this key from the process environment and use it to authenticate arbitrary API calls.
Either remove the fallback entirely (scripts should handle a missing key gracefully) or use an obviously non-functional sentinel (e.g., empty string) that scripts can detect and refuse to use.
| } | ||
|
|
||
| // 3. Any healthy/ready template | ||
| if let Ok(tpls) = templates.list_templates().await { |
There was a problem hiding this comment.
Performance: Template validation iterates healthy candidates twice + makes 2N gRPC calls
Line 448 chains tpls.iter() (the full set) after the filtered subset, so every template is visited at least once and healthy/ready templates are validated twice. The inner templates.get_template() call is a gRPC round-trip to CubeMaster — in the worst case this makes 2N sequential gRPC calls.
The fix: replace .chain(tpls.iter()) with .chain(tpls.iter().filter(|t| t.status != "healthy" && t.status != "ready")), or better, inline the status check from list_templates and skip get_template entirely for templates already confirmed as healthy/ready.
| } | ||
| }; | ||
|
|
||
| let fingerprint = { |
There was a problem hiding this comment.
DefaultHasher uses random seeds per-process → cache misses after every restart
DefaultHasher::new() uses RandomState with per-process seeds, so the same requirements.txt content produces a different hash after every server restart. The fingerprint stamp file written by a previous process lifetime will never match, causing pip3 install to rerun unconditionally on the first example run after each restart.
Replace with a deterministic hash (e.g., SipHasher::new_with_keys(0, 0)) or simply compare the file content directly for these trivially small requirements files.
| /// calls return the same `&'static` slice, so the front-end gets a stable | ||
| /// ordering without the borrow checker complaining about temporaries. | ||
| pub fn scenario_registry() -> &'static [ScenarioSpec] { | ||
| Box::leak(Box::new(vec![ |
There was a problem hiding this comment.
scenario_registry() currently allocates and leaks a new Vec<ScenarioSpec> on every call, even though the comment says the registry is materialized once. Endpoints such as list, source, and run can call this repeatedly, so a long-lived API process will grow memory over time. Please store the vector behind OnceLock/LazyLock and return the initialized slice, ideally with a regression test that repeated calls reuse the same registry.
| let start = Instant::now(); | ||
|
|
||
| let output = match req.language.as_str() { | ||
| "python" => self.exec_python(sandbox_id, domain, &req.code).await?, |
There was a problem hiding this comment.
timeout_secs is only used after the Jupyter/envd request has already returned; the service still awaits exec_python/exec_bash without a real deadline. A hung or slow sandbox request can keep the handler occupied, and larger user timeouts will still be cut off by the outer 30s route timeout. Please wrap the actual execution future in tokio::time::timeout and align the route timeout with the allowed request budget.
|
|
||
| let start = Instant::now(); | ||
| let max_secs = sc.timeout_secs.unwrap_or(120); | ||
| let run_result = timeout(Duration::from_secs(max_secs), cmd.output()).await; |
There was a problem hiding this comment.
Timing out cmd.output() drops the future but does not explicitly kill the spawned process or its process tree, so long-running Python/Go/Chromium children can continue after the API has returned. This path also installs requirements.txt during a request, mutating the server's Python environment and using a source-tree stamp file without concurrency control. Please spawn and kill/wait the child on timeout, and move dependency installation to build/deploy time or an isolated, locked cache that fails closed.
| fn build_examples_run_routes(state: &AppState, auth_configured: bool) -> Router<AppState> { | ||
| let routes = Router::new().route("/examples/run", post(examples::run_example)); | ||
| if auth_configured { | ||
| routes.layer(middleware::from_fn_with_state(state.clone(), unified_auth)) |
There was a problem hiding this comment.
POST /examples/run is moved to a long-timeout router with auth only, but it no longer gets the rate limiting used by other resource-creating or execution endpoints. Because this route starts local processes and can create sandboxes or browser sessions for up to several minutes, missing rate limits increases abuse and resource-exhaustion risk. Please apply the existing auth+rate-limit layer here, or add a dedicated concurrency/rate budget for example runs.
| // `hidden: true` filter so even an attacker who knows the IDs cannot reach | ||
| // these scenarios through the HTTP API. | ||
|
|
||
| export const EXAMPLE_SCENARIOS: ExampleScenario[] = [ |
There was a problem hiding this comment.
This adds a second hand-maintained example registry in the frontend while the backend registry also defines scenario IDs, files, categories, store items, visibility, and run metadata. The two sources can drift, producing UI entries that cannot run or backend examples with stale labels/topology in the UI. Please make the backend /examples response include the UI metadata, or generate both Rust and TS registries from a shared data source.
| pub async fn exec_code( | ||
| State(state): State<AppState>, | ||
| Path(sandbox_id): Path<String>, | ||
| Json(body): Json<ExecCodeRequest>, |
There was a problem hiding this comment.
ExecCodeRequest derives Validate, but this handler only extracts Json(body) and never calls body.validate() before executing the request. As a result, the documented code length constraint is not enforced and the API can accept payloads larger than the model contract suggests. Please validate here and map failures to a 400 response, or use a shared validated-JSON extractor.
There was a problem hiding this comment.
This PR adds the compiled examples/cube-bench/cube-bench binary alongside the Go source. Committing the build artifact substantially increases PR size, is platform/architecture-specific, and makes supply-chain review noisier. Please remove the binary and add an ignore rule for this executable or for example build outputs.
| exit_code, | ||
| success, | ||
| elapsed_ms, | ||
| steps: Vec::new(), |
There was a problem hiding this comment.
The run response now exposes steps, and the frontend adds a StepTimeline, but the service always returns steps: Vec::new() and the page does not render the timeline from backend data. This leaves a partially implemented API/UI surface that users cannot observe and future maintainers may assume is complete. Please either generate and render real step logs, or remove the unused field/component/comment until the feature is wired through.
| """ | ||
|
|
||
| with Sandbox.create(template=template_id) as sandbox: | ||
|
|
There was a problem hiding this comment.
This added blank line contains trailing whitespace, so git diff --check still fails for the PR. Please remove the stray spaces (and any other diff-check formatting issues) to avoid CI or review noise.
| </div> | ||
| ); | ||
| } | ||
|
|
There was a problem hiding this comment.
git diff --check also reports this as a new blank line at EOF. Please remove the extra trailing blank line so the PR is clean for whitespace checks.
6cfb21c to
488bea9
Compare
Signed-off-by: liciazhu <liciazhu@tencent.com>
Signed-off-by: liciazhu <liciazhu@tencent.com>
Signed-off-by: liciazhu <liciazhu@tencent.com>
Signed-off-by: liciazhu <liciazhu@tencent.com>
Signed-off-by: liciazhu <liciazhu@tencent.com>
Signed-off-by: liciazhu <liciazhu@tencent.com>
Signed-off-by: liciazhu <liciazhu@tencent.com>
Signed-off-by: liciazhu <liciazhu@tencent.com>
Signed-off-by: liciazhu <liciazhu@tencent.com>
Signed-off-by: liciazhu <liciazhu@tencent.com>
…SandboxCases Signed-off-by: liciazhu <liciazhu@tencent.com>
Signed-off-by: liciazhu <liciazhu@tencent.com>
Signed-off-by: liciazhu <liciazhu@tencent.com>
Signed-off-by: liciazhu <liciazhu@tencent.com>
… prefix Signed-off-by: liciazhu <liciazhu@tencent.com>
…put, centralise credentials Signed-off-by: liciazhu <liciazhu@tencent.com>
…stry Signed-off-by: liciazhu <liciazhu@tencent.com>
Signed-off-by: liciazhu <liciazhu@tencent.com>
Signed-off-by: liciazhu <liciazhu@tencent.com>
…licy Signed-off-by: liciazhu <liciazhu@tencent.com>
POST /store/refresh previously spawned an unbounded number of concurrent blocking tasks — one docker-pull per catalog entry — which could saturate Docker daemon and network bandwidth when the catalog is large. Introduce a tokio Semaphore (MAX_CONCURRENT_PULLS = 3) so that at most 3 docker pulls run simultaneously. Each catalog image acquires a permit before entering the spawn_blocking call and releases it automatically on drop. Assisted-by: CodeBuddy:claude-sonnet-4-6 Signed-off-by: liciazhu <liciazhu@tencent.com>
…ement Improvement in scenario_registry();Consistent Auth + Rate Limiting on /examples/run;Remove Committed Binary from Repository;Whitespace Formatting Fixes Signed-off-by: liciazhu <liciazhu@tencent.com>
…xec-code Signed-off-by: liciazhu <liciazhu@tencent.com>
488bea9 to
2f8f47f
Compare
Signed-off-by: liciazhu <liciazhu@tencent.com>
| .env("CUBE_TEMPLATE_ID", &template_id) | ||
| .env("SSL_CERT_FILE", ssl_cert) | ||
| .env("AGENTHUB_SANDBOX_PROXY_URL", &self.sandbox_proxy_url) | ||
| .env("CUBE_API_ENVD_AUTH", &self.envd_auth) |
There was a problem hiding this comment.
Security: envd_auth credential leaked into example subprocess environment
The full envd_auth header value (credential for the sandbox proxy) is injected as the CUBE_API_ENVD_AUTH environment variable into every example subprocess. Example scripts (including user-edited code in req.code) can read os.environ["CUBE_API_ENVD_AUTH"] and authenticate as root to the sandbox proxy, bypassing API-level authorization.
Impact: Escalation from "can run an example" to "can execute arbitrary code inside any sandbox."
Suggestion: Do not pass envd_auth to subprocesses. If example scripts need proxy access, create a scoped time-limited token instead.
| if let Ok(tpls) = templates.list_templates().await { | ||
| let matched = tpls.iter().find(|t| { | ||
| (t.status == "healthy" || t.status == "ready") | ||
| && t.image_info.as_deref() == Some(image_ref.as_str()) | ||
| }); | ||
| if let Some(t) = matched { | ||
| tracing::info!( | ||
| store_item_id = %sid, | ||
| image = %image_ref, | ||
| template_id = %t.template_id, | ||
| "matched template via store_item_id" | ||
| ); | ||
| return Ok(t.template_id.clone()); | ||
| } | ||
| } | ||
| } | ||
| } | ||
|
|
||
| // 3. Any healthy/ready template | ||
| if let Ok(tpls) = templates.list_templates().await { |
There was a problem hiding this comment.
Performance: N+1 HTTP cascade in template resolution
resolve_template_id performs up to ~15 sequential HTTP round-trips to CubeMaster per example run: up to 3 individual get_template calls for candidates (line ~405), then a list_templates call (line ~429), then another list_templates call (line ~448, duplicating the result from line ~429), then iterates all returned templates calling get_template individually for each.
Impact: Latency floor of O(N) sequential network hops to CubeMaster on every example run.
Suggestion:
- Call
list_templatesonce and reuse the result - Check the list's own
statusfield rather than re-verifying each candidate withget_template(the list already returns status) - Use
tokio::try_join_allfor candidate probing in the first step
| std::env::var("CUBE_API_DEFAULT_KEY") | ||
| .ok() | ||
| .filter(|s| !s.is_empty()) | ||
| .or_else(|| Some("cube_0000000000000000000000000000000000000000".to_string())) |
There was a problem hiding this comment.
Security: Hardcoded fallback API key
When CUBE_API_DEFAULT_KEY is not set, this fallback value cube_0000000000000000000000000000000000000000 is used as an API key and injected into every example subprocess. This is a known static key shared across all deployments that don't explicitly configure it — effectively a backdoor credential.
Suggestion: Remove the hardcoded fallback and require explicit configuration. Log a loud warning at startup if the env var is unset.
| className="jupyter-svg-output max-h-[300px] overflow-auto rounded-md bg-white p-2 ring-1 ring-border/60" | ||
| dangerouslySetInnerHTML={{ | ||
| __html: DOMPurify.sanitize(result.svg, { | ||
| USE_PROFILES: { svg: true, svgFilters: true }, |
There was a problem hiding this comment.
Security: SVG XSS via DOMPurify's SVG profile
Jupyter result.svg payloads are sanitized with DOMPurify.sanitize(result.svg, { USE_PROFILES: { svg: true, svgFilters: true } }). The SVG profile allows <script>, event handlers (onload, etc.), and <use> with external references. An SVG like <svg><script>fetch(document.cookie)</script></svg> would execute in the WebUI origin.
Impact: Stored XSS — any user who can produce a Jupyter SVG output can run JS in administrators' browsers.
Suggestion: Either strip SVG output entirely (render as text), or use a stricter sanitizer that removes scripts, event handlers, and external references from SVGs.
| const out: Node<TopologyNodeData>[] = []; | ||
| for (let col = 0; col < layers.length; col++) { | ||
| const layer = layers[col]; | ||
| const controlInLayer = layer.filter((id) => nodes.find((n) => n.id === id)?.plane === 'control'); |
There was a problem hiding this comment.
Performance: O(N²) scan in topology layout
nodes.find((n) => n.id === id) is called in the innermost loop of layoutNodes, scanning the entire nodes array for each ID in each layer. For a topology with ~20 nodes this is fine, but the pattern repeats 3× per layer (control filter, data filter, node lookup).
Suggestion: Build a Map<string, LocalizedNode> outside the loop and use .get(id) — reduces from O(N²) to O(N) and is a one-line change.
| if program == "go" { | ||
| let dir_name = format!(".tmp_run_{}", Uuid::new_v4()); | ||
| let dir = base_dir.join(&dir_name); | ||
| std::fs::create_dir_all(&dir).map_err(|e| { |
There was a problem hiding this comment.
Performance: Blocking std::fs calls on async runtime
Filesystem operations (create_dir_all, write, read_to_string, remove_dir_all, remove_file, copy) run directly on the tokio async thread. Under high concurrency, each blocks a thread that could otherwise process requests.
Suggestion: Use tokio::fs::* equivalents or wrap in tokio::task::spawn_blocking.
| /// - `stderr` → `{ "type": "stderr", "text": "..." }` | ||
| /// - `result` → `{ "type": "result", "text": "...", ... }` | ||
| /// - `error` → `{ "type": "error", "name": "...", "value": "...", "traceback": [...] }` | ||
| fn parse_jupyter_ndjson(bytes: &[u8]) -> AppResult<JupyterOutput> { |
There was a problem hiding this comment.
Test Coverage: Untested pure parsing functions
This file contains several pure, deterministic parsing functions — parse_jupyter_ndjson (line 295), parse_connect_stream (line 374), build_response (line 264), connect_envelope (line 366), decode_b64_lossy (line 427), and parse_exit_status (line 434) — with zero test coverage. These parse raw binary streams from envd/Jupyter; malformed input would crash the handler or produce wrong results.
The rest of the crate has an established #[cfg(test)] pattern (12+ files use it). These parsing functions are ideal candidates for unit tests as they require no network, subprocess, or database.
add cube web case center and mod web sandbox detail