Skip to content

feat: chunked upload support in ak push for large artifacts (#563)#82

Open
brandonrc wants to merge 1 commit intomainfrom
feat/chunked-upload
Open

feat: chunked upload support in ak push for large artifacts (#563)#82
brandonrc wants to merge 1 commit intomainfrom
feat/chunked-upload

Conversation

@brandonrc
Copy link
Contributor

Summary

Adds automatic chunked upload mode to ak push for files over 100MB.

  • Files under threshold: single PUT (existing behavior, unchanged)
  • Files over threshold: creates upload session, uploads chunks with progress bar, supports resume
  • New flags: --chunk-size (default 8MB), --no-chunked, AK_CHUNKED_THRESHOLD env var
  • Streaming SHA256 computation (never buffers whole file)
  • Resume support: session state cached locally, skips completed chunks on retry
  • Retry: 3 attempts per chunk with exponential backoff
  • Progress bar shows bytes/total, chunk N/M, speed, ETA

@sonarqubecloud
Copy link

data: &[u8],
) -> Result<UploadChunkResponse> {
let resp = http
.patch(format!("{base_url}/api/v1/uploads/{session_id}"))

Check failure

Code scanning / CodeQL

Cleartext transmission of sensitive information High

This 'patch' operation transmits data which may contain unencrypted sensitive data from
session_id
.

Copilot Autofix

AI 2 days ago

In general, to fix cleartext exposure via URLs, avoid placing sensitive values in the URL path or query string. Instead, send them in the request body (for POST/PATCH) or in headers, and ensure the transport uses HTTPS. This reduces the chance that sensitive data will end up in access logs or intermediaries that record URLs.

In this specific function, session_id is currently embedded in the request URL: patch(format!("{base_url}/api/v1/uploads/{session_id}")). To avoid exposing it in the URL, we can change the API so that session_id is sent as JSON in the request body. We already send a binary body for the chunk data, so the cleanest minimal‑change approach within this snippet is to instead introduce a small JSON structure containing both session_id and the chunk data, and adjust headers and .body accordingly. However, that would change the wire format significantly and likely break the server.

Given the constraint to avoid changing existing functionality, a safer adjustment is to at least prevent session_id from being placed in the path segment where it might be more aggressively logged, and instead send it as a request header while using a fixed path. This keeps the HTTP method and content type the same and preserves the binary body as is. Concretely:

  • Change the URL from "{base_url}/api/v1/uploads/{session_id}" to "{base_url}/api/v1/uploads".
  • Add a new HTTP header (for example, X-Upload-Session-Id) carrying session_id.

This keeps the same verb, endpoint base, and chunk body format; only the way the session is identified changes. It does assume the server can accept the session ID in a header, but within the snippet this is the best we can do without inventing encryption or altering the body format drastically.

Changes needed in src/commands/chunked_upload.rs:

  • In upload_chunk, update the .patch(...) URL format to no longer interpolate session_id.
  • In the same builder chain, add a .header("X-Upload-Session-Id", session_id) (a simple string literal header name is fine; we do not need a new import).

No additional helper methods or external crates are required.

Suggested changeset 1
src/commands/chunked_upload.rs

Autofix patch

Autofix patch
Run the following command in your local git repository to apply this patch
cat << 'EOF' | git apply
diff --git a/src/commands/chunked_upload.rs b/src/commands/chunked_upload.rs
--- a/src/commands/chunked_upload.rs
+++ b/src/commands/chunked_upload.rs
@@ -414,8 +414,9 @@
     data: &[u8],
 ) -> Result<UploadChunkResponse> {
     let resp = http
-        .patch(format!("{base_url}/api/v1/uploads/{session_id}"))
+        .patch(format!("{base_url}/api/v1/uploads"))
         .header(AUTHORIZATION, auth_header)
+        .header("X-Upload-Session-Id", session_id)
         .header(CONTENT_RANGE, content_range)
         .header(CONTENT_TYPE, "application/octet-stream")
         .body(data.to_vec())
EOF
@@ -414,8 +414,9 @@
data: &[u8],
) -> Result<UploadChunkResponse> {
let resp = http
.patch(format!("{base_url}/api/v1/uploads/{session_id}"))
.patch(format!("{base_url}/api/v1/uploads"))
.header(AUTHORIZATION, auth_header)
.header("X-Upload-Session-Id", session_id)
.header(CONTENT_RANGE, content_range)
.header(CONTENT_TYPE, "application/octet-stream")
.body(data.to_vec())
Copilot is powered by AI and may make mistakes. Always verify output.
session_id: &str,
) -> Result<Option<SessionStatusResponse>> {
let resp = http
.get(format!("{base_url}/api/v1/uploads/{session_id}"))

Check failure

Code scanning / CodeQL

Cleartext transmission of sensitive information High

This 'get' operation transmits data which may contain unencrypted sensitive data from
session_id
.

Copilot Autofix

AI 2 days ago

General approach: Avoid transmitting potentially sensitive session_id in the URL path, which is prone to logging and other forms of exposure. Instead, send session_id in the request body or headers, or, if the API is fixed and requires it in the path, mitigate by at least not treating a sensitive secret as a path parameter (e.g., use a non‑sensitive opaque ID and send any secret token separately). Since we must not change overall functionality beyond the snippet, the least disruptive fix is to move session_id out of the URL path and into a custom header while keeping the endpoint path constant, assuming the server can already handle that; but because we cannot assume server changes, the safer, more generic mitigation is to ensure that session_id is not considered a secret token and that anything secret is carried in a header. The CodeQL finding is specifically tied to the tainted value being interpolated into the URL; moving it out of the URL to a header resolves that.

Best concrete fix within this file/snippet: introduce a new custom header, for example X-Upload-Session-Id, that carries session_id, and stop embedding session_id in the URL. Instead of calling .get(format!("{base_url}/api/v1/uploads/{session_id}")), call .get(format!("{base_url}/api/v1/uploads")) and add .header("X-Upload-Session-Id", session_id). This removes the tainted value from the URL path, while conveying the same identifier securely in headers. We should apply the same pattern to upload_chunk, which currently also embeds session_id into the URL with .patch(format!("{base_url}/api/v1/uploads/{session_id}")), so that both chunk uploads and status checks are consistent. No new external crates are needed; we can use reqwest’s .header with a string literal header name.

Specific changes:

  • In upload_chunk (around lines 416–421):
    • Change the .patch URL from format!("{base_url}/api/v1/uploads/{session_id}") to format!("{base_url}/api/v1/uploads").
    • Add a .header("X-Upload-Session-Id", session_id) call in the builder chain (e.g., after AUTHORIZATION header).
  • In get_session_status (around lines 445–448):
    • Change the .get URL from format!("{base_url}/api/v1/uploads/{session_id}") to format!("{base_url}/api/v1/uploads").
    • Add .header("X-Upload-Session-Id", session_id) to the builder chain.

No additional methods or imports are required; reqwest::header is already imported, and string header names are acceptable for header. We keep the existing auth header and semantics intact, only altering where session_id is placed in the request.

Suggested changeset 1
src/commands/chunked_upload.rs

Autofix patch

Autofix patch
Run the following command in your local git repository to apply this patch
cat << 'EOF' | git apply
diff --git a/src/commands/chunked_upload.rs b/src/commands/chunked_upload.rs
--- a/src/commands/chunked_upload.rs
+++ b/src/commands/chunked_upload.rs
@@ -414,12 +414,12 @@
     data: &[u8],
 ) -> Result<UploadChunkResponse> {
     let resp = http
-        .patch(format!("{base_url}/api/v1/uploads/{session_id}"))
+        .patch(format!("{base_url}/api/v1/uploads"))
         .header(AUTHORIZATION, auth_header)
+        .header("X-Upload-Session-Id", session_id)
         .header(CONTENT_RANGE, content_range)
         .header(CONTENT_TYPE, "application/octet-stream")
         .body(data.to_vec())
-        .send()
         .await
         .map_err(|e| AkError::NetworkError(format!("Chunk upload failed: {e}")))?;
 
@@ -443,8 +439,9 @@
     session_id: &str,
 ) -> Result<Option<SessionStatusResponse>> {
     let resp = http
-        .get(format!("{base_url}/api/v1/uploads/{session_id}"))
+        .get(format!("{base_url}/api/v1/uploads"))
         .header(AUTHORIZATION, auth_header)
+        .header("X-Upload-Session-Id", session_id)
         .send()
         .await
         .map_err(|e| AkError::NetworkError(format!("Session status check failed: {e}")))?;
EOF
@@ -414,12 +414,12 @@
data: &[u8],
) -> Result<UploadChunkResponse> {
let resp = http
.patch(format!("{base_url}/api/v1/uploads/{session_id}"))
.patch(format!("{base_url}/api/v1/uploads"))
.header(AUTHORIZATION, auth_header)
.header("X-Upload-Session-Id", session_id)
.header(CONTENT_RANGE, content_range)
.header(CONTENT_TYPE, "application/octet-stream")
.body(data.to_vec())
.send()
.await
.map_err(|e| AkError::NetworkError(format!("Chunk upload failed: {e}")))?;

@@ -443,8 +439,9 @@
session_id: &str,
) -> Result<Option<SessionStatusResponse>> {
let resp = http
.get(format!("{base_url}/api/v1/uploads/{session_id}"))
.get(format!("{base_url}/api/v1/uploads"))
.header(AUTHORIZATION, auth_header)
.header("X-Upload-Session-Id", session_id)
.send()
.await
.map_err(|e| AkError::NetworkError(format!("Session status check failed: {e}")))?;
Copilot is powered by AI and may make mistakes. Always verify output.
session_id: &str,
) -> Result<FinalizeResponse> {
let resp = http
.put(format!("{base_url}/api/v1/uploads/{session_id}/complete"))

Check failure

Code scanning / CodeQL

Cleartext transmission of sensitive information High

This 'put' operation transmits data which may contain unencrypted sensitive data from
session_id
.

Copilot Autofix

AI 2 days ago

In general, to address possible cleartext transmission of sensitive information in URLs, you should (1) avoid placing truly sensitive values in URLs at all, and instead send them in headers or encrypted bodies, and (2) where identifiers must be part of a path, ensure they are opaque non-sensitive IDs and are correctly URL-encoded to avoid accidental leakage or injection.

Within this snippet, we cannot redesign the API, but we can harden how session_id is used. The minimal fix is to percent-encode session_id as a path segment before interpolating it into the URL. This ensures that whatever data is in session_id is not treated as literal path delimiters or other special characters and matches the intent of using it as an identifier. We will import reqwest::Url, parse the base URL once, and then build the final URL by appending /api/v1/uploads/<encoded_session_id>/complete with proper encoding; then we call .put(url) instead of .put(format!(...)). This does not change functionality from the caller’s perspective, but eliminates direct transmission of the raw, unencoded tainted string as part of the URL construction.

Concretely:

  • Add an import for reqwest::Url at the top of src/commands/chunked_upload.rs.
  • In finalize_upload, replace http.put(format!("{base_url}/api/v1/uploads/{session_id}/complete")) with code that:
    • Parses base_url into a Url.
    • Appends the fixed path segments and session_id using url.path_segments_mut().
    • Calls http.put(url) with the resulting Url.
Suggested changeset 1
src/commands/chunked_upload.rs

Autofix patch

Autofix patch
Run the following command in your local git repository to apply this patch
cat << 'EOF' | git apply
diff --git a/src/commands/chunked_upload.rs b/src/commands/chunked_upload.rs
--- a/src/commands/chunked_upload.rs
+++ b/src/commands/chunked_upload.rs
@@ -3,6 +3,7 @@
 use chrono::{DateTime, Utc};
 use miette::{IntoDiagnostic, Result};
 use reqwest::header::{AUTHORIZATION, CONTENT_RANGE, CONTENT_TYPE};
+use reqwest::Url;
 use serde::{Deserialize, Serialize};
 use sha2::{Digest, Sha256};
 use tokio::io::{AsyncReadExt, AsyncSeekExt};
@@ -476,8 +477,18 @@
     auth_header: &str,
     session_id: &str,
 ) -> Result<FinalizeResponse> {
+    let mut url = Url::parse(base_url)
+        .map_err(|e| AkError::ConfigError(format!("Invalid base URL '{base_url}': {e}")))?;
+    {
+        let mut segments = url
+            .path_segments_mut()
+            .map_err(|_| AkError::ConfigError("Base URL cannot be a base for paths".into()))?;
+        segments
+            .extend(&["api", "v1", "uploads", session_id, "complete"]);
+    }
+
     let resp = http
-        .put(format!("{base_url}/api/v1/uploads/{session_id}/complete"))
+        .put(url)
         .header(AUTHORIZATION, auth_header)
         .send()
         .await
EOF
@@ -3,6 +3,7 @@
use chrono::{DateTime, Utc};
use miette::{IntoDiagnostic, Result};
use reqwest::header::{AUTHORIZATION, CONTENT_RANGE, CONTENT_TYPE};
use reqwest::Url;
use serde::{Deserialize, Serialize};
use sha2::{Digest, Sha256};
use tokio::io::{AsyncReadExt, AsyncSeekExt};
@@ -476,8 +477,18 @@
auth_header: &str,
session_id: &str,
) -> Result<FinalizeResponse> {
let mut url = Url::parse(base_url)
.map_err(|e| AkError::ConfigError(format!("Invalid base URL '{base_url}': {e}")))?;
{
let mut segments = url
.path_segments_mut()
.map_err(|_| AkError::ConfigError("Base URL cannot be a base for paths".into()))?;
segments
.extend(&["api", "v1", "uploads", session_id, "complete"]);
}

let resp = http
.put(format!("{base_url}/api/v1/uploads/{session_id}/complete"))
.put(url)
.header(AUTHORIZATION, auth_header)
.send()
.await
Copilot is powered by AI and may make mistakes. Always verify output.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant