Skip to content

feat(sandbox): add GCE metadata emulator for Google Cloud#1763

Open
p5 wants to merge 5 commits into
NVIDIA:mainfrom
p5:dev/robertsturla/gce-metadata-emulator
Open

feat(sandbox): add GCE metadata emulator for Google Cloud#1763
p5 wants to merge 5 commits into
NVIDIA:mainfrom
p5:dev/robertsturla/gce-metadata-emulator

Conversation

@p5
Copy link
Copy Markdown

@p5 p5 commented Jun 4, 2026

Summary

Right now you can't use Google Cloud APIs (Vertex AI, Cloud Storage, BigQuery, Drive, Maps, etc.) from inside a sandbox. GCP SDKs expect a metadata server to be running and query it to get tokens - but there's no metadata server in the sandbox, so they fail before any API call is even attempted.

Go's metadata client makes this worse. It dials the metadata IP directly over TCP, bypassing HTTP_PROXY entirely, so the sandbox proxy never even sees the request. Additionally, there's no way to override this from within the SDK config.

This PR adds a google-cloud provider type and a GCE metadata emulator running on loopback (127.0.0.1:8174) inside the sandbox network namespace. GCP SDKs find it via GCE_METADATA_HOST, get credential placeholders back, and include those in their API calls. The proxy resolves placeholders to real tokens at egress. The sandbox process never holds a real credential.

Related Issue

Closes #1706

Changes

  • Add google_cloud module with shared constants and loopback address
  • Add google_cloud_metadata module implementing GCE metadata API
  • Add metadata_server module with MetadataHandler trait for provider- agnostic loopback server lifecycle
  • Add child_env_resolved() and gcp_token_response() for GCP-aware credential state
  • Bind via std::thread::spawn + setns (not spawn_blocking) to avoid tokio thread pool namespace contamination
  • Start metadata server before SSH handler to ensure consistent env on bind failure
  • Add google-cloud.yaml provider profile and credentials documentation

Testing

  • mise run pre-commit passes
  • Unit tests added/updated
  • E2E tests added/updated (if applicable)

Checklist

  • Follows Conventional Commits
  • Commits are signed off (DCO)
  • Architecture docs updated (if applicable)

@copy-pr-bot
Copy link
Copy Markdown

copy-pr-bot Bot commented Jun 4, 2026

This pull request requires additional validation before any workflows can run on NVIDIA's runners.

Pull request vetters can view their responsibilities here.

Contributors can view more details about this message here.

@github-actions
Copy link
Copy Markdown

github-actions Bot commented Jun 4, 2026

All contributors have signed the DCO ✍️ ✅
Posted by the DCO Assistant Lite bot.

@p5
Copy link
Copy Markdown
Author

p5 commented Jun 4, 2026

I have read the DCO document and I hereby sign the DCO.

@p5 p5 force-pushed the dev/robertsturla/gce-metadata-emulator branch 5 times, most recently from cb2e254 to 28d626a Compare June 4, 2026 21:47
@maxamillion maxamillion added the test:e2e Requires end-to-end coverage label Jun 4, 2026
@github-actions
Copy link
Copy Markdown

github-actions Bot commented Jun 4, 2026

Label test:e2e applied, but pull-request/1763 is at {"messa while the PR head is 28d626a. A maintainer needs to comment /ok to test 28d626a9e068293469bbeb1661bdb77e973112df to refresh the mirror. Once the mirror catches up, re-run Branch E2E Checks from the Actions tab.

@maxamillion
Copy link
Copy Markdown
Collaborator

/ok to test 28d626a

p5 added 2 commits June 4, 2026 23:15
Single source of truth for GCP naming: env var aliases, provider config
keys, token search order, and Vertex-specific env vars. Consumed by
openshell-server, openshell-providers, and openshell-sandbox.

- Add google_cloud.rs with metadata emulator host and loopback address
- Define PROJECT_ID, REGION, and SERVICE_ACCOUNT_EMAIL env var aliases
- Add provider config key constants for gcp provider implementations
- Define TOKEN_ENV_KEYS search order (SA token takes priority over ADC)
- Add Vertex-specific env vars for Goose and Claude Code SDK integration
- Add STATIC_CONFIG_KEYS as union of all alias arrays for env resolution
- Export module via openshell-core lib.rs

Signed-off-by: Robert Sturla <rsturla@redhat.com>
Add GoogleCloudProvider and VertexProvider implementing inject_env to
project GCP config (project ID, region, SA email, metadata host) into
sandbox environment variables. Replace the inline Vertex AI env
injection in the server with the registry-based inject_env dispatch.

Also adds the google-cloud.yaml provider profile with SA JWT and ADC
OAuth2 credential refresh flows.

Signed-off-by: Robert Sturla <rsturla@redhat.com>
@p5 p5 force-pushed the dev/robertsturla/gce-metadata-emulator branch from 28d626a to abde14b Compare June 4, 2026 22:16
@p5
Copy link
Copy Markdown
Author

p5 commented Jun 4, 2026

Force pushed to fix the lint issues. "mise run pre-commit" now succeeds locally again.

Previous E2E tests ongoing here - https://github.com/NVIDIA/OpenShell/actions/runs/26982657920


Edit:
rust-docker E2E tests failed. It looks like a flake, especially given the same tests for podman are passing.

@maxamillion
Copy link
Copy Markdown
Collaborator

This is great! One nit pick though, I think GCE_METADATA_IP should include the emulator port. The google-auth python library builds its metadata ping URL directly from GCE_METADATA_IP, so 127.0.0.1 becomes http://127.0.0.1 on port 80 but the shim listens on 127.0.0.1:8174 ... from an initial check, it looks like the other language libraries for google auth use GCE_METADATA_HOST

Tested with:

$ env UV_CACHE_DIR=/tmp/uv-cache uv run --with google-auth --with requests python /tmp/verify_gce_metadata_ip.py
ping_127_no_port False
ping_127_with_port True

here's the contents of /tmp/verify_gce_metadata_ip.py:

import os
import threading
import http.server
import socketserver

import google.auth.compute_engine._metadata as metadata
from google.auth.transport.requests import Request


class MetadataHandler(http.server.BaseHTTPRequestHandler):
    def do_GET(self):
        self.send_response(200)
        self.send_header("Metadata-Flavor", "Google")
        self.end_headers()

    def log_message(self, *args):
        pass


def main():
    server = socketserver.TCPServer(("127.0.0.1", 8174), MetadataHandler)
    thread = threading.Thread(target=server.serve_forever, daemon=True)
    thread.start()

    try:
        request = Request()

        os.environ["GCE_METADATA_IP"] = "127.0.0.1"
        print("ping_127_no_port", metadata.ping(request, timeout=1, retry_count=1))

        os.environ["GCE_METADATA_IP"] = "127.0.0.1:8174"
        print("ping_127_with_port", metadata.ping(request, timeout=1, retry_count=1))
    finally:
        server.shutdown()
        server.server_close()


if __name__ == "__main__":
    main()

Add a loopback HTTP server on 127.0.0.1:8174 inside the sandbox
network namespace that emulates the GCE instance metadata API.
GCP client SDKs discover it via GCE_METADATA_HOST and obtain
credential placeholders that the proxy resolves to real tokens
at egress.

Add metadata_server module with MetadataHandler trait and
netns-aware TCP binding via std::thread (not spawn_blocking)
to avoid tokio pool namespace contamination
Add google_cloud_metadata module implementing the GCE metadata
API subset (token, project-id, email, scopes, service-accounts)
Add child_env_resolved() and gcp_token_response() to
ProviderCredentialState for GCP-aware credential projection
Wire metadata server into sandbox lifecycle before SSH handler
Collapse multi-line HTTP response format string into single line

Signed-off-by: Robert Sturla <rsturla@redhat.com>
@p5 p5 force-pushed the dev/robertsturla/gce-metadata-emulator branch from abde14b to 009f28a Compare June 4, 2026 22:51
@p5
Copy link
Copy Markdown
Author

p5 commented Jun 4, 2026

Awesome spot!
You're correct - the variable should include the port.

Ran through your reproducer and confirmed it now works as expected.

@p5 p5 marked this pull request as ready for review June 5, 2026 11:20
@p5 p5 requested review from a team, derekwaynecarr, maxamillion and mrunalp as code owners June 5, 2026 11:20
@p5 p5 force-pushed the dev/robertsturla/gce-metadata-emulator branch from 009f28a to 2d776a7 Compare June 5, 2026 17:45
p5 added 2 commits June 5, 2026 18:46
Document the google-cloud provider setup for ADC and service account
flows, injected environment variables, metadata emulator behavior, and
network policy configuration for GCP APIs.

Signed-off-by: Robert Sturla <rsturla@redhat.com>
Widen --from-gcloud-adc to accept google-cloud providers. The ADC
credential key is derived from the provider profile rather than
hardcoded per type, so future GCP provider types get ADC support by
declaring the right refresh metadata in their profile YAML.

Add ProviderTypeProfile::adc_credential() to find the ADC-compatible
credential from a profile's refresh metadata. Remove unused
VERTEX_AI_ADC_TOKEN_KEY and GCP_ADC_TOKEN_KEY constants.

Signed-off-by: Robert Sturla <rsturla@redhat.com>
@p5 p5 force-pushed the dev/robertsturla/gce-metadata-emulator branch from 2d776a7 to aada3da Compare June 5, 2026 17:48
@cgwalters
Copy link
Copy Markdown
Contributor

One thing I'd like to consider here is OpenShell including something like https://github.com/LobsterTrap/llmproxy by default - i.e. an inference endpoint that always appears to be OpenResponses compatible to inner tooling. Wouldn't work with Claude Code (AFAIK without hacks) but it'd be nice to just entirely remove needing to handle inference provider auth at all for all the tools that can speak OpenResponses.

It's a heavier hammer here though.

Of course it's worth noting that many non-local deployments will probably end up wanting some kind of proxy anyways to handle observability etc. There's various existing more heavyweight things in that space.

@stbenjam
Copy link
Copy Markdown

stbenjam commented Jun 6, 2026

Thanks, these changes work for me standalone following Adam's instructions.

Will vertex provider stay or get stripped out? It might be confusing if it exists but doesn't work with CC.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

test:e2e Requires end-to-end coverage

Projects

None yet

Development

Successfully merging this pull request may close these issues.

feat: emulate GCE metadata server for Google SDK access in sandboxes

4 participants