Skip to content

feat(compile): default executor to System.AccessToken and add always-on Azure CLI#873

Open
jamesadevine wants to merge 3 commits into
mainfrom
devinejames/wholesale-access-token
Open

feat(compile): default executor to System.AccessToken and add always-on Azure CLI#873
jamesadevine wants to merge 3 commits into
mainfrom
devinejames/wholesale-access-token

Conversation

@jamesadevine
Copy link
Copy Markdown
Collaborator

@jamesadevine jamesadevine commented Jun 5, 2026

Summary

Two coupled Stage 3 / sandbox-infra changes:

Part A — Stage 3 executor defaults to $(System.AccessToken)

Previously the compiler hard-failed when a write-bearing safe output
(e.g. create-pull-request, create-work-item, add-pr-comment) was
configured without permissions.write. That stance was questionable —
the pipeline's built-in OAuth token ($(System.AccessToken), running
as Project Collection Build Service) is sufficient for the vast
majority of write operations, and requiring an ARM service connection
added significant onboarding friction.

  • generate_executor_ado_env now always emits a non-empty env: block
    for the Stage 3 executor step:
    • SYSTEM_ACCESSTOKEN: $(System.AccessToken) by default
    • SYSTEM_ACCESSTOKEN: $(SC_WRITE_TOKEN) when permissions.write is
      set (override for cross-org writes / named-identity attribution)
  • validate_write_permissions deleted along with its callsite — the
    "requires write SC" gate no longer exists.
  • WRITE_REQUIRING_SAFE_OUTPUTS removed entirely (its only consumers
    were validate_write_permissions and the tests that exercised it;
    the per-tool REQUIRES_WRITE const on each *Result type already
    provides the same information for any downstream tooling that needs
    it). The remaining test_all_known_completeness was rewritten to
    use a HashSet-based duplicate check.

Trust boundary preserved: the agent (Stage 1) still never maps
SYSTEM_ACCESSTOKEN. Only the Setup-job filter gate and the Stage 3
executor map it, both of which run outside the agent's container.

Part B — Always-on Azure CLI (az) inside the AWF sandbox

Mirrors gh-aw's "assume gh is on the runner" model. az is
pre-installed on Microsoft-hosted Ubuntu images and on most 1ES Ubuntu
images, so we expose it to the agent without an install step.

Architecture

  • New AzureCliExtension (Tool phase, always-on):
    • Allow-lists 5 Azure auth/management hosts:
      login.microsoftonline.com, login.windows.net,
      management.azure.com, graph.microsoft.com, aka.ms.
    • Adds az to every agent's bash allow-list.
    • Emits a "Detect Azure CLI on host (for AWF mount)" prepare step
      in the Agent job (NOT a separate Setup job).
    • No install step. No static AWF mounts. No new trait method.

Graceful runtime detection (addresses 1ES-breakage concern)

Static bind-mounts for /opt/az + /usr/bin/az would crash
docker run with "bind source path does not exist" on runners
without azure-cli pre-installed (notably some 1ES self-hosted
pools). The detection prepare step instead:

  • Tests for both /usr/bin/az (launcher) and /opt/az (venv) on the
    host.
  • Present → emits
    ##vso[task.setvariable variable=AW_AZ_MOUNTS]--mount /opt/az:/opt/az:ro --mount /usr/bin/az:/usr/bin/az:ro.
  • Missing → emits an empty-value ##vso[task.setvariable variable=AW_AZ_MOUNTS]
    followed by ##vso[task.logissue type=warning] explaining az
    won't be available inside the sandbox.

Setting the variable to empty string in the missing-az branch is
load-bearing: ADO leaves an undefined $(VAR) as the literal string
$(VAR) in later bash steps, where bash would interpret it as a
$(...) command substitution, try to execute a program named
AW_AZ_MOUNTS, get exit 127, and the AWF invocation step would die
under set -e — the opposite of graceful degradation. Defining the
variable as empty makes ADO expand it to nothing, leaving a harmless
\-continuation. Locked in by a 2× count regression test in both
azure_cli.rs unit tests and compiler_tests.rs.

generate_awf_mounts appends one $(AW_AZ_MOUNTS) \ line to the
AWF --mount chain when the AzureCli extension is present.

Operator outcomes

  • Microsoft-hosted ubuntu-latest: detected → mounted → az
    works inside the sandbox.
  • 1ES self-hosted with azure-cli baked into the image: same.
  • 1ES self-hosted without azure-cli: pipeline runs
    successfully; warning surfaces in the ADO log; agents that invoke
    az get command not found inside the sandbox (the standard
    failure mode for a missing CLI). No more 1ES breakage.

Auth caveat (unchanged): AZURE_DEVOPS_EXT_PAT (set when
permissions.read is configured) authenticates only az devops
subcommands. General az / ARM / Graph commands need separate auth.

Live smoke pipeline

tests/safe-outputs/azure-cli.md — daily-scheduled Copilot pipeline
that runs az --version + az devops project list and emits one
noop. Pattern matches existing tests/safe-outputs/*.md smoke
pipelines.

Docs

  • docs/network.md: Permissions section rewritten for new default;
    new "Always-on Azure CLI" section with the runtime-detection design
    and operator implications; aka.ms added to default-hosts table.
  • docs/tools.md: new "Built-in CLIs" section covering both az
    (with runtime detection + host-posture table) and gh.
  • docs/template-markers.md: {{ acquire_write_token }} and
    {{ executor_ado_env }} entries refreshed.
  • docs/front-matter.md: permissions.write reworded as optional.
  • docs/safe-outputs.md: new "Executor authentication" subsection;
    removed "requires write service connection" caveat from
    upload-build-attachment.
  • docs/ado-aw-debug.md: example env: block updated.
  • prompts/create-ado-agentic-workflow.md,
    prompts/update-ado-agentic-workflow.md: softened
    "requires permissions.write" wording.
  • site/.../troubleshooting/common-issues.mdx: marked the old error
    message as historical.
  • site/.../setup/service-connections.mdx: replaced obsolete
    compilation-error row with a cross-project 403 troubleshooting row.

Test plan

  • cargo build — clean.
  • cargo test — 1748 unit + 119 compiler + all integration tests pass.
  • cargo clippy --all-targets --all-features — clean.
  • New tests added (Part A):
    • test_executor_step_uses_system_access_token_by_default
      verifies the new default env block.
    • test_compile_succeeds_without_write_sc_for_write_requiring_safe_outputs
      — regression guard that the old hard-fail path is gone.
    • test_agent_job_steps_do_not_map_system_access_token — YAML-aware
      agent-job-scoped leak guard.
  • New tests added (Part B):
    • test_default_pipeline_mounts_az_and_allows_azure_hosts — asserts
      the detection prepare step, the $(AW_AZ_MOUNTS) \ injection,
      Azure hosts in --allow-domains, that NO static az mounts are
      emitted, that no install step is emitted, and that the
      setvariable for AW_AZ_MOUNTS appears exactly twice (one per
      branch of the if/else).
    • azure_cli.rs unit tests covering the extension shape, detection
      conditional, pipeline-variable setvariable, empty-value
      setvariable in the missing-az branch, warning on missing host,
      and set -eo pipefail discipline.
  • Updated tests:
    • test_engine_args_bash_disabled — now permits always-on
      shell(az) while still rejecting general shell(:*)/shell(*)/shell(bash).
    • test_generate_awf_mounts_no_extensions — now expects the
      $(AW_AZ_MOUNTS) \ injection line in the always-on baseline.
    • All collect_extensions count assertions bumped +1.
    • test_all_known_completeness rewritten around HashSet duplicate
      check (now that WRITE_REQUIRING_SAFE_OUTPUTS is gone).
  • Deleted tests:
    test_permissions_validation_fails_without_write_sc,
    test_comment_on_work_item_requires_write_sc,
    test_write_requiring_subset_of_all_known.

Lock-file regeneration

Only tests/safe-outputs/azure-cli.lock.yml is added by hand (it's a
new fixture). The other 26 tests/safe-outputs/*.lock.yml files will
be regenerated by the existing release-task workflow after this lands.

…on Azure CLI

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
@github-actions
Copy link
Copy Markdown
Contributor

github-actions Bot commented Jun 5, 2026

🔍 Rust PR Review

Summary: Looks good overall — clean implementation with solid test coverage. One medium concern about the always-on AzureCliExtension breaking non-Microsoft-hosted runners, and a few minor nits.

Findings

⚠️ Suggestions

  • src/compile/extensions/azure_cli.rs — no opt-out for always-on mount
    The AzureCliExtension unconditionally mounts /opt/az and /usr/bin/az into every AWF container. If the host doesn't have azure-cli installed, AWF will fail to start the Agent container (bind-mount error), making the entire pipeline hard-fail rather than gracefully skip az availability. This is a breaking change for 1ES self-hosted pools that don't bake in azure-cli, and there's no front-matter opt-out (e.g. tools: az: false). The docs call this out but don't provide an escape hatch. Given gh-aw's gh also has no opt-out, this may be an accepted design choice — but worth a deliberate call.

  • src/safeoutputs/mod.rs:30WRITE_REQUIRING_SAFE_OUTPUTS is now unreachable in-crate
    #[allow(dead_code)] was added because validate_write_permissions (its only in-crate consumer) was deleted. If nothing in the crate (including audit/) reads this constant, the annotation is just noise and the constant could be removed. If the audit analyzers do read it, remove the attribute. A quick grep WRITE_REQUIRING_SAFE_OUTPUTS src/ would clarify.

  • tests/safe-outputs/azure-cli.md:11gpt-5-mini is not a real model name
    The smoke test specifies model: gpt-5-mini. This will error at runtime when the pipeline actually queues a run — the Copilot or OpenAI API will reject an unknown model. Likely should be gpt-4o-mini or o4-mini.

  • PR description vs. code: login.microsoft.com is listed in the summary but absent from the allow-list
    The PR description enumerates 5 hosts including login.microsoft.com, but azure_cli.rs correctly has aka.ms as the 5th instead. login.microsoft.com is the consumer MSA sign-in surface and not used by enterprise Azure CLI auth — so the code is right — but the description will cause confusion for anyone cross-checking. Minor; description-only fix.

✅ What Looks Good

  • test_agent_job_steps_do_not_map_system_access_token — excellent defense-in-depth. Parses the compiled YAML, scopes the check to the Agent job only, and precisely avoids false-positives from the Setup-job gate and Stage 3 executor. Exactly the right way to guard the cross-stage trust boundary.
  • AWF mounts are AwfMountMode::ReadOnly — the agent has no business writing to az's install; this is enforced at mount level and tested.
  • validate_write_permissions removal is clean: the regression guard (test_compile_succeeds_without_write_sc_for_write_requiring_safe_outputs) checks the positive path (compile succeeds + $(System.AccessToken) is present) and the absence of $(SC_WRITE_TOKEN) in the same pass.
  • test_engine_args_bash_disabled update is correct: the old !params.contains("shell(") would have fired on the always-on shell(az) injection. The new form narrows to the actually-forbidden forms (shell(:*), shell(*), shell(bash)) while asserting the expected shell(az) is present — a faithful expression of the policy.
  • No install step is the right call. Matches gh-aw's model and avoids minutes of container-time overhead per run.

Generated by Rust PR Reviewer for issue #873 · sonnet46 1.5M ·

… longer crashes 1ES

Reviewer-requested fix: static AWF bind-mounts for /opt/az and /usr/bin/az
would break `docker run` on runners without azure-cli pre-installed (notably
some 1ES self-hosted pools), failing the pipeline before the agent ever
started.

Replace the static mounts with a runtime detection prepare step that sets
the ADO pipeline variable AW_AZ_MOUNTS via `##vso[task.setvariable]` when
both /usr/bin/az and /opt/az exist on the host, or emits a `task.logissue`
warning and leaves the variable unset otherwise.

The AWF invocation in the compiled YAML now includes a single
`$(AW_AZ_MOUNTS) \` line in the --mount chain. ADO interpolates the variable
at step start: present -> the two --mount args appear; absent -> the line
collapses to whitespace. No new trait method is added; only the existing
`prepare_steps` hook is used.

- AzureCliExtension: required_awf_mounts() now returns []; prepare_steps()
  emits the detection bash step
- generate_awf_mounts: appends `$(AW_AZ_MOUNTS) \` when AzureCli is present
- Tests: rewrite static-mount assertions to assert the detection step + the
  pipeline variable injection, plus a regression guard that no static az
  mount is emitted
- Docs: docs/network.md and docs/tools.md updated with the runtime-detection
  design and operator implications

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
@github-actions
Copy link
Copy Markdown
Contributor

github-actions Bot commented Jun 5, 2026

🔍 Rust PR Review

Summary: Needs one fix — a silent breakage on runners without az installed that defeats the PR's primary graceful-degradation goal.


Findings

🐛 Bugs / Logic Issues

  • src/compile/extensions/azure_cli.rs:119-130AW_AZ_MOUNTS never set in the else branch

    The detection step only calls ##vso[task.setvariable variable=AW_AZ_MOUNTS] in the if branch (az present). When az is not detected, the variable is simply never defined in ADO's pipeline variable store.

    In ADO, an undefined $(VARNAME) in a subsequent bash step is not expanded to empty string — it is left as the literal string $(VARNAME). The AWF invocation step already has set -eo pipefail active ("pipefail (set above)" comment confirms this). Bash sees $(AW_AZ_MOUNTS) as a command substitution, tries to execute a program named AW_AZ_MOUNTS, gets exit 127, and set -e kills the step.

    The PR description says "absent → the line collapses to whitespace + a \ continuation, which is a no-op" — but this only holds if the variable is explicitly set to empty string. An undefined variable does not collapse; it leaves the literal $(AW_AZ_MOUNTS) for bash to evaluate.

    This means every compiled pipeline running on a 1ES self-hosted pool without azure-cli pre-installed would fail at the AWF invocation step — the opposite of the "graceful degradation" the PR promises.

    Fix — add one line to the else branch in prepare_steps:

    let step = r###"- bash: |
        set -eo pipefail
        if [ -f /usr/bin/az ] && [ -d /opt/az ]; then
          echo "##vso[task.setvariable variable=AW_AZ_MOUNTS]--mount /opt/az:/opt/az:ro --mount /usr/bin/az:/usr/bin/az:ro"
          echo "Azure CLI detected on host; mounting /opt/az and /usr/bin/az into AWF sandbox."
        else
          echo "##vso[task.setvariable variable=AW_AZ_MOUNTS]"
          echo "##vso[task.logissue type=warning]Azure CLI not detected ..."
        fi
      displayName: "Detect Azure CLI on host (for AWF mount)"
    "###;

    With AW_AZ_MOUNTS always defined (even as empty string), ADO expands $(AW_AZ_MOUNTS) to nothing and the \ continuation line becomes a harmless no-op exactly as described.

    The existing test test_azure_cli_prepare_steps_warns_when_az_missing should also assert setvariable is emitted in the else branch to lock this in.


⚠️ Suggestions

  • src/compile/extensions/azure_cli.rs:69login.microsoft.com absent from required_hosts()

    The PR description lists it ("Allow-lists 5 Azure auth hosts: ... login.microsoft.com ...") but the code has only these 5: login.microsoftonline.com, login.windows.net, management.azure.com, graph.microsoft.com, aka.ms. If you intentionally excluded login.microsoft.com (personal MSA accounts, rarely needed for enterprise az usage), the description should drop it.


✅ What Looks Good

  • Trust boundary test is excellent: test_agent_job_steps_do_not_map_system_access_token parses the compiled YAML, scopes to the Agent job only, and asserts SYSTEM_ACCESSTOKEN is absent from every step's env: — much stronger than a text search that would false-positive on the legitimate executor and filter-gate mappings.
  • Removing validate_write_permissions is correct: the built-in token is standard ADO practice; forcing an ARM SC was unnecessary friction. The regression test that flips the old hard-fail to a success assertion is clean.
  • Graceful detection approach: avoiding static bind-mounts in required_awf_mounts() is the right call — the static-mount crash on 1ES pools is a real failure mode.
  • Test coverage is thorough across both parts; the smoke pipeline azure-cli.md provides end-to-end validation on real infra.

Generated by Rust PR Reviewer for issue #873 · sonnet46 2.4M ·

Address two findings from the Rust PR Reviewer bot on PR #873.

1. CRITICAL — AW_AZ_MOUNTS undefined when az is missing:

The runtime-detection step in AzureCliExtension only set the
AW_AZ_MOUNTS pipeline variable in the detected branch. In the
missing-az branch the variable was left undefined. ADO leaves an
undefined $(VAR) as the LITERAL STRING "$(VAR)" in subsequent bash
steps (it does NOT expand to empty). Bash sees $(AW_AZ_MOUNTS),
interprets it as a $(...) command substitution, tries to execute
a program named AW_AZ_MOUNTS, gets exit 127, and the AWF
invocation step dies under `set -e` — the exact 1ES failure mode
the refactor set out to prevent.

Fix: always emit `##vso[task.setvariable variable=AW_AZ_MOUNTS]`,
with an empty value in the missing branch. ADO then expands
$(AW_AZ_MOUNTS) to nothing and the trailing `\` line becomes a
harmless continuation no-op.

Regression guards (both lock this in):
  - azure_cli.rs::test_azure_cli_prepare_steps_defines_aw_az_mounts_in_else_branch
    counts `setvariable` occurrences (must be 2) and asserts the
    else block contains an empty-value setvariable line.
  - compiler_tests.rs::test_default_pipeline_mounts_az_and_allows_azure_hosts
    asserts the same 2× count on the compiled lock.yml.

2. Cleanup — delete WRITE_REQUIRING_SAFE_OUTPUTS:

The const was retained with #[allow(dead_code)] after the
removal of `validate_write_permissions`, but its only consumers
left were the two tests that exercised the const itself. Each
`*Result` type already carries `REQUIRES_WRITE: bool` for any
caller (compiler, audit, runtime) that needs the same info.
Deleting the const removes a dead-code annotation and one
otherwise-purposeless list to maintain when adding new tools.

Test cleanup: removed `test_write_requiring_subset_of_all_known`
(purely exercised the deleted const) and rewrote
`test_all_known_completeness` to use a HashSet-based duplicate
check on ALL_KNOWN_SAFE_OUTPUTS plus the ALWAYS_ON/NON_MCP
disjointness check (preserves the meaningful invariants).

Validation:
  - cargo build: clean
  - cargo test: 1748 unit + 119 compiler + all integration pass
  - cargo clippy --all-targets --all-features -- -D warnings: clean
  - tests/safe-outputs/azure-cli.lock.yml regenerated (+1 line)

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
@github-actions
Copy link
Copy Markdown
Contributor

github-actions Bot commented Jun 6, 2026

🔍 Rust PR Review

Summary: Looks good — both parts are well-reasoned and well-tested, with a couple of design-level suggestions.


Findings

⚠️ Suggestions

generate_awf_mounts tight coupling to AzureCliExtension variant
[src/compile/common.rs, generate_awf_mounts]

The inject_az_var check names a concrete enum variant inside an otherwise fully trait-generic function:

let inject_az_var = extensions
    .iter()
    .any(|ext| matches!(ext, super::extensions::Extension::AzureCli(_)));

This is the only place in common.rs that names a specific extension variant. If AzureCli is renamed, or a second extension ever needs the same pipeline-variable-mount pattern, it'll need updating in two places.

A fn awf_pipeline_var_mount(&self) -> Option<&str> method on CompilerExtension (returning Some("$(AW_AZ_MOUNTS) \\") for AzureCliExtension, None for all others) would keep generate_awf_mounts fully generic. Structural preference — the current code is correct.


test_all_known_completeness loses exhaustiveness guarantee
[src/safeoutputs/mod.rs]

The old test enforced ALL_KNOWN = WRITE_REQUIRING + ALWAYS_ON + NON_MCP via a count assertion. The new test only checks for duplicates and ALWAYS_ON/NON_MCP disjointness. With WRITE_REQUIRING_SAFE_OUTPUTS gone, there's no test that catches a new write tool (added via tool_result! { write = true, ... }) that a developer forgets to include in ALL_KNOWN_SAFE_OUTPUTS. The duplicate check catches over-inclusion; nothing catches under-inclusion.

A missing tool produces silent routing failures at runtime. A minimal guard would help:

// Bump this constant when adding new tools.
const MIN_KNOWN_COUNT: usize = 26; // count as of this PR
assert!(
    ALL_KNOWN_SAFE_OUTPUTS.len() >= MIN_KNOWN_COUNT,
    "ALL_KNOWN_SAFE_OUTPUTS shrank — did you forget to register a new tool? (got {})",
    ALL_KNOWN_SAFE_OUTPUTS.len()
);

🔒 Security Concerns

Unconditional Azure endpoint allowlist expansion
[src/compile/extensions/azure_cli.rs, required_hosts]

login.microsoftonline.com, login.windows.net, management.azure.com, graph.microsoft.com, and aka.ms are injected into every compiled pipeline's --allow-domains list — no opt-out. The PR rationale ("inert when az isn't present") is sound for availability, but from a security posture standpoint this expands the prompt-injection blast radius: a compromised agent in any pipeline can now attempt to reach Azure AD / ARM / Graph regardless of whether the operator intended to use az.

Not a blocker — the L7 proxy is the enforcement point and the docs/network.md update acknowledges this — but worth making explicit that this is an unconditional policy change for all pipelines, not just those that use az.


✅ What Looks Good

  • Both-branch setvariable regression test: locking in the 2× occurrence count for ##vso[task.setvariable variable=AW_AZ_MOUNTS] (in both azure_cli.rs unit tests and compiler_tests.rs) is exactly right. That's the critical invariant for graceful degradation and it's well-guarded.
  • YAML-aware, agent-job-scoped trust boundary test (test_agent_job_steps_do_not_map_system_access_token): parsing the compiled YAML and walking only the Agent job's env blocks avoids false-positives from the Setup-gate and Stage 3 legitimate mappings. Clean approach.
  • Empty static mount list with explanatory doc comment: the required_awf_mounts() no-op body and its detailed comment on why it must stay empty is the right call — it will prevent well-intentioned future regressions.
  • test_compile_succeeds_without_write_sc_for_write_requiring_safe_outputs: validates the actual compiled YAML content (not just exit code), so the default $(System.AccessToken) mapping is confirmed end-to-end.

Generated by Rust PR Reviewer for issue #873 · sonnet46 3.1M ·

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant