Skip to content

Add dynamic workflow tool#3426

Open
neubig wants to merge 29 commits into
mainfrom
redo-dynamic-workflow-mvp
Open

Add dynamic workflow tool#3426
neubig wants to merge 29 commits into
mainfrom
redo-dynamic-workflow-mvp

Conversation

@neubig
Copy link
Copy Markdown
Member

@neubig neubig commented May 28, 2026

Summary

  • add a WorkflowToolSet with WorkflowAction/WorkflowObservation for Python dynamic workflow scripts
  • add a WorkflowContext that supports run_agent, map_agents, reduce_agent, and flatten using the existing TaskManager
  • aggregate map_agents sub-agent failures so one failing item does not cancel sibling tasks before their results/errors are collected
  • add focused workflow tests and a runnable standalone SDK example where the parent agent generates the workflow code for a repo-wide test coverage audit

Closes #3425

Related docs PR: OpenHands/docs#535

Tests

  • uv run pytest tests/tools/workflow/test_workflow_tool.py -q
  • uv run pytest tests/tools/workflow/test_workflow_tool.py tests/tools/task/test_task_manager.py tests/tools/task/test_task_manager_thread_safety.py -q
  • uv run pre-commit run --files openhands-tools/openhands/tools/task/manager.py openhands-tools/openhands/tools/workflow/definition.py openhands-tools/openhands/tools/workflow/impl.py openhands-tools/openhands/tools/workflow/__init__.py openhands-tools/openhands/tools/__init__.py tests/tools/workflow/test_workflow_tool.py examples/01_standalone_sdk/52_dynamic_workflow.py
  • uv run python examples/01_standalone_sdk/52_dynamic_workflow.py
  • DOCS_PATH=/workspace/project/docs GITHUB_EVENT_NAME=pull_request GITHUB_EVENT_PATH=/tmp/pr_event.json python .github/scripts/check_documented_examples.py

This PR was created by an AI agent (OpenHands) on behalf of the user.


Agent Server images for this PR

GHCR package: https://github.com/OpenHands/agent-sdk/pkgs/container/agent-server

Variants & Base Images

Variant Architectures Base Image Docs / Tags
java amd64, arm64 eclipse-temurin:17-jdk Link
python amd64, arm64 nikolaik/python-nodejs:python3.13-nodejs22-slim Link
golang amd64, arm64 golang:1.21-bookworm Link

Pull (multi-arch manifest)

# Each variant is a multi-arch manifest supporting both amd64 and arm64
docker pull ghcr.io/openhands/agent-server:8a484b7-python

Run

docker run -it --rm \
  -p 8000:8000 \
  --name agent-server-8a484b7-python \
  ghcr.io/openhands/agent-server:8a484b7-python

All tags pushed for this build

ghcr.io/openhands/agent-server:8a484b7-golang-amd64
ghcr.io/openhands/agent-server:8a484b75764fbc47af7577b36c68cd77e8c92498-golang-amd64
ghcr.io/openhands/agent-server:redo-dynamic-workflow-mvp-golang-amd64
ghcr.io/openhands/agent-server:8a484b7-golang_tag_1.21-bookworm-amd64
ghcr.io/openhands/agent-server:8a484b7-golang-arm64
ghcr.io/openhands/agent-server:8a484b75764fbc47af7577b36c68cd77e8c92498-golang-arm64
ghcr.io/openhands/agent-server:redo-dynamic-workflow-mvp-golang-arm64
ghcr.io/openhands/agent-server:8a484b7-golang_tag_1.21-bookworm-arm64
ghcr.io/openhands/agent-server:8a484b7-java-amd64
ghcr.io/openhands/agent-server:8a484b75764fbc47af7577b36c68cd77e8c92498-java-amd64
ghcr.io/openhands/agent-server:redo-dynamic-workflow-mvp-java-amd64
ghcr.io/openhands/agent-server:8a484b7-eclipse-temurin_tag_17-jdk-amd64
ghcr.io/openhands/agent-server:8a484b7-java-arm64
ghcr.io/openhands/agent-server:8a484b75764fbc47af7577b36c68cd77e8c92498-java-arm64
ghcr.io/openhands/agent-server:redo-dynamic-workflow-mvp-java-arm64
ghcr.io/openhands/agent-server:8a484b7-eclipse-temurin_tag_17-jdk-arm64
ghcr.io/openhands/agent-server:8a484b7-python-amd64
ghcr.io/openhands/agent-server:8a484b75764fbc47af7577b36c68cd77e8c92498-python-amd64
ghcr.io/openhands/agent-server:redo-dynamic-workflow-mvp-python-amd64
ghcr.io/openhands/agent-server:8a484b7-nikolaik_s_python-nodejs_tag_python3.13-nodejs22-slim-amd64
ghcr.io/openhands/agent-server:8a484b7-python-arm64
ghcr.io/openhands/agent-server:8a484b75764fbc47af7577b36c68cd77e8c92498-python-arm64
ghcr.io/openhands/agent-server:redo-dynamic-workflow-mvp-python-arm64
ghcr.io/openhands/agent-server:8a484b7-nikolaik_s_python-nodejs_tag_python3.13-nodejs22-slim-arm64
ghcr.io/openhands/agent-server:8a484b7-golang
ghcr.io/openhands/agent-server:8a484b75764fbc47af7577b36c68cd77e8c92498-golang
ghcr.io/openhands/agent-server:redo-dynamic-workflow-mvp-golang
ghcr.io/openhands/agent-server:8a484b7-golang_tag_1.21-bookworm
ghcr.io/openhands/agent-server:8a484b7-java
ghcr.io/openhands/agent-server:8a484b75764fbc47af7577b36c68cd77e8c92498-java
ghcr.io/openhands/agent-server:redo-dynamic-workflow-mvp-java
ghcr.io/openhands/agent-server:8a484b7-eclipse-temurin_tag_17-jdk
ghcr.io/openhands/agent-server:8a484b7-python
ghcr.io/openhands/agent-server:8a484b75764fbc47af7577b36c68cd77e8c92498-python
ghcr.io/openhands/agent-server:redo-dynamic-workflow-mvp-python
ghcr.io/openhands/agent-server:8a484b7-nikolaik_s_python-nodejs_tag_python3.13-nodejs22-slim

About Multi-Architecture Support

  • Each variant tag (e.g., 8a484b7-python) is a multi-arch manifest supporting both amd64 and arm64
  • Docker automatically pulls the correct architecture for your platform
  • Individual architecture tags (e.g., 8a484b7-python-amd64) are also available if needed

Co-authored-by: openhands <openhands@all-hands.dev>
@github-actions
Copy link
Copy Markdown
Contributor

github-actions Bot commented May 28, 2026

Python API breakage checks — ✅ PASSED

Result:PASSED

Action log

@github-actions
Copy link
Copy Markdown
Contributor

github-actions Bot commented May 28, 2026

REST API breakage checks (OpenAPI) — ✅ PASSED

Result:PASSED

Action log

Copy link
Copy Markdown
Collaborator

@all-hands-bot all-hands-bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ QA Report: PASS WITH ISSUES

The dynamic workflow tool worked end-to-end in manual SDK usage, but PR CI currently has a failed check-examples check.

Does this PR achieve its stated goal?

Yes, functionally. On the base branch, WorkflowToolSet is absent; on the PR branch, I could import/create the tool, execute workflow scripts, block an unsafe open() script, run a single real subagent via wf.run_agent, and run the new example that fans out subagents then reduces their answers into a final result with EXAMPLE_COST.

Phase Result
Environment Setup make build completed successfully
CI Status ⚠️ check-examples failed; several checks were still in progress when checked
Functional Verification ✅ Direct tool usage, guardrail behavior, run_agent, and example map_agents/reduce_agent flow worked
Functional Verification

Test 1: Baseline absence vs PR availability

Step 1 — Establish baseline without the feature:
Ran git checkout --quiet origin/main; uv run python -c "import importlib.util; print('workflow_module_found=', importlib.util.find_spec('openhands.tools.workflow') is not None)"; uv run python -c "from openhands.tools import WorkflowToolSet; print(WorkflowToolSet)" || echo 'WorkflowToolSet import failed on base branch as expected':

BASE_COMMIT=5e614afe
workflow_module_found= False
ImportError: cannot import name 'WorkflowToolSet' from 'openhands.tools'
WorkflowToolSet import failed on base branch as expected

This confirms the workflow tool is new user-facing functionality, not pre-existing behavior.

Step 2 — Apply the PR's changes:
Checked out redo-dynamic-workflow-mvp at d91a5e8.

Step 3 — Re-run with the feature in place:
Used WorkflowToolSet.create(...).as_executable() from a real Conversation, then executed a flatten workflow and an unsafe workflow:

tool_available= WorkflowToolSet
flatten_status= completed is_error= False text= ['alpha', 'beta', 'gamma', 'delta']
unsafe_status= error is_error= True text= Workflow scripts may not call `open`

This shows the new tool is importable/creatable, executes a valid workflow script, and enforces the advertised best-effort unsafe-call guardrail.

Test 2: Single subagent workflow primitive

Step 1 — Establish baseline:
The base branch import check above shows there was no WorkflowToolSet/wf.run_agent path to exercise.

Step 2 — Apply the PR's changes:
On redo-dynamic-workflow-mvp, registered a small QA subagent and invoked the workflow tool with a script calling await wf.run_agent(...).

Step 3 — Run with the feature in place:
Ran the one-off SDK script through uv run python:

run_agent_status= completed is_error= False
run_agent_text= PASS-WORKFLOW-RUN-AGENT
cost= 0.019145000000000002

This confirms the single-subagent primitive works through the actual TaskManager/subagent path with a real LLM-backed subagent.

Test 3: Runnable standalone example with fan-out and reduce

Step 1 — Establish baseline:
The example file does not exist on the base branch as part of the new feature; the base import check also shows the workflow tool itself is absent.

Step 2 — Apply the PR's changes:
On redo-dynamic-workflow-mvp, ran the new standalone example exactly as a user would.

Step 3 — Run with the feature in place:
Ran OPENHANDS_SUPPRESS_BANNER=1 uv run python examples/01_standalone_sdk/52_dynamic_workflow.py:

Task 'task_00000002' completed.
Task 'task_00000001' completed.
Task 'task_00000003' completed.
Task 'task_00000004' completed.
Dynamic workflow result:
An octopus runs on three hearts, with the main one pausing whenever it swims; a honeybee can recognize human faces through clever pattern-based vision despite its tiny brain; and snow leopards skip roaring in favor of chuffs, hisses, growls, and a loud, non-aggressive “prusten.”
EXAMPLE_COST: 0.053985000000000005

This confirms the advertised fan-out (map_agents) and reducer (reduce_agent) workflow runs end-to-end with real subagents and prints the expected example cost line.

Issues Found

  • 🟠 Issue: GitHub CI currently reports [Optional] Docs example / check-examples as failed, with additional checks still in progress. Functional QA did not reproduce a workflow-tool failure—the new example ran successfully locally—but the PR should not be considered fully healthy until CI is green.

This review was created by an AI agent (OpenHands) on behalf of the user.

@github-actions
Copy link
Copy Markdown
Contributor

github-actions Bot commented May 28, 2026

Coverage

Coverage Report •
FileStmtsMissCoverMissing
openhands-tools/openhands/tools
   __init__.py13284%37–38
openhands-tools/openhands/tools/task
   manager.py16811233%81–83, 87–89, 99–100, 102–103, 107, 117, 121, 124, 127–132, 134, 140–141, 145, 149–152, 155–159, 181–182, 184–185, 190, 195, 202–204, 209–212, 221, 226, 233, 246–247, 249, 255–256, 258, 267, 272, 278, 289–290, 292–295, 297, 314–315, 319–320, 322–323, 326, 328, 331, 334, 338–339, 341–344, 346–354, 356–357, 359, 365–366, 370–372, 374, 377, 379–380, 391–392, 396, 403–404, 412, 416, 421, 423–424
openhands-tools/openhands/tools/workflow
   definition.py23482%142, 144, 169, 171
   impl.py17413224%83–90, 92–94, 98–100, 109–110, 125–127, 134–136, 151–153, 159–164, 169–170, 172, 176–178, 184, 188, 203, 211–214, 216–217, 220–223, 227–228, 231–232, 238–239, 245–247, 251–252, 254–256, 259, 273–274, 278–281, 283, 288–289, 293–294, 303, 305–311, 314, 319, 323, 327–330, 336–341, 346–349, 351, 355–360, 362–364, 366–369, 375–376, 380–381, 385, 423, 434–435, 442, 446–448, 453–456, 463
TOTAL291441495148% 

@neubig neubig marked this pull request as draft May 28, 2026 22:14
Co-authored-by: openhands <openhands@all-hands.dev>
@neubig neubig marked this pull request as ready for review May 28, 2026 22:47
Copy link
Copy Markdown
Collaborator

@all-hands-bot all-hands-bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

✅ QA Report: PASS

The dynamic workflow tool works end-to-end: it imports only on the PR branch, executes generated workflow scripts, rejects unsafe scripts, runs real map/reduce sub-agent orchestration, and the new standalone example completes.

Does this PR achieve its stated goal?

Yes. The PR set out to add a WorkflowToolSet/workflow actions, provide WorkflowContext helpers (run_agent, map_agents, reduce_agent, flatten), and include a runnable SDK example. I exercised those as a user: direct workflow tool execution returned a completed observation, unsafe script execution returned an error observation, real LLM-backed sub-agents produced a reducer result through map_agents/reduce_agent, and examples/01_standalone_sdk/52_dynamic_workflow.py ran successfully and printed EXAMPLE_COST.

Phase Result
Environment Setup make build completed and installed the uv environment.
CI Status ✅ All checked non-QA jobs are passing; this qa-changes job was still in progress when checked.
Functional Verification ✅ Verified import baseline, direct tool execution, safety rejection, real sub-agent map/reduce, and the standalone example.
Functional Verification

Test 1: Baseline import vs PR import/execution

Step 1 — Establish baseline without the feature:
Ran git switch --detach origin/main && uv run python -c 'import importlib; importlib.import_module("openhands.tools.workflow")':

ModuleNotFoundError: No module named 'openhands.tools.workflow'

This shows the workflow package is new user-visible functionality added by this PR.

Step 2 — Apply the PR's changes:
Switched back to redo-dynamic-workflow-mvp at 181521b3a694e847453c1eaa3e2820f792811df8.

Step 3 — Re-run with the PR behavior in place:
Ran a user-style SDK script that creates an Agent with WorkflowToolSet, creates a Conversation, and calls WorkflowExecutor with a generated async def main(wf) script using wf.flatten:

status= completed
is_error= False
text= {'areas': ['sdk', 'tools', 'workspace', 'agent-server'], 'count': 4}

This confirms the new tool package imports and the workflow tool executes generated Python through the conversation context.

Test 2: Unsafe generated workflow code is rejected

Ran the workflow tool with a generated script that attempted open('/tmp/qa-workflow-should-not-open'):

status= error
is_error= True
text= Workflow scripts may not call `open`

This confirms the tool returns an error observation instead of executing a direct file read from generated workflow code.

Test 3: Real map_agents + reduce_agent orchestration

Ran a workflow action that registered a real qa_echo sub-agent backed by the configured LLM, fanned out two sub-agent tasks with wf.map_agents(max_concurrency=2), then reduced their outputs with wf.reduce_agent:

Task 'task_00000002' completed.
Task 'task_00000001' completed.
Task 'task_00000003' completed.
status= completed
is_error= False
text= QA-RED-BLUE

This confirms the workflow context can fan out and reduce real sub-agent conversations, not just run local helper logic.

Test 4: Standalone example runs as documented

Ran OPENHANDS_SUPPRESS_BANNER=1 uv run python examples/01_standalone_sdk/52_dynamic_workflow.py:

Message from Coverage Workflow Agent to User
## Repo-wide test coverage audit
...
Recommended next step
Start with a focused test push covering the P0 security and data-integrity gaps...

Tokens: ↑ input 177.15K • cache hit 59.25% • reasoning 3.9K • ↓ output 18.44K • $ 0.9667

EXAMPLE_COST: 0.9666999999999999

This confirms the parent agent can generate and run a dynamic workflow using the new tool, producing the intended repo-wide coverage audit and example cost line.

Issues Found

None.

This review was created by an AI agent (OpenHands) on behalf of the user.

Copy link
Copy Markdown
Member Author

neubig commented May 28, 2026

CI is green on 181521b, the example now has the parent agent generate the workflow code, and related docs are in OpenHands/docs#535. Ready for another automated review.\n\n_This comment was created by an AI agent (OpenHands) on behalf of the user._

@neubig neubig requested a review from all-hands-bot May 28, 2026 22:57
Copy link
Copy Markdown
Collaborator

all-hands-bot commented May 28, 2026

Review complete.

This review was performed through OpenHands Cloud Automation. You can log in and view the conversation here.

Copy link
Copy Markdown
Collaborator

@all-hands-bot all-hands-bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good overall shape for an MVP — the three-layer design (WorkflowAction/WorkflowObservation/WorkflowToolSet in definition.py, runtime orchestration in impl.py, and the honest security disclaimer in the tool description) is clean and readable. The AST-level validation paired with _safe_globals is the right defence-in-depth approach for in-process script execution.

A few issues worth addressing before landing:

  • asyncio.gather in map_agents is missing return_exceptions=True, so one failing sub-agent silently aborts all parallel work and loses every partial result — the most impactful bug here.
  • One unreachable assert inside run_one should be removed or converted to a proper guard.
  • A private method (_ensure_parent) is called across module boundaries — fragile contract worth flagging.
  • Minor placement nit on the bottom-of-file TYPE_CHECKING import in definition.py.

This review was generated by an AI agent (OpenHands) on behalf of the user through OpenHands Automation. View conversation

Comment thread openhands-tools/openhands/tools/workflow/impl.py Outdated
Comment thread openhands-tools/openhands/tools/workflow/impl.py Outdated
Comment thread openhands-tools/openhands/tools/workflow/impl.py Outdated
Comment thread openhands-tools/openhands/tools/workflow/definition.py Outdated
Co-authored-by: openhands <openhands@all-hands.dev>
Copy link
Copy Markdown
Collaborator

@all-hands-bot all-hands-bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

✅ QA Report: PASS

Verified the dynamic workflow tool end-to-end as an SDK user; it imports on the PR branch, orchestrates real sub-agents via map/reduce, rejects unsafe scripts, and the new standalone example completed with a coverage report.

Does this PR achieve its stated goal?

Yes. The PR set out to add a WorkflowToolSet/WorkflowContext for Python dynamic workflow scripts plus a runnable standalone example. On the PR commit, I ran a real WorkflowExecutor with an SDK Conversation and a registered sub-agent; wf.map_agents, wf.flatten, and wf.reduce_agent completed and returned the expected workflow tokens. I also ran the new example path and its output ended with a repo-wide coverage report plus EXAMPLE_COST: 0.9111179999999999.

Phase Result
Environment Setup make build completed successfully and installed the uv workspace environment.
CI Status 🟡 At check time: 33 checks passing, 3 skipped, 1 pending (qa-changes); no CI jobs were rerun.
Functional Verification ✅ New import surface, real workflow execution, unsafe-script handling, and bundled example were exercised.
Functional Verification

Test 1: New workflow tool import surface

Step 1 — Reproduce / establish baseline (without the fix):
Ran git checkout origin/main && uv run python -c "from openhands.tools import WorkflowToolSet; print(WorkflowToolSet)":

BRANCH=HEAD COMMIT=e65a8abd
ImportError: cannot import name 'WorkflowToolSet' from 'openhands.tools' (.../openhands-tools/openhands/tools/__init__.py)
EXIT_CODE=1

This shows the advertised workflow tool was not available from the public tools package on the base branch.

Step 2 — Apply the PR's changes:
Checked out redo-dynamic-workflow-mvp at 181521b3.

Step 3 — Re-run with the fix in place:
Ran OPENHANDS_SUPPRESS_BANNER=1 uv run python -c "from openhands.tools import WorkflowToolSet; from openhands.tools.workflow import WorkflowAction, WorkflowExecutor; print(WorkflowToolSet.name, WorkflowAction.__name__, WorkflowExecutor.__name__)":

BRANCH=redo-dynamic-workflow-mvp COMMIT=181521b3
workflow_tool_set WorkflowAction WorkflowExecutor
EXIT_CODE=0

This confirms the new tool set and workflow action/executor are importable as SDK users would expect.

Test 2: Real dynamic workflow with sub-agent map/reduce

Step 1 — Reproduce / establish baseline (without the fix):
The base-branch import failure above establishes that this SDK user path did not exist before the PR.

Step 2 — Apply the PR's changes:
On 181521b3, I registered a minimal qa_workflow_worker sub-agent, created a parent SDK Conversation with WorkflowToolSet, and invoked WorkflowExecutor with generated-style workflow code using wf.map_agents, wf.flatten, and wf.reduce_agent.

Step 3 — Re-run with the fix in place:
Ran OPENHANDS_SUPPRESS_BANNER=1 uv run python /tmp/qa_workflow_real.py:

Task 'task_00000002' completed.
Task 'task_00000001' completed.
Task 'task_00000003' completed.
STATUS completed
IS_ERROR False
TEXT_START
WORKFLOW_QA_alpha WORKFLOW_QA_beta WORKFLOW_QA_tail
TEXT_END

This confirms the workflow tool can orchestrate multiple real SDK sub-agent tasks, preserve map output, flatten intermediate values, run a reducer, and return a completed observation.

Test 3: Shipped standalone example

Step 1 — Reproduce / establish baseline (without the fix):
Ran git checkout origin/main && test -f examples/01_standalone_sdk/52_dynamic_workflow.py:

BRANCH=HEAD COMMIT=e65a8abd
example_missing

This shows the advertised standalone example did not exist on the base branch.

Step 2 — Apply the PR's changes:
Checked out redo-dynamic-workflow-mvp at 181521b3.

Step 3 — Re-run with the fix in place:
Ran OPENHANDS_SUPPRESS_BANNER=1 uv run python examples/01_standalone_sdk/52_dynamic_workflow.py > /tmp/qa_dynamic_workflow_example.out 2>&1 and inspected the final output:

## Concrete next tests to add first

1. Agent Server secrets exposure tests
2. Agent Server persistence store tests
3. SDK LLM serialization/tool-call tests
4. Workspace lifecycle/error tests
5. Tools task tracker and timeout tests
6. SDK plugin/condenser orchestration tests
7. Websocket and patch/parser tests

EXAMPLE_COST: 0.9111179999999999

This confirms the new example ran through the parent-agent-generated workflow and produced the intended repo-wide coverage audit report.

Test 4: Unsafe generated script handling

Step 1 — Reproduce / establish baseline (without the fix):
The base branch had no workflow tool to run this error path.

Step 2 — Apply the PR's changes:
On 181521b3, I invoked the workflow executor with a script whose main(wf) attempted open('/etc/passwd').

Step 3 — Re-run with the fix in place:
Ran a real WorkflowExecutor invocation with an SDK Conversation:

STATUS error
IS_ERROR True
TEXT Workflow scripts may not call `open`

This confirms the tool returns an error observation instead of executing a blocked file-read operation directly inside workflow code.

Issues Found

None.

This QA review was created by an AI agent (OpenHands) on behalf of the user.

Copy link
Copy Markdown
Member Author

neubig commented May 28, 2026

Review feedback addressed in 70d18ca; all checks are green and review threads are resolved. Requesting another automated review.\n\n_This comment was created by an AI agent (OpenHands) on behalf of the user._

@neubig neubig requested a review from all-hands-bot May 28, 2026 23:22
Copy link
Copy Markdown
Collaborator

all-hands-bot commented May 28, 2026

Review complete.

This review was performed through OpenHands Cloud Automation. You can log in and view the conversation here.

Copy link
Copy Markdown
Collaborator

@all-hands-bot all-hands-bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

All four issues from the previous review are cleanly resolved in this iteration — attach_parent(), the return_exceptions=True gather, the dead assert removal, and the TYPE_CHECKING block placement. The three-layer design (definition.py / impl.py / __init__.py) is easy to navigate, the AST validator + _safe_globals restriction is the right defence-in-depth approach for in-process code execution, and the test coverage is solid.

A few new observations below — one worth fixing before landing, two suggestions for your consideration.

This review was generated by an AI agent (OpenHands) on behalf of the user through OpenHands Automation. View conversation

Comment thread openhands-tools/openhands/tools/workflow/impl.py Outdated
Comment thread openhands-tools/openhands/tools/workflow/impl.py Outdated
Comment thread openhands-tools/openhands/tools/workflow/impl.py Outdated
Comment thread tests/tools/workflow/test_workflow_tool.py Outdated
Copy link
Copy Markdown
Collaborator

@all-hands-bot all-hands-bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

✅ QA Report: PASS

Verified the dynamic workflow feature end-to-end through the SDK, including parent workflow tool execution, sub-agent map/reduce orchestration, failure aggregation, and the new standalone example.

Does this PR achieve its stated goal?

Yes. On origin/main, the workflow tool was not importable; on the PR branch, it imported as workflow_tool_set, a parent SDK Conversation requested the workflow tool, the workflow spawned sub-agent map jobs plus a reducer through the task path, and the workflow observation reached the parent. I also verified map_agents aggregates multiple sub-agent failures instead of stopping after the first failure, and the new example script ran to completion with an EXAMPLE_COST line.

Phase Result
Environment Setup make build completed successfully
CI Status ✅ 33 checks passing; 1 QA Changes check pending for this run; 3 PR artifact jobs skipped
Functional Verification ✅ Dynamic workflow import, conversation execution, failure aggregation, and example run all succeeded
Functional Verification

Test 1: Feature availability before and after the PR

Step 1 — Establish baseline without the PR:
Ran git checkout --detach origin/main && uv run python /tmp/qa_dynamic_workflow.py import:

HEAD is now at e65a8abd Register claude-opus-4-8 in SDK model lists (#3424)
Traceback (most recent call last):
  File "/tmp/qa_dynamic_workflow.py", line 16, in <module>
    from openhands.tools.workflow import WorkflowToolSet
ImportError: cannot import name 'WorkflowToolSet' from 'openhands.tools.workflow' (unknown location)

This confirms the dynamic workflow tool is new behavior introduced by this PR.

Step 2 — Apply the PR changes:
Checked out redo-dynamic-workflow-mvp at 70d18ca12ef08f52c8de2457acf337f27eb9a59c.

Step 3 — Re-run with the PR:
Ran uv run python /tmp/qa_dynamic_workflow.py import:

workflow tool set import ok: workflow_tool_set

This confirms the new workflow tool set is available from the package entry point.

Test 2: Parent agent executes a dynamic workflow through the SDK tool path

Step 1 — Baseline:
The base branch import failure above shows there was no workflow entry point to execute through a parent agent before the PR.

Step 2 — Apply the PR changes:
Used the PR branch with a local OpenAI-compatible fake LLM server so the SDK could run a real Conversation deterministically without external model dependencies.

Step 3 — Run the workflow as a user-facing SDK conversation:
Ran uv run python /tmp/qa_dynamic_workflow.py conversation:

fake llm requests: 6
parent requested workflow tool: True
subagent/reducer prompts observed: [{"type": "text", "text": "audit sdk"}] | [{"type": "text", "text": "audit tools"}] | [{"type": "text", "text": "audit workspace"}] | [{"type": "text", "text": "combine findings

Input:
{
  "findings": [...]
workflow observation reached parent: True
workflow observation snippet: [{'type': 'text', 'text': 'SUBAGENT_RESULT for [{"type": "text", "text": "combine findings\n\nInput:...'}]

This shows the parent agent requested the workflow tool, the generated script fanned out three sub-agent prompts, ran a reducer prompt with serialized intermediate results, and returned the workflow observation back to the parent conversation.

Test 3: map_agents aggregates sibling failures

Step 1 — Baseline:
The base branch did not have the workflow API, so there was no map_agents failure behavior to exercise.

Step 2 — Apply the PR changes:
Ran the PR implementation with a recording task manager that returns errors for two items and success for one item.

Step 3 — Execute the failing workflow:
Ran uv run python /tmp/qa_dynamic_workflow.py failure-aggregation:

started prompts: audit alpha, audit beta, audit gamma
aggregated errors: simulated failure for audit beta | simulated failure for audit gamma

This confirms all sibling tasks were attempted and both failures were surfaced together, matching the PR goal that one failing item should not cancel sibling collection.

Test 4: New standalone example runs to completion

Step 1 — Baseline:
examples/01_standalone_sdk/52_dynamic_workflow.py does not exist on origin/main, so this is a new runnable example from the PR.

Step 2 — Apply the PR changes:
Ran the example on the PR branch with LLM_MODEL=openai/gpt-4o-mini, LLM_API_KEY=fake-key, and LLM_BASE_URL pointing to a local OpenAI-compatible fake server.

Step 3 — Run the actual example script:
Ran python /tmp/qa_run_dynamic_workflow_example.py:

example exit code: 0
fake llm requests: 5
Tool: workflow
Result:
Coverage note for [{"type": "text", "text": "write final coverage report

Input:..."}]
Message from Coverage Workflow Agent to User
Example completed after workflow observation.
EXAMPLE_COST: 3.75e-06

This confirms the added example can execute the workflow tool path and emit the expected example cost marker.

Issues Found

None.

This QA review was created by an AI agent (OpenHands) on behalf of the user.

@neubig neubig added the review-this This label triggers a PR review by OpenHands label May 28, 2026 — with OpenHands AI
Copy link
Copy Markdown
Collaborator

all-hands-bot commented May 28, 2026

Review complete.

This review was performed through OpenHands Cloud Automation. You can log in and view the conversation here.

Copy link
Copy Markdown
Collaborator

@all-hands-bot all-hands-bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The three-layer design (WorkflowAction/WorkflowObservationWorkflowContextWorkflowExecutor) is clean, the _TaskStarter Protocol enables genuine dependency injection without mocking, and the ExceptionGroup aggregation + attach_parent() fixes from earlier iterations are solid. The standalone example demonstrates the intended usage clearly. A few new findings below — the wf private-attribute exposure is the most substantive; the others are usability and robustness notes.

This review was generated by an AI agent (OpenHands) on behalf of the user through OpenHands Automation. View conversation

Comment thread openhands-tools/openhands/tools/workflow/impl.py
Comment thread openhands-tools/openhands/tools/workflow/impl.py
Comment thread openhands-tools/openhands/tools/workflow/impl.py
openhands-agent and others added 3 commits May 28, 2026 23:39
Co-authored-by: openhands <openhands@all-hands.dev>
Co-authored-by: openhands <openhands@all-hands.dev>
@neubig
Copy link
Copy Markdown
Member Author

neubig commented May 29, 2026

@OpenHands /iterate

@openhands-ai
Copy link
Copy Markdown

openhands-ai Bot commented May 29, 2026

I'm on it! neubig can track my progress at all-hands.dev

Copy link
Copy Markdown
Member Author

neubig commented May 29, 2026

CI is all green (33/33 checks passed). All review findings from the previous review have been addressed:

  • return_exceptions=True added to asyncio.gather in map_agents
  • Dead assert removed, replaced with _render_required_template helper
  • ExceptionGroup unpacking added via _format_exception for richer error text
  • wf private attribute access blocked in AST validator (wf._ prefix now rejected)
  • TYPE_CHECKING block confirmed at top of definition.py
  • Standard exceptions (ValueError, TypeError, RuntimeError) added to _safe_globals
  • 0 unresolved review threads

Ready for another automated review.

This comment was created by an AI agent (OpenHands) on behalf of the user.

- add 'format' to _safe_globals so f-strings with format specifiers
  (e.g. f'{value:.2f}') work without NameError in workflow scripts
- use json.dumps for non-str items in _render_required_template so
  dict/list items are serialised consistently with reduce_agent's path

Co-authored-by: openhands <openhands@all-hands.dev>
Copy link
Copy Markdown
Member Author

neubig commented May 29, 2026

Round 20 in 4949144:

Fixed:

  • Added 'format' to _safe_globals (f-strings with format specifiers like f'{value:.2f}' now work)
  • _render_required_template uses json.dumps for non-str items for consistency with reduce_agent

All threads resolved. CI green (35/35).

This comment was created by an AI agent (OpenHands) on behalf of the user.

@neubig neubig requested a review from all-hands-bot May 29, 2026 08:57
Copy link
Copy Markdown
Collaborator

all-hands-bot commented May 29, 2026

Review complete.

This review was performed through OpenHands Cloud Automation. You can log in and view the conversation here.

Copy link
Copy Markdown
Collaborator

@all-hands-bot all-hands-bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

✅ QA Report: PASS

Dynamic workflow behavior works end-to-end in the SDK: the PR adds an importable workflow tool, executes generated workflow scripts, runs real LLM-backed map/reduce sub-agents, and aggregates map failures.

Does this PR achieve its stated goal?

Yes. The stated goal was to add a dynamic Python workflow tool with WorkflowAction/WorkflowObservation, WorkflowContext helpers (run_agent, map_agents, reduce_agent, flatten), and failure aggregation for map_agents. I verified the base branch cannot import the workflow API, while the PR can execute a workflow through WorkflowExecutor; I also ran a real LLM-backed workflow that fanned out two sub-agents and reduced their results successfully.

Phase Result
Environment Setup make build completed successfully; dependencies installed via uv sync --dev.
CI Status ⏳ 20 checks passing, 9 pending, 3 skipped at review time.
Functional Verification ✅ Workflow smoke, real sub-agent map/reduce, and failure aggregation all behaved as claimed.
Functional Verification

Test 1: New workflow tool API is absent on base and usable on PR

Step 1 — Reproduce / establish baseline (without the fix):
Ran git checkout --detach origin/main && OPENHANDS_SUPPRESS_BANNER=1 uv run python /tmp/qa_workflow_smoke.py:

HEAD is now at 58cfdcef ci(agent-server): add stress-test job; fix sub-second bash event ordering (#3203)
Traceback (most recent call last):
  File "/tmp/qa_workflow_smoke.py", line 4, in <module>
    from openhands.tools.workflow import WorkflowAction, WorkflowExecutor, WorkflowToolSet
ImportError: cannot import name 'WorkflowAction' from 'openhands.tools.workflow' (unknown location)

This establishes the baseline: users could not import or run the dynamic workflow API from main.

Step 2 — Apply the PR's changes:
Checked out 494914489ff17469ccac5b89f7857c1bc4314cf8.

Step 3 — Re-run with the fix in place:
Ran OPENHANDS_SUPPRESS_BANNER=1 uv run python /tmp/qa_workflow_smoke.py:

49491448
{'status': 'completed', 'is_error': False, 'text': "['sdk', 'tools', 'workspace', 'agent-server']"}

This shows the new workflow tool API imports successfully and a generated async def main(wf) script executes through the workflow executor, returning the expected wf.flatten(...) result.

Test 2: Real dynamic workflow fans out and reduces sub-agents

Step 1 — Reproduce / establish baseline (without the fix):
The same workflow API import failed on origin/main in Test 1, so a real sub-agent workflow cannot be started there.

Step 2 — Apply the PR's changes:
Used the PR commit and registered a minimal qa_dynamic_echo_3426 sub-agent. The workflow script called wf.map_agents(..., max_concurrency=2) for alpha/beta, then wf.reduce_agent(...).

Step 3 — Re-run with the fix in place:
Ran OPENHANDS_SUPPRESS_BANNER=1 uv run python /tmp/qa_workflow_real_subagent.py:

Task 'task_00000001' completed.
Task 'task_00000002' completed.
Task 'task_00000003' completed.
{'status': 'completed', 'is_error': False}
{'map': ['WORKFLOW-ALPHA', 'WORKFLOW-BETA'], 'reduce': 'WORKFLOW-REDUCED'}

This confirms the PR's primary user-facing workflow path works with real SDK conversations and LLM-backed sub-agents: fan-out tasks complete, ordered map results are returned, and the reducer sub-agent receives intermediate results.

Test 3: map_agents aggregates multiple sub-agent failures

Step 1 — Reproduce / establish baseline (without the fix):
On origin/main, the workflow API import failed as shown in Test 1, so this behavior did not exist for users.

Step 2 — Apply the PR's changes:
Used a controlled task starter through WorkflowContext so two mapped items returned errors and one returned success.

Step 3 — Re-run with the fix in place:
Ran OPENHANDS_SUPPRESS_BANNER=1 uv run python /tmp/qa_workflow_failure_aggregation.py:

exception_group: map_agents: one or more sub-agents failed
failure: [item 2] failed inspect bad
failure: [item 3] failed inspect worse
started: ['inspect good', 'inspect bad', 'inspect worse']
closed: True

This confirms one failed mapped item does not prevent sibling tasks from being started/collected, and the final exception includes both failures with item indexes.

Issues Found

None.

This review was generated by an AI agent (OpenHands) on behalf of the user.

Copy link
Copy Markdown
Collaborator

@all-hands-bot all-hands-bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This PR is in excellent shape after multiple productive rounds. The three-layer design (WorkflowAction/WorkflowObservationWorkflowContextWorkflowExecutor), _TaskStarter Protocol for real dependency injection, ExceptionGroup failure aggregation in map_agents, and the honest MVP security disclaimer in _WORKFLOW_DESCRIPTION are all solid. Tests are comprehensive and now include real assertions (concurrency cap, post-close guard, executor success path).

The two latest changes are correct:

  • format added to _safe_globals: necessary, since f-strings with format specifiers (f"{n:.2f}") desugar to format() calls at the bytecode level.
  • json.dumps for non-str items in _render_required_template: aligns serialization with reduce_agent's _format_value path; one minor clarification suggested inline.

[RISK ASSESSMENT]

  • ⚠️ Risk: 🟡 MEDIUM — executes LLM-generated Python in-process with best-effort AST validation. Appropriate for an MVP with a clear security disclaimer.
  • All key guards are in place: imports blocked, dunder access blocked, restricted builtins, wf.close() and private wf._* attributes blocked, _UNSAFE_CALLS deny-list.

VERDICT: ✅ Worth merging.

This review was generated by an AI agent (OpenHands) on behalf of the user through OpenHands Automation. View conversation

Comment thread openhands-tools/openhands/tools/workflow/impl.py Outdated
Comment thread openhands-tools/openhands/tools/workflow/impl.py
- extend json.dumps comment to mention scalars (booleans → true/false,
  None → null) explicitly, matching the reviewer suggestion

Co-authored-by: openhands <openhands@all-hands.dev>
Copy link
Copy Markdown
Member Author

neubig commented May 29, 2026

Round 21 in a60cdbe:

Fixed:

  • Extended json.dumps comment to mention scalars (booleans → true/false, None → null) per suggestion

Acknowledged (no change):

  • format() safety note: reasoning is sound, no action needed

All threads resolved. CI green (QA PASS ✅ at 09:03:15Z, code review ✅ VERDICT: Worth merging at 09:04:01Z).

This comment was created by an AI agent (OpenHands) on behalf of the user.

@neubig neubig requested a review from all-hands-bot May 29, 2026 09:06
Copy link
Copy Markdown
Collaborator

all-hands-bot commented May 29, 2026

Review complete.

This review was performed through OpenHands Cloud Automation. You can log in and view the conversation here.

Copy link
Copy Markdown
Collaborator

@all-hands-bot all-hands-bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

✅ QA Report: PASS

Verified the new dynamic workflow tool through real SDK usage: agent tool invocation, real sub-agent map/reduce orchestration, flattening, and aggregated map failure reporting all worked.

Does this PR achieve its stated goal?

Yes. The PR set out to add WorkflowToolSet/workflow actions plus a WorkflowContext supporting run_agent, map_agents, reduce_agent, and flatten, while preserving sibling failure collection in map_agents. I exercised those paths as an SDK user: base main cannot import the workflow tool, the PR branch lets a parent agent invoke WorkflowToolSet, a workflow successfully fanned out to real registered sub-agents and reduced their outputs, and a failing map returned all three item-indexed failures instead of only the first.

Phase Result
Environment Setup make build completed and installed the uv-managed workspace dependencies.
CI Status ⏳ GitHub checks at review time: 21 passing, 8 pending, 3 skipped; no failures observed.
Functional Verification ✅ Real SDK executions completed for tool invocation, map/reduce/flatten, and aggregated failure reporting.
Functional Verification

Test 1: Before/after availability and parent-agent tool invocation

Step 1 — Establish baseline without the PR:
Created an isolated origin/main worktree and synced dependencies, then ran:

cd /tmp/openhands-sdk-base-main && uv sync --dev >/tmp/base_uv_sync.log 2>&1
OPENHANDS_SUPPRESS_BANNER=1 .venv/bin/python - <<'PY'
try:
    from openhands.tools.workflow import WorkflowToolSet
    print('BASE_IMPORT_OK', WorkflowToolSet)
except Exception as exc:
    print(type(exc).__name__ + ': ' + str(exc))
PY

Output excerpt:

ModuleNotFoundError: No module named 'openhands.tools.workflow'

This establishes the before state: the workflow tool is not available on main.

Step 2 — Apply the PR's changes:
Used the checked-out PR head a60cdbe8d589a22212da2d01bc5832a771bb1d90 on branch redo-dynamic-workflow-mvp after make build.

Step 3 — Re-run with the PR behavior:
Ran a real Conversation whose parent agent only had Tool(name=WorkflowToolSet.name), then asked it to call the workflow tool with a simple flattening script.

Output excerpt:

Tools Available: 3
  - workflow: Run a dynamic workflow written as Python orchestration code....

Action: WorkflowAction
Arguments:
  name: "qa_agent_flatten"
  script:
    async def main(wf):
        return wf.flatten([["AGENT_TOOL_ALPHA"], "AGENT_TOOL_BETA", ["AGENT_TOOL_GAMMA"]])

Observation
Tool: workflow
Result:
['AGENT_TOOL_ALPHA', 'AGENT_TOOL_BETA', 'AGENT_TOOL_GAMMA']

Message from Agent
['AGENT_TOOL_ALPHA', 'AGENT_TOOL_BETA', 'AGENT_TOOL_GAMMA']

QA_AGENT_TOOL_COMPLETED
EVENT_COUNT: 5

This shows the workflow tool is registered, visible to the agent, invoked as a real tool call, and returns the workflow result to the conversation.

Test 2: Dynamic workflow map/reduce with real registered sub-agents

Step 1 — Baseline:
The same isolated origin/main import check above fails with ModuleNotFoundError, so this SDK flow cannot be run before the PR.

Step 2 — Apply the PR's changes:
On the PR branch, registered a real qa_echo sub-agent via register_agent_if_absent, created a parent Conversation, and invoked WorkflowExecutor with a script that used wf.map_agents, wf.flatten, and wf.reduce_agent.

Step 3 — Run with the PR behavior:
Command excerpt:

OPENHANDS_SUPPRESS_BANNER=1 uv run python - <<'PY'
# creates LLM, registers qa_echo, creates Conversation(... WorkflowToolSet ...), then:
obs = WorkflowExecutor()(WorkflowAction(name='qa-real-map-reduce', script=script, max_concurrency=2), conversation=conversation)
print(json.dumps({'status': obs.status, 'is_error': obs.is_error, 'text': obs.text}, indent=2))
PY

Output excerpt:

{
  "status": "completed",
  "is_error": false,
  "text": "QA_REDUCED QA_MAP:alpha QA_MAP:beta QA_TAIL"
}

This confirms the new workflow context can fan out to sibling sub-agent tasks, collect their outputs in order, flatten intermediate values, and pass the serialized results to a reducer sub-agent.

Test 3: Aggregated map_agents failure reporting

Step 1 — Baseline:
Again, origin/main lacks openhands.tools.workflow, so there is no workflow failure aggregation behavior to exercise before the PR.

Step 2 — Apply the PR's changes:
On the PR branch, invoked wf.map_agents with three items and an intentionally unregistered subagent_type so each item would fail through the real TaskManager path.

Step 3 — Run with the PR behavior:
Output excerpt:

{
  "status": "error",
  "is_error": true,
  "text": "map_agents: one or more sub-agents failed:\n  [1] [item 1] Unknown agent 'qa_missing_subagent'. Available types: none registered. Use register_agent() to add custom agent types.\n  [2] [item 2] Unknown agent 'qa_missing_subagent'. Available types: none registered. Use register_agent() to add custom agent types.\n  [3] [item 3] Unknown agent 'qa_missing_subagent'. Available types: none registered. Use register_agent() to add custom agent types."
}

This confirms one failing item does not hide sibling failures: all three item-indexed errors were returned in the workflow observation.

Issues Found

None.

This QA review was created by an AI agent (OpenHands) on behalf of the user.

Copy link
Copy Markdown
Collaborator

@all-hands-bot all-hands-bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🟡 Acceptable — solid MVP through multiple productive iterations; all 30 prior review threads are resolved.

The three-layer design (WorkflowAction/WorkflowObservationWorkflowContextWorkflowExecutor) is clean, ExceptionGroup failure aggregation is correct, _TaskStarter Protocol enables genuine dependency injection in tests, the str.replace template rendering closes the format-string attribute-traversal bypass, and the tool description is commendably honest about the in-process execution risk. The test suite is focused and exercises real code paths.

Two minor observations remain that were not flagged in previous cycles:


[IMPROVEMENT OPPORTUNITIES]

type builtin absent from _safe_globals

_safe_globals() includes isinstance but not type. A workflow script that checks type(result).__name__ or branches on type(x) will fail at runtime with NameError: name "type" is not defined — a confusing error for an otherwise valid script. The resolved threads for this noted both the addition of isinstance and the 3-argument type(name, bases, dict) class-creation risk. If the decision is to keep type out, the tool description and/or a dedicated error at validation time should note it so workflow authors are not surprised.

max_concurrency default of 8

The schema default of 8 simultaneous sub-agents is generous — particularly at the context level, where it acts as the hard ceiling for the entire workflow run. The canonical example (52_dynamic_workflow.py) uses max_concurrency=2, which is more API-friendly. Either lower the default or add a note to the field description (e.g. "consider 2-4 for LLM-heavy workflows to avoid rate limits") so users do not inadvertently fire off 8 parallel LLM sub-agents on every run.


[RISK ASSESSMENT]

  • ⚠️ Risk: 🟡 MEDIUM — executes LLM-generated Python in-process after best-effort AST validation. The tool description is explicit about this; the risk is well-understood and appropriate for an opt-in MVP.
  • No breaking changes; all new public symbols are additive.
  • All 30 prior review threads resolved; the two observations above are minor and non-blocking.

VERDICT: ✅ Worth merging. The two observations are improvements for a follow-up, not blockers.


Improve this review? If any feedback above seems incorrect or irrelevant to this repository, you can teach the reviewer to do better by adding a .agents/skills/custom-codereview-guide.md file with the /codereview trigger and context the reviewer is missing.

This review was generated by an AI agent (OpenHands) on behalf of the user through OpenHands Automation. View conversation

Comment thread openhands-tools/openhands/tools/workflow/impl.py
Comment thread openhands-tools/openhands/tools/workflow/definition.py
Copy link
Copy Markdown
Collaborator

@all-hands-bot all-hands-bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🟡 Acceptable — solid MVP through multiple productive iterations; all 30 prior review threads are resolved.

The three-layer design (WorkflowAction/WorkflowObservationWorkflowContextWorkflowExecutor) is clean, ExceptionGroup failure aggregation is correct, _TaskStarter Protocol enables genuine dependency injection in tests, the str.replace template rendering closes the format-string attribute-traversal bypass, and the tool description is commendably honest about the in-process execution risk. The test suite is focused and exercises real code paths.

Two minor observations remain that were not flagged in previous cycles:


[IMPROVEMENT OPPORTUNITIES]

type builtin absent from _safe_globals

_safe_globals() includes isinstance but not type. A workflow script that checks type(result).__name__ or branches on type(x) will fail at runtime with NameError: name "type" is not defined — a confusing error for an otherwise valid script. The resolved threads for this noted both the addition of isinstance and the 3-argument type(name, bases, dict) class-creation risk. If the decision is to keep type out, the tool description and/or the validator should note it so workflow authors are not surprised by the NameError.

max_concurrency default of 8

The schema default of 8 simultaneous sub-agents is generous — particularly at the context level, where it acts as the hard ceiling for the entire workflow run. The canonical example (52_dynamic_workflow.py) uses max_concurrency=2, which is more API-friendly. Either lower the default or add a note to the field description (e.g. "consider 2–4 for LLM-heavy workflows to avoid rate limits") so users do not inadvertently fire off 8 parallel LLM sub-agents on every run.


[RISK ASSESSMENT]

  • ⚠️ Risk: 🟡 MEDIUM — executes LLM-generated Python in-process after best-effort AST validation. The tool description is explicit about this; the risk is well-understood and appropriate for an opt-in MVP.
  • No breaking changes; all new public symbols are additive.
  • All 30 prior review threads resolved; the two observations above are minor and non-blocking.

VERDICT: ✅ Worth merging. The two observations are improvements for a follow-up, not blockers.


Improve this review? If any feedback above seems incorrect or irrelevant to this repository, you can teach the reviewer to do better by adding a .agents/skills/custom-codereview-guide.md file with the /codereview trigger and context the reviewer is missing.

This review was generated by an AI agent (OpenHands) on behalf of the user through OpenHands Automation. View conversation

Comment thread openhands-tools/openhands/tools/workflow/impl.py
Comment thread openhands-tools/openhands/tools/workflow/definition.py
- re-add 'type' to _safe_globals with comment explaining 3-arg safety
  (method bodies execute in restricted globals; 1-arg introspection is
  a valid use case that previously caused confusing NameErrors)
- add API rate-limit note to max_concurrency field description
  (consider 2-4 for LLM-heavy workflows)

Co-authored-by: openhands <openhands@all-hands.dev>
Copy link
Copy Markdown
Member Author

neubig commented May 29, 2026

Round 22 in eb267fa:

Fixed:

  • Re-added 'type' to _safe_globals (1-arg introspection was causing confusing NameErrors; 3-arg form is safe given existing guards)
  • Added API rate-limit note to max_concurrency field description

All 4 threads resolved.

This comment was created by an AI agent (OpenHands) on behalf of the user.

@neubig neubig requested a review from all-hands-bot May 29, 2026 09:20
Copy link
Copy Markdown
Collaborator

all-hands-bot commented May 29, 2026

Review complete.

This review was performed through OpenHands Cloud Automation. You can log in and view the conversation here.

Copy link
Copy Markdown
Collaborator

@all-hands-bot all-hands-bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

✅ QA Report: PASS

Verified the new dynamic workflow tool through SDK imports, direct workflow execution, map/reduce orchestration, failure aggregation, unsafe-script rejection, and a real agent conversation that invoked the workflow tool.

Does this PR achieve its stated goal?

Yes. On main, openhands.tools.workflow is not importable; on this PR, WorkflowToolSet imports and creates a WorkflowAction tool. Running the software showed a workflow script executing successfully, map_agents preserving all item calls while respecting a concurrency cap, reduce_agent receiving serialized intermediate results, multiple map failures being aggregated, unsafe script access being rejected, and an actual Conversation using WorkflowToolSet to produce ['QA_DYNAMIC_WORKFLOW_OK', 'flattened'].

Phase Result
Environment Setup make build completed successfully and installed the uv environment.
CI Status 🟡 At refresh time: 21 success, 8 in progress, 3 skipped, 0 failures.
Functional Verification ✅ Dynamic workflow behavior worked in direct SDK calls and in an agent-driven conversation.
Functional Verification

Test 1: New workflow package is available only after the PR

Step 1 — Establish baseline without the PR:
Ran git checkout --detach origin/main && uv run python /tmp/check_workflow_import.py:

WORKFLOW_IMPORT_RESULT=ModuleNotFoundError: No module named 'openhands.tools.workflow'

This shows the base branch did not expose the dynamic workflow tool package.

Step 2 — Apply the PR's changes:
Checked out eb267fa83e0194e7592e20729d263ace4761ac2a.

Step 3 — Re-run with the PR in place:
Ran OPENHANDS_SUPPRESS_BANNER=1 uv run python /tmp/check_workflow_import.py:

WORKFLOW_IMPORT_RESULT=success: WorkflowToolSet

This confirms the PR adds the user-facing openhands.tools.workflow.WorkflowToolSet import path.

Test 2: Direct SDK workflow execution, map/reduce, failure aggregation, and validation

Step 1 — Baseline:
The direct workflow APIs are absent on main as shown in Test 1, so these operations cannot be performed there.

Step 2 — Apply the PR's changes:
Used the PR checkout and ran a short SDK script that creates WorkflowToolSet, invokes WorkflowExecutor, runs map_agents + reduce_agent through WorkflowContext, and then intentionally triggers two map item failures.

Step 3 — Run with the PR in place:
Ran OPENHANDS_SUPPRESS_BANNER=1 uv run python /tmp/workflow_qa_exercise.py:

TOOLSET_CREATE 1 WorkflowAction
EXECUTOR_SUCCESS completed False ['sdk', 'tools', 'workspace', 'agent-server']
EXECUTOR_UNSAFE error True Workflow scripts may not call `open`
MAP_REDUCE_RESULT {"mapped_count": 4, "reduced_prefix": "result<coverage_auditor>:summarize coverage findings

I"}
MAP_REDUCE_PEAK_ACTIVE 2
MAP_REDUCE_CLOSED True
FAILURE_GROUP map_agents: one or more sub-agents failed (2 sub-exceptions)
FAILURE_ITEMS ["[item 2] boom for process fail:one", "[item 3] boom for process fail:two"]
FAILURE_CALL_COUNT 3

This confirms the tool can execute a Python workflow script, flatten returns the expected one-level flattening, unsafe direct file reads are rejected, map_agents completed all four mapped calls while capping peak concurrency at 2, reduce_agent ran afterward with the mapped results, and two failing map items were reported together rather than one failure cancelling sibling collection.

Test 3: Real agent conversation uses WorkflowToolSet

Step 1 — Baseline:
On main, the workflow tool cannot be imported, so an agent cannot be configured with Tool(name=WorkflowToolSet.name).

Step 2 — Apply the PR's changes:
Configured an Agent with tools=[Tool(name=WorkflowToolSet.name)], sent a real message through Conversation, and asked it to call the workflow tool with a simple async def main(wf) script.

Step 3 — Run with the PR in place:
Ran OPENHANDS_SUPPRESS_BANNER=1 uv run python /tmp/workflow_agent_demo.py:

Action: WorkflowAction
Observation:
Tool: workflow
Result:
['QA_DYNAMIC_WORKFLOW_OK', 'flattened']

DEMO_COMPLETED True
EVENT_COUNT 5
EVENT ObservationEvent ObservationEvent (environment)   Tool: workflow   Result: ['QA_DYNAMIC_WORKFLOW_OK', 'flattened']
EVENT MessageEvent MessageEvent (agent)   assistant: ['QA_DYNAMIC_WORKFLOW_OK', 'flattened']

This confirms the new tool is usable from a real SDK agent loop: the model selected WorkflowAction, the workflow executor returned the expected observation, and the agent replied with the workflow result.

Issues Found

None.

This review was created by an AI agent (OpenHands) on behalf of the user.

Copy link
Copy Markdown
Collaborator

@all-hands-bot all-hands-bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is a well-designed, well-iterated MVP — the two observations from the previous review cycle (type re-added to _safe_globals with rationale, and the max_concurrency description updated to recommend 2–4 for LLM-heavy workflows) are cleanly addressed in the latest commits.

The three-layer design (WorkflowAction/WorkflowObservationWorkflowContextWorkflowExecutor), _TaskStarter Protocol for real dependency injection, ExceptionGroup failure aggregation with sibling-collection, str.replace template rendering (correctly avoiding Python's format-string attribute traversal bypass), and the honest "best-effort" sandbox disclaimer in _WORKFLOW_DESCRIPTION are all solid. Tests cover the key paths.

Two minor observations below; neither is blocking.


[IMPROVEMENT OPPORTUNITIES]

Test ordering assumption in test_execute_workflow_script_runs_map_and_reduce — see inline comment

type safety comment could be more precise — see inline comment


[RISK ASSESSMENT]

  • Risk: MEDIUM — executes LLM-generated Python in-process after best-effort AST validation. Explicitly documented; appropriate for an opt-in MVP.
  • No breaking changes; all new symbols are additive and exported cleanly.
  • All prior review threads resolved.

VERDICT: Ready to merge. The two observations above are minor improvements for a follow-up.


This review was generated by an AI agent (OpenHands) on behalf of the user through OpenHands Automation. View conversation

Comment thread tests/tools/workflow/test_workflow_tool.py Outdated
Comment thread openhands-tools/openhands/tools/workflow/impl.py
- use set assertion for map phase prompts in test_run_map_and_reduce
  (asyncio.to_thread dispatch order is non-deterministic)
- expand 'type' safety comment to mention both safeguards: restricted
  globals AND AST validator blocking __dunder__ attribute access

Co-authored-by: openhands <openhands@all-hands.dev>
Copy link
Copy Markdown
Member Author

neubig commented May 29, 2026

Round 23 in 5bc40d2:

Fixed:

  • Set-based assertion for map phase in test_run_map_and_reduce (thread dispatch order is non-deterministic)
  • Expanded type() comment to mention both safeguards (restricted globals + AST validator blocks dunder)

All threads resolved. CI green (35/35). Review VERDICT: Ready to merge.

This comment was created by an AI agent (OpenHands) on behalf of the user.

@neubig neubig requested a review from all-hands-bot May 29, 2026 09:36
Copy link
Copy Markdown
Collaborator

all-hands-bot commented May 29, 2026

Review complete.

This review was performed through OpenHands Cloud Automation. You can log in and view the conversation here.

Copy link
Copy Markdown
Collaborator

@all-hands-bot all-hands-bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is a well-structured, thoughtfully documented MVP for dynamic workflow orchestration. The code is clean, the tests are comprehensive, and the security model is honest about its limitations. A few issues worth addressing before merge — the most actionable ones are the wf-alias sandbox bypass and the silent no-op in attach_parent.

This review was generated by an AI agent (OpenHands) on behalf of the user through OpenHands Automation. View conversation

Comment thread openhands-tools/openhands/tools/workflow/impl.py
Comment thread openhands-tools/openhands/tools/workflow/impl.py
Comment thread openhands-tools/openhands/tools/task/manager.py
Comment thread openhands-tools/openhands/tools/workflow/impl.py Outdated
Comment thread openhands-tools/openhands/tools/workflow/impl.py
- add docstring note to validate_workflow_script acknowledging the
  wf-alias bypass limitation (documentation gap, not security gap)
- update type() comment to accurately note methods injected via wf
  are not re-sandboxed, though they only expose public wf API
- add warning log in attach_parent when called with a different
  conversation (surfaces potential programming errors)
- thread 2 (run_one re-wrap): kept as-is; the [item N] prefix is
  valuable context for debugging ExceptionGroup failures

Co-authored-by: openhands <openhands@all-hands.dev>
Copy link
Copy Markdown
Member Author

neubig commented May 29, 2026

Round 24 in 8a484b7:

Fixed:

  • Added docstring note to validate_workflow_script (wf-alias bypass is documented gap, not security gap)
  • Added warning log in attach_parent when called with a different conversation
  • Updated type() comment to accurately note injected-object limitation

Kept as-is:

  • run_one re-wrap: [item N] prefix provides valuable context; from-exc preserves original type
  • asyncio.timeout: target-version = py313, Python 3.13+, asyncio.timeout is available

All 5 threads resolved.

This comment was created by an AI agent (OpenHands) on behalf of the user.

@neubig neubig requested a review from all-hands-bot May 29, 2026 09:48
Copy link
Copy Markdown
Collaborator

all-hands-bot commented May 29, 2026

Review complete.

This review was performed through OpenHands Cloud Automation. You can log in and view the conversation here.

Copy link
Copy Markdown
Collaborator

@all-hands-bot all-hands-bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This PR has gone through extensive iterative refinement and is in very good shape. The three-layer design (WorkflowAction/WorkflowObservationWorkflowContextWorkflowExecutor), the _TaskStarter Protocol for real dependency injection (no mocks needed in most tests), the ExceptionGroup failure aggregation with 1-based item indexing, asyncio.timeout for hung-workflow protection, idempotent close(), the str.replace prompt-injection fix, and the per-call max_concurrency cap are all cleanly implemented and tested.

Status of open threads:

  • Thread 3322692914 (add test for per-call max_concurrency cap): test_map_agents_respects_context_concurrency_cap at line 321 of the test file addresses this — it verifies that a per-call max_concurrency=1000 is silently capped at the context's limit of 3 and confirms peak concurrency never exceeds 3. ✅

  • Thread 3322692922 (document the two register_tool calls): WorkflowTool's docstring already says "Prefer WorkflowToolSet for standard SDK auto-create usage. Use WorkflowTool when you need to inject a custom executor (e.g., in tests or extensions)" which documents the intent. A short comment above the two register_tool lines would close the suggestion explicitly (see inline comment below).

Note on the _call_name false-positive (review 4386764961): The latest bot review flagged this as still unresolved, but the fix (isinstance(node.func, ast.Name) guard) has actually been in place since commit cd0aeb1. Attribute method calls like result.open() are already correctly allowed. No action needed.

[RISK ASSESSMENT]

  • ⚠️ Risk: 🟡 MEDIUM — executes LLM-generated Python in-process with best-effort AST validation. This is explicitly and clearly documented.
  • No breaking changes; all new public symbols are additive.

VERDICT: ✅ Ready to merge. Two minor items below — one is a missing test for a documented behaviour guarantee, the other is a one-line comment.

This review was generated by an AI agent (OpenHands) on behalf of the user through OpenHands Automation. View conversation


assert (
execute_workflow_script(script, _context(_FakeTaskManager())) == "recoverable"
)
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🟡 Suggestion: add a test for except Exception catching ExceptionGroup from within a workflow script

The tool description makes a concrete promise: "A plain except Exception will still catch the entire group." This is an important usability guarantee that isn't tested yet. test_workflow_script_can_catch_common_exceptions only covers ValueError. Consider adding:

def test_workflow_script_can_catch_exception_group_with_plain_except() -> None:
    class _ErrorManager(_FakeTaskManager):
        def start_task(self, prompt, subagent_type="default", resume=None, description=None, conversation=None):
            self.prompts.append(prompt)
            return _FakeTask(error="sub-agent failed")

    script = """
async def main(wf):
    try:
        return await wf.map_agents(
            items=["a", "b"],
            prompt="inspect {item}",
        )
    except Exception as exc:
        return f"caught: {exc}"
"""
    result = execute_workflow_script(script, _context(_ErrorManager()))
    assert result.startswith("caught:")
    assert "map_agents" in result

This confirms the documented except Exception escape hatch works end-to-end from within the sandbox.

return WorkflowTool.create(executor=WorkflowExecutor())


register_tool(WorkflowToolSet.name, WorkflowToolSet)
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🟡 Suggestion: add a brief comment above the two register_tool lines (addresses thread 3322692922)

Both names are registered with distinct use cases, but a reader landing at the bottom of the file without reading the class docstrings won't see why. A one-line comment would close the suggestion from that thread:

Suggested change
register_tool(WorkflowToolSet.name, WorkflowToolSet)
# WorkflowToolSet: standard SDK auto-create path; WorkflowTool: for explicit executor injection.
register_tool(WorkflowToolSet.name, WorkflowToolSet)

Copy link
Copy Markdown
Collaborator

@all-hands-bot all-hands-bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

✅ QA Report: PASS

Verified the dynamic workflow feature end-to-end: the new module is absent on main, imports on the PR, executes workflow scripts, delegates to real registered sub-agents through TaskManager, aggregates map failures, and the new standalone example completed successfully.

Does this PR achieve its stated goal?

Yes. The PR set out to add a WorkflowToolSet/workflow action-observation path, provide WorkflowContext helpers (run_agent, map_agents, reduce_agent, flatten), aggregate map_agents failures, and ship a runnable SDK example. I exercised those surfaces as a user would: direct SDK imports/calls, real LLM-backed sub-agent delegation through the workflow executor, and examples/01_standalone_sdk/52_dynamic_workflow.py, which generated and ran a coverage-audit workflow and printed EXAMPLE_COST: 1.118247.

Phase Result
Environment Setup make build completed and installed the uv-managed dev environment.
CI Status 🟡 Many checks passing; tools-tests, qa-changes, and several build/push jobs were still in progress when checked; no failures observed.
Functional Verification ✅ Core workflow tool behavior and the standalone example both ran successfully.
Functional Verification

Test 1: Baseline vs PR import behavior

Step 1 — Establish baseline without the feature:
Checked out origin/main and ran uv run python /tmp/qa_dynamic_workflow_import.py:

IMPORT_ERROR:ModuleNotFoundError:No module named 'openhands.tools.workflow'

This confirms the base branch does not expose the dynamic workflow module.

Step 2 — Apply the PR's changes:
Switched back to redo-dynamic-workflow-mvp at 8a484b75764fbc47af7577b36c68cd77e8c92498.

Step 3 — Re-run with the PR:
Ran uv run python /tmp/qa_dynamic_workflow_import.py:

IMPORT_OK:WorkflowToolSet:WorkflowAction:WorkflowExecutor

This confirms the new public import surface is available on the PR branch.

Test 2: Workflow executor and context helpers

Step 1 — Baseline:
On main, the import above fails, so this workflow tool entry point cannot be exercised there.

Step 2 — Apply the PR's changes:
Used the PR branch and called WorkflowExecutor with a generated script containing async def main(wf).

Step 3 — Run with the PR:
Ran uv run python /tmp/qa_dynamic_workflow_basic.py:

STATUS:completed
IS_ERROR:False
TEXT:alpha|beta|gamma|delta

Then ran uv run python /tmp/qa_dynamic_workflow_manager.py:

SUCCESS_RESULT:done:reduce

Input:
{
  "one": "done:single",
  "mapped": [
    "done:map a",
    "done:map b"
  ]
}
SUCCESS_PROMPTS:['worker:single', 'mapper:map a', 'mapper:map b', 'reducer:reduce

Input:
{
  "one": "done:single",
  "mapped": [
    "done:map a",
    "done:map b"
  ]
}']
CLOSED:True
FAILURE_TYPE:ExceptionGroup
FAILURE_TEXT:map_agents: one or more sub-agents failed (2 sub-exceptions)
FAILURE_ITEMS:[item 2] boom:item FAIL-1|[item 3] boom:item FAIL-2

This confirms flatten, run_agent, map_agents, reduce_agent, manager cleanup, and aggregated map failures all behave as described.

Test 3: Real registered sub-agents via TaskManager

Step 1 — Baseline:
The workflow module is absent on main, so real workflow delegation is unavailable there.

Step 2 — Apply the PR's changes:
Registered a minimal qa_echo sub-agent and invoked WorkflowExecutor without injecting a custom manager, so the workflow used the real TaskManager path.

Step 3 — Run with the PR:
Ran uv run python /tmp/qa_dynamic_workflow_real_subagents.py:

Task 'task_00000001' completed.
Task 'task_00000002' completed.
Task 'task_00000003' completed.
STATUS:completed
IS_ERROR:False
TEXT:RED, BLUE

This demonstrates the workflow can fan out to registered sub-agents and reduce their outputs through real SDK conversation/task machinery.

Test 4: Validation error path and standalone example

Validation path:
Ran uv run python /tmp/qa_dynamic_workflow_error.py:

STATUS:error
IS_ERROR:True
TEXT:Workflow scripts may not access private wf attributes or call wf.close()

This confirms unsafe/unsupported workflow calls surface as an error observation rather than silently running.

Standalone example:
Ran uv run python examples/01_standalone_sdk/52_dynamic_workflow.py:

Summary: Return synthesized repository coverage audit report
...
Finish with message:
## Repo-wide test coverage report
...
Tokens: ↑ input 185.62K • cache hit 58.48% •  reasoning 4.85K • ↓ output 22.62K • $ 1.1182

EXAMPLE_COST: 1.118247

This confirms the new example runs end-to-end: the parent agent generated workflow code, the workflow executed sub-agent coverage audits, the reducer returned a repo-wide report, and the required EXAMPLE_COST line was printed.

Issues Found

None.

This review was created by an AI agent (OpenHands) on behalf of the user.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

review-this This label triggers a PR review by OpenHands

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Add minimal Python dynamic workflow tool for subagent orchestration

3 participants