Skip to content

Add example automation regression tests (E2E feature coverage) #94

@xingyaoww

Description

@xingyaoww

Summary

Add "example automation" regression tests — a suite of end-to-end tests that exercise specific automation features (MCP servers, secrets, skills, plugins, repo cloning, event triggers) by running real automations against a live deployment. This mirrors the example tests pattern in the SDK repo and extends the existing integration tests in the deploy repo.

Motivation

Issue #93 (MCP servers not working in automation conversations) revealed a gap in test coverage: the existing integration tests verify the automation API surface (CRUD, dispatch lifecycle, timeouts) but don't verify that specific SDK features actually work inside automation sandboxes. The MCP bug went undetected because no test exercised workspace.get_mcp_config() in an actual dispatched automation.

Currently tested (in deploy/automation/integration/):

  • ✅ CRUD API lifecycle (test_automation_api.py)
  • ✅ Basic dispatch lifecycle — sandbox starts, runs, completes (test_e2e_dispatch.py)
  • ✅ Timeout behavior (test_e2e_timeout.py)
  • ✅ Preset prompt creation + dispatch (test_preset_prompt_api.py)
  • ✅ Upload API (test_upload_api.py)

Not tested (features that could silently break):

  • ❌ MCP server configuration is available and functional in automation conversations
  • ❌ Secrets are actually injectable and resolvable (current test calls get_secrets() but doesn't verify the agent can use them)
  • ❌ Skills loading works correctly (public, user, project, org skills)
  • ❌ Plugin preset E2E lifecycle
  • ❌ Event-triggered automation E2E (webhook → dispatch → completion)
  • ❌ Repository cloning works inside automations
  • ❌ Repo-level skills (AGENTS.md) are loaded from cloned repos

Proposed Approach

Inspiration: SDK Example Tests

The SDK repo has a well-established pattern (tests/examples/test_examples.py):

  1. Example scripts in examples/ each exercise one SDK feature
  2. Test runner discovers and runs each script as a subprocess
  3. Success criteria: exit code 0 + EXAMPLE_COST: marker in stdout
  4. Results: per-example JSON files → markdown report
  5. CI: nightly schedule + test-examples label on PRs + manual dispatch
  6. Reporting: results posted as PR/issue comments

Proposed: Example Automations

Apply the same philosophy to automations, building on the existing test infrastructure in deploy/automation/integration/:

1. Define Example Automation Scenarios

Each scenario is a prompt (or tarball) that exercises a specific feature and produces a verifiable outcome:

Scenario What It Tests Verification
mcp_config_available MCP servers from user settings are accessible Automation prints list of configured MCP server names
secrets_injectable Secrets are injected and resolvable Automation reads a known secret and prints a hash/partial value
skills_loaded Skills are loaded from agent server Automation prints loaded skill count > 0
repo_clone_and_skills Repo cloning + project skills from cloned repo Clone a test repo, verify AGENTS.md skills loaded
plugin_preset_e2e Plugin preset lifecycle Create via /v1/preset/plugin, dispatch, verify COMPLETED
prompt_with_repos Prompt preset with repo cloning Create prompt automation with repos field, dispatch, verify repo is cloned
event_trigger_e2e Custom webhook → dispatch Register webhook, send test event, verify run created and completes

2. Test Harness

Extend the existing parametrized test pattern in deploy/automation/integration/:

# test_e2e_example_automations.py

EXAMPLE_AUTOMATIONS = [
    {
        "name": "mcp-config-available",
        "prompt": "List all configured MCP servers by calling workspace.get_mcp_config(). "
                  "Print 'MCP_SERVERS: <count>' where count is the number of servers found. "
                  "If no servers are configured, print 'MCP_SERVERS: 0'.",
        "success_marker": "MCP_SERVERS:",
    },
    {
        "name": "secrets-injectable", 
        "prompt": "List the names of all available secrets. "
                  "Print 'SECRETS_AVAILABLE: <count>' with the count.",
        "success_marker": "SECRETS_AVAILABLE:",
    },
    # ... more scenarios
]

@pytest.mark.parametrize("scenario", EXAMPLE_AUTOMATIONS, ids=lambda s: s["name"])
class TestExampleAutomations:
    """Parametrized E2E tests that create, dispatch, and verify example automations."""
    
    def test_example_automation_completes(self, scenario, client, base_url, auth_headers):
        # 1. Create automation from prompt
        # 2. Dispatch
        # 3. Poll until COMPLETED (or FAILED)
        # 4. Verify success (run status == COMPLETED)
        # 5. Cleanup
        ...

3. CI Integration

Add to the existing automation-integration-tests.yaml workflow in the deploy repo, or create a separate workflow:

  • Trigger: after deploy to staging succeeds, nightly schedule, manual dispatch, test-examples label on automation repo PRs
  • Gate releases: block new automation service versions from deploying to production until all example automations pass on staging
  • Reporting: post results to a tracking issue (like SDK does with issue #976) or as PR comments

4. Where to Put It

The tests should live in deploy/automation/integration/ alongside the existing integration tests, since:

  • They require a live deployment (same as existing tests)
  • The CI infrastructure is already in the deploy repo
  • They share the same conftest.py fixtures (base_url, auth_headers, client, create_automation)

The example automation definitions (prompts/tarballs) could be:

  • Inline in the test file (for prompt-based scenarios)
  • In a test_example_tarballs/ subdirectory (for tarball-based scenarios that need custom scripts)

Implementation Plan

  1. Phase 1 — Prompt-based scenarios (low effort, high value)

    • Add test_e2e_example_automations.py with 3–4 prompt-based scenarios
    • Each scenario: create from prompt → dispatch → poll → verify COMPLETED
    • Start with: basic prompt, secrets availability, MCP config check
  2. Phase 2 — Feature-specific tarballs (medium effort)

    • Add custom test tarballs that exercise specific features with more precise verification
    • Each tarball's main.py prints specific markers (like SDK's EXAMPLE_COST:)
    • Scenarios: MCP tool invocation, secret resolution, skill loading verification
  3. Phase 3 — Advanced scenarios (higher effort)

    • Plugin preset E2E
    • Event-triggered automation E2E (requires webhook registration + event delivery)
    • Repo cloning + project skills
    • Cross-version compatibility (test against specific SDK versions)
  4. Phase 4 — Release gating

    • Integrate into the release pipeline so new versions can't ship unless all example automations pass
    • Nightly runs to catch regressions from SDK or platform changes

Related


This issue was created by an AI agent (OpenHands) on behalf of the user.

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type
    No fields configured for issues without a type.

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions