Implement otela-fleet project with CLI, templates, and deployment#54
Implement otela-fleet project with CLI, templates, and deployment#54
Conversation
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
…c_prefix Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
…t single) Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
…lica support Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
…ity hashing Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
…mmands Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
- Fix double #SBATCH prefix in extra_sbatch template rendering - Use unique job script filenames to avoid race condition with replicas - Add job_identity unit test - Update template test to match extra_sbatch convention Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
- Add README.md and docs/ (getting-started, cluster-config, fleet-apply) - Add `otela-fleet clusters` and `otela-fleet presets <cluster>` commands - Default config dir: ./clusters or ~/.config/opentela/fleet/clusters/ Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Clariden now has a persistent relay at 148.187.108.172, so relay startup is skipped and workers bootstrap directly from it. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Test Coverage Report 📊Click to view detailed coverageSummary: |
There was a problem hiding this comment.
Pull request overview
This PR introduces a new otela-fleet “fleet manager” subproject under contrib/fleet_manager to deploy and manage OpenTela worker + LLM backend jobs on SLURM clusters (apptainer/enroot), including relay management and a declarative “apply” workflow.
Changes:
- Adds a Click-based
otela-fleetCLI for starting/stopping jobs, status/logs, and applying a fleet YAML. - Implements cluster config loading + SSH connection helpers, relay lifecycle management, job submission/cancel/list, and Jinja2 templates for SLURM + OpenTela configs.
- Adds initial pytest coverage for cluster parsing, deploy/apply logic, templates, relay, and connection behaviors.
Reviewed changes
Copilot reviewed 31 out of 35 changed files in this pull request and generated 12 comments.
Show a summary per file
| File | Description |
|---|---|
| contrib/fleet_manager/tests/init.py | Test package marker. |
| contrib/fleet_manager/tests/test_apply.py | Tests parsing fleet file and diff computation. |
| contrib/fleet_manager/tests/test_cluster.py | Tests cluster config loading, presets, and job identity. |
| contrib/fleet_manager/tests/test_connection.py | Tests SSH connection behavior/host selection. |
| contrib/fleet_manager/tests/test_deploy.py | Tests deploy orchestration and binary sync behavior. |
| contrib/fleet_manager/tests/test_relay.py | Tests relay status/ensure logic. |
| contrib/fleet_manager/tests/test_templates.py | Tests template rendering for job scripts and configs. |
| contrib/fleet_manager/tests/test_worker.py | Tests worker job listing/cancel and exec prefix/template selection. |
| contrib/fleet_manager/src/fleet_manager/init.py | Package marker. |
| contrib/fleet_manager/src/fleet_manager/apply.py | Fleet reconciliation (desired vs live) and apply execution. |
| contrib/fleet_manager/src/fleet_manager/cli.py | Click CLI entrypoint and commands. |
| contrib/fleet_manager/src/fleet_manager/cluster.py | Cluster config schema/loading + SSH connection abstraction. |
| contrib/fleet_manager/src/fleet_manager/deploy.py | Deployment flow (sync binary, render configs, submit jobs). |
| contrib/fleet_manager/src/fleet_manager/relay.py | Relay process status/start/stop/ensure. |
| contrib/fleet_manager/src/fleet_manager/worker.py | Worker job list/submit/cancel/logs + runtime template selection. |
| contrib/fleet_manager/src/fleet_manager/templates/init.py | Jinja2 rendering helper. |
| contrib/fleet_manager/src/fleet_manager/templates/apptainer_single.sh.j2 | Single-node apptainer SLURM job script template. |
| contrib/fleet_manager/src/fleet_manager/templates/apptainer_multi.sh.j2 | Multi-node apptainer SLURM job script template. |
| contrib/fleet_manager/src/fleet_manager/templates/enroot_single.sh.j2 | Single-node enroot SLURM job script template. |
| contrib/fleet_manager/src/fleet_manager/templates/clariden_sglang.toml.j2 | Enroot EDF template for Clariden. |
| contrib/fleet_manager/src/fleet_manager/templates/relay.cfg.yaml.j2 | Relay OpenTela config template. |
| contrib/fleet_manager/src/fleet_manager/templates/worker.cfg.yaml.j2 | Worker OpenTela config template. |
| contrib/fleet_manager/src/otela_fleet.egg-info/PKG-INFO | Generated packaging metadata artifact. |
| contrib/fleet_manager/src/otela_fleet.egg-info/SOURCES.txt | Generated packaging sources list artifact. |
| contrib/fleet_manager/src/otela_fleet.egg-info/dependency_links.txt | Generated packaging artifact. |
| contrib/fleet_manager/src/otela_fleet.egg-info/entry_points.txt | Generated packaging artifact (console_script). |
| contrib/fleet_manager/src/otela_fleet.egg-info/requires.txt | Generated packaging artifact (deps). |
| contrib/fleet_manager/src/otela_fleet.egg-info/top_level.txt | Generated packaging artifact. |
| contrib/fleet_manager/pyproject.toml | Packaging configuration, deps, and console script entry point. |
| contrib/fleet_manager/README.md | Project-level usage and links to docs. |
| contrib/fleet_manager/docs/getting-started.md | Installation + quickstart documentation. |
| contrib/fleet_manager/docs/cluster-config.md | Cluster YAML schema documentation and examples. |
| contrib/fleet_manager/docs/fleet-apply.md | Declarative apply documentation and behavior description. |
| contrib/fleet_manager/clusters/.gitkeep | Keeps empty clusters dir in repo. |
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
| for cluster, backend, cmd, preset, replicas in desired: | ||
| job_name = job_identity(backend, cmd, preset) | ||
| cluster_jobs = live_jobs.get(cluster, []) |
There was a problem hiding this comment.
compute_diff() only iterates over the desired deployments, so jobs that exist in live_jobs but are removed from the fleet file will never be cancelled. This contradicts the documented behavior (“remove the entry and re-apply”) and will leave orphaned jobs running. Consider also generating cancel actions for any live OpenTela jobs whose (backend, cmd, preset) identity is not present in the desired set for that cluster (or update the docs to require replicas: 0).
| elif current > replicas: | ||
| excess = sorted(matching, key=lambda j: j.id, reverse=True)[: current - replicas] | ||
| for job in excess: | ||
| actions.append(Action(action="cancel", cluster=cluster, backend=backend, cmd=cmd, preset=preset, job_id=job.id)) |
There was a problem hiding this comment.
When scaling down, excess is chosen by sorting job.id as a string. SLURM job IDs are numeric, so lexicographic ordering can cancel the wrong “newest” job (e.g., "99" > "100"). Sort by int(job.id) (with a safe fallback) to reliably cancel the newest jobs first.
There was a problem hiding this comment.
@copilot apply changes based on this feedback
There was a problem hiding this comment.
Fixed in ef00066. Changed the sort key from lambda j: j.id to lambda j: int(j.id) if j.id and str(j.id).isdigit() else 0 so SLURM job IDs are compared numerically (e.g., 100 > 99) rather than lexicographically (where "99" > "100"). Also added a test test_compute_diff_cancel_numeric_order to cover this exact edge case.
| actions = compute_diff(desired, live_jobs) | ||
| if not actions: | ||
| click.echo("\nNo changes needed. Fleet is at desired state.") | ||
| return |
There was a problem hiding this comment.
apply() returns early when actions is empty, but any SSH connections created in the earlier loop are never closed. This can leak SSH sessions and file descriptors in long-running uses. Ensure connections are closed on all return paths (e.g., via try/finally around the whole function or closing before returning).
There was a problem hiding this comment.
@copilot apply changes based on this feedback
| click.echo(f" - cancel job {a.job_id} on {a.cluster}") | ||
| if dry_run: | ||
| click.echo("\n(dry run - no changes made)") | ||
| return |
There was a problem hiding this comment.
In the --dry-run path, apply() returns without closing any of the per-cluster ClusterConnections opened earlier. Please close connections before returning (or use a finally block) so dry-runs don’t leak SSH sessions.
There was a problem hiding this comment.
@copilot apply changes based on this feedback
| pattern = f"{cfg.binary_remote_path}.*relay.cfg.yaml" | ||
| out, _, code = conn.run(f'pgrep -f "{pattern}"', target="relay") | ||
| if code != 0: | ||
| return RelayStatus.STOPPED |
There was a problem hiding this comment.
pgrep/pkill -f patterns are built from cfg.binary_remote_path and then wrapped in double quotes. If binary_remote_path contains ~ (as shown throughout the docs), tilde expansion will not occur inside quotes, so the pattern won’t match the actual running process (which uses an expanded absolute path). This can cause relay_status() to incorrectly report STOPPED and relay_start() to spawn duplicate relays. Consider expanding ~ to $HOME on the remote side before constructing the pattern, or match on a more stable substring (e.g., start --config .*relay.cfg.yaml).
There was a problem hiding this comment.
@copilot apply changes based on this feedback
| ### Job Identity | ||
|
|
||
| Each deployment is identified by a hash of `backend + cmd + preset`. This means: | ||
| - Changing the command triggers a redeploy (new hash) | ||
| - Changing the preset triggers a redeploy (new hash) | ||
| - Changing only the replica count scales up/down without redeploying |
There was a problem hiding this comment.
The docs say the job identity hash is computed from backend + cmd + preset, but job_identity() currently hashes only cmd + preset_name (the backend is only in the prefix). Either include backend in the hash computation or adjust this documentation so users don’t assume hash changes when switching backends.
| ``` | ||
|
|
||
| To remove a deployment, set `replicas: 0` or remove the entry and re-apply. |
There was a problem hiding this comment.
This doc claims that removing a deployment entry and re-applying will remove the deployment, but the current reconciliation logic only considers deployments present in the fleet file (and won’t cancel jobs for removed entries). Either update the reconcile logic to cancel undesired jobs, or revise this statement to require replicas: 0 to remove a deployment.
| [project.optional-dependencies] | ||
| dev = ["pytest>=7.0"] | ||
|
|
||
| [project.scripts] | ||
| otela-fleet = "fleet_manager.cli:cli" | ||
|
|
||
| [tool.setuptools.packages.find] | ||
| where = ["src"] | ||
|
|
||
| [tool.setuptools.package-data] | ||
| fleet_manager = ["templates/*.j2"] | ||
|
|
||
| [dependency-groups] | ||
| dev = [ | ||
| "pytest>=9.0.2", | ||
| ] |
There was a problem hiding this comment.
The dev dependency for pytest is specified twice with different minimum versions ([project.optional-dependencies].dev uses pytest>=7.0, while [dependency-groups].dev uses pytest>=9.0.2). This can lead to inconsistent environments depending on installer/tooling. Consider using a single mechanism (and a single version constraint) for dev dependencies.
| pyproject.toml | ||
| src/fleet_manager/__init__.py | ||
| src/fleet_manager/templates/__init__.py | ||
| src/otela_fleet.egg-info/PKG-INFO | ||
| src/otela_fleet.egg-info/SOURCES.txt | ||
| src/otela_fleet.egg-info/dependency_links.txt | ||
| src/otela_fleet.egg-info/entry_points.txt | ||
| src/otela_fleet.egg-info/requires.txt | ||
| src/otela_fleet.egg-info/top_level.txt | ||
| tests/test_cluster.py No newline at end of file |
There was a problem hiding this comment.
The src/otela_fleet.egg-info/* files (e.g., SOURCES.txt, PKG-INFO, etc.) are build artifacts typically generated by setuptools during packaging. Committing them tends to create noisy diffs and can go stale relative to pyproject.toml. Consider removing the .egg-info directory from version control and adding it to .gitignore, letting the build backend regenerate it as needed.
| try: | ||
| cfg = load_cluster(name, cluster_dir) | ||
| conn = ClusterConnection(cfg) | ||
| click.echo(f"\n=== {name} ===") | ||
| rs = relay_status(conn, cfg) | ||
| click.echo(f" Relay: {rs.value}") | ||
| jobs = worker_list(conn, target="slurm") | ||
| if jobs: | ||
| click.echo(f" Jobs ({len(jobs)}):") | ||
| for j in jobs: | ||
| click.echo(f" {j.id} {j.name} {j.state} {j.elapsed}/{j.time_limit} {j.node}") | ||
| else: | ||
| click.echo(" Jobs: none") | ||
| conn.close() | ||
| except Exception as e: | ||
| click.echo(f"\n=== {name} ===") | ||
| click.echo(f" ERROR: {e}") |
There was a problem hiding this comment.
In status(), conn.close() is only called at the end of the try block. If an exception occurs after establishing the SSH connection (e.g., during relay_status() / worker_list()), the except path doesn’t close the connection, potentially leaking SSH sessions when iterating multiple clusters. Wrap the per-cluster body in try/finally (or use a context manager) to always close the connection.
Agent-Logs-Url: https://github.com/eth-easl/OpenTela/sessions/12e7506b-7a54-4197-80a3-da0be219e533 Co-authored-by: xzyaoi <10499297+xzyaoi@users.noreply.github.com>
…and enhance logging - Updated GitHub Actions workflow to use new Dockerfile for amd64 architecture and set build metadata. - Added new Dockerfile for building signed OpenTela binaries. - Removed obsolete deployment scripts and configuration files for Clariden and Euler. - Updated Dockerfile to streamline the build process and ensure proper entry points. - Enhanced logging in the protocol to provide more detailed debug information. - Adjusted public address handling to support DNS hostnames in the network configuration.
…o feat/fleet-manager
Add ProxyChains dataclass to cluster.py with enabled/ssh_key/proxy_target/ socks_port/skip_partitions fields. When enabled and the preset partition is not in skip_partitions, apptainer_single/multi templates open an SSH SOCKS tunnel from inside the SLURM allocation and export HTTPS_PROXY/HTTP_PROXY/ ALL_PROXY so downstream commands (apptainer pull, HF hub, user code) have internet access via the tunnel. Populate contrib/fleet_manager/clusters/ with jsc, jsc-dev, euler, clariden migrated to the preset-based schema. JSC enables proxychains by default with develbooster and dc-gpu-devel listed in skip_partitions. Update README to document presets, user-command env vars ($SERVICE_PORT, $HF_HOME), multi-node conventions, and proxychains. Delete stale docs/ files that described the old model-centric layout. 52/52 tests pass. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
No description provided.