fix: don't apply a 300s default task timeout to every job#41
Conversation
The previous default of 300 s silently killed any Batch job that ran longer than 5 minutes — a surprising failure mode for bioinformatics workloads that commonly run for hours. Change task_timeout default to None. When unset, the timeout key is omitted from register_job_definition entirely, so AWS Batch applies no timeout. When set, the value is validated locally (>= 60, the AWS minimum) before the API call so the error message is actionable rather than an opaque Batch rejection.
|
No actionable comments were generated in the recent review. 🎉 ℹ️ Recent review info⚙️ Run configurationConfiguration used: Organization UI Review profile: CHILL Plan: Pro Run ID: 📒 Files selected for processing (4)
📝 WalkthroughWalkthroughExecutorSettings.task_timeout is made optional (default None instead of 300) and redocumented with AWS minimum. BatchJobBuilder.build_job_definition() now validates task_timeout >= 60 seconds when provided, conditionally includes it in job registration, and raises WorkflowError for invalid values. New test suite verifies timeout validation and conditional field inclusion; CI pipeline updated to run new tests. ChangesTask Timeout Optional with Validation
Estimated code review effort🎯 2 (Simple) | ⏱️ ~12 minutes 🚥 Pre-merge checks | ✅ 5✅ Passed checks (5 passed)
✏️ Tip: You can configure your own custom pre-merge checks in the settings. ✨ Finishing Touches🧪 Generate unit tests (beta)
Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out. Comment |
| ) | ||
| task_timeout: Optional[int] = field( | ||
| default=300, | ||
| default=None, |
There was a problem hiding this comment.
we sometimes have long-running tasks like alignment.
|
@cademirch I don't have permission to assign reviewers, so tagging you here |
Squashed from the original 10-commit PR snakemake#40 stack for a clean rebase onto main (which advanced via snakemake#41's conditional task-timeout). Covers: per-rule batch_queue override, per-job tags propagated to submit_job + job definition with dynamic env-var tags, EC2/Fargate platform detection from the queue with fail-fast Fargate rejection, threads-based vCPU allocation, shared_memory_size_mb linuxParameters, and resource validation fixes (snakemake#34).
Squashed from the original 10-commit PR snakemake#40 stack for a clean rebase onto main (which advanced via snakemake#41's conditional task-timeout). Covers: per-rule batch_queue override, per-job tags propagated to submit_job + job definition with dynamic env-var tags, EC2/Fargate platform detection from the queue with fail-fast Fargate rejection, threads-based vCPU allocation, shared_memory_size_mb linuxParameters, and resource validation fixes (snakemake#34).
Adds a per-rule `aws_batch_task_timeout` resource that overrides the workflow-level `--aws-batch-task-timeout` setting, mirroring the existing per-rule `batch_queue` pattern. - **Precedence:** the per-rule resource wins over the global flag; falls back to the flag when the resource is absent; omits the timeout entirely when neither is set (preserving the no-default-timeout behavior from #41). - **Validation:** the value is coerced to an integer and must be ≥ 60 s (the AWS minimum). Errors name the source (resource vs flag) so the cause is unambiguous. ```python rule align: resources: aws_batch_task_timeout=14400 # 4 h ... ``` Depends on #41 (conditional task-timeout), now merged. Documented under a new "Task Timeout" section in `docs/further.md`. ### Testing New `TestPerRuleTaskTimeout` covers precedence, resource-only, fallback, both-absent, `< 60`, non-numeric, and the exact-60 boundary. Full suite: 61 unit tests pass; black + flake8 clean. <!-- This is an auto-generated comment: release notes by coderabbit.ai --> ## Summary by CodeRabbit ## Release Notes * **New Features** * Support for per-rule AWS Batch task timeout configuration, which takes precedence over workflow-level settings. * Workflow-level timeout control with minimum 60-second validation (AWS requirement); omitted if not configured. * **Documentation** * Added "Task Timeout" section with configuration examples and behavior clarification. <!-- end of auto-generated comment: release notes by coderabbit.ai -->
task_timeoutdefaults to 300 and is applied unconditionally inbuild_job_definition, so by default every Batch job is killed at 5 minutes — andOptional[int]is a lie, since "no timeout" is inexpressible. Typical bioinformatics jobs run hours; the default behavior is a workflow that mysteriously fails partway with terminated jobs.This changes the default to
None, omits thetimeoutkwarg fromregister_job_definitionwhen unset (AWS then applies no timeout), and validates the AWS minimum of 60s locally with a clearWorkflowError(naming--aws-batch-task-timeout) instead of a Batch API rejection. Help text updated.Tests (new
tests/test_batch_job_builder.py, added to the mocked-API CI job): default omits the key; set includes it; exactly 60 passes; 59, 1, 0, and -1 raise pre-API withregister_job_definitionnever called.Coordination note: this overlaps with my open #40, which also creates
tests/test_batch_job_builder.pyand edits the same CI line. Suggested order: land this small fix first; #40 then rebases on top (I'll handle the rebase, folding the test files together).Follow-up planned: a per-rule
aws_batch_task_timeoutresource, kept separate for review granularity.This PR was prepared with the assistance of coding agents; all changes were human-reviewed before submission.
Summary by CodeRabbit
Release Notes
New Features
Bug Fixes
Tests
Chores