Skip to content

fix: don't apply a 300s default task timeout to every job#41

Merged
cademirch merged 1 commit into
snakemake:mainfrom
nh13:fix/task-timeout-default
Jun 7, 2026
Merged

fix: don't apply a 300s default task timeout to every job#41
cademirch merged 1 commit into
snakemake:mainfrom
nh13:fix/task-timeout-default

Conversation

@nh13

@nh13 nh13 commented Jun 7, 2026

Copy link
Copy Markdown
Collaborator

task_timeout defaults to 300 and is applied unconditionally in build_job_definition, so by default every Batch job is killed at 5 minutes — and Optional[int] is a lie, since "no timeout" is inexpressible. Typical bioinformatics jobs run hours; the default behavior is a workflow that mysteriously fails partway with terminated jobs.

This changes the default to None, omits the timeout kwarg from register_job_definition when unset (AWS then applies no timeout), and validates the AWS minimum of 60s locally with a clear WorkflowError (naming --aws-batch-task-timeout) instead of a Batch API rejection. Help text updated.

Tests (new tests/test_batch_job_builder.py, added to the mocked-API CI job): default omits the key; set includes it; exactly 60 passes; 59, 1, 0, and -1 raise pre-API with register_job_definition never called.

Coordination note: this overlaps with my open #40, which also creates tests/test_batch_job_builder.py and edits the same CI line. Suggested order: land this small fix first; #40 then rebases on top (I'll handle the rebase, folding the test files together).

Follow-up planned: a per-rule aws_batch_task_timeout resource, kept separate for review granularity.


This PR was prepared with the assistance of coding agents; all changes were human-reviewed before submission.

Summary by CodeRabbit

Release Notes

  • New Features

    • Added validation for task timeout with AWS minimum of 60 seconds
  • Bug Fixes

    • Task timeout now defaults to None and is conditionally applied
  • Tests

    • Added comprehensive batch job builder test suite
  • Chores

    • Updated CI pipeline to run additional batch job builder tests

The previous default of 300 s silently killed any Batch job that ran
longer than 5 minutes — a surprising failure mode for bioinformatics
workloads that commonly run for hours.

Change task_timeout default to None. When unset, the timeout key is
omitted from register_job_definition entirely, so AWS Batch applies no
timeout. When set, the value is validated locally (>= 60, the AWS
minimum) before the API call so the error message is actionable rather
than an opaque Batch rejection.
@coderabbitai

coderabbitai Bot commented Jun 7, 2026

Copy link
Copy Markdown
Contributor

Review Change Stack

No actionable comments were generated in the recent review. 🎉

ℹ️ Recent review info
⚙️ Run configuration

Configuration used: Organization UI

Review profile: CHILL

Plan: Pro

Run ID: f9904462-ecb6-4a9b-8f65-740d8ede5f41

📥 Commits

Reviewing files that changed from the base of the PR and between 2525ca0 and 02d8ff6.

📒 Files selected for processing (4)
  • .github/workflows/ci_mocked_api.yml
  • snakemake_executor_plugin_aws_batch/__init__.py
  • snakemake_executor_plugin_aws_batch/batch_job_builder.py
  • tests/test_batch_job_builder.py

📝 Walkthrough

Walkthrough

ExecutorSettings.task_timeout is made optional (default None instead of 300) and redocumented with AWS minimum. BatchJobBuilder.build_job_definition() now validates task_timeout >= 60 seconds when provided, conditionally includes it in job registration, and raises WorkflowError for invalid values. New test suite verifies timeout validation and conditional field inclusion; CI pipeline updated to run new tests.

Changes

Task Timeout Optional with Validation

Layer / File(s) Summary
Settings contract: task_timeout default and help text
snakemake_executor_plugin_aws_batch/__init__.py
ExecutorSettings.task_timeout default changed from 300 to None; help text updated to describe timeout in seconds and AWS minimum of 60 seconds.
BatchJobBuilder timeout validation and conditional registration
snakemake_executor_plugin_aws_batch/batch_job_builder.py
typing imports expanded to include Any; build_job_definition() now validates task_timeout >= 60 when provided (raising WorkflowError for subminimum values), conditionally omits timeout field when task_timeout is None, and refactors registration arguments into register_kwargs dict passed via unpacking.
Test coverage and CI integration
tests/test_batch_job_builder.py, .github/workflows/ci_mocked_api.yml
New TestTaskTimeout test class validates build_job_definition() omits timeout when task_timeout is None, includes timeout when set, accepts minimum of 60, and raises WorkflowError for invalid values. CI workflow updated to run new test module in coverage step.

Estimated code review effort

🎯 2 (Simple) | ⏱️ ~12 minutes

🚥 Pre-merge checks | ✅ 5
✅ Passed checks (5 passed)
Check name Status Explanation
Description Check ✅ Passed Check skipped - CodeRabbit’s high-level summary is enabled.
Title check ✅ Passed The title clearly and concisely summarizes the main change: removing the 300s default task timeout that was being applied unconditionally to every job.
Docstring Coverage ✅ Passed Docstring coverage is 80.00% which is sufficient. The required threshold is 80.00%.
Linked Issues check ✅ Passed Check skipped because no linked issues were found for this pull request.
Out of Scope Changes check ✅ Passed Check skipped because no linked issues were found for this pull request.

✏️ Tip: You can configure your own custom pre-merge checks in the settings.

✨ Finishing Touches
🧪 Generate unit tests (beta)
  • Create PR with unit tests

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

Comment @coderabbitai help to get the list of available commands and usage tips.

)
task_timeout: Optional[int] = field(
default=300,
default=None,

Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

we sometimes have long-running tasks like alignment.

@nh13 nh13 marked this pull request as ready for review June 7, 2026 17:01
@nh13

nh13 commented Jun 7, 2026

Copy link
Copy Markdown
Collaborator Author

@cademirch I don't have permission to assign reviewers, so tagging you here

@cademirch cademirch self-requested a review June 7, 2026 22:09
@cademirch cademirch merged commit 96f8161 into snakemake:main Jun 7, 2026
8 of 9 checks passed
nh13 added a commit to nh13/snakemake-executor-plugin-aws-batch that referenced this pull request Jun 9, 2026
Squashed from the original 10-commit PR snakemake#40 stack for a clean rebase onto
main (which advanced via snakemake#41's conditional task-timeout). Covers: per-rule
batch_queue override, per-job tags propagated to submit_job + job definition
with dynamic env-var tags, EC2/Fargate platform detection from the queue with
fail-fast Fargate rejection, threads-based vCPU allocation, shared_memory_size_mb
linuxParameters, and resource validation fixes (snakemake#34).
nh13 added a commit to nh13/snakemake-executor-plugin-aws-batch that referenced this pull request Jun 9, 2026
Squashed from the original 10-commit PR snakemake#40 stack for a clean rebase onto
main (which advanced via snakemake#41's conditional task-timeout). Covers: per-rule
batch_queue override, per-job tags propagated to submit_job + job definition
with dynamic env-var tags, EC2/Fargate platform detection from the queue with
fail-fast Fargate rejection, threads-based vCPU allocation, shared_memory_size_mb
linuxParameters, and resource validation fixes (snakemake#34).
nh13 added a commit that referenced this pull request Jun 9, 2026
Adds a per-rule `aws_batch_task_timeout` resource that overrides the
workflow-level `--aws-batch-task-timeout` setting, mirroring the
existing per-rule `batch_queue` pattern.

- **Precedence:** the per-rule resource wins over the global flag; falls
back to the flag when the resource is absent; omits the timeout entirely
when neither is set (preserving the no-default-timeout behavior from
#41).
- **Validation:** the value is coerced to an integer and must be ≥ 60 s
(the AWS minimum). Errors name the source (resource vs flag) so the
cause is unambiguous.

```python
rule align:
    resources:
        aws_batch_task_timeout=14400  # 4 h
    ...
```

Depends on #41 (conditional task-timeout), now merged. Documented under
a new "Task Timeout" section in `docs/further.md`.

### Testing
New `TestPerRuleTaskTimeout` covers precedence, resource-only, fallback,
both-absent, `< 60`, non-numeric, and the exact-60 boundary. Full suite:
61 unit tests pass; black + flake8 clean.

<!-- This is an auto-generated comment: release notes by coderabbit.ai
-->

## Summary by CodeRabbit

## Release Notes

* **New Features**
* Support for per-rule AWS Batch task timeout configuration, which takes
precedence over workflow-level settings.
* Workflow-level timeout control with minimum 60-second validation (AWS
requirement); omitted if not configured.

* **Documentation**
* Added "Task Timeout" section with configuration examples and behavior
clarification.

<!-- end of auto-generated comment: release notes by coderabbit.ai -->
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants