feat: per-rule queues, per-job tags, EC2/Fargate-aware validation#40
Conversation
|
Warning Review limit reached
More reviews will be available in 48 minutes and 13 seconds. Learn how PR review limits work. Your organization has run out of usage credits. Purchase more in the billing tab. ⌛ How to resolve this issue?After more reviews become available, a review can be triggered using the We recommend that you space out your commits to avoid hitting the rate limit. 🚦 How do rate limits work?CodeRabbit enforces hourly rate limits for each developer per organization. Our paid plans include higher PR review limits than trial, open-source, and free plans. In all cases, reviews become available again over time. During sustained high-volume PR review activity, CodeRabbit may temporarily slow when the next review becomes available. Please see our Fair Usage Limits Policy for further information. ℹ️ Review info⚙️ Run configurationConfiguration used: Organization UI Review profile: CHILL Plan: Pro Run ID: ⛔ Files ignored due to path filters (1)
📒 Files selected for processing (6)
📝 WalkthroughWalkthroughAdds BatchClient passthroughs; extends BatchJobBuilder to pick per-job job_queue, detect EC2 vs FARGATE via AWS, perform platform-aware resource validation, set platformCapabilities, merge tags from settings and an env var, adjust shared-memory handling and defaults, update docs, and add comprehensive unit tests. ChangesBatcher changes (single cohort)
Sequence Diagram(s)sequenceDiagram
participant Env as Env/Settings
participant Builder as BatchJobBuilder
participant Client as BatchClient
participant AWS as AWS Batch
Env->>Builder: provide settings, job, env-var tags
Builder->>Client: describe_job_queues(...)
Client->>AWS: DescribeJobQueues API
AWS-->>Client: queue description
Client-->>Builder: queue info
alt computeEnvironment referenced
Builder->>Client: describe_compute_environments(...)
Client->>AWS: DescribeComputeEnvironments API
AWS-->>Client: compute env description
Client-->>Builder: compute env info
end
Builder->>Builder: determine platform (EC2 / FARGATE) and validate resources
Builder->>AWS: register_job_definition(platformCapabilities=[Builder.platform], ...)
AWS-->>Builder: job definition ARN
Builder->>Builder: _build_job_tags() merge settings + env tags
Builder->>Client: submit_job(..., jobQueue=Builder.job_queue, tags=merged?)
Client->>AWS: SubmitJob API (with tags if present)
AWS-->>Client: submit response
Client-->>Builder: submit response
Estimated code review effort🎯 4 (Complex) | ⏱️ ~45 minutes 🚥 Pre-merge checks | ✅ 4 | ❌ 1❌ Failed checks (1 warning)
✅ Passed checks (4 passed)
✏️ Tip: You can configure your own custom pre-merge checks in the settings. ✨ Finishing Touches🧪 Generate unit tests (beta)
Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out. Comment |
There was a problem hiding this comment.
Actionable comments posted: 2
Caution
Some comments are outside the diff and can’t be posted inline due to platform limitations.
⚠️ Outside diff range comments (1)
snakemake_executor_plugin_aws_batch/batch_job_builder.py (1)
179-205:⚠️ Potential issue | 🔴 CriticalFargate job definitions cannot use EC2-only container settings.
Line 215 sets
platformCapabilities=[self.platform], which can be FARGATE when detected from the job queue. However, the container_properties dict (lines 179–205) violates AWS Fargate requirements:
executionRoleArnis required but never suppliedprivileged=Trueis not supported; Fargate requires it omitted orfalse- GPU resourceRequirements (lines 198–204) are unsupported on Fargate
The AWS Batch API will reject Fargate job registrations with these violations. Platform-specific container-property assembly is needed before Fargate can work.
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@snakemake_executor_plugin_aws_batch/batch_job_builder.py` around lines 179 - 205, The container_properties assembly (container_properties dict in batch_job_builder.py used when creating job definitions) is not Fargate-safe: when platformCapabilities is set to self.platform (FARGATE) you must include executionRoleArn, remove or set privileged to False, and omit GPU resourceRequirements; update the logic that builds container_properties and the call site that sets platformCapabilities to make platform-specific adjustments (detect self.platform == "FARGATE" and then add executionRoleArn from settings, delete or set privileged to False, and skip adding BATCH_JOB_RESOURCE_REQUIREMENT_TYPE.GPU entries) so Fargate job registrations meet AWS requirements.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.
Inline comments:
In `@snakemake_executor_plugin_aws_batch/batch_job_builder.py`:
- Around line 123-133: The current logic picks min(valid_mems) which can
under-provision memory; instead, in the block that computes valid_mems from
VALID_RESOURCES_MAPPING for the given vcpu, filter for candidates >= mem and
pick the smallest of those (e.g., candidate_mem = min(m for m in valid_mems if m
>= mem)); if no candidate exists raise a WorkflowError explaining the requested
mem is too large for that vCPU; update the logger.warning to report the chosen
candidate_mem (and original mem/vcpu) rather than min_mem. Ensure this change
touches the same symbols: VALID_RESOURCES_MAPPING, valid_mems, vcpu, mem,
WorkflowError, and the logger.warning call.
- Around line 105-109: The broad except block that logs and returns
BATCH_JOB_PLATFORM_CAPABILITIES.EC2.value must not silently downgrade on all
errors; instead, remove the unconditional return in the except (or change it to
re-raise) so errors propagate (fail fast) while retaining existing explicit
empty-response fallbacks elsewhere; update the except handling around the
queue/compute-environment lookup in the method (the block that uses
self.logger.warning and BATCH_JOB_PLATFORM_CAPABILITIES.EC2.value) to either
re-raise the caught exception after logging or catch and rethrow only
non-empty-response-safe errors (e.g., inspect/handle specific AWS client
errors), ensuring only the explicit empty-response branches keep the EC2
fallback.
---
Outside diff comments:
In `@snakemake_executor_plugin_aws_batch/batch_job_builder.py`:
- Around line 179-205: The container_properties assembly (container_properties
dict in batch_job_builder.py used when creating job definitions) is not
Fargate-safe: when platformCapabilities is set to self.platform (FARGATE) you
must include executionRoleArn, remove or set privileged to False, and omit GPU
resourceRequirements; update the logic that builds container_properties and the
call site that sets platformCapabilities to make platform-specific adjustments
(detect self.platform == "FARGATE" and then add executionRoleArn from settings,
delete or set privileged to False, and skip adding
BATCH_JOB_RESOURCE_REQUIREMENT_TYPE.GPU entries) so Fargate job registrations
meet AWS requirements.
🪄 Autofix (Beta)
Fix all unresolved CodeRabbit comments on this PR:
- Push a commit to this branch (recommended)
- Create a new PR with the fixes
ℹ️ Review info
⚙️ Run configuration
Configuration used: Organization UI
Review profile: CHILL
Plan: Pro
Run ID: adee3872-6ca3-4367-b682-4abbb205004a
📒 Files selected for processing (3)
snakemake_executor_plugin_aws_batch/batch_client.pysnakemake_executor_plugin_aws_batch/batch_job_builder.pytests/test_batch_job_builder.py
There was a problem hiding this comment.
Actionable comments posted: 1
Caution
Some comments are outside the diff and can’t be posted inline due to platform limitations.
⚠️ Outside diff range comments (1)
snakemake_executor_plugin_aws_batch/batch_job_builder.py (1)
185-221:⚠️ Potential issue | 🔴 CriticalAdd platform-aware container properties for Fargate compatibility.
The job definition construction at line 185 unconditionally sets
privileged: True, omitsexecutionRoleArn, and allows GPU resources, but line 221 can sendplatformCapabilities=["FARGATE"]. This causes Fargate job definition registration to fail: AWS Batch Fargate requires an execution role, prohibits privileged mode, and does not support GPU. The code already validates Fargate vs. EC2 resource requirements but does not handle these container property differences.Conditionally set
executionRoleArnfor Fargate, omit or setprivilegedto false for Fargate (true only for EC2), and reject GPU requests on Fargate.🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@snakemake_executor_plugin_aws_batch/batch_job_builder.py` around lines 185 - 221, The container_properties currently set privileged=True, omit executionRoleArn, and may include GPU requirements even when self.platform == "FARGATE"; update the job-definition assembly in batch_job_builder.py so that inside the block building container_properties you (1) only set privileged=True for non-FARGATE platforms and set privileged=False (or omit) when self.platform == self.platform value for FARGATE, (2) add executionRoleArn=self.settings.execution_role (or settings.executionRoleArn) to container_properties when self.platform == FARGATE, and (3) guard appending a GPU resource (BATCH_JOB_RESOURCE_REQUIREMENT_TYPE.GPU) — raise/return an error if gpu > 0 and self.platform == FARGATE to reject unsupported GPU on Fargate; ensure these changes are honored before calling self.batch_client.register_job_definition so register_job_definition receives the correct platform-aware containerProperties and tags.
♻️ Duplicate comments (2)
snakemake_executor_plugin_aws_batch/batch_job_builder.py (2)
129-140:⚠️ Potential issue | 🟠 MajorDon’t reduce Fargate memory below the requested amount.
Line 135 can turn
vcpu=1, mem=5000into2048, under-provisioning the job and causing avoidable OOM failures. Pick the smallest valid memory>= mem, or fail when the request exceeds the maximum for that vCPU.Proposed fix
valid_mems = [m for m, v in VALID_RESOURCES_MAPPING.items() if vcpu in v] if not valid_mems: raise WorkflowError( f"Invalid vCPU value {vcpu} for Fargate. " f"Check valid Fargate resource configurations." ) - min_mem = min(valid_mems) + candidate_mems = [m for m in valid_mems if m >= mem] + if not candidate_mems: + raise WorkflowError( + f"Memory value {mem} MB exceeds the maximum allowed for " + f"vCPU {vcpu} on Fargate." + ) + min_mem = min(candidate_mems) self.logger.warning( f"Memory value {mem} MB is invalid for vCPU {vcpu} on Fargate. " - f"Setting memory to minimum allowed value {min_mem} MB." + f"Setting memory to next allowed value {min_mem} MB." )🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@snakemake_executor_plugin_aws_batch/batch_job_builder.py` around lines 129 - 140, The current logic in batch_job_builder.py (inside the function that maps Fargate vcpu/mem) picks min(valid_mems) which can under-provision requested mem; change it to select the smallest entry in valid_mems that is >= mem (use vcpu, mem, VALID_RESOURCES_MAPPING, valid_mems) and return that; if no valid_mems >= mem exist, raise WorkflowError indicating the requested memory exceeds the maximum allowed for that vCPU rather than silently lowering it; keep the existing logging (self.logger.warning) but update the message to reflect that memory was bumped up to the selected valid value, and ensure returned values remain str(vcpu), str(selected_mem).
111-115:⚠️ Potential issue | 🟠 MajorDon’t silently downgrade platform lookup failures to EC2.
Line 111 still catches every exception from the AWS queue/environment lookup and turns it into an EC2 fallback. That can hide IAM/configuration/transient AWS failures and register the wrong job definition; keep the explicit empty-response fallbacks, but re-raise real lookup errors.
Proposed fix
- except Exception as e: - self.logger.warning( - f"Failed to determine platform from queue: {e}. Defaulting to EC2." - ) - return BATCH_JOB_PLATFORM_CAPABILITIES.EC2.value + except Exception as e: + raise WorkflowError( + f"Failed to determine platform from queue {self.job_queue}: {e}" + ) from e🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@snakemake_executor_plugin_aws_batch/batch_job_builder.py` around lines 111 - 115, The current broad except in the platform lookup swallows all errors and returns BATCH_JOB_PLATFORM_CAPABILITIES.EC2.value; instead, change the error handling in the platform-lookup routine so you only fall back to EC2 when the lookup explicitly returns an empty/no-result value, and re-raise real AWS/client errors. Concretely: remove the generic "except Exception" around the AWS queue/environment lookup, or narrow it to only catch the expected "no result" condition (e.g., check for None/empty response) and call self.logger.warning(...) then return EC2; for any other exceptions from the AWS client (botocore.exceptions.BotoCoreError/ClientError or unexpected errors), re-raise them so they surface to callers rather than being silenced.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.
Inline comments:
In `@snakemake_executor_plugin_aws_batch/batch_job_builder.py`:
- Around line 237-247: The merged tags builder currently accepts empty keys
(e.g., "=value") and doesn't enforce AWS Batch limits; update the tag assembly
in the method that builds/returns tags (the block using
SNAKEMAKE_AWS_BATCH_JOB_TAGS_ENV_VAR and variable tags) to validate after
merging: trim key and value, skip or reject any entries where the key is empty
(key.strip() == ""), and enforce a maximum of 50 tags (len(tags) <= 50); if
validation fails, raise a clear ValueError describing the problem so the caller
(before submit_job) fails fast with a helpful message.
---
Outside diff comments:
In `@snakemake_executor_plugin_aws_batch/batch_job_builder.py`:
- Around line 185-221: The container_properties currently set privileged=True,
omit executionRoleArn, and may include GPU requirements even when self.platform
== "FARGATE"; update the job-definition assembly in batch_job_builder.py so that
inside the block building container_properties you (1) only set privileged=True
for non-FARGATE platforms and set privileged=False (or omit) when self.platform
== self.platform value for FARGATE, (2) add
executionRoleArn=self.settings.execution_role (or settings.executionRoleArn) to
container_properties when self.platform == FARGATE, and (3) guard appending a
GPU resource (BATCH_JOB_RESOURCE_REQUIREMENT_TYPE.GPU) — raise/return an error
if gpu > 0 and self.platform == FARGATE to reject unsupported GPU on Fargate;
ensure these changes are honored before calling
self.batch_client.register_job_definition so register_job_definition receives
the correct platform-aware containerProperties and tags.
---
Duplicate comments:
In `@snakemake_executor_plugin_aws_batch/batch_job_builder.py`:
- Around line 129-140: The current logic in batch_job_builder.py (inside the
function that maps Fargate vcpu/mem) picks min(valid_mems) which can
under-provision requested mem; change it to select the smallest entry in
valid_mems that is >= mem (use vcpu, mem, VALID_RESOURCES_MAPPING, valid_mems)
and return that; if no valid_mems >= mem exist, raise WorkflowError indicating
the requested memory exceeds the maximum allowed for that vCPU rather than
silently lowering it; keep the existing logging (self.logger.warning) but update
the message to reflect that memory was bumped up to the selected valid value,
and ensure returned values remain str(vcpu), str(selected_mem).
- Around line 111-115: The current broad except in the platform lookup swallows
all errors and returns BATCH_JOB_PLATFORM_CAPABILITIES.EC2.value; instead,
change the error handling in the platform-lookup routine so you only fall back
to EC2 when the lookup explicitly returns an empty/no-result value, and re-raise
real AWS/client errors. Concretely: remove the generic "except Exception" around
the AWS queue/environment lookup, or narrow it to only catch the expected "no
result" condition (e.g., check for None/empty response) and call
self.logger.warning(...) then return EC2; for any other exceptions from the AWS
client (botocore.exceptions.BotoCoreError/ClientError or unexpected errors),
re-raise them so they surface to callers rather than being silenced.
🪄 Autofix (Beta)
Fix all unresolved CodeRabbit comments on this PR:
- Push a commit to this branch (recommended)
- Create a new PR with the fixes
ℹ️ Review info
⚙️ Run configuration
Configuration used: Organization UI
Review profile: CHILL
Plan: Pro
Run ID: 2631aa04-9e9f-490a-8474-163c63b38e71
📒 Files selected for processing (1)
snakemake_executor_plugin_aws_batch/batch_job_builder.py
There was a problem hiding this comment.
♻️ Duplicate comments (3)
snakemake_executor_plugin_aws_batch/batch_job_builder.py (3)
248-258:⚠️ Potential issue | 🟡 MinorValidate tag keys and count before returning.
The current implementation accepts empty keys from malformed input like
=valueor,=value,. Additionally, AWS Batch enforces a 50-tag limit per job. Consider validating that keys are non-empty and the total count doesn't exceed 50 to fail fast with a clear error.Proposed fix
env_tags_str = os.environ.get(SNAKEMAKE_AWS_BATCH_JOB_TAGS_ENV_VAR, "") if env_tags_str: for pair in env_tags_str.split(","): pair = pair.strip() if "=" in pair: key, _, value = pair.partition("=") - tags[key.strip()] = value.strip() + key = key.strip() + if not key: + raise WorkflowError( + f"Invalid {SNAKEMAKE_AWS_BATCH_JOB_TAGS_ENV_VAR}: " + "tag key cannot be empty." + ) + tags[key] = value.strip() + + if len(tags) > 50: + raise WorkflowError( + f"AWS Batch jobs support at most 50 tags, got {len(tags)}." + ) return tags🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@snakemake_executor_plugin_aws_batch/batch_job_builder.py` around lines 248 - 258, Validate and reject empty tag keys and enforce AWS Batch's 50-tag limit before returning tags: after copying self.settings.tags into tags and while parsing env_tags_str (using SNAKEMAKE_AWS_BATCH_JOB_TAGS_ENV_VAR), strip the key and if key is empty raise a clear ValueError (or similar) indicating malformed tag like "=value"; also validate existing keys in self.settings.tags (strip and error on any empty key). After merging all tags, if len(tags) > 50 raise a clear ValueError indicating the tag count exceeds AWS Batch's 50-tag limit; return tags only if all validations pass.
111-115:⚠️ Potential issue | 🟠 MajorBroad exception handling masks real configuration errors.
The catch-all
Exceptionhandler defaults to EC2 for any failure, including permission errors (AccessDeniedException), validation errors, and transient AWS failures. This can turn real configuration issues into silent fallbacks, pushing the actual error to a later register/submit call where it's harder to diagnose.Consider catching specific boto3/botocore exceptions (e.g.,
ClientError) and re-raising unexpected errors to fail fast, while keeping the EC2 fallback only for the explicit empty-response branches.🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@snakemake_executor_plugin_aws_batch/batch_job_builder.py` around lines 111 - 115, Replace the broad except Exception handler around the logic that returns BATCH_JOB_PLATFORM_CAPABILITIES.EC2.value with targeted exception handling: import and catch botocore.exceptions.ClientError (and any specific AWS exceptions you expect, e.g., AccessDeniedException) and handle only the cases where the AWS response is empty or indicates "no platform" by returning the EC2 fallback; for other ClientError codes (and for unexpected exceptions) re-raise the exception so the caller fails fast instead of silently defaulting, and keep the EC2 fallback limited to the explicit empty-response branch in the method that determines platform from the queue.
129-140:⚠️ Potential issue | 🟠 MajorMemory fallback may under-provision the container.
When the requested memory isn't in
VALID_RESOURCES_MAPPING, line 135 picksmin(valid_mems), which could be significantly lower than requested. For example,vcpu=1, mem=5000gets rewritten to2048, potentially causing OOM failures.Consider picking the smallest valid memory that is
>= mem, or raising an error when no valid memory can satisfy the request.🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@snakemake_executor_plugin_aws_batch/batch_job_builder.py` around lines 129 - 140, The current fallback chooses min(valid_mems) which may under-provision (e.g., vcpu=1, mem=5000 -> 2048); update the logic in the block that computes valid_mems (using VALID_RESOURCES_MAPPING) to select the smallest valid memory >= requested mem (e.g., candidate = min(m for m in valid_mems if m >= mem)) and return that, and if no such candidate exists raise a WorkflowError explaining that no valid Fargate memory satisfies the requested mem for the given vcpu; keep the return format str(vcpu), str(selected_mem).
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.
Duplicate comments:
In `@snakemake_executor_plugin_aws_batch/batch_job_builder.py`:
- Around line 248-258: Validate and reject empty tag keys and enforce AWS
Batch's 50-tag limit before returning tags: after copying self.settings.tags
into tags and while parsing env_tags_str (using
SNAKEMAKE_AWS_BATCH_JOB_TAGS_ENV_VAR), strip the key and if key is empty raise a
clear ValueError (or similar) indicating malformed tag like "=value"; also
validate existing keys in self.settings.tags (strip and error on any empty key).
After merging all tags, if len(tags) > 50 raise a clear ValueError indicating
the tag count exceeds AWS Batch's 50-tag limit; return tags only if all
validations pass.
- Around line 111-115: Replace the broad except Exception handler around the
logic that returns BATCH_JOB_PLATFORM_CAPABILITIES.EC2.value with targeted
exception handling: import and catch botocore.exceptions.ClientError (and any
specific AWS exceptions you expect, e.g., AccessDeniedException) and handle only
the cases where the AWS response is empty or indicates "no platform" by
returning the EC2 fallback; for other ClientError codes (and for unexpected
exceptions) re-raise the exception so the caller fails fast instead of silently
defaulting, and keep the EC2 fallback limited to the explicit empty-response
branch in the method that determines platform from the queue.
- Around line 129-140: The current fallback chooses min(valid_mems) which may
under-provision (e.g., vcpu=1, mem=5000 -> 2048); update the logic in the block
that computes valid_mems (using VALID_RESOURCES_MAPPING) to select the smallest
valid memory >= requested mem (e.g., candidate = min(m for m in valid_mems if m
>= mem)) and return that, and if no such candidate exists raise a WorkflowError
explaining that no valid Fargate memory satisfies the requested mem for the
given vcpu; keep the return format str(vcpu), str(selected_mem).
ℹ️ Review info
⚙️ Run configuration
Configuration used: Organization UI
Review profile: CHILL
Plan: Pro
Run ID: 528b17b0-82e7-4cf6-816a-3a455ee9968b
⛔ Files ignored due to path filters (1)
pyproject.tomlis excluded by!pyproject.toml
📒 Files selected for processing (1)
snakemake_executor_plugin_aws_batch/batch_job_builder.py
c736516 to
31b0253
Compare
There was a problem hiding this comment.
Actionable comments posted: 1
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.
Inline comments:
In `@snakemake_executor_plugin_aws_batch/batch_job_builder.py`:
- Around line 118-120: The error messages in batch_job_builder.py reference
self.settings.job_queue but the actual queue used for a job can be overridden
and stored in self.job_queue; update the two places (around the
platform-determination and the second error block) to use self.job_queue instead
of self.settings.job_queue so diagnostics show the resolved per-job queue and
not the default settings value.
🪄 Autofix (Beta)
Fix all unresolved CodeRabbit comments on this PR:
- Push a commit to this branch (recommended)
- Create a new PR with the fixes
ℹ️ Review info
⚙️ Run configuration
Configuration used: Organization UI
Review profile: CHILL
Plan: Pro
Run ID: 822c07d0-b3f7-4862-864a-9c0d2861a654
⛔ Files ignored due to path filters (1)
pyproject.tomlis excluded by!pyproject.toml
📒 Files selected for processing (4)
snakemake_executor_plugin_aws_batch/__init__.pysnakemake_executor_plugin_aws_batch/batch_client.pysnakemake_executor_plugin_aws_batch/batch_job_builder.pytests/test_batch_job_builder.py
🚧 Files skipped from review as they are similar to previous changes (2)
- snakemake_executor_plugin_aws_batch/batch_client.py
- tests/test_batch_job_builder.py
31b0253 to
bafe244
Compare
|
@johanneskoester or others, this PR is ready, and I'd be glad to added as a maintainer if that would be helpful, as I do have some further improvements |
|
CC: @radusuciu |
|
@nh13 I like many of these changes and think they're necessary for a truly functional executor plugin. I implemented many of the same in my fork which I used in production and in my separate project that runs the executor itself on AWS Batch (see: https://github.com/radusuciu/snakemake-executor-plugin-aws-basic-batch). However, If I were the maintainer, I'd probably prefer to see separate PRs. Point is moot though as maintainer is MIA.. |
|
@nh13 and @radusuciu I'm happy to step in here and help get this merged. I'll take a look at this soon. Thanks for the PR. |
|
Thanks @cademirch. I do have other PRs that I'd like to contribute, would you be open to that? I can make one at a time, with an issue before hand for your review? I want to be aware of the maintenance burden this places on the maintainers, so let me know what works for you. |
Separate PRs are good. Just put me ad reviewer so it ends up in my inbox! Thanks @nh13. Also, if you're interested in helping maintain this plugin let's chat. |
|
@cademirch I'd welcome helping out here! |
`task_timeout` defaults to 300 and is applied unconditionally in `build_job_definition`, so by default every Batch job is killed at 5 minutes — and `Optional[int]` is a lie, since "no timeout" is inexpressible. Typical bioinformatics jobs run hours; the default behavior is a workflow that mysteriously fails partway with terminated jobs. This changes the default to `None`, omits the `timeout` kwarg from `register_job_definition` when unset (AWS then applies no timeout), and validates the AWS minimum of 60s locally with a clear `WorkflowError` (naming `--aws-batch-task-timeout`) instead of a Batch API rejection. Help text updated. Tests (new `tests/test_batch_job_builder.py`, added to the mocked-API CI job): default omits the key; set includes it; exactly 60 passes; 59, 1, 0, and -1 raise pre-API with `register_job_definition` never called. **Coordination note:** this overlaps with my open #40, which also creates `tests/test_batch_job_builder.py` and edits the same CI line. Suggested order: land this small fix first; #40 then rebases on top (I'll handle the rebase, folding the test files together). Follow-up planned: a per-rule `aws_batch_task_timeout` resource, kept separate for review granularity. --- *This PR was prepared with the assistance of coding agents; all changes were human-reviewed before submission.* <!-- This is an auto-generated comment: release notes by coderabbit.ai --> ## Summary by CodeRabbit ## Release Notes * **New Features** * Added validation for task timeout with AWS minimum of 60 seconds * **Bug Fixes** * Task timeout now defaults to None and is conditionally applied * **Tests** * Added comprehensive batch job builder test suite * **Chores** * Updated CI pipeline to run additional batch job builder tests <!-- end of auto-generated comment: release notes by coderabbit.ai -->
|
Thanks a lot Nils! I have now invited you as a co-maintainer. |
Squashed from the original 10-commit PR snakemake#40 stack for a clean rebase onto main (which advanced via snakemake#41's conditional task-timeout). Covers: per-rule batch_queue override, per-job tags propagated to submit_job + job definition with dynamic env-var tags, EC2/Fargate platform detection from the queue with fail-fast Fargate rejection, threads-based vCPU allocation, shared_memory_size_mb linuxParameters, and resource validation fixes (snakemake#34).
f66fc8a to
afe9402
Compare
|
@coderabbitai resume |
|
@coderabbitai review |
✅ Action performedReviews resumed. |
✅ Action performedReview finished.
|
There was a problem hiding this comment.
Actionable comments posted: 2
🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
Inline comments:
In `@docs/further.md`:
- Around line 120-122: The fenced code block containing the environment variable
example (the line starting with export
SNAKEMAKE_AWS_BATCH_JOB_TAGS="run_id=...") lacks a language specifier; update
the fence to include a shell language (e.g., use ```bash or ```sh) so the code
block is properly annotated and MD040 is resolved.
- Around line 131-145: Remove the duplicated, incomplete "Shared Memory
(/dev/shm) Sizing" fragment that begins with the example "rule align:" and ends
with the truncated "This s"; either delete that entire repeated block (the code
fence plus the trailing text) or replace it with a short comment like <!--
Removed duplicate/incomplete section; covered above in "Shared Memory
(`/dev/shm`)" --> so the documentation no longer contains the duplicated example
and the mid-sentence fragment.
🪄 Autofix (Beta)
Fix all unresolved CodeRabbit comments on this PR:
- Push a commit to this branch (recommended)
- Create a new PR with the fixes
ℹ️ Review info
⚙️ Run configuration
Configuration used: Organization UI
Review profile: CHILL
Plan: Pro
Run ID: 8378d5f0-188e-446e-aa14-c795d1ee1f1c
⛔ Files ignored due to path filters (1)
pyproject.tomlis excluded by!pyproject.toml
📒 Files selected for processing (6)
docs/further.mdsnakemake_executor_plugin_aws_batch/__init__.pysnakemake_executor_plugin_aws_batch/batch_client.pysnakemake_executor_plugin_aws_batch/batch_job_builder.pytests/test_batch_job_builder.pytests/tests_mocked_api.py
🚧 Files skipped from review as they are similar to previous changes (3)
- snakemake_executor_plugin_aws_batch/batch_client.py
- snakemake_executor_plugin_aws_batch/init.py
- snakemake_executor_plugin_aws_batch/batch_job_builder.py
Squashed from the original 10-commit PR snakemake#40 stack for a clean rebase onto main (which advanced via snakemake#41's conditional task-timeout). Covers: per-rule batch_queue override, per-job tags propagated to submit_job + job definition with dynamic env-var tags, EC2/Fargate platform detection from the queue with fail-fast Fargate rejection, threads-based vCPU allocation, shared_memory_size_mb linuxParameters, and resource validation fixes (snakemake#34).
afe9402 to
dcd7f15
Compare
Summary
Hardens the AWS Batch executor for multi-arch, multi-tenant production use. Three user-facing capabilities plus underlying validation/compat fixes.
What's new
resources.batch_queue— route each rule to a Batch queue wired to the matching compute environment. Enables a single workflow to run ARM rules on Graviton queues and x86 rules on Intel/AMD queues. Falls back to the profile default queue.settings.tagsare now sent tosubmit_job(not onlyregister_job_definition), so the AWS Batch console and Cost Explorer can group jobs by tag. ASNAKEMAKE_AWS_BATCH_JOB_TAGSenv var carries dynamicKEY=VALUEpairs for coordinator → worker tag inheritance (e.g. per-run commit hashes for cost attribution). Env tags override settings on key conflict.resources.shared_memory_size_mb— rules opt into a larger/dev/shm(EC2/ECS default 64 MB is too small for tools that stage large in-memory indexes). EC2 only; Fargate does not honorlinuxParameters.sharedMemorySize.Validation and correctness
VALID_RESOURCES_MAPPINGis restrictive, and only applicable to Fargate compute environments. #34): resolves the queue's compute environment and runs platform-specific vCPU/memory checks. Previously all jobs were validated against Fargate's strict resource matrix, which blocked legitimate EC2 requests. Real AWS lookup errors (botocore.ClientError) now propagate asWorkflowErrorinstead of silently falling back to EC2.threads— derived fromself.job.threads(the documented Snakemake directive) instead of the undocumented_coresresource, with_coresretained as fallback.>= meminstead of silently shrinking tomin(valid_mems)and OOMing the job. Raise if no valid value exists.submit_job, matching AWS Batch limits.Known limitation
Fargate job-definition registration is not yet supported. Fargate requires
executionRoleArn, prohibitsprivileged: True, and does not allow GPU resources — none of which this plugin currently emits conditionally.build_job_definitionraises a clearWorkflowErrorwhen a Fargate queue is detected, replacing an opaque AWS-side rejection with a local message. EC2 queues are unaffected.Compatibility
auto_deploy_default_storage_provider=Falseso workers don'tpip installan unpinnedsnakemake-storage-plugin-s3at startup — s3-plugin 0.3.x metadata requires snakemake 9.x and breaks 8.x workers. Image maintainers pre-install a compatible plugin version.snakemake-storage-plugin-s3pin widened to>=0.2,<0.4so 8.x consumers can fall back to 0.2.x.Dependencies
Stacked on #38 (
fix-vcpu-threads-and-validation) — the first three commits here are the same as #38's, so #38 should land first (or be closed in favor of this PR).Test plan
tests/test_batch_job_builder.py:submit_job(settings, env var, merge, override, empty cases)ClientErrorre-raises, empty-response fallback, Fargate detection)build_job_definitionSummary by CodeRabbit
New Features
Behavior Changes
Documentation
Tests