Skip to content

feat: per-rule queues, per-job tags, EC2/Fargate-aware validation#40

Merged
nh13 merged 1 commit into
snakemake:mainfrom
nh13:feat/submit-job-tags
Jun 9, 2026
Merged

feat: per-rule queues, per-job tags, EC2/Fargate-aware validation#40
nh13 merged 1 commit into
snakemake:mainfrom
nh13:feat/submit-job-tags

Conversation

@nh13

@nh13 nh13 commented Apr 12, 2026

Copy link
Copy Markdown
Collaborator

Summary

Hardens the AWS Batch executor for multi-arch, multi-tenant production use. Three user-facing capabilities plus underlying validation/compat fixes.

What's new

  • Per-rule queue routing via resources.batch_queue — route each rule to a Batch queue wired to the matching compute environment. Enables a single workflow to run ARM rules on Graviton queues and x86 rules on Intel/AMD queues. Falls back to the profile default queue.
  • Per-job tagssettings.tags are now sent to submit_job (not only register_job_definition), so the AWS Batch console and Cost Explorer can group jobs by tag. A SNAKEMAKE_AWS_BATCH_JOB_TAGS env var carries dynamic KEY=VALUE pairs for coordinator → worker tag inheritance (e.g. per-run commit hashes for cost attribution). Env tags override settings on key conflict.
  • resources.shared_memory_size_mb — rules opt into a larger /dev/shm (EC2/ECS default 64 MB is too small for tools that stage large in-memory indexes). EC2 only; Fargate does not honor linuxParameters.sharedMemorySize.

Validation and correctness

  • EC2 vs FARGATE detection (closes VALID_RESOURCES_MAPPING is restrictive, and only applicable to Fargate compute environments. #34): resolves the queue's compute environment and runs platform-specific vCPU/memory checks. Previously all jobs were validated against Fargate's strict resource matrix, which blocked legitimate EC2 requests. Real AWS lookup errors (botocore.ClientError) now propagate as WorkflowError instead of silently falling back to EC2.
  • vCPU from threads — derived from self.job.threads (the documented Snakemake directive) instead of the undocumented _cores resource, with _cores retained as fallback.
  • Fargate memory floor — when the requested memory is not a valid Fargate value, pick the smallest valid memory >= mem instead of silently shrinking to min(valid_mems) and OOMing the job. Raise if no valid value exists.
  • Tag validation — reject empty tag keys and >50 tags before calling submit_job, matching AWS Batch limits.

Known limitation

Fargate job-definition registration is not yet supported. Fargate requires executionRoleArn, prohibits privileged: True, and does not allow GPU resources — none of which this plugin currently emits conditionally. build_job_definition raises a clear WorkflowError when a Fargate queue is detected, replacing an opaque AWS-side rejection with a local message. EC2 queues are unaffected.

Compatibility

  • auto_deploy_default_storage_provider=False so workers don't pip install an unpinned snakemake-storage-plugin-s3 at startup — s3-plugin 0.3.x metadata requires snakemake 9.x and breaks 8.x workers. Image maintainers pre-install a compatible plugin version.
  • snakemake-storage-plugin-s3 pin widened to >=0.2,<0.4 so 8.x consumers can fall back to 0.2.x.

Dependencies

Stacked on #38 (fix-vcpu-threads-and-validation) — the first three commits here are the same as #38's, so #38 should land first (or be closed in favor of this PR).

Test plan

  • 31 unit tests in tests/test_batch_job_builder.py:
    • Tag assembly + propagation to submit_job (settings, env var, merge, override, empty cases)
    • Tag validation (empty keys from env or settings, 50-tag boundary)
    • Platform detection (ClientError re-raises, empty-response fallback, Fargate detection)
    • Fargate resource validation (smallest valid ≥ requested, exact match, request exceeds maximum, invalid vCPU)
    • Fargate rejection in build_job_definition
  • All existing tests still pass

Summary by CodeRabbit

  • New Features

    • Queue-aware job submission with per-job queue override, Fargate support, and job tagging with env-var overrides.
  • Behavior Changes

    • Platform-aware resource validation, vCPU/memory default adjustments (mem default 1024 MiB), EC2-only shared-memory support, and submission uses resolved job queue.
    • Disabled automatic default storage provider deployment.
  • Documentation

    • Expanded AWS Batch executor guidance, tagging, queue overrides, and shared-memory notes.
  • Tests

    • Comprehensive unit tests for builder, tagging, platform detection, resources, and timeouts.

@coderabbitai

coderabbitai Bot commented Apr 12, 2026

Copy link
Copy Markdown
Contributor

Review Change Stack

Warning

Review limit reached

@nh13, we couldn't start this review because you've reached your PR review rate limit.

More reviews will be available in 48 minutes and 13 seconds. Learn how PR review limits work.

Your organization has run out of usage credits. Purchase more in the billing tab.

⌛ How to resolve this issue?

After more reviews become available, a review can be triggered using the @coderabbitai review command as a PR comment. Alternatively, push new commits to this PR.

We recommend that you space out your commits to avoid hitting the rate limit.

🚦 How do rate limits work?

CodeRabbit enforces hourly rate limits for each developer per organization.

Our paid plans include higher PR review limits than trial, open-source, and free plans. In all cases, reviews become available again over time. During sustained high-volume PR review activity, CodeRabbit may temporarily slow when the next review becomes available.

Please see our Fair Usage Limits Policy for further information.

ℹ️ Review info
⚙️ Run configuration

Configuration used: Organization UI

Review profile: CHILL

Plan: Pro

Run ID: a4ae4ede-50d8-43dd-8386-81c58cb34102

📥 Commits

Reviewing files that changed from the base of the PR and between afe9402 and dcd7f15.

⛔ Files ignored due to path filters (1)
  • pyproject.toml is excluded by !pyproject.toml
📒 Files selected for processing (6)
  • docs/further.md
  • snakemake_executor_plugin_aws_batch/__init__.py
  • snakemake_executor_plugin_aws_batch/batch_client.py
  • snakemake_executor_plugin_aws_batch/batch_job_builder.py
  • tests/test_batch_job_builder.py
  • tests/tests_mocked_api.py
📝 Walkthrough

Walkthrough

Adds BatchClient passthroughs; extends BatchJobBuilder to pick per-job job_queue, detect EC2 vs FARGATE via AWS, perform platform-aware resource validation, set platformCapabilities, merge tags from settings and an env var, adjust shared-memory handling and defaults, update docs, and add comprehensive unit tests.

Changes

Batcher changes (single cohort)

Layer / File(s) Summary
BatchClient passthroughs
snakemake_executor_plugin_aws_batch/batch_client.py
Added describe_job_queues(**kwargs) and describe_compute_environments(**kwargs) forwarding to the boto3 client.
Builder init & platform detection
snakemake_executor_plugin_aws_batch/batch_job_builder.py
Constructor now sets self.job_queue from per-job resources["batch_queue"] (fallback to settings) and determines self.platform via AWS describe calls; empty responses default to EC2 and ClientError becomes WorkflowError.
Platform-aware validation & job-definition build
snakemake_executor_plugin_aws_batch/batch_job_builder.py
Adds Fargate/EC2 validators; vCPU prefers job.threads, default memory 1024 MiB; supports EC2-only shared_memory_size_mb mapping to linuxParameters.sharedMemorySize; registers job definitions with platformCapabilities=[self.platform].
Tag assembly & submit behavior
snakemake_executor_plugin_aws_batch/batch_job_builder.py
Introduces _build_job_tags() to merge settings.tags with SNAKEMAKE_AWS_BATCH_JOB_TAGS (env wins, supports values with =, tolerates extra commas, rejects empty keys, max 50 tags); submit() uses resolved job_queue and includes tags only when non-empty.
Docs & init change
docs/further.md, snakemake_executor_plugin_aws_batch/__init__.py
Docs expanded with image requirements, per-rule queues, shared-memory, and tags; common_settings.auto_deploy_default_storage_provider set to False; minor logging key fix.
Tests
tests/test_batch_job_builder.py, tests/tests_mocked_api.py
Adds comprehensive unit tests for tag parsing/merging/limits, platform detection, resource validators, shared-memory behavior, job-definition registration/submit differences, and task timeout; tests mocked API short-circuits platform detection to EC2 in CI.

Sequence Diagram(s)

sequenceDiagram
    participant Env as Env/Settings
    participant Builder as BatchJobBuilder
    participant Client as BatchClient
    participant AWS as AWS Batch

    Env->>Builder: provide settings, job, env-var tags
    Builder->>Client: describe_job_queues(...)
    Client->>AWS: DescribeJobQueues API
    AWS-->>Client: queue description
    Client-->>Builder: queue info
    alt computeEnvironment referenced
        Builder->>Client: describe_compute_environments(...)
        Client->>AWS: DescribeComputeEnvironments API
        AWS-->>Client: compute env description
        Client-->>Builder: compute env info
    end
    Builder->>Builder: determine platform (EC2 / FARGATE) and validate resources
    Builder->>AWS: register_job_definition(platformCapabilities=[Builder.platform], ...)
    AWS-->>Builder: job definition ARN
    Builder->>Builder: _build_job_tags() merge settings + env tags
    Builder->>Client: submit_job(..., jobQueue=Builder.job_queue, tags=merged?)
    Client->>AWS: SubmitJob API (with tags if present)
    AWS-->>Client: submit response
    Client-->>Builder: submit response
Loading

Estimated code review effort

🎯 4 (Complex) | ⏱️ ~45 minutes

🚥 Pre-merge checks | ✅ 4 | ❌ 1

❌ Failed checks (1 warning)

Check name Status Explanation Resolution
Docstring Coverage ⚠️ Warning Docstring coverage is 46.15% which is insufficient. The required threshold is 80.00%. Write docstrings for the functions missing them to satisfy the coverage threshold.
✅ Passed checks (4 passed)
Check name Status Explanation
Description Check ✅ Passed Check skipped - CodeRabbit’s high-level summary is enabled.
Title check ✅ Passed The title accurately and concisely summarizes the three main features added: per-rule queues, per-job tags, and platform-aware validation for EC2/Fargate.
Linked Issues check ✅ Passed Check skipped because no linked issues were found for this pull request.
Out of Scope Changes check ✅ Passed Check skipped because no linked issues were found for this pull request.

✏️ Tip: You can configure your own custom pre-merge checks in the settings.

✨ Finishing Touches
🧪 Generate unit tests (beta)
  • Create PR with unit tests

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

Comment @coderabbitai help to get the list of available commands and usage tips.

@coderabbitai coderabbitai Bot left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 2

Caution

Some comments are outside the diff and can’t be posted inline due to platform limitations.

⚠️ Outside diff range comments (1)
snakemake_executor_plugin_aws_batch/batch_job_builder.py (1)

179-205: ⚠️ Potential issue | 🔴 Critical

Fargate job definitions cannot use EC2-only container settings.

Line 215 sets platformCapabilities=[self.platform], which can be FARGATE when detected from the job queue. However, the container_properties dict (lines 179–205) violates AWS Fargate requirements:

  • executionRoleArn is required but never supplied
  • privileged=True is not supported; Fargate requires it omitted or false
  • GPU resourceRequirements (lines 198–204) are unsupported on Fargate

The AWS Batch API will reject Fargate job registrations with these violations. Platform-specific container-property assembly is needed before Fargate can work.

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@snakemake_executor_plugin_aws_batch/batch_job_builder.py` around lines 179 -
205, The container_properties assembly (container_properties dict in
batch_job_builder.py used when creating job definitions) is not Fargate-safe:
when platformCapabilities is set to self.platform (FARGATE) you must include
executionRoleArn, remove or set privileged to False, and omit GPU
resourceRequirements; update the logic that builds container_properties and the
call site that sets platformCapabilities to make platform-specific adjustments
(detect self.platform == "FARGATE" and then add executionRoleArn from settings,
delete or set privileged to False, and skip adding
BATCH_JOB_RESOURCE_REQUIREMENT_TYPE.GPU entries) so Fargate job registrations
meet AWS requirements.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.

Inline comments:
In `@snakemake_executor_plugin_aws_batch/batch_job_builder.py`:
- Around line 123-133: The current logic picks min(valid_mems) which can
under-provision memory; instead, in the block that computes valid_mems from
VALID_RESOURCES_MAPPING for the given vcpu, filter for candidates >= mem and
pick the smallest of those (e.g., candidate_mem = min(m for m in valid_mems if m
>= mem)); if no candidate exists raise a WorkflowError explaining the requested
mem is too large for that vCPU; update the logger.warning to report the chosen
candidate_mem (and original mem/vcpu) rather than min_mem. Ensure this change
touches the same symbols: VALID_RESOURCES_MAPPING, valid_mems, vcpu, mem,
WorkflowError, and the logger.warning call.
- Around line 105-109: The broad except block that logs and returns
BATCH_JOB_PLATFORM_CAPABILITIES.EC2.value must not silently downgrade on all
errors; instead, remove the unconditional return in the except (or change it to
re-raise) so errors propagate (fail fast) while retaining existing explicit
empty-response fallbacks elsewhere; update the except handling around the
queue/compute-environment lookup in the method (the block that uses
self.logger.warning and BATCH_JOB_PLATFORM_CAPABILITIES.EC2.value) to either
re-raise the caught exception after logging or catch and rethrow only
non-empty-response-safe errors (e.g., inspect/handle specific AWS client
errors), ensuring only the explicit empty-response branches keep the EC2
fallback.

---

Outside diff comments:
In `@snakemake_executor_plugin_aws_batch/batch_job_builder.py`:
- Around line 179-205: The container_properties assembly (container_properties
dict in batch_job_builder.py used when creating job definitions) is not
Fargate-safe: when platformCapabilities is set to self.platform (FARGATE) you
must include executionRoleArn, remove or set privileged to False, and omit GPU
resourceRequirements; update the logic that builds container_properties and the
call site that sets platformCapabilities to make platform-specific adjustments
(detect self.platform == "FARGATE" and then add executionRoleArn from settings,
delete or set privileged to False, and skip adding
BATCH_JOB_RESOURCE_REQUIREMENT_TYPE.GPU entries) so Fargate job registrations
meet AWS requirements.
🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

  • Push a commit to this branch (recommended)
  • Create a new PR with the fixes

ℹ️ Review info
⚙️ Run configuration

Configuration used: Organization UI

Review profile: CHILL

Plan: Pro

Run ID: adee3872-6ca3-4367-b682-4abbb205004a

📥 Commits

Reviewing files that changed from the base of the PR and between 2525ca0 and 71cbeea.

📒 Files selected for processing (3)
  • snakemake_executor_plugin_aws_batch/batch_client.py
  • snakemake_executor_plugin_aws_batch/batch_job_builder.py
  • tests/test_batch_job_builder.py

Comment thread snakemake_executor_plugin_aws_batch/batch_job_builder.py Outdated
Comment thread snakemake_executor_plugin_aws_batch/batch_job_builder.py

@coderabbitai coderabbitai Bot left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 1

Caution

Some comments are outside the diff and can’t be posted inline due to platform limitations.

⚠️ Outside diff range comments (1)
snakemake_executor_plugin_aws_batch/batch_job_builder.py (1)

185-221: ⚠️ Potential issue | 🔴 Critical

Add platform-aware container properties for Fargate compatibility.

The job definition construction at line 185 unconditionally sets privileged: True, omits executionRoleArn, and allows GPU resources, but line 221 can send platformCapabilities=["FARGATE"]. This causes Fargate job definition registration to fail: AWS Batch Fargate requires an execution role, prohibits privileged mode, and does not support GPU. The code already validates Fargate vs. EC2 resource requirements but does not handle these container property differences.

Conditionally set executionRoleArn for Fargate, omit or set privileged to false for Fargate (true only for EC2), and reject GPU requests on Fargate.

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@snakemake_executor_plugin_aws_batch/batch_job_builder.py` around lines 185 -
221, The container_properties currently set privileged=True, omit
executionRoleArn, and may include GPU requirements even when self.platform ==
"FARGATE"; update the job-definition assembly in batch_job_builder.py so that
inside the block building container_properties you (1) only set privileged=True
for non-FARGATE platforms and set privileged=False (or omit) when self.platform
== self.platform value for FARGATE, (2) add
executionRoleArn=self.settings.execution_role (or settings.executionRoleArn) to
container_properties when self.platform == FARGATE, and (3) guard appending a
GPU resource (BATCH_JOB_RESOURCE_REQUIREMENT_TYPE.GPU) — raise/return an error
if gpu > 0 and self.platform == FARGATE to reject unsupported GPU on Fargate;
ensure these changes are honored before calling
self.batch_client.register_job_definition so register_job_definition receives
the correct platform-aware containerProperties and tags.
♻️ Duplicate comments (2)
snakemake_executor_plugin_aws_batch/batch_job_builder.py (2)

129-140: ⚠️ Potential issue | 🟠 Major

Don’t reduce Fargate memory below the requested amount.

Line 135 can turn vcpu=1, mem=5000 into 2048, under-provisioning the job and causing avoidable OOM failures. Pick the smallest valid memory >= mem, or fail when the request exceeds the maximum for that vCPU.

Proposed fix
             valid_mems = [m for m, v in VALID_RESOURCES_MAPPING.items() if vcpu in v]
             if not valid_mems:
                 raise WorkflowError(
                     f"Invalid vCPU value {vcpu} for Fargate. "
                     f"Check valid Fargate resource configurations."
                 )
-            min_mem = min(valid_mems)
+            candidate_mems = [m for m in valid_mems if m >= mem]
+            if not candidate_mems:
+                raise WorkflowError(
+                    f"Memory value {mem} MB exceeds the maximum allowed for "
+                    f"vCPU {vcpu} on Fargate."
+                )
+            min_mem = min(candidate_mems)
             self.logger.warning(
                 f"Memory value {mem} MB is invalid for vCPU {vcpu} on Fargate. "
-                f"Setting memory to minimum allowed value {min_mem} MB."
+                f"Setting memory to next allowed value {min_mem} MB."
             )
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@snakemake_executor_plugin_aws_batch/batch_job_builder.py` around lines 129 -
140, The current logic in batch_job_builder.py (inside the function that maps
Fargate vcpu/mem) picks min(valid_mems) which can under-provision requested mem;
change it to select the smallest entry in valid_mems that is >= mem (use vcpu,
mem, VALID_RESOURCES_MAPPING, valid_mems) and return that; if no valid_mems >=
mem exist, raise WorkflowError indicating the requested memory exceeds the
maximum allowed for that vCPU rather than silently lowering it; keep the
existing logging (self.logger.warning) but update the message to reflect that
memory was bumped up to the selected valid value, and ensure returned values
remain str(vcpu), str(selected_mem).

111-115: ⚠️ Potential issue | 🟠 Major

Don’t silently downgrade platform lookup failures to EC2.

Line 111 still catches every exception from the AWS queue/environment lookup and turns it into an EC2 fallback. That can hide IAM/configuration/transient AWS failures and register the wrong job definition; keep the explicit empty-response fallbacks, but re-raise real lookup errors.

Proposed fix
-        except Exception as e:
-            self.logger.warning(
-                f"Failed to determine platform from queue: {e}. Defaulting to EC2."
-            )
-            return BATCH_JOB_PLATFORM_CAPABILITIES.EC2.value
+        except Exception as e:
+            raise WorkflowError(
+                f"Failed to determine platform from queue {self.job_queue}: {e}"
+            ) from e
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@snakemake_executor_plugin_aws_batch/batch_job_builder.py` around lines 111 -
115, The current broad except in the platform lookup swallows all errors and
returns BATCH_JOB_PLATFORM_CAPABILITIES.EC2.value; instead, change the error
handling in the platform-lookup routine so you only fall back to EC2 when the
lookup explicitly returns an empty/no-result value, and re-raise real AWS/client
errors. Concretely: remove the generic "except Exception" around the AWS
queue/environment lookup, or narrow it to only catch the expected "no result"
condition (e.g., check for None/empty response) and call
self.logger.warning(...) then return EC2; for any other exceptions from the AWS
client (botocore.exceptions.BotoCoreError/ClientError or unexpected errors),
re-raise them so they surface to callers rather than being silenced.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.

Inline comments:
In `@snakemake_executor_plugin_aws_batch/batch_job_builder.py`:
- Around line 237-247: The merged tags builder currently accepts empty keys
(e.g., "=value") and doesn't enforce AWS Batch limits; update the tag assembly
in the method that builds/returns tags (the block using
SNAKEMAKE_AWS_BATCH_JOB_TAGS_ENV_VAR and variable tags) to validate after
merging: trim key and value, skip or reject any entries where the key is empty
(key.strip() == ""), and enforce a maximum of 50 tags (len(tags) <= 50); if
validation fails, raise a clear ValueError describing the problem so the caller
(before submit_job) fails fast with a helpful message.

---

Outside diff comments:
In `@snakemake_executor_plugin_aws_batch/batch_job_builder.py`:
- Around line 185-221: The container_properties currently set privileged=True,
omit executionRoleArn, and may include GPU requirements even when self.platform
== "FARGATE"; update the job-definition assembly in batch_job_builder.py so that
inside the block building container_properties you (1) only set privileged=True
for non-FARGATE platforms and set privileged=False (or omit) when self.platform
== self.platform value for FARGATE, (2) add
executionRoleArn=self.settings.execution_role (or settings.executionRoleArn) to
container_properties when self.platform == FARGATE, and (3) guard appending a
GPU resource (BATCH_JOB_RESOURCE_REQUIREMENT_TYPE.GPU) — raise/return an error
if gpu > 0 and self.platform == FARGATE to reject unsupported GPU on Fargate;
ensure these changes are honored before calling
self.batch_client.register_job_definition so register_job_definition receives
the correct platform-aware containerProperties and tags.

---

Duplicate comments:
In `@snakemake_executor_plugin_aws_batch/batch_job_builder.py`:
- Around line 129-140: The current logic in batch_job_builder.py (inside the
function that maps Fargate vcpu/mem) picks min(valid_mems) which can
under-provision requested mem; change it to select the smallest entry in
valid_mems that is >= mem (use vcpu, mem, VALID_RESOURCES_MAPPING, valid_mems)
and return that; if no valid_mems >= mem exist, raise WorkflowError indicating
the requested memory exceeds the maximum allowed for that vCPU rather than
silently lowering it; keep the existing logging (self.logger.warning) but update
the message to reflect that memory was bumped up to the selected valid value,
and ensure returned values remain str(vcpu), str(selected_mem).
- Around line 111-115: The current broad except in the platform lookup swallows
all errors and returns BATCH_JOB_PLATFORM_CAPABILITIES.EC2.value; instead,
change the error handling in the platform-lookup routine so you only fall back
to EC2 when the lookup explicitly returns an empty/no-result value, and re-raise
real AWS/client errors. Concretely: remove the generic "except Exception" around
the AWS queue/environment lookup, or narrow it to only catch the expected "no
result" condition (e.g., check for None/empty response) and call
self.logger.warning(...) then return EC2; for any other exceptions from the AWS
client (botocore.exceptions.BotoCoreError/ClientError or unexpected errors),
re-raise them so they surface to callers rather than being silenced.
🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

  • Push a commit to this branch (recommended)
  • Create a new PR with the fixes

ℹ️ Review info
⚙️ Run configuration

Configuration used: Organization UI

Review profile: CHILL

Plan: Pro

Run ID: 2631aa04-9e9f-490a-8474-163c63b38e71

📥 Commits

Reviewing files that changed from the base of the PR and between 71cbeea and f1943b7.

📒 Files selected for processing (1)
  • snakemake_executor_plugin_aws_batch/batch_job_builder.py

Comment thread snakemake_executor_plugin_aws_batch/batch_job_builder.py Outdated

@coderabbitai coderabbitai Bot left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

♻️ Duplicate comments (3)
snakemake_executor_plugin_aws_batch/batch_job_builder.py (3)

248-258: ⚠️ Potential issue | 🟡 Minor

Validate tag keys and count before returning.

The current implementation accepts empty keys from malformed input like =value or ,=value,. Additionally, AWS Batch enforces a 50-tag limit per job. Consider validating that keys are non-empty and the total count doesn't exceed 50 to fail fast with a clear error.

Proposed fix
         env_tags_str = os.environ.get(SNAKEMAKE_AWS_BATCH_JOB_TAGS_ENV_VAR, "")
         if env_tags_str:
             for pair in env_tags_str.split(","):
                 pair = pair.strip()
                 if "=" in pair:
                     key, _, value = pair.partition("=")
-                    tags[key.strip()] = value.strip()
+                    key = key.strip()
+                    if not key:
+                        raise WorkflowError(
+                            f"Invalid {SNAKEMAKE_AWS_BATCH_JOB_TAGS_ENV_VAR}: "
+                            "tag key cannot be empty."
+                        )
+                    tags[key] = value.strip()
+
+        if len(tags) > 50:
+            raise WorkflowError(
+                f"AWS Batch jobs support at most 50 tags, got {len(tags)}."
+            )

         return tags
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@snakemake_executor_plugin_aws_batch/batch_job_builder.py` around lines 248 -
258, Validate and reject empty tag keys and enforce AWS Batch's 50-tag limit
before returning tags: after copying self.settings.tags into tags and while
parsing env_tags_str (using SNAKEMAKE_AWS_BATCH_JOB_TAGS_ENV_VAR), strip the key
and if key is empty raise a clear ValueError (or similar) indicating malformed
tag like "=value"; also validate existing keys in self.settings.tags (strip and
error on any empty key). After merging all tags, if len(tags) > 50 raise a clear
ValueError indicating the tag count exceeds AWS Batch's 50-tag limit; return
tags only if all validations pass.

111-115: ⚠️ Potential issue | 🟠 Major

Broad exception handling masks real configuration errors.

The catch-all Exception handler defaults to EC2 for any failure, including permission errors (AccessDeniedException), validation errors, and transient AWS failures. This can turn real configuration issues into silent fallbacks, pushing the actual error to a later register/submit call where it's harder to diagnose.

Consider catching specific boto3/botocore exceptions (e.g., ClientError) and re-raising unexpected errors to fail fast, while keeping the EC2 fallback only for the explicit empty-response branches.

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@snakemake_executor_plugin_aws_batch/batch_job_builder.py` around lines 111 -
115, Replace the broad except Exception handler around the logic that returns
BATCH_JOB_PLATFORM_CAPABILITIES.EC2.value with targeted exception handling:
import and catch botocore.exceptions.ClientError (and any specific AWS
exceptions you expect, e.g., AccessDeniedException) and handle only the cases
where the AWS response is empty or indicates "no platform" by returning the EC2
fallback; for other ClientError codes (and for unexpected exceptions) re-raise
the exception so the caller fails fast instead of silently defaulting, and keep
the EC2 fallback limited to the explicit empty-response branch in the method
that determines platform from the queue.

129-140: ⚠️ Potential issue | 🟠 Major

Memory fallback may under-provision the container.

When the requested memory isn't in VALID_RESOURCES_MAPPING, line 135 picks min(valid_mems), which could be significantly lower than requested. For example, vcpu=1, mem=5000 gets rewritten to 2048, potentially causing OOM failures.

Consider picking the smallest valid memory that is >= mem, or raising an error when no valid memory can satisfy the request.

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@snakemake_executor_plugin_aws_batch/batch_job_builder.py` around lines 129 -
140, The current fallback chooses min(valid_mems) which may under-provision
(e.g., vcpu=1, mem=5000 -> 2048); update the logic in the block that computes
valid_mems (using VALID_RESOURCES_MAPPING) to select the smallest valid memory
>= requested mem (e.g., candidate = min(m for m in valid_mems if m >= mem)) and
return that, and if no such candidate exists raise a WorkflowError explaining
that no valid Fargate memory satisfies the requested mem for the given vcpu;
keep the return format str(vcpu), str(selected_mem).
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.

Duplicate comments:
In `@snakemake_executor_plugin_aws_batch/batch_job_builder.py`:
- Around line 248-258: Validate and reject empty tag keys and enforce AWS
Batch's 50-tag limit before returning tags: after copying self.settings.tags
into tags and while parsing env_tags_str (using
SNAKEMAKE_AWS_BATCH_JOB_TAGS_ENV_VAR), strip the key and if key is empty raise a
clear ValueError (or similar) indicating malformed tag like "=value"; also
validate existing keys in self.settings.tags (strip and error on any empty key).
After merging all tags, if len(tags) > 50 raise a clear ValueError indicating
the tag count exceeds AWS Batch's 50-tag limit; return tags only if all
validations pass.
- Around line 111-115: Replace the broad except Exception handler around the
logic that returns BATCH_JOB_PLATFORM_CAPABILITIES.EC2.value with targeted
exception handling: import and catch botocore.exceptions.ClientError (and any
specific AWS exceptions you expect, e.g., AccessDeniedException) and handle only
the cases where the AWS response is empty or indicates "no platform" by
returning the EC2 fallback; for other ClientError codes (and for unexpected
exceptions) re-raise the exception so the caller fails fast instead of silently
defaulting, and keep the EC2 fallback limited to the explicit empty-response
branch in the method that determines platform from the queue.
- Around line 129-140: The current fallback chooses min(valid_mems) which may
under-provision (e.g., vcpu=1, mem=5000 -> 2048); update the logic in the block
that computes valid_mems (using VALID_RESOURCES_MAPPING) to select the smallest
valid memory >= requested mem (e.g., candidate = min(m for m in valid_mems if m
>= mem)) and return that, and if no such candidate exists raise a WorkflowError
explaining that no valid Fargate memory satisfies the requested mem for the
given vcpu; keep the return format str(vcpu), str(selected_mem).

ℹ️ Review info
⚙️ Run configuration

Configuration used: Organization UI

Review profile: CHILL

Plan: Pro

Run ID: 528b17b0-82e7-4cf6-816a-3a455ee9968b

📥 Commits

Reviewing files that changed from the base of the PR and between 0627c08 and c736516.

⛔ Files ignored due to path filters (1)
  • pyproject.toml is excluded by !pyproject.toml
📒 Files selected for processing (1)
  • snakemake_executor_plugin_aws_batch/batch_job_builder.py

@nh13 nh13 changed the title feat: pass tags to submit_job and support dynamic env var tags feat: per-rule queues, per-job tags, EC2/Fargate-aware validation Apr 29, 2026
@nh13 nh13 force-pushed the feat/submit-job-tags branch from c736516 to 31b0253 Compare April 29, 2026 06:46

@coderabbitai coderabbitai Bot left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 1

🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.

Inline comments:
In `@snakemake_executor_plugin_aws_batch/batch_job_builder.py`:
- Around line 118-120: The error messages in batch_job_builder.py reference
self.settings.job_queue but the actual queue used for a job can be overridden
and stored in self.job_queue; update the two places (around the
platform-determination and the second error block) to use self.job_queue instead
of self.settings.job_queue so diagnostics show the resolved per-job queue and
not the default settings value.
🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

  • Push a commit to this branch (recommended)
  • Create a new PR with the fixes

ℹ️ Review info
⚙️ Run configuration

Configuration used: Organization UI

Review profile: CHILL

Plan: Pro

Run ID: 822c07d0-b3f7-4862-864a-9c0d2861a654

📥 Commits

Reviewing files that changed from the base of the PR and between c736516 and 31b0253.

⛔ Files ignored due to path filters (1)
  • pyproject.toml is excluded by !pyproject.toml
📒 Files selected for processing (4)
  • snakemake_executor_plugin_aws_batch/__init__.py
  • snakemake_executor_plugin_aws_batch/batch_client.py
  • snakemake_executor_plugin_aws_batch/batch_job_builder.py
  • tests/test_batch_job_builder.py
🚧 Files skipped from review as they are similar to previous changes (2)
  • snakemake_executor_plugin_aws_batch/batch_client.py
  • tests/test_batch_job_builder.py

Comment thread snakemake_executor_plugin_aws_batch/batch_job_builder.py Outdated
@nh13

nh13 commented Jun 4, 2026

Copy link
Copy Markdown
Collaborator Author

@johanneskoester or others, this PR is ready, and I'd be glad to added as a maintainer if that would be helpful, as I do have some further improvements

@nh13

nh13 commented Jun 4, 2026

Copy link
Copy Markdown
Collaborator Author

CC: @radusuciu

@radusuciu

Copy link
Copy Markdown

@nh13 I like many of these changes and think they're necessary for a truly functional executor plugin. I implemented many of the same in my fork which I used in production and in my separate project that runs the executor itself on AWS Batch (see: https://github.com/radusuciu/snakemake-executor-plugin-aws-basic-batch). However, If I were the maintainer, I'd probably prefer to see separate PRs. Point is moot though as maintainer is MIA..

@cademirch

Copy link
Copy Markdown

@nh13 and @radusuciu I'm happy to step in here and help get this merged. I'll take a look at this soon. Thanks for the PR.

@cademirch cademirch self-requested a review June 7, 2026 00:33
@cademirch cademirch self-assigned this Jun 7, 2026
@nh13

nh13 commented Jun 7, 2026

Copy link
Copy Markdown
Collaborator Author

Thanks @cademirch. I do have other PRs that I'd like to contribute, would you be open to that? I can make one at a time, with an issue before hand for your review? I want to be aware of the maintenance burden this places on the maintainers, so let me know what works for you.

@cademirch

Copy link
Copy Markdown

Thanks @cademirch. I do have other PRs that I'd like to contribute, would you be open to that? I can make one at a time, with an issue before hand for your review? I want to be aware of the maintenance burden this places on the maintainers, so let me know what works for you.

Separate PRs are good. Just put me ad reviewer so it ends up in my inbox!

Thanks @nh13. Also, if you're interested in helping maintain this plugin let's chat.

@nh13

nh13 commented Jun 7, 2026

Copy link
Copy Markdown
Collaborator Author

@cademirch I'd welcome helping out here!

cademirch pushed a commit that referenced this pull request Jun 7, 2026
`task_timeout` defaults to 300 and is applied unconditionally in
`build_job_definition`, so by default every Batch job is killed at 5
minutes — and `Optional[int]` is a lie, since "no timeout" is
inexpressible. Typical bioinformatics jobs run hours; the default
behavior is a workflow that mysteriously fails partway with terminated
jobs.

This changes the default to `None`, omits the `timeout` kwarg from
`register_job_definition` when unset (AWS then applies no timeout), and
validates the AWS minimum of 60s locally with a clear `WorkflowError`
(naming `--aws-batch-task-timeout`) instead of a Batch API rejection.
Help text updated.

Tests (new `tests/test_batch_job_builder.py`, added to the mocked-API CI
job): default omits the key; set includes it; exactly 60 passes; 59, 1,
0, and -1 raise pre-API with `register_job_definition` never called.

**Coordination note:** this overlaps with my open #40, which also
creates `tests/test_batch_job_builder.py` and edits the same CI line.
Suggested order: land this small fix first; #40 then rebases on top
(I'll handle the rebase, folding the test files together).

Follow-up planned: a per-rule `aws_batch_task_timeout` resource, kept
separate for review granularity.

---
*This PR was prepared with the assistance of coding agents; all changes
were human-reviewed before submission.*

<!-- This is an auto-generated comment: release notes by coderabbit.ai
-->

## Summary by CodeRabbit

## Release Notes

* **New Features**
  * Added validation for task timeout with AWS minimum of 60 seconds

* **Bug Fixes**
  * Task timeout now defaults to None and is conditionally applied

* **Tests**
  * Added comprehensive batch job builder test suite

* **Chores**
  * Updated CI pipeline to run additional batch job builder tests

<!-- end of auto-generated comment: release notes by coderabbit.ai -->
@johanneskoester

Copy link
Copy Markdown
Collaborator

Thanks a lot Nils! I have now invited you as a co-maintainer.

nh13 added a commit to nh13/snakemake-executor-plugin-aws-batch that referenced this pull request Jun 9, 2026
Squashed from the original 10-commit PR snakemake#40 stack for a clean rebase onto
main (which advanced via snakemake#41's conditional task-timeout). Covers: per-rule
batch_queue override, per-job tags propagated to submit_job + job definition
with dynamic env-var tags, EC2/Fargate platform detection from the queue with
fail-fast Fargate rejection, threads-based vCPU allocation, shared_memory_size_mb
linuxParameters, and resource validation fixes (snakemake#34).
@nh13 nh13 force-pushed the feat/submit-job-tags branch from f66fc8a to afe9402 Compare June 9, 2026 06:55
@nh13

nh13 commented Jun 9, 2026

Copy link
Copy Markdown
Collaborator Author

@coderabbitai resume

@nh13

nh13 commented Jun 9, 2026

Copy link
Copy Markdown
Collaborator Author

@coderabbitai review

@coderabbitai

coderabbitai Bot commented Jun 9, 2026

Copy link
Copy Markdown
Contributor
✅ Action performed

Reviews resumed.

@coderabbitai

coderabbitai Bot commented Jun 9, 2026

Copy link
Copy Markdown
Contributor
✅ Action performed

Review finished.

Note: CodeRabbit is an incremental review system and does not re-review already reviewed commits. This command is applicable only when automatic reviews are paused.

@coderabbitai coderabbitai Bot left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 2

🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

Inline comments:
In `@docs/further.md`:
- Around line 120-122: The fenced code block containing the environment variable
example (the line starting with export
SNAKEMAKE_AWS_BATCH_JOB_TAGS="run_id=...") lacks a language specifier; update
the fence to include a shell language (e.g., use ```bash or ```sh) so the code
block is properly annotated and MD040 is resolved.
- Around line 131-145: Remove the duplicated, incomplete "Shared Memory
(/dev/shm) Sizing" fragment that begins with the example "rule align:" and ends
with the truncated "This s"; either delete that entire repeated block (the code
fence plus the trailing text) or replace it with a short comment like <!--
Removed duplicate/incomplete section; covered above in "Shared Memory
(`/dev/shm`)" --> so the documentation no longer contains the duplicated example
and the mid-sentence fragment.
🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

  • Push a commit to this branch (recommended)
  • Create a new PR with the fixes

ℹ️ Review info
⚙️ Run configuration

Configuration used: Organization UI

Review profile: CHILL

Plan: Pro

Run ID: 8378d5f0-188e-446e-aa14-c795d1ee1f1c

📥 Commits

Reviewing files that changed from the base of the PR and between c736516 and afe9402.

⛔ Files ignored due to path filters (1)
  • pyproject.toml is excluded by !pyproject.toml
📒 Files selected for processing (6)
  • docs/further.md
  • snakemake_executor_plugin_aws_batch/__init__.py
  • snakemake_executor_plugin_aws_batch/batch_client.py
  • snakemake_executor_plugin_aws_batch/batch_job_builder.py
  • tests/test_batch_job_builder.py
  • tests/tests_mocked_api.py
🚧 Files skipped from review as they are similar to previous changes (3)
  • snakemake_executor_plugin_aws_batch/batch_client.py
  • snakemake_executor_plugin_aws_batch/init.py
  • snakemake_executor_plugin_aws_batch/batch_job_builder.py

Comment thread docs/further.md Outdated
Comment thread docs/further.md Outdated
Squashed from the original 10-commit PR snakemake#40 stack for a clean rebase onto
main (which advanced via snakemake#41's conditional task-timeout). Covers: per-rule
batch_queue override, per-job tags propagated to submit_job + job definition
with dynamic env-var tags, EC2/Fargate platform detection from the queue with
fail-fast Fargate rejection, threads-based vCPU allocation, shared_memory_size_mb
linuxParameters, and resource validation fixes (snakemake#34).
@nh13 nh13 force-pushed the feat/submit-job-tags branch from afe9402 to dcd7f15 Compare June 9, 2026 07:12
@nh13 nh13 merged commit 1d7d6e6 into snakemake:main Jun 9, 2026
5 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

VALID_RESOURCES_MAPPING is restrictive, and only applicable to Fargate compute environments.

4 participants