fix: issue with resource validation assuming Fargate (#34)#37
Conversation
As described in issue snakemake#34, validation assumed a Fargate compute environment. We can detect the compute environment by looking up the queue, and only run the Fargate specific validation when necessary.
📝 WalkthroughWalkthroughThe changes introduce platform detection logic to identify whether AWS Batch job queues use EC2 or Fargate compute environments, with corresponding platform-specific resource validation. Two new pass-through methods are added to BatchClient for AWS Batch describe operations. Changes
Sequence Diagram(s)sequenceDiagram
participant BatchJobBuilder
participant AWS as AWS Batch API
participant Validator as Resource Validator
BatchJobBuilder->>+AWS: _get_platform_from_queue()
AWS-->>-BatchJobBuilder: describe_job_queues response
BatchJobBuilder->>+AWS: describe_compute_environments
AWS-->>-BatchJobBuilder: compute environment config
alt computeResources.type is FARGATE/FARGATE_SPOT
BatchJobBuilder->>BatchJobBuilder: platform = FARGATE
else
BatchJobBuilder->>BatchJobBuilder: platform = EC2
end
Note over BatchJobBuilder: Platform determined and stored
BatchJobBuilder->>+Validator: _validate_resources(vcpu, mem)
alt platform == FARGATE
Validator->>Validator: _validate_fargate_resources()
Note over Validator: Check vCPU/memory<br/>combos against<br/>Fargate constraints
else
Validator->>Validator: _validate_ec2_resources()
Note over Validator: Enforce vcpu >= 1<br/>and mem >= 1024 MiB
end
Validator-->>-BatchJobBuilder: validated vcpu, mem
Estimated code review effort🎯 4 (Complex) | ⏱️ ~45 minutes
Pre-merge checks and finishing touches✅ Passed checks (5 passed)
✨ Finishing touches
🧪 Generate unit tests (beta)
Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out. Comment |
There was a problem hiding this comment.
Actionable comments posted: 1
📜 Review details
Configuration used: CodeRabbit UI
Review profile: CHILL
Plan: Pro
📒 Files selected for processing (2)
snakemake_executor_plugin_aws_batch/batch_client.py(1 hunks)snakemake_executor_plugin_aws_batch/batch_job_builder.py(3 hunks)
🧰 Additional context used
📓 Path-based instructions (1)
**/*.py
⚙️ CodeRabbit configuration file
**/*.py: Do not try to improve formatting.
Do not suggest type annotations for functions that are defined inside of functions or methods.
Do not suggest type annotation of theselfargument of methods.
Do not suggest type annotation of theclsargument of classmethods.
Do not suggest return type annotation if a function or method does not contain areturnstatement.
Files:
snakemake_executor_plugin_aws_batch/batch_job_builder.pysnakemake_executor_plugin_aws_batch/batch_client.py
🧬 Code graph analysis (1)
snakemake_executor_plugin_aws_batch/batch_job_builder.py (2)
snakemake_executor_plugin_aws_batch/batch_client.py (2)
describe_job_queues(76-83)describe_compute_environments(85-92)snakemake_executor_plugin_aws_batch/constant.py (1)
BATCH_JOB_PLATFORM_CAPABILITIES(57-59)
🪛 Ruff (0.14.4)
snakemake_executor_plugin_aws_batch/batch_job_builder.py
102-102: Do not catch blind exception: Exception
(BLE001)
118-118: Avoid specifying long messages outside the exception class
(TRY003)
134-134: Avoid specifying long messages outside the exception class
(TRY003)
136-136: Avoid specifying long messages outside the exception class
(TRY003)
🔇 Additional comments (8)
snakemake_executor_plugin_aws_batch/batch_client.py (1)
76-92: LGTM! Clean pass-through wrappers.Both methods follow the established pattern in the class and enable the downstream platform detection logic.
snakemake_executor_plugin_aws_batch/batch_job_builder.py (7)
33-34: Good approach to detect platform early.Determining the platform during initialization enables fail-fast behavior and consistent platform-specific validation throughout the job lifecycle.
102-106: Broad exception catch is acceptable here.While the static analysis tool flags the broad
Exceptioncatch, the fallback to EC2 with logging is appropriate for robustness during initialization. EC2 is the safer default as it has more flexible resource constraints than Fargate.
127-137: EC2 validation appropriately minimal.The basic sanity checks (vcpu >= 1, memory >= 1024 MiB) are appropriate for EC2, which supports flexible resource allocation across various instance types. The 1024 MiB minimum aligns with the default memory value change.
139-150: Clean platform-aware routing.The validation routing correctly delegates to platform-specific validators based on the detected platform. The integer conversion ensures consistent handling across both validators.
160-160: Verify the default memory reduction is intentional.The default memory was reduced from 2048 MiB to 1024 MiB. While 1024 MiB is the minimum for EC2, this behavioral change could affect existing users who rely on the previous default.
Please confirm:
- Is this default change intentional and documented in release notes?
- Are there any implications for existing jobs that don't explicitly specify memory?
205-205: Core fix: dynamic platform capability.Excellent! This change from a hardcoded EC2 value to the dynamically determined
self.platformis the key fix that enables proper platform-specific behavior for both EC2 and Fargate environments.
70-71: ****The assumption in the review comment is invalid. AWS Batch does not permit mixing EC2 and FARGATE compute environments in the same job queue — all compute environments attached to a single queue must use the same provisioning model (EC2/Spot or FARGATE/FARGATE_SPOT) and architecture. Since the queue's compute environments are guaranteed to have the same platform, checking only the first one is correct and sufficient.
Likely an incorrect or invalid review comment.
| def _validate_fargate_resources(self, vcpu: int, mem: int) -> tuple[str, str]: | ||
| """Validates vcpu and memory conform to Fargate requirements. | ||
|
|
||
| Fargate requires strict memory/vCPU combinations. | ||
| https://docs.aws.amazon.com/batch/latest/userguide/fargate.html | ||
| """ | ||
| if mem in VALID_RESOURCES_MAPPING: | ||
| if vcpu in VALID_RESOURCES_MAPPING[mem]: | ||
| return str(vcpu), str(mem) | ||
| else: | ||
| raise WorkflowError(f"Invalid vCPU value {vcpu} for memory {mem} MB") | ||
| raise WorkflowError(f"Invalid vCPU value {vcpu} for memory {mem} MB on Fargate") | ||
| else: | ||
| min_mem = min([m for m, v in VALID_RESOURCES_MAPPING.items() if vcpu in v]) | ||
| self.logger.warning( | ||
| f"Memory value {mem} MB is invalid for vCPU {vcpu}." | ||
| f"Memory value {mem} MB is invalid for vCPU {vcpu} on Fargate. " | ||
| f"Setting memory to minimum allowed value {min_mem} MB." | ||
| ) | ||
| return str(vcpu), str(min_mem) |
There was a problem hiding this comment.
Handle edge case: invalid vCPU with no valid memory mapping.
Line 120 could raise ValueError if the vcpu value doesn't exist in any memory mapping. This happens when both vcpu and mem are invalid.
Consider adding a check before line 120:
else:
+ # Check if vcpu exists in any memory mapping
+ valid_memories = [m for m, v in VALID_RESOURCES_MAPPING.items() if vcpu in v]
+ if not valid_memories:
+ raise WorkflowError(
+ f"Invalid vCPU value {vcpu} for Fargate. "
+ f"Valid vCPU values are: {sorted(set(v for values in VALID_RESOURCES_MAPPING.values() for v in values))}"
+ )
- min_mem = min([m for m, v in VALID_RESOURCES_MAPPING.items() if vcpu in v])
+ min_mem = min(valid_memories)
self.logger.warning(🧰 Tools
🪛 Ruff (0.14.4)
118-118: Avoid specifying long messages outside the exception class
(TRY003)
🤖 Prompt for AI Agents
In snakemake_executor_plugin_aws_batch/batch_job_builder.py around lines 108 to
125, when mem is not a key in VALID_RESOURCES_MAPPING the current code computes
min_mem = min([m for m, v in VALID_RESOURCES_MAPPING.items() if vcpu in v])
which will raise a ValueError if the provided vcpu does not appear in any
mapping; add a guard: build the list of candidate memory values first, if it's
empty raise a WorkflowError with a clear message about invalid vCPU and memory
for Fargate, otherwise take min(candidate_memories), log the adjustment as
before and return str(vcpu), str(min_mem).
|
I am hitting this issue. It was a big surprise to use aws batch and memory scaled down when requesting one cpu. |
|
@radusuciu — your fix landed on #40 as commit |
nh13
left a comment
There was a problem hiding this comment.
Looking good so far! A few things before merge:
- Cache the detection.
_get_platform_from_queue()runs in__init__, andBatchJobBuilderis constructed once per job inrun_job, so every job makes two Batch describe calls, which is redundant and a throttling risk on large workflows. Memoize per queue. - Add tests. The new validators and platform detection are completely uncovered —
tests_mocked_api.pypatchesBatchJobBuilder.submitout entirely. Worth adding cases (via moto or a mockedbatch_client) for: EC2 accepts a combo Fargate rejects, Fargate still enforces the strict matrix, and the fallback-to-EC2 path. - Default memory 2048 → 1024 looks unrelated to the validation fix. Can you explain it in the description or split it out?
- The bare
except Exceptionsilently defaults to EC2, which could let a Fargate job through that AWS then rejects, Consider narrowing tobotocore.exceptions.ClientError(needs importing). - Minor: in
_validate_fargate_resources,min([... if vcpu in v])raisesValueErrorrather than aWorkflowErrorwhenvcpuisn't one of{1, 2, 4, 8, 16}. - Can you fix merge conflicts?
Let me know if you'd like me to push any of the above fixes!
|
@nh13 this is good to close or you still want the above review comments addressed? |
This closes #34, the issue being that validation assumed a Fargate compute environment. We can detect the compute environment by looking up the queue, and only run the Fargate specific validation when necessary.
Tested and working with an EC2 queue only.
Summary by CodeRabbit
New Features
Changes