CodeBundle Design Spec — aws-sqs-dlq-health
Parent: codecollection-registry intake — SQS dead-letter investigation (issue #75)
Consumed by Creator SRE mode for rw-cli-codecollection implementation.
--- Identity ---
codebundle_name: "aws-sqs-dlq-health"
target_collection: "rw-cli-codecollection"
display_name: "AWS SQS Dead Letter Queue Health and Log Correlation"
author: "rw-codebundle-agent"
--- Purpose ---
purpose: |
Monitors Amazon SQS dead-letter queues, raises issues when messages accumulate
in a DLQ, samples recent dead-lettered messages for diagnostic attributes, and
correlates failures to CloudWatch Logs for common consumer patterns (especially
AWS Lambda event source mappings) so operators can determine why messages
were not processed successfully. Mirrors the operational pattern of
azure-servicebus-health (dead-letter counts plus investigation guidance) for AWS.
--- Tasks ---
Each task becomes one Robot Framework task + one bash script.
tasks:
-
name: "Check Dead Letter Queue Depth and Redrive Configuration"
description: |
Lists target queues (user-provided URLs or discovery by prefix), reads
queue attributes including RedrivePolicy, resolves the dead-letter queue
ARN, and compares ApproximateNumberOfMessages on the DLQ against thresholds.
Emits structured JSON issues when DLQ depth exceeds DEAD_LETTER_MESSAGE_THRESHOLD.
script_name: "sqs_dlq_depth_and_redrive.sh"
expected_issue_severity: [2, 3]
access_level: "read-only"
data_type: "metrics"
-
name: "Sample Recent Dead Letter Messages for Diagnostics"
description: |
Receives a bounded sample of messages from DLQs (short visibility timeout,
then return visibility or delete policy documented in README) and extracts
message attributes, send timestamps, and body snippets relevant to failure
analysis (including Lambda async failure payloads when present).
script_name: "sqs_dlq_sample_messages.sh"
expected_issue_severity: [3, 4]
access_level: "read-only"
data_type: "logs-bulk"
-
name: "Correlate DLQ to Lambda Consumer CloudWatch Logs"
description: |
For queues with Lambda event source mappings, resolves function ARNs,
identifies CloudWatch log groups, and filters recent log events for ERROR,
Task timed out, and other processing failures overlapping the DLQ incident window.
script_name: "sqs_dlq_lambda_consumer_logs.sh"
expected_issue_severity: [3, 4]
access_level: "read-only"
data_type: "logs-regexp"
-
name: "Collect Source Queue CloudWatch Metrics for Context"
description: |
Pulls CloudWatch metrics for the source queue (e.g. ApproximateAgeOfOldestMessage,
NumberOfMessagesSent, NumberOfMessagesDeleted) to distinguish backlog vs. poison
message patterns and enrich issue details.
script_name: "sqs_source_queue_metrics.sh"
expected_issue_severity: [4]
access_level: "read-only"
data_type: "metrics"
--- Scope ---
scope:
level: "Resource"
qualifiers:
- AWS_ACCOUNT_NAME
- AWS_REGION
- SQS_QUEUE_URL
iteration_pattern: |
One SLX per discovered SQS queue (or per explicitly configured queue URL list).
Qualifier: queue name or URL segment. When multiple queues share a DLQ, dedupe
DLQ checks so the same DLQ is not double-counted across SLXs (implementation
may group by DLQ ARN).
--- Resource Discovery ---
resource_types:
- "aws_sqs_queues"
generation_strategy: |
RunWhen Local discovers SQS queues in the linked AWS account/region. Match rules
align with queue name patterns or tags where supported. One SLX per matched queue;
config provides queue URL and region. If discovery is unavailable, user-supplied
SQS_QUEUE_URLS satisfies the same template.
--- Configuration ---
env_vars:
-
name: AWS_REGION
description: "AWS region containing the queues"
required: true
-
name: AWS_ACCOUNT_NAME
description: "Account display name for reports"
required: false
default: "Unknown"
-
name: SQS_QUEUE_URLS
description: "Comma-separated queue URLs to analyze; empty triggers discovery/prefix filter"
required: false
default: ""
-
name: SQS_QUEUE_NAME_PREFIX
description: "Optional filter when listing queues"
required: false
default: ""
-
name: DEAD_LETTER_MESSAGE_THRESHOLD
description: "Open an issue when DLQ approximate message count exceeds this value"
required: false
default: "0"
-
name: CLOUDWATCH_LOG_LOOKBACK_MINUTES
description: "Window for Lambda log filter and metric alignment"
required: false
default: "30"
-
name: MAX_DLQ_MESSAGES_TO_SAMPLE
description: "Cap on DLQ messages to receive per run for diagnostics"
required: false
default: "5"
secrets:
- name: aws_credentials
description: "Standard RunWhen AWS credentials (aws-auth block)"
format: |
Same as other rw-cli-codecollection AWS bundles: access key, IRSA, or assume-role
via workspace aws_credentials secret.
--- Platform Context ---
platform:
name: "aws"
cli_tools:
- "aws sqs"
- "aws lambda"
- "aws logs"
- "aws cloudwatch"
- "jq"
auth_methods:
- "aws_credentials (RunWhen standard)"
api_docs:
- "https://docs.aws.amazon.com/AWSSimpleQueueService/latest/APIReference/Welcome.html"
- "https://docs.aws.amazon.com/lambda/latest/dg/with-sqs.html"
--- Relationships ---
related_bundles:
-
name: "azure-servicebus-health"
relationship: "complements"
notes: "Same operational story: dead-letter counts, thresholds, and investigation steps for messaging; this bundle is the AWS SQS equivalent."
-
name: "aws-lambda-health"
relationship: "complements"
notes: "Lambda invocation errors and performance; this bundle links DLQs to Lambda logs when the consumer is Lambda."
--- Test Strategy ---
test_scenarios:
-
name: "healthy_no_dlq_messages"
description: "Source queue with redrive policy; DLQ has zero approximate messages"
expected_issues: 0
-
name: "dlq_has_messages_lambda_failure"
description: "DLQ non-empty; Lambda event source mapping exists; logs contain processing errors"
expected_issues: 2
expected_severities: [2, 3]
-
name: "dlq_without_lambda_consumer"
description: "DLQ has messages but consumer is not Lambda (ECS/EC2); scripts degrade gracefully"
expected_issues: 1
expected_severities: [3]
--- Notes ---
notes: |
- SQS does not expose Azure-style DeadLetterReason on the queue API; root cause
comes from message bodies (Lambda failure payloads), consumer logs, or
application-specific attributes. Document limitations for non-Lambda consumers.
- Respect partial batch failure and FIFO semantics in receive/delete behavior.
- IAM: sqs:GetQueueAttributes, sqs:ListQueues, sqs:ReceiveMessage (DLQ only),
sqs:GetQueueUrl, lambda:ListEventSourceMappings, logs:FilterLogEvents,
cloudwatch:GetMetricStatistics or GetMetricData.
- A reference scaffold that passes the CodeBundle quality scorer (96/96) is available
in the architect workspace for this intake; SRE implementation belongs under
rw-cli-codecollection as usual.
CodeBundle Design Spec — aws-sqs-dlq-health
Parent: codecollection-registry intake — SQS dead-letter investigation (issue #75)
Consumed by Creator SRE mode for rw-cli-codecollection implementation.
--- Identity ---
codebundle_name: "aws-sqs-dlq-health"
target_collection: "rw-cli-codecollection"
display_name: "AWS SQS Dead Letter Queue Health and Log Correlation"
author: "rw-codebundle-agent"
--- Purpose ---
purpose: |
Monitors Amazon SQS dead-letter queues, raises issues when messages accumulate
in a DLQ, samples recent dead-lettered messages for diagnostic attributes, and
correlates failures to CloudWatch Logs for common consumer patterns (especially
AWS Lambda event source mappings) so operators can determine why messages
were not processed successfully. Mirrors the operational pattern of
azure-servicebus-health (dead-letter counts plus investigation guidance) for AWS.
--- Tasks ---
Each task becomes one Robot Framework task + one bash script.
tasks:
name: "Check Dead Letter Queue Depth and Redrive Configuration"
description: |
Lists target queues (user-provided URLs or discovery by prefix), reads
queue attributes including RedrivePolicy, resolves the dead-letter queue
ARN, and compares ApproximateNumberOfMessages on the DLQ against thresholds.
Emits structured JSON issues when DLQ depth exceeds DEAD_LETTER_MESSAGE_THRESHOLD.
script_name: "sqs_dlq_depth_and_redrive.sh"
expected_issue_severity: [2, 3]
access_level: "read-only"
data_type: "metrics"
name: "Sample Recent Dead Letter Messages for Diagnostics"
description: |
Receives a bounded sample of messages from DLQs (short visibility timeout,
then return visibility or delete policy documented in README) and extracts
message attributes, send timestamps, and body snippets relevant to failure
analysis (including Lambda async failure payloads when present).
script_name: "sqs_dlq_sample_messages.sh"
expected_issue_severity: [3, 4]
access_level: "read-only"
data_type: "logs-bulk"
name: "Correlate DLQ to Lambda Consumer CloudWatch Logs"
description: |
For queues with Lambda event source mappings, resolves function ARNs,
identifies CloudWatch log groups, and filters recent log events for ERROR,
Task timed out, and other processing failures overlapping the DLQ incident window.
script_name: "sqs_dlq_lambda_consumer_logs.sh"
expected_issue_severity: [3, 4]
access_level: "read-only"
data_type: "logs-regexp"
name: "Collect Source Queue CloudWatch Metrics for Context"
description: |
Pulls CloudWatch metrics for the source queue (e.g. ApproximateAgeOfOldestMessage,
NumberOfMessagesSent, NumberOfMessagesDeleted) to distinguish backlog vs. poison
message patterns and enrich issue details.
script_name: "sqs_source_queue_metrics.sh"
expected_issue_severity: [4]
access_level: "read-only"
data_type: "metrics"
--- Scope ---
scope:
level: "Resource"
qualifiers:
- AWS_ACCOUNT_NAME
- AWS_REGION
- SQS_QUEUE_URL
iteration_pattern: |
One SLX per discovered SQS queue (or per explicitly configured queue URL list).
Qualifier: queue name or URL segment. When multiple queues share a DLQ, dedupe
DLQ checks so the same DLQ is not double-counted across SLXs (implementation
may group by DLQ ARN).
--- Resource Discovery ---
resource_types:
generation_strategy: |
RunWhen Local discovers SQS queues in the linked AWS account/region. Match rules
align with queue name patterns or tags where supported. One SLX per matched queue;
config provides queue URL and region. If discovery is unavailable, user-supplied
SQS_QUEUE_URLS satisfies the same template.
--- Configuration ---
env_vars:
name: AWS_REGION
description: "AWS region containing the queues"
required: true
name: AWS_ACCOUNT_NAME
description: "Account display name for reports"
required: false
default: "Unknown"
name: SQS_QUEUE_URLS
description: "Comma-separated queue URLs to analyze; empty triggers discovery/prefix filter"
required: false
default: ""
name: SQS_QUEUE_NAME_PREFIX
description: "Optional filter when listing queues"
required: false
default: ""
name: DEAD_LETTER_MESSAGE_THRESHOLD
description: "Open an issue when DLQ approximate message count exceeds this value"
required: false
default: "0"
name: CLOUDWATCH_LOG_LOOKBACK_MINUTES
description: "Window for Lambda log filter and metric alignment"
required: false
default: "30"
name: MAX_DLQ_MESSAGES_TO_SAMPLE
description: "Cap on DLQ messages to receive per run for diagnostics"
required: false
default: "5"
secrets:
description: "Standard RunWhen AWS credentials (aws-auth block)"
format: |
Same as other rw-cli-codecollection AWS bundles: access key, IRSA, or assume-role
via workspace aws_credentials secret.
--- Platform Context ---
platform:
name: "aws"
cli_tools:
- "aws sqs"
- "aws lambda"
- "aws logs"
- "aws cloudwatch"
- "jq"
auth_methods:
- "aws_credentials (RunWhen standard)"
api_docs:
- "https://docs.aws.amazon.com/AWSSimpleQueueService/latest/APIReference/Welcome.html"
- "https://docs.aws.amazon.com/lambda/latest/dg/with-sqs.html"
--- Relationships ---
related_bundles:
name: "azure-servicebus-health"
relationship: "complements"
notes: "Same operational story: dead-letter counts, thresholds, and investigation steps for messaging; this bundle is the AWS SQS equivalent."
name: "aws-lambda-health"
relationship: "complements"
notes: "Lambda invocation errors and performance; this bundle links DLQs to Lambda logs when the consumer is Lambda."
--- Test Strategy ---
test_scenarios:
name: "healthy_no_dlq_messages"
description: "Source queue with redrive policy; DLQ has zero approximate messages"
expected_issues: 0
name: "dlq_has_messages_lambda_failure"
description: "DLQ non-empty; Lambda event source mapping exists; logs contain processing errors"
expected_issues: 2
expected_severities: [2, 3]
name: "dlq_without_lambda_consumer"
description: "DLQ has messages but consumer is not Lambda (ECS/EC2); scripts degrade gracefully"
expected_issues: 1
expected_severities: [3]
--- Notes ---
notes: |
comes from message bodies (Lambda failure payloads), consumer logs, or
application-specific attributes. Document limitations for non-Lambda consumers.
sqs:GetQueueUrl, lambda:ListEventSourceMappings, logs:FilterLogEvents,
cloudwatch:GetMetricStatistics or GetMetricData.
in the architect workspace for this intake; SRE implementation belongs under
rw-cli-codecollection as usual.