Skip to content

fix: drop head_object precheck in uploads migration#3

Merged
dosaki merged 2 commits into
mainfrom
fix/s3-migration-permissions
May 4, 2026
Merged

fix: drop head_object precheck in uploads migration#3
dosaki merged 2 commits into
mainfrom
fix/s3-migration-permissions

Conversation

@dosaki

@dosaki dosaki commented May 4, 2026

Copy link
Copy Markdown
Member

What broke

PR #1 deployed successfully through Terraform but the new ECS tasks crashed on startup with:

botocore.exceptions.ClientError: An error occurred (403) when calling the HeadObject operation: Forbidden

This came from the head_object idempotency precheck in scripts/migrate_uploads_to_s3.py, which I added to skip already-uploaded files. The precheck needs s3:GetObject to behave correctly — without it, S3 returns 403 Forbidden (not 404) for missing objects, because S3 refuses to disclose existence to unauthorized callers. Our except ClientError branch only mapped 404/NoSuchKey/NotFound to "doesn't exist", so the 403 was re-raised and crashed the script. ECS retried, hit the same crash, and never reached steady state — aws ecs wait services-stable timed out at 10 minutes.

What this fixes

Drops the precheck entirely. PutObject is idempotent for the same body, so re-running on every container boot just overwrites with the same bytes — no functional difference vs. the precheck version, and no extra IAM permission needed. About 24 redundant PUTs per cold boot until cleanup PR #2 removes the migration script altogether; in-region S3 PUTs are fast and free.

The alternative would have been to grant s3:GetObject on the bucket prefix to the ECS task role — but that broadens permissions for a check whose only purpose is to avoid those redundant PUTs, and it would have required a Terraform round-trip to apply. This change is one Python file, ships in the same deploy.yml run, and gets us out of the failed-deploy state immediately.

Current prod state

  • Old tasks still serving (ECS keeps them running because deployment_minimum_healthy_percent = 50 blocks drain when new tasks fail health checks). App is degraded-but-functional.
  • New task definition exists with the new env vars; new image is in ECR; bucket and IAM are applied.
  • DB rows: still bare filenames (migration crashed before reaching the DB phase).
  • S3 bucket: empty (migration crashed before uploading anything).

After this merges and deploy.yml re-runs:

  • ECS rolls new tasks → first new task boots → flask db upgrade (no-op) → python scripts/migrate_uploads_to_s3.py PUTs all 24 files and prefixes the 24 DB rows → gunicorn starts → health check passes → ALB shifts traffic.
  • Old tasks drain. Brief rollout window where old tasks may 404 covers (DB rows now prefixed, old image doesn't have the legacy shim because the legacy shim is in the new code only).

Test plan

  • After merge, watch the deploy.yml log for == Migrating uploads to S3 [LIVE] == followed by Done. Uploaded 24 object(s); updated 24 row(s).
  • aws ecs wait services-stable succeeds.
  • Spot-check 2-3 book covers render in prod UI (now serving from S3).
  • Open PR chore: remove disk-upload remnants after S3 cutover #2 for review and merge it once verified.

🤖 Generated with Claude Code

The ECS task role is scoped to s3:PutObject only. HeadObject without
GetObject returns 403 (S3 hides existence from unauthorized callers),
which the existing handler doesn't translate to "doesn't exist", so the
script crashed on container start and ECS never reached steady state.
Drop the precheck — PutObject is idempotent for the same body, so
re-runs just overwrite with the same bytes.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Copilot AI review requested due to automatic review settings May 4, 2026 08:11

Copilot AI left a comment

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Removes the S3 HeadObject idempotency precheck from the one-shot uploads migration script to prevent ECS startup crashes when the task role lacks s3:GetObject (where HeadObject yields 403 rather than a distinguishable “missing object” signal).

Changes:

  • Drops the _exists() helper and associated head_object logic so uploads always proceed via PutObject.
  • Removes the unused ClientError import.
  • Updates inline/documentation text to reflect the new overwrite-based idempotency approach.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Comment thread scripts/migrate_uploads_to_s3.py Outdated
The "Runs once per environment" wording was true when the script was
intended to run manually, but with entrypoint.sh invoking it on every
container boot it just misleads. Note that PutObject re-runs are cheap
in-region overwrites until the file is removed in the cleanup PR.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
@dosaki dosaki merged commit dd320f4 into main May 4, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants