Skip to content

Storage deduplication#374

Merged
daveaugustus merged 10 commits into
mainfrom
feature/storage-deduplication
May 15, 2026
Merged

Storage deduplication#374
daveaugustus merged 10 commits into
mainfrom
feature/storage-deduplication

Conversation

@GorillaGigabytes

@GorillaGigabytes GorillaGigabytes commented May 10, 2026

Copy link
Copy Markdown
Collaborator

✅ Here's a clean, professional GitHub PR description:


feat: Implement Smart File Storage with Content-Addressable Deduplication

Summary
Introduced global Content-Addressable Storage (CAS) using SHA-256 hashing to eliminate duplicate file storage across the platform. The same image uploaded by multiple users/blogs will now be stored only once.

Key Changes:

  • Added storage_assets and storage_asset_refs tables to separate physical assets from logical references.
  • API Gateway now computes hash before upload and reuses existing assets via gRPC call to Storage Service.
  • Implemented safe reference-based delete/update flows.
  • Background garbage collection for unreferenced assets.
  • "Lazy migration" strategy — only new uploads are deduplicated.

Benefits:

  • Significant reduction in storage usage and costs.
  • Faster duplicate uploads.
  • Maintains clean microservices boundaries.

Important Notes:

  • This is fully backward compatible.
  • Legacy files remain untouched (deduplication applies to new uploads only).
  • No breaking changes to existing APIs or URLs.

Related Issues: [https://github.com//issues/371]

- Add generation counter to ConnManager to prevent N goroutines from each
  reconnecting when they all fail on the same dead connection. Only the
  first caller reconnects; the rest piggyback on the fresh connection.
- Serialize PublishMessage through pubMu when confirms are enabled to
  prevent orphan confirmations from causing false timeouts in
  publishAndWaitConfirm.
- Fix Debugf format verb in GetDueScheduledBlogs log statement.
Adds SHA-256 dedup per design doc.

- Schema 000005: storage_assets + storage_asset_refs (soft-delete, partial unique index).
- Storage svc: new database pkg (CheckAsset, RegisterAsset race-safe, Create/Delete/ReplaceAssetRef, UpdateNSFW); gRPC handlers wired.
- Consumer: BLOG_DELETE / USER_ACCOUNT_DELETE soft-delete CAS refs; legacy prefix delete kept as fallback for pre-CAS objects.
- Gateway storage_v2: hash-first upload (memory <=32MB, spool above), CheckAsset reuse, RegisterAsset + Create/Replace/DeleteAssetRef on mutations; new GET /assets/sha256/:p1/:p2/:fileName.
- Proto: new dedup messages + service methods.

Follow-ups: reference-counted GC, profile-pic CAS, NSFW scanner.
@github-actions

Copy link
Copy Markdown

🏷️ [bumpr]
Next version:v1.9.0
Changes:v1.8.7...the-monkeys:feature/storage-deduplication

Signed-off-by: Dave Augustus <devpandey19924u@gmail.com>
Signed-off-by: Dave Augustus <devpandey19924u@gmail.com>
@github-actions

Copy link
Copy Markdown

🏷️ [bumpr]
Next version:v1.8.8
Changes:v1.8.7...the-monkeys:feature/storage-deduplication

@daveaugustus daveaugustus merged commit 28222c8 into main May 15, 2026
11 checks passed
@github-actions

Copy link
Copy Markdown

🚀 [bumpr] Bumped!
New version:v1.9.0
Changes:v1.8.7...v1.9.0

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants