Skip to content

rfc: MST bucket metadata on forge#4

Draft
frrist wants to merge 1 commit intomainfrom
frrist/forge/mst-bucket
Draft

rfc: MST bucket metadata on forge#4
frrist wants to merge 1 commit intomainfrom
frrist/forge/mst-bucket

Conversation

@frrist
Copy link
Copy Markdown
Member

@frrist frrist commented Apr 30, 2026

📖 Preview

@frrist frrist self-assigned this Apr 30, 2026
@frrist frrist added documentation Improvements or additions to documentation enhancement New feature or request labels Apr 30, 2026
Copy link
Copy Markdown

@Peeja Peeja left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Minor thoughts, but nothing that would block moving ahead and adjusting on the fly. This looks fantastic.


## DB ↔ MST relationship

- **DB authoritative for runtime.** All S3 queries — `GetObject`, `HeadObject`, `ListObjectsV2`, multipart, IAM checks, lifecycle evaluation — read from Postgres. The MST is never on the read path.
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

question: Is "authoritative" the right word here? As I would understand that word, the MST is authoritative, but the DB is the cache that's consulted. If the DB and the MST were out of sync, the MST would be by definition correct (and the response would be incorrect, having consulted a bad cache).

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ah, I understand what you're getting at now. The DB is a write-back cache, so immediately after a write, the DB is in fact more up-to-date than the MST. I think that satisfies my question, but I'm concerned there may be a footgun in here somewhere. I'm sure we'll find it if it's there, though.


Postgres is the authoritative runtime store. **It holds metadata only** — CIDs, sizes, hashes, timestamps, user metadata, tags, policies, multipart upload state. Body bytes are content-addressed and live in piri; they never enter the relational store. The schema is heavily inspired by [supabase/storage](https://github.com/supabase/storage) (`migrations/tenant/`), which has solved most of these problems already and at scale. The core tables (column lists indicative, not literal SQL):

- **`buckets`**: `(id, name, owner_id, root_cid, forge_root_cid, public, file_size_limit, allowed_mime_types, created_at, ...)`. Superset of today's registry (`pkg/ms3t/registry/sqlite.go:14-20`). `root_cid` is the current MST root; `forge_root_cid` is the last root snapshotted to Forge.
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

thought: Worth flagging, I think, if I'm correct: this table is different from the others. This is not a read-friendly cache of the MST values. This is the sole mutable ref for each bucket, pointing to the current root, and this information is not (currently) stored anywhere else. Is that right? (s3_multipart_uploads is also not in the MST, but it's transient.)

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Along those lines, if I'm reading correctly, only some of these columns are actually that ref. Several things are also cached from the MST, but are on this table because they're about the entire bucket. Is that correct?

It makes practical sense to combine the ref with the cached info for operational reasons, but I think we should be super clear about what this is the source of truth for, because it's awkward having them mixed together. Alternatively, it might be a sign that the mutable ref should actually live somewhere else for a better credible exit, and this should in fact be a cache of that. But of course, we're much better at storing immutable values in a credibly exitable form than mutable ones, so I'm not even sure what that would look like.

- DR boundary expands: services disk is now durable state, co-equal with Postgres for in-flight multipart.
- `CompleteMultipartUpload` is where the bytes flow to piri — long-running for large uploads.

### Option 2 — service streams parts to piri; defers Accept until Complete
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

praise: This is fantastic. Just falls together so elegantly.

Comment on lines +201 to +202
- **Add**: ~`O(depth)` orphan MST nodes (the prior path).
- **Update**: ~`O(depth)` orphan MST nodes. With versioning disabled, the prior manifest and its body chunks are also orphaned. With versioning enabled, the prior manifest stays reachable via `Previous` and the chunks remain live.
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

question: Why do these still orphan MST nodes with versioning enabled? Don't we point to the prior root, and thus transitively to everything from the prior tree?

@frrist frrist marked this pull request as draft May 4, 2026 15:22
Copy link
Copy Markdown
Member

@alanshaw alanshaw left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This LGTM.


A Storacha S3 bucket is "Web3" only if a customer can walk away with their bucket. The falsifiable test: hand them a CAR file and they reconstruct an interoperable bucket on any IPFS-aware platform. That requires a content-addressed, self-describing representation of bucket state — which the MST provides.

The current prototype (`pkg/ms3t/architectural.md`) uses the MST for both the write path and the read path. The MST is bad at the read path. For S3 prefix listing, range reads, multipart, IAM checks, and lifecycle, a relational store with proper indexes is the right shape. The supabase/storage project has demonstrated this design at scale; we are adopting its schema as prior art and re-using sprue's existing Postgres machinery.
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The MST is bad at the read path.

Why is that? Multiple small block reads from the network?


A Storacha S3 bucket is "Web3" only if a customer can walk away with their bucket. The falsifiable test: hand them a CAR file and they reconstruct an interoperable bucket on any IPFS-aware platform. That requires a content-addressed, self-describing representation of bucket state — which the MST provides.

The current prototype (`pkg/ms3t/architectural.md`) uses the MST for both the write path and the read path. The MST is bad at the read path. For S3 prefix listing, range reads, multipart, IAM checks, and lifecycle, a relational store with proper indexes is the right shape. The supabase/storage project has demonstrated this design at scale; we are adopting its schema as prior art and re-using sprue's existing Postgres machinery.
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The supabase/storage project has demonstrated this design at scale

I don't understand the relevance?


This RFC defines the split.

## What is an MST
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
## What is an MST
## What is an MST?


The implementation in `pkg/ms3t/mst/` is forked from atproto, where the structure represents social-graph repos. The implications of that origin for S3 are discussed in §"Why MST" and §"Fanout and sizing".

## Why MST
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
## Why MST
## Why MST?

| `bucket → root_cid` pointer | | ✓ | operational cursor |
| Owner mapping, audit logs, metrics | | ✓ | platform state |

Bucket tags are a borderline call. They describe the operational unit (`cost-center=engineering`) rather than user content; on exit, a recipient can re-tag the destination bucket. This RFC defaults them to DB-only and revisits if a portability use case appears.
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think we might need a "bucket manifest". This can hold the bucket tags, but also the encryption specifics. It doesn't seem worthwhile on it's own but considering we will be adding encryption we do need this information persisted.

ContentType string
Created int64 // unix seconds
Modified int64 // unix seconds (S3 Last-Modified)
Size uint64
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There's no uint in IPLD so just int64.

Suggested change
Size uint64
Size int64


A Storacha S3 bucket is "Web3" only if a customer can walk away with their bucket. The falsifiable test: hand them a CAR file and they reconstruct an interoperable bucket on any IPFS-aware platform. That requires a content-addressed, self-describing representation of bucket state — which the MST provides.

The current prototype (`pkg/ms3t/architectural.md`) uses the MST for both the write path and the read path. The MST is bad at the read path. For S3 prefix listing, range reads, multipart, IAM checks, and lifecycle, a relational store with proper indexes is the right shape. The supabase/storage project has demonstrated this design at scale; we are adopting its schema as prior art and re-using sprue's existing Postgres machinery.
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The current prototype (pkg/ms3t/architectural.md)

It might be helpful to link to the file in the repository.

4. **Free incremental snapshots.** Every commit is a snapshot. Structural sharing makes N retained snapshots ≈ 1 in storage cost.
5. **Tamper-evident history.** Retained roots form a verifiable commitment chain.

The current MST in `pkg/ms3t/mst/` is forked from atproto's repo MST: 4-bit fanout, hash-keyed via `sha256(key)` leading-zero count (`pkg/ms3t/mst/mst_util.go:20-49`). atproto's design point is bounded social-graph repos with no prefix-listing requirement. S3 buckets violate all of those: keys can be 1KB, deeply hierarchical, prefix-listing is a first-class operation, and buckets can grow unbounded.
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It would probably be good to link all the referenced repositories or files, or add them to a list of references that can later be referenced in the document.

- [storacha/RFC #65](https://github.com/storacha/RFC/pull/65) — Filepack archive format
- [storacha/RFC #66](https://github.com/storacha/RFC/pull/66) — Virtual DAG in Sharded DAG Index
- [supabase/storage](https://github.com/supabase/storage) — schema prior art (`migrations/tenant/0002`, `0021`, `0026`–`0050`)
- [atproto MST](https://github.com/bluesky-social/indigo/tree/main/mst) — origin of the MST fork
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ok I see the atproto is referenced here.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

documentation Improvements or additions to documentation enhancement New feature or request

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants