rfc: MST bucket metadata on forge by frrist · Pull Request #4 · fil-one/RFC

frrist · 2026-04-30T22:02:00Z

Peeja

Minor thoughts, but nothing that would block moving ahead and adjusting on the fly. This looks fantastic.

Peeja · 2026-05-04T14:40:53Z

+
+## DB ↔ MST relationship
+
+- **DB authoritative for runtime.** All S3 queries — `GetObject`, `HeadObject`, `ListObjectsV2`, multipart, IAM checks, lifecycle evaluation — read from Postgres. The MST is never on the read path.


question: Is "authoritative" the right word here? As I would understand that word, the MST is authoritative, but the DB is the cache that's consulted. If the DB and the MST were out of sync, the MST would be by definition correct (and the response would be incorrect, having consulted a bad cache).

Ah, I understand what you're getting at now. The DB is a write-back cache, so immediately after a write, the DB is in fact more up-to-date than the MST. I think that satisfies my question, but I'm concerned there may be a footgun in here somewhere. I'm sure we'll find it if it's there, though.

Peeja · 2026-05-04T14:44:01Z

+
+Postgres is the authoritative runtime store. **It holds metadata only** — CIDs, sizes, hashes, timestamps, user metadata, tags, policies, multipart upload state. Body bytes are content-addressed and live in piri; they never enter the relational store. The schema is heavily inspired by [supabase/storage](https://github.com/supabase/storage) (`migrations/tenant/`), which has solved most of these problems already and at scale. The core tables (column lists indicative, not literal SQL):
+
+- **`buckets`**: `(id, name, owner_id, root_cid, forge_root_cid, public, file_size_limit, allowed_mime_types, created_at, ...)`. Superset of today's registry (`pkg/ms3t/registry/sqlite.go:14-20`). `root_cid` is the current MST root; `forge_root_cid` is the last root snapshotted to Forge.


thought: Worth flagging, I think, if I'm correct: this table is different from the others. This is not a read-friendly cache of the MST values. This is the sole mutable ref for each bucket, pointing to the current root, and this information is not (currently) stored anywhere else. Is that right? (s3_multipart_uploads is also not in the MST, but it's transient.)

Along those lines, if I'm reading correctly, only some of these columns are actually that ref. Several things are also cached from the MST, but are on this table because they're about the entire bucket. Is that correct?

It makes practical sense to combine the ref with the cached info for operational reasons, but I think we should be super clear about what this is the source of truth for, because it's awkward having them mixed together. Alternatively, it might be a sign that the mutable ref should actually live somewhere else for a better credible exit, and this should in fact be a cache of that. But of course, we're much better at storing immutable values in a credibly exitable form than mutable ones, so I'm not even sure what that would look like.

Peeja · 2026-05-04T15:04:13Z

+- DR boundary expands: services disk is now durable state, co-equal with Postgres for in-flight multipart.
+- `CompleteMultipartUpload` is where the bytes flow to piri — long-running for large uploads.
+
+### Option 2 — service streams parts to piri; defers Accept until Complete


praise: This is fantastic. Just falls together so elegantly.

Peeja · 2026-05-04T15:05:55Z

+- **Add**: ~`O(depth)` orphan MST nodes (the prior path).
+- **Update**: ~`O(depth)` orphan MST nodes. With versioning disabled, the prior manifest and its body chunks are also orphaned. With versioning enabled, the prior manifest stays reachable via `Previous` and the chunks remain live.


question: Why do these still orphan MST nodes with versioning enabled? Don't we point to the prior root, and thus transitively to everything from the prior tree?

alanshaw

This LGTM.

alanshaw · 2026-05-05T16:03:39Z

+
+A Storacha S3 bucket is "Web3" only if a customer can walk away with their bucket. The falsifiable test: hand them a CAR file and they reconstruct an interoperable bucket on any IPFS-aware platform. That requires a content-addressed, self-describing representation of bucket state — which the MST provides.
+
+The current prototype (`pkg/ms3t/architectural.md`) uses the MST for both the write path and the read path. The MST is bad at the read path. For S3 prefix listing, range reads, multipart, IAM checks, and lifecycle, a relational store with proper indexes is the right shape. The supabase/storage project has demonstrated this design at scale; we are adopting its schema as prior art and re-using sprue's existing Postgres machinery.


The MST is bad at the read path.

Why is that? Multiple small block reads from the network?

alanshaw · 2026-05-05T16:04:02Z

+
+A Storacha S3 bucket is "Web3" only if a customer can walk away with their bucket. The falsifiable test: hand them a CAR file and they reconstruct an interoperable bucket on any IPFS-aware platform. That requires a content-addressed, self-describing representation of bucket state — which the MST provides.
+
+The current prototype (`pkg/ms3t/architectural.md`) uses the MST for both the write path and the read path. The MST is bad at the read path. For S3 prefix listing, range reads, multipart, IAM checks, and lifecycle, a relational store with proper indexes is the right shape. The supabase/storage project has demonstrated this design at scale; we are adopting its schema as prior art and re-using sprue's existing Postgres machinery.


The supabase/storage project has demonstrated this design at scale

I don't understand the relevance?

alanshaw · 2026-05-05T16:06:03Z

+
+This RFC defines the split.
+
+## What is an MST


Suggested change

## What is an MST

## What is an MST?

alanshaw · 2026-05-05T16:06:08Z

+
+The implementation in `pkg/ms3t/mst/` is forked from atproto, where the structure represents social-graph repos. The implications of that origin for S3 are discussed in §"Why MST" and §"Fanout and sizing".
+
+## Why MST


Suggested change

## Why MST

## Why MST?

alanshaw · 2026-05-05T16:17:03Z

+| `bucket → root_cid` pointer | | ✓ | operational cursor |
+| Owner mapping, audit logs, metrics | | ✓ | platform state |
+
+Bucket tags are a borderline call. They describe the operational unit (`cost-center=engineering`) rather than user content; on exit, a recipient can re-tag the destination bucket. This RFC defaults them to DB-only and revisits if a portability use case appears.


I think we might need a "bucket manifest". This can hold the bucket tags, but also the encryption specifics. It doesn't seem worthwhile on it's own but considering we will be adding encryption we do need this information persisted.

alanshaw · 2026-05-05T16:19:52Z

+  ContentType  string
+  Created      int64             // unix seconds
+  Modified     int64             // unix seconds (S3 Last-Modified)
+  Size         uint64


There's no uint in IPLD so just int64.

Suggested change

Size uint64

Size int64

pyropy · 2026-05-06T10:24:24Z

+
+A Storacha S3 bucket is "Web3" only if a customer can walk away with their bucket. The falsifiable test: hand them a CAR file and they reconstruct an interoperable bucket on any IPFS-aware platform. That requires a content-addressed, self-describing representation of bucket state — which the MST provides.
+
+The current prototype (`pkg/ms3t/architectural.md`) uses the MST for both the write path and the read path. The MST is bad at the read path. For S3 prefix listing, range reads, multipart, IAM checks, and lifecycle, a relational store with proper indexes is the right shape. The supabase/storage project has demonstrated this design at scale; we are adopting its schema as prior art and re-using sprue's existing Postgres machinery.


The current prototype (pkg/ms3t/architectural.md)

It might be helpful to link to the file in the repository.

pyropy · 2026-05-06T10:26:49Z

+4. **Free incremental snapshots.** Every commit is a snapshot. Structural sharing makes N retained snapshots ≈ 1 in storage cost.
+5. **Tamper-evident history.** Retained roots form a verifiable commitment chain.
+
+The current MST in `pkg/ms3t/mst/` is forked from atproto's repo MST: 4-bit fanout, hash-keyed via `sha256(key)` leading-zero count (`pkg/ms3t/mst/mst_util.go:20-49`). atproto's design point is bounded social-graph repos with no prefix-listing requirement. S3 buckets violate all of those: keys can be 1KB, deeply hierarchical, prefix-listing is a first-class operation, and buckets can grow unbounded.


It would probably be good to link all the referenced repositories or files, or add them to a list of references that can later be referenced in the document.

pyropy · 2026-05-06T10:30:07Z

+- [storacha/RFC #65](https://github.com/storacha/RFC/pull/65) — Filepack archive format
+- [storacha/RFC #66](https://github.com/storacha/RFC/pull/66) — Virtual DAG in Sharded DAG Index
+- [supabase/storage](https://github.com/supabase/storage) — schema prior art (`migrations/tenant/0002`, `0021`, `0026`–`0050`)
+- [atproto MST](https://github.com/bluesky-social/indigo/tree/main/mst) — origin of the MST fork 


Ok I see the atproto is referenced here.

rfc: MST bucket metadata on forge

07f3e50

frrist requested review from Peeja, alanshaw, bajtos, hannahhoward and pyropy April 30, 2026 22:02

frrist self-assigned this Apr 30, 2026

frrist added documentation Improvements or additions to documentation enhancement New feature or request labels Apr 30, 2026

Peeja approved these changes May 4, 2026

View reviewed changes

frrist marked this pull request as draft May 4, 2026 15:22

alanshaw approved these changes May 5, 2026

View reviewed changes

pyropy reviewed May 6, 2026

View reviewed changes


		## DB ↔ MST relationship

		- DB authoritative for runtime. All S3 queries — `GetObject`, `HeadObject`, `ListObjectsV2`, multipart, IAM checks, lifecycle evaluation — read from Postgres. The MST is never on the read path.


		Postgres is the authoritative runtime store. It holds metadata only — CIDs, sizes, hashes, timestamps, user metadata, tags, policies, multipart upload state. Body bytes are content-addressed and live in piri; they never enter the relational store. The schema is heavily inspired by [supabase/storage](https://github.com/supabase/storage) (`migrations/tenant/`), which has solved most of these problems already and at scale. The core tables (column lists indicative, not literal SQL):

		- `buckets`: `(id, name, owner_id, root_cid, forge_root_cid, public, file_size_limit, allowed_mime_types, created_at, ...)`. Superset of today's registry (`pkg/ms3t/registry/sqlite.go:14-20`). `root_cid` is the current MST root; `forge_root_cid` is the last root snapshotted to Forge.

		- Add: ~`O(depth)` orphan MST nodes (the prior path).
		- Update: ~`O(depth)` orphan MST nodes. With versioning disabled, the prior manifest and its body chunks are also orphaned. With versioning enabled, the prior manifest stays reachable via `Previous` and the chunks remain live.


		A Storacha S3 bucket is "Web3" only if a customer can walk away with their bucket. The falsifiable test: hand them a CAR file and they reconstruct an interoperable bucket on any IPFS-aware platform. That requires a content-addressed, self-describing representation of bucket state — which the MST provides.

		The current prototype (`pkg/ms3t/architectural.md`) uses the MST for both the write path and the read path. The MST is bad at the read path. For S3 prefix listing, range reads, multipart, IAM checks, and lifecycle, a relational store with proper indexes is the right shape. The supabase/storage project has demonstrated this design at scale; we are adopting its schema as prior art and re-using sprue's existing Postgres machinery.


		The implementation in `pkg/ms3t/mst/` is forked from atproto, where the structure represents social-graph repos. The implications of that origin for S3 are discussed in §"Why MST" and §"Fanout and sizing".

		## Why MST

Conversation

frrist commented Apr 30, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Peeja left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

alanshaw left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

frrist commented Apr 30, 2026 •

edited

Loading