Version storage restructure: problem and plan

## Problem

- **~81% of bucket objects** are version-related; **~67% are empty** (one R2 object per edit, metadata only, no body).
- **`.da-versions` at org root** (`{Org}/.da-versions/{FileID}/`) is a single huge prefix: slow to list and doesn't scale.
- **Two concepts mixed**: (1) real version snapshots (contentLength > 0, explicit "Save version" or Restore Point), (2) audit-only entries (empty objects created on every PUT for "Collab Parse" and similar). The latter explode object count without adding real versions.

## Plan (condensed)

### 1. Labelled versions only as R2 objects

- **Remove "Collab Parse" version**: stop creating the automatic first-save snapshot and empty version objects on every PUT. Only create version objects for **explicit labelled version** (Save version, future preview/publish) or **Restore Point**.
- **New path**: `{Org}/{Repo}/.da-versions/{FileID}/{VersionUUID}.{ext}` — move under repo so listing is per-repo, not org-wide.

### 2. Single audit file per file (read-before-write dedupe)

- **Path**: `{Org}/{Repo}/.da-versions/{FileID}/audit.txt`
- **Format**: One line per entry (tab-separated): `timestamp \t users \t path \t versionLabel \t versionId`
  - **path**: stored without repo prefix (e.g. `/surf-copy.html`) so the file is readable.
  - **versionLabel**: human-readable name when entry is a labelled version (e.g. "v1", "Restore Point"); empty for edits.
  - **versionId**: snapshot id without extension when entry is a version (e.g. UUID); empty for edits.
  - Backward compat: 3-column (path only) and 4-column (path + versionId) lines are still parsed.
- **Write**: On every versionable PUT, append or update `audit.txt`. **Read-before-write** with **30 min** window: if last line is same user, within 30 min, **and both last and new entries are edits** (no version), overwrite that line with new timestamp; else append. **Labelled version entries always append and are never replaced** — they "interrupt" the dedup window (e.g. edit at 12:23, version at 12:25, edit at 12:40 → three entries). No empty version objects.

### 3. API behaviour during migration (progressive rollout)

- **Env**: `VERSIONS_AUDIT_FILE_ORGS` — comma-separated org slugs: version list from **audit.txt** for those orgs; by default still **merge** `org/.da-versions/{id}/` until skip-legacy is enabled.
- **Env**: `VERSIONS_AUDIT_SKIP_LEGACY_ORGS` — for orgs **also** in `VERSIONS_AUDIT_FILE_ORGS`, stop reading `org/.da-versions`. **Orgs not in `VERSIONS_AUDIT_FILE_ORGS`** list **only** `org/.da-versions/{fileId}/` (no `audit.txt`, no `repo/.da-versions/{fileId}/`).
- **GET**: Try new key first, then legacy key.
- **PUT/POST**: New writes only to new structure (snapshots + `audit.txt`). No new writes under `org/.da-versions`.

### 4. Migration

- **Scripts** (in `scripts/`): (1) **Analyse** — list version folders, count empty vs non-empty; (2) **Migrate** — copy snapshots to `org/repo/.da-versions/fileId/`, build `audit.txt` from empty-object metadata using the **same 5-column format** (path without repo, versionId without extension), **same dedup rule** (30 min window; version entries do not collapse), **merge with any existing `audit.txt` in new path** (hybrid case); (3) **Validate** — compare list/GET old vs new for a sample path.
- **Dual-read**: Keep supporting both old and new paths until migration is complete; then remove legacy fallback.

### 5. Benefits

- Far fewer objects: no per-edit empty version files; one `audit.txt` per file with collapsed entries.
- Faster listing: `.da-versions` scoped per repo, not one giant org prefix.
- Clear separation: real versions (snapshots) vs audit log (single file, deduped, human-readable labels in file).


Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Version storage restructure: problem and plan #252

Problem

Plan (condensed)

1. Labelled versions only as R2 objects

2. Single audit file per file (read-before-write dedupe)

3. API behaviour during migration (progressive rollout)

4. Migration

5. Benefits

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Version storage restructure: problem and plan #252

Description

Problem

Plan (condensed)

1. Labelled versions only as R2 objects

2. Single audit file per file (read-before-write dedupe)

3. API behaviour during migration (progressive rollout)

4. Migration

5. Benefits

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions