Skip to content

fix(migration): pass compression to OpenDBIObject (root cause of LMDB→RocksDB hang)#748

Open
kriszyp wants to merge 4 commits into
mainfrom
fix/rocksdb-migration-structures-decode
Open

fix(migration): pass compression to OpenDBIObject (root cause of LMDB→RocksDB hang)#748
kriszyp wants to merge 4 commits into
mainfrom
fix/rocksdb-migration-structures-decode

Conversation

@kriszyp
Copy link
Copy Markdown
Member

@kriszyp kriszyp commented May 22, 2026

Summary

Fixes the blocking bug in migrateOnStart: true (Harper v5.0.21, LMDB→RocksDB migration) where migration silently pinned ~94% CPU and never completed. Reported by Jake — see v5-rocksdb-migration-bug-report.md.

Root cause (one-line fix)

bin/copyDb.ts opens each source LMDB primary store via new OpenDBIObject(!isPrimary, isPrimary) but never threads the per-attribute compression setting onto the init object before calling sourceRootStore.openDB(...). Without compression configured, lmdb-js does not install its decompression layer; every read on that DBI returns raw compressed bytes.

msgpackr then tries to decode those raw compressed bytes, interprets any byte in 0x40-0x7F as a shared-structure reference (see msgpackr/unpack.js read()), and calls loadStructures(). loadStructures re-enters RecordEncoder.decode on the (also compressed) structures buffer, which again contains plenty of bytes in 0x40-0x7F, which triggers another loadStructures, and so on until V8's stack frame limit is hit. That manifests as the 410-error burst from msgpackr's checkedRead, followed by the silent 94% CPU spin.

Harper's normal databases.ts path already does this at line 963 (dbiInit.compression = primaryKeyAttribute.compression); this PR ports the same one-liner into the migration path.

Verified end-to-end against the bug-report tarball — all 6 databases now migrate, Harper starts cleanly, and SELECT * FROM linode.MonthlyReport returns all 8 records with the full nested hostDetails[].orgBreakdown[] payload intact:

migrated database digest to RocksDB
migrated database linode to RocksDB
migrated database oauth to RocksDB
migrated database opmodel to RocksDB
migrated database scheduler to RocksDB
migrated database system to RocksDB
Harper 5.1.0 successfully started

Additional fixes folded in (independent improvements)

While diagnosing I also fixed three separate problems that were obscuring the root cause and would have caused trouble in their own right:

  1. migrateOnStart flag timing. Previously cleared STORAGE_MIGRATEONSTART before the try block, so a crash mid-migration would leave a half-migrated DB with the flag already off (next start wouldn't retry). Moved to after all databases migrate successfully.

  2. copyDbiToRocks retry loop — observability. The original code had retries = 1000000 with an empty catch {}, which is why the bug presented as a silent CPU-pinned hang. Reduced to 1000, log per-retry context, and throw on exhaustion so migrateOnStart's catch preserves the flag and skips moving LMDB files to backup.

  3. DESIGN.md note documenting the compression-on-source-dbi gotcha for the next person who touches this path.

Commit history note

The branch contains a couple of dead-end commits and a revert. Earlier in the investigation I added a defensive zero-byte-prefix branch to RecordEncoder.decode and an 8-byte-strip in copyStructures on the theory that lmdb-js's saveStructures emits a version=0 prefix. Empirical check across all 23 primary stores in the tarball showed zero structures buffers actually begin with [00...]; every one starts with 0x82 (fixmap) directly. v4 RecordEncoder ships with the same nextByte === 2 check and has worked in production on this exact data shape for years. The defensive branch was reverted (last commit) so the final diff matches v4's contract.

Test plan

  • End-to-end repro against /tmp/harper-v5-rocksdb-migration-bug-2026-05-22.tar.gz — all 6 databases migrate, MonthlyReport queryable post-migration
  • npm run test:integration -- 4.x-upgrade.test.ts — existing migration test still passes (additive change)
  • Verify forced mid-migration failure preserves storage.migrateOnStart: true and the original .mdb files (restart-safe; the flag-timing + throw-on-exhaustion changes are what make this work)

Strong candidate for v5.0.x patch release — see /patch-pr. The compression fix is a one-liner that fixes a confirmed blocking bug for any customer whose tables hit the compression threshold (compression defaults to enabled in storage.compression: true).

🤖 Generated by Claude (Opus 4.7) — Co-Authored-By: Claude Opus 4.7

migrateOnStart silently pinned ~94% CPU when migrating a table whose
shared msgpack structures hadn't been loaded into memory before the
migration ran. lmdb-js's internal saveStructures stores the structures
buffer via put() with no version argument; on a useVersions=true store
that emits an 8-byte float64 version=0 prefix (8 zero bytes). The LMDB
branch of RecordEncoder.decode only recognized the ordered-binary
timestamp prefix (first byte 0x02) so it mis-read the 8 zeros as a
2-byte legacy metadata header, leaving 6 junk bytes ahead of the
msgpack payload. msgpackr then threw "Data read, but end of buffer not
reached", the iterator surfaced the throw, and copyDbiToRocks's outer
1,000,000-retry loop swallowed it without logging.

- RecordEncoder.decode: also strip an 8-byte all-zero prefix on the
  LMDB path. The full 8-byte zero check disambiguates from legacy
  2-byte metadata records (e.g. HAS_RESIDENCY_ID alone encodes as
  [00 01 ...]).
- copyStructures: strip the 8-byte version prefix before writing the
  structures buffer to RocksDB so the RocksDB decoder doesn't see it
  as a metadata record.
- migrateOnStart: clear STORAGE_MIGRATEONSTART only after every
  database migrates successfully, so a crash mid-migration is retried
  on the next start rather than leaving a half-migrated state with
  the flag already off.
- copyDbiToRocks: cap retries at 1000 (was 1,000,000), log per-retry
  context, and throw at exhaustion so migrateOnStart's catch keeps
  the flag set and skips moving the LMDB files to backup.

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
@kriszyp kriszyp requested review from cb1kenobi and jcohen-hdb May 22, 2026 21:45
Note the gotcha that caused the v5 RocksDB migration hang — the 8-byte
version=0 prefix that lmdb-js prepends to the structures buffer and
the implications for decode and cross-engine copy paths.

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
@claude
Copy link
Copy Markdown
Contributor

claude Bot commented May 22, 2026

Reviewed; no blockers found.

This is the actual root cause of the silent-hang bug reported against
v5.0.21 migrateOnStart. The migration was opening each source primary
store via `new OpenDBIObject(!isPrimary, isPrimary)` but never copying
the per-attribute compression setting onto it. Without compression
configured, lmdb-js doesn't install its decompression layer; record
buffers come back as raw compressed bytes. When msgpackr then tries
to decode those bytes, it interprets bytes in the 0x40-0x7F range as
shared-structure references → calls loadStructures → decodes the
structures buffer → finds more bytes in that range → recurses → stack
overflow.

Verified end-to-end against the bug-report tarball (linode.mdb with
MonthlyReport records containing 32-element nested orgBreakdown
arrays):

  migrated database digest to RocksDB
  migrated database linode to RocksDB
  migrated database oauth to RocksDB
  migrated database opmodel to RocksDB
  migrated database scheduler to RocksDB
  migrated database system to RocksDB
  Harper 5.1.0 successfully started

Post-migration `SELECT * FROM linode.MonthlyReport` returns all 8
records correctly. Matches Harper's normal `databases.ts` path which
sets `dbiInit.compression = primaryKeyAttribute.compression` at
line 963.

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
@kriszyp kriszyp changed the title fix: lmdb→rocksdb migration hang on shared-structures decode fix(migration): pass compression to OpenDBIObject (root cause of LMDB→RocksDB hang) May 23, 2026
Earlier commits in this PR added a defensive zero-byte-prefix branch
to RecordEncoder.decode and an 8-byte-prefix strip to copyStructures,
on the theory that lmdb-js's internal saveStructures emits an 8-byte
float64 version=0 prefix on useVersions=true stores.

Empirical check across all 23 primary stores in the bug-report
tarball shows zero structures buffers begin with [00...]; every one
starts with 0x82 (fixmap) directly. v4 RecordEncoder ships with the
same `nextByte === 2` check and has worked in production for years
on this same data shape, so there is no concrete pathway to a
zero-prefix case here. Revert to match v4.

Also drops the now-orphaned recordEncoder.test.js (the legacy-2-byte
regression guard was only relevant against the reverted branch).

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
@kriszyp kriszyp marked this pull request as ready for review May 23, 2026 01:12
@kriszyp kriszyp added the patch label May 23, 2026
@github-actions
Copy link
Copy Markdown
Contributor

github-actions Bot commented May 23, 2026

Patch cherry-pick: tests ✅ passed

Cherry-picked PR #748 onto v5.0 at branch cherry-pick/v5.0/pr-748.
Integration tests: ✅ passedhttps://github.com/HarperFast/harper/actions/runs/26319493322

On merge of this PR the cherry-pick branch will be fast-forwarded into v5.0.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant