fix(migration): pass compression to OpenDBIObject (root cause of LMDB→RocksDB hang)#748
Open
kriszyp wants to merge 4 commits into
Open
fix(migration): pass compression to OpenDBIObject (root cause of LMDB→RocksDB hang)#748kriszyp wants to merge 4 commits into
kriszyp wants to merge 4 commits into
Conversation
migrateOnStart silently pinned ~94% CPU when migrating a table whose shared msgpack structures hadn't been loaded into memory before the migration ran. lmdb-js's internal saveStructures stores the structures buffer via put() with no version argument; on a useVersions=true store that emits an 8-byte float64 version=0 prefix (8 zero bytes). The LMDB branch of RecordEncoder.decode only recognized the ordered-binary timestamp prefix (first byte 0x02) so it mis-read the 8 zeros as a 2-byte legacy metadata header, leaving 6 junk bytes ahead of the msgpack payload. msgpackr then threw "Data read, but end of buffer not reached", the iterator surfaced the throw, and copyDbiToRocks's outer 1,000,000-retry loop swallowed it without logging. - RecordEncoder.decode: also strip an 8-byte all-zero prefix on the LMDB path. The full 8-byte zero check disambiguates from legacy 2-byte metadata records (e.g. HAS_RESIDENCY_ID alone encodes as [00 01 ...]). - copyStructures: strip the 8-byte version prefix before writing the structures buffer to RocksDB so the RocksDB decoder doesn't see it as a metadata record. - migrateOnStart: clear STORAGE_MIGRATEONSTART only after every database migrates successfully, so a crash mid-migration is retried on the next start rather than leaving a half-migrated state with the flag already off. - copyDbiToRocks: cap retries at 1000 (was 1,000,000), log per-retry context, and throw at exhaustion so migrateOnStart's catch keeps the flag set and skips moving the LMDB files to backup. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
Note the gotcha that caused the v5 RocksDB migration hang — the 8-byte version=0 prefix that lmdb-js prepends to the structures buffer and the implications for decode and cross-engine copy paths. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
Contributor
|
Reviewed; no blockers found. |
This is the actual root cause of the silent-hang bug reported against v5.0.21 migrateOnStart. The migration was opening each source primary store via `new OpenDBIObject(!isPrimary, isPrimary)` but never copying the per-attribute compression setting onto it. Without compression configured, lmdb-js doesn't install its decompression layer; record buffers come back as raw compressed bytes. When msgpackr then tries to decode those bytes, it interprets bytes in the 0x40-0x7F range as shared-structure references → calls loadStructures → decodes the structures buffer → finds more bytes in that range → recurses → stack overflow. Verified end-to-end against the bug-report tarball (linode.mdb with MonthlyReport records containing 32-element nested orgBreakdown arrays): migrated database digest to RocksDB migrated database linode to RocksDB migrated database oauth to RocksDB migrated database opmodel to RocksDB migrated database scheduler to RocksDB migrated database system to RocksDB Harper 5.1.0 successfully started Post-migration `SELECT * FROM linode.MonthlyReport` returns all 8 records correctly. Matches Harper's normal `databases.ts` path which sets `dbiInit.compression = primaryKeyAttribute.compression` at line 963. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
Earlier commits in this PR added a defensive zero-byte-prefix branch to RecordEncoder.decode and an 8-byte-prefix strip to copyStructures, on the theory that lmdb-js's internal saveStructures emits an 8-byte float64 version=0 prefix on useVersions=true stores. Empirical check across all 23 primary stores in the bug-report tarball shows zero structures buffers begin with [00...]; every one starts with 0x82 (fixmap) directly. v4 RecordEncoder ships with the same `nextByte === 2` check and has worked in production for years on this same data shape, so there is no concrete pathway to a zero-prefix case here. Revert to match v4. Also drops the now-orphaned recordEncoder.test.js (the legacy-2-byte regression guard was only relevant against the reverted branch). Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
Contributor
Patch cherry-pick: tests ✅ passedCherry-picked PR #748 onto On merge of this PR the cherry-pick branch will be fast-forwarded into |
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
Fixes the blocking bug in
migrateOnStart: true(Harper v5.0.21, LMDB→RocksDB migration) where migration silently pinned ~94% CPU and never completed. Reported by Jake — seev5-rocksdb-migration-bug-report.md.Root cause (one-line fix)
bin/copyDb.tsopens each source LMDB primary store vianew OpenDBIObject(!isPrimary, isPrimary)but never threads the per-attributecompressionsetting onto the init object before callingsourceRootStore.openDB(...). Withoutcompressionconfigured, lmdb-js does not install its decompression layer; every read on that DBI returns raw compressed bytes.msgpackr then tries to decode those raw compressed bytes, interprets any byte in
0x40-0x7Fas a shared-structure reference (seemsgpackr/unpack.jsread()), and callsloadStructures().loadStructuresre-entersRecordEncoder.decodeon the (also compressed) structures buffer, which again contains plenty of bytes in0x40-0x7F, which triggers anotherloadStructures, and so on until V8's stack frame limit is hit. That manifests as the 410-error burst from msgpackr'scheckedRead, followed by the silent 94% CPU spin.Harper's normal
databases.tspath already does this at line 963 (dbiInit.compression = primaryKeyAttribute.compression); this PR ports the same one-liner into the migration path.Verified end-to-end against the bug-report tarball — all 6 databases now migrate, Harper starts cleanly, and
SELECT * FROM linode.MonthlyReportreturns all 8 records with the full nestedhostDetails[].orgBreakdown[]payload intact:Additional fixes folded in (independent improvements)
While diagnosing I also fixed three separate problems that were obscuring the root cause and would have caused trouble in their own right:
migrateOnStartflag timing. Previously clearedSTORAGE_MIGRATEONSTARTbefore thetryblock, so a crash mid-migration would leave a half-migrated DB with the flag already off (next start wouldn't retry). Moved to after all databases migrate successfully.copyDbiToRocksretry loop — observability. The original code hadretries = 1000000with an emptycatch {}, which is why the bug presented as a silent CPU-pinned hang. Reduced to 1000, log per-retry context, and throw on exhaustion somigrateOnStart's catch preserves the flag and skips moving LMDB files to backup.DESIGN.mdnote documenting the compression-on-source-dbi gotcha for the next person who touches this path.Commit history note
The branch contains a couple of dead-end commits and a revert. Earlier in the investigation I added a defensive zero-byte-prefix branch to
RecordEncoder.decodeand an 8-byte-strip incopyStructureson the theory that lmdb-js'ssaveStructuresemits a version=0 prefix. Empirical check across all 23 primary stores in the tarball showed zero structures buffers actually begin with[00...]; every one starts with0x82(fixmap) directly. v4 RecordEncoder ships with the samenextByte === 2check and has worked in production on this exact data shape for years. The defensive branch was reverted (last commit) so the final diff matches v4's contract.Test plan
/tmp/harper-v5-rocksdb-migration-bug-2026-05-22.tar.gz— all 6 databases migrate, MonthlyReport queryable post-migrationnpm run test:integration -- 4.x-upgrade.test.ts— existing migration test still passes (additive change)storage.migrateOnStart: trueand the original.mdbfiles (restart-safe; the flag-timing + throw-on-exhaustion changes are what make this work)Strong candidate for v5.0.x patch release — see
/patch-pr. The compression fix is a one-liner that fixes a confirmed blocking bug for any customer whose tables hit the compression threshold (compressiondefaults to enabled instorage.compression: true).🤖 Generated by Claude (Opus 4.7) — Co-Authored-By: Claude Opus 4.7