fix(hadoop): use codec-aware metadata filenames#1326
Conversation
tanmayrauth
left a comment
There was a problem hiding this comment.
Looks good — write suffix now matches the read-path codec inference in table.go, and the create→load / commit→load round-trip tests cover both gzip and zstd. One tiny nit.
| @@ -188,6 +190,22 @@ func (c *Catalog) metadataFilePath(ident table.Identifier, version int) string { | |||
There was a problem hiding this comment.
After this change metadataFilePath no longer has a production caller, CreateTable/CommitTable/LoadTable all go through metadataFilePathForCompression / findMetadataLocation now. It's still referenced from the tests so it's not dead, but might be worth either keeping it explicitly as a test-only helper or pointing those tests at the new helper to avoid drift.
There was a problem hiding this comment.
This seems to be a valid comment, I will amend the PR. Have a nice weekend.
|
This LGTM on review, but it currently has merge conflicts with |
There was a problem hiding this comment.
Nice catch on the underlying bug — CreateTable/CommitTable were writing to vN.metadata.json regardless of the codec, so a gzip file behind a plain filename would fail to parse on the next LoadTable, and routing the discovered path from findMetadataLocation straight into NewFromLocation is the right fix. The gzip side of this is solid.
I'd hold it before merging though. The zstd half goes beyond what the ecosystem supports: Java's Codec enum is only NONE/GZIP and never probes for .zstd, PyIceberg falls back to NOOP_COMPRESSOR and reads the bytes as garbage, and iceberg-rust rejects zstd outright. So a Hadoop-catalog table we write with zstd is unreadable by every other client — and the Hadoop catalog is exactly the shared-filesystem case where that matters. I'd gate zstd at this layer (same error you already raise for brotli) unless the intent is an explicit go-only extension, in which case it needs a loud warning in the PR description and CHANGES.md.
The other blocker is independent of zstd: on commit we only stat the new codec's own filename, and on load we re-probe with a fixed plain→gz→zstd order even though WalkDir already saw the real filename. Both let a stale plain vN shadow a newer compressed vN, which silently defeats a compression migration and can produce two different files at the same version under a concurrent cross-codec write.
A few things I'd want before merge:
- gate (or loudly document) zstd for the Hadoop catalog
- make the commit conflict check probe all codec variants for the new version
- thread the filename
WalkDiralready found intoscanForwardinstead of re-probing, and add a test pinning which file wins whenvN.metadata.jsonandvN.gz.metadata.jsonboth exist - assert the returned path (not just the version) in the gzip/zstd
findVersiontests
Once those are addressed, happy to take another pass and approve.
| case table.MetadataCompressionCodecGzip: | ||
| suffix = ".gz.metadata.json" | ||
| case table.MetadataCompressionCodecZstd: | ||
| suffix = ".zstd.metadata.json" |
There was a problem hiding this comment.
I think the zstd branch here is a cross-client trap. zstd isn't a value Java's TableMetadataParser.Codec understands — its enum is only NONE("") and GZIP(".gz"), and HadoopTableOperations.getMetadataFile only ever probes Codec.values(), so a vN.zstd.metadata.json is never even looked for. PyIceberg falls back to NOOP_COMPRESSOR for any non-.gz suffix and reads the compressed bytes as garbage, and iceberg-rust rejects zstd outright in parse_metadata_file_compression.
So a Hadoop-catalog table we write with zstd is invisible-or-corrupt to every other Iceberg client, and the Hadoop catalog is precisely the shared-filesystem case where that interop matters most.
I'd gate zstd at this layer for the Hadoop catalog — return the same "unsupported codec" error you already produce for brotli, so the failure is loud at write time rather than a silent unreadable file. If the intent is genuinely a go-to-go-only extension, that needs a prominent warning in the PR description and CHANGES.md and a clear note in the code that this is non-portable. wdyt?
| tempPath := filepath.Join(metaDir, uuid.New().String()+".metadata.json") | ||
|
|
||
| compression := updated.Properties().Get(table.MetadataCompressionKey, table.MetadataCompressionDefault) | ||
| newMetaPath, err := c.metadataFilePathForCompression(ident, newVersion, compression) |
There was a problem hiding this comment.
The conflict guard below only stats the codec-specific path for newVersion, so cross-codec races slip through. If writer A plans v2.gz.metadata.json and writer B plans v2.metadata.json concurrently, neither Stat sees the other and both commit — two different files both claiming version 2, breaking the monotonic chain.
Java's getMetadataFile iterates every codec variant, so its conflict detection is codec-agnostic by construction.
I'd make the guard probe all variants: call metadataVersionLocation(ident, newVersion) and fail the commit if it finds any file for that version, regardless of suffix. That also covers the migration case where someone switches codecs between commits.
| strings.Join(ident, "."), catalog.ErrNoSuchTable) | ||
| } | ||
|
|
||
| metaPath, ok := c.metadataVersionLocation(ident, maxVer) |
There was a problem hiding this comment.
The WalkDir above already saw the exact filename for maxVer (it matched versionPattern to compute the version), but we throw the name away and only keep the integer — then re-probe here with the fixed plain→gz→zstd order.
That re-probe is where stale data wins: if both v5.metadata.json (old) and v5.gz.metadata.json (new) exist, plain is probed first and silently shadows the compressed current version, so the catalog serves stale uncompressed metadata even though the authoritative version is the gzip one. A compression migration would be quietly defeated.
I'd capture the matched filename in the WalkDir callback (it's the path param) when v > maxVer, and thread that straight into scanForward so we never re-probe a file we just saw. That removes the shadowing and drops the extra Stats per lookup on the slow path.
|
|
||
| ver, err := s.cat.findVersion(ident) | ||
| s.Require().NoError(err) | ||
| s.Equal(2, ver) |
There was a problem hiding this comment.
TestFindVersionZstdOnlyWithHint and TestFindVersionMixedZstdAndPlain only assert the version number, never the returned path — so the part of the fix that actually matters (that findMetadataLocation returns the codec-suffixed path, not a reconstructed plain one) isn't directly exercised here. The empty nil files are fine since these call only Stat, but the path is the thing the LoadTable fix hinges on.
I'd add a direct findMetadataLocation assertion that the returned path ends in .zstd.metadata.json for the zstd-only case and .gz.metadata.json for a gzip-only one. That's also the natural place to add the shadowing case from the comment on findMetadataLocation — write both v3.metadata.json and v3.gz.metadata.json and assert which path wins, so the resolution is pinned by a test.
|
|
||
| path, err = s.cat.metadataFilePathForCompression(ident, 1, table.MetadataCompressionCodecZstd) | ||
| s.Require().NoError(err) | ||
| s.Equal(filepath.Join(s.warehouse, "ns", "tbl", "metadata", "v1.zstd.metadata.json"), path) |
There was a problem hiding this comment.
The old TestMetadataFilePath covered version 1 and version 42; the rewrite only keeps version 1 across the three codecs. The multi-digit case is the normal path for any table past v9, and fmt.Sprintf("v%d%s", ...) is exactly where a width/format slip would hide.
I'd carry a version=42 assertion over for at least one codec while we're here.
| // number and the UUID is in canonical 8-4-4-4-12 hex format. | ||
| var uuidMetadataPattern = regexp.MustCompile( | ||
| `^[0-9]{5}-[0-9a-fA-F]{8}-[0-9a-fA-F]{4}-[0-9a-fA-F]{4}-[0-9a-fA-F]{4}-[0-9a-fA-F]{12}(?:\.gz)?\.metadata\.json$`, | ||
| `^[0-9]{5}-[0-9a-fA-F]{8}-[0-9a-fA-F]{4}-[0-9a-fA-F]{4}-[0-9a-fA-F]{4}-[0-9a-fA-F]{12}(?:\.(?:gz|zstd))?\.metadata\.json$`, |
There was a problem hiding this comment.
Two small things here. The // Java/PyIceberg catalogs comment above is now inaccurate for the zstd alternative — neither Java nor PyIceberg ever emits 00000-<uuid>.zstd.metadata.json, so I'd note the zstd part is an iceberg-go-only extension rather than implying parity.
And recognizing UUID-style zstd here (used by isTableDir/ListTables) while findMetadataLocation's WalkDir still keys off versionPattern only creates a confusing split: a user could create a table in Python, see it in ListTables, then get ErrNoSuchTable on LoadTable. That load gap is pre-existing, but I'd at least not widen the recognition surface without a // recognized but not loadable note. The new zstd UUID branch also has no test backing it.
| filepath.Join(dir, fmt.Sprintf("v%d.metadata.json", version)), | ||
| filepath.Join(dir, fmt.Sprintf("v%d.gz.metadata.json", version)), | ||
| filepath.Join(dir, fmt.Sprintf("v%d.zstd.metadata.json", version)), | ||
| } { |
There was a problem hiding this comment.
Worth a one-line comment that any non-ErrNotExist Stat error (permission denied, transient FS error) is treated as "not found" here. It's the same behavior as the old metadataVersionExists, so not a regression, but on a slow filesystem a transient error could make a real version look absent — calling it out keeps the next reader from assuming it's only an existence check.
| s.Equal(len(created.Schema().Fields()), len(loaded.Schema().Fields())) | ||
| } | ||
|
|
||
| func (s *HadoopCatalogTestSuite) TestCreateTableGzipMetadata() { |
There was a problem hiding this comment.
These four new integration tests (TestCreateTableGzipMetadata, TestCreateTableZstdMetadata, TestCommitTableGzipMetadata, TestCommitTableZstdMetadata) repeat the same os.Mkdir + props map + CreateTable + FileExists boilerplate, where the existing suite leans on a createTestTable helper.
I'd fold the four into a createTestTableWithCompression helper or table-driven subtests to match the rest of the suite, and while you're adding zstd coverage, a TestIsTableDirTrueUUIDMetadataZstd to mirror the existing UUID+gzip case would catch a ztsd-style typo in uuidMetadataPattern. Low priority next to the blockers above.
03d947b to
3512a3f
Compare
laskoviymishka
left a comment
There was a problem hiding this comment.
The fix itself is right — the gzip round-trip tests exercise the previously-broken write-then-reload path with real compressed bytes and assert UUID equality, so I'm confident the reported bug (compressed bytes behind a plain filename) is closed. The zstd hard-reject on write is a nice improvement over silently writing something that fails on read.
One thing I'd fix before this lands: findMetadataLocation returns maxPath straight from the WalkDir callback, so when both v{n}.metadata.json and v{n}.gz.metadata.json exist for a version, the file we load depends on listing order. On local FS the .gz variant happens to sort first so the test passes, but on a remote/NFS listing it's arbitrary — and it disagrees with the hint path, which resolves through metadataVersionLocation and deterministically prefers plain. Same table, two discovery branches, potentially two different files. Canonicalizing on metadataVersionLocation(ident, maxVer) after the walk closes that.
Left the detail inline. Fix that and I think this is good to land.
| v, _ := strconv.Atoi(matches[1]) | ||
| if v > maxVer { | ||
| maxVer = v | ||
| maxPath = path |
There was a problem hiding this comment.
maxPath is set to whichever variant WalkDir visits first, so when both v3.metadata.json and v3.gz.metadata.json exist for a version the loaded file depends on listing order. On local FS .gz sorts ahead of .metadata so it's stable and the test passes, but on a remote/NFS listing it's arbitrary — and it disagrees with the hint path, which goes through metadataVersionLocation and always prefers plain. Same table can then resolve to two different metadata files depending on which discovery branch runs.
I'd drop maxPath and call metadataVersionLocation(ident, maxVer) after the walk so both discovery paths land on the same file. wdyt?
zeroshade
left a comment
There was a problem hiding this comment.
Remaining blockers are concurrency races in the new cross-codec Hadoop metadata paths. laskoviymishka's zstd-rejection and path-canonicalization threads appear addressed; items below are the remaining blockers (concurrency).
- [MAJOR] CommitTable cross-codec version conflict check is non-atomic; add a codec-independent per-version claim and mixed-codec concurrent commit test.
- [MAJOR] CreateTable has the same default/gzip duplicate-v1 race; reuse the claim and add a concurrent create test.
| } | ||
|
|
||
| if existingPath, exists := c.metadataVersionLocation(ident, newVersion); exists { | ||
| _ = c.filesystem.Remove(tempPath) |
There was a problem hiding this comment.
[MAJOR] The cross-codec commit conflict check is non-atomic. Two concurrent writers can both compute the same newVersion, both pass metadataVersionLocation(ident, newVersion) before either publishes, then one RenameNoReplace can publish vN.metadata.json while the other publishes vN.gz.metadata.json; both succeed, leaving two files claiming the same version. Fix by taking an atomic per-version claim independent of codec suffix before publishing (for example a RenameNoReplace sentinel for vN), and add a concurrent mixed-codec commit test where both writers pass preflight before either final rename.
| } | ||
| } | ||
|
|
||
| metaPath, err := c.metadataFilePathForCompression(ident, version, compression) |
There was a problem hiding this comment.
[MAJOR] CreateTable has the same duplicate-version race with different compression. Two concurrent creators can both pass isTableDir, then publish v1.metadata.json and v1.gz.metadata.json successfully. Reuse a codec-independent per-version/table-creation claim before publishing metadata, and add a concurrent create test with default and gzip writers.
zeroshade
left a comment
There was a problem hiding this comment.
Thanks for the update — the two earlier concurrency concerns are addressed now: the codec-independent vN.metadata.json.claim path closes the mixed-codec version race, and the new concurrent mixed-codec create/commit tests cover the real scenarios. Nice work.
I found one new follow-up around crash recovery for retained claim files that should be addressed before merge.
| if err := c.filesystem.RenameNoReplace(tempPath, metaPath); err != nil { | ||
| func (c *Catalog) commitMetadataFile(ident table.Identifier, version int, tempPath, metaPath string, conflictErr error) error { | ||
| claimPath := c.metadataVersionClaimPath(ident, version) | ||
| if err := c.filesystem.RenameNoReplace(tempPath, claimPath); err != nil { |
There was a problem hiding this comment.
[MAJOR] A stale .claim file can permanently block this metadata version. The claim is created here with RenameNoReplace(tempPath, claimPath), then is only removed by the deferred cleanup when removeClaim remains true or intentionally retained after the final rename. If the process dies after taking the claim but before publishing the final metadata file, vN.metadata.json.claim remains with no final vN.*.metadata.json, so every future attempt for that version hits fs.ErrExist and returns a conflict even though metadataVersionLocation finds no actual metadata file. Please make stale claims recoverable: when claimPath already exists, check whether any final metadata variant exists via metadataVersionLocation; if none exists, treat the claim as abandoned by reclaiming/removing it, or encode enough information to distinguish in-progress vs abandoned. Add a regression test that leaves a vN.metadata.json.claim with no vN.*.metadata.json and verifies recovery.
1fabc33 to
a61039c
Compare
|
Following up after the latest commit: it only addresses the nlreturn lint, so my earlier request-changes still stands. The stale-claim recovery gap is unaddressed — if the process dies between taking the vN.metadata.json.claim and the final publish, that claim permanently blocks the version because the fs.ErrExist branch never checks whether a final metadata file actually exists. Please add the recovery check (on fs.ErrExist, call metadataVersionLocation and reclaim or remove an abandoned claim when no final metadata exists) plus a regression test that leaves a claim without final metadata. |
Summary
Why
Hadoop currently always writes metadata to vN.metadata.json even when write.metadata.compression-codec requests gzip or zstd. The table metadata reader infers decompression from the filename suffix, so compressed bytes behind a plain .metadata.json path can fail to reload.
Using codec-aware filenames on write and carrying the discovered metadata path through LoadTable keeps create, commit, and reload consistent for plain, gzip, and zstd metadata.
Testing