-
Notifications
You must be signed in to change notification settings - Fork 424
fix: preserve HMS table properties during commits #2927
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Conversation
96b6185 to
46ee7f7
Compare
This change fixes HMS-specific table properties being lost during commits by merging parameters instead of replacing them. Fixes: apache#2926
46ee7f7 to
be4e2a7
Compare
|
Thanks for the fix. A few things to check for HMS properties that I can remember:
|
a9fbb1c to
ae8a024
Compare
Makes sense. Added more tests to check these cases, also the code gives precedence to iceberg properties, PTAL and let me know if you find any issues. Thanks! |
pyiceberg/catalog/hive.py
Outdated
| # Merge HMS parameters: preserve HMS-only properties (e.g., table_category) while | ||
| # ensuring Iceberg controls its metadata. Start with HMS params, remove deleted | ||
| # Iceberg properties, then update with new Iceberg values (which take precedence). | ||
| updated_parameters = {k: v for k, v in hive_table.parameters.items() if k not in removed_keys} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
While this looks fine, I think it would be cleaner if we started with the existing, removed and then updated to have the final params. Something like this:
merged_params = dict(hive_table.parameters)
for key in removed_keys:
merged_params.pop(key, None)
merged_params.update(new_parameters)
hive_table.parameters = merged_params
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done
| assert hive_table.parameters.get("abc") is None | ||
|
|
||
|
|
||
| @pytest.mark.integration |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Worth adding a test for property deletion to ensure it's not picked up from the old state.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
done
tests/integration/test_reads.py
Outdated
|
|
||
| @pytest.mark.integration | ||
| @pytest.mark.parametrize("catalog", [pytest.lazy_fixture("session_catalog_hive")]) | ||
| def test_hive_iceberg_property_takes_precedence(catalog: Catalog) -> None: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think this should be named better. It's more like Iceberg's metadata is the source of truth. CC: @kevinjqliu to see what the right expected behavior is.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is a good call, this should be phrased as iceberg metadata is the source of truth, and HMS parameters are a projection of Iceberg state, plus any HMS-only properties that external tools may have set.
Looking at the Java implementation, they follow this order to sync properties:
- We start our with the existing HMS params
- Push all Iceberg properties into HMS
- Remove old props
- Then set all the iceberg properties
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done
pyiceberg/catalog/hive.py
Outdated
| # Merge HMS parameters: preserve HMS-only properties (e.g., table_category) while | ||
| # ensuring Iceberg controls its metadata. Start with HMS params, remove deleted | ||
| # Iceberg properties, then update with new Iceberg values (which take precedence). | ||
| updated_parameters = {k: v for k, v in hive_table.parameters.items() if k not in removed_keys} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Not blocking as this solves the issue, but one suggestion would be to extract the merge logic into a small helper function like _merge_hms_parameters(). That way we could unit test without mocking the full commit flow, and document the semantics in one spot.
pyiceberg/catalog/hive.py
Outdated
| ) | ||
|
|
||
| # Detect properties that were removed from Iceberg metadata | ||
| old_iceberg_keys = set(current_table.properties.keys()) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit: can simplify to current_table.properties.keys() - updated_staged_table.properties.keys()
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done, thanks
5b2939b to
291311f
Compare
Rationale for this change
HMS-specific table properties (e.g.,
table_categoryfrom data contracts) were lost during PyIceberg commits becausecommit_table()replaced all HMS parameters instead of merging them.Fixes: #2926
Are these changes tested?
Yes. Added
test_hive_preserves_hms_specific_properties()that:Are there any user-facing changes?
Bug fix only - HMS properties set outside PyIceberg are now preserved during commits. No API changes.