Skip to content

feat(table): add support for merge-on-read delete#721

Open
alexandre-normand wants to merge 1 commit intoapache:mainfrom
alexandre-normand:alex.normand/merge-on-read-delete
Open

feat(table): add support for merge-on-read delete#721
alexandre-normand wants to merge 1 commit intoapache:mainfrom
alexandre-normand:alex.normand/merge-on-read-delete

Conversation

@alexandre-normand
Copy link
Contributor

@alexandre-normand alexandre-normand commented Feb 12, 2026

delete this

This adds support for merge-on-read deletes. It offers an alternative to the copy-on-write to generate position delete files instead of rewriting existing data files.

I'm not very confident in the elegance of my solution as I'm still new to the internals of iceberg-go but the high-level is:

  • Reuse the classification code from the existing delete implementation to get the list of files of dropped files vs files with partial deletes
  • Reuse the arrow scanning facilities to filter records from the data files with partial deletes and emit position delete records with file path and position.
    • This is done by reusing the pipeline code and function and making the first stage in the pipeline one to enrich the RecordBatch with the file Path and position before the original position is lost due to filtering.
    • After filtering, the RecordBatch is projected to the position delete schema (i.e. the original schema fields are dropped)
  • Once we have filtered PositionDelete records that need to be emitted, we reuse the record to file writing to generate position delete files.

Testing

Integration tests were added to exercise the partitioned and unpartitioned paths and the data is such that it's meant to actually produce a position delete file rather than just go through the quick path that drops an entire file because all records are gone.

Indirect fixes

While working on this change and adding the testing for the partitioned table deletions, I realized that the manifest evaluation when the filter affected a field that was part of a partition spec was not built correctly. It needed to use similar code as what's done during scanning to build projections and build a manifest evaluator per partition id. This is fixed in this PR but this technically also applies to copy-on-write and overwrite paths so the fix goes beyond the scope of the merge-on-read.

@alexandre-normand alexandre-normand force-pushed the alex.normand/merge-on-read-delete branch 5 times, most recently from 5079248 to 114fc57 Compare February 12, 2026 23:42
var PositionalDeleteSchema = NewSchema(0,
NestedField{ID: 2147483546, Type: PrimitiveTypes.String, Name: "file_path", Required: true},
NestedField{ID: 2147483545, Type: PrimitiveTypes.Int32, Name: "pos", Required: true},
NestedField{ID: 2147483545, Type: PrimitiveTypes.Int64, Name: "pos", Required: true},
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The spec says that pos is a long so I updated this to be Int64 rather than Int32.

@alexandre-normand alexandre-normand force-pushed the alex.normand/merge-on-read-delete branch 5 times, most recently from c9e30c4 to a7a7ce6 Compare February 15, 2026 06:45
@alexandre-normand alexandre-normand marked this pull request as ready for review February 15, 2026 06:46
@alexandre-normand alexandre-normand force-pushed the alex.normand/merge-on-read-delete branch from a7a7ce6 to 6d41651 Compare February 16, 2026 19:35
Copy link
Contributor

@laskoviymishka laskoviymishka left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Overall, this looks correct (I mostly compared it with Iceberg Java), but I think a couple of things need attention:

  • panic/recover is used in normal write-path error handling;
  • position-delete fanout needs additional tests;
  • missing focused local unit tests for invariant-heavy code (enrichRecordsWithPosDeleteFields, position-delete fanout).

Regarding the API change: the ToDataFile change is internal (package scope), but the ManifestWriter.ToManifestFile signature change is public and should be explicitly called out in the compatibility/release notes.

I would advocate for not changing this API and instead adding a new one with the extra argument, but this decision, in my opinion, is not critical.


currentSpec, err := meta.CurrentSpec()
if err != nil || currentSpec == nil {
panic(fmt.Errorf("%w: cannot write files without a current spec", err))
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can we avoid panic/recover for expected failures here?
CurrentSpec() and schema conversion errors are regular error cases and returning an error iterator directly would be clearer/safer than panicking. Also, the recover block only handles string and error; any other panic type could produce yield(nil, nil), which masks failure.

// enrichRecordsWithPosDeleteFields enriches a RecordBatch with the columns declared in the PositionalDeleteArrowSchema
// so that during the pipeline filtering stages that sheds filtered out records, we still have a way to
// preserve the original position of those records.
func enrichRecordsWithPosDeleteFields(ctx context.Context, filePath iceberg.DataFile) recProcessFn {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please add focused unit tests for enrichRecordsWithPosDeleteFields:

  • monotonically increasing pos across multiple input batches
  • exact appended schema/column ordering
  • empty batch behavior
  • failure mode if required fields are missing from PositionalDeleteArrowSchema.

}

func (w *ManifestWriter) ToManifestFile(location string, length int64) (ManifestFile, error) {
func (w *ManifestWriter) ToManifestFile(location string, length int64, content ManifestContent) (ManifestFile, error) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think table/internal.DataFileStatistics.ToDataFile signature changes are acceptable (it's internal so who care?), but ManifestWriter.ToManifestFile is externally visible and now requires a new content arg.

This is backward incompatible change so 2 questions:

  1. Is it really really needed?
  2. If it's needed - we need to state it in changelog.

// positionDeletePartitionedFanoutWriter distributes Arrow position delete records across multiple partitions based on
// a partition specification, writing data to separate delete files for each partition using
// a fanout pattern with configurable parallelism.
type positionDeletePartitionedFanoutWriter struct {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could we add unit tests for positionDeletePartitionedFanoutWriter covering:

  • mixed file_path values in one batch (assert/handle explicitly)
  • missing partitionDataByFilePath entry (error instead of silent nil)
  • empty batch behavior and cancellation path.

Also, consider avoiding defer record.Release() inside the loop to prevent delayed releases in long runs.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants