Skip to content

[SPARK-57537][SS] Add metrics support for deleteRange operation in RocksDB#56598

Open
trisjain-db wants to merge 1 commit into
apache:masterfrom
trisjain-db:deleterange-metrics
Open

[SPARK-57537][SS] Add metrics support for deleteRange operation in RocksDB#56598
trisjain-db wants to merge 1 commit into
apache:masterfrom
trisjain-db:deleterange-metrics

Conversation

@trisjain-db

Copy link
Copy Markdown

What changes were proposed in this pull request?

Added metrics update for deleteRange operation (numKeysOnWritingVersion and numInternalKeysOnWritingVersion). The metric update is gated on conf.trackTotalNumberOfRows so callers that don't expose the metric pay no extra cost.

Why are the changes needed?

deleteRange currently does not update the row-count metrics, which can leave numKeysOnWritingVersion and numInternalKeysOnWritingVersion inaccurate after range deletes. This change keeps the metrics consistent with other write paths while avoiding extra cost when trackTotalNumberOfRows is disabled.

Does this PR introduce any user-facing change?

No.

How was this patch tested?

Four new unit tests in RocksDBStateStoreSuite:

  • numKeys metric is decremented by the number of keys removed
  • empty range does not change numKeys metric
  • internal column family keys are tracked separately
  • metric is not updated when trackTotalNumberOfRows is disabled

Was this patch authored or co-authored using generative AI tooling?

Generated-by: Claude

@trisjain-db trisjain-db marked this pull request as ready for review June 18, 2026 18:29
@trisjain-db trisjain-db force-pushed the deleterange-metrics branch from fb66c64 to 2dbe50f Compare June 18, 2026 18:33
cfName,
includesPrefix
)
numKeysOnWritingVersion -= deletedKeys

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Correctness concern: The counters are decremented before db.deleteRange() executes. If deleteRange throws (RocksDB write stall, disk full, or a closed-DB exception), the keys still exist but the counters have already been reduced - permanently under-counting for the rest of this version. Move the decrement after the successful db.deleteRange() call.

}

// Count the keys that are about to be deleted (gated on trackTotalNumberOfRows).
if (conf.trackTotalNumberOfRows) {

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

deleteRange in RocksDB is O(1) - it writes a single range tombstone without touching individual keys. Scanning the full range with an iterator to count keys makes this O(n). For TTL-based state expiration (the primary deleteRange use case in structured streaming), ranges can contain millions of expired keys. When trackTotalNumberOfRows is enabled (the default), every deleteRange now iterates the full range before deleting it. Worth documenting this cost tradeoff, or considering RocksDB's approximate-count APIs (GetApproximateSizes) as an alternative for large ranges.

val rangeReadOptions = new ReadOptions()
rangeReadOptions.setIterateUpperBound(upperBoundSlice)
val iter = db.newIterator(rangeReadOptions)
try {

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: upperBoundSlice and rangeReadOptions are allocated before the try block. If db.newIterator(rangeReadOptions) throws, neither native resource is closed - the finally only runs after the try starts. Consider wrapping each in its own try-finally, or moving the allocations inside the try with a null-check in finally.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants