Introduction:
Currently, there is no straightforward way to get in-depth, near real-time metadata for segments as they are committed to the deep store (I say committed and not ingested, since we do not want to bottleneck ingestion) in Pinot.
While ZooKeeper's external view provides basic segment-level metadata (timestamps, total docs, CRCs), more granular physical metadata; like Bloom filter states, dictionary sizes, and specific index configurations only exist on servers.
To access this today, we have to rely on the/segments/{tableName}/metadata?columns=<list of columns> API.
This presents a few roadblocks:
- It is not performant for tables with even a moderate number of columns.
- It requires heavy, synchronous polling, which puts unnecessary load on the servers.
- It completely prevents near real-time availability of segment metadata for downstream systems, this can be attributed to introducing multiple bottlenecks with this approach (disk, network)
I would like to propose having a (configurable/optional) event driven mechanism that pushes complete segment metadata to a sink (maybe a kafka topic) once a segment is committed to the deep store.
Capturing metadata at this granular level via an event stream would enable:
- Better observability into Pinot operations: Better visibility into index storage footprints, anomaly detection, and pipeline health at a very granular level without hammering the Controller/Server/Zookeeper APIs.
- Managing TTL'd / Cold-Tier Segments: Pushing this data to a separate Meta table would allow a user to maintain a permanent catalog of segments even after their TTL has expired and they are dropped from the active cluster. While this data is present in the deep store, it is not easily queryable as it would be if it resided on Pinot
This just serves as an issue to gather community interest, if sufficient interest is generated, will come up with a PEP.
Introduction:
Currently, there is no straightforward way to get in-depth, near real-time metadata for segments as they are committed to the deep store (I say committed and not ingested, since we do not want to bottleneck ingestion) in Pinot.
While ZooKeeper's external view provides basic segment-level metadata (timestamps, total docs, CRCs), more granular physical metadata; like Bloom filter states, dictionary sizes, and specific index configurations only exist on servers.
To access this today, we have to rely on the
/segments/{tableName}/metadata?columns=<list of columns>API.This presents a few roadblocks:
I would like to propose having a (configurable/optional) event driven mechanism that pushes complete segment metadata to a sink (maybe a kafka topic) once a segment is committed to the deep store.
Capturing metadata at this granular level via an event stream would enable:
This just serves as an issue to gather community interest, if sufficient interest is generated, will come up with a PEP.