Environment: Pinot controller, realtime table, large partition count (e.g. 100+), multiple replica groups, Kafka over SASL/SSL.
Symptom: POST /tables takes several minutes to respond when the Kafka topic has a large number of partitions. The client times out, but the controller eventually completes the work correctly and writes the ideal state.
Root cause (traced from source):
addTable is fully synchronous on the HTTP thread. After InstanceAssignmentDriver finishes (ZK-only, fast), PinotLLCRealtimeSegmentManager.setUpNewTable() calls getNewPartitionGroupMetadataList(), which ends up in StreamMetadataProvider.computePartitionGroupMetadata(). That method loops over all partitions sequentially - for each partition it constructs a new KafkaConsumer (full SASL/SSL handshake) and calls fetchStreamPartitionOffset. With SASL/SSL, each handshake takes several seconds, so the total time scales linearly with partition count and blocks the HTTP thread throughout.
Call chain:
POST /tables (PinotTableRestletResource.java:262)
→ PinotHelixResourceManager.addTable() (line 1866)
→ PinotLLCRealtimeSegmentManager.setUpNewTable() (line 379)
→ getNewPartitionGroupMetadataList()
→ PinotTableIdealStateBuilder.getPartitionGroupMetadataList()
→ PartitionGroupMetadataFetcher.call()
→ StreamMetadataProvider.computePartitionGroupMetadata()
→ for i in 0..N: ← serial, no parallelism
new KafkaPartitionLevelConnectionHandler(...)
→ new KafkaConsumer<>() ← SASL/SSL handshake per partition
→ fetchStreamPartitionOffset()
Questions:
- Is this expected? Is there a known workaround for large partition counts with SASL/SSL?
- Is there a path to parallelize the per-partition offset fetch in
computePartitionGroupMetadata?
Environment: Pinot controller, realtime table, large partition count (e.g. 100+), multiple replica groups, Kafka over SASL/SSL.
Symptom:
POST /tablestakes several minutes to respond when the Kafka topic has a large number of partitions. The client times out, but the controller eventually completes the work correctly and writes the ideal state.Root cause (traced from source):
addTableis fully synchronous on the HTTP thread. AfterInstanceAssignmentDriverfinishes (ZK-only, fast),PinotLLCRealtimeSegmentManager.setUpNewTable()callsgetNewPartitionGroupMetadataList(), which ends up inStreamMetadataProvider.computePartitionGroupMetadata(). That method loops over all partitions sequentially - for each partition it constructs a newKafkaConsumer(full SASL/SSL handshake) and callsfetchStreamPartitionOffset. With SASL/SSL, each handshake takes several seconds, so the total time scales linearly with partition count and blocks the HTTP thread throughout.Call chain:
Questions:
computePartitionGroupMetadata?