feat(arrow): add read_replica_type resource attribute column#107
Conversation
There was a problem hiding this comment.
Pull request overview
Adds a new read_replica resource attribute column to the exported Arrow IPC schema so downstream consumers (e.g., ClickHouse) can distinguish read-replica traffic based on pg_stat_ch.extra_attributes.
Changes:
- Export a new dictionary-encoded Arrow column
read_replica, sourced fromextra_attributes(mirrors the existingserver_roleplumbing). - Update the
pg_stat_ch.extra_attributesGUC documentation example to includeread_replica:false. - Extend the Arrow dump test to assert presence (and dict-encoding) of the new
read_replicafield.
Reviewed changes
Copilot reviewed 3 out of 3 changed files in this pull request and generated 1 comment.
| File | Description |
|---|---|
| src/export/arrow_batch.cc | Adds read_replica dictionary builder + schema field + append/finish/reset/size-estimate wiring. |
| src/config/guc.c | Updates the extra_attributes documentation example string to include read_replica. |
| t/026_arrow_dump.pl | Updates schema expectations and adds a type assertion for read_replica being dictionary-encoded UTF-8. |
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
| 'shared_blks_hit', 'shared_blks_read', | ||
| 'wal_records', 'wal_bytes', | ||
| 'service_version', 'region', | ||
| 'service_version', 'region', 'read_replica', |
| arrow::field("instance_ubid", arrow::utf8()), | ||
| arrow::field("server_ubid", arrow::utf8()), | ||
| arrow::field("server_role", DictionaryUtf8Type()), | ||
| arrow::field("read_replica_type", DictionaryUtf8Type()), | ||
| arrow::field("region", DictionaryUtf8Type()), |
| expected = [ | ||
| 'ts', 'db_name', 'db_user', 'db_operation', 'query_text', | ||
| 'duration_us', 'rows', 'pid', 'query_id', | ||
| 'shared_blks_hit', 'shared_blks_read', | ||
| 'wal_records', 'wal_bytes', | ||
| 'service_version', 'region', | ||
| 'service_version', 'region', 'read_replica_type', | ||
| ] | ||
| missing = [c for c in expected if schema.get_field_index(c) == -1] | ||
| if missing: | ||
| print(f"MISSING_COLUMNS:{','.join(missing)}") | ||
| sys.exit(1) | ||
|
|
||
| # Check types for a few key columns | ||
| ts_field = schema.field('ts') | ||
| assert ts_field.type == pa.timestamp('ns', tz='UTC'), f"ts type wrong: {ts_field.type}" | ||
|
|
||
| dur_field = schema.field('duration_us') | ||
| assert dur_field.type == pa.uint64(), f"duration_us type wrong: {dur_field.type}" | ||
|
|
||
| rows_field = schema.field('rows') | ||
| assert rows_field.type == pa.uint64(), f"rows type wrong: {rows_field.type}" | ||
|
|
||
| # read_replica_type is a dictionary-encoded column (like server_role/region), not plain utf8. | ||
| rr_field = schema.field('read_replica_type') | ||
| assert pa.types.is_dictionary(rr_field.type), f"read_replica_type not dict-encoded: {rr_field.type}" | ||
| assert rr_field.type.value_type == pa.utf8(), f"read_replica_type dict value type wrong: {rr_field.type.value_type}" | ||
|
|
| DefineCustomStringVariable( | ||
| "pg_stat_ch.extra_attributes", | ||
| "Key-value pairs appended to exported Arrow batches.", | ||
| "Semicolon-separated k:v pairs for resource columns: " | ||
| "'instance_ubid:abc;server_role:primary;region:us-east-1'.", | ||
| "'instance_ubid:abc;server_role:primary;read_replica_type:regional;region:us-east-1'.", | ||
| &psch_extra_attributes, |
serprex
left a comment
There was a problem hiding this comment.
should update clickhouse schema too
@serprex, There is already deviation between the arrow path and the ClickHouse schema. This PR is trying to unify it: #99. I think it is better to add the read replica there as well, because the PR 99 is defining the migration standard for the ClickHouse schema. @JoshDreamland let me know if you are okay with this idea or have any thoughts. |
|
@amogiska Chatted with Serp separately; if I merge #99 nowish, do you mind babysitting it through to production along with this change? You can tweak the "unified schema" contained within to be current with your change here, even though it is not, in fact, unified as of yet. We can fake it 'til we make it, in that regard. To clarify, #99 should be a no-op on production; a breakage would be extremely interesting, and I am happy to debug if you observe one. I would just like to piggy-back its evaluation onto this change if possible. This should move us into a better position to do a migration later. |
Pre-add the column being introduced on both sides of the Arrow pipeline to keep the unified events_raw schema wire-compatible once the cutover happens: * pg_stat_ch PR #107 — adds read_replica_type to the Arrow IPC output in arrow_batch.cc (dictionary-encoded, populated from the pg_stat_ch.extra_attributes GUC). * clickgres-platform PR #448 — adds the matching ALTER TABLE on query_logs_arrow (LowCardinality(String) DEFAULT 'none' AFTER server_role) and promotes read-replica traffic into query_logs via a widened MV filter. Mirroring the same type, default, and position here means the eventual cutover from query_logs_arrow to events_raw needs zero further schema changes — and lets PR #107 rebase onto unified_schema without having to amend the schema migration. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
JoshDreamland
left a comment
There was a problem hiding this comment.
Discussed more offline—we'll merge this, I'll merge #99, we'll proceed to release.
Summary
read_replica_typecolumn to the exported Arrow IPC schema, populated from thepg_stat_ch.extra_attributesGUC. Dictionary-encoded likeserver_role(low-cardinality enum:regional/none).t/026_arrow_dump.plschema assertion (presence + dictionary type).Why
Ubicloud emits
read_replica_typeinextra_attributes; surfacing it as an Arrow column lets the ClickHouse side distinguish and promote read-replica query traffic, which otherwise reportsserver_role=standby. Unknown extra-attribute keys were parsed-and-ignored before this; this makesread_replica_typea first-class column.Test plan
server_role; schema-field order matches finish-array ordert/026_arrow_dump.plassertsread_replica_typeis present and dictionary-encoded