Skip to content

fix: handle NULL pg_stat_wal.stats_reset (minimal)#2

Draft
honi-at-simspace wants to merge 1 commit into
mainfrom
dev/9106-minimal
Draft

fix: handle NULL pg_stat_wal.stats_reset (minimal)#2
honi-at-simspace wants to merge 1 commit into
mainfrom
dev/9106-minimal

Conversation

@honi-at-simspace
Copy link
Copy Markdown
Owner

Description

pg_stat_wal.stats_reset is per-instance and can legitimately be NULL on:

  • replicas (which never write WAL stats locally), and
  • primaries promoted from a replica that was bootstrapped via pg_basebackup
    the WAL stats subsystem on a standby never initialises stats_reset, and
    promotion does not reset it. After ≥1 failover (rolling minor upgrades,
    node maintenance, etc.) every pod in the cluster reaches this state.

The exporter scanned the column into a string, so on such instances every
collection failed with sql: Scan error on column index 4, name "stats_reset": converting NULL to string is unsupported. All pg_stat_wal counters were
dropped for the affected pod, and the error was logged at every scrape.

This change scans stats_reset into a sql.NullString and uses the empty
string as the metric label when NULL. The actual WAL counters
(wal_records, wal_fpi, wal_bytes, wal_buffers_full, plus the PG<18
write/sync ones) are then collected normally.

Relation to cloudnative-pg#9788

cloudnative-pg#9788 takes the orthogonal approach of skipping pg_stat_wal collection on
replicas entirely. That is correct on its own merits, but does not cover the
primary-with-NULL-stats_reset case (clusters that have failed over at least
once), which is what we observe in our environments. The two changes are
complementary — cloudnative-pg#9788 stops collecting data that isn't meaningful on replicas;
this change makes the scan robust to NULL on the primary.

Diff scope — minimal version

This is the minimal version of the fix: just the stringsql.NullString
type change in PgStatWal and the corresponding .String accessor at the
metric-label call sites. No refactor, no new tests; relies on the existing
E2E suite for integration coverage. Net diff: 2 files, 9 lines.

A larger version that also extracts a getPgStatWAL(db, version) helper and
adds three sqlmock-backed unit tests (populated / NULL on PG<18 / NULL on
PG≥18) is available on branch dev/9106 (see fork PR #1). Reviewer
preference decides which scope ships.

Testing

  • Build clean: go build ./...
  • Vet clean on touched packages
  • go test ./pkg/management/postgres/... ./pkg/management/postgres/webserver/metricserver/... passes.

Closes cloudnative-pg#9106


Drafted with AI assistance (Claude). All code and design choices were
reviewed by the author before submission.

The stats_reset column of pg_stat_wal can be NULL on instances that
have never had WAL stats initialized -- most commonly replicas (which
do not write WAL stats locally) and primaries promoted from a replica
that was bootstrapped via pg_basebackup. The exporter scanned this
column into a plain string, so on such instances every collection
failed with "sql: Scan error on column index 4, name stats_reset:
converting NULL to string is unsupported", dropping all pg_stat_wal
metrics for the pod and logging an error at every scrape interval.

Scan stats_reset into sql.NullString and use the empty string as the
metric label when the column is NULL. The actual WAL counters are
then emitted normally.

Closes cloudnative-pg#9106

Signed-off-by: Honi Sanders <honi.sanders@simspace.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

[Bug]: 1.27.1 stats query bug

1 participant