Skip to content

fix: handle NULL pg_stat_wal.stats_reset#1

Draft
honi-at-simspace wants to merge 1 commit into
mainfrom
dev/9106
Draft

fix: handle NULL pg_stat_wal.stats_reset#1
honi-at-simspace wants to merge 1 commit into
mainfrom
dev/9106

Conversation

@honi-at-simspace
Copy link
Copy Markdown
Owner

Description

pg_stat_wal.stats_reset is per-instance and can legitimately be NULL on:

  • replicas (which never write WAL stats locally), and
  • primaries promoted from a replica that was bootstrapped via pg_basebackup
    the WAL stats subsystem on a standby never initialises stats_reset, and
    promotion does not reset it. After ≥1 failover (rolling minor upgrades,
    node maintenance, etc.) every pod in the cluster reaches this state.

The exporter scanned the column into a string, so on such instances every
collection failed with sql: Scan error on column index 4, name "stats_reset": converting NULL to string is unsupported. All pg_stat_wal counters were
dropped for the affected pod, and the error was logged at every scrape.

This change scans stats_reset into a sql.NullString and uses the empty
string as the metric label when NULL. The actual WAL counters
(wal_records, wal_fpi, wal_bytes, wal_buffers_full, plus the PG<18
write/sync ones) are then collected normally.

Relation to cloudnative-pg#9788

cloudnative-pg#9788 takes the orthogonal approach of skipping pg_stat_wal collection on
replicas entirely. That is correct on its own merits, but does not cover the
primary-with-NULL-stats_reset case (clusters that have failed over at least
once), which is what we observe in our environments. The two changes are
complementary — cloudnative-pg#9788 stops collecting data that isn't meaningful on replicas;
this change makes the scan robust to NULL on the primary.

Diff scope — two options for reviewers

Happy to ship either of the following; reviewer preference decides.

Option A (current PR, ~110 LOC): type-safe scan + a small refactor that
extracts the scan logic into a package-private getPgStatWAL(db, version)
helper, plus three sqlmock-backed unit tests covering populated / NULL on
PG<18 / NULL on PG≥18. Pro: regression coverage. Con: introduces a small
refactor that is not strictly required by the bug fix.

Option B (~9 LOC, no test): just change StatsReset string
sql.NullString in PgStatWal and use .String at the metric-label call
sites. No helper, no new tests; relies on the existing E2E suite for
integration coverage. Pro: minimal review surface. Con: no unit test of the
NULL path.

If you prefer Option B I'll force-push the branch with the helper and the
new tests removed.

Testing

  • Build clean: go build ./...
  • Vet clean on touched packages
  • New unit tests in probes_test.go cover three cases: populated stats_reset
    on PG<18, NULL stats_reset on PG<18, NULL stats_reset on PG≥18.
  • go test ./pkg/management/postgres/... ./pkg/management/postgres/webserver/metricserver/... passes.

Closes cloudnative-pg#9106


Drafted with AI assistance (Claude). All code, tests, and design choices were
reviewed by the author before submission.

The stats_reset column of pg_stat_wal can be NULL on instances that
have never had WAL stats initialized -- most commonly replicas (which
do not write WAL stats locally) and primaries promoted from a replica
that was bootstrapped via pg_basebackup. The exporter scanned this
column into a plain string, so on such instances every collection
failed with "sql: Scan error on column index 4, name stats_reset:
converting NULL to string is unsupported", dropping all pg_stat_wal
metrics for the pod and logging an error at every scrape interval.

Scan stats_reset into sql.NullString and use the empty string as the
metric label when the column is NULL. The actual WAL counters are
then emitted normally.

The pg_stat_wal scan logic is extracted into a package-private
getPgStatWAL helper so the NULL behaviour can be exercised directly
with sqlmock.

Closes cloudnative-pg#9106

Signed-off-by: Honi Sanders <honi.sanders@simspace.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

[Bug]: 1.27.1 stats query bug

1 participant