Skip to content

SIGSEGV in pgsm_ExecutorEnd at pg_stat_monitor.c:781 during PortalCleanup (PG 18.3, correlates with autovacuum) #651

@jameslupolthrd

Description

@jameslupolthrd

Description

PostgreSQL 18.3 backends terminate with SIGSEGV in pgsm_ExecutorEnd at pg_stat_monitor.c:781 during commit-time PortalCleanup. Crashes correlate tightly with autovacuum completion on the relation the failing query targets: the postmaster logs an autovacuum-done line for a table, and within 1–2 seconds a backend running a query against that same table segfaults. The postmaster then terminates all other backends, replays WAL, and the database returns in ~1 second.

This is reproducible across four independent RDS PostgreSQL 18.3 instances sharing the same shared_preload_libraries setting. A control instance in the same account with pg_stat_monitor absent from shared_preload_libraries has zero crashes over the same window. Removing pg_stat_monitor from shared_preload_libraries is a confirmed workaround — all four instances have been crash-free since the extension was removed on 2026-04-30.

Note: This report was prepared with AI assistance. Some information is inferred rather than directly observable because the affected instances run on AWS RDS (a managed service), which does not expose the unsanitized stack trace, the exact binary build, or /rdsdbdata/log/error/ core dumps to customers. AWS Support extracted the sanitized stack trace below on our behalf. We have provided everything we were able to obtain.

Expected Results

pgsm_ExecutorEnd should safely no-op or skip its stats update when the executor state it relies on has been invalidated during portal cleanup, rather than dereferencing a NULL/freed pointer and crashing the backend.

Actual Results

Sanitized stack trace (from AWS Support, crash on 2026-04-30 16:56:58 UTC, backend PID 12539)

#0  pgsm_ExecutorEnd (queryDesc=...) at pg_stat_monitor.c:781
#1  PortalCleanup (portal=...) at portalcmds.c:309
#2  PortalDrop (portal=..., isTopCommit=true) at portalmem.c:502
#3  PreCommit_Portals (isPrepare=false) at portalmem.c:756
#4  CommitTransaction () at xact.c:2269
#5  CommitTransactionCommandInternal () at xact.c:3202
#6  CommitTransactionCommand () at xact.c:3165
#7  finish_xact_command () at postgres.c:2832
#8  PostgresMain (...) at postgres.c:4969
#9  BackendMain (...) at backend_startup.c:129
#10 postmaster_child_launch (...) at launch_backend.c:290
#11 BackendStartup (...) at postmaster.c:3601
#12 ServerLoop () at postmaster.c:1716
#13 PostmasterMain (...) at postmaster.c:1414
#14 main (...) at main.c:227

Crash site in upstream 2.3.1

pg_stat_monitor.c:781 at tag 2.3.1:

742:    if (queryId != INT64CONST(0) && queryDesc->totaltime && pgsm_enabled(nesting_level))
743:    {
...
758:            InstrEndLoop(queryDesc->totaltime);
...
773:            pgsm_update_entry(entry,
774:                              NULL,
775:                              NULL,
776:                              0,
777:                              plan_ptr,
778:                              &sys_info,
779:                              NULL,
780:                              0,
781:                              queryDesc->totaltime->total * 1000.0,

There is a NULL-check on queryDesc->totaltime at line 742, but line 781 later dereferences queryDesc->totaltime->total without re-checking. If the structure pointed to by queryDesc->totaltime becomes invalid (freed / replaced / its contents invalidated) between 742 and 781, the dereference at 781 faults.

The fact that the crash is reached specifically via PortalCleanup → PortalDrop in the commit path, and is tightly correlated with autovacuum completing on the queried relation, points at executor-state invalidation concurrent with pgsm_ExecutorEnd's read.

Representative crash log (one of ~18 across 4 instances in 3 days)

2026-04-30 16:56:56 UTC LOG:  automatic vacuum of table "postgres.public.tool_execution": index scans: 1
    pages: 0 removed, 6 remain, 6 scanned (100.00% of total), 0 eagerly scanned
    tuples: 4 removed, 52 remain, 0 are dead but not yet removable
    ...
2026-04-30 16:56:58 UTC LOG:  client backend (PID 12539) was terminated by signal 11: Segmentation fault
2026-04-30 16:56:58 UTC DETAIL:  Failed process was running:
    SELECT * FROM tool_execution WHERE task_id = $1 ORDER BY started_at
2026-04-30 16:56:58 UTC LOG:  terminating any other active server processes
2026-04-30 16:56:58 UTC LOG:  [pg_stat_monitor] pgsm_shmem_shutdown: Shutdown initiated.
2026-04-30 16:56:59 UTC LOG:  database system was interrupted; last known up at 2026-04-30 16:53:15 UTC
2026-04-30 16:56:59 UTC LOG:  database system was not properly shut down; automatic recovery in progress
2026-04-30 16:56:59 UTC LOG:  redo starts at 48/64000750
2026-04-30 16:56:59 UTC LOG:  redo done at 48/6800BA70
2026-04-30 16:56:59 UTC LOG:  database system is ready to accept connections

Observed autovacuum → crash correlation on one instance

Timestamp (UTC) Failed query Preceding autovacuum (table)
2026-04-28 08:27:28 SELECT value FROM rds_heartbeat2 rdsadmin.pg_catalog.rds_heartbeat2 (completed 08:27:17)
2026-04-29 01:31:42 SELECT value FROM rds_heartbeat2 rdsadmin.pg_catalog.rds_heartbeat2 (completed 01:31:37)
2026-04-29 22:56:42 SELECT value FROM rds_heartbeat2 rdsadmin.pg_catalog.rds_heartbeat2 (completed 22:56:28)
2026-04-30 16:03:13 SELECT value FROM rds_heartbeat2 rdsadmin.pg_catalog.rds_heartbeat2 (completed 16:03:07)
2026-04-30 16:56:58 SELECT * FROM tool_execution WHERE task_id = $1 ORDER BY started_at postgres.public.tool_execution (completed 16:56:56)

The rds_heartbeat2 bias reflects the RDS-internal monitor polling that table every few seconds, which makes it statistically the most likely query to fall in the post-autovacuum window — it is not a rdsadmin-specific pattern. The tool_execution crash is an application query on an application table.

Version

  • PostgreSQL 18.3 (AWS RDS managed, default build)
  • pg_stat_monitor extversion = 2.3 (per pg_available_extensions). The .control file ships version 2.3; PG18 support was introduced in release 2.3.1 (2025-11-28, issue support for postgres 18 #566), so the loaded binary is 2.3.1 or newer. We have asked AWS Support whether they apply downstream patches and will update this issue if we receive details.
  • Platform: AWS RDS, us-east-1
  • shared_preload_libraries = pg_stat_statements,pg_hint_plan,auto_explain,pg_stat_monitor,pg_cron

Affected instances: 4/4 with pg_stat_monitor in shared_preload_libraries. Control instance in same account/region without pg_stat_monitor: 0 crashes in the same window.

Steps to reproduce

We have not produced a minimal reproducer. Observed environment for what is a reliable reproduction in our setup:

  1. RDS PostgreSQL 18.3 with shared_preload_libraries = pg_stat_statements,pg_hint_plan,auto_explain,pg_stat_monitor,pg_cron
  2. Any table receiving regular-but-small write activity so autovacuum runs frequently (we observed it on both a 52-row application table and on rdsadmin.pg_catalog.rds_heartbeat2 which is polled constantly by RDS's heartbeat monitor)
  3. Continuous SELECT workload against the same relation
  4. Over a 3-day window we observed ~5 crashes per instance (18 total across 4 instances). Every crash occurred within 1–2s of an autovacuum-done log entry for the relation being queried.

Relevant logs

See the representative crash log above. Every recovery sequence explicitly logs [pg_stat_monitor] pgsm_shmem_shutdown: Shutdown initiated.

Workaround

Removing pg_stat_monitor from shared_preload_libraries (requires reboot) has eliminated crashes on all four affected instances since 2026-04-30. AWS Support independently recommended the same workaround.

Related issues reviewed

Confirmed distinct from existing reports:

Code of Conduct

  • I agree to follow Percona Community Code of Conduct

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type
    No fields configured for issues without a type.

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions