SIGSEGV in pgsm_ExecutorEnd at pg_stat_monitor.c:781 during PortalCleanup (PG 18.3, correlates with autovacuum)

### Description

PostgreSQL 18.3 backends terminate with `SIGSEGV` in `pgsm_ExecutorEnd` at `pg_stat_monitor.c:781` during commit-time `PortalCleanup`. Crashes correlate tightly with autovacuum completion on the relation the failing query targets: the postmaster logs an autovacuum-done line for a table, and within 1–2 seconds a backend running a query against that same table segfaults. The postmaster then terminates all other backends, replays WAL, and the database returns in ~1 second.

This is reproducible across four independent RDS PostgreSQL 18.3 instances sharing the same `shared_preload_libraries` setting. A control instance in the same account with pg_stat_monitor absent from `shared_preload_libraries` has zero crashes over the same window. Removing `pg_stat_monitor` from `shared_preload_libraries` is a confirmed workaround — all four instances have been crash-free since the extension was removed on 2026-04-30.

> **Note:** This report was prepared with AI assistance. Some information is inferred rather than directly observable because the affected instances run on AWS RDS (a managed service), which does not expose the unsanitized stack trace, the exact binary build, or `/rdsdbdata/log/error/` core dumps to customers. AWS Support extracted the sanitized stack trace below on our behalf. We have provided everything we were able to obtain.

### Expected Results

`pgsm_ExecutorEnd` should safely no-op or skip its stats update when the executor state it relies on has been invalidated during portal cleanup, rather than dereferencing a NULL/freed pointer and crashing the backend.

### Actual Results

#### Sanitized stack trace (from AWS Support, crash on 2026-04-30 16:56:58 UTC, backend PID 12539)

```
#0  pgsm_ExecutorEnd (queryDesc=...) at pg_stat_monitor.c:781
#1  PortalCleanup (portal=...) at portalcmds.c:309
#2  PortalDrop (portal=..., isTopCommit=true) at portalmem.c:502
#3  PreCommit_Portals (isPrepare=false) at portalmem.c:756
#4  CommitTransaction () at xact.c:2269
#5  CommitTransactionCommandInternal () at xact.c:3202
#6  CommitTransactionCommand () at xact.c:3165
#7  finish_xact_command () at postgres.c:2832
#8  PostgresMain (...) at postgres.c:4969
#9  BackendMain (...) at backend_startup.c:129
#10 postmaster_child_launch (...) at launch_backend.c:290
#11 BackendStartup (...) at postmaster.c:3601
#12 ServerLoop () at postmaster.c:1716
#13 PostmasterMain (...) at postmaster.c:1414
#14 main (...) at main.c:227
```

#### Crash site in upstream 2.3.1

`pg_stat_monitor.c:781` at tag `2.3.1`:

```c
742:    if (queryId != INT64CONST(0) && queryDesc->totaltime && pgsm_enabled(nesting_level))
743:    {
...
758:            InstrEndLoop(queryDesc->totaltime);
...
773:            pgsm_update_entry(entry,
774:                              NULL,
775:                              NULL,
776:                              0,
777:                              plan_ptr,
778:                              &sys_info,
779:                              NULL,
780:                              0,
781:                              queryDesc->totaltime->total * 1000.0,
```

There is a NULL-check on `queryDesc->totaltime` at line 742, but line 781 later dereferences `queryDesc->totaltime->total` without re-checking. If the structure pointed to by `queryDesc->totaltime` becomes invalid (freed / replaced / its contents invalidated) between 742 and 781, the dereference at 781 faults.

The fact that the crash is reached specifically via `PortalCleanup → PortalDrop` in the commit path, and is tightly correlated with autovacuum completing on the queried relation, points at executor-state invalidation concurrent with `pgsm_ExecutorEnd`'s read.

#### Representative crash log (one of ~18 across 4 instances in 3 days)

```
2026-04-30 16:56:56 UTC LOG:  automatic vacuum of table "postgres.public.tool_execution": index scans: 1
    pages: 0 removed, 6 remain, 6 scanned (100.00% of total), 0 eagerly scanned
    tuples: 4 removed, 52 remain, 0 are dead but not yet removable
    ...
2026-04-30 16:56:58 UTC LOG:  client backend (PID 12539) was terminated by signal 11: Segmentation fault
2026-04-30 16:56:58 UTC DETAIL:  Failed process was running:
    SELECT * FROM tool_execution WHERE task_id = $1 ORDER BY started_at
2026-04-30 16:56:58 UTC LOG:  terminating any other active server processes
2026-04-30 16:56:58 UTC LOG:  [pg_stat_monitor] pgsm_shmem_shutdown: Shutdown initiated.
2026-04-30 16:56:59 UTC LOG:  database system was interrupted; last known up at 2026-04-30 16:53:15 UTC
2026-04-30 16:56:59 UTC LOG:  database system was not properly shut down; automatic recovery in progress
2026-04-30 16:56:59 UTC LOG:  redo starts at 48/64000750
2026-04-30 16:56:59 UTC LOG:  redo done at 48/6800BA70
2026-04-30 16:56:59 UTC LOG:  database system is ready to accept connections
```

#### Observed autovacuum → crash correlation on one instance

| Timestamp (UTC) | Failed query | Preceding autovacuum (table) |
|---|---|---|
| 2026-04-28 08:27:28 | `SELECT value FROM rds_heartbeat2` | `rdsadmin.pg_catalog.rds_heartbeat2` (completed 08:27:17) |
| 2026-04-29 01:31:42 | `SELECT value FROM rds_heartbeat2` | `rdsadmin.pg_catalog.rds_heartbeat2` (completed 01:31:37) |
| 2026-04-29 22:56:42 | `SELECT value FROM rds_heartbeat2` | `rdsadmin.pg_catalog.rds_heartbeat2` (completed 22:56:28) |
| 2026-04-30 16:03:13 | `SELECT value FROM rds_heartbeat2` | `rdsadmin.pg_catalog.rds_heartbeat2` (completed 16:03:07) |
| 2026-04-30 16:56:58 | `SELECT * FROM tool_execution WHERE task_id = $1 ORDER BY started_at` | `postgres.public.tool_execution` (completed 16:56:56) |

The `rds_heartbeat2` bias reflects the RDS-internal monitor polling that table every few seconds, which makes it statistically the most likely query to fall in the post-autovacuum window — it is not a `rdsadmin`-specific pattern. The `tool_execution` crash is an application query on an application table.

### Version

- PostgreSQL 18.3 (AWS RDS managed, default build)
- pg_stat_monitor `extversion = 2.3` (per `pg_available_extensions`). The `.control` file ships version `2.3`; PG18 support was introduced in release **2.3.1** (2025-11-28, issue #566), so the loaded binary is 2.3.1 or newer. We have asked AWS Support whether they apply downstream patches and will update this issue if we receive details.
- Platform: AWS RDS, `us-east-1`
- `shared_preload_libraries = pg_stat_statements,pg_hint_plan,auto_explain,pg_stat_monitor,pg_cron`

Affected instances: 4/4 with pg_stat_monitor in `shared_preload_libraries`. Control instance in same account/region without pg_stat_monitor: 0 crashes in the same window.

### Steps to reproduce

We have not produced a minimal reproducer. Observed environment for what is a reliable reproduction in our setup:

1. RDS PostgreSQL 18.3 with `shared_preload_libraries = pg_stat_statements,pg_hint_plan,auto_explain,pg_stat_monitor,pg_cron`
2. Any table receiving regular-but-small write activity so autovacuum runs frequently (we observed it on both a 52-row application table and on `rdsadmin.pg_catalog.rds_heartbeat2` which is polled constantly by RDS's heartbeat monitor)
3. Continuous `SELECT` workload against the same relation
4. Over a 3-day window we observed ~5 crashes per instance (18 total across 4 instances). Every crash occurred within 1–2s of an autovacuum-done log entry for the relation being queried.

### Relevant logs

See the representative crash log above. Every recovery sequence explicitly logs `[pg_stat_monitor] pgsm_shmem_shutdown: Shutdown initiated.`

### Workaround

Removing `pg_stat_monitor` from `shared_preload_libraries` (requires reboot) has eliminated crashes on all four affected instances since 2026-04-30. AWS Support independently recommended the same workaround.

### Related issues reviewed

Confirmed distinct from existing reports:

- **#628** (OPEN, memory leak on PG18) — different failure mode (leak vs. SIGSEGV), different code path (shmem shutdown vs. executor hook).
- **#591** (CLOSED, PG16 segfault) — also SIGSEGV on 2.3.1, but in `pg_stat_monitor_internal` at line 2569 (the stats *reporting* path, via `tuplestore_putvalues`). Our crash is in `pgsm_ExecutorEnd` at line 781 (the executor *hook* path). Different code path and different engine major.
- **#500** (CLOSED, PG17 hang) — same `pgsm_ExecutorEnd → PortalCleanup → PortalDrop` prefix, but an LWLock hang rather than a SIGSEGV, on pgsm 2.1.0. Related enough that the hook-during-portal-cleanup pattern has history.

### Code of Conduct

- [x] I agree to follow Percona Community Code of Conduct


Timestamp (UTC)	Failed query	Preceding autovacuum (table)
2026-04-28 08:27:28	`SELECT value FROM rds_heartbeat2`	`rdsadmin.pg_catalog.rds_heartbeat2` (completed 08:27:17)
2026-04-29 01:31:42	`SELECT value FROM rds_heartbeat2`	`rdsadmin.pg_catalog.rds_heartbeat2` (completed 01:31:37)
2026-04-29 22:56:42	`SELECT value FROM rds_heartbeat2`	`rdsadmin.pg_catalog.rds_heartbeat2` (completed 22:56:28)
2026-04-30 16:03:13	`SELECT value FROM rds_heartbeat2`	`rdsadmin.pg_catalog.rds_heartbeat2` (completed 16:03:07)
2026-04-30 16:56:58	`SELECT * FROM tool_execution WHERE task_id = $1 ORDER BY started_at`	`postgres.public.tool_execution` (completed 16:56:56)

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

SIGSEGV in pgsm_ExecutorEnd at pg_stat_monitor.c:781 during PortalCleanup (PG 18.3, correlates with autovacuum) #651

Description

Expected Results

Actual Results

Sanitized stack trace (from AWS Support, crash on 2026-04-30 16:56:58 UTC, backend PID 12539)

Crash site in upstream 2.3.1

Representative crash log (one of ~18 across 4 instances in 3 days)

Observed autovacuum → crash correlation on one instance

Version

Steps to reproduce

Relevant logs

Workaround

Related issues reviewed

Code of Conduct

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

SIGSEGV in pgsm_ExecutorEnd at pg_stat_monitor.c:781 during PortalCleanup (PG 18.3, correlates with autovacuum) #651

Description

Description

Expected Results

Actual Results

Sanitized stack trace (from AWS Support, crash on 2026-04-30 16:56:58 UTC, backend PID 12539)

Crash site in upstream 2.3.1

Representative crash log (one of ~18 across 4 instances in 3 days)

Observed autovacuum → crash correlation on one instance

Version

Steps to reproduce

Relevant logs

Workaround

Related issues reviewed

Code of Conduct

Metadata

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

Issue actions