A heartbeat monitor for PostgreSQL — polls pg_stat_activity, pg_stat_statements, pg_locks, and more, then exposes Prometheus metrics on /metrics.
- A lightweight sidecar that connects to PostgreSQL and exposes 30+ Prometheus-compatible metrics
- A poll-based exporter for activity, connections, slow queries, vacuum health, table/index bloat, lock graphs, replication lag, WAL rate, checkpoint pressure, and query regression detection
- Compatible with PostgreSQL 12+ (auto-detects version for correct column names)
- Ships with a 64-panel Grafana dashboard and a Helm chart with ServiceMonitor
- Zero config beyond a DSN — sensible defaults for everything
- Not a replacement for
pg_stat_monitororpganalyze— pgpulse exposes raw counters and deltas, not analyzed insights - Not a query profiler — it captures top-N statements by total time, calls, and mean time, not full query plans
- Not a connection pooler — it uses 2 connections max and does not proxy traffic
- Not an alerting engine — pair it with Alertmanager or Grafana alerts
Observe, don't interfere. pgpulse opens a read-only window into PostgreSQL's own statistics views. It adds no extensions, modifies no data, and uses minimal resources. The metrics tell you what's happening; you decide what to do about it.
pgpulse queries PostgreSQL's built-in statistics views. Most metrics work out of the box, but some features require configuration.
Create a dedicated role with read access to statistics views:
CREATE ROLE pgpulse WITH LOGIN PASSWORD 'your-secure-password';
GRANT pg_monitor TO pgpulse;The pg_monitor role (PostgreSQL 10+) grants read access to pg_stat_activity, pg_stat_replication, pg_locks, and all other statistics views pgpulse needs.
Statement-level metrics (top queries by time/calls, query regression detection) require the pg_stat_statements extension. pgpulse auto-detects its presence and skips statement metrics if unavailable.
To enable:
-- 1. Add to postgresql.conf (requires restart):
-- shared_preload_libraries = 'pg_stat_statements'
-- 2. Create the extension in your database:
CREATE EXTENSION IF NOT EXISTS pg_stat_statements;pgpulse auto-detects PostgreSQL version and uses the correct column names (total_exec_time on PG13+, total_time on PG12).
| Feature | Minimum PG version | Notes |
|---|---|---|
| Core metrics (activity, connections, sizes) | 12 | Always available |
pg_stat_statements |
12 | Requires extension (see above) |
WAL generation rate (pg_stat_wal) |
14 | Auto-skipped on older versions |
Checkpoint stats (pg_stat_checkpointer) |
17 | Falls back to pg_stat_bgwriter on older versions |
Replication lag (replay_lag) |
10 | Graceful skip if not a primary |
Node role detection (pg_is_in_recovery) |
12 | Auto-detects primary vs replica each poll |
Allow the pgpulse role to connect from your monitoring network:
# Same host
host all pgpulse 127.0.0.1/32 scram-sha-256
# Kubernetes pod CIDR
host all pgpulse 10.244.0.0/16 scram-sha-256
Reload after editing: SELECT pg_reload_conf();
pgpulse must connect to the application database, not postgres. Per-table metrics (vacuum health, bloat, table sizes, index sizes, sequential scan ratios) use pg_stat_user_tables which only shows tables in the connected database.
Cluster-wide metrics (activity, connections, replication, WAL, checkpoints) work from any database.
If you have multiple databases to monitor, create one target per database (see multi-database monitoring below).
pgpulse connects using a standard PostgreSQL DSN:
postgres://pgpulse@hostname:5432/mydb?sslmode=require
For production, always use sslmode=require or sslmode=verify-full.
pgpulse must connect directly to PostgreSQL, not through pgbouncer or other connection poolers. pgbouncer does not proxy system views (pg_stat_activity, pg_stat_statements, pg_locks, pg_stat_replication), so most metrics will fail or return empty results.
| Direct (port 5432) | pgbouncer (port 6432) | |
|---|---|---|
pg_stat_activity |
works | empty/error |
pg_stat_statements |
works | error |
pg_locks |
works | empty |
pg_stat_replication |
works | error |
| Database sizes | works | error |
Always point PG_DSN at PostgreSQL's native port (default 5432), not the pgbouncer port.
pgpulse auto-detects whether each target is a primary or replica by querying pg_is_in_recovery() every poll cycle. This means:
- Specify all nodes as separate targets — pgpulse figures out the roles
- Failover is automatic — if Patroni promotes a replica, the
pg_node_rolemetric flips on the next poll - The
pg_node_role{role="primary|replica"}metric lets you filter and compare nodes in Grafana - Replication metrics are only collected on the primary (auto-skipped on replicas)
- All other metrics (activity, vacuum, bloat, locks, etc.) are collected on every node
In Grafana, use the instance and role template variables to filter by node or compare primary vs replicas.
Per-table metrics (pg_stat_user_tables, pg_stat_user_indexes) are scoped to the connected database. To monitor multiple databases, create one target per database per node.
Example: 3 nodes × 3 databases = 9 targets. Use a values file (pgpulse-values.yaml):
targets:
# Node 1
- name: node1-appdb1
dsn: "postgres://pgpulse@node1:5432/appdb1?sslmode=require"
- name: node1-appdb2
dsn: "postgres://pgpulse@node1:5432/appdb2?sslmode=require"
- name: node1-appdb3
dsn: "postgres://pgpulse@node1:5432/appdb3?sslmode=require"
# Node 2
- name: node2-appdb1
dsn: "postgres://pgpulse@node2:5432/appdb1?sslmode=require"
- name: node2-appdb2
dsn: "postgres://pgpulse@node2:5432/appdb2?sslmode=require"
- name: node2-appdb3
dsn: "postgres://pgpulse@node2:5432/appdb3?sslmode=require"
# Node 3
- name: node3-appdb1
dsn: "postgres://pgpulse@node3:5432/appdb1?sslmode=require"
- name: node3-appdb2
dsn: "postgres://pgpulse@node3:5432/appdb2?sslmode=require"
- name: node3-appdb3
dsn: "postgres://pgpulse@node3:5432/appdb3?sslmode=require"
serviceMonitor:
enabled: true
scrapeTimeout: "10s"
labels:
release: prometheus-operator
prometheusRule:
enabled: true
labels:
release: prometheus-operator
dashboard:
enabled: truehelm upgrade --install pgpulse charts/pgpulse/ -f pgpulse-values.yaml -n pgpulse-systemEach target creates its own Deployment (2 PostgreSQL connections each). Cluster-wide metrics (activity, connections, replication) will be identical across databases on the same node — some duplication but harmless. If pg_stat_statements is needed, create the extension in each database.
# Build
make build
# Run
export PG_DSN="postgres://pgpulse@localhost:5432/mydb?sslmode=disable"
./bin/pgpulse serve
# Docker
docker build -t pgpulse:dev .
docker run -e PG_DSN="postgres://pgpulse@localhost/mydb" -p 9187:9187 pgpulse:devMetrics at http://localhost:9187/metrics, health check at /healthz.
helm upgrade --install pgpulse charts/pgpulse/ -f pgpulse-values.yaml -n pgpulse-systemMinimal pgpulse-values.yaml:
targets:
- name: prod
dsn: "postgres://pgpulse@pghost:5432/mydb?sslmode=require"
serviceMonitor:
enabled: true
labels:
release: prometheus-operator
prometheusRule:
enabled: true
labels:
release: prometheus-operatorEach target gets its own Deployment. ServiceMonitor auto-discovers all targets. PrometheusRule ships with 5 opinionated alerts. Grafana dashboard auto-loads via sidecar ConfigMap. For multi-node and multi-database setups, see HA clusters and multi-database monitoring.
sudo cp bin/pgpulse /usr/local/bin/
sudo cp deploy/pgpulse.service /etc/systemd/system/
sudo mkdir -p /etc/pgpulse
sudo cp deploy/pgpulse.env.example /etc/pgpulse/pgpulse.env
sudo chmod 600 /etc/pgpulse/pgpulse.env
# Edit PG_DSN in /etc/pgpulse/pgpulse.env, then:
sudo systemctl daemon-reload
sudo systemctl enable --now pgpulseAll configuration is via environment variables:
| Variable | Default | Description |
|---|---|---|
PG_DSN or DATABASE_URL |
(required) | PostgreSQL connection string |
METRICS_PORT |
9187 |
Port for the HTTP metrics server |
POLL_INTERVAL |
5s |
How often to collect metrics |
SLOW_QUERY_THRESHOLD |
5s |
Duration after which a query is counted as slow |
REGRESSION_THRESHOLD |
2.0 |
Mean time ratio above which a query is flagged as regressed |
STMT_LIMIT |
50 |
Number of top statements to track per dimension |
TELEGRAM_BOT_TOKEN |
(disabled) | Telegram bot token for alerts |
TELEGRAM_CHAT_ID |
(disabled) | Telegram chat ID for alerts |
ALERT_WEBHOOK_URL |
(disabled) | Slack or generic webhook URL for alerts |
ALERT_COOLDOWN |
5m |
Minimum interval between repeated alerts of the same type |
GRAFANA_URL |
(disabled) | Grafana base URL for anomaly annotations |
GRAFANA_TOKEN |
(disabled) | Grafana service account token |
GRAFANA_DASHBOARD_UID |
(optional) | Scope annotations to a specific dashboard |
pgpulse has built-in alerting that fires on critical conditions without requiring Alertmanager. Alerts are optional — disabled unless a notification target is configured.
| Condition | Trigger | Severity |
|---|---|---|
| Connection saturation | connections / max_connections > 90% |
Critical |
| Lock chain depth | Chain depth > 3 | Warning |
| Query regression | Any query mean time exceeds regression threshold | Warning |
Create a bot via @BotFather, add it to your group, then configure:
targets:
- name: prod
dsn: "postgres://pgpulse@host:5432/mydb"
telegramBotToken: "123456:ABC-DEF"
telegramChatId: "-1001234567890"targets:
- name: prod
dsn: "postgres://pgpulse@host:5432/mydb"
alertWebhookUrl: "https://hooks.slack.com/services/T.../B.../..."Both Telegram and webhook can be enabled simultaneously. Each alert type has an independent cooldown (default 5 minutes) to avoid spam.
pgpulse can push annotations to Grafana when it detects spikes — visible as vertical markers on the dashboard timeline. This helps the team see "what happened at 14:30?" without scrolling through logs.
Anomalies detected:
- Connection spike — +20 connections in one poll cycle
- Lock chain spike — lock depth doubled
- Regression spike — any new query regression
targets:
- name: prod
dsn: "postgres://pgpulse@host:5432/mydb"
grafanaUrl: "http://grafana.monitoring:3000"
grafanaToken: "glsa_xxx"
grafanaDashboardUid: "pgpulse-main"Requires a Grafana service account token with annotations:write permission. Dashboard UID is optional — omit to annotate globally.
pg_active_queries— currently active queriespg_queries_by_state{state}— queries grouped by statepg_connections_by_user{usename}— connections per userpg_connections_by_database{datname}— connections per databasepg_slow_queries— active queries exceeding the slow thresholdpg_longest_query_seconds— duration of the longest running querypg_waiting_queries— active queries waiting on locks
pg_idle_connections— number of idle connectionspg_idle_connection_seconds_total— sum of idle time across all idle connectionspg_connection_age_seconds— histogram of connection agespg_connections_exhaustion_hours— predicted hours until max_connections exhausted (-1 if stable)
pg_database_size_bytes{datname}— size of each database in bytespg_connections_max— PostgreSQLmax_connectionssettingpg_connections_used_ratio— ratio of current connections to max
pg_stat_statements_calls{query_fingerprint,usename}— execution count (ordered by total time)pg_stat_statements_mean_time_seconds{query_fingerprint,usename}— mean execution timepg_stat_statements_total_time_seconds{query_fingerprint,usename}— total execution timepg_stat_statements_top_by_calls{query_fingerprint,usename}— top queries by call countpg_stat_statements_top_by_mean_time_seconds{query_fingerprint,usename}— top queries by mean time
pg_stmt_regressions— count of queries whose mean time regressed beyond thresholdpg_stmt_mean_time_change_ratio{query_fingerprint,usename}— current/previous mean time ratiopg_stmt_calls_delta{query_fingerprint,usename}— change in call count since last pollpg_stmt_reset_detected— 1 when pg_stat_statements reset detected (prevents false regressions)pg_stmt_plan_changes{query_fingerprint,usename}— plan count delta per query (PG14+, detects plan changes)
pg_dead_tuples{table}— dead tuple count per tablepg_dead_tuple_ratio{table}— ratio of dead to total tuplespg_last_vacuum_seconds{table}— seconds since last manual vacuum (-1 if never)pg_last_autovacuum_seconds{table}— seconds since last autovacuum (-1 if never)pg_autovacuum_workers_active— current autovacuum worker countpg_autovacuum_workers_max— max autovacuum workers setting
pg_table_total_bytes{table}— total size including indexes and toastpg_table_bytes{table}— heap size onlypg_table_bloat_bytes{table}— estimated reclaimable bloat per tablepg_table_bloat_ratio{table}— bloat ratio (dead tuples / total tuples)pg_index_bytes{index,table}— individual index sizepg_index_scans_total{index,table}— cumulative index scans (0 = unused index)
pg_table_seq_scan_ratio{table}— ratio of sequential scans to total scanspg_table_seq_scans{table}— cumulative sequential scanspg_table_index_scans{table}— cumulative index scans
pg_lock_blocked_queries— total queries blocked by lockspg_lock_chain_max_depth— deepest lock wait chainpg_lock_by_type{lock_type}— blocked queries by lock type
pg_wal_bytes_total— total WAL bytes generatedpg_wal_bytes_per_second— WAL generation rate
pg_replication_lag_bytes{slot,client_addr}— replication lag per replicapg_replication_lag_seconds{slot,client_addr}— replication lag in secondspg_replication_connected_replicas— number of connected replicas
pg_checkpoints_timed_total— scheduled checkpointspg_checkpoints_requested_total— requested checkpointspg_buffers_checkpoint— buffers written during checkpoints
pg_node_role{role}— 1 for current role (primaryorreplica), auto-detected each poll
pg_up— 1 if PostgreSQL is reachable, 0 otherwisepg_scrape_duration_seconds— time taken to collect metricspg_scrape_errors_total— cumulative scrape error count
cmd/pgpulse/main.go CLI entry point (delegates to internal/cli)
internal/
cli/ Cobra commands: serve, version
config/ Environment-based configuration
collector/ Poll loop + 12 collectors
activity.go pg_stat_activity (connections, queries, states)
database.go Database sizes, max_connections
statements.go pg_stat_statements (top-N by 3 dimensions)
regression.go Statement delta analysis (stateful)
vacuum.go Dead tuples, autovacuum workers
bloat.go Table/index sizes
tablestats.go Sequential vs index scan ratios
locks.go Lock graph with chain depth
wal.go WAL generation rate (PG14+, stateful)
replication.go Replication lag per replica
connlifecycle.go Idle connections, connection age histogram
prediction.go Connection exhaustion prediction (stateful)
checkpoint.go Checkpoint pressure (PG17-aware)
querier.go Interface for testability
metrics/ Prometheus metric definitions
charts/pgpulse/ Helm chart with ServiceMonitor + PrometheusRule
grafana/
pgpulse-dashboard.json 64-panel importable Grafana dashboard
deploy/
pgpulse.service systemd unit file
pgpulse.env.example Environment file template
Import grafana/pgpulse-dashboard.json into Grafana or use the Helm chart with dashboard.enabled=true for automatic provisioning. The dashboard includes 63 panels across 9 rows: overview stats, query activity, connections, top queries, database sizes, vacuum health, table/index sizes, lock graph, query regression detection, WAL/replication, connection lifecycle, and checkpoint pressure.
- Statement fingerprints are truncated to 80 characters
- No support for multiple PostgreSQL instances in a single process
- No TLS client certificate auth for the metrics endpoint
- Connection exhaustion prediction requires at least 2 poll cycles of data
- WAL metrics require PostgreSQL 14+ (
pg_stat_wal)