Skip to content

ppiankov/pgpulse

Repository files navigation

CI License: MIT

pgpulse

A heartbeat monitor for PostgreSQL — polls pg_stat_activity, pg_stat_statements, pg_locks, and more, then exposes Prometheus metrics on /metrics.

What pgpulse is

  • A lightweight sidecar that connects to PostgreSQL and exposes 30+ Prometheus-compatible metrics
  • A poll-based exporter for activity, connections, slow queries, vacuum health, table/index bloat, lock graphs, replication lag, WAL rate, checkpoint pressure, and query regression detection
  • Compatible with PostgreSQL 12+ (auto-detects version for correct column names)
  • Ships with a 64-panel Grafana dashboard and a Helm chart with ServiceMonitor
  • Zero config beyond a DSN — sensible defaults for everything

What pgpulse is NOT

  • Not a replacement for pg_stat_monitor or pganalyze — pgpulse exposes raw counters and deltas, not analyzed insights
  • Not a query profiler — it captures top-N statements by total time, calls, and mean time, not full query plans
  • Not a connection pooler — it uses 2 connections max and does not proxy traffic
  • Not an alerting engine — pair it with Alertmanager or Grafana alerts

Philosophy

Observe, don't interfere. pgpulse opens a read-only window into PostgreSQL's own statistics views. It adds no extensions, modifies no data, and uses minimal resources. The metrics tell you what's happening; you decide what to do about it.

PostgreSQL prerequisites

pgpulse queries PostgreSQL's built-in statistics views. Most metrics work out of the box, but some features require configuration.

Required: a monitoring role

Create a dedicated role with read access to statistics views:

CREATE ROLE pgpulse WITH LOGIN PASSWORD 'your-secure-password';
GRANT pg_monitor TO pgpulse;

The pg_monitor role (PostgreSQL 10+) grants read access to pg_stat_activity, pg_stat_replication, pg_locks, and all other statistics views pgpulse needs.

Optional: pg_stat_statements

Statement-level metrics (top queries by time/calls, query regression detection) require the pg_stat_statements extension. pgpulse auto-detects its presence and skips statement metrics if unavailable.

To enable:

-- 1. Add to postgresql.conf (requires restart):
--    shared_preload_libraries = 'pg_stat_statements'

-- 2. Create the extension in your database:
CREATE EXTENSION IF NOT EXISTS pg_stat_statements;

pgpulse auto-detects PostgreSQL version and uses the correct column names (total_exec_time on PG13+, total_time on PG12).

Version-specific features

Feature Minimum PG version Notes
Core metrics (activity, connections, sizes) 12 Always available
pg_stat_statements 12 Requires extension (see above)
WAL generation rate (pg_stat_wal) 14 Auto-skipped on older versions
Checkpoint stats (pg_stat_checkpointer) 17 Falls back to pg_stat_bgwriter on older versions
Replication lag (replay_lag) 10 Graceful skip if not a primary
Node role detection (pg_is_in_recovery) 12 Auto-detects primary vs replica each poll

Required: pg_hba.conf

Allow the pgpulse role to connect from your monitoring network:

# Same host
host    all    pgpulse    127.0.0.1/32    scram-sha-256

# Kubernetes pod CIDR
host    all    pgpulse    10.244.0.0/16    scram-sha-256

Reload after editing: SELECT pg_reload_conf();

Which database to connect to

pgpulse must connect to the application database, not postgres. Per-table metrics (vacuum health, bloat, table sizes, index sizes, sequential scan ratios) use pg_stat_user_tables which only shows tables in the connected database.

Cluster-wide metrics (activity, connections, replication, WAL, checkpoints) work from any database.

If you have multiple databases to monitor, create one target per database (see multi-database monitoring below).

Connection string

pgpulse connects using a standard PostgreSQL DSN:

postgres://pgpulse@hostname:5432/mydb?sslmode=require

For production, always use sslmode=require or sslmode=verify-full.

pgbouncer: connect directly to PostgreSQL

pgpulse must connect directly to PostgreSQL, not through pgbouncer or other connection poolers. pgbouncer does not proxy system views (pg_stat_activity, pg_stat_statements, pg_locks, pg_stat_replication), so most metrics will fail or return empty results.

Direct (port 5432) pgbouncer (port 6432)
pg_stat_activity works empty/error
pg_stat_statements works error
pg_locks works empty
pg_stat_replication works error
Database sizes works error

Always point PG_DSN at PostgreSQL's native port (default 5432), not the pgbouncer port.

HA clusters (Patroni, repmgr, Stolon)

pgpulse auto-detects whether each target is a primary or replica by querying pg_is_in_recovery() every poll cycle. This means:

  • Specify all nodes as separate targets — pgpulse figures out the roles
  • Failover is automatic — if Patroni promotes a replica, the pg_node_role metric flips on the next poll
  • The pg_node_role{role="primary|replica"} metric lets you filter and compare nodes in Grafana
  • Replication metrics are only collected on the primary (auto-skipped on replicas)
  • All other metrics (activity, vacuum, bloat, locks, etc.) are collected on every node

In Grafana, use the instance and role template variables to filter by node or compare primary vs replicas.

Multi-database monitoring

Per-table metrics (pg_stat_user_tables, pg_stat_user_indexes) are scoped to the connected database. To monitor multiple databases, create one target per database per node.

Example: 3 nodes × 3 databases = 9 targets. Use a values file (pgpulse-values.yaml):

targets:
  # Node 1
  - name: node1-appdb1
    dsn: "postgres://pgpulse@node1:5432/appdb1?sslmode=require"
  - name: node1-appdb2
    dsn: "postgres://pgpulse@node1:5432/appdb2?sslmode=require"
  - name: node1-appdb3
    dsn: "postgres://pgpulse@node1:5432/appdb3?sslmode=require"
  # Node 2
  - name: node2-appdb1
    dsn: "postgres://pgpulse@node2:5432/appdb1?sslmode=require"
  - name: node2-appdb2
    dsn: "postgres://pgpulse@node2:5432/appdb2?sslmode=require"
  - name: node2-appdb3
    dsn: "postgres://pgpulse@node2:5432/appdb3?sslmode=require"
  # Node 3
  - name: node3-appdb1
    dsn: "postgres://pgpulse@node3:5432/appdb1?sslmode=require"
  - name: node3-appdb2
    dsn: "postgres://pgpulse@node3:5432/appdb2?sslmode=require"
  - name: node3-appdb3
    dsn: "postgres://pgpulse@node3:5432/appdb3?sslmode=require"

serviceMonitor:
  enabled: true
  scrapeTimeout: "10s"
  labels:
    release: prometheus-operator

prometheusRule:
  enabled: true
  labels:
    release: prometheus-operator

dashboard:
  enabled: true
helm upgrade --install pgpulse charts/pgpulse/ -f pgpulse-values.yaml -n pgpulse-system

Each target creates its own Deployment (2 PostgreSQL connections each). Cluster-wide metrics (activity, connections, replication) will be identical across databases on the same node — some duplication but harmless. If pg_stat_statements is needed, create the extension in each database.

Quick start

# Build
make build

# Run
export PG_DSN="postgres://pgpulse@localhost:5432/mydb?sslmode=disable"
./bin/pgpulse serve

# Docker
docker build -t pgpulse:dev .
docker run -e PG_DSN="postgres://pgpulse@localhost/mydb" -p 9187:9187 pgpulse:dev

Metrics at http://localhost:9187/metrics, health check at /healthz.

Helm (Kubernetes)

helm upgrade --install pgpulse charts/pgpulse/ -f pgpulse-values.yaml -n pgpulse-system

Minimal pgpulse-values.yaml:

targets:
  - name: prod
    dsn: "postgres://pgpulse@pghost:5432/mydb?sslmode=require"

serviceMonitor:
  enabled: true
  labels:
    release: prometheus-operator

prometheusRule:
  enabled: true
  labels:
    release: prometheus-operator

Each target gets its own Deployment. ServiceMonitor auto-discovers all targets. PrometheusRule ships with 5 opinionated alerts. Grafana dashboard auto-loads via sidecar ConfigMap. For multi-node and multi-database setups, see HA clusters and multi-database monitoring.

systemd

sudo cp bin/pgpulse /usr/local/bin/
sudo cp deploy/pgpulse.service /etc/systemd/system/
sudo mkdir -p /etc/pgpulse
sudo cp deploy/pgpulse.env.example /etc/pgpulse/pgpulse.env
sudo chmod 600 /etc/pgpulse/pgpulse.env
# Edit PG_DSN in /etc/pgpulse/pgpulse.env, then:
sudo systemctl daemon-reload
sudo systemctl enable --now pgpulse

Configuration

All configuration is via environment variables:

Variable Default Description
PG_DSN or DATABASE_URL (required) PostgreSQL connection string
METRICS_PORT 9187 Port for the HTTP metrics server
POLL_INTERVAL 5s How often to collect metrics
SLOW_QUERY_THRESHOLD 5s Duration after which a query is counted as slow
REGRESSION_THRESHOLD 2.0 Mean time ratio above which a query is flagged as regressed
STMT_LIMIT 50 Number of top statements to track per dimension
TELEGRAM_BOT_TOKEN (disabled) Telegram bot token for alerts
TELEGRAM_CHAT_ID (disabled) Telegram chat ID for alerts
ALERT_WEBHOOK_URL (disabled) Slack or generic webhook URL for alerts
ALERT_COOLDOWN 5m Minimum interval between repeated alerts of the same type
GRAFANA_URL (disabled) Grafana base URL for anomaly annotations
GRAFANA_TOKEN (disabled) Grafana service account token
GRAFANA_DASHBOARD_UID (optional) Scope annotations to a specific dashboard

Alerting

pgpulse has built-in alerting that fires on critical conditions without requiring Alertmanager. Alerts are optional — disabled unless a notification target is configured.

Alert conditions

Condition Trigger Severity
Connection saturation connections / max_connections > 90% Critical
Lock chain depth Chain depth > 3 Warning
Query regression Any query mean time exceeds regression threshold Warning

Telegram

Create a bot via @BotFather, add it to your group, then configure:

targets:
  - name: prod
    dsn: "postgres://pgpulse@host:5432/mydb"
    telegramBotToken: "123456:ABC-DEF"
    telegramChatId: "-1001234567890"

Slack / generic webhook

targets:
  - name: prod
    dsn: "postgres://pgpulse@host:5432/mydb"
    alertWebhookUrl: "https://hooks.slack.com/services/T.../B.../..."

Both Telegram and webhook can be enabled simultaneously. Each alert type has an independent cooldown (default 5 minutes) to avoid spam.

Grafana anomaly annotations

pgpulse can push annotations to Grafana when it detects spikes — visible as vertical markers on the dashboard timeline. This helps the team see "what happened at 14:30?" without scrolling through logs.

Anomalies detected:

  • Connection spike — +20 connections in one poll cycle
  • Lock chain spike — lock depth doubled
  • Regression spike — any new query regression
targets:
  - name: prod
    dsn: "postgres://pgpulse@host:5432/mydb"
    grafanaUrl: "http://grafana.monitoring:3000"
    grafanaToken: "glsa_xxx"
    grafanaDashboardUid: "pgpulse-main"

Requires a Grafana service account token with annotations:write permission. Dashboard UID is optional — omit to annotate globally.

Metrics

Activity (pg_stat_activity)

  • pg_active_queries — currently active queries
  • pg_queries_by_state{state} — queries grouped by state
  • pg_connections_by_user{usename} — connections per user
  • pg_connections_by_database{datname} — connections per database
  • pg_slow_queries — active queries exceeding the slow threshold
  • pg_longest_query_seconds — duration of the longest running query
  • pg_waiting_queries — active queries waiting on locks

Connection lifecycle

  • pg_idle_connections — number of idle connections
  • pg_idle_connection_seconds_total — sum of idle time across all idle connections
  • pg_connection_age_seconds — histogram of connection ages
  • pg_connections_exhaustion_hours — predicted hours until max_connections exhausted (-1 if stable)

Database

  • pg_database_size_bytes{datname} — size of each database in bytes
  • pg_connections_max — PostgreSQL max_connections setting
  • pg_connections_used_ratio — ratio of current connections to max

Statements (pg_stat_statements)

  • pg_stat_statements_calls{query_fingerprint,usename} — execution count (ordered by total time)
  • pg_stat_statements_mean_time_seconds{query_fingerprint,usename} — mean execution time
  • pg_stat_statements_total_time_seconds{query_fingerprint,usename} — total execution time
  • pg_stat_statements_top_by_calls{query_fingerprint,usename} — top queries by call count
  • pg_stat_statements_top_by_mean_time_seconds{query_fingerprint,usename} — top queries by mean time

Query regression detection

  • pg_stmt_regressions — count of queries whose mean time regressed beyond threshold
  • pg_stmt_mean_time_change_ratio{query_fingerprint,usename} — current/previous mean time ratio
  • pg_stmt_calls_delta{query_fingerprint,usename} — change in call count since last poll
  • pg_stmt_reset_detected — 1 when pg_stat_statements reset detected (prevents false regressions)
  • pg_stmt_plan_changes{query_fingerprint,usename} — plan count delta per query (PG14+, detects plan changes)

Vacuum health

  • pg_dead_tuples{table} — dead tuple count per table
  • pg_dead_tuple_ratio{table} — ratio of dead to total tuples
  • pg_last_vacuum_seconds{table} — seconds since last manual vacuum (-1 if never)
  • pg_last_autovacuum_seconds{table} — seconds since last autovacuum (-1 if never)
  • pg_autovacuum_workers_active — current autovacuum worker count
  • pg_autovacuum_workers_max — max autovacuum workers setting

Table and index sizes

  • pg_table_total_bytes{table} — total size including indexes and toast
  • pg_table_bytes{table} — heap size only
  • pg_table_bloat_bytes{table} — estimated reclaimable bloat per table
  • pg_table_bloat_ratio{table} — bloat ratio (dead tuples / total tuples)
  • pg_index_bytes{index,table} — individual index size
  • pg_index_scans_total{index,table} — cumulative index scans (0 = unused index)

Table statistics

  • pg_table_seq_scan_ratio{table} — ratio of sequential scans to total scans
  • pg_table_seq_scans{table} — cumulative sequential scans
  • pg_table_index_scans{table} — cumulative index scans

Lock graph (pg_locks)

  • pg_lock_blocked_queries — total queries blocked by locks
  • pg_lock_chain_max_depth — deepest lock wait chain
  • pg_lock_by_type{lock_type} — blocked queries by lock type

WAL (PG14+)

  • pg_wal_bytes_total — total WAL bytes generated
  • pg_wal_bytes_per_second — WAL generation rate

Replication

  • pg_replication_lag_bytes{slot,client_addr} — replication lag per replica
  • pg_replication_lag_seconds{slot,client_addr} — replication lag in seconds
  • pg_replication_connected_replicas — number of connected replicas

Checkpoint

  • pg_checkpoints_timed_total — scheduled checkpoints
  • pg_checkpoints_requested_total — requested checkpoints
  • pg_buffers_checkpoint — buffers written during checkpoints

Node role

  • pg_node_role{role} — 1 for current role (primary or replica), auto-detected each poll

Scrape health

  • pg_up — 1 if PostgreSQL is reachable, 0 otherwise
  • pg_scrape_duration_seconds — time taken to collect metrics
  • pg_scrape_errors_total — cumulative scrape error count

Architecture

cmd/pgpulse/main.go              CLI entry point (delegates to internal/cli)
internal/
  cli/                            Cobra commands: serve, version
  config/                         Environment-based configuration
  collector/                      Poll loop + 12 collectors
    activity.go                   pg_stat_activity (connections, queries, states)
    database.go                   Database sizes, max_connections
    statements.go                 pg_stat_statements (top-N by 3 dimensions)
    regression.go                 Statement delta analysis (stateful)
    vacuum.go                     Dead tuples, autovacuum workers
    bloat.go                      Table/index sizes
    tablestats.go                 Sequential vs index scan ratios
    locks.go                      Lock graph with chain depth
    wal.go                        WAL generation rate (PG14+, stateful)
    replication.go                Replication lag per replica
    connlifecycle.go              Idle connections, connection age histogram
    prediction.go                 Connection exhaustion prediction (stateful)
    checkpoint.go                 Checkpoint pressure (PG17-aware)
    querier.go                    Interface for testability
  metrics/                        Prometheus metric definitions
charts/pgpulse/                   Helm chart with ServiceMonitor + PrometheusRule
grafana/
  pgpulse-dashboard.json          64-panel importable Grafana dashboard
deploy/
  pgpulse.service                 systemd unit file
  pgpulse.env.example             Environment file template

Grafana dashboard

Import grafana/pgpulse-dashboard.json into Grafana or use the Helm chart with dashboard.enabled=true for automatic provisioning. The dashboard includes 63 panels across 9 rows: overview stats, query activity, connections, top queries, database sizes, vacuum health, table/index sizes, lock graph, query regression detection, WAL/replication, connection lifecycle, and checkpoint pressure.

Known limitations

  • Statement fingerprints are truncated to 80 characters
  • No support for multiple PostgreSQL instances in a single process
  • No TLS client certificate auth for the metrics endpoint
  • Connection exhaustion prediction requires at least 2 poll cycles of data
  • WAL metrics require PostgreSQL 14+ (pg_stat_wal)

License

MIT

About

PostgreSQL Prometheus metrics exporter — pulse check for Patroni clusters

Topics

Resources

License

Contributing

Stars

Watchers

Forks

Packages

 
 
 

Contributors

Languages