Skip to content

fix: memory leak in libp2p bootstrap node#13

Open
SwaroopH wants to merge 8 commits intomainfrom
fix/memory-leak
Open

fix: memory leak in libp2p bootstrap node#13
SwaroopH wants to merge 8 commits intomainfrom
fix/memory-leak

Conversation

@SwaroopH
Copy link
Copy Markdown
Member

Summary

  • Bound libp2p resource manager connections to ConnManagerHighWater + 100 so gossipsub peer-scoring state cannot grow unbounded as peers churn.
  • Add a peerstore GC goroutine that clears addresses and removes disconnected peers on a 2-minute tick.
  • Introduce BootstrapNode.Close() with cancelable context, NotifyBundle deregistration, and DHT close — wired into cmd/main.go shutdown.
  • Add a pprof server (gated on PPROF_PORT) and a 60-second status logger reporting connected/peerstore/dht_rt/goroutines/heap_alloc/heap_inuse/sys.
  • Tooling: docker compose v1/v2 detection in start.sh / stop.sh / build-docker.sh, --build flag for start.sh, signal trap for clean shutdown, Go base image bumped to 1.25, README debugging guide.

Why

The bootstrap node was leaking memory under sustained peer churn. Three root causes were identified:

  1. Unbounded resource limits. Conns, ConnsInbound, ConnsOutbound, Memory, and FD were all rcmgr.Unlimited in both the System and Transient scopes. Gossipsub peer-score state is tracked per peer/connection, so unbounded conns meant unbounded scoring memory.
  2. Peerstore never cleaned. Disconnected peers stayed in the peerstore (addrs, keybook, protobook, metadata) indefinitely. Per the libp2p contract, RemovePeer does not clear addresses, so we now call ClearAddrs first and then RemovePeer.
  3. Incomplete shutdown. The previous shutdown path called only host.Close(). The DHT, the network notification bundle, and the parent context for DHT/gossipsub were never released — so background goroutines and routing-table state lingered across restarts in long-running test cycles.

Changes

pkg/service/bootstrap.go

  • Resource limits in both System and Transient scopes: Conns / ConnsInbound / ConnsOutbound now rcmgr.LimitVal(cfg.ConnManagerHighWater + 100) instead of Unlimited.
  • BootstrapNode now holds Pubsub, notificationBundle, ctx, and cancel for proper lifecycle.
  • A cancelable child context (hostCtx) is used for DHT and gossipsub construction; it is cancelled on every constructor failure path.
  • New Close() method orders cleanup as: cancel context → Network().StopNotify(bundle)DHT.Close()Host.Close(), aggregating errors into a single returned error.
  • New startPeerstoreGC() goroutine ticks every 2 minutes; for each non-self peer not currently connected, it calls ClearAddrs(p) then RemovePeer(p) and logs a per-cycle summary at info level when anything was removed (debug otherwise).
  • The gossipsub instance is now retained on the struct (previously _, err = pubsub.NewGossipSub(...)), so the GC root is explicit and we can extend pubsub usage later without reconstructing the instance.

cmd/main.go

  • Optional pprof server gated on PPROF_PORT env var; logs the listen address on startup and an error if ListenAndServe fails.
  • 60-second status ticker (replacing the previous Connected peers: N log) emits:
    Status: connected=N peerstore=N dht_rt=N goroutines=N heap_alloc=NMB heap_inuse=NMB sys=NMB
    
  • Shutdown calls node.Close() (instead of node.Host.Close()) and logs aggregated shutdown errors.
  • signal.Stop(sigs) is called on the shutdown path.

Docker / scripts

  • Dockerfile: golang:1.24.5-alpinegolang:1.25-alpine.
  • docker-compose.yaml: expose ${PPROF_PORT:-6060}:${PPROF_PORT:-6060} and pass PPROF_PORT through as an environment variable.
  • start.sh, stop.sh, build-docker.sh: detect docker compose plugin vs the standalone docker-compose binary and fail clearly if neither is present.
  • start.sh: parses a --build flag, runs <DOCKER_COMPOSE> build first when set, and installs a signal trap that runs <DOCKER_COMPOSE> down on SIGINT/SIGTERM.

Docs / misc

  • README.md: new Debugging Memory / Performance section explaining the status line metrics and pprof recipes (heap, -alloc_space, two-snapshot -base diff, goroutine dump).
  • .gitignore: add .DS_Store.

File map

File Lines Notes
pkg/service/bootstrap.go +125 / −21 Core fix
cmd/main.go +29 / −4 pprof + status metrics + graceful shutdown
start.sh +45 / −6 docker compose detection, --build, signal trap
README.md +40 / −0 Debugging guide
stop.sh +11 / −1 docker compose detection
build-docker.sh +11 / −1 docker compose detection
docker-compose.yaml +2 / −0 pprof port exposure
Dockerfile +1 / −1 Go 1.25
.gitignore +1 / −1 .DS_Store

Commits

SHA Message
5188296 fix: memory leak issues
4d24826 chore: clean up docker compose command handling and docker image build
0eb61ea refactor: BootstrapNode structure with notification bundle and improve resource management during shutdown
a9f7554 fix: improve shutdown error handling and add peerstore garbage collection in BootstrapNode
f762208 refactor: optimize peerstore garbage collection in BootstrapNode by reducing ticker interval and improving logging
fb6ddb7 feat: add pprof support for performance monitoring and enhance logging metrics in BootstrapNode

Test plan

  • go build ./... succeeds on Go 1.25.
  • ./start.sh --build builds the image and runs the container; ./stop.sh shuts it down cleanly on both docker compose (plugin) and docker-compose (standalone) hosts.
  • With PPROF_PORT=6060 set, curl http://localhost:6060/debug/pprof/heap returns a profile and the port is reachable from the host.
  • Status line appears every 60s in container logs and includes all 7 fields.
  • Take two heap snapshots ~30 minutes apart under realistic peer churn; go tool pprof -base heap1.pb.gz heap2.pb.gz does not show unbounded growth in gossipsub peer-score allocations.
  • After repeated connect/disconnect cycles, the peerstore= value in the status line tracks connected= rather than monotonically increasing, and Peerstore GC: removed N stale peers log lines appear.
  • On SIGINT, logs show Shutting down bootstrap node... followed by clean DHT and host close (no Error during shutdown lines).

Risk / rollout

  • Resource limits are now bounded by ConnManagerHighWater + 100. Confirm the configured high-water mark is sized for expected peak inbound peer count — too low will reject legitimate connections during traffic spikes.
  • The peerstore GC removes addresses and peer entries for any peer not currently connected. This is the documented libp2p contract, but watch for any external code that assumed addresses persisted across short disconnects.
  • Go 1.25 base image is a minor toolchain bump; no language features in this PR require it, so reverting the Dockerfile alone is a safe rollback if needed.

@SwaroopH SwaroopH requested a review from anomit April 25, 2026 00:27
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants