Skip to content

fix(agentless): improve session lock reliability#352

Merged
santoshpulluri merged 6 commits intomainfrom
kruti/session-lock-reliability
Mar 8, 2026
Merged

fix(agentless): improve session lock reliability#352
santoshpulluri merged 6 commits intomainfrom
kruti/session-lock-reliability

Conversation

@krutibaraiya
Copy link
Copy Markdown
Member

@krutibaraiya krutibaraiya commented Feb 26, 2026

Background

We were getting the below errors when we set Consul Http address to a direct Server IP address or an ALB address in the Consul-ESM config file.


[INFO]  consul-esm: Agent: Trying to obtain health check session...
[INFO]  consul-esm: Trying to obtain leadership..
[ERROR] consul-esm: Agent: Unable to use session lock that was held previously and presumed lost, giving up the lock (will retry): error="Lock already held"
[INFO]  consul-esm: Agent: Trying to obtain health check session...
[ERROR] consul-esm: Unable to use leader lock that was held previously and presumed lost, giving up the lock (will retry): error="Lock already held"
[INFO]  consul-esm: Trying to obtain leadership...
[WARN]  consul-esm: Agent: Lost the lock
[WARN]  consul-esm: [WARN] Error querying for health check info: error="Get \"http://10.0.3.76:8500/v1/health/service/consul-esm?dc=dc1&index=1537773&partition=default&passing=1&tag=consul-esm\": context canceled"
[WARN]  consul-esm: Error getting external node list: error="Get \"http://10.0.3.76:8500/v1/catalog/nodes?dc=dc1&index=1444910&node-meta=external-node%3Atrue&partition=default\": context canceled"
[WARN]  consul-esm: Lost leadership

Fix

  • Set SessionTTL to 30s and MonitorRetries to 6 for agentless session locks to handle transient Consul/network issues.
  • Ensure lock is properly unlocked and reset to nil on loss, preventing "Lock already held" errors and enabling clean lock reacquisition.

Testing

Tested the binary on 1 ESM instance 3 Consul server setup, got only 3 errors in a span of 3 days

Screenshot 2026-03-02 at 10 30 09 AM

PCI review checklist

  • I have documented a clear reason for, and description of, the change I am making.

  • If applicable, I've documented a plan to revert these changes if they require more than reverting the pull request.

  • If applicable, I've documented the impact of any changes to security controls.

    Examples of changes to security controls include using new access control methods, adding or removing logging pipelines, etc.

@krutibaraiya krutibaraiya force-pushed the kruti/session-lock-reliability branch from cfcf0c4 to 287991f Compare February 26, 2026 13:22
@krutibaraiya krutibaraiya force-pushed the kruti/session-lock-reliability branch from b0bda80 to 9415bbb Compare February 26, 2026 13:50
@krutibaraiya krutibaraiya marked this pull request as ready for review February 26, 2026 13:51
@krutibaraiya krutibaraiya requested a review from a team as a code owner February 26, 2026 13:51
Comment thread agent.go Outdated
Comment thread agent.go
Comment thread leader.go
Comment thread agent.go Outdated
Comment thread agent.go Outdated
Comment thread agent.go Outdated
Comment thread agent.go
Comment thread leader.go
Comment thread agent.go
Comment thread agent.go
const LeaderKey = "leader"

const (
// TTL used for Consul sessions created by ESM instances
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please add elaborate comments to describe the rationale behind setting them

    // sessionTTL is the TTL for Consul sessions created by ESM instances.
	// The session is automatically invalidated by Consul if it is not renewed
	// within this period. Must be longer than the total monitor window
	// (sessionMonitorRetries × DefaultMonitorRetryTime) to avoid split-brain,
	// where a new leader acquires the lock while the old session is still valid.
	sessionTTL = "30s"

	// sessionMonitorRetries is the number of consecutive failed attempts to
	// contact the Consul servers before the lock is considered lost and the
	// session is released. Combined with the default MonitorRetryTime of 2s,
	// this gives a monitor window of 6 × 2s = 12s — safely under the 30s
	// session TTL, ensuring the session expires before a new leader can
	// acquire the lock if Consul is truly unreachable.
	sessionMonitorRetries = 6

@santoshpulluri
Copy link
Copy Markdown
Contributor

Changes looks good for me except for a minor comment about comment enhancements

Copy link
Copy Markdown
Contributor

@santoshpulluri santoshpulluri left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

@santoshpulluri santoshpulluri merged commit 68277e9 into main Mar 8, 2026
53 checks passed
@santoshpulluri santoshpulluri deleted the kruti/session-lock-reliability branch March 8, 2026 09:31
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants