Skip to content

Use shared SshClient to prevent sshd-SshClient thread leak, to prevent excessive number of threads created#2008

Open
olamy wants to merge 1 commit intojenkinsci:masterfrom
olamy:ssh-client-single-instance
Open

Use shared SshClient to prevent sshd-SshClient thread leak, to prevent excessive number of threads created#2008
olamy wants to merge 1 commit intojenkinsci:masterfrom
olamy:ssh-client-single-instance

Conversation

@olamy
Copy link
Copy Markdown
Member

@olamy olamy commented Apr 17, 2026

  • Replace per-call SshClient creation with a single shared instance managed by SSHClientManager, started once at plugin init and stopped at Jenkins shutdown. This
    mirrors the pattern used in the cloudbees-ssh-slaves-plugin.
  • Move signature algorithm preferences from per-client setSignatureFactories() to per-session KexProposalOption.SERVERKEYS reordering via a SessionListener, and
    move host key verification from per-client setServerKeyVerifier() to per-session DelegatingServerKeyVerifier with session metadata.
  • This eliminates the thread leak in launchRemotingAgent() where each failed SSH attempt leaked ~100 sshd-SshClient threads because the SshClient cleanup was only
    registered via a Channel.Listener callback that was never reached on exception paths.
  • Expose performance tuning via system properties: nioWorkers, idleTimeoutMs, heartbeatIntervalMs, authTimeoutMs.

Signed-off-by: Olivier Lamy olamy@apache.org

Testing done

Submitter checklist

  • Make sure you are opening from a topic/feature/bugfix branch (right side) and not your main branch!
  • Ensure that the pull request title represents the desired changelog entry
  • Please describe what you did
  • Link to relevant issues in GitHub or Jira
  • Link to relevant pull requests, esp. upstream and downstream changes
  • Ensure you have provided tests that demonstrate the feature works or the issue is fixed

Signed-off-by: Olivier Lamy <olamy@apache.org>
@olamy olamy marked this pull request as ready for review April 17, 2026 09:15
@olamy
Copy link
Copy Markdown
Member Author

olamy commented Apr 17, 2026

this PR is a bit more aggressive especially applying some performance changes (so there is a conservative option to at least fix the thread leak bug #2007)

@res0nance
Copy link
Copy Markdown
Contributor

Are you running this in your own environment? Has it solve your problems

@jglick
Copy link
Copy Markdown
Member

jglick commented Apr 28, 2026

@nevingeorgesunny you did some changes in this plugin not long ago; perhaps you want to review this and/or #2007?

@olamy there is no information on testing here. Was it tested at scale somehow, and confirmed to fix the reported problem?

CloudBees-internal link

@olamy
Copy link
Copy Markdown
Member Author

olamy commented Apr 28, 2026

Well the other #2007 is just fixing the symptom but not the root (performance bad pattern) cause. While this PR will really fix the performance root cause and obviously the associated leak fixed by #2007 because not re creating client for each connection tentative.

Some figures

A default NIO2 client creates three pools on start():

Pool Size Source
<client>-timer 1 scheduled AbstractFactoryManager.java#L539-L542
<client>-nio2 availableProcessors() + 1 Nio2ServiceFactory.java#L48
<client>-nio2-resume availableProcessors() + 1 Nio2ServiceFactory.java#L51

So on 8 cores that's 19 threads. For a single client and so per slave created by this plugin.

One client for many servers

The SshClient javadoc (lines 157-162) says it clearly one instance per application, reused across sessions, safe from multiple threads as long as you don't reconfigure it while it's running.

The pools above are paid once per client, not once per connection. Sessions are cheap: each one registers its socket with the existing AsynchronousChannelGroup.
Idle sessions don't hold threads. So one client handling 500 sessions still sits at roughly 19 threads.
50 clients handling the same 500 sessions is ~950 threads, each with a 512KB-1MB ish stack. You've spent ~500MB on thread stacks before any real work happens, and you've duplicated the AsynchronousChannelGroup 50 times, plus 50 extra passes over known_hosts, identity files, and ssh_config.

The CPU side follows the same bad pattern. Each client runs his own AsynchronousChannelGroup which runs its own selector loop and its own systems calls: 50 of them and 50 loops wake up, context-switch, and compete for the same cores to do work one single loop could've handled.

Maybe the reason to spin up a client per server was per-server behavior: different host key, different identity, different cipher preferences. You don't need a new client for any of that. See the code in the PR which is customizing this via session context.

Re (nothing urgent though) AsynchronousChannelGroup (the standard jdk nio), this could be improved as well by using sshd-mina or sshd-netty, both are improving performance issues with this but this is probably not really significant here (compared to the one client per server current pattern)

Why using a nio library because the per-connection SshClient is a worst-of-both-worlds pattern (nio vs bio).
You get the resource cost of NIO without any of the multiplexing benefit, and you pay more than a blocking library would for the same work. Using something such JSch would be less resources consuming than the current pattern.

@jglick
Copy link
Copy Markdown
Member

jglick commented May 5, 2026

@olamy all sensible in principle, it is just not apparent from the PR description whether the purported problem was reproduced in a test controller, and this PR confirmed to solve it.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants