Skip to content

Option to delete idle instances during controller shutdown#530

Merged
raul-arabaolaza merged 6 commits into
jenkinsci:developfrom
gbhat618:feature-delete-idle-agents-on-jenkins-stop
Mar 27, 2026
Merged

Option to delete idle instances during controller shutdown#530
raul-arabaolaza merged 6 commits into
jenkinsci:developfrom
gbhat618:feature-delete-idle-agents-on-jenkins-stop

Conversation

@gbhat618
Copy link
Copy Markdown
Contributor

@gbhat618 gbhat618 commented Mar 19, 2026

Adds feature to delete idle agents during controller shutdown; this is useful when we want to stop a controller for an extended period of time, but don't want to delete each idle instance one by one to save cloud cost.

The use case originated from CloudBees High Availability feature, where when a replica with idle instances goes down, those GCE agents are not transferred to another replica (due to Jenkins cloud api design this is a complex problem), hence we are proposing a feature to delete idle agents based on opt-in feature.

Similar feature was proposed to ec2-plugin in jenkinsci/ec2-plugin#1125.

Description Screenshot
Delete idle instance during controller shutdown image

Testing done

Test 1
Integration test written and verified it is working correctly.

mvn clean test -Dtest=com.google.jenkins.plugins.computeengine.integration.DiscardIdleInstancesOnShutdownIT
...
...
[INFO] Tests run: 1, Failures: 0, Errors: 0, Skipped: 0, Time elapsed: 312.1 s -- in com.google.jenkins.plugins.computeengine.integration.DiscardIdleInstancesOnShutdownIT
[INFO]
[INFO] Results:
[INFO]
[INFO] Tests run: 1, Failures: 0, Errors: 0, Skipped: 0

Test 2
Tested with CasC based controller with configuration,

oneShot: false
terminateIdleDuringShutdown: true

then stopping the controller removed the agents, including GCE VMs.
logs,

2026-03-19 12:50:40.781+0000 [id=35]        FINE    c.g.j.p.c.DiscardIdleInstancesTerminator#discardIdleInstances: Looking for idle GCE instances to discard during shutdown
2026-03-19 12:50:40.784+0000 [id=254]       INFO    c.g.j.p.c.DiscardIdleInstancesTerminator#lambda$terminateInstanceAsync$1: Discarding idle GCE instance gbhat-21518-gce-rn069d during shutdown
2026-03-19 12:50:40.784+0000 [id=254]       INFO    c.g.j.p.c.DiscardIdleInstancesTerminator#lambda$terminateInstanceAsync$1: Discarding idle GCE instance gbhat-21518-gce-rn069d during shutdown

when the flag was not enabled,

terminateIdleDuringShutdown: false

no deletion happened.

Test 3
Interactive UI testing with checkbox marked and unmarked; accordingly the agent deletion is working.

With termination enabled - instance is gone after restart (both in Jenkins and GCE)

2026-03-19 12:16:39.882+0000 [id=1162]     INFO    hudson.lifecycle.Lifecycle#onStop: Stopping Jenkins as requested by admin
2026-03-19 12:16:39.884+0000 [id=1162]     INFO    hudson.lifecycle.Lifecycle#onStatusUpdate: Stopping Jenkins
2026-03-19 12:16:39.912+0000 [id=1162]     INFO    jenkins.model.Jenkins$13#onAttained: Started termination
2026-03-19 12:16:39.924+0000 [id=1162]     FINE    c.g.j.p.c.DiscardIdleInstancesTerminator#discardIdleInstances: Looking for idle GCE instances to discard during shutdown
2026-03-19 12:16:39.927+0000 [id=90]       INFO    c.g.j.p.c.DiscardIdleInstancesTerminator#lambda$terminateInstanceAsync$1: Discarding idle GCE instance gbhat-nyphvj during shutdown
2026-03-19 12:16:39.927+0000 [id=90]       INFO    c.g.j.p.c.DiscardIdleInstancesTerminator#lambda$terminateInstanceAsync$1: Discarding idle GCE instance gbhat-nyphvj during shutdown
2026-03-19 12:16:40.703+0000 [id=1162]     FINE    c.g.j.p.c.DiscardIdleInstancesTerminator#discardIdleInstances: Done discarding idle instances, there were 1 instances to discard
2026-03-19 12:16:40.733+0000 [id=1162]     INFO    c.c.o.c.MapDBMessagingStore#close: Messaging Stopped
2026-03-19 12:16:40.735+0000 [id=1162]     INFO    jenkins.model.Jenkins$13#onAttained: Completed termination
2026-03-19 12:16:40.735+0000 [id=1162]     INFO    jenkins.model.Jenkins#_cleanUpDisconnectComputers: Starting node disconnection
2026-03-19 12:16:40.751+0000 [id=1162]     INFO    jenkins.model.Jenkins#_cleanUpShutdownPluginManager: Stopping plugin manager
2026-03-19 12:16:40.776+0000 [id=1162]     INFO    jenkins.model.Jenkins#_cleanUpPersistQueue: Persisting build queue
2026-03-19 12:16:40.835+0000 [id=1162]     INFO    jenkins.model.Jenkins#_cleanUpAwaitDisconnects: Waiting for node disconnection completion
2026-03-19 12:16:40.835+0000 [id=1162]     INFO    hudson.lifecycle.Lifecycle#onStatusUpdate: Jenkins stopped

Without the checkbox marked, not terminating idle instance, though there is one.
Note: Flag value toggle will only take effect for agents provisioned after the change. Not for the already existing agents.

2026-03-19 12:25:51.977+0000 [id=353]      FINE    c.g.j.p.c.DiscardIdleInstancesTerminator#discardIdleInstances: Looking for idle GCE instances to discard during shutdown
2026-03-19 12:25:51.980+0000 [id=353]      FINE    c.g.j.p.c.DiscardIdleInstancesTerminator#discardIdleInstances: Done discarding idle instances, there were 0 instances to discard
2026-03-19 12:25:52.034+0000 [id=353]      INFO    c.c.o.c.MapDBMessagingStore#close: Messaging Stopped
2026-03-19 12:25:52.035+0000 [id=353]      INFO    jenkins.model.Jenkins$13#onAttained: Completed termination

Submitter checklist

  • Make sure you are opening from a topic/feature/bugfix branch (right side) and not your main branch!
  • Ensure that the pull request title represents the desired changelog entry
  • Please describe what you did
  • Link to relevant issues in GitHub or Jira
  • Link to relevant pull requests, esp. upstream and downstream changes
  • Ensure you have provided tests that demonstrate the feature works or the issue is fixed

@gbhat618 gbhat618 marked this pull request as ready for review March 19, 2026 12:53
.map(DiscardIdleInstancesTerminator::terminateInstanceAsync)
.toList();
/* Wait for all terminations to avoid classloader unload while tasks run (would cause NoClassDefFoundError and leave VMs running).
`ComputeEngineInstance._terminate` calls GCP async APIs, so should return quickly; 10s timeout is sufficient. */
Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

the testing logs are showing for 1 instance it is completing in < 10ms. and entire method discardIdleInstances is completing in ~1s.

private GoogleKeyCredential sshKeyCredential;
private Map<String, String> googleLabels;
private Integer numExecutors;
private boolean terminateIdleDuringShutdown;
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

bound via

Comment on lines +36 to +37
* Uses {@link RealJenkinsRule} so that stopping Jenkins triggers the {@link DiscardIdleInstancesTerminator#discardIdleInstances()}
* via the {@code @Terminator} lifecycle hook.
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

JenkinsSessionRule would probably suffice BTW, but the overhead of RealJenkinsRule is I guess minimal in this context.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

() -> {
LOGGER.info(() -> "Discarding idle GCE instance " + node.getNodeName() + " during shutdown");
try {
node.terminate();
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This also removes the Slave, right?

Copy link
Copy Markdown
Contributor Author

@gbhat618 gbhat618 Mar 26, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

it is calling https://github.com/jenkinsci/jenkins/blob/b513733f8be01a0a0cb6dd9472214764ef667b32/core/src/main/java/hudson/slaves/AbstractCloudSlave.java#L79-L96 where the _terminate is overridden by this plugin. Then the Jenkins.get().removeNode(this); in the finally follows - which is I think removes the Slave correctly. In the tests there was no entry leftover after the restart of jenkins. (checked in both automated and manual test)

that should be correct right ?

*/
public class DiscardIdleInstancesOnShutdownIT {
private static final Logger LOGGER = Logger.getLogger(DiscardIdleInstancesOnShutdownIT.class.getName());
private static final Map<String, String> GOOGLE_LABELS = getLabel(DiscardIdleInstancesOnShutdownIT.class);
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe create a distinct label per test run? I can imagine an aborted test causing failures in a subsequent run. Consider doing something like the kubernetes plugin does where it starts every IT by cleaning out all pods matching a test label selector.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

noted in #532 will check on it in separate PR.

assertEquals("Jenkins should have no GCE agent nodes after restart", 0, gceNodeCount);
// The VM should be gone from GCP
var cloud = (ComputeEngineCloud) j.jenkins.clouds.getByName("gce-integration");
await("VM should be deleted from GCP after shutdown")
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This could be asserted between stopJenkins and startJenkins right?

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Required assertion are,

  1. assert VM is created in GCP
  2. do shutdown
  3. assert the VM is deleted in GCP
  4. do start

But the GCP client object is hard to create outside of the Jenkins runtime (the credentials and setup), so just asserting the VM is gone after Jenkins has been restarted.

Copy link
Copy Markdown
Member

@jglick jglick Mar 26, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

the GCP client object is hard to create outside of the Jenkins runtime

Ah OK.

Co-authored-by: Jesse Glick <jglick@cloudbees.com>
@gbhat618 gbhat618 marked this pull request as draft March 25, 2026 12:41
@gbhat618
Copy link
Copy Markdown
Contributor Author

(marking draft while addressing comments)

@gbhat618
Copy link
Copy Markdown
Contributor Author

gbhat618 commented Mar 25, 2026

need to bump the plugin parent version to get RealJenkinsRule#run; currently using https://github.com/jenkinsci/plugin-pom/releases/tag/plugin-5.7 good to time to upgrade as well.

@gbhat618
Copy link
Copy Markdown
Contributor Author

wait for #531

@gbhat618 gbhat618 marked this pull request as ready for review March 26, 2026 07:28
@gbhat618
Copy link
Copy Markdown
Contributor Author

gbhat618 commented Mar 26, 2026

/label enhancement

@jglick
Copy link
Copy Markdown
Member

jglick commented Mar 26, 2026

@gbhat618 not sure what #530 (comment) was. Maybe you are thinking of a workflow defined in jenkinsci/jenkins?

@gbhat618
Copy link
Copy Markdown
Contributor Author

no, it was supposed to be /label enhancement (my bad)

@raul-arabaolaza raul-arabaolaza added the enhancement New feature or request label Mar 27, 2026
@raul-arabaolaza raul-arabaolaza merged commit 43c3abd into jenkinsci:develop Mar 27, 2026
17 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

enhancement New feature or request

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants