Option to delete idle instances during controller shutdown#530
Conversation
| .map(DiscardIdleInstancesTerminator::terminateInstanceAsync) | ||
| .toList(); | ||
| /* Wait for all terminations to avoid classloader unload while tasks run (would cause NoClassDefFoundError and leave VMs running). | ||
| `ComputeEngineInstance._terminate` calls GCP async APIs, so should return quickly; 10s timeout is sufficient. */ |
There was a problem hiding this comment.
the testing logs are showing for 1 instance it is completing in < 10ms. and entire method discardIdleInstances is completing in ~1s.
| private GoogleKeyCredential sshKeyCredential; | ||
| private Map<String, String> googleLabels; | ||
| private Integer numExecutors; | ||
| private boolean terminateIdleDuringShutdown; |
There was a problem hiding this comment.
| * Uses {@link RealJenkinsRule} so that stopping Jenkins triggers the {@link DiscardIdleInstancesTerminator#discardIdleInstances()} | ||
| * via the {@code @Terminator} lifecycle hook. |
There was a problem hiding this comment.
JenkinsSessionRule would probably suffice BTW, but the overhead of RealJenkinsRule is I guess minimal in this context.
There was a problem hiding this comment.
noted, noticed RealJenkinsRule recommendation from https://javadoc.jenkins.io/component/jenkins-test-harness/org/jvnet/hudson/test/JenkinsRule.html#restart()
| () -> { | ||
| LOGGER.info(() -> "Discarding idle GCE instance " + node.getNodeName() + " during shutdown"); | ||
| try { | ||
| node.terminate(); |
There was a problem hiding this comment.
This also removes the Slave, right?
There was a problem hiding this comment.
it is calling https://github.com/jenkinsci/jenkins/blob/b513733f8be01a0a0cb6dd9472214764ef667b32/core/src/main/java/hudson/slaves/AbstractCloudSlave.java#L79-L96 where the _terminate is overridden by this plugin. Then the Jenkins.get().removeNode(this); in the finally follows - which is I think removes the Slave correctly. In the tests there was no entry leftover after the restart of jenkins. (checked in both automated and manual test)
that should be correct right ?
| */ | ||
| public class DiscardIdleInstancesOnShutdownIT { | ||
| private static final Logger LOGGER = Logger.getLogger(DiscardIdleInstancesOnShutdownIT.class.getName()); | ||
| private static final Map<String, String> GOOGLE_LABELS = getLabel(DiscardIdleInstancesOnShutdownIT.class); |
There was a problem hiding this comment.
Maybe create a distinct label per test run? I can imagine an aborted test causing failures in a subsequent run. Consider doing something like the kubernetes plugin does where it starts every IT by cleaning out all pods matching a test label selector.
There was a problem hiding this comment.
noted in #532 will check on it in separate PR.
| assertEquals("Jenkins should have no GCE agent nodes after restart", 0, gceNodeCount); | ||
| // The VM should be gone from GCP | ||
| var cloud = (ComputeEngineCloud) j.jenkins.clouds.getByName("gce-integration"); | ||
| await("VM should be deleted from GCP after shutdown") |
There was a problem hiding this comment.
This could be asserted between stopJenkins and startJenkins right?
There was a problem hiding this comment.
Required assertion are,
- assert VM is created in GCP
- do shutdown
- assert the VM is deleted in GCP
- do start
But the GCP client object is hard to create outside of the Jenkins runtime (the credentials and setup), so just asserting the VM is gone after Jenkins has been restarted.
There was a problem hiding this comment.
the GCP client object is hard to create outside of the Jenkins runtime
Ah OK.
Co-authored-by: Jesse Glick <jglick@cloudbees.com>
|
(marking draft while addressing comments) |
|
need to bump the plugin parent version to get |
|
wait for #531 |
|
/label enhancement |
|
@gbhat618 not sure what #530 (comment) was. Maybe you are thinking of a workflow defined in |
|
no, it was supposed to be |
Adds feature to delete idle agents during controller shutdown; this is useful when we want to stop a controller for an extended period of time, but don't want to delete each idle instance one by one to save cloud cost.
The use case originated from CloudBees High Availability feature, where when a replica with idle instances goes down, those GCE agents are not transferred to another replica (due to Jenkins cloud api design this is a complex problem), hence we are proposing a feature to delete idle agents based on opt-in feature.
Similar feature was proposed to ec2-plugin in jenkinsci/ec2-plugin#1125.
Testing done
Test 1
Integration test written and verified it is working correctly.
Test 2
Tested with CasC based controller with configuration,
then stopping the controller removed the agents, including GCE VMs.
logs,
when the flag was not enabled,
no deletion happened.
Test 3
Interactive UI testing with checkbox marked and unmarked; accordingly the agent deletion is working.
With termination enabled - instance is gone after restart (both in Jenkins and GCE)
Without the checkbox marked, not terminating idle instance, though there is one.
Note: Flag value toggle will only take effect for agents provisioned after the change. Not for the already existing agents.
Submitter checklist