Discard idle instances during shutdown#1125
Conversation
(cherry picked from commit 9269c76)
|
@jglick, am I correct in understanding that, with this PR, if I update my plugins and restart Jenkins, all my EC2 Agents will be terminated? |
Not exactly. Only idle agents; and only up to the number of configured spare instances, if that option is even used. So at most as many would have been launched when the controller starts. If you think even that is going too far, that some people may be configuring spare instances but prefer for the unused spares to be left running when the controller terminates, then yes of course it could be made into (sigh) yet another checkbox in the agent template. |
|
(BTW I have not yet looked into the options for covering this and my other PRs via automated tests. I do not trust the mocks much, but I guess this plugin is set up neither for LocalStack container testing nor for live testing against AWS with a bring-your-own credentials model.) |
|
@jglick Thanks for the response. I agree that we seem to have too many checkboxes for configuration in this area, and their meaning, and the relationship between them, is somewhat opaque. I guess that's the result of the rules for starting / stopping / terminating instances having evolved over time.
Typically when we restart the controller, most if not all Agent EC2 instances will be in a Stopped state (i.e. not running, but not terminated). What we would like is for those instances to remain as they are, and be associated with and available for use by the (restarted) Controller when required for a job.
I can provide some assistance with basic testing in our AWS environment, if that helps. I can drop a plugin test release into a Jenkins that's idling, run a few jobs, restart controller, watch instances start / stop / terminate, etc. |
I am definitely not planning to implement that, but if that is something that currently works, then it sounds like the behavior in this PR must not be enabled for users with this requirement. So, I will make it an option.
Sorry, I was unclear—I have already tested this in an EKS cluster with IAM-based access to EC2. Of course additional sanity checks by others are welcome. I was just uncertain what automated test coverage is expected in general for changes in this plugin. |
(cherry picked from commit e8c5a47)
|
I am starting to think it would be more effective for the option to just terminate all idle agents, rather than being tied to the spare count, due to a longstanding bug in the provisioning logic under load (a topic for another PR). I will experiment with this alternative approach. |
(cherry picked from commit a8eb352)
|
Indeed it does seem more helpful to just (at an option) terminate & remove any idle agents during shutdown, regardless of why they were launched to begin with. This behaves better when the cloud accidentally overprovisions agents in response to a rapid enqueueing of builds: if some of the agents are left unused then they are simply removed the next time you restart. |
|
@mwebber I think the revised logic should address your concern? |
|
Thanks. @res0nance / @fcojfernandez? |
| * Terminates the instance in EC2. | ||
| */ | ||
| public abstract void terminate(); | ||
| public abstract Future<?> terminate(); |
There was a problem hiding this comment.
I've checked potential external impacts and it seems that only https://github.com/jenkinsci/ec2-cloud-axis-plugin/blob/4d1a8263cfe04320b311fe26ae613d1805073aa2/src/main/java/hudson/plugins/ec2/Utils.java#L14 would be using this method. Seems safe
Extracted from #1121.
While testing the spare instances feature I noticed that while the controller launches a pool of spare instances when it starts, it does not terminate them when it shuts down. If the controller is immediately restarted, this is not a big deal since they will be connected again, but it can leave a mess in a CloudBees CI HA controller where each replica has its own set of spare instances. #1115 will clean them up but it could take hours. With this patch, the spares are neatly cleaned up.
I needed to introduce a
Futureinto the internal API to make sure that the termination request is processed before the web app is unloaded and class loading from plugins becomes impossible; otherwise the termination is unreliable and mostly does not work at all. This does clash a bit with #1015.