Generic resource manager#1728
Conversation
| export Resources: List String | ||
| # Resource requirements that must be acquired before the job runs. | ||
| # These are consumable units managed by the Wake runtime (lisenses, etc). | ||
| export ResourceReqs: List ResourceRequirement |
There was a problem hiding this comment.
Will need a better name for this to differentiate between Resources above
There was a problem hiding this comment.
I know I've said this a couple times, but are you finding a benefit to putting it here rather than in Usage? If so, I'll stop beating that drum. If not, that one's already used for limiting job launching, and would allow runners to report arbitrary usage metrics (for example, I could see a heuristic-based runner reporting "number of iterations to find an allowed parameter set").
Also, you should probably have Map String a rather than List ThingContainingStringKey to avoid having to deal with some resource getting listed multiple times with different values.
| !jobtable->imp->resource_manager.can_acquire(task.resources)) { | ||
|
|
||
| // TODO move this check potentially in prim_job_launch, have it fail the job | ||
| // instead of exiting the process. |
There was a problem hiding this comment.
Reminder for this todo
There was a problem hiding this comment.
The way we do this for memory/CPU predictions that exceed the system limits is throttle everything down so the overlarge job is the only one that's running, but we do still allow it to run -- if the prediction was accurate and was a hard limit, we let the failure happen naturally in the course of the job, rather than trying to be preemptive about it and risk killing something that could have dynamically scaled down. These custom limits are admittedly a bit more likely to be hard limits than the memory/CPU, but for ease of mental modelling I think we should keep the same behaviour.
| }; | ||
|
|
||
| // Policy for resource limits configuration | ||
| // Configured in .wakeroot as: |
There was a problem hiding this comment.
Maybe it might be better to have this in its own file, outside of .wakeroot?
There was a problem hiding this comment.
For better or worse, .wakeroot has become the workspace configuration file. I really don't think we need to introduce yet another one.
| export Resources: List String | ||
| # Resource requirements that must be acquired before the job runs. | ||
| # These are consumable units managed by the Wake runtime (lisenses, etc). | ||
| export ResourceReqs: List ResourceRequirement |
There was a problem hiding this comment.
I know I've said this a couple times, but are you finding a benefit to putting it here rather than in Usage? If so, I'll stop beating that drum. If not, that one's already used for limiting job launching, and would allow runners to report arbitrary usage metrics (for example, I could see a heuristic-based runner reporting "number of iterations to find an allowed parameter set").
Also, you should probably have Map String a rather than List ThingContainingStringKey to avoid having to deal with some resource getting listed multiple times with different values.
| def serializeResourcesToJson (resources: List ResourceRequirement): String = | ||
| def serializeOne (ResourceRequirement name count) = | ||
| "\{\"name\": \"{name}\", \"count\": {str count}\}" |
There was a problem hiding this comment.
Probably better returning a JValue that gets piped into formatJSON at point of use; we'd be able to use this same serializer with prettyJSON for outputting to human-readable files.
EDIT: Ah, you already have resourceToJson -- just use that for both cases.
| }; | ||
|
|
||
| // Policy for resource limits configuration | ||
| // Configured in .wakeroot as: |
There was a problem hiding this comment.
For better or worse, .wakeroot has become the workspace configuration file. I really don't think we need to introduce yet another one.
| !jobtable->imp->resource_manager.can_acquire(task.resources)) { | ||
|
|
||
| // TODO move this check potentially in prim_job_launch, have it fail the job | ||
| // instead of exiting the process. |
There was a problem hiding this comment.
The way we do this for memory/CPU predictions that exceed the system limits is throttle everything down so the overlarge job is the only one that's running, but we do still allow it to run -- if the prediction was accurate and was a hard limit, we let the failure happen naturally in the course of the job, rather than trying to be preemptive about it and risk killing something that could have dynamically scaled down. These custom limits are admittedly a bit more likely to be hard limits than the memory/CPU, but for ease of mental modelling I think we should keep the same behaviour.
| for (const auto& req : task.resources) { | ||
| int64_t limit = jobtable->imp->resource_manager.limit(req.name); | ||
| if (limit >= 0 && req.count > limit) { |
There was a problem hiding this comment.
I might be misreading that function, but why aren't we using can_acquire here?
| return true; | ||
| } | ||
|
|
||
| void ResourceManager::acquire(const std::vector<ResourceRequirement>& requirements) { |
There was a problem hiding this comment.
I know we're still single-threaded, but if we do wind up allowing multiple threads over the same pool (more likely in the form of multi-wake communication rather than as multithreading a single instance's job manager) then having can_acquire and acquire be separate opens us up to racing more than something atomic would. Not sure how folding it into a single function would wind up looking for the rest of the code, and it's likely nothing we'd have to worry about in the near future, but might be something to look into.
Resource management system to control how many resources jobs are allowed to utilize. Add resource limits in
.wakerootlike so:And configure how many resources a job will use in the plan:
requireResources "test_resource" 1