Skip to content

Generic resource manager#1728

Open
Abrar Quazi (AbrarQuazi) wants to merge 1 commit into
masterfrom
generic-resource-manager
Open

Generic resource manager#1728
Abrar Quazi (AbrarQuazi) wants to merge 1 commit into
masterfrom
generic-resource-manager

Conversation

@AbrarQuazi
Copy link
Copy Markdown
Contributor

Resource management system to control how many resources jobs are allowed to utilize. Add resource limits in .wakeroot like so:

{
  "resource_limits": {
    "test_resource": 4
  }
}

And configure how many resources a job will use in the plan: requireResources "test_resource" 1

export Resources: List String
# Resource requirements that must be acquired before the job runs.
# These are consumable units managed by the Wake runtime (lisenses, etc).
export ResourceReqs: List ResourceRequirement
Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Will need a better name for this to differentiate between Resources above

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I know I've said this a couple times, but are you finding a benefit to putting it here rather than in Usage? If so, I'll stop beating that drum. If not, that one's already used for limiting job launching, and would allow runners to report arbitrary usage metrics (for example, I could see a heuristic-based runner reporting "number of iterations to find an allowed parameter set").

Also, you should probably have Map String a rather than List ThingContainingStringKey to avoid having to deal with some resource getting listed multiple times with different values.

Comment thread src/runtime/job.cpp
!jobtable->imp->resource_manager.can_acquire(task.resources)) {

// TODO move this check potentially in prim_job_launch, have it fail the job
// instead of exiting the process.
Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Reminder for this todo

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The way we do this for memory/CPU predictions that exceed the system limits is throttle everything down so the overlarge job is the only one that's running, but we do still allow it to run -- if the prediction was accurate and was a hard limit, we let the failure happen naturally in the course of the job, rather than trying to be preemptive about it and risk killing something that could have dynamically scaled down. These custom limits are admittedly a bit more likely to be hard limits than the memory/CPU, but for ease of mental modelling I think we should keep the same behaviour.

Comment thread src/runtime/config.h
};

// Policy for resource limits configuration
// Configured in .wakeroot as:
Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe it might be better to have this in its own file, outside of .wakeroot?

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For better or worse, .wakeroot has become the workspace configuration file. I really don't think we need to introduce yet another one.

export Resources: List String
# Resource requirements that must be acquired before the job runs.
# These are consumable units managed by the Wake runtime (lisenses, etc).
export ResourceReqs: List ResourceRequirement
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I know I've said this a couple times, but are you finding a benefit to putting it here rather than in Usage? If so, I'll stop beating that drum. If not, that one's already used for limiting job launching, and would allow runners to report arbitrary usage metrics (for example, I could see a heuristic-based runner reporting "number of iterations to find an allowed parameter set").

Also, you should probably have Map String a rather than List ThingContainingStringKey to avoid having to deal with some resource getting listed multiple times with different values.

Comment on lines +80 to +82
def serializeResourcesToJson (resources: List ResourceRequirement): String =
def serializeOne (ResourceRequirement name count) =
"\{\"name\": \"{name}\", \"count\": {str count}\}"
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Probably better returning a JValue that gets piped into formatJSON at point of use; we'd be able to use this same serializer with prettyJSON for outputting to human-readable files.

EDIT: Ah, you already have resourceToJson -- just use that for both cases.

Comment thread src/runtime/config.h
};

// Policy for resource limits configuration
// Configured in .wakeroot as:
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For better or worse, .wakeroot has become the workspace configuration file. I really don't think we need to introduce yet another one.

Comment thread src/runtime/job.cpp
!jobtable->imp->resource_manager.can_acquire(task.resources)) {

// TODO move this check potentially in prim_job_launch, have it fail the job
// instead of exiting the process.
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The way we do this for memory/CPU predictions that exceed the system limits is throttle everything down so the overlarge job is the only one that's running, but we do still allow it to run -- if the prediction was accurate and was a hard limit, we let the failure happen naturally in the course of the job, rather than trying to be preemptive about it and risk killing something that could have dynamically scaled down. These custom limits are admittedly a bit more likely to be hard limits than the memory/CPU, but for ease of mental modelling I think we should keep the same behaviour.

Comment thread src/runtime/job.cpp
Comment on lines +869 to +871
for (const auto& req : task.resources) {
int64_t limit = jobtable->imp->resource_manager.limit(req.name);
if (limit >= 0 && req.count > limit) {
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I might be misreading that function, but why aren't we using can_acquire here?

return true;
}

void ResourceManager::acquire(const std::vector<ResourceRequirement>& requirements) {
Copy link
Copy Markdown
Collaborator

@ag-eitilt Sam May (ag-eitilt) Jan 27, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I know we're still single-threaded, but if we do wind up allowing multiple threads over the same pool (more likely in the form of multi-wake communication rather than as multithreading a single instance's job manager) then having can_acquire and acquire be separate opens us up to racing more than something atomic would. Not sure how folding it into a single function would wind up looking for the rest of the code, and it's likely nothing we'd have to worry about in the near future, but might be something to look into.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants