Skip to content

max_inflight_requests functionality for whole ensemble (rather than per-request) #8597

@darin-matroid

Description

@darin-matroid

Is your feature request related to a problem? Please describe.
Consider ensemble of input bytes -> (model A=decode JPG) -> (model B=infer on raw RGB) -> outputs

A is much faster than B. A's output (raw RGB) is 10-100x larger than the input bytes.

Currently if this ensemble gets heavily loaded, requests queue up at model B, where they use a ton of extra memory in the queue (vs if they were queued before the ensemble in raw byte form).

Describe the solution you'd like
I want to be able to set a maximum number of inflight requests to an ensemble, and have additional requests get queued up before they hit model A (i.e. block the ensemble from accepting more requests until there is space).

I saw the new ensemble max_inflight_requests functionality and was excited - it describes a very similar use-case - but then disappointed when I realized it only applied to new requests generated by a single request for decoupled/streaming models. As far as I can tell, it is not applicable here.

Describe alternatives you've considered
I have a queue limit on model B to at least prevent infinite memory growth, but as described that is a super inefficient place to have a large queue. I could support a 10x larger queue size if I could queue before the ensemble.

For a single model I think I could manipulate the rate limiter to manage this, but I have a dynamic set of models that come and go, and I don't believe there's a way to dynamically adjust the set of resources in a live tritonserver instance.

Metadata

Metadata

Assignees

Projects

No projects

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions