`max_inflight_requests` functionality for whole ensemble (rather than per-request)

**Is your feature request related to a problem? Please describe.**
Consider ensemble of `input bytes -> (model A=decode JPG) -> (model B=infer on raw RGB) -> outputs`

A is much faster than B.  A's output (raw RGB) is 10-100x larger than the input bytes.

Currently if this ensemble gets heavily loaded, requests queue up at model B, where they use a ton of extra memory in the queue (vs if they were queued before the ensemble in raw byte form).

**Describe the solution you'd like**
I want to be able to set a maximum number of inflight requests to an ensemble, and have additional requests get queued up before they hit model A (i.e. block the ensemble from accepting more requests until there is space).

I saw the new ensemble `max_inflight_requests` functionality and was excited - it describes a very similar use-case - but then disappointed when I realized it only applied to new requests generated by a single request for decoupled/streaming models.  As far as I can tell, it is not applicable here.

**Describe alternatives you've considered**
I have a queue limit on model B to at least prevent infinite memory growth, but as described that is a super inefficient place to have a large queue. I could support a 10x larger queue size if I could queue before the ensemble.

For a single model I think I could manipulate the rate limiter to manage this, but I have a dynamic set of models that come and go, and I don't believe there's a way to dynamically adjust the set of resources in a live tritonserver instance.


Provide feedback

Saved searches

Use saved searches to filter your results more quickly

`max_inflight_requests` functionality for whole ensemble (rather than per-request) #8597

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

max_inflight_requests functionality for whole ensemble (rather than per-request) #8597

Description

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions

`max_inflight_requests` functionality for whole ensemble (rather than per-request) #8597