Skip to content

feat: OpenAI-compatible API Endpoints for Embedding Models (vLLM)#8483

Merged
yinggeh merged 23 commits intomainfrom
yinggeh/tri-49-request-for-openai-compatible-api-endpoints-for-triton
Nov 6, 2025
Merged

feat: OpenAI-compatible API Endpoints for Embedding Models (vLLM)#8483
yinggeh merged 23 commits intomainfrom
yinggeh/tri-49-request-for-openai-compatible-api-endpoints-for-triton

Conversation

@yinggeh
Copy link
Contributor

@yinggeh yinggeh commented Oct 30, 2025

What does the PR do?

  • Enable /v1/embeddings inference request for OpenAI API frontend in vLLM container

Checklist

  • PR title reflects the change and is of format <commit_type>: <Title>
  • Changes are described in the pull request.
  • Related issues are referenced.
  • Populated github labels field
  • Added test plan and verified test passes.
  • Verified that the PR passes existing CI.
  • Verified copyright is correct on all changed files.
  • Added succinct git squash message before merging ref.
  • All template sections are filled out.
  • Optional: Additional screenshots for behavior/output changes with before/after.

Commit Type:

Check the conventional commit type
box here and add the label to the github PR.

  • feat

Related PRs:

triton-inference-server/vllm_backend#104

Where should the reviewer start?

Test plan:

  • CI Pipeline ID:

Caveats:

Background

Related Issues: (use one of the action keywords Closes / Fixes / Resolves / Relates to)

  • closes GitHub issue: #xxx

@yinggeh yinggeh self-assigned this Oct 30, 2025
@yinggeh yinggeh added the Enhancement New feature or request label Oct 30, 2025
whoisj
whoisj previously approved these changes Oct 30, 2025
whoisj
whoisj previously approved these changes Oct 30, 2025
…to yinggeh/tri-49-request-for-openai-compatible-api-endpoints-for-triton
@yinggeh yinggeh changed the base branch from main to r25.10 October 30, 2025 22:59
@yinggeh
Copy link
Contributor Author

yinggeh commented Oct 30, 2025

Rebase to r25.10 to run pipeline.

… yinggeh/tri-49-request-for-openai-compatible-api-endpoints-for-triton
@yinggeh yinggeh changed the base branch from r25.10 to main November 3, 2025 18:12
@yinggeh yinggeh changed the base branch from main to r25.10 November 3, 2025 18:15
@yinggeh yinggeh changed the base branch from r25.10 to main November 3, 2025 18:15
@yinggeh yinggeh requested review from pskiran1 and whoisj November 4, 2025 17:09
pskiran1
pskiran1 previously approved these changes Nov 5, 2025
Copy link
Member

@pskiran1 pskiran1 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM! Thank you for adding this feature.

whoisj
whoisj previously approved these changes Nov 5, 2025
Copy link
Contributor

@whoisj whoisj left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

overall, this LGTM.

left one suggestion, but I approve of these changes as they are.

backend = self.backend

# Request conversion from OpenAI format to backend-specific format
if backend == "vllm":
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

wouldn't this be safer as below?

if backend == 'trtllm':
  # do something
elif backend == 'vllm':
  # do something else
else:
  raise ValueError(f'Unknown backend "{backend}" provided.')

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

# Explicitly handle ensembles to avoid any runtime validation errors
if not backend and model.config()["platform"] == "ensemble":
backend = "ensemble"
print(f"Found model: {name=}, {backend=}")
lora_names = None
if self.backend == "vllm" or backend == "vllm":
lora_names = _get_vllm_lora_names(
self.server.options.model_repository, name, model.version
)
metadata = TritonModelMetadata(
name=name,
backend=backend,
model=model,
tokenizer=self.tokenizer,
lora_names=lora_names,
create_time=self.create_time,
inference_request_converter=self._determine_request_converter(
backend, RequestKind.GENERATION
),
embedding_request_converter=self._determine_request_converter(
backend, RequestKind.EMBEDDING
),
)

backend can be ensemble.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

makes sense.

when backend == "ensemble" then we hit this code:

        if request_type == RequestKind.GENERATION:
            return _create_trtllm_generate_request
        else:
            return _create_trtllm_embedding_request

is that desirable?

also, adding the switch-like statement future-proofs the function.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

    # Use TRT-LLM format as default for everything else. This could be
    # an ensemble, a python or BLS model, a TRT-LLM backend model, etc.

@yinggeh yinggeh dismissed stale reviews from whoisj and pskiran1 via c022c3e November 5, 2025 22:43
@yinggeh yinggeh requested review from pskiran1 and whoisj November 5, 2025 23:07
@yinggeh yinggeh changed the title feat: OpenAI-compatible API Endpoints for Embedding Models feat: OpenAI-compatible API Endpoints for Embedding Models (vLLM) Nov 6, 2025
@yinggeh yinggeh merged commit f4ae90c into main Nov 6, 2025
3 checks passed
@yinggeh yinggeh deleted the yinggeh/tri-49-request-for-openai-compatible-api-endpoints-for-triton branch November 6, 2025 00:25
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

Enhancement New feature or request

Development

Successfully merging this pull request may close these issues.

4 participants