Llama Stack

Quick Start | Documentation | OpenAI API Compatibility | Discord

Open-source agentic API server for building AI applications. OpenAI-compatible. Any model, any infrastructure.

Llama Stack is a drop-in replacement for the OpenAI API that you can run anywhere — your laptop, your datacenter, or the cloud. Use any OpenAI-compatible client or agentic framework. Swap between Llama, GPT, Gemini, Mistral, or any model without changing your application code.

from openai import OpenAI

client = OpenAI(base_url="http://localhost:8321/v1", api_key="fake")
response = client.chat.completions.create(
    model="llama-3.3-70b",
    messages=[{"role": "user", "content": "Hello"}],
)

What you get

Chat Completions & Embeddings — standard /v1/chat/completions, /v1/completions, and /v1/embeddings endpoints, compatible with any OpenAI client
Responses API — server-side agentic orchestration with tool calling, MCP server integration, and built-in file search (RAG) in a single API call (learn more)
Vector Stores & Files — /v1/vector_stores and /v1/files for managed document storage and search
Batches — /v1/batches for offline batch processing
Open Responses conformant — the Responses API implementation passes the Open Responses conformance test suite

Use any model, use any infrastructure

Llama Stack has a pluggable provider architecture. Develop locally with Ollama, deploy to production with vLLM, or connect to a managed service — the API stays the same.

┌─────────────────────────────────────────────────────────────────────────┐
│                          Llama Stack Server                             │
│               (same API, same code, any environment)                    │
│                                                                         │
│  /v1/chat/completions  /v1/responses  /v1/vector_stores  /v1/files      │
│  /v1/embeddings        /v1/batches    /v1/models         /v1/connectors │
├───────────────────┬──────────────────┬──────────────────────────────────┤
│  Inference        │  Vector stores   │  Tools & connectors              │
│    Ollama         │    FAISS         │    MCP servers                   │
│    vLLM, TGI      │    Milvus        │    Brave, Tavily (web search)    │
│    AWS Bedrock    │    Qdrant        │    File search (built-in RAG)    │
│    Azure OpenAI   │    PGVector      │                                  │
│    Fireworks      │    ChromaDB      │  File storage & processing       │
│    Together       │    Weaviate      │    Local filesystem, S3          │
│    ...15+ more    │    Elasticsearch │    PDF, HTML (file processors)   │
│                   │    SQLite-vec    │                                  │
└───────────────────┴──────────────────┴──────────────────────────────────┘

See the provider documentation for the full list.

Get started

Install and run a Llama Stack server:

# One-line install
curl -LsSf https://github.com/llamastack/llama-stack/raw/main/scripts/install.sh | bash

# Or install via uv
uv pip install llama-stack

# Start the server (uses the starter distribution with Ollama)
llama stack run

Then connect with any OpenAI client — Python, TypeScript, curl, or any framework that speaks the OpenAI API.

See the Quick Start guide for detailed setup.

Resources

Documentation — full reference
OpenAI API Compatibility — endpoint coverage and provider matrix
Getting Started Notebook — text and vision inference walkthrough
Contributing — how to contribute

Client SDKs:

Language	SDK	Package
Python	llama-stack-client-python
TypeScript	llama-stack-client-typescript

Community

We hold regular community calls every Thursday at 09:00 AM PST — see the Community Event on Discord for details.

Thanks to all our amazing contributors!

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Llama Stack

What you get

Use any model, use any infrastructure

Get started

Resources

Community

FilesExpand file tree

README.md

Latest commit

History

README.md

File metadata and controls

Llama Stack

What you get

Use any model, use any infrastructure

Get started

Resources

Community