Skip to content
Open
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
70 changes: 67 additions & 3 deletions docs/en/llama_stack/install.mdx
Original file line number Diff line number Diff line change
Expand Up @@ -34,7 +34,9 @@ After the operator is installed, deploy Llama Stack Server by creating a `LlamaS
> - **Inference URL**: `VLLM_URL` must point at a **vLLM OpenAI-compatible** HTTP base URL (for example an in-cluster vLLM or KServe InferenceService) that serves the target model.
> - **Secret (optional)**: `VLLM_API_TOKEN` is only needed when the vLLM endpoint requires authentication. If vLLM has no auth, do not set it. When required, create a Secret in the same namespace and reference it from `containerSpec.env` (see the commented example in the manifest below).
> - **Storage Class**: Ensure the `default` Storage Class exists in the cluster; otherwise the PVC cannot be bound and the resource will not become ready.
> - **PostgreSQL storage**: The `starter` distribution in this release uses PostgreSQL for Llama Stack persistence. Configure `POSTGRES_*` environment variables for the server pod before deploying.
> - **PGVector (optional)**: To use `vector_stores` with `provider_id="pgvector"`, provide `PGVECTOR_*` environment variables to the server pod. ACP-provided PostgreSQL can be used directly because it already includes the `pgvector` extension.
> - **Milvus (optional)**: To use `vector_stores` with `provider_id="milvus-remote"`, provide `MILVUS_ENDPOINT` and, when authentication is enabled, `MILVUS_TOKEN`. Set `MILVUS_CONSISTENCY_LEVEL` to a valid Milvus consistency level such as `Strong`.
> - **Embedding model download**: Llama Stack includes a default embedding model configuration for vector-store usage, but the model artifacts are downloaded from Hugging Face on first use. If a mirror or proxy is needed, configure `HF_ENDPOINT`. For fully offline environments, pre-download the model files into the server PVC before running the first vector-store request.

```yaml
Expand Down Expand Up @@ -68,8 +70,27 @@ spec:
# key: token
# name: vllm-api-token

# Required: PostgreSQL-backed Llama Stack persistence for this starter
# distribution image.
- name: POSTGRES_HOST
value: "<postgresql-service>"
- name: POSTGRES_PORT
value: "5432"
- name: POSTGRES_DB
value: "<database-name>"
- name: POSTGRES_USER
value: "<database-username>"
- name: POSTGRES_PASSWORD
valueFrom:
secretKeyRef:
name: <postgresql-credentials-secret>
key: password

# Optional: enable PGVector-backed vector stores.
# Omit the entire block below if you do not need vector store APIs.
# Omit the entire block below if you do not need PGVector vector stores.
# These settings configure the vector DB provider and are separate from
# the POSTGRES_* persistence settings above, although they may point to
# the same PostgreSQL instance when it has the pgvector extension.
# ACP-provided PostgreSQL already includes the pgvector extension.
# - name: ENABLE_PGVECTOR
# value: "true"
Expand All @@ -87,6 +108,23 @@ spec:
# name: <pgvector-credentials-secret>
# key: password

# Optional: enable remote Milvus-backed vector stores.
# Use provider_id="milvus-remote" from the client API.
# - name: MILVUS_ENDPOINT
# value: "http://<milvus-endpoint-host-and-port>"
# - name: MILVUS_TOKEN
# valueFrom:
# secretKeyRef:
# name: <milvus-credentials-secret>
# key: token
# - name: MILVUS_CONSISTENCY_LEVEL
# value: "Strong"

# Required for PGVector or Milvus vector stores that use local
# sentence-transformers embeddings.
# - name: ENABLE_SENTENCE_TRANSFORMERS
# value: "true"
#
# Optional: configure a Hugging Face mirror or proxy for the default
# embedding model download path.
# - name: HF_ENDPOINT
Expand Down Expand Up @@ -118,6 +156,18 @@ status:
serviceURL: http://demo-service.default.svc.cluster.local:8321
```

## Configure PostgreSQL Storage

The `starter` distribution image used by this release requires PostgreSQL for Llama Stack persistence. Configure these server environment variables in the `LlamaStackDistribution`:

- `POSTGRES_HOST`
- `POSTGRES_PORT`
- `POSTGRES_DB`
- `POSTGRES_USER`
- `POSTGRES_PASSWORD`

These settings are for Llama Stack server state. They are not the same as `PGVECTOR_*`, which only configures the optional PGVector vector-store provider. You may use the same PostgreSQL instance for both roles when it has the required database, credentials, and `pgvector` extension.

## Tool calling with vLLM on KServe

The following applies to the **vLLM predictor** on KServe, not to the `LlamaStackDistribution` manifest. For agent flows that use **tools** (client-side tools or MCP), the vLLM process must expose tool-call support. Add predictor container `args` as required by upstream vLLM, for example:
Expand All @@ -139,12 +189,26 @@ Recommended preparation:

1. Prepare an ACP PostgreSQL instance and record its service name, database name, username, and password.
2. Expose the database connection to the `LlamaStackDistribution` with `PGVECTOR_HOST`, `PGVECTOR_PORT`, `PGVECTOR_DB`, `PGVECTOR_USER`, and `PGVECTOR_PASSWORD`.
3. Use the default embedding model provided by Llama Stack, and make sure its model files can be fetched on first use.
3. Set `ENABLE_SENTENCE_TRANSFORMERS=true` and make sure the default embedding model files can be fetched on first use.
4. If the cluster uses a Hugging Face mirror or proxy, set `HF_ENDPOINT` accordingly.
5. If the cluster is fully offline, pre-download the embedding model files into the server PVC and enable offline cache-related environment variables.

After the distribution is ready, you can validate the setup with the PGVector section in the [Quickstart](./quickstart) notebook.

## Enable Milvus Vector Store

When `MILVUS_ENDPOINT` is set on the server, Llama Stack can create vector stores by using `provider_id="milvus-remote"` from the client API.

Recommended preparation:

1. Prepare a Milvus endpoint reachable from the Llama Stack Server pod. `MILVUS_ENDPOINT` must include the scheme, either `http://` or `https://`, and the port required by your Milvus service.
2. Expose the Milvus connection to the `LlamaStackDistribution` with `MILVUS_ENDPOINT`.
3. If Milvus authentication is enabled, set `MILVUS_TOKEN` from a Secret.
4. Set `MILVUS_CONSISTENCY_LEVEL` to a string value such as `Strong`; the Milvus provider requires this field.
5. Set `ENABLE_SENTENCE_TRANSFORMERS=true` and make sure the embedding model files can be fetched or are already present in the server PVC.

After the distribution is ready, validate the setup with the Milvus section in the [Quickstart](./quickstart) notebook. The client creates the vector store with `provider_id="milvus-remote"` and passes the selected embedding model id plus embedding dimension in `extra_body`.

## Hugging Face Access For Embedding Models

Llama Stack uses a default embedding model for vector-store operations. On first use, the server downloads the model files from Hugging Face into its local cache.
Expand Down Expand Up @@ -179,4 +243,4 @@ Common deployment modes:
value: "1"
```

If the cache path is pre-populated correctly, the server can create PGVector-backed vector stores without downloading model artifacts at runtime.
If the cache path is pre-populated correctly, the server can create PGVector-backed or Milvus-backed vector stores without downloading model artifacts at runtime.
2 changes: 1 addition & 1 deletion docs/en/llama_stack/overview/features.mdx
Original file line number Diff line number Diff line change
Expand Up @@ -26,5 +26,5 @@ weight: 20
## Integration

- **Python Client**: `llama-stack-client` for Python 3.12+ with full agent and model APIs
- **Vector Store APIs**: Create and query vector stores from the client, including PGVector-backed stores when the server is configured with `ENABLE_PGVECTOR=true`
- **Vector Store APIs**: Create and query vector stores from the client, including PGVector-backed stores with `provider_id="pgvector"` and Milvus-backed stores with `provider_id="milvus-remote"`
- **REST-Friendly**: Server exposes APIs for inference, agents, and tool runtime; can be wrapped in FastAPI or other web frameworks for production use
21 changes: 14 additions & 7 deletions docs/en/llama_stack/quickstart.mdx
Original file line number Diff line number Diff line change
Expand Up @@ -9,9 +9,9 @@ This section provides a quickstart example for creating an AI Agent with Llama S
## Prerequisites

- Python 3.12 or higher (if not satisfied, refer to [FAQ: How to prepare Python 3.12 in Notebook](#how-to-prepare-python-312-in-notebook))
- Llama Stack Server installed and running via Operator (see [Install Llama Stack](./install)), with **`VLLM_URL` pointing at a vLLM-served model endpoint** (see install notes)
- Llama Stack Server installed and running via Operator (see [Install Llama Stack](./install)), with **`VLLM_URL` pointing at a vLLM-served model endpoint** and `POSTGRES_*` configured for server persistence (see install notes)
- Access to a Notebook environment (e.g., Jupyter Notebook, JupyterLab)
- Python environment with `llama-stack-client==0.6.0`, `fastmcp` (for the MCP section), and other notebook dependencies installed
- Python environment with `llama-stack-client==0.7.1`, `fastmcp` (for the MCP section), and other notebook dependencies installed

Comment thread
coderabbitai[bot] marked this conversation as resolved.
## Quickstart Example

Expand All @@ -25,19 +25,26 @@ The notebook demonstrates:

- **Two tool options:** client-side tools (`@client_tool`) and MCP tools (FastMCP + `toolgroups.register`)
- **Shared agent flow:** connect to Llama Stack Server, select a model, create an `Agent` with `tools=AGENT_TOOLS`, then run sessions and streaming turns
- **Optional PGVector flow:** upload a file, create a `pgvector`-backed vector store, and run a hybrid search query
- **Optional vector store flows:** upload a file, create a `pgvector` or `milvus-remote` backed vector store, and run a search query
- Streaming responses and event logging
- Optional FastAPI deployment of the `agent`

## PGVector Usage
## Vector Store Usage

The downloadable notebook includes an optional PGVector section. To run it, start the server with `ENABLE_PGVECTOR=true` and valid `PGVECTOR_*` connection settings, then execute the PGVector cells in the notebook. ACP-provided PostgreSQL can be used directly because it already includes the `pgvector` extension.
The downloadable notebook includes optional PGVector and Milvus sections.

For PGVector, start the server with `ENABLE_PGVECTOR=true` and valid `PGVECTOR_*` connection settings, then execute the PGVector cells in the notebook. ACP-provided PostgreSQL can be used directly because it already includes the `pgvector` extension.

For Milvus, start the server with `MILVUS_ENDPOINT`, optional `MILVUS_TOKEN`, and `MILVUS_CONSISTENCY_LEVEL`, then execute the Milvus cells in the notebook. Use `provider_id="milvus-remote"` in the client request.

For both vector-store examples, `client.models.list()` must include an embedding model, for example `sentence-transformers/nomic-ai/nomic-embed-text-v1.5`. If it only returns LLM models, restart the `LlamaStackDistribution` with `ENABLE_SENTENCE_TRANSFORMERS=true` and configure Hugging Face cache/download access as described in [Install Llama Stack](./install).

The notebook example covers:

- Uploading a file through `client.files.create(...)`
- Creating a vector store with `provider_id="pgvector"`
- Running a hybrid search with `client.vector_stores.search(...)` and `search_mode="hybrid"`
- Creating a vector store with `provider_id="pgvector"` or `provider_id="milvus-remote"`
- Passing `embedding_model` and `embedding_dimension` through `client.vector_stores.create(..., extra_body=...)`
- Running a search with `client.vector_stores.search(...)`; PGVector uses `search_mode="hybrid"` in `extra_body`

## FAQ

Expand Down
Loading