alauda · davidwtf · May 14, 2026
diff --git a/docs/en/llama_stack/install.mdx b/docs/en/llama_stack/install.mdx
@@ -34,7 +34,9 @@ After the operator is installed, deploy Llama Stack Server by creating a `LlamaS
 > - **Inference URL**: `VLLM_URL` must point at a **vLLM OpenAI-compatible** HTTP base URL (for example an in-cluster vLLM or KServe InferenceService) that serves the target model.
 > - **Secret (optional)**: `VLLM_API_TOKEN` is only needed when the vLLM endpoint requires authentication. If vLLM has no auth, do not set it. When required, create a Secret in the same namespace and reference it from `containerSpec.env` (see the commented example in the manifest below).
 > - **Storage Class**: Ensure the `default` Storage Class exists in the cluster; otherwise the PVC cannot be bound and the resource will not become ready.
+> - **PostgreSQL storage**: The `starter` distribution in this release uses PostgreSQL for Llama Stack persistence. Configure `POSTGRES_*` environment variables for the server pod before deploying.
 > - **PGVector (optional)**: To use `vector_stores` with `provider_id="pgvector"`, provide `PGVECTOR_*` environment variables to the server pod. ACP-provided PostgreSQL can be used directly because it already includes the `pgvector` extension.
+> - **Milvus (optional)**: To use `vector_stores` with `provider_id="milvus-remote"`, provide `MILVUS_ENDPOINT` and, when authentication is enabled, `MILVUS_TOKEN`. Set `MILVUS_CONSISTENCY_LEVEL` to a valid Milvus consistency level such as `Strong`.
 > - **Embedding model download**: Llama Stack includes a default embedding model configuration for vector-store usage, but the model artifacts are downloaded from Hugging Face on first use. If a mirror or proxy is needed, configure `HF_ENDPOINT`. For fully offline environments, pre-download the model files into the server PVC before running the first vector-store request.
 
 ```yaml
@@ -68,8 +70,27 @@ spec:
         #       key: token
         #       name: vllm-api-token
 
+        # Required: PostgreSQL-backed Llama Stack persistence for this starter
+        # distribution image.
+        - name: POSTGRES_HOST
+          value: "<postgresql-service>"
+        - name: POSTGRES_PORT
+          value: "5432"
+        - name: POSTGRES_DB
+          value: "<database-name>"
+        - name: POSTGRES_USER
+          value: "<database-username>"
+        - name: POSTGRES_PASSWORD
+          valueFrom:
+            secretKeyRef:
+              name: <postgresql-credentials-secret>
+              key: password
+
         # Optional: enable PGVector-backed vector stores.
-        # Omit the entire block below if you do not need vector store APIs.
+        # Omit the entire block below if you do not need PGVector vector stores.
+        # These settings configure the vector DB provider and are separate from
+        # the POSTGRES_* persistence settings above, although they may point to
+        # the same PostgreSQL instance when it has the pgvector extension.
         # ACP-provided PostgreSQL already includes the pgvector extension.
         # - name: ENABLE_PGVECTOR
         #   value: "true"
@@ -87,6 +108,23 @@ spec:
         #       name: <pgvector-credentials-secret>
         #       key: password
 
+        # Optional: enable remote Milvus-backed vector stores.
+        # Use provider_id="milvus-remote" from the client API.
+        # - name: MILVUS_ENDPOINT
+        #   value: "http://<milvus-endpoint-host-and-port>"
+        # - name: MILVUS_TOKEN
+        #   valueFrom:
+        #     secretKeyRef:
+        #       name: <milvus-credentials-secret>
+        #       key: token
+        # - name: MILVUS_CONSISTENCY_LEVEL
+        #   value: "Strong"
+
+        # Required for PGVector or Milvus vector stores that use local
+        # sentence-transformers embeddings.
+        # - name: ENABLE_SENTENCE_TRANSFORMERS
+        #   value: "true"
+        #
         # Optional: configure a Hugging Face mirror or proxy for the default
         # embedding model download path.
         # - name: HF_ENDPOINT
@@ -118,6 +156,18 @@ status:
   serviceURL: http://demo-service.default.svc.cluster.local:8321
 ```
 
+## Configure PostgreSQL Storage
+
+The `starter` distribution image used by this release requires PostgreSQL for Llama Stack persistence. Configure these server environment variables in the `LlamaStackDistribution`:
+
+- `POSTGRES_HOST`
+- `POSTGRES_PORT`
+- `POSTGRES_DB`
+- `POSTGRES_USER`
+- `POSTGRES_PASSWORD`
+
+These settings are for Llama Stack server state. They are not the same as `PGVECTOR_*`, which only configures the optional PGVector vector-store provider. You may use the same PostgreSQL instance for both roles when it has the required database, credentials, and `pgvector` extension.
+
 ## Tool calling with vLLM on KServe
 
 The following applies to the **vLLM predictor** on KServe, not to the `LlamaStackDistribution` manifest. For agent flows that use **tools** (client-side tools or MCP), the vLLM process must expose tool-call support. Add predictor container `args` as required by upstream vLLM, for example:
@@ -139,12 +189,26 @@ Recommended preparation:
 
 1. Prepare an ACP PostgreSQL instance and record its service name, database name, username, and password.
 2. Expose the database connection to the `LlamaStackDistribution` with `PGVECTOR_HOST`, `PGVECTOR_PORT`, `PGVECTOR_DB`, `PGVECTOR_USER`, and `PGVECTOR_PASSWORD`.
-3. Use the default embedding model provided by Llama Stack, and make sure its model files can be fetched on first use.
+3. Set `ENABLE_SENTENCE_TRANSFORMERS=true` and make sure the default embedding model files can be fetched on first use.
 4. If the cluster uses a Hugging Face mirror or proxy, set `HF_ENDPOINT` accordingly.
 5. If the cluster is fully offline, pre-download the embedding model files into the server PVC and enable offline cache-related environment variables.
 
 After the distribution is ready, you can validate the setup with the PGVector section in the [Quickstart](./quickstart) notebook.
 
+## Enable Milvus Vector Store
+
+When `MILVUS_ENDPOINT` is set on the server, Llama Stack can create vector stores by using `provider_id="milvus-remote"` from the client API.
+
+Recommended preparation:
+
+1. Prepare a Milvus endpoint reachable from the Llama Stack Server pod. `MILVUS_ENDPOINT` must include the scheme, either `http://` or `https://`, and the port required by your Milvus service.
+2. Expose the Milvus connection to the `LlamaStackDistribution` with `MILVUS_ENDPOINT`.
+3. If Milvus authentication is enabled, set `MILVUS_TOKEN` from a Secret.
+4. Set `MILVUS_CONSISTENCY_LEVEL` to a string value such as `Strong`; the Milvus provider requires this field.
+5. Set `ENABLE_SENTENCE_TRANSFORMERS=true` and make sure the embedding model files can be fetched or are already present in the server PVC.
+
+After the distribution is ready, validate the setup with the Milvus section in the [Quickstart](./quickstart) notebook. The client creates the vector store with `provider_id="milvus-remote"` and passes the selected embedding model id plus embedding dimension in `extra_body`.
+
 ## Hugging Face Access For Embedding Models
 
 Llama Stack uses a default embedding model for vector-store operations. On first use, the server downloads the model files from Hugging Face into its local cache.
@@ -179,4 +243,4 @@ Common deployment modes:
      value: "1"
    ```
 
-If the cache path is pre-populated correctly, the server can create PGVector-backed vector stores without downloading model artifacts at runtime.
+If the cache path is pre-populated correctly, the server can create PGVector-backed or Milvus-backed vector stores without downloading model artifacts at runtime.
diff --git a/docs/en/llama_stack/overview/features.mdx b/docs/en/llama_stack/overview/features.mdx
@@ -26,5 +26,5 @@ weight: 20
 ## Integration
 
 - **Python Client**: `llama-stack-client` for Python 3.12+ with full agent and model APIs
-- **Vector Store APIs**: Create and query vector stores from the client, including PGVector-backed stores when the server is configured with `ENABLE_PGVECTOR=true`
+- **Vector Store APIs**: Create and query vector stores from the client, including PGVector-backed stores with `provider_id="pgvector"` and Milvus-backed stores with `provider_id="milvus-remote"`
 - **REST-Friendly**: Server exposes APIs for inference, agents, and tool runtime; can be wrapped in FastAPI or other web frameworks for production use
diff --git a/docs/en/llama_stack/quickstart.mdx b/docs/en/llama_stack/quickstart.mdx
@@ -9,9 +9,9 @@ This section provides a quickstart example for creating an AI Agent with Llama S
 ## Prerequisites
 
 - Python 3.12 or higher (if not satisfied, refer to [FAQ: How to prepare Python 3.12 in Notebook](#how-to-prepare-python-312-in-notebook))
-- Llama Stack Server installed and running via Operator (see [Install Llama Stack](./install)), with **`VLLM_URL` pointing at a vLLM-served model endpoint** (see install notes)
+- Llama Stack Server installed and running via Operator (see [Install Llama Stack](./install)), with **`VLLM_URL` pointing at a vLLM-served model endpoint** and `POSTGRES_*` configured for server persistence (see install notes)
 - Access to a Notebook environment (e.g., Jupyter Notebook, JupyterLab)
-- Python environment with `llama-stack-client==0.6.0`, `fastmcp` (for the MCP section), and other notebook dependencies installed
+- Python environment with `llama-stack-client==0.7.1`, `fastmcp` (for the MCP section), and other notebook dependencies installed
 
 ## Quickstart Example
 
@@ -25,19 +25,26 @@ The notebook demonstrates:
 
 - **Two tool options:** client-side tools (`@client_tool`) and MCP tools (FastMCP + `toolgroups.register`)
 - **Shared agent flow:** connect to Llama Stack Server, select a model, create an `Agent` with `tools=AGENT_TOOLS`, then run sessions and streaming turns
-- **Optional PGVector flow:** upload a file, create a `pgvector`-backed vector store, and run a hybrid search query
+- **Optional vector store flows:** upload a file, create a `pgvector` or `milvus-remote` backed vector store, and run a search query
 - Streaming responses and event logging
 - Optional FastAPI deployment of the `agent`
 
-## PGVector Usage
+## Vector Store Usage
 
-The downloadable notebook includes an optional PGVector section. To run it, start the server with `ENABLE_PGVECTOR=true` and valid `PGVECTOR_*` connection settings, then execute the PGVector cells in the notebook. ACP-provided PostgreSQL can be used directly because it already includes the `pgvector` extension.
+The downloadable notebook includes optional PGVector and Milvus sections.
+
+For PGVector, start the server with `ENABLE_PGVECTOR=true` and valid `PGVECTOR_*` connection settings, then execute the PGVector cells in the notebook. ACP-provided PostgreSQL can be used directly because it already includes the `pgvector` extension.
+
+For Milvus, start the server with `MILVUS_ENDPOINT`, optional `MILVUS_TOKEN`, and `MILVUS_CONSISTENCY_LEVEL`, then execute the Milvus cells in the notebook. Use `provider_id="milvus-remote"` in the client request.
+
+For both vector-store examples, `client.models.list()` must include an embedding model, for example `sentence-transformers/nomic-ai/nomic-embed-text-v1.5`. If it only returns LLM models, restart the `LlamaStackDistribution` with `ENABLE_SENTENCE_TRANSFORMERS=true` and configure Hugging Face cache/download access as described in [Install Llama Stack](./install).
 
 The notebook example covers:
 
 - Uploading a file through `client.files.create(...)`
-- Creating a vector store with `provider_id="pgvector"`
-- Running a hybrid search with `client.vector_stores.search(...)` and `search_mode="hybrid"`
+- Creating a vector store with `provider_id="pgvector"` or `provider_id="milvus-remote"`
+- Passing `embedding_model` and `embedding_dimension` through `client.vector_stores.create(..., extra_body=...)`
+- Running a search with `client.vector_stores.search(...)`; PGVector uses `search_mode="hybrid"` in `extra_body`
 
 ## FAQ