diff --git a/docs/en/llama_stack/install.mdx b/docs/en/llama_stack/install.mdx index 3ca1ab1..26ca655 100644 --- a/docs/en/llama_stack/install.mdx +++ b/docs/en/llama_stack/install.mdx @@ -34,7 +34,9 @@ After the operator is installed, deploy Llama Stack Server by creating a `LlamaS > - **Inference URL**: `VLLM_URL` must point at a **vLLM OpenAI-compatible** HTTP base URL (for example an in-cluster vLLM or KServe InferenceService) that serves the target model. > - **Secret (optional)**: `VLLM_API_TOKEN` is only needed when the vLLM endpoint requires authentication. If vLLM has no auth, do not set it. When required, create a Secret in the same namespace and reference it from `containerSpec.env` (see the commented example in the manifest below). > - **Storage Class**: Ensure the `default` Storage Class exists in the cluster; otherwise the PVC cannot be bound and the resource will not become ready. +> - **PostgreSQL storage**: The `starter` distribution in this release uses PostgreSQL for Llama Stack persistence. Configure `POSTGRES_*` environment variables for the server pod before deploying. > - **PGVector (optional)**: To use `vector_stores` with `provider_id="pgvector"`, provide `PGVECTOR_*` environment variables to the server pod. ACP-provided PostgreSQL can be used directly because it already includes the `pgvector` extension. +> - **Milvus (optional)**: To use `vector_stores` with `provider_id="milvus-remote"`, provide `MILVUS_ENDPOINT` and, when authentication is enabled, `MILVUS_TOKEN`. Set `MILVUS_CONSISTENCY_LEVEL` to a valid Milvus consistency level such as `Strong`. > - **Embedding model download**: Llama Stack includes a default embedding model configuration for vector-store usage, but the model artifacts are downloaded from Hugging Face on first use. If a mirror or proxy is needed, configure `HF_ENDPOINT`. For fully offline environments, pre-download the model files into the server PVC before running the first vector-store request. ```yaml @@ -68,8 +70,27 @@ spec: # key: token # name: vllm-api-token + # Required: PostgreSQL-backed Llama Stack persistence for this starter + # distribution image. + - name: POSTGRES_HOST + value: "" + - name: POSTGRES_PORT + value: "5432" + - name: POSTGRES_DB + value: "" + - name: POSTGRES_USER + value: "" + - name: POSTGRES_PASSWORD + valueFrom: + secretKeyRef: + name: + key: password + # Optional: enable PGVector-backed vector stores. - # Omit the entire block below if you do not need vector store APIs. + # Omit the entire block below if you do not need PGVector vector stores. + # These settings configure the vector DB provider and are separate from + # the POSTGRES_* persistence settings above, although they may point to + # the same PostgreSQL instance when it has the pgvector extension. # ACP-provided PostgreSQL already includes the pgvector extension. # - name: ENABLE_PGVECTOR # value: "true" @@ -87,6 +108,23 @@ spec: # name: # key: password + # Optional: enable remote Milvus-backed vector stores. + # Use provider_id="milvus-remote" from the client API. + # - name: MILVUS_ENDPOINT + # value: "http://" + # - name: MILVUS_TOKEN + # valueFrom: + # secretKeyRef: + # name: + # key: token + # - name: MILVUS_CONSISTENCY_LEVEL + # value: "Strong" + + # Required for PGVector or Milvus vector stores that use local + # sentence-transformers embeddings. + # - name: ENABLE_SENTENCE_TRANSFORMERS + # value: "true" + # # Optional: configure a Hugging Face mirror or proxy for the default # embedding model download path. # - name: HF_ENDPOINT @@ -118,6 +156,18 @@ status: serviceURL: http://demo-service.default.svc.cluster.local:8321 ``` +## Configure PostgreSQL Storage + +The `starter` distribution image used by this release requires PostgreSQL for Llama Stack persistence. Configure these server environment variables in the `LlamaStackDistribution`: + +- `POSTGRES_HOST` +- `POSTGRES_PORT` +- `POSTGRES_DB` +- `POSTGRES_USER` +- `POSTGRES_PASSWORD` + +These settings are for Llama Stack server state. They are not the same as `PGVECTOR_*`, which only configures the optional PGVector vector-store provider. You may use the same PostgreSQL instance for both roles when it has the required database, credentials, and `pgvector` extension. + ## Tool calling with vLLM on KServe The following applies to the **vLLM predictor** on KServe, not to the `LlamaStackDistribution` manifest. For agent flows that use **tools** (client-side tools or MCP), the vLLM process must expose tool-call support. Add predictor container `args` as required by upstream vLLM, for example: @@ -139,12 +189,26 @@ Recommended preparation: 1. Prepare an ACP PostgreSQL instance and record its service name, database name, username, and password. 2. Expose the database connection to the `LlamaStackDistribution` with `PGVECTOR_HOST`, `PGVECTOR_PORT`, `PGVECTOR_DB`, `PGVECTOR_USER`, and `PGVECTOR_PASSWORD`. -3. Use the default embedding model provided by Llama Stack, and make sure its model files can be fetched on first use. +3. Set `ENABLE_SENTENCE_TRANSFORMERS=true` and make sure the default embedding model files can be fetched on first use. 4. If the cluster uses a Hugging Face mirror or proxy, set `HF_ENDPOINT` accordingly. 5. If the cluster is fully offline, pre-download the embedding model files into the server PVC and enable offline cache-related environment variables. After the distribution is ready, you can validate the setup with the PGVector section in the [Quickstart](./quickstart) notebook. +## Enable Milvus Vector Store + +When `MILVUS_ENDPOINT` is set on the server, Llama Stack can create vector stores by using `provider_id="milvus-remote"` from the client API. + +Recommended preparation: + +1. Prepare a Milvus endpoint reachable from the Llama Stack Server pod. `MILVUS_ENDPOINT` must include the scheme, either `http://` or `https://`, and the port required by your Milvus service. +2. Expose the Milvus connection to the `LlamaStackDistribution` with `MILVUS_ENDPOINT`. +3. If Milvus authentication is enabled, set `MILVUS_TOKEN` from a Secret. +4. Set `MILVUS_CONSISTENCY_LEVEL` to a string value such as `Strong`; the Milvus provider requires this field. +5. Set `ENABLE_SENTENCE_TRANSFORMERS=true` and make sure the embedding model files can be fetched or are already present in the server PVC. + +After the distribution is ready, validate the setup with the Milvus section in the [Quickstart](./quickstart) notebook. The client creates the vector store with `provider_id="milvus-remote"` and passes the selected embedding model id plus embedding dimension in `extra_body`. + ## Hugging Face Access For Embedding Models Llama Stack uses a default embedding model for vector-store operations. On first use, the server downloads the model files from Hugging Face into its local cache. @@ -179,4 +243,4 @@ Common deployment modes: value: "1" ``` -If the cache path is pre-populated correctly, the server can create PGVector-backed vector stores without downloading model artifacts at runtime. +If the cache path is pre-populated correctly, the server can create PGVector-backed or Milvus-backed vector stores without downloading model artifacts at runtime. diff --git a/docs/en/llama_stack/overview/features.mdx b/docs/en/llama_stack/overview/features.mdx index 59de484..1ae0b12 100644 --- a/docs/en/llama_stack/overview/features.mdx +++ b/docs/en/llama_stack/overview/features.mdx @@ -26,5 +26,5 @@ weight: 20 ## Integration - **Python Client**: `llama-stack-client` for Python 3.12+ with full agent and model APIs -- **Vector Store APIs**: Create and query vector stores from the client, including PGVector-backed stores when the server is configured with `ENABLE_PGVECTOR=true` +- **Vector Store APIs**: Create and query vector stores from the client, including PGVector-backed stores with `provider_id="pgvector"` and Milvus-backed stores with `provider_id="milvus-remote"` - **REST-Friendly**: Server exposes APIs for inference, agents, and tool runtime; can be wrapped in FastAPI or other web frameworks for production use diff --git a/docs/en/llama_stack/quickstart.mdx b/docs/en/llama_stack/quickstart.mdx index 7c42417..68e9370 100644 --- a/docs/en/llama_stack/quickstart.mdx +++ b/docs/en/llama_stack/quickstart.mdx @@ -9,9 +9,9 @@ This section provides a quickstart example for creating an AI Agent with Llama S ## Prerequisites - Python 3.12 or higher (if not satisfied, refer to [FAQ: How to prepare Python 3.12 in Notebook](#how-to-prepare-python-312-in-notebook)) -- Llama Stack Server installed and running via Operator (see [Install Llama Stack](./install)), with **`VLLM_URL` pointing at a vLLM-served model endpoint** (see install notes) +- Llama Stack Server installed and running via Operator (see [Install Llama Stack](./install)), with **`VLLM_URL` pointing at a vLLM-served model endpoint** and `POSTGRES_*` configured for server persistence (see install notes) - Access to a Notebook environment (e.g., Jupyter Notebook, JupyterLab) -- Python environment with `llama-stack-client==0.6.0`, `fastmcp` (for the MCP section), and other notebook dependencies installed +- Python environment with `llama-stack-client==0.7.1`, `fastmcp` (for the MCP section), and other notebook dependencies installed ## Quickstart Example @@ -25,19 +25,26 @@ The notebook demonstrates: - **Two tool options:** client-side tools (`@client_tool`) and MCP tools (FastMCP + `toolgroups.register`) - **Shared agent flow:** connect to Llama Stack Server, select a model, create an `Agent` with `tools=AGENT_TOOLS`, then run sessions and streaming turns -- **Optional PGVector flow:** upload a file, create a `pgvector`-backed vector store, and run a hybrid search query +- **Optional vector store flows:** upload a file, create a `pgvector` or `milvus-remote` backed vector store, and run a search query - Streaming responses and event logging - Optional FastAPI deployment of the `agent` -## PGVector Usage +## Vector Store Usage -The downloadable notebook includes an optional PGVector section. To run it, start the server with `ENABLE_PGVECTOR=true` and valid `PGVECTOR_*` connection settings, then execute the PGVector cells in the notebook. ACP-provided PostgreSQL can be used directly because it already includes the `pgvector` extension. +The downloadable notebook includes optional PGVector and Milvus sections. + +For PGVector, start the server with `ENABLE_PGVECTOR=true` and valid `PGVECTOR_*` connection settings, then execute the PGVector cells in the notebook. ACP-provided PostgreSQL can be used directly because it already includes the `pgvector` extension. + +For Milvus, start the server with `MILVUS_ENDPOINT`, optional `MILVUS_TOKEN`, and `MILVUS_CONSISTENCY_LEVEL`, then execute the Milvus cells in the notebook. Use `provider_id="milvus-remote"` in the client request. + +For both vector-store examples, `client.models.list()` must include an embedding model, for example `sentence-transformers/nomic-ai/nomic-embed-text-v1.5`. If it only returns LLM models, restart the `LlamaStackDistribution` with `ENABLE_SENTENCE_TRANSFORMERS=true` and configure Hugging Face cache/download access as described in [Install Llama Stack](./install). The notebook example covers: - Uploading a file through `client.files.create(...)` -- Creating a vector store with `provider_id="pgvector"` -- Running a hybrid search with `client.vector_stores.search(...)` and `search_mode="hybrid"` +- Creating a vector store with `provider_id="pgvector"` or `provider_id="milvus-remote"` +- Passing `embedding_model` and `embedding_dimension` through `client.vector_stores.create(..., extra_body=...)` +- Running a search with `client.vector_stores.search(...)`; PGVector uses `search_mode="hybrid"` in `extra_body` ## FAQ diff --git a/docs/public/llama-stack/llama-stack_quickstart.ipynb b/docs/public/llama-stack/llama-stack_quickstart.ipynb index 71ff102..64a0b81 100644 --- a/docs/public/llama-stack/llama-stack_quickstart.ipynb +++ b/docs/public/llama-stack/llama-stack_quickstart.ipynb @@ -7,12 +7,12 @@ "source": [ "# Llama Stack Quick Start Demo\n", "\n", - "This notebook demonstrates how to use Llama Stack for agent workflows and PGVector-backed vector store access:\n", + "This notebook demonstrates how to use Llama Stack for agent workflows and vector store access:\n", "\n", "- **Option A (section 2):** define a **client-side** weather tool with `@client_tool`; the cell sets **`AGENT_TOOLS`**.\n", "- **Option B (section 2):** run an **MCP** weather tool with **FastMCP** and register it with the server; the register cell sets **`AGENT_TOOLS`**.\n", "- **Section 3** uses the **same** connect / model selection / `Agent` construction / run flow for both options. The only difference is the value of **`AGENT_TOOLS`** passed into `Agent`.\n", - "- **Section 4** shows how to upload a file and query a **PGVector**-backed vector store.\n", + "- **Section 4** shows how to upload a file and query **PGVector** and **Milvus** backed vector stores.\n", "\n", "### Inference backend (`LlamaStackDistribution`)\n", "\n", @@ -49,7 +49,7 @@ "# Use current kernel's Python so PATH does not point to another env\n", "# If download is slow, add: -i https://pypi.tuna.tsinghua.edu.cn/simple\n", "import sys\n", - "!{sys.executable} -m pip install \"llama-stack-client==0.6.0\" \"requests\" \"fastapi\" \"uvicorn\" \"fastmcp\"" + "!{sys.executable} -m pip install \"llama-stack-client==0.7.1\" \"requests\" \"fastapi\" \"uvicorn\" \"fastmcp\"" ] }, { @@ -462,26 +462,26 @@ "id": "pgvector-title-md", "metadata": {}, "source": [ - "## 4. PGVector Vector Store Example\n", + "## 4. Vector Store Examples\n", "\n", - "This section shows how to upload a file and query a PGVector-backed vector store.\n", + "This section shows how to upload a file and query PGVector-backed and Milvus-backed vector stores.\n", "\n", - "Prerequisites:\n", - "- The server distribution is configured with `ENABLE_PGVECTOR=true` and valid `PGVECTOR_*` connection settings.\n", - "- ACP-provided PostgreSQL can be used directly; it already includes the `pgvector` extension.\n", - "- Llama Stack includes a default embedding model configuration, but the model files are downloaded from Hugging Face on first use.\n", - "- If the cluster uses a Hugging Face mirror or proxy, configure `HF_ENDPOINT`.\n", - "- If the cluster is fully offline, pre-download the model files into `/home/lls/.lls/huggingface/hub` and set offline cache-related environment variables.\n" + "### Shared Embedding Model Selection\n", + "\n", + "Run this cell once before the PGVector or Milvus example. Both vector stores use the selected embedding model id and dimension in `client.vector_stores.create(..., extra_body=...)`.\n", + "\n", + "Before continuing, `client.models.list()` must include an embedding model, for example `sentence-transformers/nomic-ai/nomic-embed-text-v1.5`. If it only shows LLM models, restart the server distribution with `ENABLE_SENTENCE_TRANSFORMERS=true` and the embedding model cache/download settings described in the install guide.\n" ] }, { "cell_type": "code", "execution_count": null, - "id": "pgvector-demo-code", + "id": "vector-store-shared-embedding-code", "metadata": {}, "outputs": [], "source": [ "import json\n", + "import os\n", "import time\n", "\n", "\n", @@ -498,16 +498,58 @@ "\n", "\n", "models = client.models.list()\n", + "print(\"models(list) response:\")\n", + "if hasattr(models, \"model_dump\"):\n", + " print(json.dumps(models.model_dump(mode=\"json\"), ensure_ascii=False, indent=2))\n", + "else:\n", + " print(\n", + " json.dumps(\n", + " [getattr(model, \"model_dump\", lambda: str(model))() for model in models],\n", + " ensure_ascii=False,\n", + " default=str,\n", + " indent=2,\n", + " )\n", + " )\n", + "\n", + "preferred_embedding_model_id = os.getenv(\n", + " \"EMBEDDING_MODEL\",\n", + " os.getenv(\"TEST_EMBEDDING_MODEL\", \"sentence-transformers/nomic-ai/nomic-embed-text-v1.5\"),\n", + ")\n", + "preferred_embedding_dimension = int(\n", + " os.getenv(\"EMBEDDING_DIMENSION\", os.getenv(\"TEST_EMBEDDING_DIMENSION\", \"768\"))\n", + ")\n", + "\n", "embedding_model = next(\n", " (\n", " model\n", " for model in models\n", - " if get_model_metadata(model).get(\"model_type\") == \"embedding\"\n", + " if getattr(model, \"id\", \"\") == preferred_embedding_model_id\n", " ),\n", " None,\n", ")\n", + "\n", "if embedding_model is None:\n", - " raise RuntimeError(\"No embedding model found from client.models.list()\")\n", + " print(\n", + " f\"Preferred embedding model {preferred_embedding_model_id!r} was not found; \"\n", + " \"falling back to the first model tagged as embedding.\"\n", + " )\n", + " embedding_model = next(\n", + " (\n", + " model\n", + " for model in models\n", + " if get_model_metadata(model).get(\"model_type\") == \"embedding\"\n", + " ),\n", + " None,\n", + " )\n", + "if embedding_model is None:\n", + " raise RuntimeError(\n", + " \"No embedding model found from client.models.list(). The server currently \"\n", + " \"exposes only LLM models, so vector store examples cannot run yet. \"\n", + " \"Restart the LlamaStackDistribution with ENABLE_SENTENCE_TRANSFORMERS=true \"\n", + " \"and make sure the embedding model is registered and its files are available. \"\n", + " \"If you use a different registered embedding model, set EMBEDDING_MODEL and \"\n", + " \"EMBEDDING_DIMENSION in the notebook environment before running this cell.\"\n", + " )\n", "\n", "embedding_metadata = get_model_metadata(embedding_model)\n", "resolved_dimension = (\n", @@ -515,14 +557,45 @@ " or embedding_metadata.get(\"dimensions\")\n", " or getattr(embedding_model, \"embedding_dimension\", None)\n", " or getattr(embedding_model, \"dimensions\", None)\n", + " or preferred_embedding_dimension\n", ")\n", - "if resolved_dimension is None:\n", - " raise RuntimeError(\n", - " f\"Could not determine embedding dimension for model {embedding_model.id!r}. \"\n", - " \"Set it explicitly to match the embedding model used by the server.\"\n", - " )\n", "embedding_dimension = int(resolved_dimension)\n", "\n", + "print(\n", + " json.dumps(\n", + " {\n", + " \"embedding_model\": embedding_model.id,\n", + " \"embedding_dimension\": embedding_dimension,\n", + " },\n", + " ensure_ascii=False,\n", + " indent=2,\n", + " )\n", + ")\n" + ] + }, + { + "cell_type": "markdown", + "id": "pgvector-prerequisites-md", + "metadata": {}, + "source": [ + "### PGVector Vector Store Example\n", + "\n", + "Prerequisites:\n", + "- The server distribution is configured with `ENABLE_PGVECTOR=true` and valid `PGVECTOR_*` connection settings.\n", + "- ACP-provided PostgreSQL can be used directly; it already includes the `pgvector` extension.\n", + "- The server distribution is configured with `ENABLE_SENTENCE_TRANSFORMERS=true` so an embedding model is registered.\n", + "- Llama Stack includes a default embedding model configuration, but the model files are downloaded from Hugging Face on first use.\n", + "- If the cluster uses a Hugging Face mirror or proxy, configure `HF_ENDPOINT`.\n", + "- If the cluster is fully offline, pre-download the model files into `/home/lls/.lls/huggingface/hub` and set offline cache-related environment variables.\n" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "id": "pgvector-demo-code", + "metadata": {}, + "outputs": [], + "source": [ "document = \"\"\"ACP PostgreSQL with pgvector can be used as the vector backend.\n", "Unique token: pgvector-demo-token\n", "This document is used to verify vector store indexing and retrieval.\n", @@ -561,6 +634,67 @@ "print(json.dumps(search_result, ensure_ascii=False, indent=2))\n" ] }, + { + "cell_type": "markdown", + "id": "milvus-title-md", + "metadata": {}, + "source": [ + "### Milvus Vector Store Example\n", + "\n", + "This section shows how to upload a file and query a remote Milvus-backed vector store.\n", + "\n", + "Prerequisites:\n", + "- The server distribution is configured with `MILVUS_ENDPOINT` pointing at a Milvus endpoint reachable from the Llama Stack Server pod.\n", + "- If Milvus authentication is enabled, the server distribution is configured with `MILVUS_TOKEN`.\n", + "- `MILVUS_CONSISTENCY_LEVEL` is set to a valid Milvus consistency level, for example `Strong`.\n", + "- `ENABLE_SENTENCE_TRANSFORMERS=true` is set when using the default local embedding model.\n", + "- Llama Stack can download the embedding model files from Hugging Face, or the files are preloaded into `/home/lls/.lls/huggingface/hub` for offline clusters.\n" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "id": "milvus-demo-code", + "metadata": {}, + "outputs": [], + "source": [ + "document = \"\"\"Remote Milvus can be used as the vector backend for Llama Stack.\n", + "Unique token: milvus-demo-token\n", + "This document is about Shanghai and verifies Milvus vector store retrieval.\n", + "\"\"\"\n", + "\n", + "file_object = client.files.create(\n", + " file=(\"milvus-demo.txt\", document.encode(\"utf-8\"), \"text/plain\"),\n", + " purpose=\"assistants\",\n", + ")\n", + "\n", + "vector_store = client.vector_stores.create(\n", + " name=f\"milvus-demo-{int(time.time())}\",\n", + " file_ids=[file_object.id],\n", + " extra_body={\n", + " \"provider_id\": \"milvus-remote\",\n", + " \"embedding_model\": embedding_model.id,\n", + " \"embedding_dimension\": embedding_dimension,\n", + " },\n", + ")\n", + "\n", + "search_result = client.vector_stores.search(\n", + " vector_store_id=vector_store.id,\n", + " query=\"milvus-demo-token\",\n", + " max_num_results=3,\n", + ")\n", + "\n", + "if hasattr(vector_store, \"model_dump\"):\n", + " vector_store = vector_store.model_dump(mode=\"json\")\n", + "if hasattr(search_result, \"model_dump\"):\n", + " search_result = search_result.model_dump(mode=\"json\")\n", + "\n", + "print(\"Milvus vector store:\")\n", + "print(json.dumps(vector_store, ensure_ascii=False, indent=2))\n", + "print(\"\\nMilvus search result:\")\n", + "print(json.dumps(search_result, ensure_ascii=False, indent=2))\n" + ] + }, { "cell_type": "markdown", "id": "6f8d31d0",