From 2c94aa1b5565fbb21d47f75ba0a8d266594546d9 Mon Sep 17 00:00:00 2001 From: yeyitech Date: Wed, 17 Jun 2026 15:15:52 +0800 Subject: [PATCH] docs: add upgrade & migration guide covering Docker bootstrap removal and embedding rebuild flow (#2273) Multiple users hit the same blockers upgrading Docker images: 1. openviking.console.bootstrap removed in v0.3.19+ (Studio is bundled into the server now). 2. EmbeddingRebuildRequiredError when an upgrade ships with embedding- config drift. Maintainer @ZaynJarvis explicitly asked for this migration doc in #2273. PR #2618 covers the actionable error message; this PR adds the how-do-I-recover guide they referenced. Adds docs/en/guides/14-upgrades-and-migrations.md and the Chinese mirror, plus pointers from README/FAQ. Refs #2273 Co-Authored-By: Claude Opus 4.7 (1M context) --- README.md | 2 + README_CN.md | 2 + docs/en/faq/faq.md | 10 + docs/en/guides/14-upgrades-and-migrations.md | 187 +++++++++++++++++++ docs/zh/faq/faq.md | 10 + docs/zh/guides/14-upgrades-and-migrations.md | 164 ++++++++++++++++ 6 files changed, 375 insertions(+) create mode 100644 docs/en/guides/14-upgrades-and-migrations.md create mode 100644 docs/zh/guides/14-upgrades-and-migrations.md diff --git a/README.md b/README.md index 85ce7873fd..406dd99959 100644 --- a/README.md +++ b/README.md @@ -784,6 +784,8 @@ This allows the Agent to get "smarter with use" through interactions with the wo For more details, please visit our [Full Documentation](./docs/en/). +If a Docker upgrade leaves your container failing to start (for example with `ModuleNotFoundError: No module named 'openviking.console.bootstrap'` or `EmbeddingRebuildRequiredError`), see the [Upgrades and Migrations guide](./docs/en/guides/14-upgrades-and-migrations.md). + ### Community & Team For more details, please see: **[About Us](./docs/en/about/01-about-us.md)** diff --git a/README_CN.md b/README_CN.md index 4a95d4bf1d..a9dffad5d2 100644 --- a/README_CN.md +++ b/README_CN.md @@ -827,6 +827,8 @@ OpenViking 内置了记忆自迭代循环。在每个会话结束时,开发者 更多详情,请访问我们的[完整文档](./docs/zh/)。 +如果 Docker 升级后容器启动失败(例如 `ModuleNotFoundError: No module named 'openviking.console.bootstrap'` 或 `EmbeddingRebuildRequiredError`),请参阅[升级与迁移指南](./docs/zh/guides/14-upgrades-and-migrations.md)。 + ### 社区与团队 更多详情,请参见:**[关于我们](./docs/zh/about/01-about-us.md)** diff --git a/docs/en/faq/faq.md b/docs/en/faq/faq.md index d9df0cd889..c7107f5ad5 100644 --- a/docs/en/faq/faq.md +++ b/docs/en/faq/faq.md @@ -372,6 +372,15 @@ This strategy finds semantically matching fragments while understanding the comp 3. **Use local storage**: Use `local` backend during development to reduce network latency 4. **Async operations**: Fully utilize `AsyncOpenViking` / `AsyncHTTPClient`'s async capabilities +### I can't upgrade my Docker container — what do I do? + +Two specific failures cause most upgrade reports: + +- `ModuleNotFoundError: No module named 'openviking.console.bootstrap'` — Web Studio is bundled into `openviking-server` starting in v0.3.19, so a `command:` line that still launches the standalone bootstrap module will exit immediately. Drop that line. +- `EmbeddingRebuildRequiredError` — the embedding model, provider, or dimension in `ov.conf` no longer matches the existing `vectordb/context` collection. You can either roll the embedding config back, or back up and rebuild only `vectordb/context/` and re-run `ov reindex` per namespace. + +The full step-by-step recovery, including a `docker-compose.yml` before/after example and the exact `ov reindex` invocations, is in [Upgrades and Migrations](../guides/14-upgrades-and-migrations.md). + ## Deployment ### What's the difference between embedded mode and service mode? @@ -400,3 +409,4 @@ Yes, OpenViking main project is open source under the AGPL-3.0 license, and exam - [Architecture Overview](../concepts/01-architecture.md) - Deep dive into system design - [Retrieval Mechanism](../concepts/07-retrieval.md) - Detailed retrieval process - [Configuration Guide](../guides/01-configuration.md) - Complete configuration reference +- [Upgrades and Migrations](../guides/14-upgrades-and-migrations.md) - Recover from upgrade-time startup failures diff --git a/docs/en/guides/14-upgrades-and-migrations.md b/docs/en/guides/14-upgrades-and-migrations.md new file mode 100644 index 0000000000..79ce0208e1 --- /dev/null +++ b/docs/en/guides/14-upgrades-and-migrations.md @@ -0,0 +1,187 @@ +# Upgrades and Migrations + +This guide collects the recovery steps for the upgrade-time blockers that +have surfaced most often in real deployments. If your container exits at +boot after pulling a newer image, start here before filing an issue. + +## When to read this guide + +- You are upgrading an existing OpenViking deployment between minor + versions. +- The server fails to start after the upgrade (the container exits or + the health check never goes green). +- You see `ModuleNotFoundError: No module named 'openviking.console.bootstrap'` + in the container logs. +- You see `EmbeddingRebuildRequiredError` in the server logs. + +## Before you upgrade + +A few minutes of preparation makes every other step in this guide +recoverable. Do all of these before pulling a new image. + +- **Snapshot your data directory.** This is the directory mounted into + the container at `/app/.openviking` (typically `~/.openviking` on the + host). The two paths that matter for retrieval are the AGFS root and + `vectordb/`. A simple `cp -a` or `tar` of the whole directory while + the server is stopped is enough; you do not need a live backup tool. +- **Note your current `ov.conf`.** Embedding model, provider, and + dimension are the fields most likely to drift between versions and + to break startup. Keep a copy of the file you were running with so + you can roll back if the upgrade fails. +- **Stop the server gracefully.** Use `docker stop ` (or + `docker compose down`). Avoid `docker kill -9` / `SIGKILL`: the + vector index relies on a clean shutdown to release locks under + `vectordb//store/LOCK`, and a hard kill can leave a + stale lock that blocks the next start. + +## Common breaking transitions + +The two failures below account for the majority of upgrade reports +between v0.3.15 and the v0.3.x series after it. They can happen +together — the server may exit on the first one, and only after you +fix it do you see the second one — so read both before changing +anything. + +### v0.3.15 → v0.3.19+ : `openviking.console.bootstrap` removed + +- **Symptom.** The container exits immediately after start. The log + shows `ModuleNotFoundError: No module named 'openviking.console.bootstrap'`, + often coming from a `python -m openviking.console.bootstrap ...` + line in your `command:` override. +- **Cause.** Web Studio used to ship as a separate process started by + `python -m openviking.console.bootstrap`. Starting in v0.3.19 the + Studio assets are bundled into `openviking-server`, and the + standalone `openviking.console.bootstrap` module no longer exists + (see PR #2320). Any custom `command:` that still launches it will + fail with `ModuleNotFoundError`. +- **Fix.** In your `docker-compose.yml` (or whatever you use to run + the container), drop the `python -m openviking.console.bootstrap` + invocation. The default entrypoint already runs `openviking-server`, + which now serves both the API on port `1933` and the Studio UI. +- **Worked example.** + + Before — two processes, one of them now-removed: + + ```yaml + services: + openviking: + image: ghcr.io/volcengine/openviking:latest + command: | + openviking-server & + python -m openviking.console.bootstrap --host 0.0.0.0 --port 8020 + ``` + + After — single process, default entrypoint: + + ```yaml + services: + openviking: + image: ghcr.io/volcengine/openviking:latest + # no `command:` override needed — the image entrypoint runs + # openviking-server, which now also serves Web Studio. + ``` + + If you still want to keep an explicit `command:`, set it to + `command: openviking-server` and remove the bootstrap line. + +### Any version with `EmbeddingRebuildRequiredError` + +- **Symptom.** The server logs `EmbeddingRebuildRequiredError: + Existing collection embedding dimension (...) does not match current + configuration (...)` or + `EmbeddingRebuildRequiredError: Existing collection embedding metadata + does not match current configuration`. Startup aborts before the + HTTP server is ready. +- **Cause.** The vector collection on disk records which embedding + provider, model, and dimension were used to build it. When the + embedding section of `ov.conf` changes (different provider, different + model, or — most importantly — a different vector dimension) the + existing vectors are no longer comparable to new ones. The server + refuses to start rather than mix incompatible vectors. +- **Choose one path.** Both paths preserve your business data; they + differ only in whether you keep the old vectors or rebuild them. + + **Path A — keep your data, restore the old embedding config.** Roll + the embedding section of `ov.conf` back to the values the existing + collection was built with (the values you noted in *Before you + upgrade*). The server will start. Schedule the embedding-model + change as a deliberate migration via Path B during a maintenance + window. If the only change between old and new config is provider + or model name and the dimension is identical, you can also set + `embedding.allow_metadata_override = true` in `ov.conf` to keep the + existing vectors and just rewrite the recorded metadata. + + **Path B — rebuild embeddings under the new config.** This + re-embeds every resource, memory, and skill. The cost is one full + embed pass over your indexed content, billed against whatever + embedding provider you have configured. + + 1. **Back up `vectordb/context/`.** Inside your data directory + (host: `~/.openviking`, container: `/app/.openviking`), rename + `data/vectordb/context/` to something like + `data/vectordb/context.bak-/`, or copy it elsewhere. Do + **not** delete it yet — you want a fallback if the rebuild fails + halfway. + 2. **Delete only `data/vectordb/context/`.** Do not delete other + directories under `data/`. The AGFS tree (resources, memories, + skills, sessions) lives outside `vectordb/` and is what we are + trying to preserve. Removing anything else risks losing the very + data you are rebuilding embeddings for. + 3. **Start the server with the new `ov.conf`.** It will create a + fresh `vectordb/context/` collection that matches the new + embedding configuration. The server should now come up and pass + `/health`. + 4. **Reindex your namespaces.** Use the CLI to re-embed the content + that previously had vectors: + + ```bash + ov reindex viking://resources --mode vectors_only --wait true + ov reindex viking://user/memories --mode vectors_only --wait true + ov reindex viking://agent/memories --mode vectors_only --wait true + ov reindex viking://agent/skills --mode vectors_only --wait true + ``` + + Run only the namespaces you actually use. `--mode vectors_only` + re-embeds against the existing semantic summaries (L0/L1) and is + the right choice when only the embedding configuration changed. + If your semantic-summary configuration also changed, use + `--mode semantic_and_vectors` instead — that re-runs L0/L1 + summarization as well and costs additional VLM calls. + 5. **Verify search works.** Run a query you know the answer to + against a representative URI: + + ```bash + ov find "" --target-uri viking://resources/ + ``` + + Once you are satisfied, delete the `context.bak-/` backup. + +## Sanity checks after a successful upgrade + +Run these against the upgraded container before pointing production +traffic at it. + +- `curl http://localhost:1933/health` returns a healthy response. +- `ov tree viking://resources -L 1` lists the resources you expect to + see — confirms the AGFS tree survived the upgrade. +- `ov find ` returns the hits you expect — confirms the + vector index is populated and queryable. +- The Studio UI loads at the same port you used before (default + `1933` for direct access, or `1934` if you go through Caddy). + +## What to do if you are stuck + +If none of the above resolves the failure, file an issue with: + +- The full server logs from the start of the failing run (everything + from container start through the first stack trace). +- Your `ov.conf`, with API keys and other secrets redacted. +- The exact version you upgraded **from** and **to** (image tag is + fine). +- The output of `ls data/vectordb/` from the data directory you are + pointing at. + +Tag the issue with `upgrade` so the maintainers can route it. See +also the related migration note for the User / Peer model in +[migration/01-user-peer-model.md](../migration/01-user-peer-model.md) +if you are crossing the 0.3.x → 0.4.0 boundary. diff --git a/docs/zh/faq/faq.md b/docs/zh/faq/faq.md index 37805dd40f..778d397f20 100644 --- a/docs/zh/faq/faq.md +++ b/docs/zh/faq/faq.md @@ -364,6 +364,15 @@ OpenViking 使用分数传播机制: 3. **使用本地存储**:开发阶段使用 `local` 后端减少网络延迟 4. **异步操作**:充分利用 `AsyncOpenViking` / `AsyncHTTPClient` 的异步特性 +### Docker 容器升不上去,怎么办? + +升级失败的报告主要集中在两个错误: + +- `ModuleNotFoundError: No module named 'openviking.console.bootstrap'` —— 自 v0.3.19 起 Web Studio 已打包进 `openviking-server`,仍然启动独立 bootstrap 模块的 `command:` 会立即退出。删除该行即可。 +- `EmbeddingRebuildRequiredError` —— `ov.conf` 中的 embedding 模型、provider 或维度与已有的 `vectordb/context` 集合不再匹配。你可以选择把 embedding 配置回滚,或者备份并仅重建 `vectordb/context/`,然后逐个 namespace 运行 `ov reindex`。 + +完整的分步恢复流程,包括 `docker-compose.yml` 改动前后对照以及具体的 `ov reindex` 命令,请参阅[升级与迁移](../guides/14-upgrades-and-migrations.md)。 + ## 部署相关 ### 嵌入式模式和服务模式有什么区别? @@ -392,3 +401,4 @@ client = ov.AsyncHTTPClient(url="http://localhost:1933", api_key="your-key") - [架构概述](../concepts/01-architecture.md) - 深入理解系统设计 - [检索机制](../concepts/07-retrieval.md) - 检索流程详解 - [配置指南](../guides/01-configuration.md) - 完整配置参考 +- [升级与迁移](../guides/14-upgrades-and-migrations.md) - 处理升级时的启动失败 diff --git a/docs/zh/guides/14-upgrades-and-migrations.md b/docs/zh/guides/14-upgrades-and-migrations.md new file mode 100644 index 0000000000..9ce5c8e5d5 --- /dev/null +++ b/docs/zh/guides/14-upgrades-and-migrations.md @@ -0,0 +1,164 @@ +# 升级与迁移 + +本指南汇总了在实际部署中最常见的升级阻塞问题及其恢复步骤。如果你在拉取 +新镜像后容器启动即退出,请先阅读本指南再提 issue。 + +## 何时阅读本指南 + +- 你正在升级一套已有的 OpenViking 部署,跨越次要版本。 +- 升级后服务启动失败(容器退出,或健康检查长期不通过)。 +- 容器日志中出现 + `ModuleNotFoundError: No module named 'openviking.console.bootstrap'`。 +- 服务日志中出现 `EmbeddingRebuildRequiredError`。 + +## 升级前准备 + +升级前花几分钟做好准备,可以让本指南中后续每一步都是可恢复的。请在拉取 +新镜像之前完成以下事项。 + +- **快照数据目录。** 即挂载到容器 `/app/.openviking` 的目录(在宿主机上 + 通常是 `~/.openviking`)。检索相关的两个关键路径是 AGFS 根目录和 + `vectordb/`。停掉服务后用 `cp -a` 或 `tar` 整体打包即可,不需要专门的 + 在线备份工具。 +- **保存当前的 `ov.conf`。** Embedding 模型、Provider 与维度是版本之间 + 最容易漂移、最容易导致启动失败的字段。把当前正常运行的配置文件留一份 + 副本,万一升级失败可以快速回滚。 +- **优雅停止服务。** 使用 `docker stop `(或 + `docker compose down`)。避免 `docker kill -9` / `SIGKILL`:向量索引 + 依赖正常关闭来释放 `vectordb//store/LOCK` 下的锁,强制终止 + 会留下陈旧的锁文件,阻塞下一次启动。 + +## 常见的破坏性升级场景 + +下面两类故障覆盖了 v0.3.15 之后 v0.3.x 系列升级报告里的大多数情况。它们 +**可能同时存在** —— 服务可能先因第一个错误退出,修好之后才暴露出第二个 +错误 —— 所以请先把两节都读完再动手。 + +### v0.3.15 → v0.3.19+ :`openviking.console.bootstrap` 已移除 + +- **现象。** 容器启动后立即退出。日志中显示 + `ModuleNotFoundError: No module named 'openviking.console.bootstrap'`, + 通常出现在你 `command:` 覆盖里的 + `python -m openviking.console.bootstrap ...` 这一行。 +- **原因。** Web Studio 之前是一个独立进程,由 + `python -m openviking.console.bootstrap` 启动。从 v0.3.19 起 Studio + 的资源已被打包进 `openviking-server`,独立的 + `openviking.console.bootstrap` 模块已不复存在(参见 PR #2320)。任何 + 仍然启动它的自定义 `command:` 都会报 `ModuleNotFoundError`。 +- **修复。** 在 `docker-compose.yml`(或你用来启动容器的任何方式)中, + 删除 `python -m openviking.console.bootstrap` 这一行。镜像默认的入口 + 脚本已经会运行 `openviking-server`,它在 `1933` 端口同时提供 API 和 + Studio UI。 +- **示例。** + + 修改前 —— 两个进程,其中一个已被移除: + + ```yaml + services: + openviking: + image: ghcr.io/volcengine/openviking:latest + command: | + openviking-server & + python -m openviking.console.bootstrap --host 0.0.0.0 --port 8020 + ``` + + 修改后 —— 单进程,使用默认入口: + + ```yaml + services: + openviking: + image: ghcr.io/volcengine/openviking:latest + # 不再需要 `command:` 覆盖 —— 镜像入口会运行 openviking-server, + # 它现在也负责提供 Web Studio。 + ``` + + 如果你仍然希望显式声明 `command:`,写成 + `command: openviking-server` 并删掉 bootstrap 那一行即可。 + +### 任何版本出现 `EmbeddingRebuildRequiredError` + +- **现象。** 服务日志中出现 + `EmbeddingRebuildRequiredError: Existing collection embedding dimension (...) + does not match current configuration (...)` 或者 + `EmbeddingRebuildRequiredError: Existing collection embedding metadata does + not match current configuration`。HTTP 服务还没起来就中止了。 +- **原因。** 磁盘上的向量集合记录了构建它时所使用的 embedding provider、 + 模型名和维度。当 `ov.conf` 中 embedding 段发生变化(更换 provider、 + 更换模型,特别是更换向量维度)时,已有向量就无法再与新向量进行比较。 + 服务为了避免新旧向量混用,宁可拒绝启动。 +- **选一条路径。** 两条路径都会保留你的业务数据,区别只在于是否保留旧 + 向量。 + + **路径 A —— 保留数据,回滚 embedding 配置。** 把 `ov.conf` 的 + embedding 段改回与已有集合一致的取值(即"升级前准备"里你保存的那一 + 份)。服务即可恢复启动。后续在维护窗口内通过路径 B 计划性地完成 + embedding 模型变更。如果新旧配置之间只是 provider 或模型名不同、 + **维度完全一致**,也可以在 `ov.conf` 中设置 + `embedding.allow_metadata_override = true`,这样会保留已有向量,仅 + 改写记录的 metadata。 + + **路径 B —— 在新配置下重建向量。** 这条路径会对所有 resource、 + memory 和 skill 重新计算 embedding。代价是一次完整的 embed 计算, + 对应的费用按你所配置的 embedding provider 计费。 + + 1. **备份 `vectordb/context/`。** 在数据目录(宿主机 + `~/.openviking`,容器内 `/app/.openviking`)下,把 + `data/vectordb/context/` 改名为 `data/vectordb/context.bak-<日期>/` + 或拷贝到别处。**先不要删除** —— 万一重建中途失败,这是你回退的 + 依据。 + 2. **只删除 `data/vectordb/context/`。** 不要删除 `data/` 下的其他任 + 何目录。AGFS 树(resources、memories、skills、sessions)位于 + `vectordb/` 之外,正是我们要保住的部分。删除其他目录有可能毁掉你 + 正打算重建向量的数据本身。 + 3. **使用新的 `ov.conf` 启动服务。** 服务会创建一个全新的 + `vectordb/context/` 集合,与新的 embedding 配置匹配。此时服务应 + 该能正常启动并通过 `/health`。 + 4. **对各 namespace 重建索引。** 使用 CLI 对原本有向量的内容重新 + embed: + + ```bash + ov reindex viking://resources --mode vectors_only --wait true + ov reindex viking://user/memories --mode vectors_only --wait true + ov reindex viking://agent/memories --mode vectors_only --wait true + ov reindex viking://agent/skills --mode vectors_only --wait true + ``` + + 只对你实际使用的 namespace 执行即可。`--mode vectors_only` 会复用 + 已有的语义摘要(L0/L1),仅重新计算向量 —— 当变化只发生在 + embedding 配置时,这是正确选择。如果你的语义摘要配置也变了,请 + 改用 `--mode semantic_and_vectors`,它会同时重做 L0/L1 摘要,会 + 额外产生 VLM 调用费用。 + 5. **验证检索可用。** 用你已知答案的查询,在代表性的 URI 下跑一次 + 检索: + + ```bash + ov find "<已知关键词>" --target-uri viking://resources/ + ``` + + 确认结果符合预期后,再删除 `context.bak-<日期>/` 备份。 + +## 升级成功后的健全性检查 + +切换生产流量之前,对升级后的容器跑一遍: + +- `curl http://localhost:1933/health` 返回健康响应。 +- `ov tree viking://resources -L 1` 列出预期中的资源 —— 验证 AGFS 树 + 在升级中未受影响。 +- `ov find <已知关键词>` 返回预期命中 —— 验证向量索引已经填充且可 + 查询。 +- Studio UI 在原来的端口可访问(直连默认 `1933`,经 Caddy 默认 + `1934`)。 + +## 如果以上都没解决 + +如果按上述步骤仍无法恢复,请提交 issue 时附上: + +- 失败那一次启动的完整服务日志(从容器启动到第一段 stack trace 的全 + 部内容)。 +- 你的 `ov.conf`,去掉 API key 等敏感字段。 +- 升级**之前**和**之后**的具体版本号(镜像 tag 即可)。 +- 数据目录下 `ls data/vectordb/` 的输出。 + +请给 issue 打上 `upgrade` 标签,便于维护者分流。如果你正在跨越 +0.3.x → 0.4.0 边界,也请同时阅读相关的迁移说明 +[migration/01-user-peer-model.md](../migration/01-user-peer-model.md)。