From 9de2b3789a11c8cf9a1d37e145e93203f5475300 Mon Sep 17 00:00:00 2001 From: ActivePeter <1020401660@qq.com> Date: Sat, 20 Jun 2026 13:45:44 +0800 Subject: [PATCH 01/13] fix: doc page url in readme --- README.md | 25 +++++++++++++------------ README_CN.md | 25 +++++++++++++------------ 2 files changed, 26 insertions(+), 24 deletions(-) diff --git a/README.md b/README.md index 53a2110..3f4cd2f 100644 --- a/README.md +++ b/README.md @@ -10,7 +10,7 @@ [![Latest](https://img.shields.io/badge/Latest-v0.2.1-f28500)](./fluxon_release) [![Interfaces](https://img.shields.io/badge/Interfaces-KV%2FRPC%20%7C%20MQ%20%7C%20FS-1f6feb)](#interface-capabilities) -[English](./README.md) | [中文](./README_CN.md) | [Docs](https://tele-ai.github.io/fluxon/) | [中文文档](https://tele-ai.github.io/fluxon/cn/) | GitHub repository +[English](./README.md) | [中文](./README_CN.md) | [Docs](https://tele-ai.github.io/Fluxon/) | [中文文档](https://tele-ai.github.io/Fluxon/cn/) | GitHub repository @@ -145,7 +145,7 @@ The benchmark results show that small-file reads and large-file writes are alrea ## 🚀 Quick Start -Quick Start is the shortest path to try Fluxon. For formal installation, deployment, and operations, see [User Docs](https://tele-ai.github.io/fluxon/user_doc/). +Quick Start is the shortest path to try Fluxon. For formal installation, deployment, and operations, see [User Docs](https://tele-ai.github.io/Fluxon/user_doc/). ### KV Quick Start @@ -178,7 +178,7 @@ Open the printed link to view the KV Web UI: Related interface docs: -- [KV and RPC Interface](https://tele-ai.github.io/fluxon/user_doc/User---3---KV-and-RPC-Interface/) +- [KV and RPC Interface](https://tele-ai.github.io/Fluxon/user_doc/User---3---KV-and-RPC-Interface/) ### MQ Quick Start @@ -209,7 +209,7 @@ Runtime view: Related interface docs: -- [MQ Interface](https://tele-ai.github.io/fluxon/user_doc/User---4---MQ-Interface/) +- [MQ Interface](https://tele-ai.github.io/Fluxon/user_doc/User---4---MQ-Interface/) ### FS Quick Start @@ -247,7 +247,7 @@ Open the printed link to view the FS Web UI: Related interface docs: -- [FS Interface](https://tele-ai.github.io/fluxon/user_doc/User---5---FS-Interface/) +- [FS Interface](https://tele-ai.github.io/Fluxon/user_doc/User---5---FS-Interface/) @@ -267,17 +267,18 @@ Related interface docs: Contributions are welcome. Before you start, please read the developer docs on GitHub Pages: -- [Developer Docs](https://tele-ai.github.io/fluxon/dev_doc/) -- [Developer - 1 - Package core install artifacts](https://tele-ai.github.io/fluxon/dev_doc/Developer---1---Package-Core-Install-Artifacts/) -- [Developer - 2 - Package middleware and images](https://tele-ai.github.io/fluxon/dev_doc/Developer---2---Package-Middleware-and-Images/) -- [Developer - 4 - Publish a release](https://tele-ai.github.io/fluxon/dev_doc/Developer---4---Publish-a-Release/) +- [Developer Docs](https://tele-ai.github.io/Fluxon/dev_doc/) +- [Developer - 1 - Package core install artifacts](https://tele-ai.github.io/Fluxon/dev_doc/Developer---1---Package-Core-Install-Artifacts/) +- [Developer - 2 - Package middleware and images](https://tele-ai.github.io/Fluxon/dev_doc/Developer---2---Package-Middleware-and-Images/) +- [Developer - 3 - Documentation Writing Rules](https://tele-ai.github.io/Fluxon/dev_doc/Developer---3---Documentation-Writing-Rules/) +- [Developer - 4 - Publish a release](https://tele-ai.github.io/Fluxon/dev_doc/Developer---4---Publish-a-Release/) ## 👥 Contributors - - + + Some earlier contribution records are no longer fully reflected in the current commit history. Historical highlights: @@ -312,4 +313,4 @@ Fluxon is open-sourced under Apache License 2.0, see [LICENSE](./LICENSE). ## ⭐ Stargazers over time -[![Star History Chart](https://api.star-history.com/chart?repos=Tele-AI/fluxon&type=date&legend=top-left)](https://www.star-history.com/?repos=Tele-AI%2Ffluxon&type=date&legend=top-left) +[![Star History Chart](https://api.star-history.com/chart?repos=Tele-AI/Fluxon&type=date&legend=top-left)](https://www.star-history.com/?repos=Tele-AI%2FFluxon&type=date&legend=top-left) diff --git a/README_CN.md b/README_CN.md index 715511b..da2cdb4 100644 --- a/README_CN.md +++ b/README_CN.md @@ -20,7 +20,7 @@ [![Latest](https://img.shields.io/badge/Latest-v0.2.1-f28500)](./fluxon_release) [![Interfaces](https://img.shields.io/badge/Interfaces-KV%2FRPC%20%7C%20MQ%20%7C%20FS-1f6feb)](#接口能力) -[中文](./README_CN.md) | [English](./README.md) | [用户文档](https://tele-ai.github.io/fluxon/cn/) | [English Docs](https://tele-ai.github.io/fluxon/) | GitHub repository +[中文](./README_CN.md) | [English](./README.md) | [用户文档](https://tele-ai.github.io/Fluxon/cn/) | [English Docs](https://tele-ai.github.io/Fluxon/) | GitHub repository @@ -147,7 +147,7 @@ benchmark 显示,小文件读和大文件写已显著领先 `Alluxio`,大文 ## 🚀 快速开始 -Quick Start 用于最短路径体验;正式安装、部署和运维入口见 [用户文档](https://tele-ai.github.io/fluxon/cn/user_doc/)。 +Quick Start 用于最短路径体验;正式安装、部署和运维入口见 [用户文档](https://tele-ai.github.io/Fluxon/cn/user_doc/)。 ### KV 快速开始 @@ -180,7 +180,7 @@ del demo:hello 对应接口文档: -- [KV 和 RPC 接口](https://tele-ai.github.io/fluxon/cn/user_doc/%E7%94%A8%E6%88%B7---3---KV-RPC%E6%8E%A5%E5%8F%A3/) +- [KV 和 RPC 接口](https://tele-ai.github.io/Fluxon/cn/user_doc/%E7%94%A8%E6%88%B7---3---KV-RPC%E6%8E%A5%E5%8F%A3/) ### MQ 快速开始 @@ -211,7 +211,7 @@ exit 对应接口文档: -- [MQ 接口](https://tele-ai.github.io/fluxon/cn/user_doc/%E7%94%A8%E6%88%B7---4---MQ%E6%8E%A5%E5%8F%A3/) +- [MQ 接口](https://tele-ai.github.io/Fluxon/cn/user_doc/%E7%94%A8%E6%88%B7---4---MQ%E6%8E%A5%E5%8F%A3/) ### FS 快速开始 @@ -249,7 +249,7 @@ FS Quick Start 会额外打印: 对应接口文档: -- [FS 接口](https://tele-ai.github.io/fluxon/cn/user_doc/%E7%94%A8%E6%88%B7---5---FS%E6%8E%A5%E5%8F%A3/) +- [FS 接口](https://tele-ai.github.io/Fluxon/cn/user_doc/%E7%94%A8%E6%88%B7---5---FS%E6%8E%A5%E5%8F%A3/) @@ -269,17 +269,18 @@ FS Quick Start 会额外打印: 欢迎参与贡献。开始之前,建议先阅读 GitHub Pages 上的开发者文档: -- [开发者文档总入口](https://tele-ai.github.io/fluxon/cn/dev_doc/) -- [开发者 - 1 - 打包核心安装包](https://tele-ai.github.io/fluxon/cn/dev_doc/%E5%BC%80%E5%8F%91%E8%80%85---1---%E6%89%93%E5%8C%85%E6%A0%B8%E5%BF%83%E5%AE%89%E8%A3%85%E5%8C%85/) -- [开发者 - 2 - 打包中间件和镜像](https://tele-ai.github.io/fluxon/cn/dev_doc/%E5%BC%80%E5%8F%91%E8%80%85---2---%E6%89%93%E5%8C%85%E4%B8%AD%E9%97%B4%E4%BB%B6%E5%92%8C%E9%95%9C%E5%83%8F/) -- [开发者 - 4 - 发布 Release](https://tele-ai.github.io/fluxon/cn/dev_doc/%E5%BC%80%E5%8F%91%E8%80%85---4---%E5%8F%91%E5%B8%83-Release/) +- [开发者文档总入口](https://tele-ai.github.io/Fluxon/cn/dev_doc/) +- [开发者 - 1 - 打包核心安装包](https://tele-ai.github.io/Fluxon/cn/dev_doc/%E5%BC%80%E5%8F%91%E8%80%85---1---%E6%89%93%E5%8C%85%E6%A0%B8%E5%BF%83%E5%AE%89%E8%A3%85%E5%8C%85/) +- [开发者 - 2 - 打包中间件和镜像](https://tele-ai.github.io/Fluxon/cn/dev_doc/%E5%BC%80%E5%8F%91%E8%80%85---2---%E6%89%93%E5%8C%85%E4%B8%AD%E9%97%B4%E4%BB%B6%E5%92%8C%E9%95%9C%E5%83%8F/) +- [开发者 - 3 - 文档写作规约](https://tele-ai.github.io/Fluxon/cn/dev_doc/%E5%BC%80%E5%8F%91%E8%80%85---3---%E6%96%87%E6%A1%A3%E5%86%99%E4%BD%9C%E8%A7%84%E7%BA%A6/) +- [开发者 - 4 - 发布 Release](https://tele-ai.github.io/Fluxon/cn/dev_doc/%E5%BC%80%E5%8F%91%E8%80%85---4---%E5%8F%91%E5%B8%83-Release/) ## 👥 Contributors - - + + 部分更早期的贡献记录已经无法从当前 commit 历史里完整反映,这里补充说明: @@ -314,4 +315,4 @@ Fluxon 基于 Apache License 2.0 开源,见 [LICENSE](./LICENSE)。 ## ⭐ Star 增长趋势 -[![Star History Chart](https://api.star-history.com/chart?repos=Tele-AI/fluxon&type=date&legend=top-left)](https://www.star-history.com/?repos=Tele-AI%2Ffluxon&type=date&legend=top-left) +[![Star History Chart](https://api.star-history.com/chart?repos=Tele-AI/Fluxon&type=date&legend=top-left)](https://www.star-history.com/?repos=Tele-AI%2FFluxon&type=date&legend=top-left) From cf6c5c09a306bc354f5deeb85b302ebc2afd34c8 Mon Sep 17 00:00:00 2001 From: ActivePeter <1020401660@qq.com> Date: Mon, 22 Jun 2026 18:02:48 +0800 Subject: [PATCH 02/13] test --- .github/workflows/all_test.yml | 9 +- deployment/gen_bare_deploy_bash.py | 588 ++++------- .../atomic_group_node_resolution_tail.sh.tmpl | 14 + .../atomic_group_service_block.sh.tmpl | 24 + .../atomic_group_start.sh.tmpl | 7 + .../atomic_group_stop.sh.tmpl | 6 + .../bare_entrypoint.sh.tmpl | 5 + .../common_node_resolution_tail.sh.tmpl | 15 + .../etcd_health_wait_block.sh.tmpl | 4 + .../gen_bare_deploy_bash/host_prelude.sh.tmpl | 57 + .../selection_present_probe_fn.sh.tmpl | 19 + ...ction_supervisor_launch_wait_block.sh.tmpl | 9 + ...on_supervisor_path_from_script_dir.sh.tmpl | 7 + .../standalone_start.sh.tmpl | 6 + .../standalone_start_body.sh.tmpl | 27 + .../standalone_stop.sh.tmpl | 15 + .../start_lock_block.sh.tmpl | 14 + .../tcp_ready_helpers.sh.tmpl | 120 +++ .../tcp_ready_wait_block.sh.tmpl | 6 + deployment/tests/test_gen_bare_deploy_bash.py | 153 ++- deployment/tests/test_gen_k8s_daemonset.py | 2 +- deployment/tests/test_log_shard.py | 117 ++ .../test_selection_supervisor_codegen.py | 191 +++- .../test_start_test_bed_bootstrap_log.py | 33 +- deployment/utils/log_shard.py | 196 ++++ deployment/utils/proc_lifecycle_codegen.py | 41 +- .../utils/selection_supervisor_codegen.py | 90 +- ...15\347\275\256\346\200\273\350\247\210.md" | 217 ++++ ...74\345\207\272\351\223\276\350\267\257.md" | 414 ++++++++ fluxon_py/config.py | 15 + fluxon_py/tests/test_config.py | 49 + fluxon_rs/Cargo.lock | 2 + fluxon_rs/fluxon_fs/src/agent.rs | 8 +- .../fluxon_kv/src/client_seg_pool/mod.rs | 23 + fluxon_rs/fluxon_kv/src/config.rs | 150 ++- .../external_client_test.rs | 12 +- .../fluxon_kv/src/external_client_api/mod.rs | 7 + fluxon_rs/fluxon_kv/src/kvcore_test_lib.rs | 4 + fluxon_rs/fluxon_kv/src/lib.rs | 653 +++++++----- .../fluxon_kv/src/memholder/memholder_test.rs | 8 + fluxon_rs/fluxon_ops/Cargo.toml | 4 + fluxon_rs/fluxon_ops/build.rs | 13 + fluxon_rs/fluxon_ops/src/lib.rs | 250 ++++- fluxon_rs/fluxon_util/build.rs | 19 +- fluxon_rs/fluxon_util/src/lib.rs | 7 +- fluxon_rs/fluxon_util/src/log.rs | 380 +++++-- fluxon_rs/fluxon_util/tests/log_mgmt.rs | 120 +++ fluxon_test_stack/ci_2_virt_node.py | 2 + fluxon_test_stack/ci_test_list.yaml | 16 + fluxon_test_stack/deployconf_testbed.yml | 5 +- fluxon_test_stack/pack_test_stack_rsc.py | 261 +---- fluxon_test_stack/start_test_bed.py | 32 +- fluxon_test_stack/test_runner.py | 284 +++-- ...fluxon_fs_s3_download_and_exec.sh.template | 108 ++ fluxon_test_stack/test_runner_ui.py | 4 + .../tests/test_ci_2_virt_node_contract.py | 97 +- .../tests/test_pack_test_stack_rsc_cli.py | 125 ++- .../tests/test_runner_contract.py | 50 + .../test_test_runner_testbed_contract.py | 99 ++ .../tests/test_test_runner_ui_contract.py | 37 +- .../test_top_attention_log_mgmt_contract.py | 112 ++ .../top_attention_test_index/README.md | 1 + .../top_attention_test_index/_log_mgmt.py | 54 + scripts/git_source_selection.py | 163 +++ scripts/source_selection_profiles.py | 134 +++ setup_and_pack/nix/lib_layout.py | 35 +- setup_and_pack/nix/pack_fluxonkv_pylib.py | 232 +--- setup_and_pack/public_workspace_contract.py | 56 +- .../tests/test_git_source_selection_utils.py | 182 ++++ setup_and_pack/tests/test_lib_layout.py | 7 + ...est_pack_fluxonkv_pylib_bridge_prebuilt.py | 33 + setup_and_pack/utils/__init__.py | 2 + .../utils/artifact_cache_digest_utils.py | 29 +- skills/browser-helm/SKILL.md | 232 ++++ skills/browser-helm/agents/openai.yaml | 6 + skills/browser-helm/references/commands.md | 131 +++ skills/canvas-dag_organizer-v1/SKILL.md | 10 + .../agents/openai.yaml | 6 + skills/canvas-ops-v1/SKILL.md | 10 + skills/canvas-ops-v1/agents/openai.yaml | 6 + skills/canvas-tidy_selection-v1/SKILL.md | 10 + .../agents/openai.yaml | 6 + skills/find-skills/SKILL.md | 133 +++ skills/imagegen/LICENSE.txt | 201 ++++ skills/imagegen/SKILL.md | 356 +++++++ skills/imagegen/agents/openai.yaml | 6 + skills/imagegen/assets/imagegen-small.svg | 5 + skills/imagegen/assets/imagegen.png | Bin 0 -> 1711 bytes skills/imagegen/references/cli.md | 242 +++++ skills/imagegen/references/codex-network.md | 33 + skills/imagegen/references/image-api.md | 90 ++ skills/imagegen/references/prompting.md | 118 +++ skills/imagegen/references/sample-prompts.md | 433 ++++++++ skills/imagegen/scripts/image_gen.py | 995 ++++++++++++++++++ skills/imagegen/scripts/remove_chroma_key.py | 440 ++++++++ skills/openai-docs/LICENSE.txt | 201 ++++ skills/openai-docs/SKILL.md | 167 +++ skills/openai-docs/agents/openai.yaml | 14 + skills/openai-docs/assets/openai-small.svg | 3 + skills/openai-docs/assets/openai.png | Bin 0 -> 1429 bytes skills/openai-docs/references/latest-model.md | 37 + .../openai-docs/references/prompting-guide.md | 244 +++++ .../openai-docs/references/upgrade-guide.md | 181 ++++ .../scripts/fetch-codex-manual.mjs | 598 +++++++++++ .../scripts/resolve-latest-model-info.js | 147 +++ skills/plugin-creator/SKILL.md | 243 +++++ skills/plugin-creator/agents/openai.yaml | 6 + .../assets/plugin-creator-small.svg | 3 + .../plugin-creator/assets/plugin-creator.png | Bin 0 -> 1563 bytes .../references/installing-and-updating.md | 143 +++ .../references/plugin-json-spec.md | 194 ++++ .../scripts/create_basic_plugin.py | 324 ++++++ .../scripts/read_marketplace_name.py | 48 + .../scripts/update_plugin_cachebuster.py | 78 ++ .../plugin-creator/scripts/validate_plugin.py | 593 +++++++++++ .../SKILL.md | 11 + .../agents/openai.yaml | 6 + .../SKILL.md | 10 + .../agents/openai.yaml | 6 + .../SKILL.md | 10 + .../agents/openai.yaml | 6 + .../SKILL.md | 10 + .../agents/openai.yaml | 6 + .../SKILL.md | 10 + .../agents/openai.yaml | 6 + .../SKILL.md | 16 + .../agents/openai.yaml | 6 + .../SKILL.md | 27 + .../agents/openai.yaml | 6 + .../SKILL.md | 10 + .../agents/openai.yaml | 6 + .../SKILL.md | 10 + .../agents/openai.yaml | 6 + .../SKILL.md | 10 + .../agents/openai.yaml | 6 + .../SKILL.md | 10 + .../agents/openai.yaml | 6 + .../SKILL.md | 10 + .../agents/openai.yaml | 6 + .../SKILL.md | 10 + .../agents/openai.yaml | 6 + .../SKILL.md | 10 + .../agents/openai.yaml | 6 + .../SKILL.md | 10 + .../agents/openai.yaml | 6 + .../SKILL.md | 10 + .../agents/openai.yaml | 6 + .../SKILL.md | 15 + .../agents/openai.yaml | 6 + .../SKILL.md | 10 + .../agents/openai.yaml | 6 + .../SKILL.md | 10 + .../agents/openai.yaml | 6 + skills/rs-skill-smoke-09e1daf7/SKILL.md | 8 + skills/rs-skill-smoke-529efbc9/SKILL.md | 8 + skills/rs-skill-smoke-cde1029f/SKILL.md | 8 + skills/skill-creator/SKILL.md | 416 ++++++++ skills/skill-creator/agents/openai.yaml | 5 + .../assets/skill-creator-small.svg | 3 + skills/skill-creator/assets/skill-creator.png | Bin 0 -> 1563 bytes skills/skill-creator/license.txt | 202 ++++ .../skill-creator/references/openai_yaml.md | 49 + .../scripts/generate_openai_yaml.py | 226 ++++ skills/skill-creator/scripts/init_skill.py | 400 +++++++ .../skill-creator/scripts/quick_validate.py | 101 ++ skills/skill-installer/LICENSE.txt | 202 ++++ skills/skill-installer/SKILL.md | 58 + skills/skill-installer/agents/openai.yaml | 5 + .../assets/skill-installer-small.svg | 3 + .../assets/skill-installer.png | Bin 0 -> 1086 bytes .../skill-installer/scripts/github_utils.py | 21 + .../scripts/install-skill-from-github.py | 308 ++++++ skills/skill-installer/scripts/list-skills.py | 107 ++ 173 files changed, 13984 insertions(+), 1462 deletions(-) create mode 100644 deployment/templates/gen_bare_deploy_bash/atomic_group_node_resolution_tail.sh.tmpl create mode 100644 deployment/templates/gen_bare_deploy_bash/atomic_group_service_block.sh.tmpl create mode 100644 deployment/templates/gen_bare_deploy_bash/atomic_group_start.sh.tmpl create mode 100644 deployment/templates/gen_bare_deploy_bash/atomic_group_stop.sh.tmpl create mode 100644 deployment/templates/gen_bare_deploy_bash/bare_entrypoint.sh.tmpl create mode 100644 deployment/templates/gen_bare_deploy_bash/common_node_resolution_tail.sh.tmpl create mode 100644 deployment/templates/gen_bare_deploy_bash/etcd_health_wait_block.sh.tmpl create mode 100644 deployment/templates/gen_bare_deploy_bash/host_prelude.sh.tmpl create mode 100644 deployment/templates/gen_bare_deploy_bash/selection_present_probe_fn.sh.tmpl create mode 100644 deployment/templates/gen_bare_deploy_bash/selection_supervisor_launch_wait_block.sh.tmpl create mode 100644 deployment/templates/gen_bare_deploy_bash/selection_supervisor_path_from_script_dir.sh.tmpl create mode 100644 deployment/templates/gen_bare_deploy_bash/standalone_start.sh.tmpl create mode 100644 deployment/templates/gen_bare_deploy_bash/standalone_start_body.sh.tmpl create mode 100644 deployment/templates/gen_bare_deploy_bash/standalone_stop.sh.tmpl create mode 100644 deployment/templates/gen_bare_deploy_bash/start_lock_block.sh.tmpl create mode 100644 deployment/templates/gen_bare_deploy_bash/tcp_ready_helpers.sh.tmpl create mode 100644 deployment/templates/gen_bare_deploy_bash/tcp_ready_wait_block.sh.tmpl create mode 100644 deployment/tests/test_log_shard.py create mode 100644 deployment/utils/log_shard.py create mode 100644 "fluxon_doc_cn/design/fluxon_0_\351\205\215\347\275\256\346\200\273\350\247\210.md" create mode 100644 "fluxon_doc_cn/design/log_1_\346\234\254\345\234\260\346\226\207\344\273\266\346\227\245\345\277\227\344\270\216Greptime_OTLP\345\257\274\345\207\272\351\223\276\350\267\257.md" create mode 100644 fluxon_rs/fluxon_util/tests/log_mgmt.rs create mode 100644 fluxon_test_stack/test_runner_templates/payload_fluxon_fs_s3_download_and_exec.sh.template create mode 100644 fluxon_test_stack/tests/test_top_attention_log_mgmt_contract.py create mode 100644 fluxon_test_stack/top_attention_test_index/_log_mgmt.py create mode 100644 scripts/git_source_selection.py create mode 100644 scripts/source_selection_profiles.py create mode 100644 setup_and_pack/tests/test_git_source_selection_utils.py create mode 100644 skills/browser-helm/SKILL.md create mode 100644 skills/browser-helm/agents/openai.yaml create mode 100644 skills/browser-helm/references/commands.md create mode 100644 skills/canvas-dag_organizer-v1/SKILL.md create mode 100644 skills/canvas-dag_organizer-v1/agents/openai.yaml create mode 100644 skills/canvas-ops-v1/SKILL.md create mode 100644 skills/canvas-ops-v1/agents/openai.yaml create mode 100644 skills/canvas-tidy_selection-v1/SKILL.md create mode 100644 skills/canvas-tidy_selection-v1/agents/openai.yaml create mode 100644 skills/find-skills/SKILL.md create mode 100644 skills/imagegen/LICENSE.txt create mode 100644 skills/imagegen/SKILL.md create mode 100644 skills/imagegen/agents/openai.yaml create mode 100644 skills/imagegen/assets/imagegen-small.svg create mode 100644 skills/imagegen/assets/imagegen.png create mode 100644 skills/imagegen/references/cli.md create mode 100644 skills/imagegen/references/codex-network.md create mode 100644 skills/imagegen/references/image-api.md create mode 100644 skills/imagegen/references/prompting.md create mode 100644 skills/imagegen/references/sample-prompts.md create mode 100644 skills/imagegen/scripts/image_gen.py create mode 100644 skills/imagegen/scripts/remove_chroma_key.py create mode 100644 skills/openai-docs/LICENSE.txt create mode 100644 skills/openai-docs/SKILL.md create mode 100644 skills/openai-docs/agents/openai.yaml create mode 100644 skills/openai-docs/assets/openai-small.svg create mode 100644 skills/openai-docs/assets/openai.png create mode 100644 skills/openai-docs/references/latest-model.md create mode 100644 skills/openai-docs/references/prompting-guide.md create mode 100644 skills/openai-docs/references/upgrade-guide.md create mode 100644 skills/openai-docs/scripts/fetch-codex-manual.mjs create mode 100644 skills/openai-docs/scripts/resolve-latest-model-info.js create mode 100644 skills/plugin-creator/SKILL.md create mode 100644 skills/plugin-creator/agents/openai.yaml create mode 100644 skills/plugin-creator/assets/plugin-creator-small.svg create mode 100644 skills/plugin-creator/assets/plugin-creator.png create mode 100644 skills/plugin-creator/references/installing-and-updating.md create mode 100644 skills/plugin-creator/references/plugin-json-spec.md create mode 100644 skills/plugin-creator/scripts/create_basic_plugin.py create mode 100644 skills/plugin-creator/scripts/read_marketplace_name.py create mode 100644 skills/plugin-creator/scripts/update_plugin_cachebuster.py create mode 100644 skills/plugin-creator/scripts/validate_plugin.py create mode 100644 skills/prompt-0ca565e9-3d44-45f1-832d-caa438aceddb/SKILL.md create mode 100644 skills/prompt-0ca565e9-3d44-45f1-832d-caa438aceddb/agents/openai.yaml create mode 100644 skills/prompt-1309ed22-5b5e-4774-9b85-41bb1b7cc971/SKILL.md create mode 100644 skills/prompt-1309ed22-5b5e-4774-9b85-41bb1b7cc971/agents/openai.yaml create mode 100644 skills/prompt-1323c8c8-88a0-40d2-89df-14fc9533a122/SKILL.md create mode 100644 skills/prompt-1323c8c8-88a0-40d2-89df-14fc9533a122/agents/openai.yaml create mode 100644 skills/prompt-144929a0-ae69-404b-9f58-a8696378e4e3/SKILL.md create mode 100644 skills/prompt-144929a0-ae69-404b-9f58-a8696378e4e3/agents/openai.yaml create mode 100644 skills/prompt-15d9a907-a363-4ec7-81ad-806f9418ad72/SKILL.md create mode 100644 skills/prompt-15d9a907-a363-4ec7-81ad-806f9418ad72/agents/openai.yaml create mode 100644 skills/prompt-193dd3cd-2722-413b-b88c-12c2af645f80/SKILL.md create mode 100644 skills/prompt-193dd3cd-2722-413b-b88c-12c2af645f80/agents/openai.yaml create mode 100644 skills/prompt-2793a3a4-310f-40c8-ba5d-bc7f5c1cafd7/SKILL.md create mode 100644 skills/prompt-2793a3a4-310f-40c8-ba5d-bc7f5c1cafd7/agents/openai.yaml create mode 100644 skills/prompt-2d53cebd-afd4-4d35-94e9-74436da3148a/SKILL.md create mode 100644 skills/prompt-2d53cebd-afd4-4d35-94e9-74436da3148a/agents/openai.yaml create mode 100644 skills/prompt-2eaed145-d789-4b27-93b9-8ea990830b3a/SKILL.md create mode 100644 skills/prompt-2eaed145-d789-4b27-93b9-8ea990830b3a/agents/openai.yaml create mode 100644 skills/prompt-345530e6-2736-42c3-9d4e-da5f14b8b8cb/SKILL.md create mode 100644 skills/prompt-345530e6-2736-42c3-9d4e-da5f14b8b8cb/agents/openai.yaml create mode 100644 skills/prompt-566905c8-0ad8-4d7e-857a-1c38ac7e54ca/SKILL.md create mode 100644 skills/prompt-566905c8-0ad8-4d7e-857a-1c38ac7e54ca/agents/openai.yaml create mode 100644 skills/prompt-5e80deb4-c278-4424-a0f4-a3df4f3443d8/SKILL.md create mode 100644 skills/prompt-5e80deb4-c278-4424-a0f4-a3df4f3443d8/agents/openai.yaml create mode 100644 skills/prompt-615e1231-fe33-47f8-bf35-29fdf3766d98/SKILL.md create mode 100644 skills/prompt-615e1231-fe33-47f8-bf35-29fdf3766d98/agents/openai.yaml create mode 100644 skills/prompt-7ae16163-92c9-4fde-a74f-7c61eddd62f2/SKILL.md create mode 100644 skills/prompt-7ae16163-92c9-4fde-a74f-7c61eddd62f2/agents/openai.yaml create mode 100644 skills/prompt-8c5cc431-635c-4c94-9deb-a502e77160eb/SKILL.md create mode 100644 skills/prompt-8c5cc431-635c-4c94-9deb-a502e77160eb/agents/openai.yaml create mode 100644 skills/prompt-a7fb4e43-d1eb-4739-93b3-646d7a1c072c/SKILL.md create mode 100644 skills/prompt-a7fb4e43-d1eb-4739-93b3-646d7a1c072c/agents/openai.yaml create mode 100644 skills/prompt-ac42abf9-6df8-4539-99c7-e402e905a03b/SKILL.md create mode 100644 skills/prompt-ac42abf9-6df8-4539-99c7-e402e905a03b/agents/openai.yaml create mode 100644 skills/prompt-ae9ff67b-09d8-4848-bbde-aac1fb6e1315/SKILL.md create mode 100644 skills/prompt-ae9ff67b-09d8-4848-bbde-aac1fb6e1315/agents/openai.yaml create mode 100644 skills/prompt-f118ab91-390b-48e2-a962-3abe4d54211e/SKILL.md create mode 100644 skills/prompt-f118ab91-390b-48e2-a962-3abe4d54211e/agents/openai.yaml create mode 100644 skills/rs-skill-smoke-09e1daf7/SKILL.md create mode 100644 skills/rs-skill-smoke-529efbc9/SKILL.md create mode 100644 skills/rs-skill-smoke-cde1029f/SKILL.md create mode 100644 skills/skill-creator/SKILL.md create mode 100644 skills/skill-creator/agents/openai.yaml create mode 100644 skills/skill-creator/assets/skill-creator-small.svg create mode 100644 skills/skill-creator/assets/skill-creator.png create mode 100644 skills/skill-creator/license.txt create mode 100644 skills/skill-creator/references/openai_yaml.md create mode 100644 skills/skill-creator/scripts/generate_openai_yaml.py create mode 100644 skills/skill-creator/scripts/init_skill.py create mode 100644 skills/skill-creator/scripts/quick_validate.py create mode 100644 skills/skill-installer/LICENSE.txt create mode 100644 skills/skill-installer/SKILL.md create mode 100644 skills/skill-installer/agents/openai.yaml create mode 100644 skills/skill-installer/assets/skill-installer-small.svg create mode 100644 skills/skill-installer/assets/skill-installer.png create mode 100644 skills/skill-installer/scripts/github_utils.py create mode 100644 skills/skill-installer/scripts/install-skill-from-github.py create mode 100644 skills/skill-installer/scripts/list-skills.py diff --git a/.github/workflows/all_test.yml b/.github/workflows/all_test.yml index 4300c60..33cdd5b 100644 --- a/.github/workflows/all_test.yml +++ b/.github/workflows/all_test.yml @@ -86,10 +86,15 @@ jobs: # Scene selection: # - ci_top_attention_doc_page_build keeps the doc-site build as a CI scene workload. # - ci_top_attention_bin_kvtest keeps the Rust kv_test entry under the same CI scene contract. + # - ci_top_attention_log_mgmt keeps log rolling/sharding coverage under the same CI scene contract. suite["scenes"] = { key: value for key, value in suite["scenes"].items() - if key in ("ci_top_attention_doc_page_build", "ci_top_attention_bin_kvtest") + if key in ( + "ci_top_attention_doc_page_build", + "ci_top_attention_bin_kvtest", + "ci_top_attention_log_mgmt", + ) } # Profile selection: @@ -107,11 +112,13 @@ jobs: suite["profiles"]["fluxon_tcp"]["runtime"]["ci"]["scene_configs"]["ci_top_attention_doc_page_build"]["doc_site_base_url"] = ( "${{ github.repository_owner }}.github.io/${{ github.event.repository.name }}" ) + suite["profiles"]["fluxon_tcp"]["runtime"]["ci"]["scene_configs"]["ci_top_attention_log_mgmt"]["enabled"] = True # Scale selection: # - Keep the original per-scene scales from ci_test_list.yaml. # - ci_top_attention_doc_page_build stays on n1_kvowner_dram_3gib. # - ci_top_attention_bin_kvtest stays on n1_kvowner_dram_20gib. + # - ci_top_attention_log_mgmt stays on n1_kvowner_dram_20gib. out_path.write_text( yaml.safe_dump(suite, sort_keys=False, allow_unicode=False), diff --git a/deployment/gen_bare_deploy_bash.py b/deployment/gen_bare_deploy_bash.py index ce51025..5503658 100644 --- a/deployment/gen_bare_deploy_bash.py +++ b/deployment/gen_bare_deploy_bash.py @@ -4,8 +4,10 @@ import argparse import json import os +import re import shlex import sys +from functools import lru_cache from pathlib import Path from typing import Any, Dict, List @@ -25,7 +27,9 @@ StopTimeouts, render_bash_proc_lifecycle_funcs_pid_tree, ) +from log_shard import render_module_source as render_log_shard_module_source # type: ignore from selection_supervisor_codegen import ( # type: ignore + LOG_SHARD_HELPER_FILENAME, PYTHON_SELECTION_SUPERVISOR_FILENAME, render_python_selection_supervisor_module, ) @@ -44,13 +48,36 @@ ATOMIC_GROUP_CRASHLOOP_CONSECUTIVE_RESTARTS = 10 ATOMIC_GROUP_CRASHLOOP_INTERVAL_LT_SECONDS = 30 ATOMIC_GROUP_PROBABLE_READY_SECONDS = 10 -STANDALONE_PROBABLE_READY_SECONDS = 3 -STANDALONE_STARTUP_DEADLINE_SECONDS = 60 -ATOMIC_GROUP_STARTUP_DEADLINE_SECONDS = 10 * 60 +STANDALONE_PROBABLE_READY_SECONDS = 10 +STANDALONE_STARTUP_DEADLINE_SECONDS = 10 +ATOMIC_GROUP_STARTUP_DEADLINE_SECONDS = 10 HOSTWORKDIR_RUNTIME_TOKEN = "${HOSTWORKDIR}" REPO_ROOT = SCRIPT_DIR.parent -TCP_READY_STABLE_SECONDS = 2 -TCP_READY_POLL_INTERVAL_SECONDS = 0.2 +BARE_TEMPLATE_DIR = SCRIPT_DIR / "templates" / "gen_bare_deploy_bash" +_TEMPLATE_TOKEN_RE = re.compile(r"\{\{([A-Z0-9_]+)\}\}") + + +@lru_cache(maxsize=None) +def _load_bare_template(*, template_name: str) -> str: + template_path = BARE_TEMPLATE_DIR / template_name + if not template_path.is_file(): + raise RuntimeError(f"missing bare deploy template: {template_path}") + return template_path.read_text(encoding="utf-8") + + +def _render_bare_template(*, template_name: str, values: Dict[str, str]) -> str: + template = _load_bare_template(template_name=template_name) + + def _replace(match: re.Match[str]) -> str: + key = match.group(1) + if key not in values: + raise RuntimeError(f"missing bare deploy template value: template={template_name} key={key}") + value = values[key] + if not isinstance(value, str): + raise ValueError(f"bare deploy template value must be a string: template={template_name} key={key}") + return value + + return _TEMPLATE_TOKEN_RE.sub(_replace, template) def _resolve_repo_root_cli_path(*, raw_path: Path, field_name: str) -> Path: @@ -89,6 +116,10 @@ def main() -> None: outdir / PYTHON_SELECTION_SUPERVISOR_FILENAME, render_python_selection_supervisor_module(timeouts=STOP_TIMEOUTS), ) + (outdir / LOG_SHARD_HELPER_FILENAME).write_text( + render_log_shard_module_source(), + encoding="utf-8", + ) name_prefix = _require_str(cfg.get("name_prefix"), "name_prefix") cluster_nodes_raw = _require_list(cfg.get("cluster_nodes"), "cluster_nodes") @@ -306,12 +337,12 @@ def _bare_entrypoint_script_name(*, workload_name: str) -> str: def _render_bare_entrypoint_script(*, service_name: str, entrypoint: str) -> str: - return ( - "#!/usr/bin/env bash\n" - "set -euo pipefail\n\n" - f"export SERVICE={_sh_quote(service_name)}\n" - + entrypoint.strip() - + "\n" + return _render_bare_template( + template_name="bare_entrypoint.sh.tmpl", + values={ + "SERVICE_EXPORT": _sh_quote(service_name), + "ENTRYPOINT": entrypoint.strip(), + }, ) @@ -353,29 +384,25 @@ def _render_standalone_start_script( service_cfg: Dict[str, Any], ) -> str: allowed_nodes = _extract_nodes(service_cfg) - service_port = _extract_port(service_cfg) - port_export = "" - if service_port is not None: - port_export = f"export {service_name.upper()}__PORT={_sh_quote(str(service_port))}\n" - return ( - "#!/usr/bin/env bash\n" - "set -euo pipefail\n\n" - f"SERVICE={_sh_quote(service_name)}\n" - f"NAME_PREFIX={_sh_quote(name_prefix)}\n" - + _render_nodes_bash(name="ALLOWED_NODES", nodes=allowed_nodes) - + _render_host_prelude(cluster_nodes=cluster_nodes) - + _render_common_node_resolution_tail(service_name=service_name) - + _render_selection_supervisor_path_from_script_dir() - + _render_proc_lifecycle_pid_tree_helpers() - + _render_tcp_ready_helpers() - + _render_selection_present_probe_fn() - + _render_start_lock_block() - + _render_global_env_exports(global_envs) - + port_export - + _render_standalone_start_body( - name_prefix=name_prefix, - service_name=service_name, - ) + return _render_bare_template( + template_name="standalone_start.sh.tmpl", + values={ + "SERVICE_ASSIGN": _sh_quote(service_name), + "NAME_PREFIX_ASSIGN": _sh_quote(name_prefix), + "ALLOWED_NODES_BLOCK": _render_nodes_bash(name="ALLOWED_NODES", nodes=allowed_nodes), + "HOST_PRELUDE": _render_host_prelude(cluster_nodes=cluster_nodes), + "COMMON_NODE_RESOLUTION_TAIL": _render_common_node_resolution_tail(service_name=service_name), + "SELECTION_SUPERVISOR_PATH_BLOCK": _render_selection_supervisor_path_from_script_dir(), + "PROC_LIFECYCLE_HELPERS": _render_proc_lifecycle_pid_tree_helpers(), + "SELECTION_PRESENT_PROBE_FN": _render_selection_present_probe_fn(), + "START_LOCK_BLOCK": _render_start_lock_block(), + "GLOBAL_ENV_EXPORTS": _render_global_env_exports(global_envs), + "PORT_EXPORT": _render_service_port_export(service_name=service_name, service_cfg=service_cfg), + "START_BODY": _render_standalone_start_body( + name_prefix=name_prefix, + service_name=service_name, + ), + }, ) @@ -387,25 +414,19 @@ def _render_standalone_stop_script( service_cfg: Dict[str, Any], ) -> str: allowed_nodes = _extract_nodes(service_cfg) - return ( - "#!/usr/bin/env bash\n" - "set -euo pipefail\n\n" - f"SERVICE={_sh_quote(service_name)}\n" - f"NAME_PREFIX={_sh_quote(name_prefix)}\n" - + _render_nodes_bash(name="ALLOWED_NODES", nodes=allowed_nodes) - + _render_host_prelude(cluster_nodes=cluster_nodes) - + _render_common_node_resolution_tail(service_name=service_name) - + _render_selection_supervisor_path_from_script_dir() - + f'SUPERVISOR_LABEL={_sh_quote(_bare_plain_selection_supervisor_label(name_prefix=name_prefix, service_name=service_name))}\n' - + "# English note:\n" - + "# - Generated bare stop is retained as a manual operator tool.\n" - + "# - Automation must not depend on this path for handover or rollout convergence.\n" - + "# - The command only asks the shared selection supervisor to retire the concrete selection\n" - + "# identity identified by label on this node.\n" - + 'if ! python3 "$SELECTION_SUPERVISOR" stop --label "$SUPERVISOR_LABEL" --scope-key "$HOSTWORKDIR" --missing-ok >/dev/null; then\n' - + ' echo "[bare] stop failed svc=$SERVICE label=$SUPERVISOR_LABEL hostworkdir=$HOSTWORKDIR"\n' - + " exit 1\n" - + "fi\n" + return _render_bare_template( + template_name="standalone_stop.sh.tmpl", + values={ + "SERVICE_ASSIGN": _sh_quote(service_name), + "NAME_PREFIX_ASSIGN": _sh_quote(name_prefix), + "ALLOWED_NODES_BLOCK": _render_nodes_bash(name="ALLOWED_NODES", nodes=allowed_nodes), + "HOST_PRELUDE": _render_host_prelude(cluster_nodes=cluster_nodes), + "COMMON_NODE_RESOLUTION_TAIL": _render_common_node_resolution_tail(service_name=service_name), + "SELECTION_SUPERVISOR_PATH_BLOCK": _render_selection_supervisor_path_from_script_dir(), + "SUPERVISOR_LABEL_ASSIGN": _sh_quote( + _bare_plain_selection_supervisor_label(name_prefix=name_prefix, service_name=service_name) + ), + }, ) @@ -429,20 +450,19 @@ def _render_atomic_group_start_script( service_cfg=service_cfg, ) ) - return ( - "#!/usr/bin/env bash\n" - "set -euo pipefail\n\n" - f"GROUP={_sh_quote(group_name)}\n" - f"NAME_PREFIX={_sh_quote(name_prefix)}\n" - + _render_host_prelude(cluster_nodes=cluster_nodes) - + _render_atomic_group_node_resolution_tail(group_cfg["nodes"]) - + _render_selection_supervisor_path_from_script_dir() - + _render_proc_lifecycle_pid_tree_helpers() - + _render_tcp_ready_helpers() - + _render_global_env_exports(global_envs) - + f"GROUP_STARTUP_DEADLINE_TS=$(( $(date +%s) + {ATOMIC_GROUP_STARTUP_DEADLINE_SECONDS} ))\n" - + "".join(service_blocks) - + 'echo "[atomic-group] ready group=$GROUP node=$NODE_ID"\n' + return _render_bare_template( + template_name="atomic_group_start.sh.tmpl", + values={ + "GROUP_ASSIGN": _sh_quote(group_name), + "NAME_PREFIX_ASSIGN": _sh_quote(name_prefix), + "HOST_PRELUDE": _render_host_prelude(cluster_nodes=cluster_nodes), + "ATOMIC_GROUP_NODE_RESOLUTION_TAIL": _render_atomic_group_node_resolution_tail(group_cfg["nodes"]), + "SELECTION_SUPERVISOR_PATH_BLOCK": _render_selection_supervisor_path_from_script_dir(), + "PROC_LIFECYCLE_HELPERS": _render_proc_lifecycle_pid_tree_helpers(), + "GLOBAL_ENV_EXPORTS": _render_global_env_exports(global_envs), + "GROUP_STARTUP_DEADLINE_ASSIGN": str(ATOMIC_GROUP_STARTUP_DEADLINE_SECONDS), + "SERVICE_BLOCKS": "".join(service_blocks), + }, ) @@ -454,276 +474,105 @@ def _render_atomic_group_stop_script( group_cfg: Dict[str, Any], ) -> str: stop_services = list(reversed(group_cfg["services"])) - return ( - "#!/usr/bin/env bash\n" - "set -u -o pipefail\n\n" - f"GROUP={_sh_quote(group_name)}\n" - f"NAME_PREFIX={_sh_quote(name_prefix)}\n" - + _render_host_prelude(cluster_nodes=cluster_nodes) - + _render_atomic_group_node_resolution_tail(group_cfg["nodes"]) - + _render_selection_supervisor_path_from_script_dir() - + _render_atomic_group_stop_fn( - runtime_specs=[ - { - "service_name": service_name, - "supervisor_label": _bare_atomic_group_member_selection_supervisor_label( - name_prefix=name_prefix, - group_name=group_name, - service_name=service_name, - ), - } - for service_name in stop_services - ], - ) - + "stop_group\n" + return _render_bare_template( + template_name="atomic_group_stop.sh.tmpl", + values={ + "GROUP_ASSIGN": _sh_quote(group_name), + "NAME_PREFIX_ASSIGN": _sh_quote(name_prefix), + "HOST_PRELUDE": _render_host_prelude(cluster_nodes=cluster_nodes), + "ATOMIC_GROUP_NODE_RESOLUTION_TAIL": _render_atomic_group_node_resolution_tail(group_cfg["nodes"]), + "SELECTION_SUPERVISOR_PATH_BLOCK": _render_selection_supervisor_path_from_script_dir(), + "ATOMIC_GROUP_STOP_FN": _render_atomic_group_stop_fn( + runtime_specs=[ + { + "service_name": service_name, + "supervisor_label": _bare_atomic_group_member_selection_supervisor_label( + name_prefix=name_prefix, + group_name=group_name, + service_name=service_name, + ), + } + for service_name in stop_services + ], + ), + }, ) def _render_host_prelude(*, cluster_nodes: List[Dict[str, Any]]) -> str: all_nodes = [_require_str(node.get("hostname"), "cluster_nodes[].hostname") for node in cluster_nodes] - out = _render_nodes_bash(name="ALL_NODES", nodes=all_nodes) - out += "\nLOCAL_HOSTNAME=$(hostname -s 2>/dev/null || hostname 2>/dev/null || echo unknown)\n" - out += 'LOCAL_FQDN=$(hostname -f 2>/dev/null || echo "$LOCAL_HOSTNAME")\n' - out += 'NODE_ID="${NODE_ID:-}"\n' - out += 'if [ -n "$NODE_ID" ]; then\n' - out += ' _node_id_known=false\n' - out += ' for n in "${ALL_NODES[@]}"; do\n' - out += ' if [ "$n" = "$NODE_ID" ]; then\n' - out += ' _node_id_known=true\n' - out += " break\n" - out += " fi\n" - out += " done\n" - out += ' if [ "$_node_id_known" != true ]; then\n' - out += ' echo "Unknown preset NODE_ID: $NODE_ID"\n' - out += f' echo "Known nodes: {" ".join(all_nodes)}"\n' - out += " exit 1\n" - out += " fi\n" - out += "fi\n" - out += 'if [ -z "$NODE_ID" ]; then\n' - out += 'for n in "${ALL_NODES[@]}"; do\n' - out += ' if [ "$n" = "$LOCAL_HOSTNAME" ] || [ "$n" = "$LOCAL_FQDN" ]; then\n' - out += ' NODE_ID="$n"\n' - out += " break\n" - out += " fi\n" - out += "done\n" - out += "fi\n" - out += 'if [ -z "$NODE_ID" ] && [ ${#ALL_NODES[@]} -eq 1 ]; then\n' - out += ' NODE_ID="${ALL_NODES[0]}"\n' - out += "fi\n" - out += 'if [ -z "$NODE_ID" ]; then\n' - out += ' for ip in $(hostname -I 2>/dev/null); do\n' - out += ' for n in "${ALL_NODES[@]}"; do\n' - out += ' _ip_n=""\n' - out += ' case "$n" in\n' - for node in cluster_nodes: - node_name = _require_str(node.get("hostname"), "cluster_nodes[].hostname") - node_ip = _require_str(node.get("ip"), f"cluster_nodes[{node_name}].ip") - out += f" {_sh_quote(node_name)}) _ip_n={_sh_quote(node_ip)};;\n" - out += ' *) _ip_n="";;\n' - out += " esac\n" - out += ' if [ "$_ip_n" = "$ip" ]; then\n' - out += ' NODE_ID="$n"\n' - out += " break\n" - out += " fi\n" - out += " done\n" - out += ' [ -n "$NODE_ID" ] && break\n' - out += " done\n" - out += "fi\n" - out += 'if [ -z "$NODE_ID" ]; then\n' - out += ' echo "Cannot map host to a configured node. Hostname=$LOCAL_HOSTNAME FQDN=$LOCAL_FQDN IPs=$(hostname -I 2>/dev/null)"\n' - out += f' echo "Known nodes: {" ".join(all_nodes)}"\n' - out += " exit 1\n" - out += "fi\n\n" - out += 'HOST_IP=""\nHOSTWORKDIR=""\ncase "$NODE_ID" in\n' + ip_case_lines: list[str] = [] + host_case_lines: list[str] = [] for node in cluster_nodes: node_name = _require_str(node.get("hostname"), "cluster_nodes[].hostname") node_ip = _require_str(node.get("ip"), f"cluster_nodes[{node_name}].ip") hostworkdir = _require_str(node.get("hostworkdir"), f"cluster_nodes[{node_name}].hostworkdir") - out += f" {_sh_quote(node_name)}) HOST_IP={_sh_quote(node_ip)}; HOSTWORKDIR={_sh_quote(hostworkdir)};;\n" - out += ' *) echo "Unknown NODE_ID: $NODE_ID"; exit 1;;\n' - out += "esac\n" - return out + ip_case_lines.append(f" {_sh_quote(node_name)}) _ip_n={_sh_quote(node_ip)};;") + host_case_lines.append( + f" {_sh_quote(node_name)}) HOST_IP={_sh_quote(node_ip)}; HOSTWORKDIR={_sh_quote(hostworkdir)};;" + ) + return _render_bare_template( + template_name="host_prelude.sh.tmpl", + values={ + "ALL_NODES_BLOCK": _render_nodes_bash(name="ALL_NODES", nodes=all_nodes), + "KNOWN_NODES": " ".join(all_nodes), + "IP_CASE_LINES": "\n".join(ip_case_lines), + "HOST_CASE_LINES": "\n".join(host_case_lines), + }, + ) def _render_common_node_resolution_tail(*, service_name: str) -> str: - return ( - 'if [ ${#ALLOWED_NODES[@]} -gt 0 ]; then\n' - + ' _ok=false\n' - + ' for n in "${ALLOWED_NODES[@]}"; do\n' - + ' if [ "$n" = "$NODE_ID" ]; then _ok=true; fi\n' - + " done\n" - + ' if [ "$_ok" != true ]; then\n' - + f' echo "Service {service_name} not scheduled on this node ($NODE_ID). Allowed: ${{ALLOWED_NODES[*]}}"\n' - + " exit 0\n" - + " fi\n" - + "fi\n\n" - + 'export NODE_ID="$NODE_ID"\n' - + 'export HOST_IP="$HOST_IP"\n' - + 'export HOSTWORKDIR="$HOSTWORKDIR"\n\n' + return _render_bare_template( + template_name="common_node_resolution_tail.sh.tmpl", + values={"SERVICE_NAME": service_name}, ) def _render_atomic_group_node_resolution_tail(allowed_nodes: List[str]) -> str: - return ( - _render_nodes_bash(name="GROUP_NODES", nodes=allowed_nodes) - + 'scheduled=false\n' - + 'for n in "${GROUP_NODES[@]}"; do\n' - + ' if [ "$n" = "$NODE_ID" ]; then scheduled=true; fi\n' - + "done\n" - + 'if [ "$scheduled" != true ]; then\n' - + ' echo "[atomic-group] skip group=$GROUP node=$NODE_ID allowed=${GROUP_NODES[*]}"\n' - + " exit 0\n" - + "fi\n\n" - + 'export NODE_ID="$NODE_ID"\n' - + 'export HOST_IP="$HOST_IP"\n' - + 'export HOSTWORKDIR="$HOSTWORKDIR"\n' - + 'echo "[atomic-group] group=$GROUP node=$NODE_ID hostworkdir=$HOSTWORKDIR"\n\n' + return _render_bare_template( + template_name="atomic_group_node_resolution_tail.sh.tmpl", + values={"GROUP_NODES_BLOCK": _render_nodes_bash(name="GROUP_NODES", nodes=allowed_nodes)}, ) def _render_start_lock_block() -> str: - return ( - 'PID_DIR="$HOSTWORKDIR/run"\n' - + 'mkdir -p "$PID_DIR"\n' - + 'START_LOCKFILE="$PID_DIR/${SERVICE}.start.lock"\n' - + 'if ! command -v flock >/dev/null 2>&1; then\n' - + ' echo "Missing required command: flock"\n' - + " exit 1\n" - + "fi\n" - + 'exec 9>"$START_LOCKFILE"\n' - + 'if ! flock -xn 9; then\n' - + ' echo "[bare] start skipped svc=$SERVICE reason=another start is already running lockfile=$START_LOCKFILE"\n' - + " exit 0\n" - + "fi\n" - + 'exec 9>&-\n\n' - ) + return _load_bare_template(template_name="start_lock_block.sh.tmpl") def _render_proc_lifecycle_pid_tree_helpers() -> str: return render_bash_proc_lifecycle_funcs_pid_tree(timeouts=STOP_TIMEOUTS) + "\n\n" -def _render_tcp_ready_helpers() -> str: - return ( - "wait_service_tcp_ready() {\n" - + ' svc="$1"\n' - + ' host="$2"\n' - + ' port="$3"\n' - + ' stable_seconds="$4"\n' - + ' deadline_ts="$5"\n' - + ' context="$6"\n' - + ' if [[ ! "$port" =~ ^[0-9]+$ ]]; then\n' - + ' echo "$context tcp-ready: invalid port svc=$svc port=$port"\n' - + " return 1\n" - + " fi\n" - + ' if [[ ! "$stable_seconds" =~ ^[0-9]+$ ]] || [ "$stable_seconds" -le 0 ]; then\n' - + ' echo "$context tcp-ready: invalid stable_seconds svc=$svc stable_seconds=$stable_seconds"\n' - + " return 1\n" - + " fi\n" - + f" poll_interval_seconds={TCP_READY_POLL_INTERVAL_SECONDS}\n" - + ' stable_checks=$(python3 - "$stable_seconds" "$poll_interval_seconds" <<\'__FLUXON_TCP_READY_CHECKS__\'\n' - + "import math\n" - + "import sys\n" - + "stable_seconds = float(sys.argv[1])\n" - + "poll_interval_seconds = float(sys.argv[2])\n" - + "print(max(1, int(math.ceil(stable_seconds / poll_interval_seconds))))\n" - + "__FLUXON_TCP_READY_CHECKS__\n" - + ")\n" - + ' if [[ ! "$stable_checks" =~ ^[0-9]+$ ]] || [ "$stable_checks" -le 0 ]; then\n' - + ' echo "$context tcp-ready: failed to compute stable_checks svc=$svc"\n' - + " return 1\n" - + " fi\n" - + " ok_checks=0\n" - + " while true; do\n" - + ' now=$(date +%s)\n' - + ' if [ "$now" -ge "$deadline_ts" ]; then\n' - + ' echo "$context tcp-ready: deadline exceeded svc=$svc host=$host port=$port"\n' - + " return 1\n" - + " fi\n" - + ' if python3 - "$host" "$port" <<\'__FLUXON_TCP_READY_PROBE__\'\n' - + "import socket\n" - + "import sys\n" - + "host = sys.argv[1]\n" - + "port = int(sys.argv[2])\n" - + "with socket.create_connection((host, port), timeout=1.0):\n" - + " pass\n" - + "__FLUXON_TCP_READY_PROBE__\n" - + " then\n" - + " ok_checks=$((ok_checks+1))\n" - + ' if [ "$ok_checks" -ge "$stable_checks" ]; then\n' - + ' echo "$context tcp-ready: ok svc=$svc host=$host port=$port stable_checks=$stable_checks"\n' - + " return 0\n" - + " fi\n" - + " else\n" - + ' if [ "$ok_checks" -ne 0 ]; then\n' - + ' echo "$context tcp-ready: reset svc=$svc ok_checks=$ok_checks host=$host port=$port"\n' - + " fi\n" - + " ok_checks=0\n" - + " fi\n" - + ' sleep "$poll_interval_seconds"\n' - + " done\n" - + "}\n\n" - ) - - def _render_selection_present_probe_fn() -> str: - return ( - "selection_present() {\n" - + " python3 - \"$SELECTION_SUPERVISOR\" \"$SUPERVISOR_LABEL\" \"$HOSTWORKDIR\" <<'__FLUXON_SELECTION_PRESENT__'\n" - + "import importlib.util\n" - + "import sys\n" - + "from pathlib import Path\n" - + "\n" - + "supervisor_path = Path(sys.argv[1])\n" - + "label = sys.argv[2]\n" - + "scope_key = sys.argv[3]\n" - + 'spec = importlib.util.spec_from_file_location("fluxon_selection_supervisor_probe", supervisor_path)\n' - + "if spec is None or spec.loader is None:\n" - + ' raise RuntimeError(f"failed to load selection supervisor module: {supervisor_path}")\n' - + "module = importlib.util.module_from_spec(spec)\n" - + "sys.modules[spec.name] = module\n" - + "spec.loader.exec_module(module)\n" - + "raise SystemExit(0 if module._selection_present(label, scope_key=scope_key) else 1)\n" - + "__FLUXON_SELECTION_PRESENT__\n" - + "}\n\n" - ) + return _load_bare_template(template_name="selection_present_probe_fn.sh.tmpl") def _render_selection_supervisor_launch_wait_block( *, run_cmd: str, - logfile_expr: str, stable_seconds_expr: str, deadline_ts_expr: str, context: str, ) -> str: - return ( - 'SUPERVISOR_PID=$( ' - + run_cmd - + f' >>{logfile_expr} 2>&1 < /dev/null & echo "$!" )\n' - + 'if [[ ! "$SUPERVISOR_PID" =~ ^[0-9]+$ ]]; then\n' - + f' echo "{context} launch failed svc=$SERVICE label=$SUPERVISOR_LABEL supervisor_pid=$SUPERVISOR_PID"\n' - + " exit 1\n" - + "fi\n" - + 'if ! wait_service_probably_ready_pid_tree "$SERVICE" "$SUPERVISOR_PID" ' - + stable_seconds_expr - + " " - + deadline_ts_expr - + f' "{context}"; then\n' - + f' echo "{context} probable-ready failed svc=$SERVICE label=$SUPERVISOR_LABEL supervisor_pid=$SUPERVISOR_PID"\n' - + " exit 1\n" - + "fi\n" + return _render_bare_template( + template_name="selection_supervisor_launch_wait_block.sh.tmpl", + values={ + "RUN_CMD": run_cmd, + "STABLE_SECONDS_EXPR": stable_seconds_expr, + "DEADLINE_TS_EXPR": deadline_ts_expr, + "CONTEXT": context, + }, ) -def _render_tcp_ready_wait_block(*, context: str) -> str: +def _render_service_port_export(*, service_name: str, service_cfg: Dict[str, Any], indent: str = "") -> str: + service_port = _extract_port(service_cfg) + if service_port is None: + return indent + "unset SERVICE_PORT\n" return ( - 'if [[ "${SERVICE_PORT:-}" =~ ^[0-9]+$ ]]; then\n' - + f' if ! wait_service_tcp_ready "$SERVICE" "$HOST_IP" "$SERVICE_PORT" {TCP_READY_STABLE_SECONDS} "$STARTUP_DEADLINE_TS" "{context}"; then\n' - + f' echo "{context} tcp-ready failed svc=$SERVICE host=$HOST_IP port=$SERVICE_PORT"\n' - + " exit 1\n" - + " fi\n" - + "fi\n" + indent + f"export {service_name.upper()}__PORT={_sh_quote(str(service_port))}\n" + + indent + f"export SERVICE_PORT={_sh_quote(str(service_port))}\n" ) @@ -759,54 +608,28 @@ def _render_standalone_start_body(*, name_prefix: str, service_name: str) -> str crashloop_interval_lt_seconds=0, child_command=child_command, ) - return ( - f'SUPERVISOR_LABEL={_sh_quote(_bare_plain_selection_supervisor_label(name_prefix=name_prefix, service_name=service_name))}\n' - + f'RUNTIME_STATE_JSON={_sh_quote(runtime_state_json)}\n' - + 'OWNER_TS_MS=$(python3 -c \'import time; print(int(time.time() * 1000))\')\n' - + f"STARTUP_DEADLINE_TS=$(( $(date +%s) + {STANDALONE_STARTUP_DEADLINE_SECONDS} ))\n" - + 'LOG_DIR="$HOSTWORKDIR/log"\n' - + 'LOGFILE="$LOG_DIR/${SERVICE}.log"\n' - + 'mkdir -p "$LOG_DIR"\n' - + 'touch "$LOGFILE"\n' - + 'echo "Starting $SERVICE on $NODE_ID (IP: $HOST_IP, workdir: $HOSTWORKDIR)"\n' - + "# English note:\n" - + "# - bootstrap bare start must be idempotent when the shared selection supervisor already owns\n" - + "# a live child for the same label.\n" - + "# - start_test_bed enables this path only for deployconf.bootstrap_bare_services.\n" - + 'if [ "${FLUXON_BARE_ALLOW_ALREADY_PRESENT:-false}" = "true" ]; then\n' - + " if selection_present; then\n" - + ' echo "[bare] already present svc=$SERVICE label=$SUPERVISOR_LABEL"\n' - + ' echo "Started $SERVICE (label: $SUPERVISOR_LABEL)"\n' - + ' echo "Logs: $LOGFILE"\n' - + " exit 0\n" - + " fi\n" - + "fi\n" - + "# English note:\n" - + "# - Bare start must not depend on extra supervisor observation subcommands because the shared\n" - + "# runtime surface is intentionally reduced to run/stop.\n" - + "# - We therefore launch the detached supervisor and wait until its pid subtree keeps a live child\n" - + "# process for a short stable window.\n" - + _render_selection_supervisor_launch_wait_block( - run_cmd=run_cmd, - logfile_expr='"$LOGFILE"', - stable_seconds_expr=str(STANDALONE_PROBABLE_READY_SECONDS), - deadline_ts_expr='"$STARTUP_DEADLINE_TS"', - context="[bare]", - ) - + _render_tcp_ready_wait_block(context="[bare]") - + 'echo "Started $SERVICE (label: $SUPERVISOR_LABEL)"\n' - + 'echo "Logs: $LOGFILE"\n' + return _render_bare_template( + template_name="standalone_start_body.sh.tmpl", + values={ + "SUPERVISOR_LABEL_ASSIGN": _sh_quote( + _bare_plain_selection_supervisor_label(name_prefix=name_prefix, service_name=service_name) + ), + "RUNTIME_STATE_JSON_ASSIGN": _sh_quote(runtime_state_json), + "STARTUP_DEADLINE_SECONDS": str(STANDALONE_STARTUP_DEADLINE_SECONDS), + "SELECTION_SUPERVISOR_LAUNCH_WAIT_BLOCK": _render_selection_supervisor_launch_wait_block( + run_cmd=run_cmd, + stable_seconds_expr=str(STANDALONE_PROBABLE_READY_SECONDS), + deadline_ts_expr='"$STARTUP_DEADLINE_TS"', + context="[bare]", + ), + }, ) def _render_selection_supervisor_path_from_script_dir() -> str: - return ( - 'DIR=$(cd "$(dirname "$0")" && pwd)\n' - + f'SELECTION_SUPERVISOR="$DIR/{PYTHON_SELECTION_SUPERVISOR_FILENAME}"\n' - + 'if [ ! -f "$SELECTION_SUPERVISOR" ]; then\n' - + ' echo "Missing selection supervisor: $SELECTION_SUPERVISOR"\n' - + " exit 1\n" - + "fi\n\n" + return _render_bare_template( + template_name="selection_supervisor_path_from_script_dir.sh.tmpl", + values={"SELECTION_SUPERVISOR_FILENAME": PYTHON_SELECTION_SUPERVISOR_FILENAME}, ) @@ -833,10 +656,6 @@ def _render_atomic_group_service_block( log_path=f"${{HOSTWORKDIR}}/log/{service_name}.log", ) allowed_nodes = _extract_nodes(service_cfg) - service_port = _extract_port(service_cfg) - port_export = "" - if service_port is not None: - port_export = f" export {service_name.upper()}__PORT={_sh_quote(str(service_port))}\n" run_cmd = _render_selection_supervisor_run_shell( subcommand="run", supervisor_expr='"$SELECTION_SUPERVISOR"', @@ -850,54 +669,37 @@ def _render_atomic_group_service_block( crashloop_interval_lt_seconds=ATOMIC_GROUP_CRASHLOOP_INTERVAL_LT_SECONDS, child_command=child_command, ) - return ( - f"\n# rollout: {service_name}\n" - + _render_nodes_bash(name="ALLOWED_NODES", nodes=allowed_nodes) - + "scheduled=false\n" - + 'for n in "${ALLOWED_NODES[@]}"; do\n' - + ' if [ "$n" = "$NODE_ID" ]; then scheduled=true; fi\n' - + "done\n" - + 'if [ "$scheduled" != true ]; then\n' - + f' echo "[rollout] skip {service_name}: not scheduled on node $NODE_ID"\n' - + "else\n" - + f" export SERVICE={_sh_quote(service_name)}\n" - + port_export - + ' LOG_DIR="$HOSTWORKDIR/log"\n' - + ' mkdir -p "$LOG_DIR"\n' - + f' SUPERVISOR_LABEL={_sh_quote(_bare_atomic_group_member_selection_supervisor_label(name_prefix=name_prefix, group_name=group_name, service_name=service_name))}\n' - + f' RUNTIME_STATE_JSON={_sh_quote(runtime_state_json)}\n' - + ' OWNER_TS_MS=$(python3 -c \'import time; print(int(time.time() * 1000))\')\n' - + f' LOGFILE="$HOSTWORKDIR/log/{service_name}.log"\n' - + ' touch "$LOGFILE"\n' - + f' echo "[rollout] start {service_name} node=$NODE_ID hostworkdir=$HOSTWORKDIR"\n' - + " # English note:\n" - + " # - Atomic-group order still depends on a readiness gate, but that gate now observes only the\n" - + " # detached supervisor process subtree on this host.\n" - + " # - Ownership stays inside the shared selection supervisor big loop; the group runner only waits\n" - + " # until that loop has a stable live child before advancing to the next service.\n" - # English note: - # - The embedded `run_cmd` contains a nested `bash -lc` payload, and that payload may contain - # heredocs used by real service entrypoints. - # - A blind newline replacement would shift heredoc terminators away from column 0 inside the - # child shell and silently turn valid entrypoints into immediate no-op exits. - # - Indent only the outer block lines while preserving each inner line start exactly. - + _indent_script_block( - script=_render_selection_supervisor_launch_wait_block( - run_cmd=run_cmd, - logfile_expr='"$LOGFILE"', - stable_seconds_expr=str(ATOMIC_GROUP_PROBABLE_READY_SECONDS), - deadline_ts_expr='"$GROUP_STARTUP_DEADLINE_TS"', - context="[rollout]", - ).rstrip() + "\n", - prefix=" ", - ).rstrip() - + "\n" - + _indent_script_block( - script=_render_tcp_ready_wait_block(context="[rollout]"), - prefix=" ", - ).rstrip() - + "\n" - + "fi\n" + return _render_bare_template( + template_name="atomic_group_service_block.sh.tmpl", + values={ + "SERVICE_NAME": service_name, + "ALLOWED_NODES_BLOCK": _render_nodes_bash(name="ALLOWED_NODES", nodes=allowed_nodes), + "SERVICE_EXPORT": _sh_quote(service_name), + "PORT_EXPORT": _render_service_port_export( + service_name=service_name, + service_cfg=service_cfg, + indent=" ", + ), + "SUPERVISOR_LABEL_ASSIGN": _sh_quote( + _bare_atomic_group_member_selection_supervisor_label( + name_prefix=name_prefix, + group_name=group_name, + service_name=service_name, + ) + ), + "RUNTIME_STATE_JSON_ASSIGN": _sh_quote(runtime_state_json), + "LOGFILE_PATH": f"$HOSTWORKDIR/log/{service_name}.log", + "INDENTED_SELECTION_SUPERVISOR_LAUNCH_WAIT_BLOCK": _indent_script_block( + script=_render_selection_supervisor_launch_wait_block( + run_cmd=run_cmd, + stable_seconds_expr=str(ATOMIC_GROUP_PROBABLE_READY_SECONDS), + deadline_ts_expr='"$GROUP_STARTUP_DEADLINE_TS"', + context="[rollout]", + ).rstrip() + + "\n", + prefix=" ", + ).rstrip(), + }, ) diff --git a/deployment/templates/gen_bare_deploy_bash/atomic_group_node_resolution_tail.sh.tmpl b/deployment/templates/gen_bare_deploy_bash/atomic_group_node_resolution_tail.sh.tmpl new file mode 100644 index 0000000..d385995 --- /dev/null +++ b/deployment/templates/gen_bare_deploy_bash/atomic_group_node_resolution_tail.sh.tmpl @@ -0,0 +1,14 @@ +{{GROUP_NODES_BLOCK}}scheduled=false +for n in "${GROUP_NODES[@]}"; do + if [ "$n" = "$NODE_ID" ]; then scheduled=true; fi +done +if [ "$scheduled" != true ]; then + echo "[atomic-group] skip group=$GROUP node=$NODE_ID allowed=${GROUP_NODES[*]}" + exit 0 +fi + +export NODE_ID="$NODE_ID" +export HOST_IP="$HOST_IP" +export HOSTWORKDIR="$HOSTWORKDIR" +echo "[atomic-group] group=$GROUP node=$NODE_ID hostworkdir=$HOSTWORKDIR" + diff --git a/deployment/templates/gen_bare_deploy_bash/atomic_group_service_block.sh.tmpl b/deployment/templates/gen_bare_deploy_bash/atomic_group_service_block.sh.tmpl new file mode 100644 index 0000000..6ad9a1a --- /dev/null +++ b/deployment/templates/gen_bare_deploy_bash/atomic_group_service_block.sh.tmpl @@ -0,0 +1,24 @@ + +# rollout: {{SERVICE_NAME}} +{{ALLOWED_NODES_BLOCK}}scheduled=false +for n in "${ALLOWED_NODES[@]}"; do + if [ "$n" = "$NODE_ID" ]; then scheduled=true; fi +done +if [ "$scheduled" != true ]; then + echo "[rollout] skip {{SERVICE_NAME}}: not scheduled on node $NODE_ID" +else + export SERVICE={{SERVICE_EXPORT}} +{{PORT_EXPORT}} LOG_DIR="$HOSTWORKDIR/log" + mkdir -p "$LOG_DIR" + SUPERVISOR_LABEL={{SUPERVISOR_LABEL_ASSIGN}} + RUNTIME_STATE_JSON={{RUNTIME_STATE_JSON_ASSIGN}} + OWNER_TS_MS=$(python3 -c 'import time; print(int(time.time() * 1000))') + LOGFILE="{{LOGFILE_PATH}}" + echo "[rollout] start {{SERVICE_NAME}} node=$NODE_ID hostworkdir=$HOSTWORKDIR" + # English note: + # - Atomic-group order still depends on a readiness gate, but that gate now observes only the + # detached supervisor process subtree on this host. + # - Ownership stays inside the shared selection supervisor big loop; the group runner only waits + # through the fixed startup observation window before advancing to the next service. +{{INDENTED_SELECTION_SUPERVISOR_LAUNCH_WAIT_BLOCK}} +fi diff --git a/deployment/templates/gen_bare_deploy_bash/atomic_group_start.sh.tmpl b/deployment/templates/gen_bare_deploy_bash/atomic_group_start.sh.tmpl new file mode 100644 index 0000000..d0c82ad --- /dev/null +++ b/deployment/templates/gen_bare_deploy_bash/atomic_group_start.sh.tmpl @@ -0,0 +1,7 @@ +#!/usr/bin/env bash +set -euo pipefail + +GROUP={{GROUP_ASSIGN}} +NAME_PREFIX={{NAME_PREFIX_ASSIGN}} +{{HOST_PRELUDE}}{{ATOMIC_GROUP_NODE_RESOLUTION_TAIL}}{{SELECTION_SUPERVISOR_PATH_BLOCK}}{{PROC_LIFECYCLE_HELPERS}}{{GLOBAL_ENV_EXPORTS}}GROUP_STARTUP_DEADLINE_TS=$(( $(date +%s) + {{GROUP_STARTUP_DEADLINE_ASSIGN}} )) +{{SERVICE_BLOCKS}}echo "[atomic-group] ready group=$GROUP node=$NODE_ID" diff --git a/deployment/templates/gen_bare_deploy_bash/atomic_group_stop.sh.tmpl b/deployment/templates/gen_bare_deploy_bash/atomic_group_stop.sh.tmpl new file mode 100644 index 0000000..5501b8f --- /dev/null +++ b/deployment/templates/gen_bare_deploy_bash/atomic_group_stop.sh.tmpl @@ -0,0 +1,6 @@ +#!/usr/bin/env bash +set -u -o pipefail + +GROUP={{GROUP_ASSIGN}} +NAME_PREFIX={{NAME_PREFIX_ASSIGN}} +{{HOST_PRELUDE}}{{ATOMIC_GROUP_NODE_RESOLUTION_TAIL}}{{SELECTION_SUPERVISOR_PATH_BLOCK}}{{ATOMIC_GROUP_STOP_FN}}stop_group diff --git a/deployment/templates/gen_bare_deploy_bash/bare_entrypoint.sh.tmpl b/deployment/templates/gen_bare_deploy_bash/bare_entrypoint.sh.tmpl new file mode 100644 index 0000000..39db682 --- /dev/null +++ b/deployment/templates/gen_bare_deploy_bash/bare_entrypoint.sh.tmpl @@ -0,0 +1,5 @@ +#!/usr/bin/env bash +set -euo pipefail + +export SERVICE={{SERVICE_EXPORT}} +{{ENTRYPOINT}} diff --git a/deployment/templates/gen_bare_deploy_bash/common_node_resolution_tail.sh.tmpl b/deployment/templates/gen_bare_deploy_bash/common_node_resolution_tail.sh.tmpl new file mode 100644 index 0000000..e0cb433 --- /dev/null +++ b/deployment/templates/gen_bare_deploy_bash/common_node_resolution_tail.sh.tmpl @@ -0,0 +1,15 @@ +if [ ${#ALLOWED_NODES[@]} -gt 0 ]; then + _ok=false + for n in "${ALLOWED_NODES[@]}"; do + if [ "$n" = "$NODE_ID" ]; then _ok=true; fi + done + if [ "$_ok" != true ]; then + echo "Service {{SERVICE_NAME}} not scheduled on this node ($NODE_ID). Allowed: ${ALLOWED_NODES[*]}" + exit 0 + fi +fi + +export NODE_ID="$NODE_ID" +export HOST_IP="$HOST_IP" +export HOSTWORKDIR="$HOSTWORKDIR" + diff --git a/deployment/templates/gen_bare_deploy_bash/etcd_health_wait_block.sh.tmpl b/deployment/templates/gen_bare_deploy_bash/etcd_health_wait_block.sh.tmpl new file mode 100644 index 0000000..b424bc3 --- /dev/null +++ b/deployment/templates/gen_bare_deploy_bash/etcd_health_wait_block.sh.tmpl @@ -0,0 +1,4 @@ +if ! wait_service_etcd_endpoint_healthy "$SERVICE" "$HOSTWORKDIR/fluxon_release/ext_images/etcd/etcdctl" "http://$HOST_IP:$SERVICE_PORT" {{ETCD_HEALTH_STABLE_SECONDS}} {{ETCD_HEALTH_DEADLINE_TS}} "{{CONTEXT}}"; then + echo "{{CONTEXT}} etcd-health failed svc=$SERVICE endpoint=http://$HOST_IP:$SERVICE_PORT" + exit 1 +fi diff --git a/deployment/templates/gen_bare_deploy_bash/host_prelude.sh.tmpl b/deployment/templates/gen_bare_deploy_bash/host_prelude.sh.tmpl new file mode 100644 index 0000000..6075106 --- /dev/null +++ b/deployment/templates/gen_bare_deploy_bash/host_prelude.sh.tmpl @@ -0,0 +1,57 @@ +{{ALL_NODES_BLOCK}} +LOCAL_HOSTNAME=$(hostname -s 2>/dev/null || hostname 2>/dev/null || echo unknown) +LOCAL_FQDN=$(hostname -f 2>/dev/null || echo "$LOCAL_HOSTNAME") +NODE_ID="${NODE_ID:-}" +if [ -n "$NODE_ID" ]; then + _node_id_known=false + for n in "${ALL_NODES[@]}"; do + if [ "$n" = "$NODE_ID" ]; then + _node_id_known=true + break + fi + done + if [ "$_node_id_known" != true ]; then + echo "Unknown preset NODE_ID: $NODE_ID" + echo "Known nodes: {{KNOWN_NODES}}" + exit 1 + fi +fi +if [ -z "$NODE_ID" ]; then +for n in "${ALL_NODES[@]}"; do + if [ "$n" = "$LOCAL_HOSTNAME" ] || [ "$n" = "$LOCAL_FQDN" ]; then + NODE_ID="$n" + break + fi +done +fi +if [ -z "$NODE_ID" ] && [ ${#ALL_NODES[@]} -eq 1 ]; then + NODE_ID="${ALL_NODES[0]}" +fi +if [ -z "$NODE_ID" ]; then + for ip in $(hostname -I 2>/dev/null); do + for n in "${ALL_NODES[@]}"; do + _ip_n="" + case "$n" in +{{IP_CASE_LINES}} + *) _ip_n="";; + esac + if [ "$_ip_n" = "$ip" ]; then + NODE_ID="$n" + break + fi + done + [ -n "$NODE_ID" ] && break + done +fi +if [ -z "$NODE_ID" ]; then + echo "Cannot map host to a configured node. Hostname=$LOCAL_HOSTNAME FQDN=$LOCAL_FQDN IPs=$(hostname -I 2>/dev/null)" + echo "Known nodes: {{KNOWN_NODES}}" + exit 1 +fi + +HOST_IP="" +HOSTWORKDIR="" +case "$NODE_ID" in +{{HOST_CASE_LINES}} + *) echo "Unknown NODE_ID: $NODE_ID"; exit 1;; +esac diff --git a/deployment/templates/gen_bare_deploy_bash/selection_present_probe_fn.sh.tmpl b/deployment/templates/gen_bare_deploy_bash/selection_present_probe_fn.sh.tmpl new file mode 100644 index 0000000..0a7282b --- /dev/null +++ b/deployment/templates/gen_bare_deploy_bash/selection_present_probe_fn.sh.tmpl @@ -0,0 +1,19 @@ +selection_present() { + python3 - "$SELECTION_SUPERVISOR" "$SUPERVISOR_LABEL" "$HOSTWORKDIR" <<'__FLUXON_SELECTION_PRESENT__' +import importlib.util +import sys +from pathlib import Path + +supervisor_path = Path(sys.argv[1]) +label = sys.argv[2] +scope_key = sys.argv[3] +spec = importlib.util.spec_from_file_location("fluxon_selection_supervisor_probe", supervisor_path) +if spec is None or spec.loader is None: + raise RuntimeError(f"failed to load selection supervisor module: {supervisor_path}") +module = importlib.util.module_from_spec(spec) +sys.modules[spec.name] = module +spec.loader.exec_module(module) +raise SystemExit(0 if module._selection_present(label, scope_key=scope_key) else 1) +__FLUXON_SELECTION_PRESENT__ +} + diff --git a/deployment/templates/gen_bare_deploy_bash/selection_supervisor_launch_wait_block.sh.tmpl b/deployment/templates/gen_bare_deploy_bash/selection_supervisor_launch_wait_block.sh.tmpl new file mode 100644 index 0000000..f466cbc --- /dev/null +++ b/deployment/templates/gen_bare_deploy_bash/selection_supervisor_launch_wait_block.sh.tmpl @@ -0,0 +1,9 @@ +SUPERVISOR_PID=$( {{RUN_CMD}} < /dev/null & echo "$!" ) +if [[ ! "$SUPERVISOR_PID" =~ ^[0-9]+$ ]]; then + echo "{{CONTEXT}} launch failed svc=$SERVICE label=$SUPERVISOR_LABEL supervisor_pid=$SUPERVISOR_PID" + exit 1 +fi +if ! wait_service_probably_ready_pid_tree "$SERVICE" "$SUPERVISOR_PID" {{STABLE_SECONDS_EXPR}} {{DEADLINE_TS_EXPR}} "{{CONTEXT}}"; then + echo "{{CONTEXT}} probable-ready failed svc=$SERVICE label=$SUPERVISOR_LABEL supervisor_pid=$SUPERVISOR_PID" + exit 1 +fi diff --git a/deployment/templates/gen_bare_deploy_bash/selection_supervisor_path_from_script_dir.sh.tmpl b/deployment/templates/gen_bare_deploy_bash/selection_supervisor_path_from_script_dir.sh.tmpl new file mode 100644 index 0000000..dac7dff --- /dev/null +++ b/deployment/templates/gen_bare_deploy_bash/selection_supervisor_path_from_script_dir.sh.tmpl @@ -0,0 +1,7 @@ +DIR=$(cd "$(dirname "$0")" && pwd) +SELECTION_SUPERVISOR="$DIR/{{SELECTION_SUPERVISOR_FILENAME}}" +if [ ! -f "$SELECTION_SUPERVISOR" ]; then + echo "Missing selection supervisor: $SELECTION_SUPERVISOR" + exit 1 +fi + diff --git a/deployment/templates/gen_bare_deploy_bash/standalone_start.sh.tmpl b/deployment/templates/gen_bare_deploy_bash/standalone_start.sh.tmpl new file mode 100644 index 0000000..5a565f1 --- /dev/null +++ b/deployment/templates/gen_bare_deploy_bash/standalone_start.sh.tmpl @@ -0,0 +1,6 @@ +#!/usr/bin/env bash +set -euo pipefail + +SERVICE={{SERVICE_ASSIGN}} +NAME_PREFIX={{NAME_PREFIX_ASSIGN}} +{{ALLOWED_NODES_BLOCK}}{{HOST_PRELUDE}}{{COMMON_NODE_RESOLUTION_TAIL}}{{SELECTION_SUPERVISOR_PATH_BLOCK}}{{PROC_LIFECYCLE_HELPERS}}{{SELECTION_PRESENT_PROBE_FN}}{{START_LOCK_BLOCK}}{{GLOBAL_ENV_EXPORTS}}{{PORT_EXPORT}}{{START_BODY}} diff --git a/deployment/templates/gen_bare_deploy_bash/standalone_start_body.sh.tmpl b/deployment/templates/gen_bare_deploy_bash/standalone_start_body.sh.tmpl new file mode 100644 index 0000000..bc2fc40 --- /dev/null +++ b/deployment/templates/gen_bare_deploy_bash/standalone_start_body.sh.tmpl @@ -0,0 +1,27 @@ +SUPERVISOR_LABEL={{SUPERVISOR_LABEL_ASSIGN}} +RUNTIME_STATE_JSON={{RUNTIME_STATE_JSON_ASSIGN}} +OWNER_TS_MS=$(python3 -c 'import time; print(int(time.time() * 1000))') +STARTUP_DEADLINE_TS=$(( $(date +%s) + {{STARTUP_DEADLINE_SECONDS}} )) +LOG_DIR="$HOSTWORKDIR/log" +LOGFILE="$LOG_DIR/${SERVICE}.log" +mkdir -p "$LOG_DIR" +echo "Starting $SERVICE on $NODE_ID (IP: $HOST_IP, workdir: $HOSTWORKDIR)" +# English note: +# - bootstrap bare start must be idempotent when the shared selection supervisor already owns +# a live child for the same label. +# - start_test_bed enables this path only for deployconf.bootstrap_bare_services. +if [ "${FLUXON_BARE_ALLOW_ALREADY_PRESENT:-false}" = "true" ]; then + if selection_present; then + echo "[bare] already present svc=$SERVICE label=$SUPERVISOR_LABEL" + echo "Started $SERVICE (label: $SUPERVISOR_LABEL)" + echo "Logs: $LOGFILE" + exit 0 + fi +fi +# English note: +# - Bare start must not depend on extra supervisor observation subcommands because the shared +# runtime surface is intentionally reduced to run/stop. +# - We therefore launch the detached supervisor and wait until its pid subtree keeps a live child +# process alive across the fixed startup observation window. +{{SELECTION_SUPERVISOR_LAUNCH_WAIT_BLOCK}}echo "Started $SERVICE (label: $SUPERVISOR_LABEL)" +echo "Logs: $LOGFILE" diff --git a/deployment/templates/gen_bare_deploy_bash/standalone_stop.sh.tmpl b/deployment/templates/gen_bare_deploy_bash/standalone_stop.sh.tmpl new file mode 100644 index 0000000..4f7dc37 --- /dev/null +++ b/deployment/templates/gen_bare_deploy_bash/standalone_stop.sh.tmpl @@ -0,0 +1,15 @@ +#!/usr/bin/env bash +set -euo pipefail + +SERVICE={{SERVICE_ASSIGN}} +NAME_PREFIX={{NAME_PREFIX_ASSIGN}} +{{ALLOWED_NODES_BLOCK}}{{HOST_PRELUDE}}{{COMMON_NODE_RESOLUTION_TAIL}}{{SELECTION_SUPERVISOR_PATH_BLOCK}}SUPERVISOR_LABEL={{SUPERVISOR_LABEL_ASSIGN}} +# English note: +# - Generated bare stop is retained as a manual operator tool. +# - Automation must not depend on this path for handover or rollout convergence. +# - The command only asks the shared selection supervisor to retire the concrete selection +# identity identified by label on this node. +if ! python3 "$SELECTION_SUPERVISOR" stop --label "$SUPERVISOR_LABEL" --scope-key "$HOSTWORKDIR" --missing-ok >/dev/null; then + echo "[bare] stop failed svc=$SERVICE label=$SUPERVISOR_LABEL hostworkdir=$HOSTWORKDIR" + exit 1 +fi diff --git a/deployment/templates/gen_bare_deploy_bash/start_lock_block.sh.tmpl b/deployment/templates/gen_bare_deploy_bash/start_lock_block.sh.tmpl new file mode 100644 index 0000000..47ec770 --- /dev/null +++ b/deployment/templates/gen_bare_deploy_bash/start_lock_block.sh.tmpl @@ -0,0 +1,14 @@ +PID_DIR="$HOSTWORKDIR/run" +mkdir -p "$PID_DIR" +START_LOCKFILE="$PID_DIR/${SERVICE}.start.lock" +if ! command -v flock >/dev/null 2>&1; then + echo "Missing required command: flock" + exit 1 +fi +exec 9>"$START_LOCKFILE" +if ! flock -xn 9; then + echo "[bare] start skipped svc=$SERVICE reason=another start is already running lockfile=$START_LOCKFILE" + exit 0 +fi +exec 9>&- + diff --git a/deployment/templates/gen_bare_deploy_bash/tcp_ready_helpers.sh.tmpl b/deployment/templates/gen_bare_deploy_bash/tcp_ready_helpers.sh.tmpl new file mode 100644 index 0000000..0c0cc3b --- /dev/null +++ b/deployment/templates/gen_bare_deploy_bash/tcp_ready_helpers.sh.tmpl @@ -0,0 +1,120 @@ +wait_service_tcp_ready() { + svc="$1" + host="$2" + port="$3" + stable_seconds="$4" + deadline_ts="$5" + context="$6" + if [[ ! "$port" =~ ^[0-9]+$ ]]; then + echo "$context tcp-ready: invalid port svc=$svc port=$port" + return 1 + fi + if [[ ! "$stable_seconds" =~ ^[0-9]+$ ]] || [ "$stable_seconds" -le 0 ]; then + echo "$context tcp-ready: invalid stable_seconds svc=$svc stable_seconds=$stable_seconds" + return 1 + fi + poll_interval_seconds={{TCP_READY_POLL_INTERVAL_SECONDS}} + stable_checks=$(python3 - "$stable_seconds" "$poll_interval_seconds" <<'__FLUXON_TCP_READY_CHECKS__' +import math +import sys +stable_seconds = float(sys.argv[1]) +poll_interval_seconds = float(sys.argv[2]) +print(max(1, int(math.ceil(stable_seconds / poll_interval_seconds)))) +__FLUXON_TCP_READY_CHECKS__ +) + if [[ ! "$stable_checks" =~ ^[0-9]+$ ]] || [ "$stable_checks" -le 0 ]; then + echo "$context tcp-ready: failed to compute stable_checks svc=$svc" + return 1 + fi + ok_checks=0 + while true; do + now=$(date +%s) + if [ "$now" -ge "$deadline_ts" ]; then + echo "$context tcp-ready: deadline exceeded svc=$svc host=$host port=$port" + return 1 + fi + if python3 - "$host" "$port" <<'__FLUXON_TCP_READY_PROBE__' +import socket +import sys +host = sys.argv[1] +port = int(sys.argv[2]) +with socket.create_connection((host, port), timeout=1.0): + pass +__FLUXON_TCP_READY_PROBE__ + then + ok_checks=$((ok_checks+1)) + if [ "$ok_checks" -ge "$stable_checks" ]; then + echo "$context tcp-ready: ok svc=$svc host=$host port=$port stable_checks=$stable_checks" + return 0 + fi + else + if [ "$ok_checks" -ne 0 ]; then + echo "$context tcp-ready: reset svc=$svc ok_checks=$ok_checks host=$host port=$port" + fi + ok_checks=0 + fi + sleep "$poll_interval_seconds" + done +} + +wait_service_etcd_endpoint_healthy() { + svc="$1" + etcdctl_bin="$2" + endpoint="$3" + stable_seconds="$4" + deadline_ts="$5" + context="$6" + if [ ! -x "$etcdctl_bin" ]; then + echo "$context etcd-health: missing etcdctl svc=$svc path=$etcdctl_bin" + return 1 + fi + if [ -z "$endpoint" ]; then + echo "$context etcd-health: missing endpoint svc=$svc" + return 1 + fi + if [[ ! "$stable_seconds" =~ ^[0-9]+$ ]] || [ "$stable_seconds" -le 0 ]; then + echo "$context etcd-health: invalid stable_seconds svc=$svc stable_seconds=$stable_seconds" + return 1 + fi + poll_interval_seconds={{ETCD_HEALTH_POLL_INTERVAL_SECONDS}} + stable_checks=$(python3 - "$stable_seconds" "$poll_interval_seconds" <<'__FLUXON_ETCD_HEALTH_CHECKS__' +import math +import sys +stable_seconds = float(sys.argv[1]) +poll_interval_seconds = float(sys.argv[2]) +print(max(1, int(math.ceil(stable_seconds / poll_interval_seconds)))) +__FLUXON_ETCD_HEALTH_CHECKS__ +) + if [[ ! "$stable_checks" =~ ^[0-9]+$ ]] || [ "$stable_checks" -le 0 ]; then + echo "$context etcd-health: failed to compute stable_checks svc=$svc" + return 1 + fi + ok_checks=0 + last_output="" + while true; do + now=$(date +%s) + if [ "$now" -ge "$deadline_ts" ]; then + if [ -n "$last_output" ]; then + last_output="${last_output//$'\n'/ }" + echo "$context etcd-health: deadline exceeded svc=$svc endpoint=$endpoint last_output=$last_output" + else + echo "$context etcd-health: deadline exceeded svc=$svc endpoint=$endpoint" + fi + return 1 + fi + if probe_output=$(ETCDCTL_API=3 "$etcdctl_bin" --endpoints "$endpoint" --dial-timeout "{{ETCD_HEALTH_PROBE_TIMEOUT_MS}}ms" --command-timeout "{{ETCD_HEALTH_PROBE_TIMEOUT_MS}}ms" endpoint health 2>&1); then + ok_checks=$((ok_checks+1)) + if [ "$ok_checks" -ge "$stable_checks" ]; then + echo "$context etcd-health: ok svc=$svc endpoint=$endpoint stable_checks=$stable_checks" + return 0 + fi + else + last_output="$probe_output" + if [ "$ok_checks" -ne 0 ]; then + echo "$context etcd-health: reset svc=$svc ok_checks=$ok_checks endpoint=$endpoint" + fi + ok_checks=0 + fi + sleep "$poll_interval_seconds" + done +} diff --git a/deployment/templates/gen_bare_deploy_bash/tcp_ready_wait_block.sh.tmpl b/deployment/templates/gen_bare_deploy_bash/tcp_ready_wait_block.sh.tmpl new file mode 100644 index 0000000..bbf021b --- /dev/null +++ b/deployment/templates/gen_bare_deploy_bash/tcp_ready_wait_block.sh.tmpl @@ -0,0 +1,6 @@ +if [[ "${SERVICE_PORT:-}" =~ ^[0-9]+$ ]]; then + if ! wait_service_tcp_ready "$SERVICE" "$HOST_IP" "$SERVICE_PORT" {{TCP_READY_STABLE_SECONDS}} {{TCP_READY_DEADLINE_TS}} "{{CONTEXT}}"; then + echo "{{CONTEXT}} tcp-ready failed svc=$SERVICE host=$HOST_IP port=$SERVICE_PORT" + exit 1 + fi +fi diff --git a/deployment/tests/test_gen_bare_deploy_bash.py b/deployment/tests/test_gen_bare_deploy_bash.py index f51a923..f1645a3 100644 --- a/deployment/tests/test_gen_bare_deploy_bash.py +++ b/deployment/tests/test_gen_bare_deploy_bash.py @@ -13,6 +13,8 @@ from pathlib import Path from typing import Callable, List, Optional, Tuple +import yaml + SCRIPT_DIR = Path(__file__).resolve().parent DEPLOYMENT_DIR = SCRIPT_DIR.parent @@ -50,6 +52,12 @@ def _build_checks(selected_test_id: Optional[str]) -> List[Tuple[str, Callable[[ ("preserves_hostworkdir_runtime_token", test_preserves_hostworkdir_runtime_token), ("generated_scripts_do_not_embed_pidfile_authority", test_generated_scripts_do_not_embed_pidfile_authority), ("ops_entrypoints_use_direct_scripts", test_ops_entrypoints_use_direct_scripts), + ("bare_start_uses_no_exit_startup_gate", test_bare_start_uses_no_exit_startup_gate), + ( + "normalized_testbed_master_exports_service_port_for_atomic_group", + test_normalized_testbed_master_exports_service_port_for_atomic_group, + ), + ("normalized_testbed_owner_emits_large_file_paths", test_normalized_testbed_owner_emits_large_file_paths), ("bare_child_command_preserves_runtime_hostworkdir_expansion", test_bare_child_command_preserves_runtime_hostworkdir_expansion), ("supervisor_label_uses_stable_selection_suffix", test_supervisor_label_uses_stable_selection_suffix), ("bootstrap_start_reuses_already_present_selection", test_bootstrap_start_reuses_already_present_selection), @@ -93,6 +101,7 @@ def test_preserves_hostworkdir_runtime_token() -> None: FLUXON_SHARED_MEM: "${HOSTWORKDIR}/shm1" service: svc_plain: + port: 12345 entrypoint: | WORKDIR="${HOSTWORKDIR}/svc_${NODE_ID}" EXPORT_TABLE=$(cat < None: assert "wait-present" not in script, script assert "launch_only_start_gate" not in script, script assert 'wait_service_probably_ready_pid_tree "$SERVICE" "$SUPERVISOR_PID"' in script, script - assert 'wait_service_tcp_ready "$SERVICE" "$HOST_IP" "$SERVICE_PORT"' in script, script + assert 'wait_service_probably_ready_pid_tree "$SERVICE" "$SUPERVISOR_PID" 10 "$STARTUP_DEADLINE_TS" "[bare]"' in script, script + assert "export SERVICE_PORT=12345" in script, script + assert 'STARTUP_DEADLINE_TS=$(( $(date +%s) + 10 ))' in script, script + assert "wait_service_tcp_ready" not in script, script + assert "wait_service_etcd_endpoint_healthy" not in script, script assert 'SUPERVISOR_PID=$( setsid ' not in script, script + assert '>>"$LOGFILE" 2>&1' not in script, script + assert 'touch "$LOGFILE"' not in script, script assert 'python3 "$SELECTION_SUPERVISOR" stop --label "$SUPERVISOR_LABEL" --scope-key "$HOSTWORKDIR" --missing-ok' in stop_script, stop_script assert "retire-runtime" not in stop_script, stop_script print("PASS: test_preserves_hostworkdir_runtime_token") @@ -149,6 +164,7 @@ def test_atomic_group_start_does_not_auto_stop_on_failure() -> None: hostworkdir: /tmp/hostworkdir service: svc_a: + port: 23456 entrypoint: | echo svc_a node_bind: @@ -179,7 +195,12 @@ def test_atomic_group_start_does_not_auto_stop_on_failure() -> None: assert 'SUPERVISOR_PID=$( setsid ' not in script, script assert 'echo "[rollout] probable-ready failed svc=$SERVICE label=$SUPERVISOR_LABEL supervisor_pid=$SUPERVISOR_PID"' in script, script assert 'wait_service_probably_ready_pid_tree "$SERVICE" "$SUPERVISOR_PID"' in script, script - assert 'wait_service_tcp_ready "$SERVICE" "$HOST_IP" "$SERVICE_PORT"' in script, script + assert 'wait_service_probably_ready_pid_tree "$SERVICE" "$SUPERVISOR_PID" 10 "$GROUP_STARTUP_DEADLINE_TS" "[rollout]"' in script, script + assert 'GROUP_STARTUP_DEADLINE_TS=$(( $(date +%s) + 10 ))' in script, script + assert "export SERVICE_PORT=23456" in script, script + assert "unset SERVICE_PORT" in script, script + assert "wait_service_tcp_ready" not in script, script + assert "wait_service_etcd_endpoint_healthy" not in script, script print("PASS: test_atomic_group_start_does_not_auto_stop_on_failure") @@ -251,11 +272,129 @@ def test_ops_entrypoints_use_direct_scripts() -> None: assert "-m fluxon_py.runtime.start_ops_controller" in controller_entrypoint, controller_entrypoint assert "examples/fluxon_ops/start_controller.py" not in controller_entrypoint, controller_entrypoint + assert 'http_listen_addr: "0.0.0.0:19080"' in controller_entrypoint, controller_entrypoint + assert 'http_listen_addr: "0.0.0.0:${MASTER__PORT}"' not in controller_entrypoint, controller_entrypoint assert "-m fluxon_py.runtime.start_ops_agent" in agent_entrypoint, agent_entrypoint assert "examples/fluxon_ops/start_agent.py" not in agent_entrypoint, agent_entrypoint print("PASS: test_ops_entrypoints_use_direct_scripts") +def test_bare_start_uses_no_exit_startup_gate() -> None: + with tempfile.TemporaryDirectory(prefix="test_gen_bare_deploy_bash_no_exit_gate_") as td: + tmpdir = Path(td) + config_path = tmpdir / "deployconf.yaml" + outdir = tmpdir / "out" + config_path.write_text( + textwrap.dedent( + """ + name_prefix: fluxon-testbed + cluster_nodes: + - hostname: node-a + ip: 127.0.0.1 + hostworkdir: /tmp/hostworkdir + service: + etcd: + port: 2379 + entrypoint: | + echo etcd + node_bind: + node: ["node-a"] + tikv: + port: 20160 + entrypoint: | + echo tikv + node_bind: + node: ["node-a"] + svc_plain: + port: 12345 + entrypoint: | + echo plain + node_bind: + node: ["node-a"] + """ + ).strip() + + "\n", + encoding="utf-8", + ) + + result = _run_generator(config_path=config_path, outdir=outdir) + assert result.returncode == 0, f"generator failed: stdout={result.stdout} stderr={result.stderr}" + + etcd_script = (outdir / "start_etcd.sh").read_text(encoding="utf-8") + tikv_script = (outdir / "start_tikv.sh").read_text(encoding="utf-8") + plain_script = (outdir / "start_svc_plain.sh").read_text(encoding="utf-8") + + for script in (etcd_script, tikv_script, plain_script): + assert 'STARTUP_DEADLINE_TS=$(( $(date +%s) + 10 ))' in script, script + assert 'wait_service_probably_ready_pid_tree "$SERVICE" "$SUPERVISOR_PID" 10 "$STARTUP_DEADLINE_TS" "[bare]"' in script, script + assert "wait_service_tcp_ready" not in script, script + assert "wait_service_etcd_endpoint_healthy" not in script, script + print("PASS: test_bare_start_uses_no_exit_startup_gate") + + +def test_normalized_testbed_master_exports_service_port_for_atomic_group() -> None: + with tempfile.TemporaryDirectory(prefix="test_gen_bare_deploy_bash_normalized_testbed_") as td: + tmpdir = Path(td) + config_path = tmpdir / "deployconf.normalized.yaml" + outdir = tmpdir / "out" + + start_test_bed = _load_python_module( + module_name="start_test_bed_for_gen_bare_tests", + path=DEPLOYMENT_DIR.parent / "fluxon_test_stack" / "start_test_bed.py", + ) + base_cfg = yaml.safe_load( + (DEPLOYMENT_DIR.parent / "fluxon_test_stack" / "deployconf_testbed.yml").read_text(encoding="utf-8") + ) + normalized, _ = start_test_bed._normalize_bootstrap_deployconf(deployconf=base_cfg) + config_path.write_text( + yaml.safe_dump(normalized, sort_keys=False, allow_unicode=False), + encoding="utf-8", + ) + + result = _run_generator(config_path=config_path, outdir=outdir) + assert result.returncode == 0, f"generator failed: stdout={result.stdout} stderr={result.stderr}" + + script = (outdir / "start_fluxon_core_controller.sh").read_text(encoding="utf-8") + master_block_start = script.index("export SERVICE=master") + owner_block_start = script.index("export SERVICE=owner") + master_block = script[master_block_start:owner_block_start] + assert "export MASTER__PORT=51051" in master_block, master_block + assert "export SERVICE_PORT=51051" in master_block, master_block + assert "unset SERVICE_PORT" not in master_block, master_block + assert 'wait_service_probably_ready_pid_tree "$SERVICE" "$SUPERVISOR_PID" 10 "$GROUP_STARTUP_DEADLINE_TS" "[rollout]"' in master_block, master_block + assert "wait_service_tcp_ready" not in master_block, master_block + print("PASS: test_normalized_testbed_master_exports_service_port_for_atomic_group") + + +def test_normalized_testbed_owner_emits_large_file_paths() -> None: + with tempfile.TemporaryDirectory(prefix="test_gen_bare_deploy_bash_testbed_owner_large_paths_") as td: + tmpdir = Path(td) + config_path = tmpdir / "deployconf.normalized.yaml" + outdir = tmpdir / "out" + + start_test_bed = _load_python_module( + module_name="start_test_bed_for_owner_large_paths_tests", + path=DEPLOYMENT_DIR.parent / "fluxon_test_stack" / "start_test_bed.py", + ) + base_cfg = yaml.safe_load( + (DEPLOYMENT_DIR.parent / "fluxon_test_stack" / "deployconf_testbed.yml").read_text(encoding="utf-8") + ) + normalized, _ = start_test_bed._normalize_bootstrap_deployconf(deployconf=base_cfg) + config_path.write_text( + yaml.safe_dump(normalized, sort_keys=False, allow_unicode=False), + encoding="utf-8", + ) + + result = _run_generator(config_path=config_path, outdir=outdir) + assert result.returncode == 0, f"generator failed: stdout={result.stdout} stderr={result.stderr}" + + script = (outdir / "entrypoint__fluxon-self-host2-fluxon_core_controller__owner.sh").read_text(encoding="utf-8") + assert 'large_file_paths:' in script, script + assert 'log_root_path: "${HOSTWORKDIR}/large/log/owner_${NODE_ID}"' in script, script + assert 'cache_root_path: "${HOSTWORKDIR}/large/cache/owner_${NODE_ID}"' in script, script + print("PASS: test_normalized_testbed_owner_emits_large_file_paths") + + def test_bare_child_command_preserves_runtime_hostworkdir_expansion() -> None: with tempfile.TemporaryDirectory(prefix="test_gen_bare_deploy_bash_runtime_expand_") as td: tmpdir = Path(td) @@ -600,6 +739,16 @@ def _load_generated_supervisor_module(supervisor_path: Path): return module +def _load_python_module(*, module_name: str, path: Path): + spec = importlib.util.spec_from_file_location(module_name, path) + if spec is None or spec.loader is None: + raise RuntimeError(f"failed to load module: {path}") + module = importlib.util.module_from_spec(spec) + sys.modules[module_name] = module + spec.loader.exec_module(module) + return module + + def _wait_until_selection_present(module, *, label: str, timeout_seconds: int = 15) -> None: deadline = time.time() + timeout_seconds while time.time() < deadline: diff --git a/deployment/tests/test_gen_k8s_daemonset.py b/deployment/tests/test_gen_k8s_daemonset.py index eff0aad..2cd769e 100644 --- a/deployment/tests/test_gen_k8s_daemonset.py +++ b/deployment/tests/test_gen_k8s_daemonset.py @@ -248,7 +248,7 @@ def test_ops_entrypoints_use_direct_scripts() -> None: cluster_name: "${FLUXON_CLUSTER_NAME}" member_kind: kv output: web - http_listen_addr: "0.0.0.0:${MASTER__PORT}" + http_listen_addr: "0.0.0.0:${OPS_CONTROLLER__PORT}" YAML ${HOSTWORKDIR}/venv/bin/python -m fluxon_py.runtime.start_ops_controller -c "${WORKDIR}/ops_controller.yaml" -w "${WORKDIR}" node_bind: diff --git a/deployment/tests/test_log_shard.py b/deployment/tests/test_log_shard.py new file mode 100644 index 0000000..642e718 --- /dev/null +++ b/deployment/tests/test_log_shard.py @@ -0,0 +1,117 @@ +#!/usr/bin/env python3 + +from __future__ import annotations + +import argparse +import datetime +import os +import sys +import tempfile +import time +from pathlib import Path +from typing import Callable, List, Optional, Tuple + +SCRIPT_DIR = Path(__file__).resolve().parent +DEPLOYMENT_DIR = SCRIPT_DIR.parent +sys.path.insert(0, str(DEPLOYMENT_DIR)) + +from utils import log_shard + + +def main() -> int: + parser = argparse.ArgumentParser(description="log_shard util test runner") + parser.add_argument("--test-id", help="Run only the named test id") + args = parser.parse_args() + + checks = _build_checks(args.test_id) + failures = 0 + for _, check in checks: + try: + check() + print(f"PASS: {check.__name__}") + except Exception as exc: + print(f"FAIL: {check.__name__}: {exc}") + failures += 1 + return 0 if failures == 0 else 1 + + +def _build_checks(selected_test_id: Optional[str]) -> List[Tuple[str, Callable[[], None]]]: + checks: List[Tuple[str, Callable[[], None]]] = [ + ("daily_path_uses_utc_date_suffix", test_daily_path_uses_utc_date_suffix), + ("daily_path_uses_test_window_suffix_when_configured", test_daily_path_uses_test_window_suffix_when_configured), + ("resolve_readable_prefers_latest_existing_shard", test_resolve_readable_prefers_latest_existing_shard), + ("cleanup_keeps_only_retention_window", test_cleanup_keeps_only_retention_window), + ] + if selected_test_id is None: + return checks + for check_id, check in checks: + if check_id == selected_test_id: + return [(check_id, check)] + available = ", ".join(check_id for check_id, _ in checks) + raise ValueError(f"unknown --test-id: {selected_test_id}. Available: {available}") + + +def test_daily_path_uses_utc_date_suffix() -> None: + base = Path("/tmp/test_runner.log") + now = datetime.datetime(2026, 6, 21, 4, 0, 0, tzinfo=datetime.timezone.utc) + resolved = log_shard.daily_sharded_log_path(base, now=now) + assert resolved.name == "test_runner.2026-06-21.log", resolved + + +def test_resolve_readable_prefers_latest_existing_shard() -> None: + with tempfile.TemporaryDirectory(prefix="test_log_shard_resolve_") as td: + root = Path(td) + base = root / "service.log" + (root / "service.2026-06-19.log").write_text("old\n", encoding="utf-8") + (root / "service.2026-06-20.log").write_text("new\n", encoding="utf-8") + resolved = log_shard.resolve_readable_log_path(base) + assert resolved == (root / "service.2026-06-20.log").resolve(), resolved + + +def test_daily_path_uses_test_window_suffix_when_configured() -> None: + base = Path("/tmp/test_runner.log") + saved_window = os.environ.get(log_shard.TEST_LOG_SHARD_WINDOW_SECONDS_ENV) + saved_anchor = os.environ.get(log_shard.TEST_LOG_SHARD_ANCHOR_UNIX_SECONDS_ENV) + try: + os.environ[log_shard.TEST_LOG_SHARD_WINDOW_SECONDS_ENV] = "10" + os.environ[log_shard.TEST_LOG_SHARD_ANCHOR_UNIX_SECONDS_ENV] = str( + int(datetime.datetime(2026, 6, 21, 0, 0, 0, tzinfo=datetime.timezone.utc).timestamp()) + ) + now_0 = datetime.datetime(2026, 6, 21, 0, 0, 5, tzinfo=datetime.timezone.utc) + now_1 = datetime.datetime(2026, 6, 21, 0, 0, 15, tzinfo=datetime.timezone.utc) + resolved_0 = log_shard.daily_sharded_log_path(base, now=now_0) + resolved_1 = log_shard.daily_sharded_log_path(base, now=now_1) + assert resolved_0.name == "test_runner.2026-01-01.log", resolved_0 + assert resolved_1.name == "test_runner.2026-01-02.log", resolved_1 + finally: + if saved_window is None: + os.environ.pop(log_shard.TEST_LOG_SHARD_WINDOW_SECONDS_ENV, None) + else: + os.environ[log_shard.TEST_LOG_SHARD_WINDOW_SECONDS_ENV] = saved_window + if saved_anchor is None: + os.environ.pop(log_shard.TEST_LOG_SHARD_ANCHOR_UNIX_SECONDS_ENV, None) + else: + os.environ[log_shard.TEST_LOG_SHARD_ANCHOR_UNIX_SECONDS_ENV] = saved_anchor + + +def test_cleanup_keeps_only_retention_window() -> None: + with tempfile.TemporaryDirectory(prefix="test_log_shard_cleanup_") as td: + root = Path(td) + base = root / "service.log" + keep_date = datetime.datetime.now(datetime.timezone.utc).date() + old_date = keep_date - datetime.timedelta(days=31) + recent_date = keep_date - datetime.timedelta(days=30) + stale_path = root / f"service.{old_date.isoformat()}.log" + recent_path = root / f"service.{recent_date.isoformat()}.log" + today_path = root / f"service.{keep_date.isoformat()}.log" + stale_path.write_text("stale\n", encoding="utf-8") + recent_path.write_text("recent\n", encoding="utf-8") + today_path.write_text("today\n", encoding="utf-8") + log_shard.cleanup_old_daily_sharded_logs(base, retention_days=31) + assert not stale_path.exists(), stale_path + assert recent_path.exists(), recent_path + assert today_path.exists(), today_path + + +if __name__ == "__main__": + raise SystemExit(main()) diff --git a/deployment/tests/test_selection_supervisor_codegen.py b/deployment/tests/test_selection_supervisor_codegen.py index 02ffa3b..a00caa9 100644 --- a/deployment/tests/test_selection_supervisor_codegen.py +++ b/deployment/tests/test_selection_supervisor_codegen.py @@ -19,6 +19,7 @@ UTILS_DIR = SCRIPT_DIR.parent / "utils" sys.path.insert(0, str(UTILS_DIR)) +from log_shard import render_module_source as render_log_shard_module_source # type: ignore from selection_supervisor_codegen import render_python_selection_supervisor_module # type: ignore @@ -41,6 +42,9 @@ def _build_checks(selected_test_id: Optional[str]) -> List[Tuple[str, Callable[[ ("install_subreaper_uses_prctl", test_install_subreaper_uses_prctl), ("spawn_child_sanitizes_rdma_driver_env", test_spawn_child_sanitizes_rdma_driver_env), ("selection_present_requires_live_child_process", test_selection_present_requires_live_child_process), + ("runtime_log_path_uses_daily_shard_files", test_runtime_log_path_uses_daily_shard_files), + ("runtime_log_path_expands_hostworkdir_env", test_runtime_log_path_expands_hostworkdir_env), + ("runtime_log_shards_roll_and_preserve_content_boundaries", test_runtime_log_shards_roll_and_preserve_content_boundaries), ("selection_present_checks_all_live_supervisors", test_selection_present_checks_all_live_supervisors), ("zombie_supervisor_is_treated_as_stopped", test_zombie_supervisor_is_treated_as_stopped), ("legacy_replace_process_is_observed_as_live_owner", test_legacy_replace_process_is_observed_as_live_owner), @@ -99,6 +103,10 @@ def _write_runtime_script(root: Path, *, term_seconds: int = 5, kill_seconds: in ), encoding="utf-8", ) + (root / "log_shard.py").write_text( + render_log_shard_module_source(), + encoding="utf-8", + ) return supervisor_path @@ -561,6 +569,181 @@ def test_selection_present_requires_live_child_process() -> None: _terminate_process(supervisor) +def test_runtime_log_path_uses_daily_shard_files() -> None: + module = _load_runtime_module() + with tempfile.TemporaryDirectory(prefix="test_selection_supervisor_log_shard_") as td: + root = Path(td) + supervisor_path = _write_runtime_script(root) + child_path = root / "child.py" + child_path.write_text( + "import sys, time\n" + "print('hello-log-shard', flush=True)\n" + "time.sleep(30)\n", + encoding="utf-8", + ) + label = "DaemonSet/test-log-shard" + child_argv = [sys.executable, str(child_path)] + base_log_path = root / "test-log-shard.log" + supervisor = _run_supervisor_command( + supervisor_path=supervisor_path, + label=label, + owner_ts_ms=1, + state_json=json.dumps( + { + "kind": "DaemonSet", + "name": "test-log-shard", + "service_name": "test-log-shard", + "argv": child_argv, + "cwd": str(root), + "log_path": str(base_log_path), + }, + sort_keys=True, + ), + child_argv=child_argv, + cwd=root, + ) + try: + _wait_until_present(module, label) + deadline = time.time() + 5.0 + shard_path = root / f"test-log-shard.{time.strftime('%Y-%m-%d', time.gmtime())}.log" + while time.time() < deadline and not shard_path.exists(): + time.sleep(0.1) + assert shard_path.exists(), shard_path + assert not base_log_path.exists(), base_log_path + assert "hello-log-shard" in shard_path.read_text(encoding="utf-8", errors="replace") + finally: + _terminate_process(supervisor) + + +def test_runtime_log_path_expands_hostworkdir_env() -> None: + module = _load_runtime_module() + with tempfile.TemporaryDirectory(prefix="test_selection_supervisor_expand_hostworkdir_") as td: + root = Path(td) + hostworkdir = root / "hostworkdir" + hostworkdir.mkdir(parents=True, exist_ok=True) + supervisor_path = _write_runtime_script(root) + child_path = root / "child.py" + child_path.write_text( + "import time\n" + "print('expanded-hostworkdir-log', flush=True)\n" + "time.sleep(30)\n", + encoding="utf-8", + ) + label = "DaemonSet/test-expand-hostworkdir" + child_argv = [sys.executable, str(child_path)] + saved_hostworkdir = os.environ.get("HOSTWORKDIR") + os.environ["HOSTWORKDIR"] = str(hostworkdir) + supervisor = _run_supervisor_command( + supervisor_path=supervisor_path, + label=label, + owner_ts_ms=1, + state_json=json.dumps( + { + "kind": "DaemonSet", + "name": "test-expand-hostworkdir", + "service_name": "test-expand-hostworkdir", + "argv": child_argv, + "cwd": str(root), + "log_path": "${HOSTWORKDIR}/log/test-expand-hostworkdir.log", + }, + sort_keys=True, + ), + child_argv=child_argv, + cwd=root, + ) + try: + _wait_until_present(module, label) + deadline = time.time() + 5.0 + shard_path = hostworkdir / "log" / f"test-expand-hostworkdir.{time.strftime('%Y-%m-%d', time.gmtime())}.log" + while time.time() < deadline and not shard_path.exists(): + time.sleep(0.1) + assert shard_path.exists(), shard_path + assert "expanded-hostworkdir-log" in shard_path.read_text(encoding="utf-8", errors="replace") + finally: + _terminate_process(supervisor) + if saved_hostworkdir is None: + os.environ.pop("HOSTWORKDIR", None) + else: + os.environ["HOSTWORKDIR"] = saved_hostworkdir + + +def test_runtime_log_shards_roll_and_preserve_content_boundaries() -> None: + module = _load_runtime_module() + saved_window = os.environ.get("FLUXON_TEST_LOG_SHARD_WINDOW_SECONDS") + saved_anchor = os.environ.get("FLUXON_TEST_LOG_SHARD_ANCHOR_UNIX_SECONDS") + with tempfile.TemporaryDirectory(prefix="test_selection_supervisor_log_roll_") as td: + root = Path(td) + supervisor_path = _write_runtime_script(root) + child_path = root / "child.py" + child_path.write_text( + "import sys, time\n" + "print('[ops-log-mgmt][phase=before] ts=' + str(int(time.time())), flush=True)\n" + "time.sleep(11)\n" + "print('[ops-log-mgmt][phase=after] ts=' + str(int(time.time())), flush=True)\n" + "time.sleep(30)\n", + encoding="utf-8", + ) + anchor = str(int(time.time()) - 2) + os.environ["FLUXON_TEST_LOG_SHARD_WINDOW_SECONDS"] = "10" + os.environ["FLUXON_TEST_LOG_SHARD_ANCHOR_UNIX_SECONDS"] = anchor + label = "DaemonSet/test-log-roll" + child_argv = [sys.executable, str(child_path)] + base_log_path = root / "test-log-roll.log" + stale_shard = root / "test-log-roll.2025-12-01.log" + stale_shard.write_text("stale\n", encoding="utf-8") + supervisor = _run_supervisor_command( + supervisor_path=supervisor_path, + label=label, + owner_ts_ms=1, + state_json=json.dumps( + { + "kind": "DaemonSet", + "name": "test-log-roll", + "service_name": "test-log-roll", + "argv": child_argv, + "cwd": str(root), + "log_path": str(base_log_path), + }, + sort_keys=True, + ), + child_argv=child_argv, + cwd=root, + ) + try: + _wait_until_present(module, label) + first_shard = root / "test-log-roll.2026-01-01.log" + second_shard = root / "test-log-roll.2026-01-02.log" + deadline = time.time() + 20.0 + while time.time() < deadline: + if first_shard.exists() and second_shard.exists(): + first_text = first_shard.read_text(encoding="utf-8", errors="replace") + second_text = second_shard.read_text(encoding="utf-8", errors="replace") + if "[ops-log-mgmt][phase=before]" in first_text and "[ops-log-mgmt][phase=after]" in second_text: + break + time.sleep(0.2) + assert first_shard.exists(), first_shard + assert second_shard.exists(), second_shard + assert not stale_shard.exists(), stale_shard + shard_names = sorted(path.name for path in root.glob("test-log-roll.*.log")) + assert shard_names == ["test-log-roll.2026-01-01.log", "test-log-roll.2026-01-02.log"], shard_names + first_text = first_shard.read_text(encoding="utf-8", errors="replace") + second_text = second_shard.read_text(encoding="utf-8", errors="replace") + assert "[ops-log-mgmt][phase=before]" in first_text, first_text + assert "[ops-log-mgmt][phase=after]" not in first_text, first_text + assert "[ops-log-mgmt][phase=after]" in second_text, second_text + assert "[ops-log-mgmt][phase=before]" not in second_text, second_text + finally: + _terminate_process(supervisor) + if saved_window is None: + os.environ.pop("FLUXON_TEST_LOG_SHARD_WINDOW_SECONDS", None) + else: + os.environ["FLUXON_TEST_LOG_SHARD_WINDOW_SECONDS"] = saved_window + if saved_anchor is None: + os.environ.pop("FLUXON_TEST_LOG_SHARD_ANCHOR_UNIX_SECONDS", None) + else: + os.environ["FLUXON_TEST_LOG_SHARD_ANCHOR_UNIX_SECONDS"] = saved_anchor + + def test_selection_present_checks_all_live_supervisors() -> None: module = _load_runtime_module() label = "DaemonSet/test-present-any-live-child" @@ -569,7 +752,9 @@ def test_selection_present_checks_all_live_supervisors() -> None: original_iter_live_supervisors = module._iter_live_supervisors original_count_pid_tree_members = module._count_pid_tree_members try: - module._iter_live_supervisors = lambda current_label=None: [stale_new, old_live] if current_label == label else [] + module._iter_live_supervisors = ( + lambda current_label=None, scope_key=None: [stale_new, old_live] if current_label == label else [] + ) module._count_pid_tree_members = lambda pid: {11: 1, 22: 2}[pid] assert module._selection_present(label) is True finally: @@ -1135,7 +1320,7 @@ def test_retire_adopted_children_stops_live_roots() -> None: calls: List[tuple[str, object]] = [] try: module._direct_live_child_pids = lambda pid: [41, 42] if pid == module.os.getpid() else [] - module._iter_live_supervisors = lambda label=None: [] + module._iter_live_supervisors = lambda label=None, scope_key=None: [] module._stop_pid_tree_batch = lambda roots, label: calls.append(("stop", (list(roots), label))) module._reap_terminated_children = lambda: [(41, 0), (42, 0)] module._log_reaped_children = lambda **kwargs: calls.append(("reap", kwargs)) @@ -1160,7 +1345,7 @@ def test_retire_adopted_children_preserves_live_supervisor_roots() -> None: calls: List[tuple[str, object]] = [] try: module._direct_live_child_pids = lambda pid: [41, 42] if pid == module.os.getpid() else [] - module._iter_live_supervisors = lambda label=None: [ + module._iter_live_supervisors = lambda label=None, scope_key=None: [ module.LiveSupervisor( process_info=module.ProcessInfo(pid=42, ppid=module.os.getpid(), pgid=42, state="S", start_time_ticks=1), owner_ts_ms=7, diff --git a/deployment/tests/test_start_test_bed_bootstrap_log.py b/deployment/tests/test_start_test_bed_bootstrap_log.py index 312deea..9f5ef49 100644 --- a/deployment/tests/test_start_test_bed_bootstrap_log.py +++ b/deployment/tests/test_start_test_bed_bootstrap_log.py @@ -3,6 +3,7 @@ from __future__ import annotations import argparse +import copy import importlib.util import io import sys @@ -604,6 +605,7 @@ def test_normalize_bootstrap_deployconf_strips_legacy_master_p2p_listen_port() - ops_agent_entrypoint = normalized["service"]["ops_agent"]["entrypoint"] assert "p2p_listen_port: 31100" not in master_entrypoint, master_entrypoint assert "p2p_listen_port: 12102" in ops_agent_entrypoint, ops_agent_entrypoint + assert normalized["service"]["master"]["port"] == 51051, normalized["service"]["master"] assert notes == ["service.master.entrypoint: removed legacy master field p2p_listen_port"], notes assert "p2p_listen_port: 31100" in deployconf["service"]["master"]["entrypoint"], deployconf print("PASS: test_normalize_bootstrap_deployconf_strips_legacy_master_p2p_listen_port") @@ -789,6 +791,7 @@ def test_normalize_bootstrap_deployconf_rewrites_same_host_local_multi_node_fixe assert "--http-addr 0.0.0.0:19390" in normalized["service"]["greptime"]["entrypoint"], normalized["service"]["greptime"]["entrypoint"] assert normalized["service"]["tikv_pd"]["port"] == 19400, normalized["service"]["tikv_pd"] assert normalized["service"]["tikv"]["port"] == 19410, normalized["service"]["tikv"] + assert normalized["service"]["master"]["port"] == 19290, normalized["service"]["master"] assert "port: 19290" in normalized["service"]["master"]["entrypoint"], normalized["service"]["master"]["entrypoint"] assert "OPS_AGENT_P2P_LISTEN_PORT=19320" in normalized["service"]["ops_agent"]["entrypoint"], normalized["service"]["ops_agent"]["entrypoint"] assert "OPS_AGENT_P2P_LISTEN_PORT=19321" in normalized["service"]["ops_agent"]["entrypoint"], normalized["service"]["ops_agent"]["entrypoint"] @@ -845,11 +848,35 @@ def test_normalize_bootstrap_deployconf_keeps_non_local_or_single_node_ports_unc }, } normalized, notes = module._normalize_bootstrap_deployconf(deployconf=deployconf) - assert normalized == deployconf, normalized + assert normalized["service"]["master"]["port"] == 51051, normalized["service"]["master"] + expected = copy.deepcopy(deployconf) + expected["service"]["master"]["port"] = 51051 + assert normalized == expected, normalized assert notes == [], notes print("PASS: test_normalize_bootstrap_deployconf_keeps_non_local_or_single_node_ports_unchanged") +def test_normalize_bootstrap_deployconf_promotes_master_port_from_entrypoint() -> None: + module = _load_start_test_bed_module() + deployconf = { + "service": { + "master": { + "entrypoint": ( + 'cat > "${CONFIG_PATH}" < None: module = _load_start_test_bed_module() with tempfile.TemporaryDirectory(prefix="test_start_test_bed_refresh_bare_") as td: @@ -1476,6 +1503,10 @@ def main() -> int: "normalize_bootstrap_deployconf_keeps_non_local_or_single_node_ports_unchanged", test_normalize_bootstrap_deployconf_keeps_non_local_or_single_node_ports_unchanged, ), + ( + "normalize_bootstrap_deployconf_promotes_master_port_from_entrypoint", + test_normalize_bootstrap_deployconf_promotes_master_port_from_entrypoint, + ), ( "refresh_cluster_bare_deploy_scripts_copies_local_and_remote_nodes", test_refresh_cluster_bare_deploy_scripts_copies_local_and_remote_nodes, diff --git a/deployment/utils/log_shard.py b/deployment/utils/log_shard.py new file mode 100644 index 0000000..415d4ff --- /dev/null +++ b/deployment/utils/log_shard.py @@ -0,0 +1,196 @@ +#!/usr/bin/env python3 + +from __future__ import annotations + +import datetime +import os +from pathlib import Path +from typing import Optional + + +DEFAULT_DAILY_LOG_RETENTION_DAYS = 31 +TEST_LOG_SHARD_WINDOW_SECONDS_ENV = "FLUXON_TEST_LOG_SHARD_WINDOW_SECONDS" +TEST_LOG_SHARD_ANCHOR_UNIX_SECONDS_ENV = "FLUXON_TEST_LOG_SHARD_ANCHOR_UNIX_SECONDS" +TEST_LOG_SHARD_BASE_DATE = datetime.date(2026, 1, 1) + + +def _read_test_log_shard_window_seconds() -> Optional[int]: + raw_value = os.environ.get(TEST_LOG_SHARD_WINDOW_SECONDS_ENV) + if raw_value is None: + return None + text = raw_value.strip() + if not text: + return None + window_seconds = int(text) + if window_seconds <= 0: + raise ValueError( + f"{TEST_LOG_SHARD_WINDOW_SECONDS_ENV} must be a positive integer, got: {raw_value!r}" + ) + return window_seconds + + +def _read_test_log_shard_anchor_unix_seconds() -> int: + raw_value = os.environ.get(TEST_LOG_SHARD_ANCHOR_UNIX_SECONDS_ENV) + if raw_value is None or not raw_value.strip(): + raise ValueError( + f"{TEST_LOG_SHARD_ANCHOR_UNIX_SECONDS_ENV} is required when " + f"{TEST_LOG_SHARD_WINDOW_SECONDS_ENV} is set" + ) + return int(raw_value.strip()) + + +def _resolve_shard_date(ts: datetime.datetime) -> datetime.date: + window_seconds = _read_test_log_shard_window_seconds() + if window_seconds is None: + return ts.date() + anchor_unix_seconds = _read_test_log_shard_anchor_unix_seconds() + unix_seconds = int(ts.timestamp()) + bucket_index = (unix_seconds - anchor_unix_seconds) // window_seconds + if bucket_index < 0: + raise ValueError( + "test log shard anchor must not be in the future: " + f"anchor={anchor_unix_seconds}, ts={unix_seconds}" + ) + return TEST_LOG_SHARD_BASE_DATE + datetime.timedelta(days=bucket_index) + + +def daily_sharded_log_path( + base_path: Path, + *, + now: Optional[datetime.datetime] = None, +) -> Path: + ts = datetime.datetime.now(datetime.timezone.utc) if now is None else now.astimezone(datetime.timezone.utc) + name = base_path.name + if not name.endswith(".log"): + raise ValueError(f"log base path must end with .log: {base_path}") + stem = name[:-4] + shard_date = _resolve_shard_date(ts) + return (base_path.parent / f"{stem}.{shard_date.isoformat()}.log").resolve() + + +def latest_existing_daily_sharded_log_path(base_path: Path) -> Optional[Path]: + name = base_path.name + if not name.endswith(".log"): + return base_path.resolve() if base_path.exists() else None + stem = name[:-4] + prefix = stem + "." + suffix = ".log" + latest: Optional[tuple[datetime.date, Path]] = None + parent = base_path.parent + if not parent.exists(): + return base_path.resolve() if base_path.exists() else None + for path in parent.iterdir(): + if not path.is_file(): + continue + entry_name = path.name + if not entry_name.startswith(prefix) or not entry_name.endswith(suffix): + continue + date_text = entry_name[len(prefix):-len(suffix)] + try: + shard_date = datetime.date.fromisoformat(date_text) + except ValueError: + continue + if latest is None or shard_date > latest[0]: + latest = (shard_date, path.resolve()) + if latest is not None: + return latest[1] + if base_path.exists(): + return base_path.resolve() + return None + + +def resolve_readable_log_path(base_path: Path) -> Optional[Path]: + current = daily_sharded_log_path(base_path) + if current.exists(): + return current + return latest_existing_daily_sharded_log_path(base_path) + + +def cleanup_old_daily_sharded_logs( + base_path: Path, + *, + retention_days: int = DEFAULT_DAILY_LOG_RETENTION_DAYS, +) -> None: + name = base_path.name + if not name.endswith(".log"): + return + current_shard_date = _resolve_shard_date(datetime.datetime.now(datetime.timezone.utc)) + keep_since = current_shard_date - datetime.timedelta(days=max(int(retention_days) - 1, 0)) + stem = name[:-4] + prefix = stem + "." + suffix = ".log" + parent = base_path.parent + parent.mkdir(parents=True, exist_ok=True) + for path in parent.iterdir(): + if not path.is_file(): + continue + entry_name = path.name + if not entry_name.startswith(prefix) or not entry_name.endswith(suffix): + continue + date_text = entry_name[len(prefix):-len(suffix)] + try: + shard_date = datetime.date.fromisoformat(date_text) + except ValueError: + continue + if shard_date < keep_since: + try: + path.unlink() + except FileNotFoundError: + pass + + +def render_module_source() -> str: + module_path = Path(__file__).resolve() + return module_path.read_text(encoding="utf-8") + + +def import_sibling_log_shard(): + import importlib.util + import sys + + helper_path = Path(__file__).resolve().with_name("log_shard.py") + module_name = "_fluxon_log_shard_runtime" + loaded = sys.modules.get(module_name) + if loaded is not None: + return loaded + spec = importlib.util.spec_from_file_location(module_name, helper_path) + if spec is None or spec.loader is None: + raise RuntimeError(f"failed to load log shard helper: {helper_path}") + module = importlib.util.module_from_spec(spec) + sys.modules[module_name] = module + spec.loader.exec_module(module) + return module + + +def relay_fd_to_daily_sharded_logs( + *, + base_log_path: str, + read_fd: int, + retention_days: int = DEFAULT_DAILY_LOG_RETENTION_DAYS, +) -> None: + base_path = Path(os.path.abspath(base_log_path)) + current_path: Optional[Path] = None + current_fp = None + try: + while True: + try: + chunk = os.read(read_fd, 65536) + except OSError: + break + if not chunk: + break + next_path = daily_sharded_log_path(base_path) + if current_path != next_path: + if current_fp is not None: + current_fp.flush() + current_fp.close() + cleanup_old_daily_sharded_logs(base_path, retention_days=retention_days) + next_path.parent.mkdir(parents=True, exist_ok=True) + current_fp = next_path.open("ab", buffering=0) + current_path = next_path + current_fp.write(chunk) + finally: + if current_fp is not None: + current_fp.flush() + current_fp.close() + os.close(read_fd) diff --git a/deployment/utils/proc_lifecycle_codegen.py b/deployment/utils/proc_lifecycle_codegen.py index 31ef2b0..116b0c4 100644 --- a/deployment/utils/proc_lifecycle_codegen.py +++ b/deployment/utils/proc_lifecycle_codegen.py @@ -150,22 +150,19 @@ def render_bash_proc_lifecycle_funcs_pid_tree(*, timeouts: StopTimeouts) -> str: }} wait_service_probably_ready_pid_tree() {{ - # "Probably ready" contract: - # - A service is considered probably-ready iff for N consecutive seconds: - # - the supervisor PID exists, and - # - the supervisor PID subtree has at least one other PID besides the supervisor. - # - If the child process restarts during the window, we reset the counter and keep waiting, - # until the provided deadline is reached. - # - # This is used by atomic-group runners to enforce strict start ordering. + # Startup gate contract: + # - Success means the supervisor PID stays alive across the fixed startup window. + # - During this startup window we do not probe service ports or readiness endpoints. + # - We intentionally do not require the child to expose ports, endpoints, or even finish + # spawning before the window ends. svc="$1" root_pid="$2" - stable_seconds="$3" + startup_window_seconds="$3" deadline_ts="$4" context="$5" - if [[ ! "$stable_seconds" =~ ^[0-9]+$ ]] || [ "$stable_seconds" -le 0 ]; then - echo "$context probable-ready: invalid stable_seconds=$stable_seconds svc=$svc" + if [[ ! "$startup_window_seconds" =~ ^[0-9]+$ ]] || [ "$startup_window_seconds" -le 0 ]; then + echo "$context probable-ready: invalid startup_window_seconds=$startup_window_seconds svc=$svc" return 1 fi if [[ ! "$deadline_ts" =~ ^[0-9]+$ ]] || [ "$deadline_ts" -le 0 ]; then @@ -173,30 +170,16 @@ def render_bash_proc_lifecycle_funcs_pid_tree(*, timeouts: StopTimeouts) -> str: return 1 fi - ok_s=0 while true; do - now=$(date +%s) - if [ "$now" -ge "$deadline_ts" ]; then - echo "$context probable-ready: deadline exceeded svc=$svc stable_seconds=$stable_seconds pid=$root_pid" - return 1 - fi - if ! _pid_exists "$root_pid"; then echo "$context probable-ready: supervisor pid exited svc=$svc pid=$root_pid" return 1 fi - if _pid_tree_has_child_process "$root_pid"; then - ok_s=$((ok_s+1)) - if [ "$ok_s" -ge "$stable_seconds" ]; then - echo "$context probable-ready: ok svc=$svc stable_seconds=$stable_seconds pid=$root_pid" - return 0 - fi - else - if [ "$ok_s" -ne 0 ]; then - echo "$context probable-ready: reset svc=$svc ok_s=$ok_s missing_child=true" - fi - ok_s=0 + now=$(date +%s) + if [ "$now" -ge "$deadline_ts" ]; then + echo "$context probable-ready: ok svc=$svc startup_window_seconds=$startup_window_seconds pid=$root_pid" + return 0 fi sleep 1 diff --git a/deployment/utils/selection_supervisor_codegen.py b/deployment/utils/selection_supervisor_codegen.py index 2945ff5..ab76dfc 100644 --- a/deployment/utils/selection_supervisor_codegen.py +++ b/deployment/utils/selection_supervisor_codegen.py @@ -13,6 +13,7 @@ PYTHON_SELECTION_SUPERVISOR_FILENAME = "selection_supervisor.py" +LOG_SHARD_HELPER_FILENAME = "log_shard.py" def render_python_selection_supervisor_module(*, timeouts) -> str: @@ -42,11 +43,13 @@ def render_python_selection_supervisor_module(*, timeouts) -> str: import enum import fcntl import hashlib +import importlib.util import json import os import signal import subprocess import sys +import threading import time from dataclasses import dataclass from pathlib import Path @@ -62,6 +65,37 @@ def render_python_selection_supervisor_module(*, timeouts) -> str: SANITIZED_CHILD_ENV_KEYS = ("RDMAV_DRIVERS", "IBV_DRIVERS") _shutdown_requested = False +_STDIO_ROUTER_THREAD = None +_STDIO_ROUTER_KEEPALIVE_FP = None + + +def _load_log_shard_helper(): + candidates = [] + raw_file = globals().get("__file__") + if isinstance(raw_file, str) and raw_file: + candidates.append(Path(raw_file).resolve().with_name("__LOG_SHARD_HELPER_FILENAME__")) + cwd = Path.cwd().resolve() + candidates.append(cwd / "__LOG_SHARD_HELPER_FILENAME__") + candidates.append(cwd / "deployment" / "utils" / "__LOG_SHARD_HELPER_FILENAME__") + for entry in sys.path: + if not isinstance(entry, str) or not entry: + continue + candidates.append(Path(entry).resolve() / "__LOG_SHARD_HELPER_FILENAME__") + helper_path = candidates[0] + for candidate in candidates: + if candidate.is_file(): + helper_path = candidate + break + spec = importlib.util.spec_from_file_location("_fluxon_selection_log_shard", helper_path) + if spec is None or spec.loader is None: + raise RuntimeError(f"failed to load log shard helper: {helper_path}") + module = importlib.util.module_from_spec(spec) + sys.modules[spec.name] = module + spec.loader.exec_module(module) + return module + + +_LOG_SHARD = _load_log_shard_helper() def main() -> int: @@ -96,6 +130,8 @@ def main() -> int: stop_parser.add_argument("--missing-ok", action="store_true") args = parser.parse_args() + runtime_state_for_stdio = _runtime_state_for_startup_stdio(args) + _redirect_process_stdio_to_runtime_log(runtime_state_for_stdio) # English note: # - The supervisor module is invoked both as a long-running `run` daemon and as a short-lived # `stop` helper from ops-managed reconcile loops. @@ -356,6 +392,16 @@ def _parse_run_command_spec(args: argparse.Namespace) -> RunCommandSpec: ) +def _runtime_state_for_startup_stdio(args: argparse.Namespace) -> Optional[SelectionRuntimeState]: + if str(args.command) != "run": + return None + label = _require_non_empty_str(args.label, "label") + state_json = args.state_json + if state_json is None: + return None + return _build_runtime_state(label=label, state_json=state_json) + + def _requested_phase1_overlap_with_applyless_owner( current_owner: Optional[LiveSupervisor], requested_runtime_state: Optional[SelectionRuntimeState], @@ -438,6 +484,7 @@ def _run_supervisor(spec: RunCommandSpec, selection_lock_fp=None) -> int: restart_timestamps: List[float] = [] backoff_seconds = spec.restart_delay_seconds + _redirect_process_stdio_to_runtime_log(runtime_state) while True: _log_reaped_children( @@ -661,6 +708,10 @@ def _sanitize_child_ld_library_path(raw_value: Optional[str]) -> Optional[str]: return ":".join(sanitized_entries) +def _expand_runtime_state_path(value: str) -> str: + return os.path.expandvars(value) + + def _spawn_child(command: List[str], workdir: Optional[Path]) -> subprocess.Popen[bytes]: def _set_pdeathsig_sigterm() -> None: libc = ctypes.CDLL("libc.so.6", use_errno=True) @@ -687,6 +738,40 @@ def _set_pdeathsig_sigterm() -> None: ) +def _redirect_process_stdio_to_runtime_log(runtime_state: Optional[SelectionRuntimeState]) -> None: + global _STDIO_ROUTER_THREAD + global _STDIO_ROUTER_KEEPALIVE_FP + if runtime_state is None: + return + if _STDIO_ROUTER_THREAD is not None: + return + base_log_path = _require_non_empty_str(runtime_state.log_path, "state.log_path") + read_fd, write_fd = os.pipe() + router_keepalive = os.dup(write_fd) + + def _router_loop() -> None: + _LOG_SHARD.relay_fd_to_daily_sharded_logs( + base_log_path=base_log_path, + read_fd=read_fd, + retention_days=_LOG_SHARD.DEFAULT_DAILY_LOG_RETENTION_DAYS, + ) + + router = threading.Thread( + target=_router_loop, + name="selection-supervisor-stdio-log-router", + daemon=True, + ) + router.start() + os.dup2(write_fd, sys.stdout.fileno()) + os.dup2(write_fd, sys.stderr.fileno()) + sys.stdout = os.fdopen(sys.stdout.fileno(), "w", encoding="utf-8", buffering=1, closefd=False) + sys.stderr = os.fdopen(sys.stderr.fileno(), "w", encoding="utf-8", buffering=1, closefd=False) + try: + os.close(write_fd) + except OSError: + pass + _STDIO_ROUTER_KEEPALIVE_FP = os.fdopen(router_keepalive, "w", encoding="utf-8", buffering=1) + _STDIO_ROUTER_THREAD = router def _retired_and_preserved_adopted_roots(root_pid: int) -> Tuple[List[int], List[int]]: adopted_roots = _direct_live_child_pids(root_pid) if not adopted_roots: @@ -788,7 +873,9 @@ def _selection_runtime_state_from_raw( apply_id=_require_optional_non_empty_str(raw.get("apply_id"), "state.apply_id"), argv=_require_non_empty_str_list(raw.get("argv"), "state.argv"), cwd=_require_optional_non_empty_str(raw.get("cwd"), "state.cwd"), - log_path=_require_non_empty_str(raw.get("log_path"), "state.log_path"), + log_path=_expand_runtime_state_path( + _require_non_empty_str(raw.get("log_path"), "state.log_path") + ), owner_ts_ms=owner_ts_ms, started_ts_ms=started_ts_ms, ) @@ -1337,6 +1424,7 @@ def _signal_pid_tree(root_pid: int, sig: signal.Signals, label: str) -> None: """ return ( textwrap.dedent(template) + .replace("__LOG_SHARD_HELPER_FILENAME__", LOG_SHARD_HELPER_FILENAME) .replace("__TERM_S__", str(term_s)) .replace("__KILL_S__", str(kill_s)) .replace("__SUPERSEDE_S__", str(supersede_s)) diff --git "a/fluxon_doc_cn/design/fluxon_0_\351\205\215\347\275\256\346\200\273\350\247\210.md" "b/fluxon_doc_cn/design/fluxon_0_\351\205\215\347\275\256\346\200\273\350\247\210.md" new file mode 100644 index 0000000..852b73f --- /dev/null +++ "b/fluxon_doc_cn/design/fluxon_0_\351\205\215\347\275\256\346\200\273\350\247\210.md" @@ -0,0 +1,217 @@ +# Fluxon 配置总览 + +## 1. 结论 + +本文只回答一件事:Fluxon 仓库里有哪些稳定配置入口,它们各自负责什么,校验后会变成什么运行时结构。 + +**稳定结论:** + +- 配置输入和运行时结构是分开的,YAML 只负责声明意图,`verify()` / `parse_*()` 负责收敛成唯一可执行结果。 +- 共享契约优先放在 `fluxon_commu_contract` 和 `fluxon_cli::config` 这类公共模块里,业务包更多是复用或重导出。 +- `host:port`、`http(s)://...`、`cluster-scoped path` 这几类格式都被严格区分,不靠探测或模糊回退。 +- 仓库里的 checked-in YAML 分两类:运行时契约和环境/测试契约。前者要强校验,后者主要用于把开发、部署、测试流水线接起来。 + +```mermaid +flowchart TD + A[build_config_ext.yml
build_config_ext_static.yml] --> B[setup_and_pack / repo_config_utils] + C[deployment/deployconf.yaml] --> D[deployment utils / fluxon_py tests] + E[fluxon_py/tests/test_config.yaml] --> D + F[fluxon_test_stack/*.yaml] --> G[teststack runner / start_test_bed] + H[fluxon_cli/src/config.rs] --> I[monitor / UI] + J[fluxon_kv/src/config.rs] --> K[KV runtime] + L[fluxon_fs_core/src/config.rs] --> M[FS runtime] + N[fluxon_commu_contract/src/config.rs] --> K + N --> M +``` + +## 2. 配置地图 + +| 配置家族 | 入口文件 / 模块 | 主要消费者 | 作用 | +| --- | --- | --- | --- | +| 仓库环境配置 | `build_config_ext.yml` | Rust KV 测试族、`fluxon_py/tests/test_lib.py`、`setup_and_pack` 打包/校验脚本、TestStack 的 `bin_kvtest` 用例 staging | 提供 etcd、Prometheus、remote write 等开发/测试基线 | +| 静态构建配置 | `build_config_ext_static.yml` | `setup_and_pack/pack_release.py`、`build_pack_fluxonkv_pylib_img.py`、Nix 打包链路 | 固定 wheel / manylinux 版本 | +| 部署配置 | `deployment/deployconf.yaml` | 部署脚本、`fluxon_py` 测试入口、TestStack 生成/消费链路 | 提供集群节点、服务地址和全局环境变量 | +| Python 测试配置 | `fluxon_py/tests/test_config.yaml` | `fluxon_py` 测试入口、测试辅助库、deployconf 解析链路 | 连接 deployconf,选择 KV backend 类型 | +| 开发/打包环境配置 | `setup_and_pack/setup_dev_env/*.yaml`、`setup_and_pack/build_pack_fluxonkv_pylib_img/*.yaml`、`setup_and_pack/nix/*.yaml`、`pub_prepare_build.yaml` | `setup_and_pack` 脚本 | 提供开发机和打包流水线的环境输入 | +| TestStack 配置 | `fluxon_test_stack/ci_test_list.yaml`、`start_test_bed.yaml`、`gitops.yaml` | `test_runner.py`、`start_test_bed.py` | 定义 suite、testbed、GitOps 和 UI 入口 | +| CLI 监控配置 | `fluxon_cli/src/config.rs` | `master_ui_monitor`、`test_runner_ui` | 提供监控页和查询页配置 | +| KV 配置 | `fluxon_kv/src/config.rs` | KV master / owner / external | 定义 KV 运行时角色和校验规则 | +| FS 配置 | `fluxon_fs_core/src/config.rs` | FS master / agent / panel | 定义 FS cache、master、panel、权限和转移态 | +| 共享传输配置 | `fluxon_commu_contract/src/config.rs`、`transfer_engine/surface.rs` | KV / FS / commu | 提供 `NetworkConfig`、`ProtocolType`、`TransferEngineType` | + +## 3. 通用规则 + +| 规则 | 含义 | +| --- | --- | +| `serde(deny_unknown_fields)` | 运行时 YAML 默认拒绝未知字段 | +| `from_file` / `from_str` + `verify` | 先解析,再收敛成强类型运行时配置 | +| `YamlNullable` | 只在需要区分“缺失 / null / value”时使用 | +| `host:port` 与 `http(s)://...` 分离 | etcd / deployconf 常用前者,监控 / Prometheus 常用后者 | +| 派生值要显式写回 | 例如 cluster-scoped 路径、默认表名、默认 transport_mode | + +## 4. 环境与部署配置 + +### 4.1 `build_config_ext.yml` + +这是仓库级开发环境配置,不是业务 runtime config。 + +| 字段 | 规则 | 主要用途 | +| --- | --- | --- | +| `etcd` | 必填,`host:port` | 供 Rust / Python / 测试工具读取 etcd 地址 | +| `prom` | 必填,`http(s)://.../v1` 或 `.../api/v1` | 供 Grafana / TSDB 查询 URL 使用 | +| `prom_remote_write_url` | 必填,`http(s)://...` | 供 remote write 使用 | + +`setup_and_pack/utils/repo_config_utils.py` 里保留了 `prometheus_remote_write_url` 的旧名兼容读取,但这是 build tooling 的过渡路径,不是推荐的新契约。 + +### 4.2 `build_config_ext_static.yml` + +当前只固定一个值: + +| 字段 | 规则 | +| --- | --- | +| `manylinux_version` | 必填,当前只允许 `2_28` | + +### 4.3 `deployment/deployconf.yaml` + +这是部署和打包流水线的核心配置。当前稳定消费面主要有三块: + +| 区块 | 关键字段 | 作用 | +| --- | --- | --- | +| `cluster_nodes` | 节点列表 | 作为 placeholder 解析的基础 | +| `service` | 服务节点映射 | 供部署脚本和测试脚本查 service ip:port | +| `global_envs` | `ETCD_FULL_ADDRESS`、`FLUXON_PROMETHEUS_BASE_URL`、`MONITOR_GREPTIMEDB_WRITE_URL`、`FLUXON_CLUSTER_NAME`、`FLUXON_SHARED_MEM`、`FLUXON_SHARED_FILE` | 供部署/测试代码读取集群级 authority | + +`global_envs` 允许占位符解析,先由 `cluster_nodes` + `service` 构造映射,再把变量落成最终值。 + +### 4.4 `fluxon_py/tests/test_config.yaml` + +这是一层测试入口配置,不是 runtime 部署配置。 + +| 字段 | 规则 | +| --- | --- | +| `deployconf_path` | 必填,指向共享 deployconf | +| `kv_svc_type` | 必填,当前测试助手只接受已知 backend 类型 | + +测试代码里还保留了 mooncake 相关读取函数,但 checked-in 的最小样例只使用上面两个字段。 + +### 4.5 `fluxon_test_stack/*` + +TestStack 的配置已经单独有设计文档,这里只收口成一句话: + +- `ci_test_list.yaml` 定义 suite 空间。 +- `start_test_bed.yaml` 定义共享 testbed 和 UI。 +- `gitops.yaml` 定义 GitOps 轮询和记录。 +- 生成的 `deployconf_testbed.yml` 是派生产物,不是手工主配置。 + +## 5. 运行时配置 + +### 5.1 KV + +KV 的入口在 `fluxon_kv/src/config.rs`,对外分成 master 和 client 两个稳定 YAML: + +| 类型 | 作用 | +| --- | --- | +| `MasterConfigYaml` | master 节点输入 | +| `ClientConfigYaml` | owner / external 输入 | +| `TestSpecConfig` | 测试和实验分支开关 | +| `MonitoringConfigYaml` | master 监控块 | +| `NetworkConfig` | 网络白名单和 IP 映射,共享自 `fluxon_commu_contract` | + +核心分流规则: + +- `contribute_to_cluster_pool_size` 缺失或全零时,进入 external。 +- `contribute_to_cluster_pool_size.dram > 0` 时,进入 owner。 +- `test_spec_config.side_transfer_role = worker` 时,走 side-transfer worker 分支,强制 `TransferEngineType::P2p`。 + +主要约束: + +- `monitoring` 在 master 上必填。 +- `master_ui` 依赖 `monitoring`,并作为嵌入式 monitor HTTP 服务启动。 +- `shared_memory_path` / `shared_file_path` 会拼成 `cluster_name` 作用域路径。 +- `etcd_addresses` 在 client 侧保留 raw `host:port` 和归一化 `http://host:port` 两份视图。 +- zero-contribution `external` / side worker 的 `etcd_addresses`、`sub_cluster`、`large_file_paths` 由 owner 发布的 `shared.json` 继承;本地配置面只保留 attach owner 所需的共享 bundle 锚点和本进程参数。 + +更细的调用时序、持有生命周期和并发规则分别在 `kv_1_概览与分层.md`、`kv_2_调用时序.md`、`kv_3_参数与并发.md`、`kv_4_allocation_segment_holder生命周期.md` 里展开。 + +### 5.2 FS + +FS 的配置集中在 `fluxon_fs_core/src/config.rs`,上层 `fluxon_fs/src/config.rs` 只是重导出。 + +| 配置块 | 入口 | 结果 | +| --- | --- | --- | +| cache | `fluxon_fs.cache` | `FluxonFsGlobalConfig` | +| master | `fluxon_fs.master` | `FluxonFsMasterConfig` | +| master_panel | `fluxon_fs.master_panel` | `FluxonFsMasterPanelConfig` | + +`fluxon_fs.cache` 的核心字段: + +- `stale_window_ms` 必须 `> 0`。 +- `write_session_target_inflight_bytes` 可缺省,默认 128 MiB。 +- `rules[*]` 需要绝对路径、合法 cache/write 模式、合法前缀和非零 cache 上限。 +- `exports[*]` 需要绝对路径;`nodes` 缺失时表示 `AgentRegistry`,给出时表示 `StaticNodes`。 + +`fluxon_fs.master` 的核心字段: + +- `instance_key` 必填。 +- `pull_interval_ms` 可选,但如果给出必须 `> 0`。 +- 旧的 `fluxon_fs.rpc` 和 `rpc_timeout_ms` 已移除。 + +`fluxon_fs.master_panel` 的核心字段: + +- `listen_addr`、`public_base_url`、`prometheus_base_url`、`access_db_path` 都是必需基线。 +- `bootstrap_access_model` 是面板的启动授权模型。 +- `transfer_state_store` 当前稳定实现是 `tikv`。 +- `s3_gateway` 负责对象请求和 KV miss 策略。 + +FS 还把访问模型拆成两层: + +- `access_model` 是用户/权限的输入模型。 +- `runtime_access_model` 是 runtime 使用的派生模型,密码会被哈希,不再原样保留。 + +### 5.3 CLI 监控 + +`fluxon_cli/src/config.rs` 定义统一监控页配置,KV 的 `master_ui` 和 TestStack 的 UI 都复用它。 + +| 类型 | 关键字段 | +| --- | --- | +| `MonitorConfigYaml` | `etcd_endpoints`、`prometheus_base_url`、`cluster_name`、`member_kind`、`output` | +| 可选项 | `mq_unique_key_prefixes`、`http_listen_addr`、`greptime_sql` | + +主要约束: + +- `etcd_endpoints` 必须非空且带 scheme。 +- `prometheus_base_url` 必须带 scheme。 +- `mq_unique_key_prefixes` 给出时不能为空,也不能带前后空白。 +- `greptime_sql` 可以显式提供;如果 `prometheus_base_url` 指向 Greptime 的 `/v1/prometheus`,会自动派生默认 SQL 连接信息。 + +### 5.4 共享传输契约 + +`fluxon_commu_contract` 提供多个被 KV / FS 共同复用的基础类型: + +| 类型 | 取值 | 作用 | +| --- | --- | --- | +| `ProtocolType` | `Tcp` / `Rdma` | 输入协议选择 | +| `TransferEngineType` | `Closed` / `P2p` | 传输引擎分支 | +| `TransferBackendActivationMode` | 三个显式分支 | 控制 backend 激活方式 | +| `NetworkConfig` | `subnet_whitelist`、`primary_ip_to_extended_ips` | 网络白名单和 IP 扩展映射 | + +这些类型是共享契约,不属于某一个子系统的私有配置。 + +## 6. 配置之间的关系 + +| 关系 | 说明 | +| --- | --- | +| build_config_ext -> deployment/test | 先确定环境基线,再给 runtime 配置提供 host、URL、路径 | +| deployconf -> test_config | Python 测试配置通过 `deployconf_path` 指向共享部署配置 | +| deployconf -> teststack | `start_test_bed` 和 `test_runner` 读取派生后的 testbed deployconf | +| commu_contract -> KV / FS | `ProtocolType`、`TransferEngineType`、`NetworkConfig` 是共享底座 | +| CLI config -> KV / TestStack UI | master UI、runner UI 复用同一个 monitor config 契约 | + +## 7. 读法建议 + +如果你只想看某一块的细节,按这个顺序读: + +1. 环境/部署先看 `deployment/utils/deployconf_config_utils.py` 和 `fluxon_util/src/dev_config.rs`。 +2. KV 先看 `fluxon_kv/src/config.rs`,再接 `kv_1` 到 `kv_4`。 +3. FS 先看 `fluxon_fs_core/src/config.rs`,再看 `用户 - 5 - FS接口.md`。 +4. TestStack 直接看 `teststack_1_当前架构与CI测试流程.md`。 diff --git "a/fluxon_doc_cn/design/log_1_\346\234\254\345\234\260\346\226\207\344\273\266\346\227\245\345\277\227\344\270\216Greptime_OTLP\345\257\274\345\207\272\351\223\276\350\267\257.md" "b/fluxon_doc_cn/design/log_1_\346\234\254\345\234\260\346\226\207\344\273\266\346\227\245\345\277\227\344\270\216Greptime_OTLP\345\257\274\345\207\272\351\223\276\350\267\257.md" new file mode 100644 index 0000000..fd81c45 --- /dev/null +++ "b/fluxon_doc_cn/design/log_1_\346\234\254\345\234\260\346\226\207\344\273\266\346\227\245\345\277\227\344\270\216Greptime_OTLP\345\257\274\345\207\272\351\223\276\350\267\257.md" @@ -0,0 +1,414 @@ +# Fluxon Log 设计 1 - 统一 log 标准与 Greptime OTLP 导出链路 + +## 0. 总起 +本文定义 Fluxon 服务平面的统一日志标准。主线代码落在 `fluxon_rs/fluxon_kv/src/config.rs`、`fluxon_rs/fluxon_kv/src/lib.rs`、`fluxon_rs/fluxon_util/src/log.rs`、`fluxon_rs/fluxon_observability/src/greptime_otlp_tracing.rs`、`fluxon_rs/fluxon_observability/src/greptime_otlp_log_orchestrator.rs` 和 `fluxon_rs/fluxon_observability/src/greptime_otlp_log.rs`。 + +稳定结论先说死: + +- 本地文件日志始终启用,作为可回放的安全网。 +- Greptime OTLP 导出由 `master.monitoring.otlp_log_api` 控制,`master` 负责配置源,`owner` / `external` 只消费广播。 +- `testbed` 是独立的 `log_service_kind`,启动器、runner、UI 和 workload 统一按同一套日志语义落盘。 +- 当前导出链路采用 best-effort 策略,不阻塞主业务路径。 + +本文重点回答四个问题: + +1. 各条日志链路当前落在哪些目录边界里。 +2. 当前 canonical 文件名、按天分片和 31 天清理语义是什么。 +3. Rust / Python 之间哪些 contract 已经对齐,哪些还没有。 +4. 当前实现里哪些地方已经收口,哪些地方仍是未完全收口点。 + +KV 里的 `external` 与 side worker 都只消费 owner 感知结果。当前稳定 contract 是:它们显式配置 `shared_memory_path` / `shared_file_path` 作为 attach owner 的共享 bundle 锚点,`large_file_paths` 则从 owner 发布的 `shared.json` 继承;日志和 cache 从启动起就直接落到 owner 派生出来的大文件目录,不再要求 zero-contribution 侧另配一份本地 large root。 + +## 1. 目录边界 +目录边界只管物理隔离,不管统一 root。统一的是命名、元数据、归档窗口和清理语义。 + +### 1.1 KV +- `master` 以 `log_dir` 作为本地主日志根,并在其下派生 cluster-scoped runtime 日志目录。 +- `owner`、`external` 和 side worker 共享单一 `share_path` 作为 share 根,用来放 `mmap.file`、`shared.json`、peer metadata 和 side transfer 相关文件。 +- `owner` 的 `large_file_paths` 定义 runtime log、cache 等大文件资产的物理根目录。 +- `external` 和 side worker 不再单独声明自己的 `large_file_paths`。它们在 zero-contribution bootstrap 阶段从 owner `shared.json` 继承同一组大文件根目录,然后直接复用 owner 派生出来的 runtime log / cache 边界。 + +### 1.2 ops / bare shared supervisor control plane +这里不要把 `ops` 和 `bare` 理解成两套彼此独立的面。两者确实共用同一个 `selection_supervisor.py + log_shard.py` 实现源,但当前实际落盘边界不是一棵完全统一的目录树。 + +先区分两个层次: + +| 层次 | 稳定根 | 主要内容 | +| --- | --- | --- | +| `deployconf -> gen_bare -> bare bootstrap` | `hostworkdir` | generated control scripts、bare 服务日志 | +| `ops` runtime | `workdir` | runtime config、embedded supervisor runtime、ops-managed workload 日志 | + +其中: + +- `hostworkdir` 是节点级宿主根,用来承载 deployer 下发产物、bare 控制脚本和其他需要跨进程稳定复用的目录。 +- `workdir` 是某个具体进程实例自己的运行子目录,用来承载该实例的 runtime config、embedded supervisor runtime 和它托管出来的 workload 日志。 +- 位置关系上,当前 self-host deployconf 里 `workdir` 通常是 `hostworkdir` 的子目录;语义关系上,`workdir` 仍然只是“某个实例的运行子树”,不能反过来代表整个 `hostworkdir`。 + +bare 稳定根当前可以直观看成: + +```text +${HOSTWORKDIR}/ + log/ + ops_controller..log + ops_agent..log + ..log + gen_bare_deploy_bash/ + start_ops_controller.sh + start_ops_agent.sh + start_.sh + stop_ops_controller.sh + stop_ops_agent.sh + stop_.sh + start_.sh + stop_.sh + selection_supervisor.py + log_shard.py + entrypoint__.sh +``` + +当前 self-host deployconf 下,`hostworkdir` 与 `ops workdir` 的实际位置关系可以直观看成: + +```text +${HOSTWORKDIR}/ + gen_bare_deploy_bash/ + ... + log/ + ops_controller..log + ops_agent..log + ..log + ops_controller/ + ops_controller.yaml + selection_supervisor/ + selection_supervisor.py + log_shard.py + log/ + workload____..log + ops_agent/ + / + ops_agent.yaml + selection_supervisor/ + selection_supervisor.py + log_shard.py + log/ + workload____..log +``` + +这里再把 contract 说清楚: + +- `${HOSTWORKDIR}/gen_bare_deploy_bash/` 里的 `start_*.sh` / `stop_*.sh` 是 generated control scripts,是这套 shared supervisor 控制面的入口脚本,不是另一套独立 authority。 +- bare 这一层的稳定逻辑基名仍然是 `${HOSTWORKDIR}/log/.log`,shared supervisor runtime 再把它收口为 `${HOSTWORKDIR}/log/..log`。 +- ops-managed workload 这一层的稳定逻辑基名则是 `${WORKDIR}/log/workload____.log`,shared supervisor runtime 再把它收口为 `${WORKDIR}/log/workload____..log`。 +- 两层真正共享的是 `selection_supervisor.py + log_shard.py` 这组控制与滚动实现,不是“所有路径和文件名完全一样”。 + +在当前 self-host deployconf 示例里: + +- `ops_controller` 的 workdir 是 `${HOSTWORKDIR}/ops_controller` +- `ops_agent` 的 workdir 是 `${HOSTWORKDIR}/ops_agent/${NODE_ID}` + +### 1.3 testbed +- `workdir`、`run_dir` 分别承担 launcher、runner、UI、workload 的 run-scoped 落盘边界。 +- `testbed` 必须显式作为 `log_service_kind` 出现,不再用泛化名称代替。 +- launcher 和 workload 的目录语义要和 ops 对齐。 +- 当前优先级不是先把 testbed 做到完美支持,而是先把 ops 长时服务日志 contract 讲清楚并收口;testbed 继续按“服务级日志”和“case artifact”分开讨论。 + +### 1.4 FS +- `shared_file_path` 与 `export.remote_root_dir_abs` 分开使用。 +- 前者负责共享 attachment 边界。 +- 后者负责业务数据边界。 + +这里的目标很明确:目录可以不同,语义必须一致。`log`、`cache`、`shared attachment`、`workload data` 不能混在同一个边界里。 + +## 2. 文件命名 +当前实现里的文件命名还没有完全统一,但已经可以明确分成下面几类。 + +| 类别 | 当前逻辑基名 | 当前实际落盘 | +| --- | --- | --- | +| KV runtime | `fluxon-kv-.log` | `fluxon-kv-..log` | +| bare 服务日志 | `.log` | `..log` | +| ops-managed workload | `workload____.log` | `workload____..log` | +| testbed 服务日志 | `test_runner.log` / `test_runner_ui.log` | `test_runner..log` / `test_runner_ui..log` | +| KV side worker stdio | `side_worker_.stdout.log` / `side_worker_.stderr.log` | 当前还没补日期分片 | + +补充说明: + +- KV runtime 日志当前仍由 `fluxon_util::init_log(...)` 创建,`run_master_impl(...)` 和 `run_client_impl(...)` 都会初始化这套本地文件日志,所以 `master`、`owner`、`external` 这些 KV 运行时进程当前确实都会产生这类文件。 +- `ops` 里还保留一些特例命名,例如 `smoke.log`、`smoke_bare.log`、`smoke_workloads_bare.log`。这些都属于当前实现尚未收口的历史命名。 +- `testbed` 当前仍然没有单一 canonical log filename。服务级日志已经补上时间分片,但 `ci_runner` 等 case 级日志仍主要落在 `results//run_/logs/**` 与 `summary.yaml`、`exception.txt`、`ci.log` 这类 run artifact 里。 + +清理只依据文件名里约定好的日期分片字段,不按目录数量、文件大小或历史批次做判断。这样本地清理和 Greptime retention 才能共享同一时间窗口。 + +## 3. 元数据字段 +这一节描述的是当前 KV OTLP 导出链路已经实际写入 Greptime 的元数据字段。 + +| 字段 | 含义 | +| --- | --- | +| `service.name` | 当前固定为 `fluxon` | +| `fluxon_cluster_name` | 集群名 | +| `fluxon_member_kind` | 当前业务类型标签,例如 `kv` | +| `fluxon_role` | 当前进程角色标签,例如 `master`、`owner_client`、`external_client` | +| `fluxon_member_id` | 当前实例标识 | + +当前实现里的日志元数据仍然是围绕 `cluster_name`、`member_kind`、`role`、`member_id` 这组字段组织的;`log_service_kind`、`log_kind`、`process_role`、`instance_key`、`workload_kind`、`workload_name` 这些更细的统一字段,目前还没有完整进入导出链路。 + +## 4. 归档、超时与清理 +本地文件日志按天滚动归档,默认保留 31 天。清理时只扫描 canonical log file name,并按命名约定提取日期分片删除过期文件,不按文件数量或目录总量触发。 + +流式备份和 OTLP 导出也服从同一套窗口: + +| 项目 | 规则 | +| --- | --- | +| 导出策略 | best-effort,不阻塞主业务路径 | +| 队列满 | 允许丢弃,并保留可观测信号 | +| 发送失败 | 允许跳过当前 batch,本地文件仍在 | +| 停机行为 | shutdown 时执行 best-effort flush | +| 超时语义 | 单次导出必须有硬上界,不能无限挂起 | + +Greptime 侧的 retention / TTL 也按同一日期窗口收口,保证本地与远端的保留语义一致。这里要把远端清理语义说死:写入 `fluxon_logs` 的日志记录默认只保留 1 个月,超过窗口的数据必须由 Greptime 表级 TTL 或定时清理任务删除,不能只依赖查询层按时间过滤“看不见旧数据”。 + +如果后续本地窗口仍保持 31 天,那么 Greptime 侧也应保持同一 31 天窗口;如果本地窗口改为新的 canonical 值,远端 TTL 也必须同步调整。`disable_observability=true` 只关闭 OTLP 层,不关闭本地文件日志。 + +如果某条 stream 只是“备份副本”,它不能绕开本地日志的归档窗口单独永久存活。超时后应停止 tailing、释放资源,并交回本地文件归档策略处理历史文件。 + +## 5. 当前实现里已经收口的点 +这一节只写已经可以当作当前事实使用的内容。 + +### 5.1 本地文件按天分片与 31 天窗口 +- KV runtime 已具备稳定的按天滚动与保留窗口。 +- bare 服务日志已经接到 shared supervisor 的按天分片与同口径清理。 +- ops-managed workload 日志已经接到 shared supervisor 的按天分片与同口径清理。 +- `test_runner` / `test_runner_ui` 这类 testbed 服务级日志已补齐按天分片与本地 31 天保留窗口。 + +### 5.2 shared supervisor 已经统一到一个实现源 +- bare bootstrap 与 ops-managed workload 现在都复用 `selection_supervisor.py + log_shard.py` 这组实现。 +- `gen_bare_deploy_bash.py` 会把同一个 `log_shard.py` helper 下发到生成目录。 +- bare 启动脚本层保留的是稳定逻辑基名,真正的 stdio 重定向和实际分片写入都在共享 `selection_supervisor.py` 运行时里生效。 + +### 5.3 Rust / Python 已经有三类明确对齐 +- 按天分片与 31 天清理 +- 日志目录派生规则 +- OTLP 基础字段与 Greptime header + +## 6. 当前还没有完全收口的点 +这一节只写未完全收口点,避免把“当前事实”和“目标态”混在一起。 + +### 6.1 KV 目录边界还没有完全收口到单一 `share_path` +- 预期 KV 最终收口为单一 `share_path`,统一承载 `mmap.file`、`shared.json` 和 side transfer metadata。 +- 当前 Rust 实现仍保留 `shared_memory_path` 与 `shared_file_path` 两条配置,并分别用于 `mmap.file` 与 `shared.json` / `peer metadata` 的就绪探测和发布。 + +### 6.2 side worker stdio 仍未收口到统一按天分片 +- zero-contribution bootstrap 已经在启动前继承 owner 的 `large_file_paths`,因此 KV runtime logger 不再依赖 attach 后热切换文件路径。 +- 但 side worker stdio 当前仍然直接写 `side_worker_.stdout.log` / `side_worker_.stderr.log`,还没有补到统一的按天分片命名。 + +### 6.3 side worker stdio 与历史 `smoke` 文件还没纳入这轮收口 +- side worker stdio 当前仍是 `side_worker_.stdout.log` / `side_worker_.stderr.log`。 +- `smoke.log`、`smoke_bare.log`、`smoke_workloads_bare.log` 一类历史命名仍然存在。 + +### 6.4 testbed 只有服务级日志收口到了同类语义 +- `test_runner`、`test_runner_ui` 已改为“稳定逻辑基名 + 按天分片落盘”。 +- case 级 `run_dir/logs/**`、`summary.yaml`、`resolved_case.yaml`、`benchmark_result.json` 等仍按 run artifact 生命周期消费。 +- `history_lookback_days` 仍只是控制 UI 回看哪些 workdir;`gitops retention.max_age_days` 仍然清理 gitops run 目录,不是 testbed 服务日志文件的统一 TTL。 + +### 6.5 OTLP 统一字段和统一状态机还没有全部收口 +- 当前导出链路仍以 `cluster_name`、`member_kind`、`role`、`member_id` 为主。 +- `log_service_kind`、`log_kind`、`process_role`、`instance_key`、`workload_kind`、`workload_name` 这组更细的 canonical 字段还没有完整进入导出链路。 +- Rust 通用链路已经把 `disabled`、`direct`、`proxy`、失败分支显式枚举出来;Python benchmark exporter 仍是直连特化路径,还没有进入同一套通用发送状态机。 + +## 7. rs / py 模块对齐与防漂移 +稳定结论先说死: + +- 共享 log contract 以 Rust canonical 模块为准,Python 优先复用 Rust 已经导出的结果。 +- 当前已经能从代码直接看出三类对齐:按天分片与 31 天清理、日志目录派生、OTLP 基础字段与 header。 +- 当前还没有完全收口的是通用 OTLP 发送状态机。Rust 已经显式枚举发送分支,Python 侧 benchmark exporter 仍是直连特化路径。 + +### 7.1 按天分片与本地保留窗口 +Rust `fluxon_rs/fluxon_util/src/log.rs`: + +```rust +const LOG_RETENTION_DAYS: usize = 31; + +pub fn current_daily_sharded_log_path(base_path: &Path) -> anyhow::Result { + daily_sharded_log_path(base_path, current_shard_date()?) +} + +fn cleanup_old_daily_sharded_logs(base_path: &Path, retention_days: usize) -> anyhow::Result<()> { + let keep_since = current_shard_date()? - chrono::Days::new(retention_days.saturating_sub(1) as u64); + ... + if shard_date < keep_since { + fs::remove_file(&path)?; + } +} + +impl DailyShardedFileWriter { + fn rotate_if_needed(&self, state: &mut DailyShardedFileWriterState) -> io::Result<()> { + let next_path = self.current_path()?; + cleanup_old_daily_sharded_logs(&self.base_path, self.retention_days)?; + let file = fs::OpenOptions::new().create(true).append(true).open(&next_path)?; + state.current_path = Some(next_path); + state.current_file = Some(file); + Ok(()) + } +} +``` + +Python `deployment/utils/log_shard.py`: + +```python +DEFAULT_DAILY_LOG_RETENTION_DAYS = 31 + +def daily_sharded_log_path(base_path: Path, *, now: Optional[datetime.datetime] = None) -> Path: + shard_date = _resolve_shard_date(ts) + return (base_path.parent / f"{stem}.{shard_date.isoformat()}.log").resolve() + +def cleanup_old_daily_sharded_logs(base_path: Path, *, retention_days: int = DEFAULT_DAILY_LOG_RETENTION_DAYS) -> None: + current_shard_date = _resolve_shard_date(datetime.datetime.now(datetime.timezone.utc)) + keep_since = current_shard_date - datetime.timedelta( + days=max(int(retention_days) - 1, 0) + ) +``` + +这两段现在对齐的是同一个显式 contract:逻辑基名保持不变,日期字段统一落在 `..log`,默认本地窗口都是 31 天,而且过期删除都显式按日期分片判断。这里不要机械要求两边 helper 名称完全一样;对齐的是“按天分片 + 31 天窗口 + 同口径清理”这条 contract。 + +### 7.2 KV 主日志是 Rust;Python 侧要分 bare 服务日志和 ops-managed workload 日志两层 +先把边界说死:KV runtime 主日志当前基本都是 Rust 在输出。`master`、`owner`、`external` 这些 KV 进程走的是 `fluxon_util::init_log(...)` 这条链。Python 一侧真正需要单独检查的,当前已经分成两层: + +- `deployconf -> gen_bare -> bare bootstrap` 这一层,负责 `ops_controller`、`ops_agent` 和其他 bare service 自身的 stdout/stderr。 +- `ops_agent` 进入 desired-runtime 管理之后,再去托管 workload;这一层的日志 contract 不再沿用 bare `${service_name}.log`,而是 `workload____.log`。 + +先看 bare 这一层: + +Python `deployment/gen_bare_deploy_bash.py`: + +```python +from log_shard import render_module_source as render_log_shard_module_source + +(outdir / LOG_SHARD_HELPER_FILENAME).write_text( + render_log_shard_module_source(), + encoding="utf-8", +) +``` + +```python +runtime_state_json = _bare_runtime_state_json( + workload_name=workload_name, + authority_name=..., + service_name=service_name, + log_path=f"${{HOSTWORKDIR}}/log/{service_name}.log", +) + +LOG_DIR="$HOSTWORKDIR/log" +LOGFILE="$LOG_DIR/${SERVICE}.log" +... +SUPERVISOR_PID=$( ... < /dev/null & echo "$!" ) +``` + +Python `deployment/utils/selection_supervisor_codegen.py`: + +```python +def _redirect_process_stdio_to_runtime_log(runtime_state: Optional[SelectionRuntimeState]) -> None: + base_log_path = _require_non_empty_str(runtime_state.log_path, "state.log_path") + + def _router_loop() -> None: + _LOG_SHARD.relay_fd_to_daily_sharded_logs( + base_log_path=base_log_path, + read_fd=read_fd, + retention_days=_LOG_SHARD.DEFAULT_DAILY_LOG_RETENTION_DAYS, + ) + + os.dup2(write_fd, sys.stdout.fileno()) + os.dup2(write_fd, sys.stderr.fileno()) + +... + +_redirect_process_stdio_to_runtime_log(runtime_state) +``` + +再看 ops-managed workload 这一层: + +Rust `fluxon_rs/fluxon_ops/src/lib.rs`: + +```rust +fn workload_log_filename(kind: WorkloadKind, name: &str) -> anyhow::Result { + Ok(format!("workload__{}__{}.log", kind.as_str(), name)) +} + +let runtime_dir = workdir.join(OPS_SELECTION_SUPERVISOR_DIR_NAME); +let log_dir = workdir.join(OPS_LOG_DIR_NAME); +let log_path = self.log_dir.join(log_filename); +``` + +这组代码说明当前现状是: + +- bare bootstrap 与 ops-managed workload 确实已经复用了同一个 `selection_supervisor.py + log_shard.py` 实现源。 +- bare 服务日志与 ops-managed workload 日志也都已经真正接到这套滚动管理 helper 上。 +- 但两层当前并不是同一个 path contract: + - bare 服务日志保留的是 `${HOSTWORKDIR}/log/${service_name}.log` + - ops-managed workload 保留的是 `${WORKDIR}/log/workload____.log` + +### 7.3 OTLP 基础字段与 header 已经同名对齐 +Rust `fluxon_rs/fluxon_observability/src/greptime_otlp_log.rs`: + +```rust +let kvs = vec![ + KeyValue { key: KEY_CLUSTER_NAME.to_string(), value: Some(...) }, + KeyValue { key: KEY_MEMBER_KIND.to_string(), value: Some(...) }, + KeyValue { key: KEY_ROLE.to_string(), value: Some(...) }, + KeyValue { key: KEY_MEMBER_ID.to_string(), value: Some(...) }, +]; + +let mut reqb = self + .http + .post(&self.endpoint) + .header("X-Greptime-DB-Name", &self.db_name) + .header("X-Greptime-Log-Extract-Keys", GREPTIME_LOG_EXTRACT_KEYS_HEADER_VALUE); +``` + +Python `fluxon_test_stack/distributed_benchmark_node.py`: + +```python +log_attrs: Dict[str, Any] = { + "fluxon_cluster_name": self._cfg.cluster_name, + "fluxon_member_kind": self._cfg.member_kind, + "fluxon_role": self._cfg.role, + "fluxon_member_id": self._cfg.member_id, +} + +headers = { + "Content-Type": "application/x-protobuf", + "X-Greptime-DB-Name": self._cfg.db_name, + "X-Greptime-Log-Extract-Keys": ",".join(extract_keys), +} +``` + +这两边已经对齐到同一个最小公共集合:`fluxon_cluster_name`、`fluxon_member_kind`、`fluxon_role`、`fluxon_member_id` 这组基础属性同名同义,Greptime header 也保持同一协议面。Python benchmark exporter 可以补 phase summary 字段,但不能改写这组基础字段的含义。 + +### 7.4 发送状态机还没有完全收口 +Rust `fluxon_rs/fluxon_observability/src/greptime_otlp_log_orchestrator.rs`: + +```rust +pub enum GreptimeOtlpLogAttemptResult { + Disabled, + Sent { path: GreptimeOtlpLogSendPath, proxy_node: Option }, + SkippedNoProxy { detail: String }, + ProxyFailed { proxy_node: N, detail: String }, +} +``` + +Python `fluxon_test_stack/distributed_benchmark_node.py`: + +```python +with urllib.request.urlopen(req, timeout=GREPTIME_OTLP_LOG_TIMEOUT_SECONDS) as resp: + status = getattr(resp, "status", 200) + if int(status) < 200 or int(status) >= 300: + body_text = resp.read().decode("utf-8", errors="replace") + raise RuntimeError(f"greptime otlp http {status}: {body_text}") +``` + +这组对照反映的是当前边界:Rust 通用链路已经把 `disabled`、`direct`、`proxy`、失败分支显式枚举出来;Python 这里只是 benchmark phase summary 的直连特化路径,还没有进入同一套通用发送状态机。后续如果 Python 需要承担通用 service-plane 导出,应该复用 Rust 这组有限分支,而不是再发明一套平行状态模型。 + +### 7.5 防止未来漂移 +只保留四条工程规则: + +1. 共享 contract 只保留一个真相源。目录派生、canonical 字段、发送状态、TTL 这类会跨语言消费的语义,优先由 Rust 定义,Python 复用导出结果或逐项镜像实现。 +2. 任何改动如果影响 canonical 文件名、OTLP 字段、Greptime header、发送分支或 retention,必须同一个 PR 同时更新 Rust 代码、Python 代码、设计文档和至少一层 contract test。 +3. Python 特化路径必须显式标出作用域。`test_runner` 服务日志和 benchmark phase summary 可以保留自己的实现,但不能反向成为公共 contract 的定义源。 +4. 多语言边界坚持一个概念一个名字。不要在 rs / py 两边分别引入近义字段、别名参数或平行配置面,否则文档、查询、清理和告警都会漂移。 diff --git a/fluxon_py/config.py b/fluxon_py/config.py index 9b7b447..51e0d7d 100644 --- a/fluxon_py/config.py +++ b/fluxon_py/config.py @@ -110,6 +110,9 @@ def _yaml_template(): cluster_name: # Cluster name (str) shared_memory_path: # Shared memory path (str) shared_file_path: # Shared file path for shared.json/logs/profiles (str) + large_file_paths: # Owner-mode large file roots (dict(optional)) + log_root_path: # Log root path for owner/client large-file outputs (str) + cache_root_path: # Cache root path for owner/client large-file outputs (str) p2p_listen_port: # P2P QUIC listen port override (int(optional)) redis_compat: # Enable Redis protocol shim (dict(optional)) listen_addr: # TCP listen addr, e.g. "127.0.0.1:16379" (str) @@ -584,6 +587,18 @@ def to_fluxon_kv_client_config_yaml_str(self) -> str: return yaml.safe_dump(cfg, sort_keys=False) + if "large_file_paths" not in spec: + raise ValueError("fluxonkv_spec.large_file_paths is required for owner mode") + large_file_paths = spec.get("large_file_paths") + if not isinstance(large_file_paths, dict): + raise ValueError("fluxonkv_spec.large_file_paths must be a mapping in owner mode") + for field_name in ("log_root_path", "cache_root_path"): + field_value = large_file_paths.get(field_name) + if not isinstance(field_value, str) or not field_value.strip(): + raise ValueError( + f"fluxonkv_spec.large_file_paths.{field_name} must be a non-empty string in owner mode" + ) + return yaml.safe_dump(cfg, sort_keys=False) diff --git a/fluxon_py/tests/test_config.py b/fluxon_py/tests/test_config.py index 379e3e0..2979d8e 100644 --- a/fluxon_py/tests/test_config.py +++ b/fluxon_py/tests/test_config.py @@ -47,6 +47,7 @@ def _build_checks(selected_test_id: Optional[str]) -> List[Tuple[str, Callable[[ ("to_yaml_str_roundtrip", _run_test_to_yaml_str_roundtrip), ("fluxonkv_sub_cluster_config", test_fluxonkv_sub_cluster_config), ("fluxonkv_owner_requires_sub_cluster", test_fluxonkv_owner_requires_sub_cluster), + ("fluxonkv_owner_requires_large_file_paths", test_fluxonkv_owner_requires_large_file_paths), ("fluxonkv_p2p_relay_removed", test_fluxonkv_p2p_relay_removed), ("fluxon_client_config_yaml_shape", test_fluxon_client_config_yaml_shape), ("fluxonkv_protocol_field", test_fluxonkv_protocol_field), @@ -270,6 +271,54 @@ def test_fluxonkv_owner_requires_sub_cluster(): print(f"❌ FAIL: test_fluxonkv_owner_requires_sub_cluster - {e}") +def test_fluxonkv_owner_requires_large_file_paths(): + """Ensure owner mode requires explicit large_file_paths roots.""" + try: + base = { + "instance_key": "test_instance", + "contribute_to_cluster_pool_size": {"dram": 16777216, "vram": {}}, + "fluxonkv_spec": { + "etcd_addresses": ["localhost:2379"], + "cluster_name": "test_cluster", + "shared_memory_path": "/tmp/kvcache_shared_memory/test", + "shared_file_path": "/tmp/kvcache_shared_files/test", + "sub_cluster": "rack-a", + }, + } + + try: + FluxonKvClientConfig(copy.deepcopy(base)).to_fluxon_kv_client_config_yaml_str() + print("❌ FAIL: test_fluxonkv_owner_requires_large_file_paths - missing large_file_paths should be rejected") + return + except ValueError: + pass + + invalid_blank = copy.deepcopy(base) + invalid_blank["fluxonkv_spec"]["large_file_paths"] = { + "log_root_path": " ", + "cache_root_path": "/tmp/kvcache_large_cache/test", + } + try: + FluxonKvClientConfig(invalid_blank).to_fluxon_kv_client_config_yaml_str() + print("❌ FAIL: test_fluxonkv_owner_requires_large_file_paths - blank log_root_path should be rejected") + return + except ValueError: + pass + + valid = copy.deepcopy(base) + valid["fluxonkv_spec"]["large_file_paths"] = { + "log_root_path": "/tmp/kvcache_large_logs/test", + "cache_root_path": "/tmp/kvcache_large_cache/test", + } + rendered = FluxonKvClientConfig(valid).to_fluxon_kv_client_config_yaml_str() + assert "large_file_paths:" in rendered + assert "log_root_path: /tmp/kvcache_large_logs/test" in rendered + assert "cache_root_path: /tmp/kvcache_large_cache/test" in rendered + print("✅ PASS: test_fluxonkv_owner_requires_large_file_paths") + except Exception as e: + print(f"❌ FAIL: test_fluxonkv_owner_requires_large_file_paths - {e}") + + def test_fluxonkv_p2p_relay_removed(): """Ensure removed fluxonkv_spec.p2p_relay is rejected as an unknown key.""" try: diff --git a/fluxon_rs/Cargo.lock b/fluxon_rs/Cargo.lock index 4ddcf9b..a4b0ecd 100644 --- a/fluxon_rs/Cargo.lock +++ b/fluxon_rs/Cargo.lock @@ -1320,6 +1320,7 @@ dependencies = [ "anyhow", "askama", "base64 0.21.7", + "chrono", "clap", "etcd-client", "fluxon_cli", @@ -1336,6 +1337,7 @@ dependencies = [ "serde_json", "serde_yaml", "sha2", + "tempfile", "thiserror 1.0.69", "tokio", "tracing", diff --git a/fluxon_rs/fluxon_fs/src/agent.rs b/fluxon_rs/fluxon_fs/src/agent.rs index eca583e..03a3dd0 100644 --- a/fluxon_rs/fluxon_fs/src/agent.rs +++ b/fluxon_rs/fluxon_fs/src/agent.rs @@ -1407,20 +1407,20 @@ impl FluxonFsAgent { .get_self_info() .id .to_string(); - let shared_file_path = if self.kv_framework.is_external_mode() { + let cache_root_base = if self.kv_framework.is_external_mode() { self.kv_framework .external_client_api_view() .external_client_api() .inner() - .shared_file_path() + .cache_root_path() } else { self.kv_framework .client_seg_pool_view() .client_seg_pool() - .shared_file_path() + .cache_root_path() .to_string() }; - let cache_root = resolve_disk_cache_root(Path::new(&shared_file_path), &instance_key); + let cache_root = resolve_disk_cache_root(Path::new(&cache_root_base), &instance_key); let cache = RemoteDiskCacheManager::new(cache_root.clone(), disk_cache_max_bytes_from_env()) .map_err(|err| { diff --git a/fluxon_rs/fluxon_kv/src/client_seg_pool/mod.rs b/fluxon_rs/fluxon_kv/src/client_seg_pool/mod.rs index 7902beb..fb54c06 100644 --- a/fluxon_rs/fluxon_kv/src/client_seg_pool/mod.rs +++ b/fluxon_rs/fluxon_kv/src/client_seg_pool/mod.rs @@ -46,6 +46,8 @@ pub struct ClientSegPoolNewArg { pub contribute_size: ContributeToClusterPoolSize, pub shared_memory_path: String, pub shared_file_path: String, + pub log_root_path: String, + pub cache_root_path: String, pub cluster_name: String, pub etcd_addresses: Vec, pub attach_existing_meta: Option, @@ -64,6 +66,7 @@ pub struct SharedJsonMeta { pub etcd_addresses: Vec, pub shared_memory_path: String, pub shared_file_path: String, + pub large_file_paths: crate::config::LargeFilePaths, pub protocol_version: String, pub write_ts: Option, } @@ -203,6 +206,10 @@ pub struct ClientSegPoolInner { shared_memory_path: String, /// Directory path for regular files (shared.json, side-transfer metadata). shared_file_path: String, + /// Base directory for runtime logs and profile outputs. + log_root_path: String, + /// Base directory for large cache files. + cache_root_path: String, side_transfer_worker: bool, attach_owner_ref: Option, @@ -262,6 +269,8 @@ impl ClientSegPool { let contribute_size = arg.contribute_size; let shared_memory_path = arg.shared_memory_path; let shared_file_path = arg.shared_file_path; + let log_root_path = arg.log_root_path; + let cache_root_path = arg.cache_root_path; let cluster_name = arg.cluster_name; let etcd_addresses = arg.etcd_addresses; let attach_existing_meta = arg.attach_existing_meta; @@ -356,6 +365,8 @@ impl ClientSegPool { view: std::sync::OnceLock::new(), shared_memory_path: shared_memory_path.clone(), shared_file_path: shared_file_path.clone(), + log_root_path: log_root_path.clone(), + cache_root_path: cache_root_path.clone(), side_transfer_worker, attach_owner_ref, cluster_name: cluster_name.clone(), @@ -372,6 +383,8 @@ impl ClientSegPool { view: std::sync::OnceLock::new(), shared_memory_path: shared_memory_path.clone(), shared_file_path: shared_file_path.clone(), + log_root_path: log_root_path.clone(), + cache_root_path: cache_root_path.clone(), side_transfer_worker, attach_owner_ref, cluster_name: cluster_name.clone(), @@ -535,6 +548,8 @@ impl ClientSegPool { view: std::sync::OnceLock::new(), shared_memory_path: base_path.to_string(), shared_file_path: shared_file_path.clone(), + log_root_path, + cache_root_path, side_transfer_worker, attach_owner_ref, cluster_name, @@ -553,6 +568,10 @@ impl ClientSegPool { &self.inner().shared_file_path } + pub fn cache_root_path(&self) -> &str { + &self.inner().cache_root_path + } + fn transfer_rpc_fast_path_eligible_members(&self) -> Vec { let inner = self.inner(); let self_info = inner.view().cluster_manager().get_self_info(); @@ -1161,6 +1180,10 @@ impl ClientSegPool { etcd_addresses: inner.etcd_addresses.clone(), shared_memory_path: shared_memory_canonical, shared_file_path: shared_file_canonical, + large_file_paths: crate::config::LargeFilePaths { + log_root_path: inner.log_root_path.clone(), + cache_root_path: inner.cache_root_path.clone(), + }, protocol_version, diff --git a/fluxon_rs/fluxon_kv/src/config.rs b/fluxon_rs/fluxon_kv/src/config.rs index 218ef69..2df094c 100644 --- a/fluxon_rs/fluxon_kv/src/config.rs +++ b/fluxon_rs/fluxon_kv/src/config.rs @@ -379,6 +379,17 @@ fn cluster_scoped_shared_path(root: &str, cluster_name: &str) -> KvResult KvResult { + let trimmed = root.trim(); + if trimmed.is_empty() { + return Err(ConfigError::InvalidClientConfig { + detail: format!("{field_name} cannot be empty"), + } + .into_kverror()); + } + Ok(trimmed.to_string()) +} + fn resolve_compiled_rdma_transfer_engine() -> KvResult { Ok(TransferEngineType::Closed) } @@ -552,6 +563,8 @@ pub struct FluxonKvSpecYaml { pub shared_memory_path: String, pub shared_file_path: String, #[serde(skip_serializing_if = "Option::is_none")] + pub large_file_paths: Option, + #[serde(skip_serializing_if = "Option::is_none")] pub p2p_listen_port: Option, #[serde(skip_serializing_if = "Option::is_none")] pub redis_compat: Option>, @@ -559,6 +572,13 @@ pub struct FluxonKvSpecYaml { pub sub_cluster: Option>, } +#[derive(Debug, Clone, Serialize, Deserialize, PartialEq, Eq)] +#[serde(deny_unknown_fields)] +pub struct LargeFilePathsYaml { + pub log_root_path: String, + pub cache_root_path: String, +} + #[derive(Debug, Clone, Serialize, Deserialize)] #[serde(deny_unknown_fields)] pub struct RedisCompatConfigYaml { @@ -608,6 +628,12 @@ pub struct FluxonKvSpec { pub sub_cluster: Option, } +#[derive(Debug, Clone, Serialize, Deserialize, PartialEq, Eq)] +pub struct LargeFilePaths { + pub log_root_path: String, + pub cache_root_path: String, +} + /// KV client backend types supported by the system #[derive(Debug, Clone, PartialEq)] pub enum KvClientType { @@ -627,6 +653,7 @@ pub struct ClientConfig { pub fluxonkv_spec: FluxonKvSpec, pub shared_memory_path: String, // Mandatory shared memory path pub shared_file_path: String, // Mandatory shared file path + pub large_file_paths: LargeFilePaths, // Mandatory large-file roots for logs and caches pub test_spec_config: TestSpecConfig, } @@ -893,7 +920,7 @@ impl ClientConfigYaml { .into_kverror()); } - // External (zero-contribution) mode forbids additional knobs to keep the schema minimal. + // External (zero-contribution) mode forbids additional owner-derived knobs to keep the schema minimal. if is_external { if self.fluxonkv_spec.redis_compat.is_some() { return Err(ConfigError::InvalidClientConfig { @@ -914,6 +941,12 @@ impl ClientConfigYaml { } .into_kverror()); } + if self.fluxonkv_spec.large_file_paths.is_some() { + return Err(ConfigError::InvalidClientConfig { + detail: "fluxonkv_spec.large_file_paths is forbidden in zero-contribution mode (it is inherited from owner shared.json)".to_string(), + } + .into_kverror()); + } } // Preserve historical behavior for configs that omit `protocol`, but allow @@ -1053,6 +1086,32 @@ impl ClientConfigYaml { } .into_kverror()); } + let large_file_paths = if is_external { + LargeFilePaths { + log_root_path: String::new(), + cache_root_path: String::new(), + } + } else { + let Some(large_file_paths_yaml) = self.fluxonkv_spec.large_file_paths.as_ref() else { + return Err(ConfigError::InvalidClientConfig { + detail: "fluxonkv_spec.large_file_paths is required for owner mode" + .to_string(), + } + .into_kverror()); + }; + let log_root_path = verify_non_empty_root_path( + &large_file_paths_yaml.log_root_path, + "large_file_paths.log_root_path", + )?; + let cache_root_path = verify_non_empty_root_path( + &large_file_paths_yaml.cache_root_path, + "large_file_paths.cache_root_path", + )?; + LargeFilePaths { + log_root_path, + cache_root_path, + } + }; let shared_memory_path = cluster_scoped_shared_path( &self.fluxonkv_spec.shared_memory_path, @@ -1062,7 +1121,6 @@ impl ClientConfigYaml { &self.fluxonkv_spec.shared_file_path, &fluxonkv_spec.cluster_name, )?; - let redis_compat_listen_addr = match self.fluxonkv_spec.redis_compat.as_ref() { None | Some(YamlNullable::Null) => None, Some(YamlNullable::Value(rc)) => { @@ -1094,6 +1152,7 @@ impl ClientConfigYaml { fluxonkv_spec, shared_memory_path, shared_file_path, + large_file_paths, test_spec_config, }) } @@ -1434,6 +1493,9 @@ fluxonkv_spec: cluster_name: test_cluster shared_memory_path: /tmp/test_owner shared_file_path: /tmp/test_owner_files + large_file_paths: + log_root_path: /tmp/test_owner_logs + cache_root_path: /tmp/test_owner_cache sub_cluster: rack-a test_spec_config: disable_observability: true @@ -1480,6 +1542,9 @@ fluxonkv_spec: cluster_name: test_cluster shared_memory_path: /tmp/test_owner shared_file_path: /tmp/test_owner_files + large_file_paths: + log_root_path: /tmp/test_owner_logs + cache_root_path: /tmp/test_owner_cache sub_cluster: rack-a "#, ) @@ -1492,6 +1557,45 @@ fluxonkv_spec: assert!(verified.fluxonkv_spec.enable_transfer_rpc_fast_path); } + #[test] + fn client_config_zero_contribution_allows_owner_bootstrapped_large_file_paths() { + let cfg = ClientConfigYaml::from_str( + r#" +instance_key: test_external +fluxonkv_spec: + cluster_name: test_cluster + shared_memory_path: /tmp/test_external + shared_file_path: /tmp/test_external_files +"#, + ) + .unwrap(); + let verified = cfg.verify().unwrap(); + assert_eq!(verified.large_file_paths.log_root_path, ""); + assert_eq!(verified.large_file_paths.cache_root_path, ""); + assert_eq!(verified.fluxonkv_spec.etcd_addresses, Vec::::new()); + assert_eq!(verified.fluxonkv_spec.sub_cluster, None); + } + + #[test] + fn client_config_zero_contribution_rejects_large_file_paths_in_yaml() { + let cfg = ClientConfigYaml::from_str( + r#" +instance_key: test_external +fluxonkv_spec: + cluster_name: test_cluster + shared_memory_path: /tmp/test_external + shared_file_path: /tmp/test_external_files + large_file_paths: + log_root_path: /tmp/test_external_logs + cache_root_path: /tmp/test_external_cache +"#, + ) + .unwrap(); + let err = cfg.verify().unwrap_err(); + let text = format!("{err}"); + assert!(text.contains("fluxonkv_spec.large_file_paths is forbidden in zero-contribution mode")); + } + #[test] fn client_test_spec_config_accepts_explicit_rdma_device_names() { let cfg = ClientConfigYaml::from_str( @@ -1505,6 +1609,9 @@ fluxonkv_spec: cluster_name: test_cluster shared_memory_path: /tmp/test_owner shared_file_path: /tmp/test_owner_files + large_file_paths: + log_root_path: /tmp/test_owner_logs + cache_root_path: /tmp/test_owner_cache sub_cluster: rack-a test_spec_config: transport_mode: transfer_with_rpc @@ -1558,6 +1665,9 @@ fluxonkv_spec: cluster_name: test_cluster shared_memory_path: /tmp/test_owner shared_file_path: /tmp/test_owner_files + large_file_paths: + log_root_path: /tmp/test_owner_logs + cache_root_path: /tmp/test_owner_cache sub_cluster: rack-a test_spec_config: rdma_device_names: ["mlx5_0"] @@ -1593,6 +1703,9 @@ fluxonkv_spec: cluster_name: test_cluster shared_memory_path: /tmp/test_owner shared_file_path: /tmp/test_owner_files + large_file_paths: + log_root_path: /tmp/test_owner_logs + cache_root_path: /tmp/test_owner_cache sub_cluster: rack-a test_spec_config: transport_mode: transfer_with_rpc @@ -1624,6 +1737,9 @@ fluxonkv_spec: cluster_name: test_cluster shared_memory_path: /tmp/test_owner shared_file_path: /tmp/test_owner_files + large_file_paths: + log_root_path: /tmp/test_owner_logs + cache_root_path: /tmp/test_owner_cache sub_cluster: rack-a test_spec_config: require_transfer_rpc_fast_path_ready_timeout_seconds: 45 @@ -1649,6 +1765,9 @@ fluxonkv_spec: cluster_name: test_cluster shared_memory_path: /tmp/test_owner shared_file_path: /tmp/test_owner_files + large_file_paths: + log_root_path: /tmp/test_owner_logs + cache_root_path: /tmp/test_owner_cache sub_cluster: rack-a test_spec_config: tcp_thread_control_lane_count: 0 @@ -1675,6 +1794,9 @@ fluxonkv_spec: cluster_name: test_cluster shared_memory_path: /tmp/test_owner shared_file_path: /tmp/test_owner_files + large_file_paths: + log_root_path: /tmp/test_owner_logs + cache_root_path: /tmp/test_owner_cache sub_cluster: rack-a test_spec_config: transport_mode: transfer_with_rpc @@ -1706,6 +1828,9 @@ fluxonkv_spec: cluster_name: test_cluster shared_memory_path: /tmp/test_owner shared_file_path: /tmp/test_owner_files + large_file_paths: + log_root_path: /tmp/test_owner_logs + cache_root_path: /tmp/test_owner_cache sub_cluster: rack-a test_spec_config: transport_mode: transfer_with_rpc @@ -1730,6 +1855,9 @@ fluxonkv_spec: cluster_name: test_cluster shared_memory_path: /tmp/test_owner shared_file_path: /tmp/test_owner_files + large_file_paths: + log_root_path: /tmp/test_owner_logs + cache_root_path: /tmp/test_owner_cache sub_cluster: rack-a test_spec_config: rdma_device_names: ["mlx5_0"] @@ -1784,6 +1912,9 @@ fluxonkv_spec: cluster_name: test_cluster shared_memory_path: /tmp/test_side_worker shared_file_path: /tmp/test_side_worker_files + large_file_paths: + log_root_path: /tmp/test_side_worker_logs + cache_root_path: /tmp/test_side_worker_cache p2p_listen_port: 18081 test_spec_config: enable_side_transfer: true @@ -1823,6 +1954,9 @@ fluxonkv_spec: cluster_name: test_cluster shared_memory_path: /tmp/test_side_worker shared_file_path: /tmp/test_side_worker_files + large_file_paths: + log_root_path: /tmp/test_side_worker_logs + cache_root_path: /tmp/test_side_worker_cache test_spec_config: enable_side_transfer: true side_transfer_role: worker @@ -1854,6 +1988,9 @@ fluxonkv_spec: cluster_name: test_cluster shared_memory_path: /tmp/test_side_worker shared_file_path: /tmp/test_side_worker_files + large_file_paths: + log_root_path: /tmp/test_side_worker_logs + cache_root_path: /tmp/test_side_worker_cache test_spec_config: enable_side_transfer: true side_transfer_role: worker @@ -1883,6 +2020,9 @@ fluxonkv_spec: cluster_name: test_cluster shared_memory_path: /tmp/test_owner shared_file_path: /tmp/test_owner_files + large_file_paths: + log_root_path: /tmp/test_owner_logs + cache_root_path: /tmp/test_owner_cache p2p_listen_port: 18081 sub_cluster: rack-a test_spec_config: @@ -1915,6 +2055,9 @@ fluxonkv_spec: cluster_name: test_cluster shared_memory_path: /tmp/test_owner shared_file_path: /tmp/test_owner_files + large_file_paths: + log_root_path: /tmp/test_owner_logs + cache_root_path: /tmp/test_owner_cache sub_cluster: rack-a "#, ) @@ -1940,6 +2083,9 @@ fluxonkv_spec: cluster_name: test_cluster shared_memory_path: /tmp/test_owner shared_file_path: /tmp/test_owner_files + large_file_paths: + log_root_path: /tmp/test_owner_logs + cache_root_path: /tmp/test_owner_cache sub_cluster: rack-a test_spec_config: transport_mode: transfer_with_rpc diff --git a/fluxon_rs/fluxon_kv/src/external_client_api/external_client_test.rs b/fluxon_rs/fluxon_kv/src/external_client_api/external_client_test.rs index f811424..6a36ca7 100644 --- a/fluxon_rs/fluxon_kv/src/external_client_api/external_client_test.rs +++ b/fluxon_rs/fluxon_kv/src/external_client_api/external_client_test.rs @@ -2,8 +2,8 @@ use std::collections::HashMap; use crate::cluster_manager::NodeID; use crate::config::{ - ClientConfig, ContributeToClusterPoolSize, FluxonKvSpec, MasterConfig, MonitoringConfig, - ProtocolConfig, ProtocolType, TestSpecConfig, TransferEngineType, + ClientConfig, ContributeToClusterPoolSize, FluxonKvSpec, LargeFilePaths, MasterConfig, + MonitoringConfig, ProtocolConfig, ProtocolType, TestSpecConfig, TransferEngineType, }; use crate::master_kv_router::MasterKvRouterView; use crate::{ConfigArg, run_client, run_master}; @@ -82,6 +82,10 @@ fn new_client_config( }, shared_memory_path: shm_path.to_string(), shared_file_path: format!("{}_files", shm_path), + large_file_paths: LargeFilePaths { + log_root_path: format!("{}_logs", shm_path), + cache_root_path: format!("{}_cache", shm_path), + }, test_spec_config: TestSpecConfig::default(), } } @@ -124,6 +128,10 @@ fn new_zero_contribution_client_config( }, shared_memory_path: shm_path.to_string(), shared_file_path: format!("{}_files", shm_path), + large_file_paths: LargeFilePaths { + log_root_path: String::new(), + cache_root_path: String::new(), + }, test_spec_config: TestSpecConfig::default(), } } diff --git a/fluxon_rs/fluxon_kv/src/external_client_api/mod.rs b/fluxon_rs/fluxon_kv/src/external_client_api/mod.rs index f2634be..0758ab5 100644 --- a/fluxon_rs/fluxon_kv/src/external_client_api/mod.rs +++ b/fluxon_rs/fluxon_kv/src/external_client_api/mod.rs @@ -253,6 +253,7 @@ define_module!( pub struct ExternalClientApiNewArg { pub shared_memory_path: String, pub shared_file_path: String, + pub cache_root_path: String, pub expected_cluster_name: String, pub expected_protocol_version: String, pub enable_side_transfer: bool, @@ -312,6 +313,7 @@ pub struct ExternalInner { expected_protocol_version: String, external_shared_memory_path: String, external_shared_file_path: String, + external_cache_root_path: String, _enable_side_transfer: bool, short_circuit_put_payload_path: bool, side_rr_next: AtomicUsize, @@ -363,6 +365,7 @@ impl ExternalClientApi { expected_protocol_version: arg.expected_protocol_version, external_shared_memory_path: arg.shared_memory_path, external_shared_file_path: arg.shared_file_path, + external_cache_root_path: arg.cache_root_path, _enable_side_transfer: arg.enable_side_transfer, short_circuit_put_payload_path: arg.short_circuit_put_payload_path, side_rr_next: AtomicUsize::new(0), @@ -1237,6 +1240,10 @@ impl ExternalInner { self.external_shared_file_path.clone() } + pub fn cache_root_path(&self) -> String { + self.external_cache_root_path.clone() + } + fn should_fallback_side_p2p_error(err: &crate::p2p::P2PError) -> bool { matches!( err, diff --git a/fluxon_rs/fluxon_kv/src/kvcore_test_lib.rs b/fluxon_rs/fluxon_kv/src/kvcore_test_lib.rs index 355ca6e..1b5754d 100644 --- a/fluxon_rs/fluxon_kv/src/kvcore_test_lib.rs +++ b/fluxon_rs/fluxon_kv/src/kvcore_test_lib.rs @@ -147,6 +147,10 @@ fn new_client_config_with_cluster_and_dram( }, shared_memory_path, shared_file_path, + large_file_paths: crate::config::LargeFilePaths { + log_root_path: format!("{}/large_logs/{}", base, instance_key), + cache_root_path: format!("{}/large_cache/{}", base, instance_key), + }, test_spec_config: TestSpecConfig::default(), }; println!("fluxonkv core created client config for test: {:?}", conf); diff --git a/fluxon_rs/fluxon_kv/src/lib.rs b/fluxon_rs/fluxon_kv/src/lib.rs index b46fd85..96e9b28 100644 --- a/fluxon_rs/fluxon_kv/src/lib.rs +++ b/fluxon_rs/fluxon_kv/src/lib.rs @@ -105,6 +105,13 @@ use std::sync::Arc; use std::time::{Duration, Instant}; use tracing::{info, warn}; +struct ExternalBootstrapBundle { + meta: SharedJsonMeta, + shared_memory_path: String, + shared_file_path: String, + etcd_endpoints: Vec, +} + fn cluster_manager_rdma_control_init_from_transfer_config( _transfer_engine: TransferEngineType, _protocol: &ProtocolConfig, @@ -585,7 +592,7 @@ fn tcp_thread_transport_tuning_from_test_spec_config( } pub async fn load_client_config(config_arg: ConfigArg) -> KvResult { - match config_arg { + let config = match config_arg { ConfigArg::None => { // Try to find default config file match find_default_config_file() { @@ -594,13 +601,13 @@ pub async fn load_client_config(config_arg: ConfigArg) -> KvResult let config_yaml = ClientConfigYaml::from_file(&path)?; let config = config_yaml.verify()?; println!("Client configuration loaded and validated successfully"); - Ok(config) + config } None => Err(ConfigError::FileReadError { detail: "No config file found. Please provide a config file with -f option" .to_string(), } - .into_kverror()), + .into_kverror())?, } } ConfigArg::File(config_path) => { @@ -608,13 +615,15 @@ pub async fn load_client_config(config_arg: ConfigArg) -> KvResult let config_yaml = ClientConfigYaml::from_file(&config_path)?; let config = config_yaml.verify()?; println!("Client configuration loaded and validated successfully"); - Ok(config) + config } ConfigArg::Config(config) => { println!("Using provided client configuration"); - Ok(config) + config } - } + }; + + bootstrap_zero_contribution_client_config(config).await } pub async fn load_master_config(config_arg: ConfigArg) -> KvResult { @@ -785,6 +794,7 @@ fn build_side_transfer_worker_config( }, shared_memory_path: owner_config.shared_memory_path.clone(), shared_file_path: owner_config.shared_file_path.clone(), + large_file_paths: owner_config.large_file_paths.clone(), test_spec_config, }) } @@ -829,6 +839,7 @@ fn build_side_transfer_worker_config_yaml( cluster_name: side_config.cluster_name, shared_memory_path: side_config.shared_memory_path, shared_file_path: side_config.shared_file_path, + large_file_paths: None, p2p_listen_port: side_config.fluxonkv_spec.p2p_listen_port, redis_compat: None, sub_cluster: None, @@ -838,7 +849,7 @@ fn build_side_transfer_worker_config_yaml( } fn side_transfer_runtime_dir(owner_config: &ClientConfig) -> PathBuf { - Path::new(&owner_config.shared_file_path) + Path::new(&owner_config.large_file_paths.log_root_path) .join(format!("{}_cluster_kv_logs", owner_config.cluster_name)) .join("side_transfer_runtime") .join(&owner_config.instance_key) @@ -1569,6 +1580,265 @@ fn merge_startup_member_metadata( Ok(()) } +async fn bootstrap_zero_contribution_client_config(config: ClientConfig) -> KvResult { + let dram = config.contribute_to_cluster_pool_size.dram; + let vram_is_zero = config + .contribute_to_cluster_pool_size + .vram + .values() + .all(|&v| v == 0); + let is_zero_contribution = dram == 0 && vram_is_zero; + if !is_zero_contribution { + return Ok(config); + } + + let bundle = wait_for_external_bootstrap_bundle(&config).await?; + let mut final_config = config; + final_config.etcd_addresses_raw = bundle.meta.etcd_addresses.clone(); + final_config.fluxonkv_spec.etcd_addresses = bundle.etcd_endpoints; + final_config.fluxonkv_spec.sub_cluster = bundle.meta.sub_cluster.clone(); + final_config.shared_memory_path = bundle.shared_memory_path; + final_config.shared_file_path = bundle.shared_file_path; + final_config.large_file_paths = bundle.meta.large_file_paths; + Ok(final_config) +} + +async fn wait_for_external_bootstrap_bundle( + config: &ClientConfig, +) -> KvResult { + let build_version = fluxon_util::git_version_build_record::get_current_git_commitid().unwrap(); + let shared_memory_dir = Path::new(&config.shared_memory_path); + let shared_file_dir = Path::new(&config.shared_file_path); + let shared_json_path = shared_file_dir.join("shared.json"); + let mmap_file_path = shared_memory_dir.join("mmap.file"); + + let mut waited_ticks: u64 = 0; + loop { + if !shared_json_path.exists() || !mmap_file_path.exists() { + limit_thirdparty::tokio::time::sleep(std::time::Duration::from_millis(200)).await; + waited_ticks += 1; + if waited_ticks % 25 == 0 { + info!( + "Waiting owner shared bundle to be ready... ({}s), shm_dir={} file_dir={} (shared.json={}, mmap.file={})", + waited_ticks / 5, + shared_memory_dir.to_string_lossy(), + shared_file_dir.to_string_lossy(), + shared_json_path.exists(), + mmap_file_path.exists() + ); + } + continue; + } + + let shared_json_buf = match std::fs::read_to_string(&shared_json_path) { + Ok(v) => v, + Err(e) => { + limit_thirdparty::tokio::time::sleep(std::time::Duration::from_millis(200)).await; + waited_ticks += 1; + if waited_ticks % 25 == 0 { + warn!( + "Waiting owner shared.json readable... ({}s), path={}, err={}", + waited_ticks / 5, + shared_json_path.to_string_lossy(), + e + ); + } + continue; + } + }; + + let meta: crate::SharedJsonMeta = match serde_json::from_str(&shared_json_buf) { + Ok(v) => v, + Err(e) => { + limit_thirdparty::tokio::time::sleep(std::time::Duration::from_millis(200)).await; + waited_ticks += 1; + if waited_ticks % 25 == 0 { + warn!( + "Waiting owner shared.json schema ready... ({}s), path={}, err={}", + waited_ticks / 5, + shared_json_path.to_string_lossy(), + e + ); + } + continue; + } + }; + + if meta.protocol_version != build_version { + limit_thirdparty::tokio::time::sleep(std::time::Duration::from_millis(200)).await; + waited_ticks += 1; + if waited_ticks % 25 == 0 { + warn!( + "Waiting protocol_version match... ({}s), shm_dir='{}' file_dir='{}', shared='{}', local='{}'", + waited_ticks / 5, + shared_memory_dir.to_string_lossy(), + shared_file_dir.to_string_lossy(), + meta.protocol_version, + build_version + ); + } + continue; + } + + if meta.cluster_name != config.cluster_name { + limit_thirdparty::tokio::time::sleep(std::time::Duration::from_millis(200)).await; + waited_ticks += 1; + if waited_ticks % 25 == 0 { + warn!( + "Waiting cluster_name match... ({}s), shm_dir='{}' file_dir='{}', config='{}', shared.json='{}'", + waited_ticks / 5, + shared_memory_dir.to_string_lossy(), + shared_file_dir.to_string_lossy(), + config.cluster_name, + meta.cluster_name + ); + } + continue; + } + + let shared_memory_path_canonical = match std::fs::canonicalize(&config.shared_memory_path) { + Ok(v) => v.to_string_lossy().into_owned(), + Err(e) => { + limit_thirdparty::tokio::time::sleep(std::time::Duration::from_millis(200)).await; + waited_ticks += 1; + if waited_ticks % 25 == 0 { + warn!( + "Waiting shared_memory_path canonicalizable... ({}s), shm_dir='{}', path='{}', err={}", + waited_ticks / 5, + shared_memory_dir.to_string_lossy(), + config.shared_memory_path, + e + ); + } + continue; + } + }; + + let meta_shm_canonical = match std::fs::canonicalize(&meta.shared_memory_path) { + Ok(v) => v.to_string_lossy().into_owned(), + Err(e) => { + limit_thirdparty::tokio::time::sleep(std::time::Duration::from_millis(200)).await; + waited_ticks += 1; + if waited_ticks % 25 == 0 { + warn!( + "Waiting shared.json shared_memory_path canonicalizable... ({}s), shm_dir='{}', path='{}', err={}", + waited_ticks / 5, + shared_memory_dir.to_string_lossy(), + meta.shared_memory_path, + e + ); + } + continue; + } + }; + + let shared_file_path_canonical = match std::fs::canonicalize(&config.shared_file_path) { + Ok(v) => v.to_string_lossy().into_owned(), + Err(e) => { + limit_thirdparty::tokio::time::sleep(std::time::Duration::from_millis(200)).await; + waited_ticks += 1; + if waited_ticks % 25 == 0 { + warn!( + "Waiting shared_file_path canonicalizable... ({}s), file_dir='{}', path='{}', err={}", + waited_ticks / 5, + shared_file_dir.to_string_lossy(), + config.shared_file_path, + e + ); + } + continue; + } + }; + let meta_file_canonical = match std::fs::canonicalize(&meta.shared_file_path) { + Ok(v) => v.to_string_lossy().into_owned(), + Err(e) => { + limit_thirdparty::tokio::time::sleep(std::time::Duration::from_millis(200)).await; + waited_ticks += 1; + if waited_ticks % 25 == 0 { + warn!( + "Waiting shared.json shared_file_path canonicalizable... ({}s), file_dir='{}', path='{}', err={}", + waited_ticks / 5, + shared_file_dir.to_string_lossy(), + meta.shared_file_path, + e + ); + } + continue; + } + }; + + if meta_shm_canonical != shared_memory_path_canonical { + limit_thirdparty::tokio::time::sleep(std::time::Duration::from_millis(200)).await; + waited_ticks += 1; + if waited_ticks % 25 == 0 { + warn!( + "Waiting shared_memory_path match... ({}s), shm_dir='{}', config='{}', shared.json='{}'", + waited_ticks / 5, + shared_memory_dir.to_string_lossy(), + shared_memory_path_canonical, + meta_shm_canonical + ); + } + continue; + } + if meta_file_canonical != shared_file_path_canonical { + limit_thirdparty::tokio::time::sleep(std::time::Duration::from_millis(200)).await; + waited_ticks += 1; + if waited_ticks % 25 == 0 { + warn!( + "Waiting shared_file_path match... ({}s), file_dir='{}', config='{}', shared.json='{}'", + waited_ticks / 5, + shared_file_dir.to_string_lossy(), + shared_file_path_canonical, + meta_file_canonical + ); + } + continue; + } + + if meta.etcd_addresses.is_empty() { + limit_thirdparty::tokio::time::sleep(std::time::Duration::from_millis(200)).await; + waited_ticks += 1; + if waited_ticks % 25 == 0 { + warn!( + "Waiting shared.json etcd_addresses non-empty... ({}s), shm_dir='{}' file_dir='{}', shared_memory_path='{}'", + waited_ticks / 5, + shared_memory_dir.to_string_lossy(), + shared_file_dir.to_string_lossy(), + meta_shm_canonical + ); + } + continue; + } + + let etcd_endpoints = match normalize_etcd_addresses(&meta.etcd_addresses) { + Ok(v) => v, + Err(e) => { + limit_thirdparty::tokio::time::sleep(std::time::Duration::from_millis(200)).await; + waited_ticks += 1; + if waited_ticks % 25 == 0 { + warn!( + "Waiting shared.json etcd_addresses valid... ({}s), shm_dir='{}' file_dir='{}', raw={:?}, err={}", + waited_ticks / 5, + shared_memory_dir.to_string_lossy(), + shared_file_dir.to_string_lossy(), + meta.etcd_addresses, + e + ); + } + continue; + } + }; + + return Ok(ExternalBootstrapBundle { + meta, + shared_memory_path: meta_shm_canonical, + shared_file_path: meta_file_canonical, + etcd_endpoints, + }); + } +} + async fn run_client_impl( config_arg: ConfigArg, test_overrides: Option, @@ -1598,9 +1868,8 @@ async fn run_client_impl( let build_version = fluxon_util::git_version_build_record::get_current_git_commitid().unwrap(); let source_sha256 = fluxon_util::build_info::SOURCE_SHA256; - // 初始化日志系统:将日志放到共享文件根目录 - // 下的 {cluster_name}_cluster_kv_logs 子目录,避免在 shm 根目录下展开普通文件。 - let kv_logs_dir = Path::new(&config.shared_file_path) + // Logs and other large files are isolated from shared.json/peer metadata. + let kv_logs_dir = Path::new(&config.large_file_paths.log_root_path) .join(format!("{}_cluster_kv_logs", config.cluster_name)); let observability_disabled = config.test_spec_config.disable_observability; let greptime_tracing_rx = if observability_disabled { @@ -1651,263 +1920,10 @@ async fn run_client_impl( config.test_spec_config.side_transfer_role, Some(SideTransferRole::Worker) ); - let mut bootstrapped_shared_meta: Option = None; - - let config = if is_external { - let shared_memory_dir = Path::new(&config.shared_memory_path); - let shared_file_dir = Path::new(&config.shared_file_path); - let shared_json_path = shared_file_dir.join("shared.json"); - let mmap_file_path = shared_memory_dir.join("mmap.file"); - - let mut waited_ticks: u64 = 0; - let (meta, meta_shm_canonical, meta_file_canonical, etcd_endpoints) = loop { - if !shared_json_path.exists() || !mmap_file_path.exists() { - limit_thirdparty::tokio::time::sleep(std::time::Duration::from_millis(200)).await; - waited_ticks += 1; - if waited_ticks % 25 == 0 { - info!( - "Waiting owner shared bundle to be ready... ({}s), shm_dir={} file_dir={} (shared.json={}, mmap.file={})", - waited_ticks / 5, - shared_memory_dir.to_string_lossy(), - shared_file_dir.to_string_lossy(), - shared_json_path.exists(), - mmap_file_path.exists() - ); - } - continue; - } - - let shared_json_buf = match std::fs::read_to_string(&shared_json_path) { - Ok(v) => v, - Err(e) => { - limit_thirdparty::tokio::time::sleep(std::time::Duration::from_millis(200)) - .await; - waited_ticks += 1; - if waited_ticks % 25 == 0 { - warn!( - "Waiting owner shared.json readable... ({}s), path={}, err={}", - waited_ticks / 5, - shared_json_path.to_string_lossy(), - e - ); - } - continue; - } - }; - - let meta: crate::SharedJsonMeta = match serde_json::from_str(&shared_json_buf) { - Ok(v) => v, - Err(e) => { - limit_thirdparty::tokio::time::sleep(std::time::Duration::from_millis(200)) - .await; - waited_ticks += 1; - if waited_ticks % 25 == 0 { - warn!( - "Waiting owner shared.json schema ready... ({}s), path={}, err={}", - waited_ticks / 5, - shared_json_path.to_string_lossy(), - e - ); - } - continue; - } - }; - - if meta.protocol_version != build_version { - limit_thirdparty::tokio::time::sleep(std::time::Duration::from_millis(200)).await; - waited_ticks += 1; - if waited_ticks % 25 == 0 { - warn!( - "Waiting protocol_version match... ({}s), shm_dir='{}' file_dir='{}', shared='{}', local='{}'", - waited_ticks / 5, - shared_memory_dir.to_string_lossy(), - shared_file_dir.to_string_lossy(), - meta.protocol_version, - build_version - ); - } - continue; - } - - if meta.cluster_name != config.cluster_name { - limit_thirdparty::tokio::time::sleep(std::time::Duration::from_millis(200)).await; - waited_ticks += 1; - if waited_ticks % 25 == 0 { - warn!( - "Waiting cluster_name match... ({}s), shm_dir='{}' file_dir='{}', config='{}', shared.json='{}'", - waited_ticks / 5, - shared_memory_dir.to_string_lossy(), - shared_file_dir.to_string_lossy(), - config.cluster_name, - meta.cluster_name - ); - } - continue; - } - - let shared_memory_path_canonical = match std::fs::canonicalize( - &config.shared_memory_path, - ) { - Ok(v) => v.to_string_lossy().into_owned(), - Err(e) => { - limit_thirdparty::tokio::time::sleep(std::time::Duration::from_millis(200)) - .await; - waited_ticks += 1; - if waited_ticks % 25 == 0 { - warn!( - "Waiting shared_memory_path canonicalizable... ({}s), shm_dir='{}', path='{}', err={}", - waited_ticks / 5, - shared_memory_dir.to_string_lossy(), - config.shared_memory_path, - e - ); - } - continue; - } - }; - - let meta_shm_canonical = match std::fs::canonicalize(&meta.shared_memory_path) { - Ok(v) => v.to_string_lossy().into_owned(), - Err(e) => { - limit_thirdparty::tokio::time::sleep(std::time::Duration::from_millis(200)) - .await; - waited_ticks += 1; - if waited_ticks % 25 == 0 { - warn!( - "Waiting shared.json shared_memory_path canonicalizable... ({}s), shm_dir='{}', path='{}', err={}", - waited_ticks / 5, - shared_memory_dir.to_string_lossy(), - meta.shared_memory_path, - e - ); - } - continue; - } - }; - let shared_file_path_canonical = match std::fs::canonicalize(&config.shared_file_path) { - Ok(v) => v.to_string_lossy().into_owned(), - Err(e) => { - limit_thirdparty::tokio::time::sleep(std::time::Duration::from_millis(200)) - .await; - waited_ticks += 1; - if waited_ticks % 25 == 0 { - warn!( - "Waiting shared_file_path canonicalizable... ({}s), file_dir='{}', path='{}', err={}", - waited_ticks / 5, - shared_file_dir.to_string_lossy(), - config.shared_file_path, - e - ); - } - continue; - } - }; - let meta_file_canonical = match std::fs::canonicalize(&meta.shared_file_path) { - Ok(v) => v.to_string_lossy().into_owned(), - Err(e) => { - limit_thirdparty::tokio::time::sleep(std::time::Duration::from_millis(200)) - .await; - waited_ticks += 1; - if waited_ticks % 25 == 0 { - warn!( - "Waiting shared.json shared_file_path canonicalizable... ({}s), file_dir='{}', path='{}', err={}", - waited_ticks / 5, - shared_file_dir.to_string_lossy(), - meta.shared_file_path, - e - ); - } - continue; - } - }; - - if meta_shm_canonical != shared_memory_path_canonical { - limit_thirdparty::tokio::time::sleep(std::time::Duration::from_millis(200)).await; - waited_ticks += 1; - if waited_ticks % 25 == 0 { - warn!( - "Waiting shared_memory_path match... ({}s), shm_dir='{}', config='{}', shared.json='{}'", - waited_ticks / 5, - shared_memory_dir.to_string_lossy(), - shared_memory_path_canonical, - meta_shm_canonical - ); - } - continue; - } - if meta_file_canonical != shared_file_path_canonical { - limit_thirdparty::tokio::time::sleep(std::time::Duration::from_millis(200)).await; - waited_ticks += 1; - if waited_ticks % 25 == 0 { - warn!( - "Waiting shared_file_path match... ({}s), file_dir='{}', config='{}', shared.json='{}'", - waited_ticks / 5, - shared_file_dir.to_string_lossy(), - shared_file_path_canonical, - meta_file_canonical - ); - } - continue; - } - - if meta.etcd_addresses.is_empty() { - limit_thirdparty::tokio::time::sleep(std::time::Duration::from_millis(200)).await; - waited_ticks += 1; - if waited_ticks % 25 == 0 { - warn!( - "Waiting shared.json etcd_addresses non-empty... ({}s), shm_dir='{}' file_dir='{}', shared_memory_path='{}'", - waited_ticks / 5, - shared_memory_dir.to_string_lossy(), - shared_file_dir.to_string_lossy(), - meta_shm_canonical - ); - } - continue; - } - - let etcd_endpoints = match normalize_etcd_addresses(&meta.etcd_addresses) { - Ok(v) => v, - Err(e) => { - limit_thirdparty::tokio::time::sleep(std::time::Duration::from_millis(200)) - .await; - waited_ticks += 1; - if waited_ticks % 25 == 0 { - warn!( - "Waiting shared.json etcd_addresses valid... ({}s), shm_dir='{}' file_dir='{}', raw={:?}, err={}", - waited_ticks / 5, - shared_memory_dir.to_string_lossy(), - shared_file_dir.to_string_lossy(), - meta.etcd_addresses, - e - ); - } - continue; - } - }; - - break ( - meta, - meta_shm_canonical, - meta_file_canonical, - etcd_endpoints, - ); - }; - bootstrapped_shared_meta = Some(meta.clone()); - // External bootstrap contract: - // - Config provides: instance_key, fluxonkv_spec.cluster_name, fluxonkv_spec.shared_memory_path, - // fluxonkv_spec.shared_file_path, fluxonkv_spec.p2p_listen_port. - // - shared.json provides: cluster_name, etcd_addresses (raw), shared_memory_path (canonical), - // shared_file_path (canonical), protocol_version, sub_cluster. - // - pprof_duration_seconds is not inherited; it is controlled solely by config. - let mut final_config = config.clone(); - final_config.etcd_addresses_raw = meta.etcd_addresses.clone(); - final_config.fluxonkv_spec.etcd_addresses = etcd_endpoints; - final_config.fluxonkv_spec.sub_cluster = meta.sub_cluster; - final_config.shared_memory_path = meta_shm_canonical; - final_config.shared_file_path = meta_file_canonical; - final_config + let bootstrapped_shared_meta = if is_external { + Some(wait_for_external_bootstrap_bundle(&config).await?.meta) } else { - config + None }; if !is_external && config.test_spec_config.side_transfer_worker_count > 0 { @@ -2012,6 +2028,7 @@ async fn run_client_impl( external_client_api_arg: ExternalClientApiNewArg { shared_memory_path: config.shared_memory_path.clone(), shared_file_path: config.shared_file_path.clone(), + cache_root_path: config.large_file_paths.cache_root_path.clone(), expected_cluster_name: config.cluster_name.clone(), expected_protocol_version: build_version.clone(), enable_side_transfer: config.test_spec_config.enable_side_transfer, @@ -2063,6 +2080,8 @@ async fn run_client_impl( // Read shared memory path from config (must not be empty). shared_memory_path: config.shared_memory_path.clone(), shared_file_path: config.shared_file_path.clone(), + log_root_path: config.large_file_paths.log_root_path.clone(), + cache_root_path: config.large_file_paths.cache_root_path.clone(), cluster_name: config.cluster_name.clone(), etcd_addresses: config.etcd_addresses_raw.clone(), attach_existing_meta: if is_side_transfer_worker { @@ -2405,7 +2424,7 @@ async fn run_client_impl( } let shutdown_waiter = framework.cluster_manager_view().register_shutdown_waiter(); - let kv_profiles_dir = Path::new(&config.shared_file_path) + let kv_profiles_dir = Path::new(&config.large_file_paths.log_root_path) .join(format!("{}_cluster_kv_profiles", config.cluster_name)); profile::spawn_pprof_flamegraph_on_timeout_or_shutdown( config.pprof_duration_seconds, @@ -2485,6 +2504,10 @@ mod tests { }, shared_memory_path: "/tmp/fluxon_side_transfer_test".to_string(), shared_file_path: "/tmp/fluxon_side_transfer_test_files".to_string(), + large_file_paths: crate::config::LargeFilePaths { + log_root_path: "/tmp/fluxon_side_transfer_test_large/log".to_string(), + cache_root_path: "/tmp/fluxon_side_transfer_test_large/cache".to_string(), + }, test_spec_config: TestSpecConfig { enable_side_transfer: true, side_transfer_worker_count: 4, @@ -2720,6 +2743,7 @@ mod tests { ); assert!(side_cfg_yaml.contribute_to_cluster_pool_size.is_none()); assert!(side_cfg_yaml.fluxonkv_spec.etcd_addresses.is_none()); + assert!(side_cfg_yaml.fluxonkv_spec.large_file_paths.is_none()); assert!(side_cfg_yaml.fluxonkv_spec.sub_cluster.is_none()); assert_eq!(side_cfg_yaml.fluxonkv_spec.p2p_listen_port, Some(42001)); assert_eq!( @@ -2728,6 +2752,101 @@ mod tests { ); } + #[tokio::test] + async fn zero_contribution_bootstrap_inherits_large_file_paths_from_owner_shared_json() { + let tempdir = new_test_dir("fluxon_external_bootstrap_large_paths"); + let shared_memory_root = tempdir.join("shared_mem"); + let shared_file_root = tempdir.join("shared_file"); + let owner_log_root = tempdir.join("owner_logs"); + let owner_cache_root = tempdir.join("owner_cache"); + std::fs::create_dir_all(&shared_memory_root).unwrap(); + std::fs::create_dir_all(&shared_file_root).unwrap(); + std::fs::create_dir_all(&owner_log_root).unwrap(); + std::fs::create_dir_all(&owner_cache_root).unwrap(); + std::fs::write(shared_memory_root.join("mmap.file"), vec![0u8; 4096]).unwrap(); + + let shared_meta = SharedJsonMeta { + owner_id: "owner-a".to_string(), + node_start_time: 123, + segment_len: 4096, + segment_label: Some("cpu:0".to_string()), + sub_cluster: Some("owner-sub".to_string()), + cluster_name: "test_cluster".to_string(), + etcd_addresses: vec!["127.0.0.1:2379".to_string()], + shared_memory_path: std::fs::canonicalize(&shared_memory_root) + .unwrap() + .to_string_lossy() + .into_owned(), + shared_file_path: std::fs::canonicalize(&shared_file_root) + .unwrap() + .to_string_lossy() + .into_owned(), + large_file_paths: crate::config::LargeFilePaths { + log_root_path: owner_log_root.to_string_lossy().into_owned(), + cache_root_path: owner_cache_root.to_string_lossy().into_owned(), + }, + protocol_version: + fluxon_util::git_version_build_record::get_current_git_commitid().unwrap(), + write_ts: Some(chrono::Utc::now().timestamp_micros()), + }; + std::fs::write( + shared_file_root.join("shared.json"), + serde_json::to_vec(&shared_meta).unwrap(), + ) + .unwrap(); + + let config = ClientConfig { + cluster_name: "test_cluster".to_string(), + etcd_addresses_raw: Vec::new(), + instance_key: "external-a".to_string(), + contribute_to_cluster_pool_size: ContributeToClusterPoolSize { + dram: 0, + vram: HashMap::new(), + }, + protocol: ProtocolConfig { + protocol_type: ProtocolType::Tcp, + rdma_device_names: None, + }, + pprof_duration_seconds: None, + redis_compat_listen_addr: None, + fluxonkv_spec: FluxonKvSpec { + etcd_addresses: Vec::new(), + cluster_name: "test_cluster".to_string(), + p2p_listen_port: Some(41001), + transfer_engine: TransferEngineType::P2p, + enable_transfer_rpc_fast_path: false, + sub_cluster: None, + }, + shared_memory_path: shared_memory_root.to_string_lossy().into_owned(), + shared_file_path: shared_file_root.to_string_lossy().into_owned(), + large_file_paths: crate::config::LargeFilePaths { + log_root_path: String::new(), + cache_root_path: String::new(), + }, + test_spec_config: TestSpecConfig::default(), + }; + + let bootstrapped = bootstrap_zero_contribution_client_config(config) + .await + .expect("bootstrap zero-contribution config"); + assert_eq!( + bootstrapped.large_file_paths.log_root_path, + owner_log_root.to_string_lossy() + ); + assert_eq!( + bootstrapped.large_file_paths.cache_root_path, + owner_cache_root.to_string_lossy() + ); + assert_eq!( + bootstrapped.fluxonkv_spec.sub_cluster, + Some("owner-sub".to_string()) + ); + assert_eq!( + bootstrapped.fluxonkv_spec.etcd_addresses, + vec!["http://127.0.0.1:2379".to_string()] + ); + } + #[test] fn current_exe_name_helpers_detect_python_and_fluxon_kv() { assert!(current_exe_looks_like_python(Path::new( diff --git a/fluxon_rs/fluxon_kv/src/memholder/memholder_test.rs b/fluxon_rs/fluxon_kv/src/memholder/memholder_test.rs index 377a1c2..5b260c3 100644 --- a/fluxon_rs/fluxon_kv/src/memholder/memholder_test.rs +++ b/fluxon_rs/fluxon_kv/src/memholder/memholder_test.rs @@ -94,6 +94,10 @@ fn new_client_config_with_size( }, shared_memory_path: format!("/tmp/kvcache_shared_memory/{}", instance_key), shared_file_path: format!("/tmp/kvcache_shared_files/{}", instance_key), + large_file_paths: crate::config::LargeFilePaths { + log_root_path: format!("/tmp/kvcache_large_logs/{}", instance_key), + cache_root_path: format!("/tmp/kvcache_large_cache/{}", instance_key), + }, test_spec_config: TestSpecConfig::default(), } } @@ -127,6 +131,10 @@ fn new_zero_contribution_client_config( }, shared_memory_path: format!("/tmp/kvcache_shared_memory/{}", owner_instance_key), shared_file_path: format!("/tmp/kvcache_shared_files/{}", owner_instance_key), + large_file_paths: crate::config::LargeFilePaths { + log_root_path: String::new(), + cache_root_path: String::new(), + }, test_spec_config: TestSpecConfig::default(), } } diff --git a/fluxon_rs/fluxon_ops/Cargo.toml b/fluxon_rs/fluxon_ops/Cargo.toml index 0d54fc5..f4f772a 100644 --- a/fluxon_rs/fluxon_ops/Cargo.toml +++ b/fluxon_rs/fluxon_ops/Cargo.toml @@ -5,6 +5,7 @@ edition = "2024" [dependencies] anyhow = { workspace = true } +chrono = { workspace = true } serde = { workspace = true } serde_json = { workspace = true } serde_yaml = { workspace = true } @@ -28,3 +29,6 @@ fluxon_framework = { path = "../fluxon_framework" } fluxon_util = { path = "../fluxon_util" } fluxon_cli = { path = "../fluxon_cli" } fluxon_proxy = { path = "../fluxon_proxy" } + +[dev-dependencies] +tempfile = { workspace = true } diff --git a/fluxon_rs/fluxon_ops/build.rs b/fluxon_rs/fluxon_ops/build.rs index ae424ef..585fbfc 100644 --- a/fluxon_rs/fluxon_ops/build.rs +++ b/fluxon_rs/fluxon_ops/build.rs @@ -58,14 +58,23 @@ print( String::from_utf8(output.stdout).expect("selection supervisor output must be utf-8") } +fn render_log_shard_helper(repo_root: &Path) -> String { + let helper_path = repo_root.join("deployment").join("utils").join("log_shard.py"); + fs::read_to_string(&helper_path) + .unwrap_or_else(|e| panic!("read log shard helper failed: {} ({})", helper_path.display(), e)) +} + fn main() { let manifest_dir = PathBuf::from(env::var("CARGO_MANIFEST_DIR").expect("CARGO_MANIFEST_DIR")); let repo_root = repo_root(&manifest_dir); let source = render_selection_supervisor(&repo_root); + let log_shard_source = render_log_shard_helper(&repo_root); let out_dir = PathBuf::from(env::var("OUT_DIR").expect("OUT_DIR")); let out_path = out_dir.join("selection_supervisor.py"); fs::write(&out_path, source).expect("write embedded selection supervisor source"); + let helper_out_path = out_dir.join("log_shard.py"); + fs::write(&helper_out_path, log_shard_source).expect("write embedded log shard helper"); println!("cargo:rerun-if-changed=build.rs"); println!( @@ -76,4 +85,8 @@ fn main() { .join("selection_supervisor_codegen.py") .display() ); + println!( + "cargo:rerun-if-changed={}", + repo_root.join("deployment").join("utils").join("log_shard.py").display() + ); } diff --git a/fluxon_rs/fluxon_ops/src/lib.rs b/fluxon_rs/fluxon_ops/src/lib.rs index 40f646a..b27420b 100644 --- a/fluxon_rs/fluxon_ops/src/lib.rs +++ b/fluxon_rs/fluxon_ops/src/lib.rs @@ -28,7 +28,8 @@ use fluxon_kv::{ConfigArg, Framework, run_client}; use fluxon_proxy::{HeaderKv, PanelProxyMethod, PanelProxyResp}; use fluxon_util::{ - FluxonCliProxyDescriptorV2, FluxonCliProxyTransportV2, fluxon_cli_proxy_desc_etcd_key_v2, + FluxonCliProxyDescriptorV2, FluxonCliProxyTransportV2, display_runtime_log_path, + fluxon_cli_proxy_desc_etcd_key_v2, resolve_readable_log_path, }; pub const OPS_SERVICE_NAME: &str = "ops"; @@ -57,6 +58,7 @@ const OPS_ATOMIC_GROUP_ANNOTATION_KEY: &str = "fluxon.io/atomic_group"; const OPS_ATOMIC_GROUP_PHASE_ANNOTATION_KEY: &str = "fluxon.io/atomic_group_phase"; const OPS_ATOMIC_GROUP_ORDER_ANNOTATION_KEY: &str = "fluxon.io/atomic_group_order"; const OPS_SELECTION_SUPERVISOR_FILENAME: &str = "selection_supervisor.py"; +const OPS_LOG_SHARD_HELPER_FILENAME: &str = "log_shard.py"; const OPS_SELECTION_SUPERVISOR_DIR_NAME: &str = "selection_supervisor"; const OPS_SELECTION_SUPERVISOR_RUN_RESTART_DELAY_SECONDS: u64 = 5; const OPS_SELECTION_SUPERVISOR_RUN_MAX_BACKOFF_SECONDS: u64 = 30; @@ -78,6 +80,7 @@ const DELETE_APPLY_NO_WAIT_DELAY_SECONDS: u64 = 30; const EMBEDDED_SELECTION_SUPERVISOR_SOURCE: &str = include_str!(concat!(env!("OUT_DIR"), "/selection_supervisor.py")); +const EMBEDDED_LOG_SHARD_HELPER_SOURCE: &str = include_str!(concat!(env!("OUT_DIR"), "/log_shard.py")); // Ops controller uses Fluxon user-RPC to talk to ops agents. // Keep the timeout as a fixed constant to avoid config surface area. @@ -970,7 +973,7 @@ fn resolve_python_host_executable(python_exe: &Path) -> anyhow::Result Ok(resolved) } -fn ensure_embedded_selection_supervisor(workdir: &Path) -> anyhow::Result { +fn ensure_embedded_selection_supervisor_runtime(workdir: &Path) -> anyhow::Result<(PathBuf, PathBuf)> { let runtime_dir = workdir.join(OPS_SELECTION_SUPERVISOR_DIR_NAME); std::fs::create_dir_all(&runtime_dir).with_context(|| { format!( @@ -979,6 +982,7 @@ fn ensure_embedded_selection_supervisor(workdir: &Path) -> anyhow::Result existing != EMBEDDED_SELECTION_SUPERVISOR_SOURCE, Err(e) => { @@ -992,6 +996,19 @@ fn ensure_embedded_selection_supervisor(workdir: &Path) -> anyhow::Result existing != EMBEDDED_LOG_SHARD_HELPER_SOURCE, + Err(e) => { + if e.kind() == std::io::ErrorKind::NotFound { + true + } else { + return Err(anyhow::Error::new(e).context(format!( + "read embedded log shard helper failed: {}", + helper_path.display() + ))); + } + } + }; if should_write { std::fs::write(&script_path, EMBEDDED_SELECTION_SUPERVISOR_SOURCE).with_context(|| { format!( @@ -1019,13 +1036,21 @@ fn ensure_embedded_selection_supervisor(workdir: &Path) -> anyhow::Result anyhow::Result { let python_exe = resolve_python_host_executable(python_exe)?; - let script_path = ensure_embedded_selection_supervisor(workdir)?; + let (script_path, _helper_path) = ensure_embedded_selection_supervisor_runtime(workdir)?; if !hostworkdir.is_absolute() { anyhow::bail!( "hostworkdir must be absolute for shared selection supervisor runtime: {}", @@ -1647,7 +1672,9 @@ fn selection_status_from_live_supervisor( apply_id: runtime_state.as_ref().and_then(|v| v.apply_id.clone()), argv: runtime_state.as_ref().map(|v| v.argv.clone()), cwd: runtime_state.as_ref().and_then(|v| v.cwd.clone()), - log_path: runtime_state.as_ref().map(|v| v.log_path.clone()), + log_path: runtime_state + .as_ref() + .map(|v| display_runtime_log_path(v.log_path.as_str())), started_ts_ms: None, owner_ts_ms: Some(supervisor.owner_ts_ms), supervisor_start_time_ticks: Some(supervisor.start_time_ticks()), @@ -2970,7 +2997,8 @@ impl UserRpcHandler for ReadWorkloadLogChunkHandler { } }; - let path = self.log_dir.join(log_filename); + let logical_path = self.log_dir.join(log_filename); + let path = resolve_readable_log_path(&logical_path).unwrap_or(logical_path.clone()); let meta = match std::fs::metadata(&path) { Ok(v) => v, Err(e) => { @@ -3773,8 +3801,12 @@ fn desired_workload_matches_running( workloads: &SupervisorBackedWorkloads, desired: &AgentDesiredWorkload, ) -> bool { - let _ = workloads; - let Ok(status) = observe_selection_status(desired.kind, &desired.name, &desired.authority) + let Ok(status) = observe_selection_status_for_scope( + desired.kind, + &desired.name, + &desired.authority, + Some(workloads.scope_key.as_str()), + ) else { return false; }; @@ -3854,7 +3886,6 @@ fn desired_workload_recovery_superseded( workloads: &SupervisorBackedWorkloads, desired: &AgentDesiredWorkload, ) -> anyhow::Result { - let _ = workloads; // English note: // - A newer apply-owned generation overlapping an older applyless bare owner is the expected // phase-1 state of the self-host two-phase handover. @@ -3863,7 +3894,12 @@ fn desired_workload_recovery_superseded( // phase 2 has a chance to cut over. // - Only an owner_ts that is newer than the requested workload and is not this intentional // phase-1 overlap is treated as a hard superseding fact. - let status = observe_selection_status(desired.kind, &desired.name, &desired.authority)?; + let status = observe_selection_status_for_scope( + desired.kind, + &desired.name, + &desired.authority, + Some(workloads.scope_key.as_str()), + )?; if phase1_overlap_with_applyless_owner(&status, desired) { return Ok(false); } @@ -13938,6 +13974,90 @@ mod tests { assert!(err_text.contains("owner_ts_ms collision"), "{err_text}"); } + #[test] + fn live_selection_supervisors_isolate_same_label_collision_by_scope_key() { + let snapshot = SelectionSupervisorProcSnapshot { + infos_by_pid: std::collections::HashMap::from([ + ( + 11, + ProcessInfoObservation { + pid: 11, + ppid: 1, + pgid: 11, + state: 'S', + start_time_ticks: 100, + }, + ), + ( + 22, + ProcessInfoObservation { + pid: 22, + ppid: 1, + pgid: 22, + state: 'S', + start_time_ticks: 200, + }, + ), + ]), + children_by_ppid: std::collections::HashMap::new(), + cmdlines: vec![ + ( + 11, + vec![ + "/usr/bin/python3".to_string(), + "selection_supervisor.py".to_string(), + "run".to_string(), + "--label".to_string(), + "DaemonSet/target".to_string(), + "--scope-key".to_string(), + "/tmp/scope-a".to_string(), + "--owner-ts-ms".to_string(), + "2".to_string(), + ], + ), + ( + 22, + vec![ + "/usr/bin/python3".to_string(), + "selection_supervisor.py".to_string(), + "run".to_string(), + "--label".to_string(), + "DaemonSet/target".to_string(), + "--scope-key".to_string(), + "/tmp/scope-b".to_string(), + "--owner-ts-ms".to_string(), + "2".to_string(), + ], + ), + ], + zombie_infos: Vec::new(), + }; + + let scoped_a = + live_selection_supervisors(&snapshot, Some("DaemonSet/target"), Some("/tmp/scope-a")) + .unwrap(); + assert_eq!(scoped_a.len(), 1); + assert_eq!(scoped_a[0].pid(), 11); + + let scoped_b = + live_selection_supervisors(&snapshot, Some("DaemonSet/target"), Some("/tmp/scope-b")) + .unwrap(); + assert_eq!(scoped_b.len(), 1); + assert_eq!(scoped_b[0].pid(), 22); + + let listed_a = observe_all_selection_statuses_for_snapshot(&snapshot, Some("/tmp/scope-a")) + .unwrap(); + assert_eq!(listed_a.len(), 1); + assert_eq!(listed_a[0].label, "DaemonSet/target"); + assert_eq!(listed_a[0].pid, Some(11)); + + let listed_b = observe_all_selection_statuses_for_snapshot(&snapshot, Some("/tmp/scope-b")) + .unwrap(); + assert_eq!(listed_b.len(), 1); + assert_eq!(listed_b[0].label, "DaemonSet/target"); + assert_eq!(listed_b[0].pid, Some(22)); + } + #[test] fn live_selection_supervisors_reject_matching_legacy_entry_without_owner_ts_ms() { let snapshot = SelectionSupervisorProcSnapshot { @@ -14405,6 +14525,95 @@ mod tests { .unwrap(); } + #[test] + fn materialize_selection_supervisor_runtime_writes_log_shard_helper() { + let python_exe = PathBuf::from("/usr/bin/python3"); + assert!( + python_exe.is_file(), + "python executable does not exist: {}", + python_exe.display() + ); + let workdir = tempfile::tempdir().unwrap(); + let runtime = + SelectionSupervisorRuntime::materialize(workdir.path(), workdir.path(), python_exe.as_path()) + .unwrap(); + assert!(runtime.script_path.exists()); + assert!( + runtime + .script_path + .parent() + .unwrap() + .join(OPS_LOG_SHARD_HELPER_FILENAME) + .is_file() + ); + } + + #[test] + fn detached_selection_supervisor_preserves_early_startup_logs() { + let python_exe = PathBuf::from("/usr/bin/python3"); + assert!( + python_exe.is_file(), + "python executable does not exist: {}", + python_exe.display() + ); + let workdir = tempfile::tempdir().unwrap(); + let runtime = + SelectionSupervisorRuntime::materialize(workdir.path(), workdir.path(), python_exe.as_path()) + .unwrap(); + let log_path = workdir.path().join("startup.log"); + let command = vec![ + python_exe.display().to_string(), + runtime.script_path.display().to_string(), + "run".to_string(), + "--label".to_string(), + "Deployment/startup_demo".to_string(), + "--scope-key".to_string(), + workdir.path().display().to_string(), + "--owner-ts-ms".to_string(), + "0".to_string(), + "--restart-policy".to_string(), + "always".to_string(), + "--restart-delay-seconds".to_string(), + "5".to_string(), + "--max-backoff-seconds".to_string(), + "30".to_string(), + "--crashloop-consecutive-restarts".to_string(), + "0".to_string(), + "--crashloop-interval-lt-seconds".to_string(), + "0".to_string(), + "--".to_string(), + "/bin/true".to_string(), + ]; + let pid = runtime.spawn_detached_command(&log_path, command.as_slice()).unwrap(); + let deadline = Instant::now() + Duration::from_secs(10); + let expected = "owner-ts-ms must be positive"; + let mut saw_expected = false; + while Instant::now() < deadline { + if let Some(path) = resolve_readable_log_path(&log_path) { + let text = std::fs::read_to_string(path).unwrap_or_default(); + if text.contains(expected) { + saw_expected = true; + break; + } + } + std::thread::sleep(Duration::from_millis(100)); + } + if let Some(path) = resolve_readable_log_path(&log_path) { + let text = std::fs::read_to_string(path).unwrap_or_default(); + assert!( + text.contains(expected), + "expected detached supervisor startup logs to reach runtime log, got: {text:?}" + ); + } else { + panic!("runtime log path did not materialize"); + } + assert!(saw_expected, "startup log was not observed before timeout"); + let _ = std::process::Command::new("kill") + .arg("-TERM") + .arg(pid.to_string()) + .status(); + } + #[test] fn atomic_group_non_agent_requires_present_before_running_match() { let desired = AgentDesiredWorkload { @@ -14616,4 +14825,25 @@ mod tests { }; assert!(!phase1_overlap_with_applyless_owner(&status, &desired)); } + + #[test] + fn resolve_readable_log_path_prefers_latest_daily_shard() { + let td = tempfile::tempdir().unwrap(); + let base_path = td.path().join("workload__Deployment__demo.log"); + std::fs::write( + td.path().join("workload__Deployment__demo.2026-06-19.log"), + "old\n", + ) + .unwrap(); + std::fs::write( + td.path().join("workload__Deployment__demo.2026-06-20.log"), + "new\n", + ) + .unwrap(); + let resolved = resolve_readable_log_path(&base_path).unwrap(); + assert_eq!( + resolved.file_name().and_then(|v| v.to_str()), + Some("workload__Deployment__demo.2026-06-20.log") + ); + } } diff --git a/fluxon_rs/fluxon_util/build.rs b/fluxon_rs/fluxon_util/build.rs index 0f586d3..2bf7b87 100644 --- a/fluxon_rs/fluxon_util/build.rs +++ b/fluxon_rs/fluxon_util/build.rs @@ -88,12 +88,15 @@ fn collect_crates_for_runtime(ws: &CargoWorkspace) { println!("cargo:rerun-if-changed=Cargo.toml"); } -fn try_discover_git_dir(manifest_dir: &Path) -> Option { +fn try_discover_git_dir(manifest_dir: &Path, workspace_root: &Path) -> Option { + let workspace_search_ceiling = workspace_root.parent().unwrap_or(workspace_root); let mut cur = Some(manifest_dir); while let Some(dir) = cur { let candidate = dir.join(".git"); if candidate.is_dir() { - return Some(candidate); + if candidate.join("HEAD").is_file() { + return Some(candidate); + } } if candidate.is_file() { // Worktree/submodule style: .git is a file containing `gitdir: ` @@ -106,11 +109,17 @@ fn try_discover_git_dir(manifest_dir: &Path) -> Option { .unwrap_or_else(|| panic!("invalid .git file format: {}", candidate.display())) .trim(); let gitdir_path = Path::new(gitdir); - return Some(if gitdir_path.is_absolute() { + let resolved = if gitdir_path.is_absolute() { gitdir_path.to_path_buf() } else { dir.join(gitdir_path) - }); + }; + if resolved.join("HEAD").is_file() { + return Some(resolved); + } + } + if dir == workspace_search_ceiling { + break; } cur = dir.parent(); } @@ -309,7 +318,7 @@ fn main() { v } Err(_) => { - match try_discover_git_dir(&manifest_dir) { + match try_discover_git_dir(&manifest_dir, &ws.workspace_root) { Some(git_dir) => { emit_rerun_hints(&git_dir); resolve_head_commit_id(&git_dir) diff --git a/fluxon_rs/fluxon_util/src/lib.rs b/fluxon_rs/fluxon_util/src/lib.rs index 2f4f9fa..e575a75 100644 --- a/fluxon_rs/fluxon_util/src/lib.rs +++ b/fluxon_rs/fluxon_util/src/lib.rs @@ -36,7 +36,12 @@ pub mod limitrate; // PyO3 helpers: run long-time Python call without holding GIL in caller thread. pub mod pyo3; // Re-export for stable public API: existing call sites can keep using `fluxon_util::init_log`. -pub use log::{current_log_file_path, init_log, init_log_test, init_log_with_extra_layer}; +pub use log::{ + current_daily_sharded_log_path, current_log_file_path, daily_sharded_log_path, + display_runtime_log_path, init_log, init_log_test, init_log_with_extra_layer, + latest_existing_daily_sharded_log_path, resolve_readable_log_path, + DEFAULT_DAILY_LOG_RETENTION_DAYS, +}; #[cfg(test)] mod test_util_test; diff --git a/fluxon_rs/fluxon_util/src/log.rs b/fluxon_rs/fluxon_util/src/log.rs index db3d88f..648650f 100644 --- a/fluxon_rs/fluxon_util/src/log.rs +++ b/fluxon_rs/fluxon_util/src/log.rs @@ -3,6 +3,7 @@ use std::io; use std::path::{Path, PathBuf}; use std::sync::OnceLock; +use parking_lot::Mutex; use tracing_appender::non_blocking; use tracing_appender::non_blocking::WorkerGuard; use tracing_subscriber::EnvFilter; @@ -20,6 +21,9 @@ mod generated_crates { // RPC fast-path traffic actually entered the closed transfer / verbs backend. Keep the scope explicit: // only these dependency targets are promoted to DEBUG alongside workspace crates. const RDMA_DEBUG_TARGETS: &[&str] = &["fabric_lib", "libfabric_sys", "libibverbs_sys"]; +const LOG_RETENTION_DAYS: usize = 31; +const TEST_LOG_SHARD_WINDOW_SECONDS_ENV: &str = "FLUXON_TEST_LOG_SHARD_WINDOW_SECONDS"; +const TEST_LOG_SHARD_ANCHOR_UNIX_SECONDS_ENV: &str = "FLUXON_TEST_LOG_SHARD_ANCHOR_UNIX_SECONDS"; // Simple UTC timer in RFC3339 seconds (no subsecond precision) struct UtcSecondTimer; @@ -37,6 +41,191 @@ static GLOBAL_CONSOLE_LOG_GUARD: OnceLock = OnceLock::new(); // Expose the current process log file path for sidecar collectors (e.g. OTLP tailer). static GLOBAL_LOG_FILE_PATH: OnceLock = OnceLock::new(); +pub const DEFAULT_DAILY_LOG_RETENTION_DAYS: usize = LOG_RETENTION_DAYS; + +#[derive(Clone, Copy, Debug)] +struct LogShardWindowConfig { + window_seconds: i64, + anchor_unix_seconds: i64, +} + +fn read_test_log_shard_window_config() -> anyhow::Result> { + let Some(raw_window) = std::env::var_os(TEST_LOG_SHARD_WINDOW_SECONDS_ENV) else { + return Ok(None); + }; + let raw_window = raw_window + .into_string() + .map_err(|_| anyhow::anyhow!("{TEST_LOG_SHARD_WINDOW_SECONDS_ENV} must be valid utf-8"))?; + let window_text = raw_window.trim(); + if window_text.is_empty() { + return Ok(None); + } + let window_seconds: i64 = window_text.parse().map_err(|e| { + anyhow::anyhow!( + "{TEST_LOG_SHARD_WINDOW_SECONDS_ENV} must be a positive integer: {e}" + ) + })?; + if window_seconds <= 0 { + anyhow::bail!("{TEST_LOG_SHARD_WINDOW_SECONDS_ENV} must be > 0"); + } + + let raw_anchor = std::env::var(TEST_LOG_SHARD_ANCHOR_UNIX_SECONDS_ENV).map_err(|_| { + anyhow::anyhow!( + "{TEST_LOG_SHARD_ANCHOR_UNIX_SECONDS_ENV} is required when {TEST_LOG_SHARD_WINDOW_SECONDS_ENV} is set" + ) + })?; + let anchor_unix_seconds: i64 = raw_anchor.trim().parse().map_err(|e| { + anyhow::anyhow!( + "{TEST_LOG_SHARD_ANCHOR_UNIX_SECONDS_ENV} must be an integer unix timestamp: {e}" + ) + })?; + Ok(Some(LogShardWindowConfig { + window_seconds, + anchor_unix_seconds, + })) +} + +fn resolve_shard_date_from_datetime(now: chrono::DateTime) -> anyhow::Result { + let Some(config) = read_test_log_shard_window_config()? else { + return Ok(now.date_naive()); + }; + let unix_seconds = now.timestamp(); + let delta_seconds = unix_seconds - config.anchor_unix_seconds; + if delta_seconds < 0 { + anyhow::bail!( + "test log shard anchor must not be in the future: anchor={}, ts={}", + config.anchor_unix_seconds, + unix_seconds + ); + } + let bucket_index = delta_seconds / config.window_seconds; + let base_date = chrono::NaiveDate::from_ymd_opt(2026, 1, 1) + .expect("valid hard-coded synthetic base date"); + Ok(base_date + chrono::Days::new(bucket_index as u64)) +} + +fn current_shard_date() -> anyhow::Result { + resolve_shard_date_from_datetime(chrono::Utc::now()) +} + +fn cleanup_old_daily_sharded_logs( + base_path: &Path, + retention_days: usize, +) -> anyhow::Result<()> { + let parent = match base_path.parent() { + Some(parent) => parent, + None => return Ok(()), + }; + let file_name = match base_path.file_name().and_then(|v| v.to_str()) { + Some(file_name) => file_name, + None => return Ok(()), + }; + let Some(stem) = file_name.strip_suffix(".log") else { + return Ok(()); + }; + fs::create_dir_all(parent)?; + let keep_since = current_shard_date()? - chrono::Days::new(retention_days.saturating_sub(1) as u64); + let prefix = format!("{stem}."); + for entry in std::fs::read_dir(parent)? { + let entry = entry?; + let path = entry.path(); + if !path.is_file() { + continue; + } + let entry_name = entry.file_name(); + let Some(entry_name) = entry_name.to_str() else { + continue; + }; + if !entry_name.starts_with(prefix.as_str()) || !entry_name.ends_with(".log") { + continue; + } + let date_text = &entry_name[prefix.len()..entry_name.len() - ".log".len()]; + let Ok(shard_date) = chrono::NaiveDate::parse_from_str(date_text, "%Y-%m-%d") else { + continue; + }; + if shard_date < keep_since { + match fs::remove_file(&path) { + Ok(()) => {} + Err(err) if err.kind() == io::ErrorKind::NotFound => {} + Err(err) => return Err(err.into()), + } + } + } + Ok(()) +} + +#[derive(Debug)] +struct DailyShardedFileWriter { + base_path: PathBuf, + retention_days: usize, + state: Mutex, +} + +#[derive(Debug, Default)] +struct DailyShardedFileWriterState { + current_path: Option, + current_file: Option, +} + +impl DailyShardedFileWriter { + fn new(base_path: PathBuf, retention_days: usize) -> Self { + Self { + base_path, + retention_days, + state: Mutex::new(DailyShardedFileWriterState::default()), + } + } + + fn current_path(&self) -> anyhow::Result { + current_daily_sharded_log_path(&self.base_path) + } + + fn rotate_if_needed( + &self, + state: &mut DailyShardedFileWriterState, + ) -> io::Result<()> { + let next_path = self + .current_path() + .map_err(|err| io::Error::new(io::ErrorKind::Other, err.to_string()))?; + if state.current_path.as_ref() == Some(&next_path) && state.current_file.is_some() { + return Ok(()); + } + cleanup_old_daily_sharded_logs(&self.base_path, self.retention_days) + .map_err(|err| io::Error::new(io::ErrorKind::Other, err.to_string()))?; + if let Some(parent) = next_path.parent() { + fs::create_dir_all(parent)?; + } + let file = fs::OpenOptions::new() + .create(true) + .append(true) + .open(&next_path)?; + state.current_path = Some(next_path); + state.current_file = Some(file); + Ok(()) + } +} + +impl io::Write for DailyShardedFileWriter { + fn write(&mut self, buf: &[u8]) -> io::Result { + let mut state = self.state.lock(); + self.rotate_if_needed(&mut state)?; + state + .current_file + .as_mut() + .expect("log writer file must exist after rotation") + .write(buf) + } + + fn flush(&mut self) -> io::Result<()> { + let mut state = self.state.lock(); + if let Some(file) = state.current_file.as_mut() { + file.flush() + } else { + Ok(()) + } + } +} + fn setup_global_log_guards(file_guard: WorkerGuard, console_guard: WorkerGuard) { let _ = GLOBAL_FILE_LOG_GUARD.set(file_guard); let _ = GLOBAL_CONSOLE_LOG_GUARD.set(console_guard); @@ -90,9 +279,9 @@ fn third_party_log_target_overrides( targets } -/// Init log for production +/// Init log for production. /// - `log_path`: directory to write log files -/// - `instance_key`: used in file names to disambiguate instances +/// - `instance_key`: used in daily file names to disambiguate instances pub fn init_log(log_path: &Path, instance_key: &str) { init_log_impl(log_path, instance_key, NoopLayer); } @@ -113,6 +302,95 @@ struct NoopLayer; impl tracing_subscriber::Layer for NoopLayer where S: tracing::Subscriber {} +fn current_daily_log_file_path(log_path: &Path, instance_key: &str) -> PathBuf { + current_daily_sharded_log_path(&log_path.join(format!("fluxon-kv-{instance_key}.log"))) + .unwrap_or_else(|_| { + let date = chrono::Utc::now().format("%Y-%m-%d"); + log_path.join(format!("fluxon-kv-{instance_key}.{date}.log")) + }) +} + +pub fn daily_sharded_log_path( + base_path: &Path, + date: chrono::NaiveDate, +) -> anyhow::Result { + let file_name = base_path.file_name().and_then(|v| v.to_str()).ok_or_else(|| { + anyhow::anyhow!( + "log path must end with a valid utf-8 filename: {}", + base_path.display() + ) + })?; + let stem = file_name + .strip_suffix(".log") + .ok_or_else(|| anyhow::anyhow!("log path must end with .log: {}", base_path.display()))?; + Ok(base_path.with_file_name(format!( + "{}.{}.log", + stem, + date.format("%Y-%m-%d") + ))) +} + +pub fn current_daily_sharded_log_path(base_path: &Path) -> anyhow::Result { + daily_sharded_log_path(base_path, current_shard_date()?) +} + +pub fn latest_existing_daily_sharded_log_path(base_path: &Path) -> Option { + let parent = base_path.parent()?; + let file_name = base_path.file_name()?.to_str()?; + let stem = file_name.strip_suffix(".log")?; + let prefix = format!("{}.", stem); + let mut latest: Option<(chrono::NaiveDate, PathBuf)> = None; + let entries = std::fs::read_dir(parent).ok()?; + for entry in entries { + let entry = entry.ok()?; + let path = entry.path(); + if !path.is_file() { + continue; + } + let entry_name = entry.file_name(); + let entry_name = entry_name.to_str()?; + if !entry_name.starts_with(prefix.as_str()) || !entry_name.ends_with(".log") { + continue; + } + if entry_name.len() <= prefix.len() + ".log".len() { + continue; + } + let date_text = &entry_name[prefix.len()..entry_name.len() - ".log".len()]; + let date = chrono::NaiveDate::parse_from_str(date_text, "%Y-%m-%d").ok()?; + let replace = match latest.as_ref() { + Some((prev, _)) => date > *prev, + None => true, + }; + if replace { + latest = Some((date, path)); + } + } + latest.map(|(_, path)| path) +} + +pub fn resolve_readable_log_path(base_path: &Path) -> Option { + if let Ok(current) = current_daily_sharded_log_path(base_path) { + if current.exists() { + return Some(current); + } + } + if let Some(latest) = latest_existing_daily_sharded_log_path(base_path) { + return Some(latest); + } + if base_path.exists() { + return Some(base_path.to_path_buf()); + } + None +} + +pub fn display_runtime_log_path(base_path_text: &str) -> String { + let base_path = Path::new(base_path_text); + resolve_readable_log_path(base_path) + .unwrap_or_else(|| base_path.to_path_buf()) + .display() + .to_string() +} + fn init_log_impl(log_path: &Path, instance_key: &str, extra_layer: L) where L: tracing_subscriber::Layer + Send + Sync + 'static, @@ -238,83 +516,9 @@ where } } - // Archive existing logs for the same instance into a sibling history directory. - // Scope is strictly within the provided `log_path` (cluster is implied by the dir path), - // and only files of the current `instance_key` are moved. This avoids any cross-instance - // interference and keeps behavior explicit and bounded. - { - let history_dir = log_path.join("history"); - if let Err(e) = fs::create_dir_all(&history_dir) { - panic!( - "[fluxon] Create history directory failed: {:?}. Base log_path: {:?}. \ -This log_path is provided by the caller's configuration. \ -For Master mode it is derived from MasterConfigYaml.log_dir with a subdirectory '_cluster_kv_logs'; \ -for Client mode it is derived from ClientConfigYaml.fluxonkv_spec.shared_memory_path with subdirectory '_cluster_kv_logs'. \ -Please ensure the directory exists and is writable. Underlying OS error: {:?}", - history_dir, log_path, e - ); - } - - // Pattern: fluxon-kv-..log - // No fallback patterns: keep rule strict and explicit. - let prefix = format!("fluxon-kv-{}.", instance_key); - let mut moved = 0usize; - - let iter = fs::read_dir(log_path).unwrap_or_else(|e| { - panic!( - "[fluxon] Read log directory failed at {:?}. This directory is the configured log_path described above. OS error: {:?}", - log_path, e - ) - }); - - for entry in iter { - let entry = entry.unwrap_or_else(|e| { - panic!( - "[fluxon] Failed to read a directory entry under {:?}. OS error: {:?}", - log_path, e - ) - }); - let path = entry.path(); - if !path.is_file() { - continue; - } - let name_os = match path.file_name() { - Some(n) => n, - None => continue, - }; - let name = match name_os.to_str() { - Some(s) => s, - None => continue, - }; - let is_target = name.starts_with(&prefix) && name.ends_with(".log"); - if !is_target { - continue; - } - let dst = history_dir.join(name); - if let Err(err) = fs::rename(&path, &dst) { - panic!( - "[fluxon] Move old log failed: {:?} -> {:?}. Base log_path: {:?}. OS error: {:?}", - path, dst, log_path, err - ); - } - moved += 1; - } - - if moved > 0 { - println!( - "[fluxon] Archived {moved} existing logs for instance_key='{instance_key}' into {:?}", - history_dir - ); - } - } - - // Files named with UTC timestamp once per process run - let ts = chrono::Utc::now().format("%Y-%m-%d_%H-%M-%S"); - // File log keeps workspace crates at DEBUG; non-workspace crates default to WARN. // This avoids dumping verbose dependency debug logs (e.g. h2/tower) into file output. - let file_name = format!("fluxon-kv-{instance_key}.{ts}.log"); - let file_path = log_path.join(&file_name); + let file_path = current_daily_log_file_path(log_path, instance_key); // Keep a copy for the whole process lifetime; collectors can clone it. if let Some(prev) = GLOBAL_LOG_FILE_PATH.get() { if prev != &file_path { @@ -326,18 +530,11 @@ Please ensure the directory exists and is writable. Underlying OS error: {:?}", } else { let _ = GLOBAL_LOG_FILE_PATH.set(file_path.clone()); } - let file = match std::fs::OpenOptions::new() - .create(true) - .append(true) - .open(&file_path) - { - Ok(f) => f, - Err(e) => { - eprintln!("Failed to open log file {:?}, err: {:?}", file_path, e); - return; - } - }; - let (file_writer, file_guard) = non_blocking(file); + let file_appender = DailyShardedFileWriter::new( + log_path.join(format!("fluxon-kv-{instance_key}.log")), + LOG_RETENTION_DAYS, + ); + let (file_writer, file_guard) = non_blocking(file_appender); let enable_iceoryx_logs = matches!( std::env::var("FLUXON_ENABLE_ICEORYX_LOGS") .ok() @@ -380,10 +577,9 @@ Please ensure the directory exists and is writable. Underlying OS error: {:?}", setup_global_log_guards(file_guard, console_guard); // Success notice: tell users where logs are written. - let history_dir_for_print = log_path.join("history"); println!( - "[fluxon] Logging initialized. base_dir={:?}, history_dir={:?}, instance_key='{}'", - log_path, history_dir_for_print, instance_key + "[fluxon] Logging initialized. base_dir={:?}, retention_days={}, current_file={:?}, instance_key='{}'", + log_path, LOG_RETENTION_DAYS, file_path, instance_key ); } diff --git a/fluxon_rs/fluxon_util/tests/log_mgmt.rs b/fluxon_rs/fluxon_util/tests/log_mgmt.rs new file mode 100644 index 0000000..03de37c --- /dev/null +++ b/fluxon_rs/fluxon_util/tests/log_mgmt.rs @@ -0,0 +1,120 @@ +use std::fs; +use std::path::Path; +use std::time::{Duration, SystemTime, UNIX_EPOCH}; + +use fluxon_util::DEFAULT_DAILY_LOG_RETENTION_DAYS; +use tempfile::TempDir; + +const TEST_LOG_SHARD_WINDOW_SECONDS_ENV: &str = "FLUXON_TEST_LOG_SHARD_WINDOW_SECONDS"; +const TEST_LOG_SHARD_ANCHOR_UNIX_SECONDS_ENV: &str = "FLUXON_TEST_LOG_SHARD_ANCHOR_UNIX_SECONDS"; + +struct EnvVarGuard { + key: &'static str, + previous: Option, +} + +impl EnvVarGuard { + fn set(key: &'static str, value: impl Into) -> Self { + let previous = std::env::var(key).ok(); + unsafe { + std::env::set_var(key, value.into()); + } + Self { key, previous } + } +} + +impl Drop for EnvVarGuard { + fn drop(&mut self) { + match self.previous.as_deref() { + Some(value) => unsafe { + std::env::set_var(self.key, value); + }, + None => unsafe { + std::env::remove_var(self.key); + }, + } + } +} + +fn count_service_shards(root: &Path, prefix: &str) -> usize { + fs::read_dir(root) + .expect("read log directory") + .filter_map(|entry| entry.ok()) + .map(|entry| entry.file_name().to_string_lossy().to_string()) + .filter(|name| name.starts_with(prefix) && name.ends_with(".log")) + .count() +} + +#[test] +fn kv_log_shards_roll_and_cleanup_with_test_window() { + let temp_dir = TempDir::new().expect("create temp dir"); + let log_path = temp_dir.path(); + let instance_key = "log_mgmt_window"; + let base_prefix = format!("fluxon-kv-{instance_key}"); + let stale_path = log_path.join(format!("{base_prefix}.2025-12-01.log")); + fs::write(&stale_path, "stale\n").expect("write stale shard"); + + let now = SystemTime::now() + .duration_since(UNIX_EPOCH) + .expect("unix epoch") + .as_secs() as i64; + let _window_guard = EnvVarGuard::set(TEST_LOG_SHARD_WINDOW_SECONDS_ENV, "10"); + let _anchor_guard = EnvVarGuard::set(TEST_LOG_SHARD_ANCHOR_UNIX_SECONDS_ENV, (now - 2).to_string()); + + fluxon_util::init_log(log_path, instance_key); + tracing::info!(target: "fluxon_util", "[kv-log-mgmt][phase=before] ts={}", now); + std::thread::sleep(Duration::from_millis(300)); + std::thread::sleep(Duration::from_secs(11)); + let after_ts = SystemTime::now() + .duration_since(UNIX_EPOCH) + .expect("unix epoch") + .as_secs(); + tracing::info!(target: "fluxon_util", "[kv-log-mgmt][phase=after] ts={after_ts}"); + std::thread::sleep(Duration::from_millis(500)); + + let shard_1 = log_path.join(format!("{base_prefix}.2026-01-01.log")); + let shard_2 = log_path.join(format!("{base_prefix}.2026-01-02.log")); + assert!(shard_1.exists(), "missing shard: {}", shard_1.display()); + assert!(shard_2.exists(), "missing shard: {}", shard_2.display()); + assert!( + !stale_path.exists(), + "stale shard should be removed once retention cleanup runs" + ); + assert_eq!( + count_service_shards(log_path, base_prefix.as_str()), + 2, + "expected exactly two retained shard files within the synthetic test window" + ); + + let shard_1_text = fs::read_to_string(&shard_1).expect("read first shard"); + let shard_2_text = fs::read_to_string(&shard_2).expect("read second shard"); + assert!( + shard_1_text.contains("[kv-log-mgmt][phase=before]"), + "first shard should contain the before marker" + ); + assert!( + !shard_1_text.contains("[kv-log-mgmt][phase=after]"), + "first shard should not contain the after marker" + ); + assert!( + shard_2_text.contains("[kv-log-mgmt][phase=after]"), + "second shard should contain the after marker" + ); + assert!( + !shard_2_text.contains("[kv-log-mgmt][phase=before]"), + "second shard should not contain the before marker" + ); + assert_eq!(DEFAULT_DAILY_LOG_RETENTION_DAYS, 31); +} + +#[test] +fn resolve_readable_log_path_ignores_plain_base_log_when_daily_shards_exist() { + let temp_dir = TempDir::new().expect("create temp dir"); + let base_path = temp_dir.path().join("startup.log"); + fs::write(&base_path, "plain\n").expect("write base log"); + let shard_path = temp_dir.path().join("startup.2026-06-21.log"); + fs::write(&shard_path, "shard\n").expect("write shard log"); + + let resolved = fluxon_util::resolve_readable_log_path(&base_path).expect("resolve readable log path"); + assert_eq!(resolved, shard_path); +} diff --git a/fluxon_test_stack/ci_2_virt_node.py b/fluxon_test_stack/ci_2_virt_node.py index 28e9b82..7f91716 100644 --- a/fluxon_test_stack/ci_2_virt_node.py +++ b/fluxon_test_stack/ci_2_virt_node.py @@ -879,6 +879,8 @@ def main() -> int: sys.executable, str((REPO_ROOT / "fluxon_test_stack" / "pack_test_stack_rsc.py").resolve()), "--all-profiles", + "--release-dir", + str(release_dir), "-c", str(pack_metadata["suite_path"]), ] diff --git a/fluxon_test_stack/ci_test_list.yaml b/fluxon_test_stack/ci_test_list.yaml index e0e99c4..333ce3f 100644 --- a/fluxon_test_stack/ci_test_list.yaml +++ b/fluxon_test_stack/ci_test_list.yaml @@ -29,6 +29,14 @@ scenes: scales: [n1_kvowner_dram_20gib] profiles: [fluxon_tcp] + ci_top_attention_log_mgmt: + ci: + subject: rust + runtime_contract: rust_self_managed + select: + scales: [n1_kvowner_dram_20gib] + profiles: [fluxon_tcp] + kv_read_heavy_zipf: test_stack: mode: KVSTORE @@ -315,6 +323,8 @@ profiles: doc_site_base_url: example.com ci_top_attention_bin_kvtest: kv_test_rounds: all + ci_top_attention_log_mgmt: + enabled: true runtime_contracts: cluster_kv_owner: &cluster_kv_owner_runtime base_runtime: @@ -460,6 +470,8 @@ profiles: doc_site_base_url: example.com ci_top_attention_bin_kvtest: kv_test_rounds: all + ci_top_attention_log_mgmt: + enabled: true test_stack: <<: *common_test_stack_runtime fluxon_sockudo_ws: @@ -472,6 +484,8 @@ profiles: doc_site_base_url: example.com ci_top_attention_bin_kvtest: kv_test_rounds: all + ci_top_attention_log_mgmt: + enabled: true test_stack: <<: *common_test_stack_runtime fluxon_tcp: @@ -484,6 +498,8 @@ profiles: doc_site_base_url: example.com ci_top_attention_bin_kvtest: kv_test_rounds: all + ci_top_attention_log_mgmt: + enabled: true test_stack: <<: *common_test_stack_runtime redis_sharded: diff --git a/fluxon_test_stack/deployconf_testbed.yml b/fluxon_test_stack/deployconf_testbed.yml index fe431de..552ce13 100644 --- a/fluxon_test_stack/deployconf_testbed.yml +++ b/fluxon_test_stack/deployconf_testbed.yml @@ -349,6 +349,9 @@ service: cluster_name: "${FLUXON_CLUSTER_NAME}" shared_memory_path: "${FLUXON_SHARED_MEM}" shared_file_path: "${FLUXON_SHARED_FILE}" + large_file_paths: + log_root_path: "${HOSTWORKDIR}/large/log/owner_${NODE_ID}" + cache_root_path: "${HOSTWORKDIR}/large/cache/owner_${NODE_ID}" sub_cluster: "owner" YAML ${HOSTWORKDIR}/venv/bin/python -m fluxon_py.runtime.start_owner_kvclient -c "${CONFIG_PATH}" -w "${WORKDIR}" @@ -589,7 +592,7 @@ service: cluster_name: "${FLUXON_CLUSTER_NAME}" member_kind: kv output: web - http_listen_addr: "0.0.0.0:${MASTER__PORT}" + http_listen_addr: "0.0.0.0:${OPS_CONTROLLER__PORT}" YAML ${HOSTWORKDIR}/venv/bin/python -m fluxon_py.runtime.start_ops_controller -c "${WORKDIR}/ops_controller.yaml" -w "${WORKDIR}" diff --git a/fluxon_test_stack/pack_test_stack_rsc.py b/fluxon_test_stack/pack_test_stack_rsc.py index 9e80dea..e843a15 100644 --- a/fluxon_test_stack/pack_test_stack_rsc.py +++ b/fluxon_test_stack/pack_test_stack_rsc.py @@ -2,7 +2,6 @@ from __future__ import annotations import argparse -import fnmatch import hashlib import json import os @@ -19,30 +18,28 @@ import yaml REPO_ROOT = Path(__file__).resolve().parent.parent -SCRIPTS_DIR = REPO_ROOT / "setup_and_pack" -if str(SCRIPTS_DIR) not in sys.path: - sys.path.insert(0, str(SCRIPTS_DIR)) +SETUP_AND_PACK_DIR = REPO_ROOT / "setup_and_pack" +setup_and_pack_dir_str = str(SETUP_AND_PACK_DIR) +if setup_and_pack_dir_str in sys.path: + sys.path.remove(setup_and_pack_dir_str) +sys.path.insert(0, setup_and_pack_dir_str) +SCRIPTS_DIR = REPO_ROOT / "scripts" +scripts_dir_str = str(SCRIPTS_DIR) +if scripts_dir_str in sys.path: + sys.path.remove(scripts_dir_str) +sys.path.insert(0, scripts_dir_str) import utils as script_utils +from source_selection_profiles import ( + SOURCE_SELECTION_PROFILE_SOURCE_PACK, + collect_source_profile_relpaths, + get_source_profile_source_roots, + source_profile_relpath_excluded, +) -CI_SOURCE_ROOT_NAMES: tuple[str, ...] = (".",) -CI_SOURCE_COMMON_EXCLUDE_REL_PATHS: tuple[str, ...] = ( - "__pycache__/", - ".pytest_cache/", - ".mypy_cache/", - ".ruff_cache/", - "*.swp", -) -CI_SOURCE_STAGE_EXCLUDE_PREFIXES: tuple[str, ...] = ( - ".dever/", - "fluxon_release/", - "skills/", -) -CI_SOURCE_STAGE_EXCLUDE_NAMES: frozenset[str] = frozenset( - { - ".DS_Store", - } +CI_SOURCE_ROOT_NAMES: tuple[str, ...] = get_source_profile_source_roots( + profile=SOURCE_SELECTION_PROFILE_SOURCE_PACK ) PACKED_RUNTIME_ROOT_NAMES: tuple[str, ...] = ( "bin", @@ -86,18 +83,6 @@ "mooncake", ) RELEASE_MANIFEST_FILENAME = "fluxon_release.sha256" -CI_SOURCE_DIGEST_IGNORED_DIR_NAMES = frozenset( - { - ".git", - "__pycache__", - ".pytest_cache", - ".mypy_cache", - ".ruff_cache", - "target", - } -) -CI_SOURCE_DIGEST_IGNORED_FILE_NAMES = frozenset() -CI_SOURCE_DIGEST_IGNORED_FILE_SUFFIXES = (".pyc", ".swp", ".gitignore") DEFAULT_REDIS_BUILD_IMAGE = "quay.io/pypa/manylinux_2_28_x86_64" DEFAULT_REDIS_DOWNLOAD_URL_TEMPLATE = "https://download.redis.io/releases/redis-{version}.tar.gz" DEFAULT_REDIS_VERSION = "7.2.5" @@ -121,11 +106,6 @@ profile_id: transport_backend for transport_backend, profile_id in script_utils.TRANSPORT_PROFILE_IDS.items() } -DEFAULT_RATHER_NO_GIT_SUBMODULE_CONFIG_RELPATH = Path( - "setup_and_pack/rather_no_git_submodule.yaml" -) - - def main() -> int: script_utils.reset_stage_summary() try: @@ -933,125 +913,13 @@ def _git_stage_ci_source_tree(*, repo_root: Path, stage_root: Path) -> list[str] return selected -def _collect_git_listed_source_relpaths( - *, - repo_root: Path, - git_root: Path, - rel_prefix: str = "", -) -> list[str]: - script_utils.require_cmd("git") - argv = [ - "git", - "ls-files", - "--cached", - "--others", - "--exclude-standard", - "-z", - ] - raw = subprocess.check_output(argv, cwd=str(git_root)) - selected: list[str] = [] - rel_prefix = rel_prefix.strip("/") - for entry in raw.split(b"\0"): - if not entry: - continue - rel = entry.decode("utf-8").strip() - if not rel: - continue - repo_rel = rel if not rel_prefix else f"{rel_prefix}/{rel}" - if _ci_source_relpath_excluded(repo_rel): - continue - src_path = (repo_root / repo_rel).resolve() - if not src_path.exists(): - continue - selected.append(repo_rel) - return selected - - -def _load_rather_no_git_submodule_source_roots( - *, - repo_root: Path, -) -> tuple[tuple[str, Path], ...]: - config_path = (repo_root / DEFAULT_RATHER_NO_GIT_SUBMODULE_CONFIG_RELPATH).resolve() - if not config_path.exists(): - return () - raw_cfg = _load_yaml_file(config_path) - if raw_cfg is None: - return () - if not isinstance(raw_cfg, dict): - raise RuntimeError( - "rather_no_git_submodule config must be a YAML mapping: " - f"{config_path}" - ) - raw_modules = raw_cfg.get("modules") - if raw_modules is None: - return () - if not isinstance(raw_modules, list): - raise RuntimeError( - "rather_no_git_submodule config `modules` must be a list: " - f"{config_path}" - ) - - repo_root = repo_root.resolve() - selected: list[tuple[str, Path]] = [] - seen_relpaths: set[str] = set() - for index, raw_item in enumerate(raw_modules): - if not isinstance(raw_item, dict): - raise RuntimeError( - "rather_no_git_submodule config entries must be mappings: " - f"{config_path} modules[{index}]" - ) - raw_path = raw_item.get("path") - if not isinstance(raw_path, str) or not raw_path.strip(): - raise RuntimeError( - "rather_no_git_submodule config path must be a non-empty string: " - f"{config_path} modules[{index}].path" - ) - rel_path = Path(raw_path.strip()) - if rel_path.is_absolute() or ".." in rel_path.parts: - raise RuntimeError( - "rather_no_git_submodule config path must stay within the repo root: " - f"{config_path} modules[{index}].path={raw_path!r}" - ) - relpath = rel_path.as_posix() - if relpath in seen_relpaths: - continue - seen_relpaths.add(relpath) - module_root = (repo_root / rel_path).resolve() - if module_root != repo_root and repo_root not in module_root.parents: - raise RuntimeError( - "rather_no_git_submodule config path escapes the repo root: " - f"{config_path} modules[{index}].path={raw_path!r}" - ) - if not module_root.is_dir(): - raise RuntimeError( - "CI source pack requires configured rather_no_git_submodule path to exist as a directory: " - f"path={relpath} resolved={module_root}" - ) - selected.append((relpath, module_root)) - return tuple(selected) - - def _collect_ci_source_relpaths(*, repo_root: Path) -> list[str]: - repo_root = repo_root.resolve() - selected = set( - _collect_git_listed_source_relpaths( - repo_root=repo_root, - git_root=repo_root, + return list( + collect_source_profile_relpaths( + repo_root=repo_root.resolve(), + profile=SOURCE_SELECTION_PROFILE_SOURCE_PACK, ) ) - for relpath, module_root in _load_rather_no_git_submodule_source_roots( - repo_root=repo_root - ): - selected.update( - _collect_git_listed_source_relpaths( - repo_root=repo_root, - git_root=module_root, - rel_prefix=relpath, - ) - ) - if not selected: - raise RuntimeError("git-based CI source selection produced no files") - return sorted(selected) def _compute_ci_source_digest(*, repo_root: Path) -> str: @@ -1060,16 +928,17 @@ def _compute_ci_source_digest(*, repo_root: Path) -> str: relative_to=repo_root, mode=script_utils.PathDigestMode.PACK_INPUTS, algorithm=script_utils.PathHashAlgorithm.SHA256, - ignored_dir_names=CI_SOURCE_DIGEST_IGNORED_DIR_NAMES, - ignored_file_names=CI_SOURCE_DIGEST_IGNORED_FILE_NAMES, - ignored_file_suffixes=CI_SOURCE_DIGEST_IGNORED_FILE_SUFFIXES, + ignored_dir_names=(), + ignored_file_names=(), + ignored_file_suffixes=(), ) def _ci_source_relpath_excluded(relpath: str) -> bool: - if relpath in CI_SOURCE_STAGE_EXCLUDE_NAMES: - return True - return any(relpath == prefix.rstrip("/") or relpath.startswith(prefix) for prefix in CI_SOURCE_STAGE_EXCLUDE_PREFIXES) + return source_profile_relpath_excluded( + profile=SOURCE_SELECTION_PROFILE_SOURCE_PACK, + relpath=relpath, + ) def _pack_ci_ext_rsc(*, repo_root: Path, out_path: Path) -> None: @@ -1110,8 +979,8 @@ def build_tarball() -> None: src=src, dst=packed_stage_root / rel_name, honor_gitignore=False, + exclude_rel_paths=PACKED_RUNTIME_EXCLUDE_REL_PATHS, ) - _prune_stage_paths(packed_stage_root, PACKED_RUNTIME_EXCLUDE_REL_PATHS) script_utils.tar_gz( cwd=stage_root, out_path=out_path, @@ -1148,7 +1017,7 @@ def _stage_repo_test_rsc_tree(*, repo_test_rsc_root: Path, out_dir: Path) -> Non else: dst.parent.mkdir(parents=True, exist_ok=True) shutil.copy2(src, dst) - _prune_stage_paths(out_dir, TEST_RSC_REPO_TREE_EXCLUDE_REL_PATHS) + script_utils.prune_stage_paths(out_dir, TEST_RSC_REPO_TREE_EXCLUDE_REL_PATHS) def _release_shared_baselines_root(*, release_dir: Path) -> Path: @@ -1179,7 +1048,7 @@ def _stage_release_shared_baselines_into_root(*, release_dir: Path, prepared_roo if baselines_dst.exists(): raise RuntimeError(f"prepared test_rsc baselines path already exists before release authority stage: {baselines_dst}") shutil.copytree(shared_baselines_root, baselines_dst, dirs_exist_ok=False) - _prune_stage_paths(baselines_dst, TEST_RSC_REPO_TREE_EXCLUDE_REL_PATHS) + script_utils.prune_stage_paths(baselines_dst, TEST_RSC_REPO_TREE_EXCLUDE_REL_PATHS) def _stage_canonical_profile_prepared_resources_into_root(*, profile_id: str, prepared_root: Path) -> None: @@ -1206,7 +1075,7 @@ def _stage_canonical_profile_prepared_resources_into_root(*, profile_id: str, pr else: dst.parent.mkdir(parents=True, exist_ok=True) shutil.copy2(src, dst) - _prune_stage_paths(dst, TEST_RSC_REPO_TREE_EXCLUDE_REL_PATHS) + script_utils.prune_stage_paths(dst, TEST_RSC_REPO_TREE_EXCLUDE_REL_PATHS) def _stage_prepared_test_rsc(*, prepared_root: Path, out_dir: Path) -> None: @@ -1220,7 +1089,7 @@ def _stage_prepared_test_rsc(*, prepared_root: Path, out_dir: Path) -> None: else: dst.parent.mkdir(parents=True, exist_ok=True) shutil.copy2(src, dst) - _prune_stage_paths(out_dir, TEST_RSC_REPO_TREE_EXCLUDE_REL_PATHS) + script_utils.prune_stage_paths(out_dir, TEST_RSC_REPO_TREE_EXCLUDE_REL_PATHS) def _prepare_baselines_into_root( @@ -1249,7 +1118,7 @@ def _prepare_baselines_into_root( dir_source=dir_source, archive_source=archive_source, ) - _prune_stage_paths(prepared_root, TEST_RSC_REPO_TREE_EXCLUDE_REL_PATHS) + script_utils.prune_stage_paths(prepared_root, TEST_RSC_REPO_TREE_EXCLUDE_REL_PATHS) def _prepare_configured_test_rsc_resources_into_root( @@ -1276,7 +1145,7 @@ def _prepare_configured_test_rsc_resources_into_root( scratch_root=scratch_root, mooncake_cfg=mooncake_cfg_raw, ) - _prune_stage_paths(prepared_root, TEST_RSC_REPO_TREE_EXCLUDE_REL_PATHS) + script_utils.prune_stage_paths(prepared_root, TEST_RSC_REPO_TREE_EXCLUDE_REL_PATHS) def _prepare_python_runtime_wheelhouse_into_root( @@ -1786,7 +1655,7 @@ def _sync_prepared_baselines_into_release_tree(*, prepared_root: Path, release_d release_shared_baselines_root.parent.mkdir(parents=True, exist_ok=True) _remove_path(release_shared_baselines_root) shutil.copytree(prepared_baselines_root, release_shared_baselines_root, dirs_exist_ok=False) - _prune_stage_paths(release_shared_baselines_root, TEST_RSC_REPO_TREE_EXCLUDE_REL_PATHS) + script_utils.prune_stage_paths(release_shared_baselines_root, TEST_RSC_REPO_TREE_EXCLUDE_REL_PATHS) def _extract_bundle_archive(*, archive_path: Path, out_dir: Path, expected_root_name: str) -> None: @@ -1814,62 +1683,6 @@ def _remove_path(path: Path) -> None: path.unlink() -def _rsync_stage_filtered( - *, - repo_root: Path, - src: Path, - dst: Path, - honor_gitignore: bool, - exclude_rel_paths: tuple[str, ...] = (), -) -> None: - if not exclude_rel_paths: - script_utils.rsync_stage( - repo_root=repo_root, - src=src, - dst=dst, - honor_gitignore=honor_gitignore, - ) - return - - if not src.exists(): - raise RuntimeError(f"missing required source path for staging: {src}") - if dst.exists(): - raise RuntimeError(f"staging destination already exists (no overwrite): {dst}") - if shutil.which("rsync") is None: - raise RuntimeError("rsync is required for filtered staging, but was not found in PATH") - - dst.parent.mkdir(parents=True, exist_ok=True) - argv = ["rsync", "-a"] - if honor_gitignore: - argv += [ - "--exclude=.git/", - "--exclude-from=.gitignore", - "--filter=:- .gitignore", - ] - for pattern in exclude_rel_paths: - argv.append(f"--exclude={pattern}") - if src.is_dir(): - argv += [str(src) + "/", str(dst) + "/"] - else: - argv += [str(src), str(dst)] - subprocess.check_call(argv, cwd=str(repo_root)) - - -def _prune_stage_paths(stage_root: Path, exclude_rel_paths: tuple[str, ...]) -> None: - if not stage_root.exists(): - return - for path in sorted(stage_root.rglob("*"), reverse=True): - rel_path = path.relative_to(stage_root).as_posix() - for pattern in exclude_rel_paths: - normalized_pattern = pattern.rstrip("/") - if fnmatch.fnmatch(rel_path, normalized_pattern) or fnmatch.fnmatch(path.name, normalized_pattern): - if path.is_dir(): - shutil.rmtree(path) - else: - path.unlink(missing_ok=True) - break - - def _test_rsc_manifest_file_list(*, out_dir: Path, prepared_root: Path) -> list[Path]: files: list[Path] = [] for fixed_name in ("src_ci.tar.gz", "fluxon_ci_ext_rsc.tar.gz"): diff --git a/fluxon_test_stack/start_test_bed.py b/fluxon_test_stack/start_test_bed.py index 79f7bcd..7c3f22a 100644 --- a/fluxon_test_stack/start_test_bed.py +++ b/fluxon_test_stack/start_test_bed.py @@ -7,6 +7,7 @@ import fcntl import json import os +import re import subprocess import sys import time @@ -24,6 +25,7 @@ DEPLOYMENT_DIR = REPO_ROOT / "deployment" sys.path.insert(0, str(DEPLOYMENT_DIR)) import manual_dispatch_release +from utils import log_shard from utils.selection_runtime import ( atomic_group_member_authority_name as _selection_atomic_group_member_authority_name, atomic_group_member_selection_workload_name as _selection_atomic_group_member_selection_workload_name, @@ -432,11 +434,12 @@ def main() -> None: waves=coverage_bootstrap_waves, bootstrap_bare_services=bootstrap_bare_services, ) - _wait_controller_ready_stable( - controller_url=controller_url, - timeout_seconds=controller_ready_timeout_seconds, - stability_window_seconds=bootstrap_stability_window_seconds, - ) + if bootstrap_mode in (BOOTSTRAP_MODE_BARE_THEN_APPLY, BOOTSTRAP_MODE_BARE_ONLY): + _wait_controller_ready_stable( + controller_url=controller_url, + timeout_seconds=controller_ready_timeout_seconds, + stability_window_seconds=bootstrap_stability_window_seconds, + ) test_runner_ui_summary = _ensure_test_runner_ui_started(ui_cfg=test_runner_ui_cfg) if bootstrap_mode == BOOTSTRAP_MODE_BARE_THEN_APPLY: post_bootstrap_agent_instance_keys = _selection_agent_instance_keys( @@ -764,6 +767,9 @@ def _normalize_bootstrap_deployconf( if isinstance(master_cfg, dict): entrypoint = master_cfg.get("entrypoint") if isinstance(entrypoint, str): + master_port = _extract_master_listen_port(entrypoint=entrypoint) + if master_port is not None: + _set_service_port(master_cfg, port=master_port) normalized_entrypoint, removed = _strip_legacy_master_p2p_listen_port(entrypoint=entrypoint) if removed: master_cfg["entrypoint"] = normalized_entrypoint @@ -843,6 +849,7 @@ def _rewrite_same_host_local_multi_node_fixed_ports( _set_service_port(greptime_cfg, port=plan["greptime_port"]) _set_service_port(tikv_pd_cfg, port=plan["tikv_pd_port"]) _set_service_port(tikv_cfg, port=plan["tikv_port"]) + _set_service_port(master_cfg, port=plan["master_port"]) etcd_entrypoint = _require_str(etcd_cfg.get("entrypoint"), "deployconf.service.etcd.entrypoint") etcd_entrypoint = _replace_expected_substring( @@ -972,6 +979,13 @@ def _set_service_port(service_cfg: dict[str, Any], *, port: int) -> None: service_cfg["in_container_port"] = int(port) +def _extract_master_listen_port(*, entrypoint: str) -> int | None: + match = re.search(r"(?m)^[ \t]*port:\s*(\d+)\s*$", entrypoint) + if match is None: + return None + return _require_port_number(match.group(1), "deployconf.service.master.entrypoint port") + + def _replace_expected_substring(*, value: str, old: str, new: str, ctx: str) -> str: if old in value: return value.replace(old, new) @@ -1398,7 +1412,7 @@ def _test_runner_ui_summary_from_cfg( "url": ui_cfg["url"], "probe_url": ui_cfg["probe_url"], "workdir": str(ui_cfg["workdir"]), - "log_path": str(ui_cfg["log_path"]), + "log_path": str(ui_cfg["active_log_path"]), "history_lookback_days": int(ui_cfg["history_lookback_days"]), "history_roots": [str(path) for path in ui_cfg["history_roots"]], "gitops_config_path": ( @@ -1459,7 +1473,8 @@ def _parse_test_runner_ui_config( _require_str(ui_cfg.get("gitops_config_path"), "test_runner_ui.gitops_config_path"), "test_runner_ui.gitops_config_path", ) - log_path = (workdir / TEST_RUNNER_UI_LOG_FILENAME).resolve() + log_path = (workdir.resolve() / TEST_RUNNER_UI_LOG_FILENAME).resolve() + active_log_path = log_shard.daily_sharded_log_path(log_path) return { "enabled": True, "host": host, @@ -1468,6 +1483,7 @@ def _parse_test_runner_ui_config( "probe_url": _test_runner_ui_probe_url(host=host, port=port), "workdir": workdir.resolve(), "log_path": log_path, + "active_log_path": active_log_path, "history_lookback_days": int(history_lookback_days), "history_roots": [path.resolve() for path in history_roots], "gitops_config_path": gitops_config_path.resolve() if gitops_config_path is not None else None, @@ -1588,7 +1604,7 @@ def _ensure_test_runner_ui_started(*, ui_cfg: dict[str, Any]) -> dict[str, Any]: if ui_cfg["gitops_config_path"] is not None: argv.extend(["--gitops-config", str(ui_cfg["gitops_config_path"])]) - log_path = Path(ui_cfg["log_path"]).resolve() + log_path = Path(ui_cfg["active_log_path"]).resolve() log_path.parent.mkdir(parents=True, exist_ok=True) log_handle = log_path.open("a", encoding="utf-8") try: diff --git a/fluxon_test_stack/test_runner.py b/fluxon_test_stack/test_runner.py index e31d500..f3344b8 100644 --- a/fluxon_test_stack/test_runner.py +++ b/fluxon_test_stack/test_runner.py @@ -37,6 +37,11 @@ import yaml +RUNNER_REPO_ROOT = Path(__file__).resolve().parent.parent +RUNNER_DEPLOYMENT_DIR = RUNNER_REPO_ROOT / "deployment" +RUNNER_TEMPLATE_DIR = (RUNNER_REPO_ROOT / "fluxon_test_stack" / "test_runner_templates").resolve() +sys.path.insert(0, str(RUNNER_DEPLOYMENT_DIR)) + from benchmark_role_names import ( KV_NODE_ROLE_SEED, KV_NODE_ROLE_WORKER, @@ -51,6 +56,7 @@ run_top_attention_entries, select_top_attention_entries, ) +from utils import log_shard # NOTE: This project uses multiple schemas: @@ -277,10 +283,10 @@ def _test_stack_mode_requires_kv_master(mode: str) -> bool: "workloads may still be stopping", ) _WAIT_DELETE_APPLY_REQUIRES_DELETE_ERR = "wait_delete_apply requires delete_apply first" -RUNNER_REPO_ROOT = Path(__file__).resolve().parent.parent RUNNER_SHARED_RUNTIME_DIR = (RUNNER_REPO_ROOT / "fluxon_test_stack" / "test_runner").resolve() RUNNER_SHARED_LOCK_DIR = (RUNNER_SHARED_RUNTIME_DIR / "locks").resolve() RUNNER_STDIO_LOG_FILENAME = "test_runner.log" +_SERVICE_LOG_RETENTION_DAYS = log_shard.DEFAULT_DAILY_LOG_RETENTION_DAYS _ACTIVE_TEST_BED_SELECTION_SUPERVISOR_CHECK_CACHE_KEY: Optional[str] = None # TEST_STACK coordinator uses a stable workload name across cases; if a previous run crashed @@ -349,6 +355,7 @@ def _runner_native_ci_scene_ids() -> Tuple[str, ...]: return ( "ci_top_attention_doc_page_build", "ci_top_attention_bin_kvtest", + "ci_top_attention_log_mgmt", ) @@ -402,6 +409,7 @@ def _scene_id_uses_runner_native_ci_commands(scene_id: str) -> bool: _RUNNER_STDIO_LOG_FP: Optional[Any] = None _RUNNER_STDIO_KEEPALIVE_FDS: Optional[Tuple[int, int]] = None _RUNNER_STDIO_MIRROR_THREAD: Optional[threading.Thread] = None +_RUNNER_STDIO_ROUTER_THREAD: Optional[threading.Thread] = None _CI_WAIT_HEARTBEAT_INTERVAL_SECONDS = 15.0 _CI_WAIT_TAIL_MAX_CHARS = 8000 _TEST_RUNNER_UI_MAX_LOG_CHUNK_BYTES = 1024 * 1024 @@ -434,17 +442,65 @@ def _ci_log_prefix_lines(text: str, *, now: Optional[float] = None) -> str: return "".join(f"{prefix} {line}" if line.strip() else line for line in lines) +def _service_log_base_path(workdir_root: Path, *, filename: str) -> Path: + return (workdir_root / filename).resolve() + + +def _service_log_daily_path(base_path: Path, *, now: Optional[datetime.datetime] = None) -> Path: + return log_shard.daily_sharded_log_path(base_path, now=now) + + +def _service_log_latest_path(base_path: Path) -> Optional[Path]: + return log_shard.latest_existing_daily_sharded_log_path(base_path) + + +def _service_log_resolve_read_path(workdir_root: Path, *, filename: str) -> Optional[Path]: + base_path = _service_log_base_path(workdir_root, filename=filename) + return _service_log_resolve_read_path_from_base(base_path) + + +def _service_log_resolve_read_path_from_base(base_path: Path) -> Optional[Path]: + return log_shard.resolve_readable_log_path(base_path) + + +def _cleanup_old_service_logs(base_path: Path, *, retention_days: int = _SERVICE_LOG_RETENTION_DAYS) -> None: + log_shard.cleanup_old_daily_sharded_logs(base_path, retention_days=retention_days) + + +def _start_runner_stdio_log_router(*, base_log_path: Path, read_fd: int) -> None: + def _router_loop() -> None: + log_shard.relay_fd_to_daily_sharded_logs( + base_log_path=str(base_log_path), + read_fd=read_fd, + retention_days=_SERVICE_LOG_RETENTION_DAYS, + ) + + router = threading.Thread( + target=_router_loop, + name="test-runner-stdio-log-router", + daemon=True, + ) + router.start() + global _RUNNER_STDIO_ROUTER_THREAD + _RUNNER_STDIO_ROUTER_THREAD = router + + def _start_runner_stdio_log_mirror(*, log_path: Path, stdout_fd: int) -> None: def _mirror_loop() -> None: offset = 0 + current_path: Optional[Path] = None while True: try: - if log_path.exists(): - size = log_path.stat().st_size + resolved_path = _service_log_resolve_read_path_from_base(log_path) + if isinstance(resolved_path, Path) and resolved_path.exists(): + if current_path != resolved_path: + current_path = resolved_path + offset = 0 + size = resolved_path.stat().st_size if size < offset: offset = 0 if size > offset: - with log_path.open("r", encoding="utf-8", errors="replace") as fp: + with resolved_path.open("r", encoding="utf-8", errors="replace") as fp: fp.seek(offset) chunk = fp.read() offset = fp.tell() @@ -469,7 +525,11 @@ def _mirror_loop() -> None: _RUNNER_STDIO_MIRROR_THREAD = mirror -def _redirect_process_stdio_to_log(workdir_root: Path) -> None: +def _redirect_process_stdio_to_log( + workdir_root: Path, + *, + filename: str = RUNNER_STDIO_LOG_FILENAME, +) -> None: """Route runner stdio to a stable workdir log so long suites survive PTY loss. English note: @@ -481,10 +541,13 @@ def _redirect_process_stdio_to_log(workdir_root: Path) -> None: """ global _RUNNER_STDIO_LOG_FP global _RUNNER_STDIO_KEEPALIVE_FDS + global _RUNNER_STDIO_ROUTER_THREAD if _RUNNER_STDIO_LOG_FP is not None: return - log_path = (workdir_root / RUNNER_STDIO_LOG_FILENAME).resolve() + base_log_path = _service_log_base_path(workdir_root, filename=filename) + _cleanup_old_service_logs(base_log_path) + log_path = _service_log_daily_path(base_log_path) log_fp = log_path.open("a", encoding="utf-8", buffering=1) banner = ( f"{_ci_log_timestamp_prefix()} [test_runner] redirecting process stdio to stable log: {log_path}\n" @@ -519,15 +582,26 @@ def _redirect_process_stdio_to_log(workdir_root: Path) -> None: except OSError: _RUNNER_STDIO_KEEPALIVE_FDS = (-1, -1) - os.dup2(log_fp.fileno(), sys.stdout.fileno()) - os.dup2(log_fp.fileno(), sys.stderr.fileno()) + read_fd, write_fd = os.pipe() + router_keepalive = os.dup(write_fd) + _start_runner_stdio_log_router(base_log_path=base_log_path, read_fd=read_fd) + os.dup2(write_fd, sys.stdout.fileno()) + os.dup2(write_fd, sys.stderr.fileno()) sys.stdout = os.fdopen(sys.stdout.fileno(), "w", encoding="utf-8", buffering=1, closefd=False) sys.stderr = os.fdopen(sys.stderr.fileno(), "w", encoding="utf-8", buffering=1, closefd=False) - _RUNNER_STDIO_LOG_FP = log_fp + try: + os.close(write_fd) + except OSError: + pass + try: + log_fp.close() + except OSError: + pass + _RUNNER_STDIO_LOG_FP = os.fdopen(router_keepalive, "w", encoding="utf-8", buffering=1) if _runner_stdio_mirror_enabled(): keepalive = _RUNNER_STDIO_KEEPALIVE_FDS or (-1, -1) _start_runner_stdio_log_mirror( - log_path=log_path, + log_path=base_log_path, stdout_fd=int(keepalive[0]), ) @@ -2919,105 +2993,17 @@ def _write_deployer_manifests(resolved_case: Dict[str, Any], run_dir: Path, *, a orig_argv = [cmd0] + args exec_cmd = " ".join(_shell_quote(x) for x in orig_argv) - # Generate a self-contained SigV4 GET downloader (Fluxon FS S3 gateway) and then exec the original argv. - bash_script = ( - "set -euo pipefail\n" - "python3 - <<'PY'\n" - "import datetime\n" - "import hashlib\n" - "import hmac\n" - "import os\n" - "import urllib.parse\n" - "import urllib.request\n" - "from pathlib import Path\n" - "\n" - f"BASE_URL = {s3_base_url!r}\n" - f"BUCKET = {s3_bucket!r}\n" - f"OBJECT_KEY = {object_key!r}\n" - f"DEST_PATH = {payload_dest_path_s!r}\n" - f"ACCESS_KEY = {s3_access_key!r}\n" - f"SECRET_KEY = {s3_secret_key!r}\n" - f"REGION = {s3_region!r}\n" - "\n" - "ALG = 'AWS4-HMAC-SHA256'\n" - "SERVICE = 's3'\n" - "TERM = 'aws4_request'\n" - "UNSIGNED = 'UNSIGNED-PAYLOAD'\n" - "\n" - "def _hmac_sha256(key: bytes, msg: bytes) -> bytes:\n" - " return hmac.new(key, msg, hashlib.sha256).digest()\n" - "\n" - "def _sha256_hex(msg: bytes) -> str:\n" - " return hashlib.sha256(msg).hexdigest()\n" - "\n" - "def _derive_signing_key(secret_key: str, scope_date: str, region: str) -> bytes:\n" - " k_date = _hmac_sha256(('AWS4' + secret_key).encode('utf-8'), scope_date.encode('utf-8'))\n" - " k_region = _hmac_sha256(k_date, region.encode('utf-8'))\n" - " k_service = _hmac_sha256(k_region, SERVICE.encode('utf-8'))\n" - " return _hmac_sha256(k_service, TERM.encode('utf-8'))\n" - "\n" - "def _sigv4_headers(*, method: str, signing_path: str, query: str, host: str, scope_date: str, amz_date: str, payload_hash: str) -> dict:\n" - " signed_headers = 'host;x-amz-content-sha256;x-amz-date'\n" - " canonical_headers = ''\n" - " canonical_headers += f'host:{host}\\n'\n" - " canonical_headers += f'x-amz-content-sha256:{payload_hash}\\n'\n" - " canonical_headers += f'x-amz-date:{amz_date}\\n'\n" - " canonical_request = '\\n'.join([method, signing_path, query, canonical_headers, signed_headers, payload_hash])\n" - " cr_hash = _sha256_hex(canonical_request.encode('utf-8'))\n" - " scope = f'{scope_date}/{REGION}/{SERVICE}/{TERM}'\n" - " string_to_sign = '\\n'.join([ALG, amz_date, scope, cr_hash])\n" - " signing_key = _derive_signing_key(SECRET_KEY, scope_date, REGION)\n" - " sig = hmac.new(signing_key, string_to_sign.encode('utf-8'), hashlib.sha256).hexdigest()\n" - " auth = f\"{ALG} Credential={ACCESS_KEY}/{scope}, SignedHeaders={signed_headers}, Signature={sig}\"\n" - " return {\n" - " 'Authorization': auth,\n" - " 'x-amz-date': amz_date,\n" - " 'x-amz-content-sha256': payload_hash,\n" - " 'Host': host,\n" - " }\n" - "\n" - "u = urllib.parse.urlparse(BASE_URL)\n" - "if u.scheme not in ('http', 'https'):\n" - " raise ValueError('BASE_URL must be http(s)')\n" - "if not u.netloc:\n" - " raise ValueError('BASE_URL missing host')\n" - "base_path = u.path.rstrip('/')\n" - "if base_path == '':\n" - " raise ValueError('BASE_URL must include a non-root path prefix (e.g. /fs_s3)')\n" - "\n" - "bucket_enc = urllib.parse.quote(BUCKET, safe='-_.~')\n" - "key_enc = urllib.parse.quote(OBJECT_KEY, safe='/-_.~')\n" - "full_path = base_path + '/' + bucket_enc + '/' + key_enc\n" - # Sign the *actual* client-visible request path (including s3_base_url path prefix, e.g. "/fs_s3"). - "signing_path = full_path\n" - "url = f'{u.scheme}://{u.netloc}{full_path}'\n" - "\n" - "now = datetime.datetime.utcnow()\n" - "amz_date = now.strftime('%Y%m%dT%H%M%SZ')\n" - "scope_date = now.strftime('%Y%m%d')\n" - "hdrs = _sigv4_headers(method='GET', signing_path=signing_path, query='', host=u.netloc, scope_date=scope_date, amz_date=amz_date, payload_hash=UNSIGNED)\n" - "\n" - "dest = Path(DEST_PATH)\n" - "dest.parent.mkdir(parents=True, exist_ok=True)\n" - "tmp = Path(str(dest) + '.tmp')\n" - "if tmp.exists():\n" - " tmp.unlink()\n" - "req = urllib.request.Request(url, method='GET')\n" - "for k, v in hdrs.items():\n" - " req.add_header(k, v)\n" - "with urllib.request.urlopen(req, timeout=60) as resp:\n" - " if getattr(resp, 'status', None) != 200:\n" - " body = resp.read(4096)\n" - " raise RuntimeError(f'download failed: status={getattr(resp, \"status\", None)} body={body!r}')\n" - " with tmp.open('wb') as f:\n" - " while True:\n" - " b = resp.read(1024 * 1024)\n" - " if not b:\n" - " break\n" - " f.write(b)\n" - "tmp.replace(dest)\n" - "PY\n" - f"exec {exec_cmd}\n" + # Keep the remote wrapper self-contained, but store it as a standalone template + # instead of hardcoding a long inline script in this Python source file. + bash_script = _render_fluxon_fs_s3_payload_wrapper( + s3_base_url=s3_base_url, + s3_bucket=s3_bucket, + object_key=object_key, + payload_dest_path=payload_dest_path_s, + s3_access_key=s3_access_key, + s3_secret_key=s3_secret_key, + s3_region=s3_region, + exec_cmd=exec_cmd, ) # Deployer only consumes argv/cwd; container image is required by the YAML subset parser @@ -7670,6 +7656,18 @@ def _runner_native_ci_commands_for_case(case: _ResolvedCase, *, ctx: str) -> Lis "timeout_seconds": 21600, } ] + if scene_id == "ci_top_attention_log_mgmt": + return [ + { + "id": "top_attention_log_mgmt", + "command": ( + "__RUN_DIR__/venv/bin/python3 -u " + "__RUN_DIR__/src/fluxon_test_stack/top_attention_test_index/_log_mgmt.py " + "--case-config __RUN_DIR__/configs/ci_scene_config.yaml" + ), + "timeout_seconds": 21600, + } + ] raise ValueError(f"{ctx} unsupported runner-native CI scene: {scene_id!r}") @@ -12251,6 +12249,51 @@ def _shell_quote(s: str) -> str: return "'" + s.replace("'", "'\\''") + "'" +def _json_string_literal(value: str) -> str: + return json.dumps(value, ensure_ascii=True) + + +def _render_runner_template(*, template_name: str, replacements: Dict[str, str]) -> str: + template_path = (RUNNER_TEMPLATE_DIR / template_name).resolve() + if template_path.parent != RUNNER_TEMPLATE_DIR: + raise ValueError(f"template must stay under {RUNNER_TEMPLATE_DIR}: {template_path}") + if not template_path.is_file(): + raise ValueError(f"missing runner template: {template_path}") + rendered = template_path.read_text(encoding="utf-8") + for token, value in replacements.items(): + rendered = rendered.replace(token, value) + unresolved = sorted(set(re.findall(r"__FLUXON_TMPL_[A-Z0-9_]+__", rendered))) + if unresolved: + raise ValueError(f"unresolved runner template tokens: {unresolved} template={template_path}") + return rendered + + +def _render_fluxon_fs_s3_payload_wrapper( + *, + s3_base_url: str, + s3_bucket: str, + object_key: str, + payload_dest_path: str, + s3_access_key: str, + s3_secret_key: str, + s3_region: str, + exec_cmd: str, +) -> str: + return _render_runner_template( + template_name="payload_fluxon_fs_s3_download_and_exec.sh.template", + replacements={ + "__FLUXON_TMPL_BASE_URL_JSON__": _json_string_literal(s3_base_url), + "__FLUXON_TMPL_BUCKET_JSON__": _json_string_literal(s3_bucket), + "__FLUXON_TMPL_OBJECT_KEY_JSON__": _json_string_literal(object_key), + "__FLUXON_TMPL_DEST_PATH_JSON__": _json_string_literal(payload_dest_path), + "__FLUXON_TMPL_ACCESS_KEY_JSON__": _json_string_literal(s3_access_key), + "__FLUXON_TMPL_SECRET_KEY_JSON__": _json_string_literal(s3_secret_key), + "__FLUXON_TMPL_REGION_JSON__": _json_string_literal(s3_region), + "__FLUXON_TMPL_EXEC_CMD__": exec_cmd, + }, + ) + + def _find_deploy_instance_opt(resolved_case: Dict[str, Any], *, instance_id: str) -> Optional[Dict[str, Any]]: deploy = _require_dict(resolved_case.get("deploy"), "resolved_case.deploy") @@ -16220,7 +16263,9 @@ def _consume_path(path: Path) -> None: return _consume_path((workdir_root / "case_runs.yaml").resolve()) - _consume_path((workdir_root / RUNNER_STDIO_LOG_FILENAME).resolve()) + runner_log_path = _service_log_resolve_read_path(workdir_root, filename=RUNNER_STDIO_LOG_FILENAME) + if isinstance(runner_log_path, Path): + _consume_path(runner_log_path) run_dir = (_ui_case_result_root(workdir_root, case_id) / _ui_run_dir_name(run_index)).resolve() _consume_path(run_dir) @@ -16327,7 +16372,7 @@ def _ui_case_overview(workdir_root: Path, *, case_id: str) -> Dict[str, Any]: def _ui_collect_suite_overview(workdir_root: Path) -> Dict[str, Any]: case_ids = _ui_collect_case_ids(workdir_root) cases = [_ui_case_overview(workdir_root, case_id=case_id) for case_id in case_ids] - runner_log_path = (workdir_root / RUNNER_STDIO_LOG_FILENAME).resolve() + runner_log_path = _service_log_resolve_read_path(workdir_root, filename=RUNNER_STDIO_LOG_FILENAME) running_cases = [case for case in cases if case.get("status") == "RUNNING"] incomplete_cases = [case for case in cases if case.get("status") in {"INCOMPLETE", "RESERVED"}] last_updated_unix_s = 0 @@ -16350,7 +16395,7 @@ def _ui_collect_suite_overview(workdir_root: Path) -> Dict[str, Any]: return { "workdir_root": workdir_root.resolve(), "case_runs_path": (workdir_root / "case_runs.yaml").resolve(), - "runner_log_path": runner_log_path if runner_log_path.exists() else None, + "runner_log_path": runner_log_path if isinstance(runner_log_path, Path) and runner_log_path.exists() else None, "running_case_count": len(running_cases), "status": "RUNNING" if running_cases else ("INCOMPLETE" if incomplete_cases else ("IDLE" if cases else "EMPTY")), "last_updated_unix_s": int(last_updated_unix_s), @@ -16451,7 +16496,7 @@ def _ui_workdir_id(workdir_root: Path) -> str: def _ui_workdir_touch_unix_s(workdir_root: Path) -> int: touched = 0 - for name in ("case_runs.yaml", RUNNER_STDIO_LOG_FILENAME): + for name in ("case_runs.yaml",): path = (workdir_root / name).resolve() if not path.exists(): continue @@ -16459,6 +16504,12 @@ def _ui_workdir_touch_unix_s(workdir_root: Path) -> int: touched = max(touched, int(path.stat().st_mtime)) except Exception: continue + runner_log_path = _service_log_resolve_read_path(workdir_root, filename=RUNNER_STDIO_LOG_FILENAME) + if isinstance(runner_log_path, Path) and runner_log_path.exists(): + try: + touched = max(touched, int(runner_log_path.stat().st_mtime)) + except Exception: + pass return int(touched) @@ -17897,8 +17948,11 @@ def _handle_api_log_chunk(self, parsed) -> None: self._send_json(400, {"error": "missing workdir_id"}) return suite_workdir = _ui_workdir_by_id(workdir_root, workdir_id, extra_history_roots) - path = (suite_workdir / RUNNER_STDIO_LOG_FILENAME).resolve() - if not path.exists(): + path = _service_log_resolve_read_path( + suite_workdir, + filename=RUNNER_STDIO_LOG_FILENAME, + ) + if not isinstance(path, Path) or not path.exists(): raise FileNotFoundError(f"runner log not found: {path}") elif kind == "run": workdir_id = (qs.get("workdir_id") or [""])[0] diff --git a/fluxon_test_stack/test_runner_templates/payload_fluxon_fs_s3_download_and_exec.sh.template b/fluxon_test_stack/test_runner_templates/payload_fluxon_fs_s3_download_and_exec.sh.template new file mode 100644 index 0000000..ca677bc --- /dev/null +++ b/fluxon_test_stack/test_runner_templates/payload_fluxon_fs_s3_download_and_exec.sh.template @@ -0,0 +1,108 @@ +set -euo pipefail +python3 - <<'PY' +import datetime +import hashlib +import hmac +import urllib.parse +import urllib.request +from pathlib import Path + +BASE_URL = __FLUXON_TMPL_BASE_URL_JSON__ +BUCKET = __FLUXON_TMPL_BUCKET_JSON__ +OBJECT_KEY = __FLUXON_TMPL_OBJECT_KEY_JSON__ +DEST_PATH = __FLUXON_TMPL_DEST_PATH_JSON__ +ACCESS_KEY = __FLUXON_TMPL_ACCESS_KEY_JSON__ +SECRET_KEY = __FLUXON_TMPL_SECRET_KEY_JSON__ +REGION = __FLUXON_TMPL_REGION_JSON__ + +ALG = "AWS4-HMAC-SHA256" +SERVICE = "s3" +TERM = "aws4_request" +UNSIGNED = "UNSIGNED-PAYLOAD" + + +def _hmac_sha256(key: bytes, msg: bytes) -> bytes: + return hmac.new(key, msg, hashlib.sha256).digest() + + +def _sha256_hex(msg: bytes) -> str: + return hashlib.sha256(msg).hexdigest() + + +def _derive_signing_key(secret_key: str, scope_date: str, region: str) -> bytes: + k_date = _hmac_sha256(("AWS4" + secret_key).encode("utf-8"), scope_date.encode("utf-8")) + k_region = _hmac_sha256(k_date, region.encode("utf-8")) + k_service = _hmac_sha256(k_region, SERVICE.encode("utf-8")) + return _hmac_sha256(k_service, TERM.encode("utf-8")) + + +def _sigv4_headers(*, method: str, signing_path: str, query: str, host: str, scope_date: str, amz_date: str, payload_hash: str) -> dict: + signed_headers = "host;x-amz-content-sha256;x-amz-date" + canonical_headers = "" + canonical_headers += f"host:{host}\n" + canonical_headers += f"x-amz-content-sha256:{payload_hash}\n" + canonical_headers += f"x-amz-date:{amz_date}\n" + canonical_request = "\n".join([method, signing_path, query, canonical_headers, signed_headers, payload_hash]) + cr_hash = _sha256_hex(canonical_request.encode("utf-8")) + scope = f"{scope_date}/{REGION}/{SERVICE}/{TERM}" + string_to_sign = "\n".join([ALG, amz_date, scope, cr_hash]) + signing_key = _derive_signing_key(SECRET_KEY, scope_date, REGION) + sig = hmac.new(signing_key, string_to_sign.encode("utf-8"), hashlib.sha256).hexdigest() + auth = f"{ALG} Credential={ACCESS_KEY}/{scope}, SignedHeaders={signed_headers}, Signature={sig}" + return { + "Authorization": auth, + "x-amz-date": amz_date, + "x-amz-content-sha256": payload_hash, + "Host": host, + } + + +u = urllib.parse.urlparse(BASE_URL) +if u.scheme not in ("http", "https"): + raise ValueError("BASE_URL must be http(s)") +if not u.netloc: + raise ValueError("BASE_URL missing host") +base_path = u.path.rstrip("/") +if base_path == "": + raise ValueError("BASE_URL must include a non-root path prefix (e.g. /fs_s3)") + +bucket_enc = urllib.parse.quote(BUCKET, safe="-_.~") +key_enc = urllib.parse.quote(OBJECT_KEY, safe="/-_.~") +full_path = base_path + "/" + bucket_enc + "/" + key_enc +signing_path = full_path +url = f"{u.scheme}://{u.netloc}{full_path}" + +now = datetime.datetime.utcnow() +amz_date = now.strftime("%Y%m%dT%H%M%SZ") +scope_date = now.strftime("%Y%m%d") +hdrs = _sigv4_headers( + method="GET", + signing_path=signing_path, + query="", + host=u.netloc, + scope_date=scope_date, + amz_date=amz_date, + payload_hash=UNSIGNED, +) + +dest = Path(DEST_PATH) +dest.parent.mkdir(parents=True, exist_ok=True) +tmp = Path(str(dest) + ".tmp") +if tmp.exists(): + tmp.unlink() +req = urllib.request.Request(url, method="GET") +for k, v in hdrs.items(): + req.add_header(k, v) +with urllib.request.urlopen(req, timeout=60) as resp: + if getattr(resp, "status", None) != 200: + body = resp.read(4096) + raise RuntimeError(f'download failed: status={getattr(resp, "status", None)} body={body!r}') + with tmp.open("wb") as f: + while True: + b = resp.read(1024 * 1024) + if not b: + break + f.write(b) +tmp.replace(dest) +PY +exec __FLUXON_TMPL_EXEC_CMD__ diff --git a/fluxon_test_stack/test_runner_ui.py b/fluxon_test_stack/test_runner_ui.py index 9702da4..d7c6ac2 100644 --- a/fluxon_test_stack/test_runner_ui.py +++ b/fluxon_test_stack/test_runner_ui.py @@ -53,6 +53,10 @@ def main() -> None: raw_path=Path(args.workdir), field_name="workdir", ) + test_runner._redirect_process_stdio_to_log( + workdir_root, + filename="test_runner_ui.log", + ) gitops_cfg_path = None if args.gitops_config: gitops_cfg_path = test_runner._resolve_repo_root_cli_path( diff --git a/fluxon_test_stack/tests/test_ci_2_virt_node_contract.py b/fluxon_test_stack/tests/test_ci_2_virt_node_contract.py index 96f0554..a9d0cf7 100644 --- a/fluxon_test_stack/tests/test_ci_2_virt_node_contract.py +++ b/fluxon_test_stack/tests/test_ci_2_virt_node_contract.py @@ -29,12 +29,13 @@ def _load_module(): class TestCi2VirtNodeContract(unittest.TestCase): _KVTEST_SCENE_ID = "ci_top_attention_bin_kvtest" _DOC_SCENE_ID = "ci_top_attention_doc_page_build" + _LOG_MGMT_SCENE_ID = "ci_top_attention_log_mgmt" def test_generated_suite_is_public_dual_local_nodes_ci_only(self) -> None: suite_cfg = _ENTRY._load_yaml_mapping(_ENTRY.DEFAULT_SUITE_PATH, ctx="suite") generated = _ENTRY._rewrite_suite_for_local_dual_nodes( suite_cfg=suite_cfg, - scene_ids=[self._DOC_SCENE_ID, self._KVTEST_SCENE_ID], + scene_ids=[self._DOC_SCENE_ID, self._KVTEST_SCENE_ID, self._LOG_MGMT_SCENE_ID], primary_node_name="local-node-a", secondary_node_name="local-node-b", host_ip="10.1.1.119", @@ -43,7 +44,7 @@ def test_generated_suite_is_public_dual_local_nodes_ci_only(self) -> None: ) self.assertEqual(generated["run"]["selectors"]["profile_ids"], ["fluxon_tcp_thread"]) - self.assertEqual(set(generated["scenes"].keys()), {self._DOC_SCENE_ID, self._KVTEST_SCENE_ID}) + self.assertEqual(set(generated["scenes"].keys()), {self._DOC_SCENE_ID, self._KVTEST_SCENE_ID, self._LOG_MGMT_SCENE_ID}) self.assertEqual(generated["profiles"]["fluxon_tcp_thread"]["artifact_set"], "fluxon_tcp_thread") self.assertEqual( generated["profiles"]["fluxon_tcp_thread"]["runtime"]["ci"]["scene_configs"][self._KVTEST_SCENE_ID][ @@ -51,6 +52,12 @@ def test_generated_suite_is_public_dual_local_nodes_ci_only(self) -> None: ], "tcp_thread_transport", ) + self.assertEqual( + generated["profiles"]["fluxon_tcp_thread"]["runtime"]["ci"]["scene_configs"][self._LOG_MGMT_SCENE_ID][ + "enabled" + ], + True, + ) self.assertEqual( generated["profiles"]["fluxon_tcp_thread"]["runtime"]["ci"]["deploy"]["target_ip_map"], {"local-node-a": "10.1.1.119", "local-node-b": "10.1.1.119"}, @@ -106,11 +113,16 @@ def test_generated_suite_is_public_dual_local_nodes_ci_only(self) -> None: generated["scenes"][self._KVTEST_SCENE_ID]["select"]["scales"], ["n1_kvowner_dram_20gib"], ) + self.assertEqual( + generated["scenes"][self._LOG_MGMT_SCENE_ID]["select"]["scales"], + ["n1_kvowner_dram_20gib"], + ) self.assertEqual( set(generated["scales"].keys()), {"n1_kvowner_dram_3gib", "n1_kvowner_dram_20gib"}, ) self.assertNotIn("commands", generated["scenes"][self._KVTEST_SCENE_ID]["ci"]) + self.assertNotIn("commands", generated["scenes"][self._LOG_MGMT_SCENE_ID]["ci"]) def test_generated_suite_preserves_source_scene_configs(self) -> None: suite_cfg = _ENTRY._load_yaml_mapping(_ENTRY.DEFAULT_SUITE_PATH, ctx="suite") @@ -211,7 +223,23 @@ def test_generated_deployconf_rewrites_to_dual_local_nodes(self) -> None: self.assertIn('--wheel "$FLUXON_RELEASE_WHEEL"', generated["global_envs"]["FLUXON_RELEASE_WHEEL_FETCH_CMD"]) self.assertEqual(generated["atomic_groups"]["fluxon_core_controller"]["nodes"], ["local-node-a", "local-node-b"]) self.assertEqual(generated["service"]["owner"]["node_bind"]["node"], ["local-node-a", "local-node-b"]) + self.assertIn( + 'log_root_path: "${HOSTWORKDIR}/large/log/owner_${NODE_ID}"', + generated["service"]["owner"]["entrypoint"], + ) + self.assertIn( + 'cache_root_path: "${HOSTWORKDIR}/large/cache/owner_${NODE_ID}"', + generated["service"]["owner"]["entrypoint"], + ) self.assertEqual(generated["service"]["ops_controller"]["port"], 19180) + self.assertIn( + 'http_listen_addr: "0.0.0.0:${OPS_CONTROLLER__PORT}"', + generated["service"]["ops_controller"]["entrypoint"], + ) + self.assertNotIn( + 'http_listen_addr: "0.0.0.0:${MASTER__PORT}"', + generated["service"]["ops_controller"]["entrypoint"], + ) self.assertIn("local-node-a", generated["service"]["ops_agent"]["entrypoint"]) self.assertIn("local-node-b", generated["service"]["ops_agent"]["entrypoint"]) self.assertIn(' - "10.1.1.119/32"', generated["service"]["master"]["entrypoint"]) @@ -401,7 +429,7 @@ def test_main_supports_explicit_suite_path(self) -> None: suite_cfg["scenes"] = { key: value for key, value in suite_cfg["scenes"].items() - if key in (self._DOC_SCENE_ID, self._KVTEST_SCENE_ID) + if key in (self._DOC_SCENE_ID, self._KVTEST_SCENE_ID, self._LOG_MGMT_SCENE_ID) } suite_cfg["profiles"] = {"fluxon_tcp": suite_cfg["profiles"]["fluxon_tcp"]} suite_cfg["run"]["selectors"]["profile_ids"] = ["fluxon_tcp"] @@ -409,6 +437,7 @@ def test_main_supports_explicit_suite_path(self) -> None: suite_cfg["profiles"]["fluxon_tcp"]["runtime"]["ci"]["scene_configs"][self._DOC_SCENE_ID]["doc_site_base_url"] = ( "tele-ai.github.io/Fluxon" ) + suite_cfg["profiles"]["fluxon_tcp"]["runtime"]["ci"]["scene_configs"][self._LOG_MGMT_SCENE_ID]["enabled"] = True _ENTRY._write_yaml(suite_path, suite_cfg) release_dir = REPO_ROOT / "fluxon_release" release_dir.mkdir(parents=True, exist_ok=True) @@ -445,7 +474,7 @@ def test_main_supports_explicit_suite_path(self) -> None: workdir / "generated" / "ci_test_list.local.yaml", ctx="generated suite", ) - self.assertEqual(set(generated_suite["scenes"].keys()), {self._DOC_SCENE_ID, self._KVTEST_SCENE_ID}) + self.assertEqual(set(generated_suite["scenes"].keys()), {self._DOC_SCENE_ID, self._KVTEST_SCENE_ID, self._LOG_MGMT_SCENE_ID}) self.assertEqual( generated_suite["profiles"]["fluxon_tcp_thread"]["runtime"]["ci"]["scene_configs"][self._KVTEST_SCENE_ID][ "kv_test_rounds" @@ -458,6 +487,12 @@ def test_main_supports_explicit_suite_path(self) -> None: ], "tele-ai.github.io/Fluxon", ) + self.assertEqual( + generated_suite["profiles"]["fluxon_tcp_thread"]["runtime"]["ci"]["scene_configs"][self._LOG_MGMT_SCENE_ID][ + "enabled" + ], + True, + ) def test_main_same_host_generated_configs_use_non_loopback_host_ip(self) -> None: with tempfile.TemporaryDirectory() as td: @@ -563,6 +598,60 @@ def fake_run(argv: list[str], *, env=None) -> None: str((REPO_ROOT / "fluxon_test_stack" / "pack_test_stack_rsc.py").resolve()), ) + def test_main_passes_explicit_release_dir_to_pack_stage(self) -> None: + with tempfile.TemporaryDirectory() as td: + root = Path(td) + workdir = root / "ci_2_virt_node_workdir" + hostworkdir = root / "hostworkdir" + release_dir = root / "custom_release" + release_dir.mkdir(parents=True, exist_ok=True) + wheel_path = release_dir / "fluxon-0.2.1-cp38-abi3-manylinux_2_28_x86_64.whl" + wheel_path.write_text("", encoding="utf-8") + calls: list[tuple[list[str], dict[str, str] | None]] = [] + + def fake_run(argv: list[str], *, env=None) -> None: + calls.append((list(argv), None if env is None else dict(env))) + + argv = [ + "ci_2_virt_node.py", + "--workdir", + str(workdir), + "--testbed-hostworkdir", + str(hostworkdir), + "--release-dir", + str(release_dir), + "--scene-id", + self._KVTEST_SCENE_ID, + "--skip-builder-image", + "--skip-dispatch", + "--skip-start-testbed", + "--skip-runner", + ] + original_argv = sys.argv[:] + try: + with mock.patch.object(_ENTRY, "_run", side_effect=fake_run): + with mock.patch.object(_ENTRY, "_detect_local_hostname", return_value="runner-host"): + with mock.patch.object(_ENTRY, "_detect_local_ipv4", return_value="10.1.1.119"): + with mock.patch.object(_ENTRY, "_ensure_ci_pack_release_env", return_value=Path("/tmp/env.yaml")): + with mock.patch.object(_ENTRY, "_render_ci_nix_pack_config", return_value=Path("/tmp/cfg.yaml")): + sys.argv = argv + rc = _ENTRY.main() + finally: + sys.argv = original_argv + + self.assertEqual(rc, 0) + self.assertGreaterEqual(len(calls), 2) + pack_cmd = calls[1][0] + self.assertEqual( + pack_cmd[1], + str((REPO_ROOT / "fluxon_test_stack" / "pack_test_stack_rsc.py").resolve()), + ) + self.assertIn("--release-dir", pack_cmd) + self.assertEqual( + pack_cmd[pack_cmd.index("--release-dir") + 1], + str(release_dir.resolve()), + ) + def test_main_uses_apply_check_config_for_explicit_apply_validation(self) -> None: with tempfile.TemporaryDirectory() as td: root = Path(td) diff --git a/fluxon_test_stack/tests/test_pack_test_stack_rsc_cli.py b/fluxon_test_stack/tests/test_pack_test_stack_rsc_cli.py index d4bfac2..09afc1b 100644 --- a/fluxon_test_stack/tests/test_pack_test_stack_rsc_cli.py +++ b/fluxon_test_stack/tests/test_pack_test_stack_rsc_cli.py @@ -261,26 +261,19 @@ def test_git_stage_ci_source_tree_excludes_runtime_outputs(self) -> None: "scripts/build_doc_site.py", "fluxon_doc_cn/roadmap.md", "README.md", - "fluxon_release/install.py", - ".dever/run.log", - "skills/demo/SKILL.md", ): path = repo_root / relpath path.parent.mkdir(parents=True, exist_ok=True) path.write_text("x\n", encoding="utf-8") - - raw = b"\0".join( - [ - b"scripts/build_doc_site.py", - b"fluxon_doc_cn/roadmap.md", - b"README.md", - b"fluxon_release/install.py", - b".dever/run.log", - b"skills/demo/SKILL.md", - ] - ) + b"\0" - - with mock.patch.object(_PACK.subprocess, "check_output", return_value=raw): + with mock.patch.object( + _PACK, + "_collect_ci_source_relpaths", + return_value=[ + "README.md", + "fluxon_doc_cn/roadmap.md", + "scripts/build_doc_site.py", + ], + ): relpaths = _PACK._git_stage_ci_source_tree(repo_root=repo_root, stage_root=stage_root) self.assertEqual( @@ -308,25 +301,40 @@ def test_collect_ci_source_relpaths_excludes_runtime_outputs(self) -> None: path = repo_root / relpath path.parent.mkdir(parents=True, exist_ok=True) path.write_text("x\n", encoding="utf-8") + (repo_root / ".gitignore").write_text( + "\n".join( + [ + "fluxon_release/*", + "!fluxon_release/install.py", + ".dever", + "skills/", + ] + ) + + "\n", + encoding="utf-8", + ) - raw = b"\0".join( - [ - b"scripts/build_doc_site.py", - b"fluxon_doc_cn/roadmap.md", - b"README.md", - b"fluxon_release/install.py", - b".dever/run.log", - b"skills/demo/SKILL.md", - ] - ) + b"\0" - - with mock.patch.object(_PACK.subprocess, "check_output", return_value=raw): + def fake_check_output(argv, cwd=None): + del argv + cwd_path = Path(cwd).resolve() + if cwd_path == repo_root.resolve(): + return b"scripts/build_doc_site.py\0fluxon_doc_cn/roadmap.md\0README.md\0" + raise AssertionError(f"unexpected git ls-files cwd: {cwd_path}") + + with mock.patch.object( + _PACK.collect_source_profile_relpaths.__globals__["git_source_selection_utils"].subprocess, + "check_output", + side_effect=fake_check_output, + ): relpaths = _PACK._collect_ci_source_relpaths(repo_root=repo_root) self.assertEqual( relpaths, ["README.md", "fluxon_doc_cn/roadmap.md", "scripts/build_doc_site.py"], ) + self.assertNotIn("fluxon_release/install.py", relpaths) + self.assertNotIn(".dever/run.log", relpaths) + self.assertNotIn("skills/demo/SKILL.md", relpaths) def test_collect_ci_source_relpaths_includes_rather_no_git_submodule_sources(self) -> None: with tempfile.TemporaryDirectory() as tmpdir: @@ -334,6 +342,7 @@ def test_collect_ci_source_relpaths_includes_rather_no_git_submodule_sources(sel tracked_root = repo_root / "scripts" tracked_root.mkdir(parents=True, exist_ok=True) (tracked_root / "build_doc_site.py").write_text("tracked\n", encoding="utf-8") + (repo_root / ".gitignore").write_text("", encoding="utf-8") module_root = repo_root / "fluxon_rs" / "moka" (module_root / "src").mkdir(parents=True, exist_ok=True) (module_root / "Cargo.toml").write_text("module\n", encoding="utf-8") @@ -357,7 +366,11 @@ def fake_check_output(argv, cwd=None): return b"Cargo.toml\0src/lib.rs\0" raise AssertionError(f"unexpected git ls-files cwd: {cwd_path}") - with mock.patch.object(_PACK.subprocess, "check_output", side_effect=fake_check_output): + with mock.patch.object( + _PACK.collect_source_profile_relpaths.__globals__["git_source_selection_utils"].subprocess, + "check_output", + side_effect=fake_check_output, + ): relpaths = _PACK._collect_ci_source_relpaths(repo_root=repo_root) self.assertEqual( @@ -384,9 +397,9 @@ def test_collect_ci_source_relpaths_requires_rather_no_git_submodule_root_to_exi with ( mock.patch.object( - _PACK, - "_collect_git_listed_source_relpaths", - return_value=["scripts/build_doc_site.py"], + _PACK.collect_source_profile_relpaths.__globals__["git_source_selection_utils"].subprocess, + "check_output", + return_value=b"scripts/build_doc_site.py\0", ), self.assertRaisesRegex( RuntimeError, @@ -422,6 +435,54 @@ def test_compute_ci_source_digest_uses_selected_git_paths_only(self) -> None: digest_roots = digest_mock.call_args.args[0] self.assertEqual(digest_roots, [tracked.resolve()]) + def test_prune_stage_paths_applies_glob_patterns(self) -> None: + with tempfile.TemporaryDirectory() as tmpdir: + stage_root = Path(tmpdir) + keep_path = stage_root / "keep.txt" + pyc_path = stage_root / "pkg" / "drop.pyc" + baseline_file = stage_root / "baselines" / "manifest.txt" + pyc_path.parent.mkdir(parents=True, exist_ok=True) + baseline_file.parent.mkdir(parents=True, exist_ok=True) + keep_path.write_text("keep\n", encoding="utf-8") + pyc_path.write_text("drop\n", encoding="utf-8") + baseline_file.write_text("drop\n", encoding="utf-8") + + _PACK.script_utils.prune_stage_paths( + stage_root, + ("*.pyc", "baselines/"), + ) + + self.assertTrue(keep_path.exists()) + self.assertFalse(pyc_path.exists()) + self.assertFalse(baseline_file.exists()) + + def test_shared_rsync_stage_accepts_exclude_patterns(self) -> None: + with tempfile.TemporaryDirectory() as tmpdir: + repo_root = Path(tmpdir) + src = repo_root / "src" + dst = repo_root / "dst" + (src / "keep").mkdir(parents=True, exist_ok=True) + (src / "drop").mkdir(parents=True, exist_ok=True) + (src / "keep" / "a.txt").write_text("keep\n", encoding="utf-8") + (src / "drop" / "b.txt").write_text("drop\n", encoding="utf-8") + + run_mock = mock.Mock() + with mock.patch.dict( + _PACK.script_utils.rsync_stage.__globals__, + {"run_cmd_argv": run_mock}, + ): + _PACK.script_utils.rsync_stage( + repo_root=repo_root, + src=src, + dst=dst, + honor_gitignore=False, + exclude_rel_paths=("drop/", "*.tmp"), + ) + + argv = run_mock.call_args.args[0] + self.assertIn("--exclude=drop/", argv) + self.assertIn("--exclude=*.tmp", argv) + if __name__ == "__main__": raise SystemExit(unittest.main()) diff --git a/fluxon_test_stack/tests/test_runner_contract.py b/fluxon_test_stack/tests/test_runner_contract.py index d017841..67d42e0 100644 --- a/fluxon_test_stack/tests/test_runner_contract.py +++ b/fluxon_test_stack/tests/test_runner_contract.py @@ -59,6 +59,10 @@ def _build_checks(selected_test_id: Optional[str]) -> List[Tuple[str, Callable[[ "ci_top_attention_doc_page_build_declares_setup_dev_env_prepare", test_ci_top_attention_doc_page_build_declares_setup_dev_env_prepare, ), + ( + "ci_top_attention_log_mgmt_scene_exists", + test_ci_top_attention_log_mgmt_scene_exists, + ), ] if selected_test_id is None: return checks @@ -247,5 +251,51 @@ def test_ci_top_attention_doc_page_build_declares_setup_dev_env_prepare() -> Non print("PASS: test_ci_top_attention_doc_page_build_declares_setup_dev_env_prepare") +def test_ci_top_attention_log_mgmt_scene_exists() -> None: + repo_root = Path(__file__).resolve().parents[2] + suite_cfg_path = repo_root / "fluxon_test_stack" / "ci_test_list.yaml" + suite_cfg = yaml.safe_load(suite_cfg_path.read_text(encoding="utf-8")) + if not isinstance(suite_cfg, dict): + print("FAIL: test_ci_top_attention_log_mgmt_scene_exists - suite config is not a mapping") + return + + suite_for_contract = copy.deepcopy(suite_cfg) + artifact_sets = suite_for_contract.get("artifact_sets") + if not isinstance(artifact_sets, dict): + print("FAIL: test_ci_top_attention_log_mgmt_scene_exists - artifact_sets is not a mapping") + return + for artifact_set in artifact_sets.values(): + if not isinstance(artifact_set, dict): + continue + release_artifacts = artifact_set.get("release_artifacts") + if isinstance(release_artifacts, dict): + python_wheel = release_artifacts.get("python_wheel") + if isinstance(python_wheel, str) and python_wheel.strip(): + artifact_set["release_artifacts"] = {"wheel": python_wheel} + + suite = _TEST_RUNNER._parse_suite_config(suite_for_contract) + scene = suite.scenes.get("ci_top_attention_log_mgmt") + if not isinstance(scene, dict): + print("FAIL: test_ci_top_attention_log_mgmt_scene_exists - missing scene") + return + ci = scene.get("ci") + if not isinstance(ci, dict): + print("FAIL: test_ci_top_attention_log_mgmt_scene_exists - scene.ci missing") + return + if ci.get("subject") != "rust": + print( + "FAIL: test_ci_top_attention_log_mgmt_scene_exists - " + f"expected subject 'rust', got {ci.get('subject')!r}" + ) + return + if ci.get("runtime_contract") != "rust_self_managed": + print( + "FAIL: test_ci_top_attention_log_mgmt_scene_exists - " + f"expected runtime_contract 'rust_self_managed', got {ci.get('runtime_contract')!r}" + ) + return + print("PASS: test_ci_top_attention_log_mgmt_scene_exists") + + if __name__ == "__main__": raise SystemExit(main()) diff --git a/fluxon_test_stack/tests/test_test_runner_testbed_contract.py b/fluxon_test_stack/tests/test_test_runner_testbed_contract.py index 617ffda..4272c10 100644 --- a/fluxon_test_stack/tests/test_test_runner_testbed_contract.py +++ b/fluxon_test_stack/tests/test_test_runner_testbed_contract.py @@ -103,6 +103,30 @@ def test_top_attention_ci_execution_plan_is_runner_native(self) -> None: self.assertEqual(planned[0].ci_commands[0]["id"], "top_attention_bin_kvtest") self.assertIn("--case-config __RUN_DIR__/configs/ci_scene_config.yaml", planned[0].ci_commands[0]["command"]) + def test_top_attention_log_mgmt_ci_execution_plan_is_runner_native(self) -> None: + suite_cfg = yaml.safe_load((_RUNNER.RUNNER_REPO_ROOT / "fluxon_test_stack" / "ci_test_list.yaml").read_text(encoding="utf-8")) + artifact_sets = suite_cfg.get("artifact_sets") + if isinstance(artifact_sets, dict): + for artifact_set in artifact_sets.values(): + if not isinstance(artifact_set, dict): + continue + release_artifacts = artifact_set.get("release_artifacts") + if isinstance(release_artifacts, dict): + python_wheel = release_artifacts.get("python_wheel") + if isinstance(python_wheel, str) and python_wheel.strip(): + artifact_set["release_artifacts"] = {"wheel": python_wheel} + suite = _RUNNER._parse_suite_config(suite_cfg) + cases = _RUNNER._expand_cases(suite) + case = next(item for item in cases if item.scene_id == "ci_top_attention_log_mgmt" and item.profile_id == "fluxon_tcp") + planned = _RUNNER._build_ci_execution_plan(case, suite) + self.assertEqual(len(planned), 1) + self.assertEqual(planned[0].ci_commands[0]["id"], "top_attention_log_mgmt") + self.assertIn( + "__RUN_DIR__/src/fluxon_test_stack/top_attention_test_index/_log_mgmt.py", + planned[0].ci_commands[0]["command"], + ) + self.assertIn("--case-config __RUN_DIR__/configs/ci_scene_config.yaml", planned[0].ci_commands[0]["command"]) + def test_ci_prepare_run_inputs_rebuilds_release_view_without_reusing_source_test_rsc(self) -> None: with tempfile.TemporaryDirectory() as td: root = Path(td) @@ -520,6 +544,81 @@ def test_ci_base_runtime_service_target_ip_uses_loopback_for_same_host_local_nod "127.0.0.1", ) + def test_write_deployer_manifests_renders_payload_wrapper_from_template(self) -> None: + with tempfile.TemporaryDirectory() as td: + run_dir = Path(td) + resolved_case = { + "case": { + "case_id": "bench_case", + "profile_id": "bench_profile", + }, + "scene": { + "bench": { + "subject": "kv", + } + }, + "deploy": { + "instances": [ + { + "id": "worker_0", + "k8s_ref": "deployment/test-worker", + "lifecycle": "service", + "deployer": { + "target": "logic-a", + "payload_file": "wheelhouse/pkg.whl", + "payload_dest_path": "/tmp/run/pkg.whl", + "command": ["/bin/sh", "-lc", "python3 /tmp/run/pkg.whl"], + }, + } + ], + "payload_delivery": { + "kind": _RUNNER.PAYLOAD_DELIVERY_KIND_FLUXON_FS_S3, + "s3_base_url": "http://127.0.0.1:19080/fs_s3", + "bucket": "bench-bucket", + "access_key": "bench-ak", + "secret_key": "bench-sk", + "region": "bench-region", + "key_prefix": "case-prefix", + }, + }, + "runtime": { + "workdir_root": str(run_dir.parent), + "run_dir": str(run_dir), + "stack_identity": { + "cluster_name": "fluxon_testbed", + "controller_url": "http://127.0.0.1:19080/r/ops/fluxon_testbed", + "shared_memory_path": "/tmp/shm", + "shared_file_path": "/tmp/share", + }, + }, + "artifact_set": { + "release_root": str(run_dir / "fluxon_release"), + "test_rsc_root": str(run_dir / "test_rsc"), + }, + } + + template_path = ( + _RUNNER.RUNNER_TEMPLATE_DIR / "payload_fluxon_fs_s3_download_and_exec.sh.template" + ).resolve() + self.assertTrue(template_path.is_file()) + + _RUNNER._write_deployer_manifests(resolved_case, run_dir, allow_overwrite=False) + + manifest_docs = list( + yaml.safe_load_all((run_dir / "deployer_deploy.yaml").read_text(encoding="utf-8")) + ) + self.assertEqual(len(manifest_docs), 1) + container = manifest_docs[0]["spec"]["template"]["spec"]["containers"][0] + self.assertEqual(container["command"], ["/bin/bash", "-lc"]) + self.assertEqual(len(container["args"]), 1) + script_text = container["args"][0] + self.assertIn("python3 - <<'PY'", script_text) + self.assertIn('BASE_URL = "http://127.0.0.1:19080/fs_s3"', script_text) + self.assertIn('OBJECT_KEY = "case-prefix/wheelhouse/pkg.whl"', script_text) + self.assertIn('DEST_PATH = "/tmp/run/pkg.whl"', script_text) + self.assertIn('exec /bin/sh -lc', script_text) + self.assertNotIn("__FLUXON_TMPL_", script_text) + if __name__ == "__main__": raise SystemExit(unittest.main()) diff --git a/fluxon_test_stack/tests/test_test_runner_ui_contract.py b/fluxon_test_stack/tests/test_test_runner_ui_contract.py index ff407e2..2abc4ec 100644 --- a/fluxon_test_stack/tests/test_test_runner_ui_contract.py +++ b/fluxon_test_stack/tests/test_test_runner_ui_contract.py @@ -119,6 +119,8 @@ def test_redirect_process_stdio_starts_mirror_on_github_actions(self) -> None: workdir = Path(td) original_log_fp = _RUNNER._RUNNER_STDIO_LOG_FP original_keepalive = _RUNNER._RUNNER_STDIO_KEEPALIVE_FDS + saved_stdout = sys.stdout + saved_stderr = sys.stderr with mock.patch.dict(os.environ, {"GITHUB_ACTIONS": "true"}, clear=False): _RUNNER._RUNNER_STDIO_LOG_FP = None _RUNNER._RUNNER_STDIO_KEEPALIVE_FDS = (11, 12) @@ -129,10 +131,18 @@ def test_redirect_process_stdio_starts_mirror_on_github_actions(self) -> None: self.assertEqual(dup2_mock.call_count, 2) start_mirror.assert_called_once() kwargs = start_mirror.call_args.kwargs - self.assertEqual(kwargs["log_path"], (workdir / _RUNNER.RUNNER_STDIO_LOG_FILENAME).resolve()) + expected_log_path = _RUNNER._service_log_base_path( + workdir, filename=_RUNNER.RUNNER_STDIO_LOG_FILENAME + ) + self.assertEqual(kwargs["log_path"], expected_log_path) self.assertEqual(kwargs["stdout_fd"], 11) self.assertNotIn("stderr_fd", kwargs) - if _RUNNER._RUNNER_STDIO_LOG_FP is not None: + sys.stdout = saved_stdout + sys.stderr = saved_stderr + if _RUNNER._RUNNER_STDIO_LOG_FP is not None and _RUNNER._RUNNER_STDIO_LOG_FP not in ( + sys.__stdout__, + sys.__stderr__, + ): _RUNNER._RUNNER_STDIO_LOG_FP.close() _RUNNER._RUNNER_STDIO_LOG_FP = original_log_fp _RUNNER._RUNNER_STDIO_KEEPALIVE_FDS = original_keepalive @@ -142,6 +152,8 @@ def test_redirect_process_stdio_skips_mirror_outside_github_actions(self) -> Non workdir = Path(td) original_log_fp = _RUNNER._RUNNER_STDIO_LOG_FP original_keepalive = _RUNNER._RUNNER_STDIO_KEEPALIVE_FDS + saved_stdout = sys.stdout + saved_stderr = sys.stderr with mock.patch.dict(os.environ, {}, clear=True): _RUNNER._RUNNER_STDIO_LOG_FP = None _RUNNER._RUNNER_STDIO_KEEPALIVE_FDS = (11, 12) @@ -151,7 +163,12 @@ def test_redirect_process_stdio_skips_mirror_outside_github_actions(self) -> Non _RUNNER._redirect_process_stdio_to_log(workdir) self.assertEqual(dup2_mock.call_count, 2) start_mirror.assert_not_called() - if _RUNNER._RUNNER_STDIO_LOG_FP is not None: + sys.stdout = saved_stdout + sys.stderr = saved_stderr + if _RUNNER._RUNNER_STDIO_LOG_FP is not None and _RUNNER._RUNNER_STDIO_LOG_FP not in ( + sys.__stdout__, + sys.__stderr__, + ): _RUNNER._RUNNER_STDIO_LOG_FP.close() _RUNNER._RUNNER_STDIO_LOG_FP = original_log_fp _RUNNER._RUNNER_STDIO_KEEPALIVE_FDS = original_keepalive @@ -225,6 +242,20 @@ def test_log_chunk_tail_and_before_window(self) -> None: self.assertEqual(older["text"], "2345") self.assertEqual(older["start"], 2) + def test_service_log_resolve_read_path_prefers_latest_daily_shard(self) -> None: + with tempfile.TemporaryDirectory() as td: + workdir = Path(td) + (workdir / "test_runner.2026-06-19.log").write_text("old\n", encoding="utf-8") + (workdir / "test_runner.2026-06-20.log").write_text("new\n", encoding="utf-8") + resolved = _RUNNER._service_log_resolve_read_path( + workdir, + filename=_RUNNER.RUNNER_STDIO_LOG_FILENAME, + ) + self.assertEqual( + resolved, + (workdir / "test_runner.2026-06-20.log").resolve(), + ) + def test_ops_logs_base_url_derives_from_controller_proxy(self) -> None: url = _RUNNER._ui_ops_logs_base_url("http://127.0.0.1:19080/r/ops/fluxon_testbed") self.assertEqual(url, "http://127.0.0.1:19080/logs") diff --git a/fluxon_test_stack/tests/test_top_attention_log_mgmt_contract.py b/fluxon_test_stack/tests/test_top_attention_log_mgmt_contract.py new file mode 100644 index 0000000..c06c033 --- /dev/null +++ b/fluxon_test_stack/tests/test_top_attention_log_mgmt_contract.py @@ -0,0 +1,112 @@ +#!/usr/bin/env python3 + +from __future__ import annotations + +import importlib.util +import sys +import tempfile +import unittest +from pathlib import Path +from unittest import mock + +import yaml + + +REPO_ROOT = Path(__file__).resolve().parents[2] +MODULE_PATH = REPO_ROOT / "fluxon_test_stack" / "top_attention_test_index" / "_log_mgmt.py" + + +def _load_module(): + module_dir = MODULE_PATH.parent + sys.path.insert(0, str(module_dir)) + try: + spec = importlib.util.spec_from_file_location("fluxon_test_stack_top_attention_log_mgmt_contract", MODULE_PATH) + assert spec is not None and spec.loader is not None + mod = importlib.util.module_from_spec(spec) + sys.modules[spec.name] = mod + spec.loader.exec_module(mod) + return mod + finally: + if sys.path and sys.path[0] == str(module_dir): + sys.path.pop(0) + + +_ENTRY = _load_module() + + +class TestTopAttentionLogMgmtContract(unittest.TestCase): + def test_main_accepts_case_config_and_runs_canonical_tests(self) -> None: + with tempfile.TemporaryDirectory() as td: + run_dir = Path(td) + cfg_dir = run_dir / "configs" + cfg_dir.mkdir(parents=True) + case_cfg = cfg_dir / "ci_scene_config.yaml" + case_cfg.write_text( + yaml.safe_dump( + { + "case": { + "scene_id": "ci_top_attention_log_mgmt", + "scale_id": "n1_kvowner_dram_20gib", + "profile_id": "fluxon_tcp_thread", + "case_id": "ci_top_attention_log_mgmt__n1_kvowner_dram_20gib__fluxon_tcp_thread", + }, + "scene_config": { + "enabled": True, + }, + "scene_runtime": { + "etcd": {"ip": "127.0.0.1", "port": 19180}, + "greptime": {"ip": "127.0.0.1", "port": 19190}, + }, + }, + sort_keys=False, + ), + encoding="utf-8", + ) + + python_calls: list[tuple[str, tuple[str, ...]]] = [] + + def fake_run_python_file(description: str, path: str, extra_args=()): + del description + python_calls.append((path, tuple(extra_args))) + return 0 + + with mock.patch.object(_ENTRY, "run_python_file", side_effect=fake_run_python_file): + with mock.patch.object(_ENTRY, "run_cargo", return_value=0) as run_cargo: + with mock.patch.object( + sys, + "argv", + [str(MODULE_PATH), "--case-config", str(case_cfg), "--", "--nocapture"], + ): + rc = _ENTRY.main() + + self.assertEqual(rc, 0) + self.assertEqual( + python_calls, + [ + ("deployment/tests/test_log_shard.py", ("--", "--nocapture")), + ( + "deployment/tests/test_selection_supervisor_codegen.py", + ("--test-id", "runtime_log_path_uses_daily_shard_files", "--", "--nocapture"), + ), + ( + "deployment/tests/test_selection_supervisor_codegen.py", + ("--test-id", "runtime_log_shards_roll_and_preserve_content_boundaries", "--", "--nocapture"), + ), + ], + ) + self.assertEqual( + run_cargo.call_args.args[0], + [ + "test", + "--manifest-path", + str(REPO_ROOT / "fluxon_rs" / "fluxon_util" / "Cargo.toml"), + "--test", + "log_mgmt", + "--", + "--nocapture", + ], + ) + + +if __name__ == "__main__": + raise SystemExit(unittest.main()) diff --git a/fluxon_test_stack/top_attention_test_index/README.md b/fluxon_test_stack/top_attention_test_index/README.md index d81c346..2894ddf 100644 --- a/fluxon_test_stack/top_attention_test_index/README.md +++ b/fluxon_test_stack/top_attention_test_index/README.md @@ -47,6 +47,7 @@ Entries: - `_fs_remote_mount.py`: heavier Fluxon FS remote mount integration coverage - `_test_stack_contract.py`: test-stack runner contract coverage - `_deployment_codegen.py`: deployment code generation coverage +- `_log_mgmt.py`: shared-supervisor ops log rolling plus Rust KV log sharding coverage. `ci_test_list.yaml` now exposes this wrapper as the formal `ci_top_attention_log_mgmt` scene, and `test_runner.py` dispatches to it from the runner-native `top_attention` CI execution model. - `_script_tools.py`: script utility coverage - `_cargo_fs_core.py`: cargo tests for the Rust FS core crate - `_cargo_util.py`: cargo tests for the Rust util crate diff --git a/fluxon_test_stack/top_attention_test_index/_log_mgmt.py b/fluxon_test_stack/top_attention_test_index/_log_mgmt.py new file mode 100644 index 0000000..e3547ab --- /dev/null +++ b/fluxon_test_stack/top_attention_test_index/_log_mgmt.py @@ -0,0 +1,54 @@ +#!/usr/bin/env python3 +from __future__ import annotations + +import argparse + +from _common import REPO_ROOT, load_case_config, run_cargo, run_python_file + + +TEST_REQUIREMENTS = ["cargo", "etcd", "ops", "submodules"] +SCENE_ID = "ci_top_attention_log_mgmt" + + +def main() -> int: + parser = argparse.ArgumentParser( + description="Flat index entry for shared-supervisor ops log rolling and Rust KV log sharding coverage." + ) + parser.add_argument( + "--case-config", + help="Canonical CI case config YAML emitted by test_runner.", + ) + args, passthrough = parser.parse_known_args() + if args.case_config: + _ = load_case_config(args.case_config, expected_scene_id=SCENE_ID) + + rc = run_python_file( + "Flat index entry for ops/shared-supervisor log shard helper coverage.", + "deployment/tests/test_log_shard.py", + extra_args=tuple(passthrough), + ) + if rc != 0: + return rc + for test_id in ( + "runtime_log_path_uses_daily_shard_files", + "runtime_log_shards_roll_and_preserve_content_boundaries", + ): + rc = run_python_file( + "Flat index entry for ops/shared-supervisor log routing coverage.", + "deployment/tests/test_selection_supervisor_codegen.py", + extra_args=("--test-id", test_id, *passthrough), + ) + if rc != 0: + return rc + return run_cargo([ + "test", + "--manifest-path", + str(REPO_ROOT / "fluxon_rs" / "fluxon_util" / "Cargo.toml"), + "--test", + "log_mgmt", + *passthrough, + ]) + + +if __name__ == "__main__": + raise SystemExit(main()) diff --git a/scripts/git_source_selection.py b/scripts/git_source_selection.py new file mode 100644 index 0000000..491a0c1 --- /dev/null +++ b/scripts/git_source_selection.py @@ -0,0 +1,163 @@ +from __future__ import annotations + +import subprocess +from pathlib import Path +from typing import Callable + +import yaml + + +DEFAULT_RATHER_NO_GIT_SUBMODULE_CONFIG_RELPATH = Path( + "setup_and_pack/rather_no_git_submodule.yaml" +) + + +def collect_git_listed_source_relpaths( + *, + repo_root: Path, + git_root: Path, + rel_prefix: str = "", + is_excluded: Callable[[str], bool], +) -> list[str]: + argv = [ + "git", + "ls-files", + "--cached", + "--others", + "--exclude-standard", + "-z", + ] + raw = subprocess.check_output(argv, cwd=str(git_root)) + selected: list[str] = [] + rel_prefix = rel_prefix.strip("/") + for entry in raw.split(b"\0"): + if not entry: + continue + rel = entry.decode("utf-8").strip() + if not rel: + continue + repo_rel = rel if not rel_prefix else f"{rel_prefix}/{rel}" + if is_excluded(repo_rel): + continue + source_path = (repo_root / repo_rel).resolve() + if not source_path.exists(): + continue + selected.append(repo_rel) + return selected + + +def load_rather_no_git_submodule_source_roots( + *, + repo_root: Path, + context_name: str, +) -> tuple[tuple[str, Path], ...]: + config_path = (repo_root / DEFAULT_RATHER_NO_GIT_SUBMODULE_CONFIG_RELPATH).resolve() + if not config_path.exists(): + return () + raw_cfg = yaml.safe_load(config_path.read_text(encoding="utf-8")) + if raw_cfg is None: + return () + if not isinstance(raw_cfg, dict): + raise RuntimeError( + "rather_no_git_submodule config must be a YAML mapping: " + f"{config_path}" + ) + raw_modules = raw_cfg.get("modules") + if raw_modules is None: + return () + if not isinstance(raw_modules, list): + raise RuntimeError( + "rather_no_git_submodule config `modules` must be a list: " + f"{config_path}" + ) + + repo_root = repo_root.resolve() + selected: list[tuple[str, Path]] = [] + seen_relpaths: set[str] = set() + for index, raw_item in enumerate(raw_modules): + if not isinstance(raw_item, dict): + raise RuntimeError( + "rather_no_git_submodule config entries must be mappings: " + f"{config_path} modules[{index}]" + ) + raw_path = raw_item.get("path") + if not isinstance(raw_path, str) or not raw_path.strip(): + raise RuntimeError( + "rather_no_git_submodule config path must be a non-empty string: " + f"{config_path} modules[{index}].path" + ) + rel_path = Path(raw_path.strip()) + if rel_path.is_absolute() or ".." in rel_path.parts: + raise RuntimeError( + "rather_no_git_submodule config path must stay within the repo root: " + f"{config_path} modules[{index}].path={raw_path!r}" + ) + relpath = rel_path.as_posix() + if relpath in seen_relpaths: + continue + seen_relpaths.add(relpath) + module_root = (repo_root / rel_path).resolve() + if module_root != repo_root and repo_root not in module_root.parents: + raise RuntimeError( + "rather_no_git_submodule config path escapes the repo root: " + f"{config_path} modules[{index}].path={raw_path!r}" + ) + if not module_root.is_dir(): + raise RuntimeError( + f"{context_name} requires configured rather_no_git_submodule path " + f"to exist as a directory: path={relpath} resolved={module_root}" + ) + selected.append((relpath, module_root)) + return tuple(selected) + + +def collect_source_relpaths_with_rather_no_git_submodule( + *, + repo_root: Path, + source_roots: tuple[str, ...], + is_excluded: Callable[[str], bool], + empty_selection_error: str, + rather_no_git_submodule_context_name: str, +) -> list[str]: + repo_root = repo_root.resolve() + selected: set[str] = set() + for source_root in source_roots: + root_path = (repo_root / source_root).resolve() + if not root_path.exists(): + continue + if root_path.is_file(): + relpath = Path(source_root).as_posix() + if not is_excluded(relpath): + selected.add(relpath) + continue + selected.update( + collect_git_listed_source_relpaths( + repo_root=repo_root, + git_root=root_path, + rel_prefix="" if source_root == "." else source_root, + is_excluded=is_excluded, + ) + ) + for relpath, module_root in load_rather_no_git_submodule_source_roots( + repo_root=repo_root, + context_name=rather_no_git_submodule_context_name, + ): + selected.update( + collect_git_listed_source_relpaths( + repo_root=repo_root, + git_root=module_root, + rel_prefix=relpath, + is_excluded=is_excluded, + ) + ) + if not selected: + raise RuntimeError(empty_selection_error) + return sorted(selected) + + +__all__ = [ + "DEFAULT_RATHER_NO_GIT_SUBMODULE_CONFIG_RELPATH", + "collect_git_listed_source_relpaths", + "collect_source_relpaths_with_rather_no_git_submodule", + "load_rather_no_git_submodule_source_roots", +] diff --git a/scripts/source_selection_profiles.py b/scripts/source_selection_profiles.py new file mode 100644 index 0000000..6c7493c --- /dev/null +++ b/scripts/source_selection_profiles.py @@ -0,0 +1,134 @@ +from __future__ import annotations + +from dataclasses import dataclass, field +from pathlib import Path +import sys + +SCRIPT_DIR = Path(__file__).resolve().parent +script_dir_str = str(SCRIPT_DIR) +if script_dir_str in sys.path: + sys.path.remove(script_dir_str) +sys.path.insert(0, script_dir_str) + +import git_source_selection as git_source_selection_utils + + +SOURCE_SELECTION_PROFILE_BUILD_SEED = "build_seed" +SOURCE_SELECTION_PROFILE_SOURCE_PACK = "source_pack" +SOURCE_SELECTION_PROFILES = ( + SOURCE_SELECTION_PROFILE_BUILD_SEED, + SOURCE_SELECTION_PROFILE_SOURCE_PACK, +) + +BUILD_SEED_SOURCE_ROOTS: tuple[str, ...] = ( + "README.md", + "setup.py", + "deployment", + "fluxon_py", + "fluxon_release/closed_sdk", + "fluxon_rs", + "scripts/git_source_selection.py", + "scripts/source_selection_profiles.py", + "setup_and_pack", +) +SOURCE_PACK_SOURCE_ROOTS: tuple[str, ...] = (".",) + +BUILD_SEED_INCLUDED_RELPATHS: frozenset[str] = frozenset( + { + "fluxon_release/closed_sdk/manifest.json", + "setup_and_pack/pub_prepare_build.yaml", + } +) +SOURCE_PACK_EXCLUDED_RELPATH_PREFIXES: tuple[str, ...] = ( + ".dever/", + "fluxon_release/", + "skills/", +) +SOURCE_PACK_EXCLUDED_RELPATH_NAMES: frozenset[str] = frozenset( + { + ".DS_Store", + } +) + + +@dataclass(frozen=True) +class SourceSelectionProfileSpec: + source_roots: tuple[str, ...] + empty_selection_error: str + rather_no_git_submodule_context_name: str + include_relpaths: frozenset[str] = field(default_factory=frozenset) + + +BUILD_SEED_PROFILE_SPEC = SourceSelectionProfileSpec( + source_roots=BUILD_SEED_SOURCE_ROOTS, + empty_selection_error="public workspace source selection produced no files", + rather_no_git_submodule_context_name="public workspace source selection", + include_relpaths=BUILD_SEED_INCLUDED_RELPATHS, +) +SOURCE_PACK_PROFILE_SPEC = SourceSelectionProfileSpec( + source_roots=SOURCE_PACK_SOURCE_ROOTS, + empty_selection_error="git-based CI source selection produced no files", + rather_no_git_submodule_context_name="CI source pack", +) + + +def get_source_profile_spec(*, profile: str) -> SourceSelectionProfileSpec: + if profile == SOURCE_SELECTION_PROFILE_BUILD_SEED: + return BUILD_SEED_PROFILE_SPEC + if profile == SOURCE_SELECTION_PROFILE_SOURCE_PACK: + return SOURCE_PACK_PROFILE_SPEC + raise ValueError( + f"unsupported source selection profile: {profile!r}; expected one of {SOURCE_SELECTION_PROFILES}" + ) + + +def get_source_profile_source_roots(*, profile: str) -> tuple[str, ...]: + return get_source_profile_spec(profile=profile).source_roots + + +def source_profile_relpath_excluded(*, profile: str, relpath: str) -> bool: + spec = get_source_profile_spec(profile=profile) + normalized = relpath.strip("/") + if not normalized: + return True + if normalized in spec.include_relpaths: + return False + if profile == SOURCE_SELECTION_PROFILE_SOURCE_PACK: + if normalized in SOURCE_PACK_EXCLUDED_RELPATH_NAMES: + return True + return any( + normalized == prefix.rstrip("/") or normalized.startswith(prefix) + for prefix in SOURCE_PACK_EXCLUDED_RELPATH_PREFIXES + ) + return False + + +def collect_source_profile_relpaths(*, repo_root: Path, profile: str) -> tuple[str, ...]: + spec = get_source_profile_spec(profile=profile) + return tuple( + git_source_selection_utils.collect_source_relpaths_with_rather_no_git_submodule( + repo_root=repo_root, + source_roots=spec.source_roots, + is_excluded=lambda relpath: source_profile_relpath_excluded( + profile=profile, + relpath=relpath, + ), + empty_selection_error=spec.empty_selection_error, + rather_no_git_submodule_context_name=spec.rather_no_git_submodule_context_name, + ) + ) + + +__all__ = [ + "BUILD_SEED_SOURCE_ROOTS", + "SOURCE_PACK_SOURCE_ROOTS", + "SOURCE_PACK_EXCLUDED_RELPATH_NAMES", + "SOURCE_PACK_EXCLUDED_RELPATH_PREFIXES", + "SOURCE_SELECTION_PROFILE_BUILD_SEED", + "SOURCE_SELECTION_PROFILE_SOURCE_PACK", + "SOURCE_SELECTION_PROFILES", + "collect_source_profile_relpaths", + "get_source_profile_source_roots", + "get_source_profile_spec", + "source_profile_relpath_excluded", +] diff --git a/setup_and_pack/nix/lib_layout.py b/setup_and_pack/nix/lib_layout.py index 05ac4b0..8322a55 100644 --- a/setup_and_pack/nix/lib_layout.py +++ b/setup_and_pack/nix/lib_layout.py @@ -10,7 +10,7 @@ import yaml from setup_and_pack.public_workspace_contract import ( - PUBLIC_WORKSPACE_INPUT_RELATIVE_PATHS, + collect_public_workspace_input_relative_paths, _copy_public_workspace_input_path, _sanitize_public_workspace_input, ) @@ -48,35 +48,6 @@ ("manylinux", "cargo_registry_dir"): "manylinux-cache/cargo-registry", ("manylinux", "cargo_git_dir"): "manylinux-cache/cargo-git", } -BRIDGE_PREBUILT_WORKSPACE_SEED_EXTRA_RELATIVE_PATHS = ( - "setup_and_pack/nix", - "setup_and_pack/lib_tool.py", - "setup_and_pack/pyscript_util.py", - "setup_and_pack/closed_sdk_contract.py", - "setup_and_pack/public_workspace_contract.py", - "setup_and_pack/pub_prepare_build.py", - "setup_and_pack/pub_prepare_build.yaml", - "setup_and_pack/utils/wheel_runtime_helper.py", - "setup_and_pack/utils", - "deployment/utils/placeholder_utils.py", - "deployment/utils/proc_lifecycle_codegen.py", - "deployment/utils/selection_supervisor_codegen.py", - "fluxon_release/closed_sdk", - "fluxon_rs/fluxon_commu_contract", - "fluxon_rs/fluxon_commu", - "fluxon_rs/fluxon_commu_closed_sdk_consumer", - "fluxon_rs/Cargo.lock", -) -BRIDGE_PREBUILT_WORKSPACE_SEED_RELATIVE_PATHS = tuple( - dict.fromkeys( - ( - *PUBLIC_WORKSPACE_INPUT_RELATIVE_PATHS, - *BRIDGE_PREBUILT_WORKSPACE_SEED_EXTRA_RELATIVE_PATHS, - ) - ) -) - - @dataclass(frozen=True) class AssemblyRefs: baseline_path: str @@ -757,7 +728,9 @@ def _materialize_bridge_prebuilt_workspace_seed(*, source_root: Path, target_roo _remove_stale_derived_entry(path=target_root) target_root.mkdir(parents=True, exist_ok=True) target_root.chmod(0o777) - for relative_path in BRIDGE_PREBUILT_WORKSPACE_SEED_RELATIVE_PATHS: + for relative_path in collect_public_workspace_input_relative_paths( + repo_root=source_root + ): source_path = source_root / relative_path if not source_path.exists(): raise RuntimeError( diff --git a/setup_and_pack/nix/pack_fluxonkv_pylib.py b/setup_and_pack/nix/pack_fluxonkv_pylib.py index c44df13..e12f8fe 100644 --- a/setup_and_pack/nix/pack_fluxonkv_pylib.py +++ b/setup_and_pack/nix/pack_fluxonkv_pylib.py @@ -43,6 +43,9 @@ CLOSED_SDK_CONSUMER_BOUNDARY_MODE, rewrite_fluxon_native_export_bundle, ) +from setup_and_pack.public_workspace_contract import ( + collect_public_workspace_input_relative_paths, +) from utils.sudo_prefix_utils import host_sudo_prefix import utils as script_utils ABI3_SMOKE_TEST_INTERPRETERS = ( @@ -142,11 +145,6 @@ ) ) SUPPORTED_TARGET_CACHE_GENERATOR_KINDS = frozenset() -REQUIRED_DEPLOYMENT_UTIL_FILES_FOR_PYO3_BUILD = ( - "placeholder_utils.py", - "proc_lifecycle_codegen.py", - "selection_supervisor_codegen.py", -) TEMP_WORKSPACE_MOUNT_DIRS: list[Path] = [] @@ -160,54 +158,6 @@ def _cleanup_temp_workspace_mount_dirs() -> None: atexit.register(_cleanup_temp_workspace_mount_dirs) -REQUIRED_DEPLOYMENT_UTIL_RELATIVE_PATHS_FOR_PYO3_BUILD = tuple( - f"deployment/utils/{name}" for name in REQUIRED_DEPLOYMENT_UTIL_FILES_FOR_PYO3_BUILD -) -PYO3_WORKSPACE_HELPER_RELATIVE_PATHS = ( - "fluxon_rs/rust-toolchain.toml", - "setup_and_pack/lib_tool.py", - "setup_and_pack/pyscript_util.py", - "setup_and_pack/closed_sdk_contract.py", - "setup_and_pack/public_workspace_contract.py", - "setup_and_pack/pub_prepare_build.py", - "setup_and_pack/pub_prepare_build.yaml", - "setup_and_pack/nix/pack_release_in_container.py", - "setup_and_pack/utils/wheel_runtime_helper.py", - "setup_and_pack/nix/lib_layout.py", -) -PYO3_INPUT_RELATIVE_PATHS_COMMON = ( - "fluxon_rs/Cargo.toml", - "fluxon_rs/Cargo.lock", - "fluxon_rs/.cargo", - "fluxon_rs/rust-toolchain.toml", - "fluxon_rs/fluxon_commu_contract", - "fluxon_rs/fluxon_commu_closed_sdk_consumer", - "fluxon_rs/fluxon_pyo3", - "fluxon_rs/limit_thirdparty", - "fluxon_rs/fluxon_commu", - "fluxon_rs/fluxon_kv", - "fluxon_rs/fluxon_framework", - "fluxon_rs/fluxon_framework_compiled", - "fluxon_rs/fluxon_util", - "fluxon_rs/fluxon_mq", - "fluxon_rs/fluxon_cli", - "fluxon_rs/fluxon_ops", - "fluxon_rs/fluxon_proxy_proto", - "fluxon_rs/fluxon_proxy", - "fluxon_rs/fluxon_fs", - "fluxon_rs/fluxon_fs_core", - "fluxon_rs/fluxon_fs_s3_gateway", - "fluxon_rs/fluxon_observability", - "fluxon_rs/moka", - "fluxon_py", - "fluxon_release/closed_sdk", - "setup_and_pack/nix/lib_layout.py", - "setup_and_pack/closed_sdk_contract.py", - "setup_and_pack/public_workspace_contract.py", - "setup_and_pack/lib_tool.py", - "setup_and_pack/pyscript_util.py", - *REQUIRED_DEPLOYMENT_UTIL_RELATIVE_PATHS_FOR_PYO3_BUILD, -) PYO3_INPUT_RELATIVE_PATHS_BY_TRANSPORT_BACKEND = { "fastws": (), "tquic": (), @@ -218,15 +168,6 @@ def _cleanup_temp_workspace_mount_dirs() -> None: PYO3_INPUT_RELATIVE_PATHS_BY_RDMA_BACKEND = { "closed_sdk": ("fluxon_release/closed_sdk",), } -PYO3_WORKSPACE_COPY_RELATIVE_PATHS_PUBLIC_NATIVE = () -PYO3_WORKSPACE_COPY_RELATIVE_PATHS_COMMON = tuple( - relative_path - for relative_path in ( - *PYO3_INPUT_RELATIVE_PATHS_COMMON, - *PYO3_WORKSPACE_COPY_RELATIVE_PATHS_PUBLIC_NATIVE, - *PYO3_WORKSPACE_HELPER_RELATIVE_PATHS, - ) -) TRANSPORT_BACKEND_FEATURES = { "fastws": ["fastws_transport"], "tquic": ["tquic_transport"], @@ -257,90 +198,6 @@ def _cleanup_temp_workspace_mount_dirs() -> None: "libstdc++.so.6", "libgomp.so.1", ) -IGNORED_FILE_SUFFIXES = ( - ".gitignore", - ".pkl", - ".pyc", - ".md", - ".rst", - ".html", - ".htm", - ".xml", - ".css", - ".js", - ".map", - ".png", - ".jpg", - ".jpeg", - ".gif", - ".bmp", - ".svg", - ".ico", - ".pdf", - ".ppt", - ".pptx", - ".doc", - ".docx", - ".pem", - ".crt", - ".crl", - ".key", - ".csr", - ".p12", - ".der", - ".serial", - ".old", - ".orig", - ".rej", - ".tar", - ".tar.gz", - ".tgz", - ".tar.xz", - ".txz", - ".tar.bz2", - ".tbz2", - ".zip", - ".7z", - ".xz", - ".bz2", - ".gz", -) -IGNORED_DIR_NAMES = { - ".git", - "__pycache__", - "target", - "wheels", - "docs", - "doc", - "doxygen", - "examples", - "example", - "tests", - "test", - "testdata", - "bench", - "benches", - "benchmark", - "benchmarks", - "fuzz", - "fuzzers", - "packagecache", - "wycheproof_testvectors", - "tfprof", -} -IGNORED_FILE_NAMES = ( - PYO3_CHECKSUM_FILE_NAME, - "configure~", -) - - -def _pyo3_input_relative_paths(transport_backend: str, rdma_backend: str) -> tuple[str, ...]: - return ( - PYO3_INPUT_RELATIVE_PATHS_COMMON - + PYO3_INPUT_RELATIVE_PATHS_BY_TRANSPORT_BACKEND[transport_backend] - + PYO3_INPUT_RELATIVE_PATHS_BY_RDMA_BACKEND[rdma_backend] - ) - def _dedupe_relative_paths(relative_paths: tuple[str, ...]) -> tuple[str, ...]: ordered_relative_paths: list[str] = [] @@ -354,11 +211,13 @@ def _dedupe_relative_paths(relative_paths: tuple[str, ...]) -> tuple[str, ...]: def pyo3_workspace_copy_relative_paths(transport_backend: str, rdma_backend: str) -> tuple[str, ...]: - return _dedupe_relative_paths( - PYO3_WORKSPACE_COPY_RELATIVE_PATHS_COMMON - + PYO3_INPUT_RELATIVE_PATHS_BY_TRANSPORT_BACKEND[transport_backend] - + PYO3_INPUT_RELATIVE_PATHS_BY_RDMA_BACKEND[rdma_backend] - ) + del transport_backend + del rdma_backend + return collect_public_workspace_input_relative_paths(repo_root=REPO_ROOT) + + +def _pyo3_input_relative_paths(transport_backend: str, rdma_backend: str) -> tuple[str, ...]: + return pyo3_workspace_copy_relative_paths(transport_backend, rdma_backend) def _wheel_variant_key(transport_backend: str, rdma_backend: str) -> str: @@ -435,9 +294,9 @@ def _compute_inputs_digest(repo_root: Path, relative_paths: tuple[str, ...]) -> relative_to=repo_root, mode=script_utils.PathDigestMode.CONTENTS_ONLY, algorithm=script_utils.PathHashAlgorithm.MD5, - ignored_dir_names=IGNORED_DIR_NAMES, - ignored_file_names=IGNORED_FILE_NAMES, - ignored_file_suffixes=IGNORED_FILE_SUFFIXES, + ignored_dir_names=(), + ignored_file_names=(), + ignored_file_suffixes=(), ) @@ -593,33 +452,6 @@ def current_checksum(self) -> str: _pyo3_input_relative_paths(self.transport_backend, self.rdma_backend), ) + f"|transport_backend={self.transport_backend}|rdma_backend={self.rdma_backend}" - def _legacy_checksum_map(self) -> dict[str, str]: - file_hash: dict[str, str] = {} - for current_root, dirnames, filenames in os.walk(self.rs_root, topdown=True): - current_root_path = Path(current_root) - root_rel = current_root_path.relative_to(self.rs_root).as_posix() - root_text = current_root_path.as_posix() - if root_rel == "target" or root_rel.startswith("target/"): - dirnames[:] = [] - continue - if root_rel == "wheels" or root_rel.startswith("wheels/"): - dirnames[:] = [] - continue - if "/.git/" in root_text or root_text.endswith("/.git"): - dirnames[:] = [] - continue - dirnames[:] = sorted(dir_name for dir_name in dirnames if dir_name != ".git") - for file_name in sorted(filenames): - if file_name in IGNORED_FILE_NAMES or file_name.endswith(IGNORED_FILE_SUFFIXES): - continue - file_path = current_root_path / file_name - hash_md5 = hashlib.md5() - with open(file_path, "rb") as f: - for chunk in iter(lambda: f.read(4096), b""): - hash_md5.update(chunk) - file_hash[file_path.relative_to(self.rs_root).as_posix()] = hash_md5.hexdigest() - return file_hash - def find_cached_wheel(self) -> Path | None: if not self.target_wheels_dir.exists(): return None @@ -2976,12 +2808,15 @@ def _build_published_profile_manifest( selected_backend_plan: dict, native_build_authority: dict | None, ) -> dict: - workspace_seed_digest = _compute_inputs_digest( - workspace_seed_dir, - _public_workspace_seed_relative_paths( - transport_backend, - rdma_backend=selected_backend_plan["rdma_backend"], - ), + del transport_backend + workspace_seed_digest = script_utils.compute_paths_digest( + [workspace_seed_dir], + relative_to=workspace_seed_dir, + mode=script_utils.PathDigestMode.CONTENTS_ONLY, + algorithm=script_utils.PathHashAlgorithm.MD5, + ignored_dir_names=(), + ignored_file_names=(), + ignored_file_suffixes=(), ) manifest = { "object_kind": "FluxonManylinuxPublishedProfile", @@ -3137,18 +2972,19 @@ def _copy_workspace_seed_subset( transport_backend: str, rdma_backend: str, ) -> None: + del transport_backend + del rdma_backend target_workspace_seed_dir.mkdir(parents=True, exist_ok=True) target_workspace_seed_dir.chmod(0o777) - for relative_path in _public_workspace_seed_relative_paths( - transport_backend, - rdma_backend=rdma_backend, - ): - source_path = source_workspace_seed_dir / relative_path - if not source_path.exists(): - raise RuntimeError( - f"workspace seed path is missing required publish input: {source_path}" - ) + for source_path in sorted(source_workspace_seed_dir.rglob("*")): + if source_path == source_workspace_seed_dir: + continue + relative_path = source_path.relative_to(source_workspace_seed_dir) target_path = target_workspace_seed_dir / relative_path + if source_path.is_dir() and not source_path.is_symlink(): + target_path.mkdir(parents=True, exist_ok=True) + target_path.chmod(0o777) + continue target_path.parent.mkdir(parents=True, exist_ok=True) target_path.parent.chmod(0o777) _sudo_copy_path(source_path=source_path, target_path=target_path) @@ -3248,10 +3084,6 @@ def _run_with_tee_log(*, argv: list[str], log_path: Path) -> None: if return_code != 0: raise RuntimeError(f"docker run failed with exit code {return_code}, log={log_path}") -def _public_workspace_seed_relative_paths(transport_backend: str, *, rdma_backend: str) -> tuple[str, ...]: - return pyo3_workspace_copy_relative_paths(transport_backend, rdma_backend) - - def _require_workspace_seed_fluxon_commu_source_dir(*, workspace_seed_dir: Path, field_name: str) -> Path: source_dir = workspace_seed_dir / FLUXON_COMMU_AUTHORITY_RELATIVE_PATH cargo_toml_path = source_dir / "Cargo.toml" diff --git a/setup_and_pack/public_workspace_contract.py b/setup_and_pack/public_workspace_contract.py index 5cd6b50..cb1574e 100644 --- a/setup_and_pack/public_workspace_contract.py +++ b/setup_and_pack/public_workspace_contract.py @@ -1,47 +1,46 @@ from __future__ import annotations +import os import shutil +import sys from pathlib import Path +REPO_ROOT = Path(__file__).resolve().parent.parent +SCRIPTS_DIR = REPO_ROOT / "scripts" +scripts_dir_str = str(SCRIPTS_DIR) +if scripts_dir_str in sys.path: + sys.path.remove(scripts_dir_str) +sys.path.insert(0, scripts_dir_str) -PUBLIC_WORKSPACE_INPUT_RELATIVE_PATHS = ( - "setup.py", - "fluxon_py", - "fluxon_release/closed_sdk", - "fluxon_rs/Cargo.toml", - "fluxon_rs/Cargo.lock", - "fluxon_rs/.cargo", - "fluxon_rs/rust-toolchain.toml", - "fluxon_rs/fluxon_commu_contract", - "fluxon_rs/fluxon_commu_closed_sdk_consumer", - "fluxon_rs/fluxon_commu", - "fluxon_rs/fluxon_pyo3", - "fluxon_rs/limit_thirdparty", - "fluxon_rs/fluxon_kv", - "fluxon_rs/fluxon_framework", - "fluxon_rs/fluxon_framework_compiled", - "fluxon_rs/fluxon_util", - "fluxon_rs/fluxon_mq", - "fluxon_rs/fluxon_cli", - "fluxon_rs/fluxon_ops", - "fluxon_rs/fluxon_proxy_proto", - "fluxon_rs/fluxon_proxy", - "fluxon_rs/fluxon_fs", - "fluxon_rs/fluxon_fs_core", - "fluxon_rs/fluxon_fs_s3_gateway", - "fluxon_rs/fluxon_observability", - "fluxon_rs/moka", +from source_selection_profiles import ( + SOURCE_SELECTION_PROFILE_BUILD_SEED, + collect_source_profile_relpaths, ) def _copy_public_workspace_input_path(source_path: Path, target_path: Path) -> None: target_path.parent.mkdir(parents=True, exist_ok=True) + if source_path.is_symlink(): + if target_path.exists() or target_path.is_symlink(): + if target_path.is_dir() and not target_path.is_symlink(): + shutil.rmtree(target_path) + else: + target_path.unlink() + os.symlink(os.readlink(source_path), target_path) + return if source_path.is_dir(): shutil.copytree(source_path, target_path, symlinks=True, dirs_exist_ok=True) return shutil.copy2(source_path, target_path) +def collect_public_workspace_input_relative_paths(*, repo_root: Path) -> tuple[str, ...]: + return collect_source_profile_relpaths( + repo_root=repo_root, + profile=SOURCE_SELECTION_PROFILE_BUILD_SEED, + ) + + def _sanitize_public_workspace_input(*, workspace_root: Path) -> None: for pycache_dir in workspace_root.rglob("__pycache__"): shutil.rmtree(pycache_dir, ignore_errors=True) @@ -51,9 +50,8 @@ def _sanitize_public_workspace_input(*, workspace_root: Path) -> None: except FileNotFoundError: pass - __all__ = [ - "PUBLIC_WORKSPACE_INPUT_RELATIVE_PATHS", + "collect_public_workspace_input_relative_paths", "_copy_public_workspace_input_path", "_sanitize_public_workspace_input", ] diff --git a/setup_and_pack/tests/test_git_source_selection_utils.py b/setup_and_pack/tests/test_git_source_selection_utils.py new file mode 100644 index 0000000..b28d64d --- /dev/null +++ b/setup_and_pack/tests/test_git_source_selection_utils.py @@ -0,0 +1,182 @@ +from __future__ import annotations + +import importlib.util +import sys +import tempfile +import unittest +from pathlib import Path +from unittest import mock + + +REPO_ROOT = Path(__file__).resolve().parents[2] +MODULE_PATH = REPO_ROOT / "scripts" / "git_source_selection.py" +PROFILE_MODULE_PATH = REPO_ROOT / "scripts" / "source_selection_profiles.py" + + +def _load_module(): + spec = importlib.util.spec_from_file_location( + "scripts_git_source_selection_test", + MODULE_PATH, + ) + assert spec is not None and spec.loader is not None + mod = importlib.util.module_from_spec(spec) + sys.modules[spec.name] = mod + spec.loader.exec_module(mod) + return mod + + +_MOD = _load_module() + + +def _load_profile_module(): + scripts_root_str = str(REPO_ROOT / "scripts") + if scripts_root_str in sys.path: + sys.path.remove(scripts_root_str) + sys.path.insert(0, scripts_root_str) + spec = importlib.util.spec_from_file_location( + "scripts_source_selection_profiles_test", + PROFILE_MODULE_PATH, + ) + assert spec is not None and spec.loader is not None + mod = importlib.util.module_from_spec(spec) + sys.modules[spec.name] = mod + spec.loader.exec_module(mod) + return mod + + +_PROFILE_MOD = _load_profile_module() + + +class GitSourceSelectionUtilsTest(unittest.TestCase): + def test_collect_source_relpaths_with_rather_no_git_submodule_merges_module_sources(self) -> None: + with tempfile.TemporaryDirectory() as tmpdir: + repo_root = Path(tmpdir) + (repo_root / "README.md").write_text("repo\n", encoding="utf-8") + module_root = repo_root / "fluxon_rs" / "moka" + (module_root / "src").mkdir(parents=True, exist_ok=True) + (module_root / "Cargo.toml").write_text("module\n", encoding="utf-8") + (module_root / "src" / "lib.rs").write_text("pub fn x() {}\n", encoding="utf-8") + cfg_path = repo_root / "setup_and_pack" / "rather_no_git_submodule.yaml" + cfg_path.parent.mkdir(parents=True, exist_ok=True) + cfg_path.write_text( + "modules:\n" + " - path: fluxon_rs/moka\n" + " repo: https://example.com/moka.git\n" + " checkout: main\n", + encoding="utf-8", + ) + + def fake_check_output(argv, cwd=None): + del argv + cwd_path = Path(cwd).resolve() + if cwd_path == repo_root.resolve(): + return b"README.md\0" + if cwd_path == module_root.resolve(): + return b"Cargo.toml\0src/lib.rs\0" + raise AssertionError(f"unexpected git ls-files cwd: {cwd_path}") + + with mock.patch.object(_MOD.subprocess, "check_output", side_effect=fake_check_output): + relpaths = _MOD.collect_source_relpaths_with_rather_no_git_submodule( + repo_root=repo_root, + source_roots=("README.md",), + is_excluded=lambda _relpath: False, + empty_selection_error="no files", + rather_no_git_submodule_context_name="test source selection", + ) + + self.assertEqual( + relpaths, + [ + "README.md", + "fluxon_rs/moka/Cargo.toml", + "fluxon_rs/moka/src/lib.rs", + ], + ) + + def test_load_rather_no_git_submodule_source_roots_uses_context_name_in_missing_dir_error(self) -> None: + with tempfile.TemporaryDirectory() as tmpdir: + repo_root = Path(tmpdir) + cfg_path = repo_root / "setup_and_pack" / "rather_no_git_submodule.yaml" + cfg_path.parent.mkdir(parents=True, exist_ok=True) + cfg_path.write_text( + "modules:\n" + " - path: fluxon_rs/moka\n" + " repo: https://example.com/moka.git\n" + " checkout: main\n", + encoding="utf-8", + ) + + with self.assertRaisesRegex( + RuntimeError, + "test source selection requires configured rather_no_git_submodule path to exist", + ): + _MOD.load_rather_no_git_submodule_source_roots( + repo_root=repo_root, + context_name="test source selection", + ) + + def test_source_profiles_only_add_inclusions_beyond_gitignore(self) -> None: + self.assertTrue( + _PROFILE_MOD.source_profile_relpath_excluded( + profile=_PROFILE_MOD.SOURCE_SELECTION_PROFILE_SOURCE_PACK, + relpath=".dever/run.log", + ) + ) + self.assertTrue( + _PROFILE_MOD.source_profile_relpath_excluded( + profile=_PROFILE_MOD.SOURCE_SELECTION_PROFILE_SOURCE_PACK, + relpath="fluxon_release/install.py", + ) + ) + self.assertTrue( + _PROFILE_MOD.source_profile_relpath_excluded( + profile=_PROFILE_MOD.SOURCE_SELECTION_PROFILE_SOURCE_PACK, + relpath="skills/demo/SKILL.md", + ) + ) + self.assertFalse( + _PROFILE_MOD.source_profile_relpath_excluded( + profile=_PROFILE_MOD.SOURCE_SELECTION_PROFILE_BUILD_SEED, + relpath="fluxon_release/closed_sdk/manifest.json", + ) + ) + self.assertFalse( + _PROFILE_MOD.source_profile_relpath_excluded( + profile=_PROFILE_MOD.SOURCE_SELECTION_PROFILE_BUILD_SEED, + relpath="fluxon_doc_cn/roadmap.md", + ) + ) + self.assertFalse( + _PROFILE_MOD.source_profile_relpath_excluded( + profile=_PROFILE_MOD.SOURCE_SELECTION_PROFILE_BUILD_SEED, + relpath="deployment/utils/log_shard.py", + ) + ) + self.assertFalse( + _PROFILE_MOD.source_profile_relpath_excluded( + profile=_PROFILE_MOD.SOURCE_SELECTION_PROFILE_BUILD_SEED, + relpath="scripts/source_selection_profiles.py", + ) + ) + self.assertFalse( + _PROFILE_MOD.source_profile_relpath_excluded( + profile=_PROFILE_MOD.SOURCE_SELECTION_PROFILE_BUILD_SEED, + relpath="fluxon_rs/moka/examples/append_value_async.rs", + ) + ) + self.assertFalse( + _PROFILE_MOD.source_profile_relpath_excluded( + profile=_PROFILE_MOD.SOURCE_SELECTION_PROFILE_BUILD_SEED, + relpath="fluxon_rs/moka/tests/entry_api_sync.rs", + ) + ) + self.assertFalse( + _PROFILE_MOD.source_profile_relpath_excluded( + profile=_PROFILE_MOD.SOURCE_SELECTION_PROFILE_BUILD_SEED, + relpath="fluxon_rs/fluxon_cli/templates/landing.html", + ) + ) + + +if __name__ == "__main__": + raise SystemExit(unittest.main()) diff --git a/setup_and_pack/tests/test_lib_layout.py b/setup_and_pack/tests/test_lib_layout.py index dd19442..6d05d54 100644 --- a/setup_and_pack/tests/test_lib_layout.py +++ b/setup_and_pack/tests/test_lib_layout.py @@ -14,6 +14,10 @@ def _load_lib_layout(): + repo_root_str = str(REPO_ROOT) + if repo_root_str in sys.path: + sys.path.remove(repo_root_str) + sys.path.insert(0, repo_root_str) spec = importlib.util.spec_from_file_location("setup_and_pack_nix_lib_layout_test", LIB_LAYOUT_PATH) assert spec is not None and spec.loader is not None mod = importlib.util.module_from_spec(spec) @@ -83,14 +87,17 @@ def test_bridge_prebuilt_materializes_workspace_seed(self) -> None: self.assertTrue(workspace_seed_dir.is_dir()) self.assertTrue((workspace_seed_dir / "setup_and_pack/closed_sdk_contract.py").is_file()) self.assertTrue((workspace_seed_dir / "setup_and_pack/public_workspace_contract.py").is_file()) + self.assertTrue((workspace_seed_dir / "README.md").is_file()) self.assertTrue((workspace_seed_dir / "fluxon_rs/fluxon_commu_contract/Cargo.toml").is_file()) self.assertTrue((workspace_seed_dir / "fluxon_rs/fluxon_commu/Cargo.toml").is_file()) + self.assertTrue((workspace_seed_dir / "fluxon_rs/fluxon_ops/build.rs").is_file()) self.assertTrue((workspace_seed_dir / "fluxon_release/closed_sdk/manifest.json").is_file()) self.assertTrue((workspace_seed_dir / "setup_and_pack/nix/pack_fluxonkv_pylib.py").is_file()) self.assertTrue((workspace_seed_dir / "setup_and_pack/nix/pack_release_in_container.py").is_file()) self.assertTrue((workspace_seed_dir / "setup_and_pack/utils/__init__.py").is_file()) self.assertTrue((workspace_seed_dir / "setup_and_pack/utils/sudo_prefix_utils.py").is_file()) self.assertTrue((workspace_seed_dir / "setup_and_pack/utils/wheel_runtime_helper.py").is_file()) + self.assertTrue((workspace_seed_dir / "deployment/utils/log_shard.py").is_file()) self.assertTrue((workspace_seed_dir / "fluxon_rs/fluxon_kv/Cargo.toml").is_file()) self.assertTrue((workspace_seed_dir / "fluxon_rs/Cargo.lock").is_file()) self.assertTrue((workspace_seed_dir / "fluxon_rs/moka/Cargo.toml").is_file()) diff --git a/setup_and_pack/tests/test_pack_fluxonkv_pylib_bridge_prebuilt.py b/setup_and_pack/tests/test_pack_fluxonkv_pylib_bridge_prebuilt.py index db1bcd7..bae0e86 100644 --- a/setup_and_pack/tests/test_pack_fluxonkv_pylib_bridge_prebuilt.py +++ b/setup_and_pack/tests/test_pack_fluxonkv_pylib_bridge_prebuilt.py @@ -38,6 +38,39 @@ def _load_module(): class BridgePrebuiltAuthorityMaterializationTest(unittest.TestCase): + def test_pyo3_workspace_inputs_follow_dynamic_public_workspace_selection(self) -> None: + relpaths = _PACKMOD.pyo3_workspace_copy_relative_paths( + transport_backend="tcp_thread", + rdma_backend="closed_sdk", + ) + + self.assertIn("README.md", relpaths) + self.assertIn("deployment/utils/log_shard.py", relpaths) + self.assertIn("fluxon_rs/fluxon_ops/build.rs", relpaths) + self.assertIn("fluxon_rs/moka/examples/append_value_async.rs", relpaths) + self.assertIn("fluxon_rs/fluxon_cli/templates/landing.html", relpaths) + self.assertNotIn("skills/browser-helm/SKILL.md", relpaths) + self.assertNotIn("fluxon_doc_cn/roadmap.md", relpaths) + + def test_pyo3_workspace_digest_tracks_selected_template_inputs(self) -> None: + with tempfile.TemporaryDirectory() as tmpdir: + repo_root = Path(tmpdir) + landing_path = repo_root / "fluxon_rs" / "fluxon_cli" / "templates" / "landing.html" + landing_path.parent.mkdir(parents=True, exist_ok=True) + landing_path.write_text("v1\n", encoding="utf-8") + + digest_before = _PACKMOD._compute_inputs_digest( + repo_root, + ("fluxon_rs/fluxon_cli/templates/landing.html",), + ) + landing_path.write_text("v2\n", encoding="utf-8") + digest_after = _PACKMOD._compute_inputs_digest( + repo_root, + ("fluxon_rs/fluxon_cli/templates/landing.html",), + ) + + self.assertNotEqual(digest_before, digest_after) + def test_host_side_materialization_only_creates_placeholders(self) -> None: with tempfile.TemporaryDirectory() as tmpdir: build_root = Path(tmpdir) diff --git a/setup_and_pack/utils/__init__.py b/setup_and_pack/utils/__init__.py index df414f6..3921245 100644 --- a/setup_and_pack/utils/__init__.py +++ b/setup_and_pack/utils/__init__.py @@ -10,6 +10,7 @@ _iter_digest_entries, build_cached_tarball, compute_paths_digest, + prune_stage_paths, rsync_stage, tar_gz, tarball_rule, @@ -66,6 +67,7 @@ "ArtifactRule", "tarball_rule", "build_cached_tarball", + "prune_stage_paths", "rsync_stage", "tar_gz", "_iter_digest_entries", diff --git a/setup_and_pack/utils/artifact_cache_digest_utils.py b/setup_and_pack/utils/artifact_cache_digest_utils.py index 11739ef..d3780e3 100644 --- a/setup_and_pack/utils/artifact_cache_digest_utils.py +++ b/setup_and_pack/utils/artifact_cache_digest_utils.py @@ -1,8 +1,10 @@ from __future__ import annotations import enum +import fnmatch import hashlib import os +import shutil from dataclasses import dataclass from pathlib import Path from typing import Callable, Collection, Iterator, Sequence @@ -19,6 +21,7 @@ "ArtifactCheck", "ArtifactRule", "tarball_rule", + "prune_stage_paths", "build_cached_tarball", "rsync_stage", "tar_gz", @@ -114,7 +117,14 @@ def build_cached_tarball(*, rule: ArtifactRule, out_path: Path, build_tarball: C rule.write_stamp(check.digest) -def rsync_stage(*, repo_root: Path, src: Path, dst: Path, honor_gitignore: bool) -> None: +def rsync_stage( + *, + repo_root: Path, + src: Path, + dst: Path, + honor_gitignore: bool, + exclude_rel_paths: tuple[str, ...] = (), +) -> None: if not src.exists(): print(f"Missing required source path for staging: {src}") raise SystemExit(1) @@ -132,6 +142,8 @@ def rsync_stage(*, repo_root: Path, src: Path, dst: Path, honor_gitignore: bool) "--exclude-from=.gitignore", "--filter=:- .gitignore", ] + for pattern in exclude_rel_paths: + argv.append(f"--exclude={pattern}") if src.is_dir(): argv += [str(src) + "/", str(dst) + "/"] else: @@ -139,6 +151,21 @@ def rsync_stage(*, repo_root: Path, src: Path, dst: Path, honor_gitignore: bool) run_cmd_argv(argv, cwd=repo_root) +def prune_stage_paths(stage_root: Path, exclude_rel_paths: tuple[str, ...]) -> None: + if not stage_root.exists(): + return + for path in sorted(stage_root.rglob("*"), reverse=True): + rel_path = path.relative_to(stage_root).as_posix() + for pattern in exclude_rel_paths: + normalized_pattern = pattern.rstrip("/") + if fnmatch.fnmatch(rel_path, normalized_pattern) or fnmatch.fnmatch(path.name, normalized_pattern): + if path.is_dir() and not path.is_symlink(): + shutil.rmtree(path) + else: + path.unlink(missing_ok=True) + break + + def tar_gz( *, cwd: Path, diff --git a/skills/browser-helm/SKILL.md b/skills/browser-helm/SKILL.md new file mode 100644 index 0000000..dbe1afd --- /dev/null +++ b/skills/browser-helm/SKILL.md @@ -0,0 +1,232 @@ +--- +name: browser-helm +description: Helm-only browser runtime workflow for operating Browser Helm managed tabs via `browser-helm`, with namespaced `browser` / `tab` / `page` / `picker` / `events` commands and namespaced `.tmp/browser-helm/` output conventions. +allowed-tools: Bash(*) +--- + +# 用 `browser-helm` 操作 Browser Helm 受控标签页 + +当用户想通过 **Helm-only runtime** 操作浏览器,而不是使用通用 `agent-browser` 时,使用这个 skill。 + +适用场景: + +- 需要列出已连接浏览器 / managed tab +- 需要创建 managed tab 并 attach debugger +- 需要执行 `page navigate` / `page click` / `page eval` / `page wait` / `page type` / `page press` / `page summary` / `page snapshot` / `page screenshot` +- 需要通过 picker 获取/清空最近一次选中元素的 metadata(无需用户粘贴 JSON) +- 需要遵守 `browser-helm` 当前的输出与落盘约定 + +不适用场景: + +- 用户明确要用通用 `agent-browser` / noVNC 工作流 +- 用户只是要解释代码,不需要运行 `browser-helm` + +## 默认工作流(新主路径) + +默认 Base URL:`http://127.0.0.1:5181`(不需要设置环境变量)。 + +如需覆盖(可选):在命令前追加 `--base-url http://127.0.0.1:5181`。 + +如本机未全局安装 `browser-helm`,也可以用 `node browser-helm/dist/cli.js` 替代下方命令。 + +## 多人/多 AI 会话(互信)约定(重要) + +当前产品定位下,daemon / Web UI / WS **默认不做鉴权**,更偏向“同一局域网多人互信”的协作模型。 + +但为了避免 **同一台浏览器 + 多个 AI 对话** 时出现“串台/误操作”,推荐强制使用 `session` 做操作隔离: + +- 每个 AI 对话固定用一个 `--session `(或设置环境变量 `BROWSER_HELM_SESSION=`) +- `session` 会隔离: + - CLI context 落盘:`.tmp/browser-helm/context.json`(default)或 `.tmp/browser-helm/sessions//context.json` + - CLI 输出落盘:`.tmp/browser-helm//...`(default)或 `.tmp/browser-helm/sessions///...` + - `tab create` 会自动加前缀:`[session:] ...`(用于人类/AI 识别归属) +- `tab list --mine` 只在非 default session 下可用(通过 note 前缀过滤“我这条会话创建的 tab”) + +注意:`session` 只是“操作习惯/隔离约定”,**不是安全边界**。知道 `managed-tab-id` 仍然能跨 session 操作;不要把端口暴露到不可信网络。 + +### 前置(必须):安装插件并配对 + +`browser-helm` 的所有浏览器动作都依赖 **Chrome 插件已连接 daemon(WebSocket)**: + +- 创建 managed tab 时建议提供 `--note `,用于描述这个 tab 的意图/用途。 + - 若省略 `--note` 且提供 URL,CLI 会自动生成:`打开页面:` + +- 若 `browser-helm browser list` 一直为空,优先判断是「插件未安装/未 Connect」而不是 CLI 出错。 + +一次性配对步骤: + +1) 启动 daemon + +```bash +browser-helm daemon ensure +``` + +(可选)如需重启: + +```bash +browser-helm daemon restart +``` + +2) 用 Chrome 打开 Web UI(用“Chrome 能访问到的地址”打开) + +- Web UI:`http://127.0.0.1:5181` +- 页面上会显示 `Pairing Code`(推荐)以及 `WS URL`/`Pairing Token`(Advanced) + +3) 安装扩展(Unpacked) + +- 在 Web UI 点击“下载插件 zip”,解压 +- 打开 `chrome://extensions`,开启开发者模式 +- 点击“加载已解压的扩展程序”,选择解压后的目录 + +4) 插件配对(Connect) + +- 打开扩展弹窗 +- 粘贴 Web UI 中的 `Pairing Code`,点击 `Connect` +- (可选)点一次 `Status` 确认连接 OK +- Advanced:也可手填 `WS URL` + `Pairing Token` + +5) CLI 验证插件已连接 + +```bash +browser-helm browser list +``` + +### 默认动作流 + +1. 确保 `Browser Helm daemon` 已启动(AI 可通过 CLI 直接启动/拉起) + +```bash +browser-helm daemon ensure +``` + +注:`daemon ensure` 会启动内置的预编译 daemon(当前提供 `linux-x64`),不要求用户安装 `cargo`。 + +2. 确认扩展已连接,并列出浏览器 + +```bash +browser-helm browser list +``` + +(推荐)3. Pin 默认 browser/tab(减少长对话遗忘成本) + +```bash +browser-helm context use-browser +browser-helm context use-tab +browser-helm context show +``` + +4. 列 tab;如无 tab,则创建新 tab + +```bash +browser-helm browser list +browser-helm tab list +browser-helm tab create https://example.com --note "说明这个 tab 的用途" +``` + +5. (可选)显式 `tab attach` debugger + +`tab create` / `page navigate` 已会自动 ensure debugger attach(用于更早捕获 network/console)。如果你准备在浏览器里手动刷新/导航,也建议先 `tab attach`。 + +```bash +browser-helm tab attach +``` + +6. 页面分析优先走返回值主路 + +```bash +browser-helm page summary +browser-helm page snapshot +``` + +7. 只有在需要留档时才显式保存 `page summary` / `page snapshot` + +```bash +browser-helm --save page summary +browser-helm --save page snapshot +``` + +8. `page screenshot` 默认会落盘;`page click` 会走受控页遮罩下的程序化点击 + +```bash +browser-helm page click '#selector' +browser-helm page click '#selector' --wait-text 'Finished working' --timeout-ms 15000 +browser-helm page eval '1+1' +browser-helm page wait --until-text 'Finished working' --timeout-ms 15000 +browser-helm page type 'div[aria-label="Composer"]' 'hello' +browser-helm page press 'Enter' +browser-helm page screenshot +``` + +9. 推荐先 `page snapshot` 生成 `@iN` refs,再用 ref 操作(类似 agent-browser 的 `@eN`) + +```bash +browser-helm page snapshot +browser-helm page click @i1 +browser-helm page type @i2 'hello' +``` + +9. 如用户在 SidePanel 做了元素选择(Start Picking),AI 可直接从 daemon 拉取最近一次选择结果 + +```bash +browser-helm picker last +browser-helm picker clear +``` + +### 交互录制(用户手动复现) + +当你需要「AI 先打开受控 tab,然后用户自己操作复现问题,再让 AI 回看」时,可以开启交互录制: + +```bash +# 记录起始时间(ms) +t0=$(date +%s%3N) + +# 开始录制(会注入监听脚本,并临时隐藏遮罩,允许用户点击/输入) +browser-helm recorder start + +# ...用户在该 tab 上手动复现... + +# 拉取复现阶段的交互/console/network 事件(按 since 过滤) +browser-helm events interaction --since $t0 --limit 2000 +browser-helm events console --since $t0 --limit 2000 +browser-helm events network --since $t0 --limit 2000 + +# 停止录制(恢复遮罩) +browser-helm recorder stop +``` + +注意:交互录制会包含 input 的原始 value(不脱敏)。仅建议在互信/本地环境使用。 + +## 输出与落盘约定 + +- `page summary`:默认只打印;传 `output-path` 或 `--save` 时,写入 `.tmp/browser-helm/summaries/` +- `page snapshot`:默认只打印;传 `output-path` 或 `--save` 时,写入 `.tmp/browser-helm/snapshots/` +- `page screenshot`:默认写入 `.tmp/browser-helm/screenshots/` +- 若使用 `--session ` / `BROWSER_HELM_SESSION=`:上述目录会自动切换到 `.tmp/browser-helm/sessions//...` +- 如用户显式提供路径,优先使用用户路径 + +## 命令参考 + +详细命令与示例见:[`browser-helm/skills/browser-helm/references/commands.md`] + +优先顺序建议: + +1. `browser list` +2. `tab list` +3. `tab create`(推荐写 `--note`;若省略且提供 URL,则自动生成 note) + - 或:`tab adopt-active`(接管当前活动 tab) +4. `tab attach` +5. `page navigate` +6. `page summary` / `page snapshot` +7. `page click` / `page screenshot` + + +## 目录约定 + +- 项目内 skill 源目录:[`browser-helm/skills/browser-helm/`] +- 仓库根入口:[`skills/browser-helm/`] + + +## 命令约定 + +- 仅支持 namespaced 命令面:`browser list`、`tab create`、`page navigate`、`picker last` 等。 +- 默认文档路径改为 namespaced 形式:`browser list`、`tab create`、`page navigate`、`events console`、`picker last`。 diff --git a/skills/browser-helm/agents/openai.yaml b/skills/browser-helm/agents/openai.yaml new file mode 100644 index 0000000..686f428 --- /dev/null +++ b/skills/browser-helm/agents/openai.yaml @@ -0,0 +1,6 @@ +interface: + display_name: Browser Helm + short_description: 通过 Browser Helm 操作 Browser Helm 受控标签页流程 + default_prompt: Use $browser-helm to inspect and operate Browser Helm managed tabs. +policy: + allow_implicit_invocation: false diff --git a/skills/browser-helm/references/commands.md b/skills/browser-helm/references/commands.md new file mode 100644 index 0000000..d22d465 --- /dev/null +++ b/skills/browser-helm/references/commands.md @@ -0,0 +1,131 @@ +# `browser-helm` 命令参考 + +## 前置(必须):插件安装与配对 + +CLI 能否操作浏览器,取决于 **Chrome 插件是否已连接 daemon(WebSocket)**。 + +最小闭环步骤: + +```bash +# 1) 启动/确保 daemon +browser-helm daemon ensure +browser-helm daemon status +browser-helm daemon restart + +# 2) 在 Chrome 打开 Web UI(用 Chrome 能访问到的地址打开) +# http://127.0.0.1:5181 +# 从页面复制 Pairing Code(推荐;含多网卡候选地址)或 WS URL + Pairing Token(Advanced) +# +# 3) 安装扩展(Unpacked) +# - Web UI 下载插件 zip -> 解压 +# - chrome://extensions 开启开发者模式 -> 加载已解压扩展 +# +# 4) 插件弹窗填 Pairing Code -> Connect + +# 5) 验证浏览器已连接 +browser-helm browser list +``` + +## 基础命令(新主路径) + +```bash +browser-helm daemon status +browser-helm daemon ensure +browser-helm daemon stop +browser-helm daemon restart +browser-helm status +browser-helm browser list +browser-helm tab list [browser-id] [--mine] +browser-helm recorder start [browser-id] [managed-tab-id] +browser-helm recorder stop [browser-id] [managed-tab-id] +``` + +## 受控 tab 生命周期 + +```bash +browser-helm tab create [browser-id] [url] [--note ] +browser-helm tab adopt-active [browser-id] [--note ] +browser-helm tab attach [browser-id] [managed-tab-id] +browser-helm page navigate [browser-id] [managed-tab-id] +``` + +## 交互与分析 + +```bash +browser-helm page click [browser-id] [managed-tab-id] [--wait-(selector|text|js) ] [--timeout-ms ] [--interval-ms ] +browser-helm page eval [browser-id] [managed-tab-id] +browser-helm page wait [browser-id] [managed-tab-id] --until-(selector|text|js) [--timeout-ms ] [--interval-ms ] +browser-helm page type [browser-id] [managed-tab-id] +browser-helm page press [browser-id] [managed-tab-id] +browser-helm page summary [browser-id] [managed-tab-id] [output-path] +browser-helm page snapshot [browser-id] [managed-tab-id] [output-path] +browser-helm page screenshot [browser-id] [managed-tab-id] [output-path] +browser-helm events console [browser-id] [managed-tab-id] [--limit ] [--since ] +browser-helm events network [browser-id] [managed-tab-id] [--limit ] [--since ] +browser-helm events interaction [browser-id] [managed-tab-id] [--limit ] [--since ] +browser-helm picker last [browser-id] [managed-tab-id] +browser-helm picker clear [browser-id] [managed-tab-id] +``` + +说明: + +- `page snapshot` 会生成可复用的 interactive refs:`@i1/@i2/...`(按 interactives 列表顺序)。 +- `page click/@iN`、`page type/@iN` 会把 ref 解析为 snapshot 中记录的 selector(落盘于 `.tmp/browser-helm/refs/.json`,按 `--session` 隔离)。 + +## Context(session-like,新主路径) + +长对话/长任务里,为了避免反复提供 `browser-id` / `managed-tab-id`,可以把默认对象写入本地 context: + +```bash +browser-helm context use-browser +browser-helm context use-tab +browser-helm context show +browser-helm context clear +``` + +## 多 AI 对话隔离(推荐) + +为了避免“同一浏览器 + 多个 AI 对话”串台,建议为每条对话固定一个 `session`: + +```bash +browser-helm --session chat-a browser list +browser-helm --session chat-a tab list --mine +browser-helm --session chat-a tab create https://example.com --note "这条对话的用途说明" +``` + +说明: + +- `tab create` 会自动加前缀:`[session:chat-a] ...` +- `tab list --mine` 需要非 default session(否则会报错) + +## 输出约定 + +- `page summary` + - 默认只打印 + - `--save` 时默认落到 [`.tmp/browser-helm/summaries/`] +- `page snapshot` + - 默认只打印 + - `--save` 时默认落到 [`.tmp/browser-helm/snapshots/`] +- `page screenshot` + - 默认落到 [`.tmp/browser-helm/screenshots/`] +- 若使用 `--session ` / `BROWSER_HELM_SESSION=`:上述目录会自动切换到 [`.tmp/browser-helm/sessions//...`] + +## 推荐示例 + +```bash +browser-helm browser list +browser-helm tab create https://example.com --note "说明这个 tab 的用途" +browser-helm tab attach +browser-helm page snapshot +browser-helm --save page summary +browser-helm page screenshot +``` + +说明: + +- `tab create` 若省略 `--note` 且提供 URL,会自动生成:`打开页面:` + +## 命令约定 + +- 仅支持 namespaced 命令面:`browser list`、`tab create`、`tab attach`、`page navigate`、`picker last` 等。 +- 文档与 skill 后续默认都以 namespaced 命令作为主路径。 diff --git a/skills/canvas-dag_organizer-v1/SKILL.md b/skills/canvas-dag_organizer-v1/SKILL.md new file mode 100644 index 0000000..db3dc0d --- /dev/null +++ b/skills/canvas-dag_organizer-v1/SKILL.md @@ -0,0 +1,10 @@ +--- +name: "canvas-dag_organizer-v1" +description: "Canvas DAG Organizer v1" +metadata: + short-description: "Canvas DAG Organizer v1" +--- + +# Canvas DAG Organizer v1 + +你是「Canvas DAG 可读性优化专家」(canvas_dag_organizer)。\n你的目标:基于当前 canvas 内容与 DAG(causal/timeline edges)结构,决定如何拆分/分组/调整空间布局,以最大化可读性。\n\n硬约束(必须遵守):\n- 禁止要求用户手工编辑 `.canvas` / `.canvas.ext` JSON。\n- 你不能执行任何命令;你只能输出一个严格 JSON 对象(不要 markdown、不要 code fence、不要额外文本)。\n- 你输出的修改必须是“可复现/确定性”的(同一输入得到同一输出)。\n\n你会收到:\n- path + expectedCanvasSha256(并发保护)\n- scopeNodes / scopeEdges(允许你改动的子图范围)\n- 每个节点的 effective rect(考虑 ext.dx/dy/scale)\n\n你的输出 JSON schema(version=1):\n{\n "version": 1,\n "kind": "canvas_dag_organize_apply_v1",\n "path": "",\n "expectedCanvasSha256": "",\n "summary": "一句话总结你做了什么(用于 UI 提示)",\n "ops": [\n // CanvasOpsRequestV1.ops: op=upsert_node|delete_node|upsert_edge|delete_edge\n ]\n}\n\n重要规则:\n- 只允许改动 scope 内的 existing session nodes(移动/尺寸/文本等)与 existing edges。\n- 允许创建 group 节点用于分区(id 必须以 "group-" 开头;type="group")。\n- 禁止删除任何 session 节点(dever_kind=session)。\n- 如果你删除 node,必须同时删除所有引用它的 edges(否则服务端会拒绝 apply)。\n- 优先做:分组 + 分层/泳道 + 对齐 + 留白;不要盲目网格化。 diff --git a/skills/canvas-dag_organizer-v1/agents/openai.yaml b/skills/canvas-dag_organizer-v1/agents/openai.yaml new file mode 100644 index 0000000..f7ffc0e --- /dev/null +++ b/skills/canvas-dag_organizer-v1/agents/openai.yaml @@ -0,0 +1,6 @@ +interface: + display_name: "Canvas DAG Organizer v1" + short_description: "Canvas DAG Organizer v1" + default_prompt: "Use $canvas-dag_organizer-v1." +policy: + allow_implicit_invocation: false diff --git a/skills/canvas-ops-v1/SKILL.md b/skills/canvas-ops-v1/SKILL.md new file mode 100644 index 0000000..ffa8017 --- /dev/null +++ b/skills/canvas-ops-v1/SKILL.md @@ -0,0 +1,10 @@ +--- +name: "canvas-ops-v1" +description: "Canvas Ops v1" +metadata: + short-description: "Canvas Ops v1" +--- + +# Canvas Ops v1 + +你是「Canvas 文件操作助手」(canvas_ops)。\n你的目标:对 `*.canvas` / `*.canvas.ext` 的任何修改,都必须通过项目内的脚本执行;禁止手工编辑 JSON。\n\n唯一允许的执行入口:\n- `.dever/tools/canvas_ops/canvas_ops.sh`\n- 配置:`.dever/tools/canvas_ops/config.json`\n\n硬约束:\n- 你只能生成 `apply` 需要的 request JSON(version=1),并给出一条可执行命令来调用脚本。\n- 禁止直接输出/粘贴完整 `.canvas` 内容作为“修改后的文件”。\n- 如果需要删除 node:必须同时显式删除所有依赖该 node 的 edges(否则脚本会拒绝执行)。\n\n你的输出格式(两段,且仅两段):\n(1) request JSON(纯 JSON,不要 markdown,不要 code fence)\n(2) 一段 bash 命令(用 heredoc 把 JSON 送进脚本;命令内必须显式传 `-w` 与 `-c`)\n\n命令模板(把 替换为项目根;一般是 `.`):\n.dever/tools/canvas_ops/canvas_ops.sh apply -w -c .dever/tools/canvas_ops/config.json --request-stdin <<'JSON'\n{...}\nJSON\n\n建议(可选):命令后再跑一次 validate,确认写盘结果可读且 ext sha 一致。 diff --git a/skills/canvas-ops-v1/agents/openai.yaml b/skills/canvas-ops-v1/agents/openai.yaml new file mode 100644 index 0000000..5566cff --- /dev/null +++ b/skills/canvas-ops-v1/agents/openai.yaml @@ -0,0 +1,6 @@ +interface: + display_name: Canvas Ops v1 + short_description: Canvas Ops v1 + default_prompt: Use $canvas-ops-v1. +policy: + allow_implicit_invocation: false diff --git a/skills/canvas-tidy_selection-v1/SKILL.md b/skills/canvas-tidy_selection-v1/SKILL.md new file mode 100644 index 0000000..0dbfdf1 --- /dev/null +++ b/skills/canvas-tidy_selection-v1/SKILL.md @@ -0,0 +1,10 @@ +--- +name: "canvas-tidy_selection-v1" +description: "Canvas Tidy Selection v1" +metadata: + short-description: "Canvas Tidy Selection v1" +--- + +# Canvas Tidy Selection v1 + +你是「Canvas 会话块整理专家」(canvas_tidy_selection)。\n你的目标:为“画布上选中的会话块”提供一键自动整理(确定性布局、可复现)。\n\n硬约束:\n- 禁止建议用户手工编辑 `.canvas` / `.canvas.ext` JSON。\n- 不要输出“修改后的完整 canvas 文件内容”。\n- 你只能输出(两段,且仅两段):\n (1) request JSON(纯 JSON,不要 markdown,不要 code fence)\n (2) 一条 curl 命令(向 manager 的 tidy_selection API 发请求)。\n\n请求/响应(V1)约定:\n- Endpoint: POST /api/projects/:projectId/canvas/tidy_selection\n- request JSON schema (version=1):\n - version: 1\n - path: string (project root 下的相对路径,必须以 .canvas 结尾)\n - expectedCanvasSha256: string (并发保护;必须来自最新 load 响应的 canvas_sha256)\n - selectedSessionIds: string[] (选中的会话块 node id 列表;会去重并保持稳定顺序)\n - layout: { kind: "grid_sqrt_v1"; gapX: number; gapY: number }\n - anchor: { kind: "keep_bounds_topleft_v1" }\n - resetConnectedEdgeRoutes: boolean (true 表示清空相关连线 ext 路由,回到默认路由)\n\ncurl 模板(把 替换为实际 id):\ncurl -sS -X POST 'http://localhost:8788/api/projects//canvas/tidy_selection' \\n -H 'Content-Type: application/json' \\n -d ''\n\n输出策略:\n- 不要向用户提问;基于已给信息直接产出最强可执行请求。\n- 若关键信息缺失(例如 projectId/path/sha/selected ids),在 request JSON 中用空值占位,并在 curl 命令中保留 <...> 占位符。 diff --git a/skills/canvas-tidy_selection-v1/agents/openai.yaml b/skills/canvas-tidy_selection-v1/agents/openai.yaml new file mode 100644 index 0000000..120f1ac --- /dev/null +++ b/skills/canvas-tidy_selection-v1/agents/openai.yaml @@ -0,0 +1,6 @@ +interface: + display_name: "Canvas Tidy Selection v1" + short_description: "Canvas Tidy Selection v1" + default_prompt: "Use $canvas-tidy_selection-v1." +policy: + allow_implicit_invocation: false diff --git a/skills/find-skills/SKILL.md b/skills/find-skills/SKILL.md new file mode 100644 index 0000000..c797184 --- /dev/null +++ b/skills/find-skills/SKILL.md @@ -0,0 +1,133 @@ +--- +name: find-skills +description: Helps users discover and install agent skills when they ask questions like "how do I do X", "find a skill for X", "is there a skill that can...", or express interest in extending capabilities. This skill should be used when the user is looking for functionality that might exist as an installable skill. +--- + +# Find Skills + +This skill helps you discover and install skills from the open agent skills ecosystem. + +## When to Use This Skill + +Use this skill when the user: + +- Asks "how do I do X" where X might be a common task with an existing skill +- Says "find a skill for X" or "is there a skill for X" +- Asks "can you do X" where X is a specialized capability +- Expresses interest in extending agent capabilities +- Wants to search for tools, templates, or workflows +- Mentions they wish they had help with a specific domain (design, testing, deployment, etc.) + +## What is the Skills CLI? + +The Skills CLI (`npx skills`) is the package manager for the open agent skills ecosystem. Skills are modular packages that extend agent capabilities with specialized knowledge, workflows, and tools. + +**Key commands:** + +- `npx skills find [query]` - Search for skills interactively or by keyword +- `npx skills add ` - Install a skill from GitHub or other sources +- `npx skills check` - Check for skill updates +- `npx skills update` - Update all installed skills + +**Browse skills at:** https://skills.sh/ + +## How to Help Users Find Skills + +### Step 1: Understand What They Need + +When a user asks for help with something, identify: + +1. The domain (e.g., React, testing, design, deployment) +2. The specific task (e.g., writing tests, creating animations, reviewing PRs) +3. Whether this is a common enough task that a skill likely exists + +### Step 2: Search for Skills + +Run the find command with a relevant query: + +```bash +npx skills find [query] +``` + +For example: + +- User asks "how do I make my React app faster?" → `npx skills find react performance` +- User asks "can you help me with PR reviews?" → `npx skills find pr review` +- User asks "I need to create a changelog" → `npx skills find changelog` + +The command will return results like: + +``` +Install with npx skills add + +vercel-labs/agent-skills@vercel-react-best-practices +└ https://skills.sh/vercel-labs/agent-skills/vercel-react-best-practices +``` + +### Step 3: Present Options to the User + +When you find relevant skills, present them to the user with: + +1. The skill name and what it does +2. The install command they can run +3. A link to learn more at skills.sh + +Example response: + +``` +I found a skill that might help! The "vercel-react-best-practices" skill provides +React and Next.js performance optimization guidelines from Vercel Engineering. + +To install it: +npx skills add vercel-labs/agent-skills@vercel-react-best-practices + +Learn more: https://skills.sh/vercel-labs/agent-skills/vercel-react-best-practices +``` + +### Step 4: Offer to Install + +If the user wants to proceed, you can install the skill for them: + +```bash +npx skills add -g -y +``` + +The `-g` flag installs globally (user-level) and `-y` skips confirmation prompts. + +## Common Skill Categories + +When searching, consider these common categories: + +| Category | Example Queries | +| --------------- | ---------------------------------------- | +| Web Development | react, nextjs, typescript, css, tailwind | +| Testing | testing, jest, playwright, e2e | +| DevOps | deploy, docker, kubernetes, ci-cd | +| Documentation | docs, readme, changelog, api-docs | +| Code Quality | review, lint, refactor, best-practices | +| Design | ui, ux, design-system, accessibility | +| Productivity | workflow, automation, git | + +## Tips for Effective Searches + +1. **Use specific keywords**: "react testing" is better than just "testing" +2. **Try alternative terms**: If "deploy" doesn't work, try "deployment" or "ci-cd" +3. **Check popular sources**: Many skills come from `vercel-labs/agent-skills` or `ComposioHQ/awesome-claude-skills` + +## When No Skills Are Found + +If no relevant skills exist: + +1. Acknowledge that no existing skill was found +2. Offer to help with the task directly using your general capabilities +3. Suggest the user could create their own skill with `npx skills init` + +Example: + +``` +I searched for skills related to "xyz" but didn't find any matches. +I can still help you with this task directly! Would you like me to proceed? + +If this is something you do often, you could create your own skill: +npx skills init my-xyz-skill +``` diff --git a/skills/imagegen/LICENSE.txt b/skills/imagegen/LICENSE.txt new file mode 100644 index 0000000..13e25df --- /dev/null +++ b/skills/imagegen/LICENSE.txt @@ -0,0 +1,201 @@ +Apache License +Version 2.0, January 2004 +http://www.apache.org/licenses/ + +TERMS AND CONDITIONS FOR USE, REPRODUCTION, AND DISTRIBUTION + +1. Definitions. + + "License" shall mean the terms and conditions for use, reproduction, + and distribution as defined by Sections 1 through 9 of this document. + + "Licensor" shall mean the copyright owner or entity authorized by + the copyright owner that is granting the License. + + "Legal Entity" shall mean the union of the acting entity and all + other entities that control, are controlled by, or are under common + control with that entity. For the purposes of this definition, + "control" means (i) the power, direct or indirect, to cause the + direction or management of such entity, whether by contract or + otherwise, or (ii) ownership of fifty percent (50%) or more of the + outstanding shares, or (iii) beneficial ownership of such entity. + + "You" (or "Your") shall mean an individual or Legal Entity + exercising permissions granted by this License. + + "Source" form shall mean the preferred form for making modifications, + including but not limited to software source code, documentation + source, and configuration files. + + "Object" form shall mean any form resulting from mechanical + transformation or translation of a Source form, including but + not limited to compiled object code, generated documentation, + and conversions to other media types. + + "Work" shall mean the work of authorship, whether in Source or + Object form, made available under the License, as indicated by a + copyright notice that is included in or attached to the work + (an example is provided in the Appendix below). + + "Derivative Works" shall mean any work, whether in Source or Object + form, that is based on (or derived from) the Work and for which the + editorial revisions, annotations, elaborations, or other modifications + represent, as a whole, an original work of authorship. For the purposes + of this License, Derivative Works shall not include works that remain + separable from, or merely link (or bind by name) to the interfaces of, + the Work and Derivative Works thereof. + + "Contribution" shall mean any work of authorship, including + the original version of the Work and any modifications or additions + to that Work or Derivative Works thereof, that is intentionally + submitted to Licensor for inclusion in the Work by the copyright owner + or by an individual or Legal Entity authorized to submit on behalf of + the copyright owner. For the purposes of this definition, "submitted" + means any form of electronic, verbal, or written communication sent + to the Licensor or its representatives, including but not limited to + communication on electronic mailing lists, source code control systems, + and issue tracking systems that are managed by, or on behalf of, the + Licensor for the purpose of discussing and improving the Work, but + excluding communication that is conspicuously marked or otherwise + designated in writing by the copyright owner as "Not a Contribution." + + "Contributor" shall mean Licensor and any individual or Legal Entity + on behalf of whom a Contribution has been received by Licensor and + subsequently incorporated within the Work. + +2. Grant of Copyright License. Subject to the terms and conditions of + this License, each Contributor hereby grants to You a perpetual, + worldwide, non-exclusive, no-charge, royalty-free, irrevocable + copyright license to reproduce, prepare Derivative Works of, + publicly display, publicly perform, sublicense, and distribute the + Work and such Derivative Works in Source or Object form. + +3. Grant of Patent License. Subject to the terms and conditions of + this License, each Contributor hereby grants to You a perpetual, + worldwide, non-exclusive, no-charge, royalty-free, irrevocable + (except as stated in this section) patent license to make, have made, + use, offer to sell, sell, import, and otherwise transfer the Work, + where such license applies only to those patent claims licensable + by such Contributor that are necessarily infringed by their + Contribution(s) alone or by combination of their Contribution(s) + with the Work to which such Contribution(s) was submitted. If You + institute patent litigation against any entity (including a + cross-claim or counterclaim in a lawsuit) alleging that the Work + or a Contribution incorporated within the Work constitutes direct + or contributory patent infringement, then any patent licenses + granted to You under this License for that Work shall terminate + as of the date such litigation is filed. + +4. Redistribution. You may reproduce and distribute copies of the + Work or Derivative Works thereof in any medium, with or without + modifications, and in Source or Object form, provided that You + meet the following conditions: + + (a) You must give any other recipients of the Work or + Derivative Works a copy of this License; and + + (b) You must cause any modified files to carry prominent notices + stating that You changed the files; and + + (c) You must retain, in the Source form of any Derivative Works + that You distribute, all copyright, patent, trademark, and + attribution notices from the Source form of the Work, + excluding those notices that do not pertain to any part of + the Derivative Works; and + + (d) If the Work includes a "NOTICE" text file as part of its + distribution, then any Derivative Works that You distribute must + include a readable copy of the attribution notices contained + within such NOTICE file, excluding those notices that do not + pertain to any part of the Derivative Works, in at least one + of the following places: within a NOTICE text file distributed + as part of the Derivative Works; within the Source form or + documentation, if provided along with the Derivative Works; or, + within a display generated by the Derivative Works, if and + wherever such third-party notices normally appear. The contents + of the NOTICE file are for informational purposes only and + do not modify the License. You may add Your own attribution + notices within Derivative Works that You distribute, alongside + or as an addendum to the NOTICE text from the Work, provided + that such additional attribution notices cannot be construed + as modifying the License. + + You may add Your own copyright statement to Your modifications and + may provide additional or different license terms and conditions + for use, reproduction, or distribution of Your modifications, or + for any such Derivative Works as a whole, provided Your use, + reproduction, and distribution of the Work otherwise complies with + the conditions stated in this License. + +5. Submission of Contributions. Unless You explicitly state otherwise, + any Contribution intentionally submitted for inclusion in the Work + by You to the Licensor shall be under the terms and conditions of + this License, without any additional terms or conditions. + Notwithstanding the above, nothing herein shall supersede or modify + the terms of any separate license agreement you may have executed + with Licensor regarding such Contributions. + +6. Trademarks. This License does not grant permission to use the trade + names, trademarks, service marks, or product names of the Licensor, + except as required for reasonable and customary use in describing the + origin of the Work and reproducing the content of the NOTICE file. + +7. Disclaimer of Warranty. Unless required by applicable law or + agreed to in writing, Licensor provides the Work (and each + Contributor provides its Contributions) on an "AS IS" BASIS, + WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or + implied, including, without limitation, any warranties or conditions + of TITLE, NON-INFRINGEMENT, MERCHANTABILITY, or FITNESS FOR A + PARTICULAR PURPOSE. You are solely responsible for determining the + appropriateness of using or redistributing the Work and assume any + risks associated with Your exercise of permissions under this License. + +8. Limitation of Liability. In no event and under no legal theory, + whether in tort (including negligence), contract, or otherwise, + unless required by applicable law (such as deliberate and grossly + negligent acts) or agreed to in writing, shall any Contributor be + liable to You for damages, including any direct, indirect, special, + incidental, or consequential damages of any character arising as a + result of this License or out of the use or inability to use the + Work (including but not limited to damages for loss of goodwill, + work stoppage, computer failure or malfunction, or any and all + other commercial damages or losses), even if such Contributor + has been advised of the possibility of such damages. + +9. Accepting Warranty or Additional Liability. While redistributing + the Work or Derivative Works thereof, You may choose to offer, + and charge a fee for, acceptance of support, warranty, indemnity, + or other liability obligations and/or rights consistent with this + License. However, in accepting such obligations, You may act only + on Your own behalf and on Your sole responsibility, not on behalf of + any other Contributor, and only if You agree to indemnify, + defend, and hold each Contributor harmless for any liability + incurred by, or claims asserted against, such Contributor by reason + of your accepting any such warranty or additional liability. + +END OF TERMS AND CONDITIONS + +APPENDIX: How to apply the Apache License to your work. + + To apply the Apache License to your work, attach the following + boilerplate notice, with the fields enclosed by brackets "[]" + replaced with your own identifying information. (Don\'t include + the brackets!) The text should be enclosed in the appropriate + comment syntax for the file format. We also recommend that a + file or class name and description of purpose be included on the + same "printed page" as the copyright notice for easier + identification within third-party archives. + +Copyright [yyyy] [name of copyright owner] + +Licensed under the Apache License, Version 2.0 (the "License"); +you may not use this file except in compliance with the License. +You may obtain a copy of the License at + + http://www.apache.org/licenses/LICENSE-2.0 + +Unless required by applicable law or agreed to in writing, software +distributed under the License is distributed on an "AS IS" BASIS, +WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. +See the License for the specific language governing permissions and +limitations under the License. diff --git a/skills/imagegen/SKILL.md b/skills/imagegen/SKILL.md new file mode 100644 index 0000000..4285e5e --- /dev/null +++ b/skills/imagegen/SKILL.md @@ -0,0 +1,356 @@ +--- +name: "imagegen" +description: "Generate or edit raster images when the task benefits from AI-created bitmap visuals such as photos, illustrations, textures, sprites, mockups, or transparent-background cutouts. Use when Codex should create a brand-new image, transform an existing image, or derive visual variants from references, and the output should be a bitmap asset rather than repo-native code or vector. Do not use when the task is better handled by editing existing SVG/vector/code-native assets, extending an established icon or logo system, or building the visual directly in HTML/CSS/canvas." +--- + +# Image Generation Skill + +Generates or edits images for the current project (for example website assets, game assets, UI mockups, product mockups, wireframes, logo design, photorealistic images, or infographics). + +## Top-level modes and rules + +This skill has exactly two top-level modes: + +- **Default built-in tool mode (preferred):** built-in `image_gen` tool for normal image generation, editing, and simple transparent-image requests. Does not require `OPENAI_API_KEY`. +- **Fallback CLI mode:** `scripts/image_gen.py` CLI. Use when the user explicitly asks for the CLI/API/model path, or after the user explicitly confirms a true model-native transparency fallback with `gpt-image-1.5`. Requires `OPENAI_API_KEY`. + +Within CLI fallback, the CLI exposes three subcommands: + +- `generate` +- `edit` +- `generate-batch` + +Rules: +- Use the built-in `image_gen` tool by default for normal image generation and editing requests. +- Do not switch to CLI fallback for ordinary quality, size, or file-path control. +- If the user explicitly asks for a transparent image/background, stay on built-in `image_gen` first: prompt for a flat removable chroma-key background, then remove it locally with the installed helper at `$CODEX_HOME/skills/.system/imagegen/scripts/remove_chroma_key.py`. +- Never silently switch from built-in `image_gen` or CLI `gpt-image-2` to CLI `gpt-image-1.5`. Treat this as a model/path downgrade and ask the user before doing it, unless the user has already explicitly requested `gpt-image-1.5`, `scripts/image_gen.py`, or CLI fallback. +- If a transparent request appears too complex for clean chroma-key removal, asks for true/native transparency, or local removal fails validation, explain that true transparency requires CLI `gpt-image-1.5 --background transparent --output-format png` because `gpt-image-2` does not support `background=transparent`, then ask whether to proceed. Run the CLI fallback only after the user confirms. +- The word `batch` by itself does not mean CLI fallback. If the user asks for many assets or says to batch-generate assets without explicitly asking for CLI/API/model controls, stay on the built-in path and issue one built-in call per requested asset or variant. +- If the built-in tool fails or is unavailable, tell the user the CLI fallback exists and that it requires `OPENAI_API_KEY`. Proceed only if the user explicitly asks for that fallback. +- If the user explicitly asks for CLI mode, use the bundled `scripts/image_gen.py` workflow. Do not create one-off SDK runners. +- Never modify `scripts/image_gen.py`. If something is missing, ask the user before doing anything else. + +Built-in save-path policy: +- In built-in tool mode, Codex saves generated images under `$CODEX_HOME/*` by default. +- Do not describe or rely on OS temp as the default built-in destination. +- Do not describe or rely on a destination-path argument (if any) on the built-in `image_gen` tool. If a specific location is needed, generate first and then move or copy the selected output from `$CODEX_HOME/generated_images/...`. +- Save-path precedence in built-in mode: + 1. If the user names a destination, move or copy the selected output there. + 2. If the image is meant for the current project, move or copy the final selected image into the workspace before finishing. + 3. If the image is only for preview or brainstorming, render it inline; the underlying file can remain at the default `$CODEX_HOME/*` path. +- Never leave a project-referenced asset only at the default `$CODEX_HOME/*` path. +- Do not overwrite an existing asset unless the user explicitly asked for replacement; otherwise create a sibling versioned filename such as `hero-v2.png` or `item-icon-edited.png`. + +Shared prompt guidance for both modes lives in `references/prompting.md` and `references/sample-prompts.md`. + +Fallback-only docs/resources for CLI mode: +- `references/cli.md` +- `references/image-api.md` +- `references/codex-network.md` +- `scripts/image_gen.py` + +Local post-processing helper: +- `$CODEX_HOME/skills/.system/imagegen/scripts/remove_chroma_key.py`: removes a flat chroma-key background from a generated image and writes a PNG/WebP with alpha. Prefer auto-key sampling, soft matte, and despill for antialiased edges. + +## When to use +- Generate a new image (concept art, product shot, cover, website hero) +- Generate a new image using one or more reference images for style, composition, or mood +- Edit an existing image (inpainting, lighting or weather transformations, background replacement, object removal, compositing, transparent background) +- Produce many assets or variants for one task + +## When not to use +- Extending or matching an existing SVG/vector icon set, logo system, or illustration library inside the repo +- Creating simple shapes, diagrams, wireframes, or icons that are better produced directly in SVG, HTML/CSS, or canvas +- Making a small project-local asset edit when the source file already exists in an editable native format +- Any task where the user clearly wants deterministic code-native output instead of a generated bitmap + +## Decision tree + +Think about two separate questions: + +1. **Intent:** is this a new image or an edit of an existing image? +2. **Execution strategy:** is this one asset or many assets/variants? + +Intent: +- If the user wants to modify an existing image while preserving parts of it, treat the request as **edit**. +- If the user provides images only as references for style, composition, mood, or subject guidance, treat the request as **generate**. +- If the user provides no images, treat the request as **generate**. + +Built-in edit semantics: +- Built-in edit mode is for images already visible in the conversation context, such as attached images or images generated earlier in the thread. +- If the user wants to edit a local image file with the built-in tool, first load it with built-in `view_image` tool so the image is visible in the conversation context, then proceed with the built-in edit flow. +- Do not promise arbitrary filesystem-path editing through the built-in tool. +- If a local file still needs direct file-path control, masks, or other explicit CLI-only parameters, use the explicit CLI fallback only when the user asks for it. +- For edits, preserve invariants aggressively and save non-destructively by default. + +Execution strategy: +- In the built-in default path, produce many assets or variants by issuing one `image_gen` call per requested asset or variant. +- In the CLI fallback path, use the CLI `generate-batch` subcommand only when the user explicitly chose CLI mode and needs many prompts/assets. +- For many distinct assets, do not use `n` as a substitute for separate prompts. `n` is for variants of one prompt; distinct assets need distinct built-in calls or distinct CLI `generate-batch` jobs. + +Assume the user wants a new image unless they clearly ask to change an existing one. + +## Workflow +1. Decide the top-level mode: built-in by default, including simple transparent-output requests; fallback CLI only if explicitly requested or after the user explicitly confirms a transparent-output fallback. +2. Decide the intent: `generate` or `edit`. +3. Decide whether the output is preview-only or meant to be consumed by the current project. +4. Decide the execution strategy: single asset vs repeated built-in calls vs CLI `generate-batch`. +5. Collect inputs up front: prompt(s), exact text (verbatim), constraints/avoid list, and any input images. +6. For every input image, label its role explicitly: + - reference image + - edit target + - supporting insert/style/compositing input +7. If the edit target is only on the local filesystem and you are staying on the built-in path, inspect it with `view_image` first so the image is available in conversation context. +8. If the user asked for a photo, illustration, sprite, product image, banner, or other explicitly raster-style asset, use `image_gen` rather than substituting SVG/HTML/CSS placeholders. If the request is for an icon, logo, or UI graphic that should match existing repo-native SVG/vector/code assets, prefer editing those directly instead. +9. Augment the prompt based on specificity: + - If the user's prompt is already specific and detailed, normalize it into a clear spec without adding creative requirements. + - If the user's prompt is generic, add tasteful augmentation only when it materially improves output quality. +10. Use the built-in `image_gen` tool by default. +11. For transparent-output requests, follow the transparent image guidance below: generate with built-in `image_gen` on a flat chroma-key background, copy the selected output into the workspace or `tmp/imagegen/`, run the installed `$CODEX_HOME/skills/.system/imagegen/scripts/remove_chroma_key.py` helper, and validate the alpha result before using it. If this path looks unsuitable or fails, ask before switching to CLI `gpt-image-1.5`. +12. Inspect outputs and validate: subject, style, composition, text accuracy, and invariants/avoid items. +13. Iterate with a single targeted change, then re-check. +14. For preview-only work, render the image inline; the underlying file may remain at the default `$CODEX_HOME/generated_images/...` path. +15. For project-bound work, move or copy the selected artifact into the workspace and update any consuming code or references. Never leave a project-referenced asset only at the default `$CODEX_HOME/generated_images/...` path. +16. For batches or multi-asset requests, persist every requested deliverable final in the workspace unless the user explicitly asked to keep outputs preview-only. Discarded variants do not need to be kept unless requested. +17. If the user explicitly chooses or confirms the CLI fallback, then use the fallback-only docs for model, quality, size, `input_fidelity`, masks, output format, output paths, and network setup. +18. Always report the final saved path(s) for any workspace-bound asset(s), plus the final prompt or prompt set and whether the built-in tool or fallback CLI mode was used. + +## Transparent image requests + +Transparent-image requests still use built-in `image_gen` first. Because the built-in tool does not expose a true transparent-background control, create a removable chroma-key source image and then convert the key color to alpha locally. + +Default sequence: +1. Use built-in `image_gen` to generate the requested subject on a perfectly flat solid chroma-key background. +2. Choose a key color that is unlikely to appear in the subject: default `#00ff00`, use `#ff00ff` for green subjects, and avoid `#0000ff` for blue subjects. +3. After generation, move or copy the selected source image from `$CODEX_HOME/generated_images/...` into the workspace or `tmp/imagegen/`. +4. Run the installed helper path, not a project-relative script path: + ```bash + python "${CODEX_HOME:-$HOME/.codex}/skills/.system/imagegen/scripts/remove_chroma_key.py" \ + --input \ + --out \ + --auto-key border \ + --soft-matte \ + --transparent-threshold 12 \ + --opaque-threshold 220 \ + --despill + ``` +5. Validate that the output has an alpha channel, transparent corners, plausible subject coverage, and no obvious key-color fringe. If a thin fringe remains, retry once with `--edge-contract 1`; use `--edge-feather 0.25` only when the edge is visibly stair-stepped and the subject is not shiny or reflective. +6. Save the final alpha PNG/WebP in the project if the asset is project-bound. Never leave a project-referenced transparent asset only under `$CODEX_HOME/*`. + +Prompt transparent requests like this: + +```text +Create the requested subject on a perfectly flat solid #00ff00 chroma-key background for background removal. +The background must be one uniform color with no shadows, gradients, texture, reflections, floor plane, or lighting variation. +Keep the subject fully separated from the background with crisp edges and generous padding. +Do not use #00ff00 anywhere in the subject. +No cast shadow, no contact shadow, no reflection, no watermark, and no text unless explicitly requested. +``` + +Do not automatically use CLI `gpt-image-1.5 --background transparent --output-format png` instead of chroma keying. Ask the user first when the user asks for true/native transparency, when local removal fails validation, or when the requested image is complex: hair, fur, feathers, smoke, glass, liquids, translucent materials, reflective objects, soft shadows, realistic product grounding, or subject colors that conflict with all practical key colors. + +Use a concise confirmation like: + +```text +This likely needs true native transparency. The default built-in path uses a chroma-key background plus local removal, but true transparency requires the CLI fallback with gpt-image-1.5 because gpt-image-2 does not support background=transparent. It also requires OPENAI_API_KEY. Should I proceed with that CLI fallback? +``` + +## Prompt augmentation + +Reformat user prompts into a structured, production-oriented spec. Make the user's goal clearer and more actionable, but do not blindly add detail. + +Treat this as prompt-shaping guidance, not a closed schema. Use only the lines that help, and add a short extra labeled line when it materially improves clarity. + +### Specificity policy + +Use the user's prompt specificity to decide how much augmentation is appropriate: + +- If the prompt is already specific and detailed, preserve that specificity and only normalize/structure it. +- If the prompt is generic, you may add tasteful augmentation when it will materially improve the result. + +Allowed augmentations: +- composition or framing hints +- polish level or intended-use hints +- practical layout guidance +- reasonable scene concreteness that supports the stated request + +Not allowed augmentations: +- extra characters or objects that are not implied by the request +- brand names, slogans, palettes, or narrative beats that are not implied +- arbitrary side-specific placement unless the surrounding layout supports it + +## Use-case taxonomy (exact slugs) + +Classify each request into one of these buckets and keep the slug consistent across prompts and references. + +Generate: +- photorealistic-natural — candid/editorial lifestyle scenes with real texture and natural lighting. +- product-mockup — product/packaging shots, catalog imagery, merch concepts. +- ui-mockup — app/web interface mockups and wireframes; specify the desired fidelity. +- infographic-diagram — diagrams/infographics with structured layout and text. +- scientific-educational — classroom explainers, scientific diagrams, and learning visuals with required labels and accuracy constraints. +- ads-marketing — campaign concepts and ad creatives with audience, brand position, scene, and exact tagline/copy. +- productivity-visual — slide, chart, workflow, and data-heavy business visuals. +- logo-brand — logo/mark exploration, vector-friendly. +- illustration-story — comics, children’s book art, narrative scenes. +- stylized-concept — style-driven concept art, 3D/stylized renders. +- historical-scene — period-accurate/world-knowledge scenes. + +Edit: +- text-localization — translate/replace in-image text, preserve layout. +- identity-preserve — try-on, person-in-scene; lock face/body/pose. +- precise-object-edit — remove/replace a specific element (including interior swaps). +- lighting-weather — time-of-day/season/atmosphere changes only. +- background-extraction — transparent background / clean cutout. Use built-in `image_gen` with chroma-key removal first for simple opaque subjects; ask before using CLI true transparency for complex subjects. +- style-transfer — apply reference style while changing subject/scene. +- compositing — multi-image insert/merge with matched lighting/perspective. +- sketch-to-render — drawing/line art to photoreal render. + +## Shared prompt schema + +Use the following labeled spec as shared prompt scaffolding for both top-level modes: + +```text +Use case: +Asset type: +Primary request: +Input images: (optional) +Scene/backdrop: +Subject:
+Style/medium: +Composition/framing: +Lighting/mood: +Color palette: +Materials/textures: +Text (verbatim): "" +Constraints: +Avoid: +``` + +Notes: +- `Asset type` and `Input images` are prompt scaffolding, not dedicated CLI flags. +- `Scene/backdrop` refers to the visual setting. It is not the same as the fallback CLI `background` parameter, which controls output transparency behavior. +- Fallback-only execution notes such as `Quality:`, `Input fidelity:`, masks, output format, and output paths belong in the CLI path only. Do not treat them as built-in `image_gen` tool arguments. + +Augmentation rules: +- Keep it short. +- Add only the details needed to improve the prompt materially. +- For edits, explicitly list invariants (`change only X; keep Y unchanged`). +- If any critical detail is missing and blocks success, ask a question; otherwise proceed. + +## Examples + +### Generation example (hero image) +```text +Use case: product-mockup +Asset type: landing page hero +Primary request: a minimal hero image of a ceramic coffee mug +Style/medium: clean product photography +Composition/framing: wide composition with usable negative space for page copy if needed +Lighting/mood: soft studio lighting +Constraints: no logos, no text, no watermark +``` + +### Edit example (invariants) +```text +Use case: precise-object-edit +Asset type: product photo background replacement +Primary request: replace only the background with a warm sunset gradient +Constraints: change only the background; keep the product and its edges unchanged; no text; no watermark +``` + +## Prompting best practices +- Structure prompt as scene/backdrop -> subject -> details -> constraints. +- Include intended use (ad, UI mock, infographic) to set the mode and polish level. +- Use camera/composition language for photorealism. +- Only use SVG/vector stand-ins when the user explicitly asked for vector output or a non-image placeholder. +- Quote exact text and specify typography + placement. +- For tricky words, spell them letter-by-letter and require verbatim rendering. +- For multi-image inputs, reference images by index and describe how they should be used. +- For edits, repeat invariants every iteration to reduce drift. +- Iterate with single-change follow-ups. +- If the prompt is generic, add only the extra detail that will materially help. +- If the prompt is already detailed, normalize it instead of expanding it. +- For CLI fallback only, see `references/cli.md` and `references/image-api.md` for model, `quality`, `input_fidelity`, masks, output format, and output-path guidance. +- For transparent images, use the built-in-first chroma-key workflow unless the request is complex enough to need true CLI transparency; ask before switching to CLI `gpt-image-1.5`. + +More principles shared by both modes: `references/prompting.md`. +Copy/paste specs shared by both modes: `references/sample-prompts.md`. + +## Guidance by asset type +Asset-type templates (website assets, game assets, wireframes, logo) are consolidated in `references/sample-prompts.md`. + +## gpt-image-2 guidance for CLI fallback + +The fallback CLI defaults to `gpt-image-2`. + +- Use `gpt-image-2` for new CLI/API workflows unless the request needs true model-native transparent output. +- If a transparent request may need CLI fallback, ask before using `gpt-image-1.5` unless the user already explicitly requested `gpt-image-1.5`, `scripts/image_gen.py`, or CLI fallback. Explain that the built-in chroma-key path is the default, but true transparency requires `gpt-image-1.5` because `gpt-image-2` does not support `background=transparent`. +- `gpt-image-2` always uses high fidelity for image inputs; do not set `input_fidelity` with this model. +- `gpt-image-2` supports `quality` values `low`, `medium`, `high`, and `auto`. +- Use `quality low` for fast drafts, thumbnails, and quick iterations. Use `medium`, `high`, or `auto` for final assets, dense text, diagrams, identity-sensitive edits, or high-resolution outputs. +- Square images are typically fastest to generate. Use `1024x1024` for fast square drafts. +- If the user asks for 4K-style output, use `3840x2160` for landscape or `2160x3840` for portrait. +- `gpt-image-2` size may be `auto` or `WIDTHxHEIGHT` if all constraints hold: max edge `<= 3840px`, both edges multiples of `16px`, long-to-short ratio `<= 3:1`, total pixels between `655,360` and `8,294,400`. + +Popular `gpt-image-2` sizes: +- `1024x1024` square +- `1536x1024` landscape +- `1024x1536` portrait +- `2048x2048` 2K square +- `2048x1152` 2K landscape +- `3840x2160` 4K landscape +- `2160x3840` 4K portrait +- `auto` + +## Fallback CLI mode only + +### Temp and output conventions +These conventions apply only to the CLI fallback. They do not describe built-in `image_gen` output behavior. +- Use `tmp/imagegen/` for intermediate files (for example JSONL batches); delete them when done. +- Write final artifacts under `output/imagegen/`. +- Use `--out` or `--out-dir` to control output paths; keep filenames stable and descriptive. + +### Dependencies +Prefer `uv` for dependency management in this repo. + +Required Python package: +```bash +uv pip install openai +``` + +Required for local chroma-key removal and optional downscaling: +```bash +uv pip install pillow +``` + +Portability note: +- If you are using the installed skill outside this repo, install dependencies into that environment with its package manager. +- In uv-managed environments, `uv pip install ...` remains the preferred path. + +### Environment +- `OPENAI_API_KEY` must be set for live API calls. +- Do not ask the user for `OPENAI_API_KEY` when using the built-in `image_gen` tool. +- Never ask the user to paste the full key in chat. Ask them to set it locally and confirm when ready. + +If the key is missing, give the user these steps: +1. Create an API key in the OpenAI platform UI: https://platform.openai.com/api-keys +2. Set `OPENAI_API_KEY` as an environment variable in their system. +3. Offer to guide them through setting the environment variable for their OS/shell if needed. + +If installation is not possible in this environment, tell the user which dependency is missing and how to install it into their active environment. + +### Script-mode notes +- CLI commands + examples: `references/cli.md` +- API parameter quick reference: `references/image-api.md` +- Network approvals / sandbox settings for CLI mode: `references/codex-network.md` + +## Reference map +- `references/prompting.md`: shared prompting principles for both modes. +- `references/sample-prompts.md`: shared copy/paste prompt recipes for both modes. +- `references/cli.md`: fallback-only CLI usage via `scripts/image_gen.py`. +- `references/image-api.md`: fallback-only API/CLI parameter reference. +- `references/codex-network.md`: fallback-only network/sandbox troubleshooting for CLI mode. +- `scripts/image_gen.py`: fallback-only CLI implementation. Do not load or use it unless the user explicitly chooses CLI mode or explicitly confirms a transparent request's true CLI transparency fallback. +- `$CODEX_HOME/skills/.system/imagegen/scripts/remove_chroma_key.py`: local post-processing helper for built-in transparent-image requests. diff --git a/skills/imagegen/agents/openai.yaml b/skills/imagegen/agents/openai.yaml new file mode 100644 index 0000000..5e01d44 --- /dev/null +++ b/skills/imagegen/agents/openai.yaml @@ -0,0 +1,6 @@ +interface: + display_name: "Image Gen" + short_description: "Generate or edit images for websites, games, and more" + icon_small: "./assets/imagegen-small.svg" + icon_large: "./assets/imagegen.png" + default_prompt: "Use $imagegen to make or edit an image for this project." diff --git a/skills/imagegen/assets/imagegen-small.svg b/skills/imagegen/assets/imagegen-small.svg new file mode 100644 index 0000000..20128b2 --- /dev/null +++ b/skills/imagegen/assets/imagegen-small.svg @@ -0,0 +1,5 @@ + + + + + diff --git a/skills/imagegen/assets/imagegen.png b/skills/imagegen/assets/imagegen.png new file mode 100644 index 0000000000000000000000000000000000000000..94b54541a9affd39a7aa09d0efd5bc6b712b723b GIT binary patch literal 1711 zcmV;g22lBlP)f8*GWodgmJ0xDHBC~1!*z7`}xiD+*{^u{6Wtr9m>;F?6e6%akq_R>NQ97`^h zS_ISw!l_acdMv2|s@jkWFo}~maqQ{5Ar6jXJ8OGpcRf3wwDM+c*~)zT-u&HNh6Xr2 z$-B~6@6U&DoN@F6Tx_gMyw!zZkYd0r7n}J1r^VmNyNO5=4Zu(bgOk|-gzyRH_#DA0 ze5a&Dv5qs&Z`LEHCLtt(GYpH}0a%=n6p=Gpix@{jKAv3Z&a&_|v3nYpI$$>Jn72{}2rewsoH7VU@@*oo2>*cKtL zTT?Bf&S2fJxIt9+2RUqgo>5nJ#ub| zQq}-0z927LghkRij)2h@&){gs3Jyeb=#8#z9#4y@PugZO5lp1xm|ls&O&Ie1Y;Y<& zDJqnH7*0nhk28A~+ z7<=9c&7_PLLIN%we2x!9dQ!nmCW_&I_i2rm5DevDWF6nVJEu$r(Gy$4m3QXlck9AP zAru+93XZ;0rxYWFoZJ6W>sd_Seeu8x=)gE3r*<#NO*egQ8QEe}jd4PT_P&B!&~Me- za;t<zMu9~wY)fm}Rzy9AItra{)#n*Ft zp#x)wPKN*4hndv~RP#6!#f883EA5>ZW2_J=urHpxtu%d61x_+_HXOVgZ*U1pUyntk z&{EQR=i5oZ_4zJ5T+-jH*0PkegJp0u{3nitpQ5+rnfUw7Y~EbH6KNj9MD{&AT0ew2 zu6@1iCZ~3%@lofp<0)5aAWWoVxbvz*TZ$@0sgS;wrx=Pokgrae&K<&7=9qkS3pxKD zT|-~ns+{fX2)!(eSEue($$FZ zBMl2V75M|_V)ta4Qo-2jF}VemhGKH4^?}?}Lf=PnuV)4kMCz`oE{PPST1J60(ckfX z`wz;Zu9>J=2o=)R_FFPl>7=J+9#`9L1x`v$La6e-A_}RI%DWbVD8;dmf{b2KNNy4| zDUO8nhUTRVlHvtHlCq-^sw=LvdncnnP^459LaiNYP5Gq|f*_@$5Ngt%i`|9aFkMPn zAtXXqA43HuOW7i%(8r)u<#$Y#vMI#vK86ZRlp=+A)yGhQX;MO7_c2spk`%T$bD5iR zcFpAW%f^kirm}vO`, `image_2.`, and so on. +- Downscaled copies use the default suffix `-web` unless you override it. + +## Common recipes + +Generate with augmentation fields: + +```bash +python "$IMAGE_GEN" generate \ + --prompt "A minimal hero image of a ceramic coffee mug" \ + --use-case "product-mockup" \ + --style "clean product photography" \ + --composition "wide product shot with usable negative space for page copy" \ + --constraints "no logos, no text" \ + --out output/imagegen/mug-hero.png +``` + +Generate + also write a downscaled copy for fast web loading: + +```bash +python "$IMAGE_GEN" generate \ + --prompt "A cozy alpine cabin at dawn" \ + --size 1024x1024 \ + --downscale-max-dim 1024 \ + --out output/imagegen/alpine-cabin.png +``` + +Generate multiple prompts concurrently (async batch): + +```bash +mkdir -p tmp/imagegen output/imagegen/batch +cat > tmp/imagegen/prompts.jsonl << 'EOF' +{"prompt":"Cavernous hangar interior with a compact shuttle parked near the center","use_case":"stylized-concept","composition":"wide-angle, low-angle","lighting":"volumetric light rays through drifting fog","constraints":"no logos or trademarks; no watermark","size":"1536x1024"} +{"prompt":"Gray wolf in profile in a snowy forest","use_case":"photorealistic-natural","composition":"eye-level","constraints":"no logos or trademarks; no watermark","size":"1024x1024"} +EOF + +python "$IMAGE_GEN" generate-batch \ + --input tmp/imagegen/prompts.jsonl \ + --out-dir output/imagegen/batch \ + --concurrency 5 + +rm -f tmp/imagegen/prompts.jsonl +``` + +Notes: +- `generate-batch` requires `--out-dir`. +- generate-batch requires --out-dir. +- Use `--concurrency` to control parallelism (default `5`). +- Per-job overrides are supported in JSONL (for example `size`, `quality`, `background`, `output_format`, `output_compression`, `moderation`, `n`, `model`, `out`, and prompt-augmentation fields). +- `--n` generates multiple variants for a single prompt; `generate-batch` is for many different prompts. +- In batch mode, per-job `out` is treated as a filename under `--out-dir`. +- For many requested deliverable assets, provide one prompt/job per distinct asset and use semantic filenames when possible. + +## CLI notes +- Supported sizes depend on the model. `gpt-image-2` supports flexible constrained sizes; older GPT Image models support `1024x1024`, `1536x1024`, `1024x1536`, or `auto`. +- True transparent CLI outputs require `output_format` to be `png` or `webp` and are not supported by `gpt-image-2`. +- `--prompt-file`, `--output-compression`, `--moderation`, `--max-attempts`, `--fail-fast`, `--force`, and `--no-augment` are supported. +- This CLI is intended for GPT Image models. Do not assume older non-GPT image-model behavior applies here. + +## See also +- API parameter quick reference for fallback CLI mode: `references/image-api.md` +- Prompt examples shared across both top-level modes: `references/sample-prompts.md` +- Network/sandbox notes for fallback CLI mode: `references/codex-network.md` +- Built-in-first transparent image workflow: `SKILL.md` and `$CODEX_HOME/skills/.system/imagegen/scripts/remove_chroma_key.py` diff --git a/skills/imagegen/references/codex-network.md b/skills/imagegen/references/codex-network.md new file mode 100644 index 0000000..5ce1fbc --- /dev/null +++ b/skills/imagegen/references/codex-network.md @@ -0,0 +1,33 @@ +# Codex network approvals / sandbox notes + +This file is for the fallback CLI mode only. Read it when the user explicitly asks to use `scripts/image_gen.py` / CLI / API / model controls, or after the user explicitly confirms that a transparent-output request should use the `gpt-image-1.5` true-transparency fallback path. + +This guidance is intentionally isolated from `SKILL.md` because it can vary by environment and may become stale. Prefer the defaults in your environment when in doubt. + +## Why am I asked to approve image generation calls? +The fallback CLI uses the OpenAI Image API, so it needs outbound network access. In many Codex setups, network access is disabled by default and/or the approval policy requires confirmation before networked commands run. + +## Important note about approvals vs network +- `--ask-for-approval never` suppresses approval prompts. +- It does **not** by itself enable network access. +- In `workspace-write`, network access still depends on your Codex configuration (for example `[sandbox_workspace_write] network_access = true`). + +## How do I reduce repeated approval prompts? +If you trust the repo and want fewer prompts, use a configuration or profile that both: +- enables network for the sandbox mode you plan to use +- sets an approval policy that matches your risk tolerance + +Example `~/.codex/config.toml` pattern: + +```toml +approval_policy = "on-request" +sandbox_mode = "workspace-write" + +[sandbox_workspace_write] +network_access = true +``` + +If you want quieter automation after network is enabled, you can choose a stricter approval policy, but do that intentionally and with care. + +## Safety note +Enabling network and reducing approvals lowers friction, but increases risk if you run untrusted code or work in an untrusted repository. diff --git a/skills/imagegen/references/image-api.md b/skills/imagegen/references/image-api.md new file mode 100644 index 0000000..db8567d --- /dev/null +++ b/skills/imagegen/references/image-api.md @@ -0,0 +1,90 @@ +# Image API quick reference + +This file is for the fallback CLI mode only. Use it when the user explicitly asks to use `scripts/image_gen.py` / CLI / API / model controls, or after the user explicitly confirms that a transparent-output request should use the `gpt-image-1.5` true-transparency fallback path. + +These parameters describe the Image API and bundled CLI fallback surface. Do not assume they are normal arguments on the built-in `image_gen` tool. + +## Scope +- This fallback CLI is intended for GPT Image models (`gpt-image-2`, `gpt-image-1.5`, `gpt-image-1`, and `gpt-image-1-mini`). +- The built-in `image_gen` tool and the fallback CLI do not expose the same controls. + +## Model summary + +| Model | Quality | Input fidelity | Resolutions | Recommended use | +| --- | --- | --- | --- | --- | +| `gpt-image-2` | `low`, `medium`, `high`, `auto` | Always high fidelity for image inputs; do not set `input_fidelity` | `auto` or flexible sizes that satisfy the constraints below | Default for new CLI/API workflows: high-quality generation and editing, text-heavy images, photorealism, compositing, identity-sensitive edits, and workflows where fewer retries matter | +| `gpt-image-1.5` | `low`, `medium`, `high`, `auto` | `low`, `high` | `1024x1024`, `1024x1536`, `1536x1024`, `auto` | True transparent-background fallback and backward-compatible workflows | +| `gpt-image-1` | `low`, `medium`, `high`, `auto` | `low`, `high` | `1024x1024`, `1024x1536`, `1536x1024`, `auto` | Legacy compatibility | +| `gpt-image-1-mini` | `low`, `medium`, `high`, `auto` | `low`, `high` | `1024x1024`, `1024x1536`, `1536x1024`, `auto` | Cost-sensitive draft batches and lower-stakes previews | + +## gpt-image-2 sizes + +`gpt-image-2` accepts `auto` or any `WIDTHxHEIGHT` size that satisfies all constraints: + +- Maximum edge length must be less than or equal to `3840px`. +- Both edges must be multiples of `16px`. +- Long edge to short edge ratio must not exceed `3:1`. +- Total pixels must be at least `655,360` and no more than `8,294,400`. + +Popular sizes: + +| Label | Size | Notes | +| --- | --- | --- | +| Square | `1024x1024` | Typical fast default | +| Landscape | `1536x1024` | Standard landscape | +| Portrait | `1024x1536` | Standard portrait | +| 2K square | `2048x2048` | Larger square output | +| 2K landscape | `2048x1152` | Widescreen output | +| 4K landscape | `3840x2160` | Widescreen 4K output | +| 4K portrait | `2160x3840` | Vertical 4K output | +| Auto | `auto` | Default size | + +Square images are typically fastest to generate. For 4K-style output, use `3840x2160` or `2160x3840`. + +## Endpoints +- Generate: `POST /v1/images/generations` (`client.images.generate(...)`) +- Edit: `POST /v1/images/edits` (`client.images.edit(...)`) + +## Core parameters for GPT Image models +- `prompt`: text prompt +- `model`: image model +- `n`: number of images (1-10) +- `size`: `auto` by default for `gpt-image-2`; flexible `WIDTHxHEIGHT` sizes are allowed only for `gpt-image-2`; older GPT Image models use `1024x1024`, `1536x1024`, `1024x1536`, or `auto` +- `quality`: `low`, `medium`, `high`, or `auto` +- `background`: output transparency behavior (`transparent`, `opaque`, or `auto`) for generated output; this is not the same thing as the prompt's visual scene/backdrop +- `output_format`: `png` (default), `jpeg`, `webp` +- `output_compression`: 0-100 (jpeg/webp only) +- `moderation`: `auto` (default) or `low` + +## Edit-specific parameters +- `image`: one or more input images. For GPT Image models, you can provide up to 16 images. +- `mask`: optional mask image +- `input_fidelity`: `low` or `high` only for models that support it; do not set this for `gpt-image-2` + +Model-specific note for `input_fidelity`: +- `gpt-image-2` always uses high fidelity for image inputs and does not support setting `input_fidelity`. +- `gpt-image-1` and `gpt-image-1-mini` preserve all input images, but the first image gets richer textures and finer details. +- `gpt-image-1.5` preserves the first 5 input images with higher fidelity. + +## Transparent backgrounds + +`gpt-image-2` does not currently support the Image API `background=transparent` parameter. The skill's default transparent-image path is built-in `image_gen` with a flat chroma-key background, followed by local alpha extraction with `python "${CODEX_HOME:-$HOME/.codex}/skills/.system/imagegen/scripts/remove_chroma_key.py"`. + +Use CLI `gpt-image-1.5` with `background=transparent` and a transparent-capable output format such as `png` or `webp` only after the user explicitly confirms that fallback, unless they already requested `gpt-image-1.5`, `scripts/image_gen.py`, or CLI fallback. If the user asks for true/native transparency, the subject is too complex for clean chroma-key removal, or local background removal fails validation, explain the tradeoff and ask before switching. + +## Output +- `data[]` list with `b64_json` per image +- The bundled `scripts/image_gen.py` CLI decodes `b64_json` and writes output files for you. + +## Limits and notes +- Input images and masks must be under 50MB. +- Use the edits endpoint when the user requests changes to an existing image. +- Masking is prompt-guided; exact shapes are not guaranteed. +- Large sizes and high quality increase latency and cost. +- Use `quality=low` for fast drafts, thumbnails, and quick iterations. Use `medium` or `high` for final assets, dense text, diagrams, identity-sensitive edits, or high-resolution outputs. +- High `input_fidelity` can materially increase input token usage on models that support it. +- If a request fails because a specific option is unsupported by the selected GPT Image model, retry manually without that option only when the option is not required by the user. If true transparent CLI output is required, ask before switching to `gpt-image-1.5` instead of dropping `background=transparent`, unless the user already explicitly chose that fallback. + +## Important boundary +- `quality`, `input_fidelity`, explicit masks, `background`, `output_format`, and related parameters are fallback-only execution controls. +- Do not assume they are built-in `image_gen` tool arguments. diff --git a/skills/imagegen/references/prompting.md b/skills/imagegen/references/prompting.md new file mode 100644 index 0000000..9d2da42 --- /dev/null +++ b/skills/imagegen/references/prompting.md @@ -0,0 +1,118 @@ +# Prompting best practices + +These prompting principles are shared by both top-level modes of the skill: +- built-in `image_gen` tool (default) +- explicit `scripts/image_gen.py` CLI fallback + +This file is about prompt structure, specificity, and iteration. Fallback-only execution controls such as `quality`, `input_fidelity`, masks, output format, and output paths live in the fallback docs. + +## Contents +- [Structure](#structure) +- [Specificity policy](#specificity-policy) +- [Allowed and disallowed augmentation](#allowed-and-disallowed-augmentation) +- [Composition and layout](#composition-and-layout) +- [Constraints and invariants](#constraints-and-invariants) +- [Text in images](#text-in-images) +- [Input images and references](#input-images-and-references) +- [Iterate deliberately](#iterate-deliberately) +- [Transparent images](#transparent-images) +- [Fallback-only execution controls](#fallback-only-execution-controls) +- [Use-case tips](#use-case-tips) +- [Where to find copy/paste recipes](#where-to-find-copypaste-recipes) + +## Structure +- Use a consistent order: scene/backdrop -> subject -> key details -> constraints -> output intent. +- Include intended use (ad, UI mock, infographic) to set the level of polish. +- For complex requests, use short labeled lines instead of one long paragraph. + +## Specificity policy +- If the user prompt is already specific and detailed, normalize it into a clean spec without adding creative requirements. +- If the prompt is generic, you may add tasteful detail when it materially improves the output. +- Treat examples in `sample-prompts.md` as fully-authored recipes, not as the default amount of augmentation to add to every request. +- For photorealism, include `photorealistic` directly when that is the goal, plus concrete real-world texture such as pores, wrinkles, fabric wear, material grain, or imperfect everyday detail. + +## Allowed and disallowed augmentation + +Allowed augmentation for generic prompts: +- composition and framing cues +- intended-use or polish-level hints +- practical layout guidance +- reasonable scene concreteness that supports the request + +Do not add: +- extra characters, props, or objects that are not implied +- brand palettes, slogans, or story beats that are not implied +- arbitrary side-specific placement unless the surrounding layout supports it + +## Composition and layout +- Specify framing and viewpoint (close-up, wide, top-down) and placement only when it materially helps. +- Call out negative space if the asset clearly needs room for UI or copy. +- Avoid making left/right layout decisions unless the user or surrounding layout supports them. +- For people, describe body framing, scale, gaze, and object interactions when they matter (`full body visible`, `looking down at the book`, `hands naturally gripping the handlebars`). + +## Constraints and invariants +- State what must not change (`keep background unchanged`). +- For edits, say `change only X; keep Y unchanged` and repeat invariants on every iteration to reduce drift. + +## Text in images +- Put literal text in quotes or ALL CAPS and specify typography (font style, size, color, placement). +- Spell uncommon words letter-by-letter if accuracy matters. +- For in-image copy, require verbatim rendering and no extra characters. +- In CLI fallback mode, use `medium` or `high` quality for small text, dense infographics, data-heavy slides, multi-font layouts, legends, axes, and footnotes. + +## Input images and references +- Do not assume that every provided image is an edit target. +- Label each image by index and role (`Image 1: edit target`, `Image 2: style reference`). +- If the user provides images for style, composition, or mood guidance and does not ask to modify them, treat the request as generation with references. +- If the user asks to preserve an existing image while changing specific parts, treat the request as an edit. +- For compositing, describe how the images interact (`place the subject from Image 2 into Image 1`). + +## Iterate deliberately +- Start with a clean base prompt, then make small single-change edits. +- Re-specify critical constraints when you iterate. +- Prefer one targeted follow-up at a time over rewriting the whole prompt. + +## Transparent images +- Use built-in `image_gen` first for transparent-image requests. If the subject is clearly too complex for chroma-key removal, explain the fallback and ask before switching to CLI. +- Prompt for a perfectly flat solid chroma-key background, usually `#00ff00`; use `#ff00ff` when the subject is green, and avoid key colors that appear in the subject. +- Explicitly prohibit shadows, gradients, floor planes, reflections, texture, and lighting variation in the background. +- Ask for crisp edges, generous padding, and no use of the key color inside the subject. +- After generation, remove the background locally with `python "${CODEX_HOME:-$HOME/.codex}/skills/.system/imagegen/scripts/remove_chroma_key.py" --input --out --auto-key border --soft-matte --transparent-threshold 12 --opaque-threshold 220 --despill` and validate the alpha result before shipping it. +- Use soft matte and despill for antialiased edges; hard tolerance-only removal is mainly for flat pixel-art or exact-color fixtures. +- Use CLI `gpt-image-1.5 --background transparent --output-format png` only after the user explicitly confirms the fallback, or when the user already explicitly requested `gpt-image-1.5`, `scripts/image_gen.py`, or CLI fallback. Ask first for true/native transparency requests, failed chroma-key validation, or complex transparent subjects such as hair, fur, glass, smoke, liquids, translucent materials, reflective objects, or soft shadows. + +## Fallback-only execution controls +- `quality`, `input_fidelity`, explicit masks, output format, and output paths are fallback-only execution controls. +- Do not assume they are built-in `image_gen` tool arguments. +- If the user explicitly chooses CLI fallback, see `references/cli.md` and `references/image-api.md` for those controls. +- In CLI fallback mode, `gpt-image-2` is the default. It supports `quality=low|medium|high|auto`; use `low` for fast drafts and thumbnails, and move to `medium`, `high`, or `auto` for final assets. +- `gpt-image-2` always uses high fidelity for image inputs, so do not set `input_fidelity` with that model. +- If a transparent request needs true CLI transparency, ask before using `gpt-image-1.5` unless the user already explicitly chose it. Explain that built-in chroma-key removal is the default path, but `gpt-image-2` does not support `background=transparent`. +- If the user asks for 4K-style output with `gpt-image-2`, use `3840x2160` for landscape or `2160x3840` for portrait. + +## Use-case tips +Generate: +- photorealistic-natural: Prompt as if a real photo is captured in the moment; use photography language (lens, lighting, framing); call for real texture; avoid over-stylized polish unless requested. +- product-mockup: Describe the product/packaging and materials; ensure clean silhouette and label clarity; if in-image text is needed, require verbatim rendering and specify typography. +- ui-mockup: Describe the target fidelity first (shippable mockup or low-fi wireframe), then focus on layout, hierarchy, and practical UI elements; avoid concept-art language. +- infographic-diagram: Define the audience and layout flow; label parts explicitly; require verbatim text; prefer higher quality in CLI mode for dense labels. +- logo-brand: Keep it simple and scalable; ask for a strong silhouette and balanced negative space; avoid decorative flourishes unless requested. +- ads-marketing: Write like a creative brief; include brand positioning, audience, desired vibe, scene, and exact tagline if text must appear. +- productivity-visual: Name the exact artifact (slide, chart, workflow diagram), define the canvas and hierarchy, provide real labels/data, and ask for readable typography and polished spacing. +- scientific-educational: Define audience, lesson objective, required labels, scientific constraints, arrows, and scan-friendly whitespace. +- illustration-story: Define panels or scene beats; keep each action concrete. +- stylized-concept: Specify style cues, material finish, and rendering approach (3D, painterly, clay) without inventing new story elements. +- historical-scene: State the location/date and required period accuracy; constrain clothing, props, and environment to match the era. + +Edit: +- text-localization: Change only the text; preserve layout, typography, spacing, and hierarchy; no extra words or reflow unless needed. +- identity-preserve: Lock identity (face, body, pose, hair, expression); change only the specified elements; match lighting and shadows. +- precise-object-edit: Specify exactly what to remove/replace; preserve surrounding texture and lighting; keep everything else unchanged. +- lighting-weather: Change only environmental conditions (light, shadows, atmosphere, precipitation); keep geometry, framing, and subject identity. +- background-extraction: For simple opaque subjects, request a clean cutout on a perfectly flat chroma-key background; crisp silhouette; generous padding; no shadows; no halos; preserve label text exactly; no restyling. Ask before using true CLI transparency for complex subjects. +- style-transfer: Specify style cues to preserve (palette, texture, brushwork) and what must change; add `no extra elements` to prevent drift. +- compositing: Reference inputs by index; specify what moves where; match lighting, perspective, and scale; keep the base framing unchanged. +- sketch-to-render: Preserve layout, proportions, and perspective; choose materials and lighting that support the supplied sketch without adding new elements. + +## Where to find copy/paste recipes +For copy/paste prompt specs (examples only), see `references/sample-prompts.md`. This file focuses on principles, specificity, and iteration patterns. diff --git a/skills/imagegen/references/sample-prompts.md b/skills/imagegen/references/sample-prompts.md new file mode 100644 index 0000000..d949295 --- /dev/null +++ b/skills/imagegen/references/sample-prompts.md @@ -0,0 +1,433 @@ +# Sample prompts (copy/paste) + +These prompt recipes are shared across both top-level modes of the skill: +- built-in `image_gen` tool (default) +- `scripts/image_gen.py` CLI fallback for explicit CLI/API/model requests or user-confirmed true-transparent-output fallback requests + +Use these as starting points. They are intentionally complete prompt recipes, not the default amount of augmentation to add to every user request. + +When adapting a user's prompt: +- keep user-provided requirements +- only add detail according to the specificity policy in `SKILL.md` +- do not treat every example below as permission to invent extra story elements + +The labeled lines are prompt scaffolding, not a closed schema. `Asset type` and `Input images` are prompt-only scaffolding; the CLI does not expose them as dedicated flags. + +Execution details such as explicit CLI flags, `quality`, `input_fidelity`, masks, output formats, and local output paths depend on mode. Use the built-in tool by default, including simple transparent-image requests. For transparent images, prompt for a flat chroma-key background and remove it locally with `python "${CODEX_HOME:-$HOME/.codex}/skills/.system/imagegen/scripts/remove_chroma_key.py"`; only apply CLI-specific controls when the user explicitly opts into fallback mode or explicitly confirms that the transparent request should use true CLI transparency. + +CLI model notes: +- `gpt-image-2` is the fallback CLI default for new workflows. +- `gpt-image-2` supports `quality` values `low`, `medium`, `high`, and `auto`. +- For 4K-style `gpt-image-2` output, use `3840x2160` or `2160x3840`. +- If transparent output needs true CLI fallback, ask before using `gpt-image-1.5` unless the user already explicitly requested `gpt-image-1.5`, `scripts/image_gen.py`, or CLI fallback. Explain that built-in chroma-key removal is the default path, but `gpt-image-2` does not support `background=transparent`. +- Do not set `input_fidelity` with `gpt-image-2`; image inputs already use high fidelity. + +For prompting principles (structure, specificity, invariants, iteration), see `references/prompting.md`. + +## Generate + +### photorealistic-natural +``` +Use case: photorealistic-natural +Primary request: candid photo of an elderly sailor on a small fishing boat adjusting a net +Scene/backdrop: coastal water with soft haze +Subject: weathered skin with wrinkles and sun texture +Style/medium: photorealistic candid photo +Composition/framing: medium close-up, eye-level +Lighting/mood: soft coastal daylight, shallow depth of field, subtle film grain +Materials/textures: real skin texture, worn fabric, salt-worn wood +Constraints: natural color balance; no heavy retouching; no glamorization; no watermark +Avoid: studio polish; staged look +``` + +### product-mockup +``` +Use case: product-mockup +Primary request: premium product photo of a matte black shampoo bottle with a minimal label +Scene/backdrop: clean studio gradient from light gray to white +Subject: single bottle centered with subtle reflection +Style/medium: premium product photography +Composition/framing: centered, slight three-quarter angle, generous padding +Lighting/mood: softbox lighting, clean highlights, controlled shadows +Materials/textures: matte plastic, crisp label printing +Constraints: no logos or trademarks; no watermark +``` + +### ui-mockup +``` +Use case: ui-mockup +Primary request: mobile app home screen for a local farmers market with vendors and daily specials +Asset type: mobile app screen +Style/medium: realistic product UI, not concept art +Composition/framing: clean vertical mobile layout with clear hierarchy +Constraints: practical layout, clear typography, no logos or trademarks, no watermark +``` + +### infographic-diagram +``` +Use case: infographic-diagram +Primary request: detailed infographic of an automatic coffee machine flow +Scene/backdrop: clean, light neutral background +Subject: bean hopper -> grinder -> brew group -> boiler -> water tank -> drip tray +Style/medium: clean vector-like infographic with clear callouts and arrows +Composition/framing: vertical poster layout, top-to-bottom flow +Text (verbatim): "Bean Hopper", "Grinder", "Brew Group", "Boiler", "Water Tank", "Drip Tray" +Constraints: clear labels, strong contrast, no logos or trademarks, no watermark +``` + +### scientific-educational +``` +Use case: scientific-educational +Primary request: biology diagram titled "Cellular Respiration at a Glance" for high school students +Scene/backdrop: clean white classroom handout background +Subject: glucose turns into energy inside a cell; include glycolysis, Krebs cycle, and electron transport chain +Style/medium: flat scientific diagram with consistent icons, arrows, and readable labels +Composition/framing: landscape slide-style layout with clear hierarchy and generous whitespace +Text (verbatim): "Cellular Respiration at a Glance", "Glucose", "Pyruvate", "ATP", "NADH", "FADH2", "CO2", "O2", "H2O" +Constraints: scientifically plausible; avoid tiny text; no extra decoration; no watermark +``` + +### logo-brand +``` +Use case: logo-brand +Primary request: original logo for "Field & Flour", a local bakery +Style/medium: vector logo mark; flat colors; minimal +Composition/framing: single centered logo on a plain background with generous padding +Constraints: strong silhouette, balanced negative space; original design only; no gradients unless essential; no trademarks; no watermark +``` + +### illustration-story +``` +Use case: illustration-story +Primary request: 4-panel comic about a pet left alone at home +Scene/backdrop: cozy living room across panels +Subject: pet reacting to the owner leaving, then relaxing, then returning to a composed pose +Style/medium: comic illustration with clear panels +Composition/framing: 4 equal-sized vertical panels, readable actions per panel +Constraints: no text; no logos or trademarks; no watermark +``` + +### stylized-concept +``` +Use case: stylized-concept +Primary request: cavernous hangar interior with tall support beams and drifting fog +Scene/backdrop: industrial hangar interior, deep scale, light haze +Subject: compact shuttle parked near the center +Style/medium: cinematic concept art, industrial realism +Composition/framing: wide-angle, low-angle +Lighting/mood: volumetric light rays cutting through fog +Constraints: no logos or trademarks; no watermark +``` + +### ads-marketing +``` +Use case: ads-marketing +Primary request: campaign image for a streetwear brand called Thread +Subject: group of friends hanging out together in a stylish urban setting +Style/medium: polished youth streetwear campaign photography +Composition/framing: vertical ad layout with natural poses and integrated headline space +Lighting/mood: contemporary, energetic, tasteful +Text (verbatim): "Yours to Create." +Constraints: render the tagline exactly once; clean legible typography; no extra text; no watermarks; no unrelated logos +``` + +### productivity-visual +``` +Use case: productivity-visual +Primary request: one pitch-deck slide titled "Market Opportunity" +Asset type: fundraising slide image +Style/medium: clean modern deck slide, white background, crisp sans-serif typography +Subject: TAM/SAM/SOM concentric-circle diagram plus a small growth bar chart from 2021 to 2026 +Composition/framing: 16:9 landscape slide, clear data hierarchy, polished spacing +Text (verbatim): "Market Opportunity", "TAM: $42B", "SAM: $8.7B", "SOM: $340M", "AGI Research, 2024", "Internal analysis" +Constraints: readable labels, no clip art, no stock photography, no decorative clutter, no watermark +``` + +### historical-scene +``` +Use case: historical-scene +Primary request: outdoor crowd scene in Bethel, New York on August 16, 1969 +Scene/backdrop: open field with period-appropriate staging +Subject: crowd in period-accurate clothing, authentic environment +Style/medium: photorealistic photo +Composition/framing: wide shot, eye-level +Constraints: period-accurate details; no modern objects; no logos or trademarks; no watermark +``` + +## Asset type templates (taxonomy-aligned) + +### Website assets template +``` +Use case: +Asset type: +Primary request: +Scene/backdrop: +Subject:
+Style/medium: +Composition/framing: +Lighting/mood: +Color palette: +Constraints: +``` + +### Website assets example: minimal hero background +``` +Use case: stylized-concept +Asset type: landing page hero background +Primary request: minimal abstract background with a soft gradient and subtle texture +Style/medium: matte illustration / soft-rendered abstract background +Composition/framing: wide composition with usable negative space for page copy +Lighting/mood: gentle studio glow +Color palette: restrained neutral palette +Constraints: no text; no logos; no watermark +``` + +### Website assets example: feature section illustration +``` +Use case: stylized-concept +Asset type: feature section illustration +Primary request: simple abstract shapes suggesting connection and flow +Scene/backdrop: subtle light-gray backdrop with faint texture +Style/medium: flat illustration; soft shadows; restrained contrast +Composition/framing: centered cluster; open margins for UI +Color palette: muted neutral palette +Constraints: no text; no logos; no watermark +``` + +### Website assets example: blog header image +``` +Use case: photorealistic-natural +Asset type: blog header image +Primary request: overhead desk scene with notebook, pen, and coffee cup +Scene/backdrop: warm wooden tabletop +Style/medium: photorealistic photo +Composition/framing: wide crop with clean room for page copy +Lighting/mood: soft morning light +Constraints: no text; no logos; no watermark +``` + +### Game assets template +``` +Use case: stylized-concept +Asset type: +Primary request: +Scene/backdrop: (if applicable) +Subject:
+Style/medium: ; +Composition/framing: ; ; +Lighting/mood: