fix: split log files daily and clean up logs older than 30 days#16
fix: split log files daily and clean up logs older than 30 days#16ActivePeter wants to merge 14 commits into
Conversation
ActivePeter
left a comment
There was a problem hiding this comment.
Blocking issue found.
deployment/templates/gen_bare_deploy_bash/selection_supervisor_launch_wait_block.sh.tmpl:1: this starts{{RUN_CMD}}in the background inside command substitution, but the PR removed the previous stdout/stderr redirection to$LOGFILE. In bash, a background job inside$(...)that keeps stdout open keeps the substitution pipe open, soSUPERVISOR_PID=$( ... & echo "$!" )waits until the supervisor exits instead of returning immediately. For any long-lived service, generatedstart_*.shhangs before it can runwait_service_probably_ready_pid_tree, affecting both standalone and atomic-group starts. Please detach the supervisor fds again, for example by redirecting stdout/stderr to"$LOGFILE"or/dev/nullbefore&, and keep a regression test that executes a generated start against a long-lived child.
ActivePeter
left a comment
There was a problem hiding this comment.
Additional review finding.
- The PR adds a new top-level
skills/tree with 100 files, including Codex/OpenAI skill bundles, scripts, icons, prompt skills, and installer helpers. This does not appear to be part of Fluxon runtime, tests, docs, or packaging: outside that tree the only references are exclusion rules/tests (scripts/source_selection_profiles.pyexcludesskills/, and packaging tests assertskills/...is not selected). That means the repository now carries a large tool-local asset tree while the project artifacts intentionally ignore it. This works against the repo rules to keep surfaces concise, avoid parallel/non-canonical module surfaces, and keep long-lived docs/tooling focused on Fluxon. Please removeskills/from this PR and keep those Codex skills installed via the user/agent skill mechanism or an external cache, not vendored in the Fluxon repository.
ActivePeter
left a comment
There was a problem hiding this comment.
Correction to my earlier review comment on deployment/templates/gen_bare_deploy_bash/selection_supervisor_launch_wait_block.sh.tmpl:1:
I re-tested the generated script in the real ${HOSTWORKDIR}/gen_bare_deploy_bash layout with a long-lived child. The start script returns after the 10s probable-ready gate, and the child output lands in the daily shard log. The reason is that generated selection_supervisor.py redirects its own stdout/stderr to the runtime log before entering the long-running loop, which closes the command-substitution pipe. So the earlier claim that SUPERVISOR_PID=$( ... & echo "$!" ) hangs until the supervisor exits is not correct for the current generated supervisor path. Please disregard that blocking conclusion.
The separate review finding about the vendored top-level skills/ tree still stands.
ActivePeter
left a comment
There was a problem hiding this comment.
Additional review finding.
fluxon_py/config.py:594adds an owner-modelarge_file_pathsrequirement, but the PR only applies it into_fluxon_kv_client_config_yaml_str()and leaves existing owner fixtures unsynchronized. Runningpython3 fluxon_py/tests/test_config.pyon the PR currently fails three registered checks:fluxon_client_config_yaml_shape(fluxon_py/tests/test_config.py:353),fluxonkv_runtime_defaults_are_internal(fluxon_py/tests/test_config.py:410), andfluxonkv_test_spec_config(fluxon_py/tests/test_config.py:474), all withfluxonkv_spec.large_file_paths is required for owner mode. This is a CI-level regression. It also leaves the Python public contract split:FluxonKvClientConfig(...)still accepts an owner config missinglarge_file_paths, and the zero-contribution path atfluxon_py/config.py:578still serializeslarge_file_pathseven though the Rust schema now rejects it. Please centralize the new role-based validation in the Python config contract, keep it aligned with Rust, and update the existing owner fixtures so the full config test suite passes.
ActivePeter
left a comment
There was a problem hiding this comment.
Additional review finding.
fluxon_rs/fluxon_kv/src/config.rs:1086makesfluxonkv_spec.large_file_pathsmandatory for owner mode, but two public owner startup examples still build owner configs without that field:examples/start_master_owner.py:117andexamples/start_kv_and_fs_svc.py:188. On this PR, both reproduce asValueError: fluxonkv_spec.large_file_paths is required for owner modewhen theirbuild_owner_config()output is passed throughFluxonKvClientConfig(...).to_fluxon_kv_client_config_yaml_str(), which is exactly the path used beforenew_store(...)starts the owner.examples/start_master_owner.pyis also the recommended service-plane entry in the KV/MQ user docs (fluxon_doc_cn/user_doc/用户 - 3 - KV-RPC接口.md:187,fluxon_doc_en/user_doc/User - 3 - KV and RPC Interface.md:162, and the service-plane/MQ pages). Please update these canonical examples and the copied doc snippets at the same time as the new owner contract, and add a lightweight test that serializes the example owner configs so public quick starts do not drift from the runtime schema again.
ActivePeter
left a comment
There was a problem hiding this comment.
Additional review finding.
fluxon_rs/fluxon_kv/src/config.rs:944now rejectsfluxonkv_spec.large_file_pathsfor every zero-contribution YAML config, but the side-transfer worker YAML fixtures in the same file still include that owner-only field and then callcfg.verify().unwrap()(fluxon_rs/fluxon_kv/src/config.rs:1919,fluxon_rs/fluxon_kv/src/config.rs:1961, andfluxon_rs/fluxon_kv/src/config.rs:1995). Those configs have nocontribute_to_cluster_pool_sizeandtest_spec_config.side_transfer_role: worker, so they enter the zero-contribution branch and fail before reaching the expected assertions. This also conflicts with the runtime generator, which correctly omitslarge_file_pathsfor side-worker YAML atfluxon_rs/fluxon_kv/src/lib.rs:842and asserts that omission atfluxon_rs/fluxon_kv/src/lib.rs:2746. Please remove the owner-only field from these side-worker YAML contract tests so the test fixtures match the new single external/side-worker contract. I could compile the targeted Rust test here, but the harness cannot execute in this checkout becauselibfluxon_commu_core.sois missing from the runtime library path; the branch contradiction above is still deterministic from the code.
ActivePeter
left a comment
There was a problem hiding this comment.
Additional review finding.
fluxon_test_stack/start_test_bed.py:1504still returns the bare service logical base path ($HOSTWORKDIR/log/<service>.log), and_collect_bare_runtime_statuses()then reads that path directly atfluxon_test_stack/start_test_bed.py:3415. This PR moved generated bare service stdout/stderr to daily shards throughdeployment/utils/selection_supervisor_codegen.py:753(<service>.<YYYY-MM-DD>.log), and the shared helper already exposeslog_shard.resolve_readable_log_path()for that exact base-to-shard lookup (deployment/utils/log_shard.py:102). With onlymaster.2026-06-23.logpresent, the current status code reportsmaster.logand omitsservice_log_tailfrom the failure detail, so start-testbed diagnostics lose the actual service log after the shard migration. Please resolve the readable shard path before returning/reporting the bare runtime log, and extenddeployment/tests/test_start_test_bed_bootstrap_log.py:155with a non-TiKV sharded-log case so this consumer stays aligned with the new log contract.
ActivePeter
left a comment
There was a problem hiding this comment.
Review summary: changes are still needed before this PR should merge.
The blocking issues I found are:
- The new top-level
skills/tree vendors 100 Codex/OpenAI-local skill files, while Fluxon artifacts only exclude it and do not use it. Please keep those outside this repository. - The new owner-mode
large_file_pathscontract breakspython3 fluxon_py/tests/test_config.pyin three registered checks and leaves Python/Rust config validation inconsistent. - Canonical public owner startup examples still omit owner
large_file_paths, so documented quick-start paths now fail before startup. - Rust side-transfer worker YAML tests still include owner-only
large_file_pathseven though zero-contribution YAML now rejects it. - Bare start-testbed runtime status still reads the logical base log path after bare service logs moved to daily shards, so failure diagnostics can miss the real service log tail.
Correction note: my earlier supervisor command-substitution hang comment was re-tested and explicitly retracted in a follow-up comment; that item should not be treated as an active finding.
ActivePeter
left a comment
There was a problem hiding this comment.
Additional review finding from the follow-up design/contract sweep.
fluxon_test_stack/test_runner.py:59andfluxon_test_stack/start_test_bed.py:30import the new daily-shard helper asfrom utils import log_shardafter addingdeployment/ordeployment/utilstosys.path. That is not a stable module surface in this repo becausesetup_and_pack/pack_test_stack_rsc.py:32already imports the package nameutilsfromsetup_and_pack/utils. In a long-lived Python process that importspack_test_stack_rsc.pyfirst and then importstest_runner.pywith the normal test-runner path, the second import fails deterministically withImportError: cannot import name 'log_shard' from 'utils' (.../setup_and_pack/utils/__init__.py). The same preloadedutilspackage also breaks importingstart_test_bed.py. This is an import-order bug and a naming-contract regression: the PR reuses the right log-shard implementation, but exposes it through an ambiguous top-level module name that collides with an existing repo helper package. Please load the canonicaldeployment/utils/log_shard.pyby an explicit path/module name or package-qualified surface, and cover the import-order case so teststack and packaging helpers can coexist in one interpreter.
ActivePeter
left a comment
There was a problem hiding this comment.
Additional review finding from the config/runtime contract sweep.
fluxon_rs/fluxon_kv/src/lib.rs:626now makes the publicload_client_config(...)path callbootstrap_zero_contribution_client_config(...), which waits inwait_for_external_bootstrap_bundle(...)until the ownershared.jsonandmmap.fileare present and valid (fluxon_rs/fluxon_kv/src/lib.rs:1615). That changes a config-load/verify API into a runtime attach operation for every zero-contribution config. The drift shows up directly influxon_rs/fluxon_kv/examples/kvcli_example.rs:249: the comment says "Validate config early for clearer errors", but the call at line 250 now waits for the owner bundle before initialization. Theninit_client(config_arg)re-entersrun_client_impl(...), which callsload_client_config(...)again atfluxon_rs/fluxon_kv/src/lib.rs:1853; for external configs it also callswait_for_external_bootstrap_bundle(...)again atfluxon_rs/fluxon_kv/src/lib.rs:1923. So one external CLI startup can perform owner-bundle bootstrap as an early validation side effect, then repeat the same bootstrap during initialization. This violates the repo contract to keep one semantic operation on one primary path, and makes the public API name/type misleading: callers asking only to load a config can now block on runtime state that is outside the config file. Please keepload_client_configas the pure load/verify contract and fold zero-contribution owner-bundle bootstrap into the actual runtime initialization path once, returning both the bootstrapped config and shared metadata there; or introduce a separately named runtime-bootstrap helper if that blocking attach step is intentionally public.
ActivePeter
left a comment
There was a problem hiding this comment.
Additional review finding from the test-runner scene-contract sweep.
fluxon_test_stack/test_runner.py:357now treats any scene id with theci_top_attention_prefix as allowed, but the runner-native CI command surface remains a closed set of three ids atfluxon_test_stack/test_runner.py:361andfluxon_test_stack/test_runner.py:7662. With a normal suite shape, a scene namedci_top_attention_unknownparses and expands successfully, then fails only when_build_ci_execution_plan(...)reachesfluxon_test_stack/test_runner.py:7748withscene[ci_top_attention_unknown] does not declare a runner-native CI command branch. This is exactly the kind of open-ended branch proliferation the repo rules try to avoid: the schema validator accepts a broader public scene surface than the executor implements. Please make the allowed scene set explicit and finite, for example by deriving validation from_runner_native_ci_scene_ids()or by registering each top-attention entry through one canonical mapping that both validation and command generation use. That keeps new top-attention scenes exhaustive at parse time and avoids late runtime failures from a prefix-only admission rule.
ActivePeter
left a comment
There was a problem hiding this comment.
Additional review finding from the log-shard helper sweep.
- The Rust log-shard resolver does not match the Python helper's tolerant shard scan. In
fluxon_rs/fluxon_util/src/log.rs:337,latest_existing_daily_sharded_log_path(...)iterates same-prefix*.logfiles, butentry.file_name().to_str()?at line 351 andchrono::NaiveDate::parse_from_str(...).ok()?at line 359 returnNonefor the whole function when any matching entry has a non-UTF8 name or an invalid date suffix. The Python source of the same contract explicitly skips invalid shard names (deployment/utils/log_shard.py:89), so a single leftover file such asworkload__Deployment__demo.tmp.logorworkload__Deployment__demo.not-a-date.logcan make Rustresolve_readable_log_path(...)ignore a validworkload__Deployment__demo.2026-06-20.log. That directly affects ops log reading:fluxon_rs/fluxon_ops/src/lib.rs:3000falls back to the logical base path and then returnsstat log failedif the base file does not exist. Please make the Rust scanner skip malformed candidates and add a test besidefluxon_rs/fluxon_ops/src/lib.rs:14830orfluxon_rs/fluxon_util/tests/log_mgmt.rs:110that includes one invalid same-prefix shard plus one valid dated shard, so the Rust and Python shard contracts stay aligned.
ActivePeter
left a comment
There was a problem hiding this comment.
Additional review finding from the ops log-tail contract sweep.
fluxon_rs/fluxon_ops/src/lib.rs:3000resolves eachread_workload_logrequest from the logical base path to whichever daily shard is current/latest, but the cursor contract atfluxon_rs/fluxon_ops/src/lib.rs:592only returnsfile_size,start_offset, andend_offset; it never tells the caller which concrete shard file those offsets belong to. The embedded UI then stores onlyworkloadLogEndOffsetand pollsForwardwith that offset (fluxon_rs/fluxon_ops/src/lib.rs:8512). Afterdeployment/utils/selection_supervisor_codegen.py:753rolls workload stdout/stderr fromworkload.logtoworkload.<next-date>.log, the next poll transparently targets the new shard while reusing the old shard offset. If the new shard is smaller, the agent returnscursor out of rangeatfluxon_rs/fluxon_ops/src/lib.rs:3025; if it has already grown past that offset, the UI starts reading from the middle of the new shard and drops the beginning of the day's log. This is a contract bug in the new sharded log surface, not just a UI display issue: offsets are file-relative but the API does not expose a file identity. Please return a stableresolved_log_path/shard id inReadWorkloadLogRespand have the UI reset or append a shard-boundary marker when it changes, or make the RPC accept and validate the shard identity with the cursor. Add a regression test that simulates two daily shards and verifies a tail poll after rollover does not error and does not skip the new shard prefix.
ActivePeter
left a comment
There was a problem hiding this comment.
Additional review finding from the public-doc contract sweep.
- The PR moves KV runtime logs and profiles out of
shared_file_pathand intolarge_file_paths.log_root_path:fluxon_rs/fluxon_kv/src/lib.rs:1872buildskv_logs_dirfromconfig.large_file_paths.log_root_path, andfluxon_rs/fluxon_kv/src/lib.rs:2427does the same forkv_profiles_dir. The owner also publishes that split throughSharedJsonMeta.large_file_pathsatfluxon_rs/fluxon_kv/src/client_seg_pool/mod.rs:1183, so external clients inherit it fromshared.jsonwhile zero-contribution YAML must omit the owner-only field. The user docs still describeshared_file_pathas the authority forshared.json, logs, and profiles in both languages (fluxon_doc_en/user_doc/User - 1 - Architecture and Concepts.md:81,fluxon_doc_cn/user_doc/用户 - 1 - 架构和概念.md:65,fluxon_doc_en/user_doc/User - 3 - KV and RPC Interface.md:388,fluxon_doc_cn/user_doc/用户 - 3 - KV-RPC接口.md:617). That now points users at the wrong directory for runtime artifacts and hides the new owner/external contract split. Please update the EN/CN user docs together: makeshared_file_pathonly theshared.json/metadata attachment authority, documentlarge_file_paths.log_root_pathandcache_root_pathas owner-side physical runtime roots, and show that external zero-contribution configs inherit them instead of declaring them.
ActivePeter
left a comment
There was a problem hiding this comment.
Additional review finding from the source-selection contract sweep.
scripts/git_source_selection.py:22now hard-codes onegit ls-filesmode for every source-selection profile, including--others --exclude-standard. That is fine for a CI source pack if the goal is to include local untracked-but-not-ignored test files, but the same helper also feeds the public build seed throughsetup_and_pack/public_workspace_contract.py:37, thensetup_and_pack/nix/pack_fluxonkv_pylib.py:213uses that list as the Pyo3 workspace/wheel input set andsetup_and_pack/nix/lib_layout.py:751materializes it into the bridge-prebuiltworkspace_seed. The published profile manifest then digests the whole workspace seed atsetup_and_pack/nix/pack_fluxonkv_pylib.py:2812. I reproduced the current behavior with a temp repo where mockedgit ls-files --cached --others --exclude-standard -zreturnedconfig.py\0local_probe.py\0;collect_public_workspace_input_relative_paths(...)returned bothfluxon_py/config.pyand the untrackedfluxon_py/local_probe.py. This makes wheel/cache digests and published workspace seeds depend on arbitrary local scratch files under selected roots, so two developers can publish different artifacts from the same commit, and an accidental untracked file can enter the public workspace. Please keep the shared profile abstraction, but make the untracked-file policy an explicit bounded branch: tracked-only forbuild_seed/published public workspace inputs, and opt-in include-untracked only forsource_packif that is really the CI artifact contract. Add a regression test that creates or mocks an untracked file underfluxon_py/and provescollect_public_workspace_input_relative_paths()excludes it while_collect_ci_source_relpaths()keeps whatever policy is intended for CI source packs.
ActivePeter
left a comment
There was a problem hiding this comment.
Additional review finding from the bare startup-gate sweep.
deployment/utils/proc_lifecycle_codegen.py:167documentswait_service_probably_ready_pid_tree(...)as requiring one supervised direct child PID to stay unchanged across the fixed startup window, but the implementation only records the first child PID at line 212 and then returns success as soon asnow >= deadline_tsif any child was ever observed (lines 219-226). Thestartup_window_secondsargument is validated but never used to measure stability duration. This is paired withdeployment/gen_bare_deploy_bash.py:50-53, which now sets both the required stable window and the entire startup deadline to 10 seconds, and the generated standalone/atomic paths pass those equal values at lines 621-622 and 695-696. A child that appears in the last fraction of the deadline is therefore accepted immediately even though it has not survived the advertised stable window, and atomic-group startup can advance to the next service before the previous member has been stable. I reproduced the helper directly by starting a parent that sleeps ~1.7s, then creates its only direct child, and callingwait_service_probably_ready_pid_tree svc <pid> 2 <now+2> '[repro]'; it returnedrc=0 elapsed_ms=1722, proving it did not wait for 2 seconds of child stability. Please either trackobserved_child_sinceand requirenow - observed_child_since >= startup_window_seconds, with a separate deadline that allows that full window after the child appears, or rename/remove the stability contract if the intended behavior is only "child appeared once before deadline". Add a regression test where the child appears late relative to the deadline and assert the helper does not report ready until the full stability window has elapsed.
ActivePeter
left a comment
There was a problem hiding this comment.
Additional review finding from the Python/Rust config contract sweep.
-
fluxon_py/config.py:579updates the zero-contribution forbidden-field list for the Python publicFluxonKvClientConfigpath, but it does not include the new owner-onlyfluxonkv_spec.large_file_paths. Rust now rejects that field in zero-contribution YAML atfluxon_rs/fluxon_kv/src/config.rs:944, and the design docs say externals inheritlarge_file_pathsfrom ownershared.json, so the Python facade should reject the same input before rendering YAML. On this PR, this snippet is accepted and emits YAML that Rust will later reject:from fluxon_py.config import FluxonKvClientConfig FluxonKvClientConfig({ 'instance_key': 'external_with_large_paths', 'fluxonkv_spec': { 'cluster_name': 'demo', 'shared_memory_path': '/tmp/fluxon_shm', 'shared_file_path': '/tmp/fluxon_shared', 'large_file_paths': { 'log_root_path': '/tmp/fluxon_logs', 'cache_root_path': '/tmp/fluxon_cache', }, }, }).to_fluxon_kv_client_config_yaml_str()
I also checked the existing forbidden fields through the same path:
etcd_addresses,sub_cluster, andredis_compatare rejected, whilelarge_file_pathsis accepted. Please addlarge_file_pathsto the Python zero-contribution forbidden list in both validation surfaces and add a Python contract test mirroring the Rustclient_config_zero_contribution_rejects_large_file_paths_in_yamltest.
ActivePeter
left a comment
There was a problem hiding this comment.
Additional review finding from the bare deploy contract sweep.
deployment/gen_bare_deploy_bash.py:52changesSTANDALONE_STARTUP_DEADLINE_SECONDSfrom 60 to 10, anddeployment/gen_bare_deploy_bash.py:53changesATOMIC_GROUP_STARTUP_DEADLINE_SECONDSfrom 10 minutes to 10 seconds. Those constants feed the generated public bare start scripts atdeployment/gen_bare_deploy_bash.py:618-623anddeployment/gen_bare_deploy_bash.py:693-697. The PR also removes the single group-levelGROUP_STARTUP_DEADLINE_TSand now gives each atomic-group service its ownSTARTUP_DEADLINE_TS=$(( $(date +%s) + 10 )), which the updated tests assert. That is a behavioral contraction of the production startup contract, not just a template refactor: any standalone service that needs more than 10s to expose a stable child, or any atomic-group member that needs more than 10s during rollout, now fails bare bootstrap even though the previous generated scripts allowed 60s / 600s. This conflicts with the repo rule to keep teststack/bare contracts explicit and bounded rather than letting test-oriented constants leak into the shared deploy path. Please keep the deadline constants at their previous production values or split test-only fast windows from the generated deployment defaults.
ActivePeter
left a comment
There was a problem hiding this comment.
Additional review finding from the shared-supervisor module-boundary sweep.
deployment/utils/selection_supervisor_codegen.py:72generates_load_log_shard_helper()with an open-ended search over the generated script directory, current working directory,cwd/deployment/utils, and everysys.pathentry, then loads the firstlog_shard.pyit finds. That undermines the PR's own single-source contract atdeployment/utils/selection_supervisor_codegen.py:5: both public materializers already publish the helper as a concrete sibling artifact (deployment/gen_bare_deploy_bash.py:115-121writesselection_supervisor.pyandlog_shard.pyinto the same outdir;fluxon_rs/fluxon_ops/src/lib.rs:984-1040writes both into the sameselection_supervisorruntime dir). With the current generated runtime, a missing/stale sibling silently falls through to cwd orsys.pathand can bind a different helper version, so bare and ops supervisors no longer have one canonical module pair. This is the kind of implementation probing the repo rules try to avoid: fail fast if the sibling helper is missing or incompatible, and make the generated supervisor load only the co-materializedPath(__file__).with_name("log_shard.py")(or pass an explicit helper path if that is meant to be a real contract).
| "git", | ||
| "ls-files", | ||
| "--cached", | ||
| "--others", |
There was a problem hiding this comment.
这会把所有 untracked 文件也带进同一个 source selection helper。collect_public_workspace_input_relative_paths() 走 build_seed profile,后面 pack_fluxonkv_pylib.py / lib_layout.py 会把结果物化为 public workspace seed;这样发布输入不再只由提交内容决定,本地临时文件、未 stage 的实验文件也会进入 release/cache digest。建议把 --others 收窄到明确的 dev/source-pack profile,public build_seed 默认只选 tracked 文件;需要额外文件时用显式 include。
|
|
||
| let path = self.log_dir.join(log_filename); | ||
| let logical_path = self.log_dir.join(log_filename); | ||
| let path = resolve_readable_log_path(&logical_path).unwrap_or(logical_path.clone()); |
There was a problem hiding this comment.
这里每次请求都会重新把 logical base 解析成 current/latest 日分片,但 ReadWorkloadLogResp 只返回 file_size/start_offset/end_offset,没有实际 shard/path identity。UI 后续用旧 end_offset 继续读时,如果跨午夜或 current shard 创建,offset 会被套到另一个文件上,结果可能缺日志、重复日志或 out-of-range。这个 API 需要在响应中返回并在后续请求中校验/传回 resolved shard identity,或者在 rollover 时显式要求客户端 reset。
| cwd = Path.cwd().resolve() | ||
| candidates.append(cwd / "__LOG_SHARD_HELPER_FILENAME__") | ||
| candidates.append(cwd / "deployment" / "utils" / "__LOG_SHARD_HELPER_FILENAME__") | ||
| for entry in sys.path: |
There was a problem hiding this comment.
生成的 selection_supervisor.py 和 log_shard.py 已经由 gen_bare / ops embedded runtime 同目录发布,这里再从 cwd、deployment/utils 和整个 sys.path 搜索会把运行时 authority 打开:同目录 helper 缺失或旧版本时,脚本会静默加载仓库里或环境里的另一个 log_shard.py。这违反“一条公共路径一个明确契约”的规约,也让生成脚本依赖启动 cwd。建议只接受 __file__.with_name(...) 的同目录 helper,缺失就 fail fast,或由 materializer 显式传入 helper path。
No description provided.