Skip to content

fix: split log files daily and clean up logs older than 30 days#16

Open
ActivePeter wants to merge 14 commits into
mainfrom
padev14
Open

fix: split log files daily and clean up logs older than 30 days#16
ActivePeter wants to merge 14 commits into
mainfrom
padev14

Conversation

@ActivePeter

Copy link
Copy Markdown
Collaborator

No description provided.

@ActivePeter ActivePeter left a comment

Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Blocking issue found.

  • deployment/templates/gen_bare_deploy_bash/selection_supervisor_launch_wait_block.sh.tmpl:1: this starts {{RUN_CMD}} in the background inside command substitution, but the PR removed the previous stdout/stderr redirection to $LOGFILE. In bash, a background job inside $(...) that keeps stdout open keeps the substitution pipe open, so SUPERVISOR_PID=$( ... & echo "$!" ) waits until the supervisor exits instead of returning immediately. For any long-lived service, generated start_*.sh hangs before it can run wait_service_probably_ready_pid_tree, affecting both standalone and atomic-group starts. Please detach the supervisor fds again, for example by redirecting stdout/stderr to "$LOGFILE" or /dev/null before &, and keep a regression test that executes a generated start against a long-lived child.

@ActivePeter ActivePeter left a comment

Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Additional review finding.

  • The PR adds a new top-level skills/ tree with 100 files, including Codex/OpenAI skill bundles, scripts, icons, prompt skills, and installer helpers. This does not appear to be part of Fluxon runtime, tests, docs, or packaging: outside that tree the only references are exclusion rules/tests (scripts/source_selection_profiles.py excludes skills/, and packaging tests assert skills/... is not selected). That means the repository now carries a large tool-local asset tree while the project artifacts intentionally ignore it. This works against the repo rules to keep surfaces concise, avoid parallel/non-canonical module surfaces, and keep long-lived docs/tooling focused on Fluxon. Please remove skills/ from this PR and keep those Codex skills installed via the user/agent skill mechanism or an external cache, not vendored in the Fluxon repository.

@ActivePeter ActivePeter left a comment

Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Correction to my earlier review comment on deployment/templates/gen_bare_deploy_bash/selection_supervisor_launch_wait_block.sh.tmpl:1:

I re-tested the generated script in the real ${HOSTWORKDIR}/gen_bare_deploy_bash layout with a long-lived child. The start script returns after the 10s probable-ready gate, and the child output lands in the daily shard log. The reason is that generated selection_supervisor.py redirects its own stdout/stderr to the runtime log before entering the long-running loop, which closes the command-substitution pipe. So the earlier claim that SUPERVISOR_PID=$( ... & echo "$!" ) hangs until the supervisor exits is not correct for the current generated supervisor path. Please disregard that blocking conclusion.

The separate review finding about the vendored top-level skills/ tree still stands.

@ActivePeter ActivePeter left a comment

Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Additional review finding.

  • fluxon_py/config.py:594 adds an owner-mode large_file_paths requirement, but the PR only applies it in to_fluxon_kv_client_config_yaml_str() and leaves existing owner fixtures unsynchronized. Running python3 fluxon_py/tests/test_config.py on the PR currently fails three registered checks: fluxon_client_config_yaml_shape (fluxon_py/tests/test_config.py:353), fluxonkv_runtime_defaults_are_internal (fluxon_py/tests/test_config.py:410), and fluxonkv_test_spec_config (fluxon_py/tests/test_config.py:474), all with fluxonkv_spec.large_file_paths is required for owner mode. This is a CI-level regression. It also leaves the Python public contract split: FluxonKvClientConfig(...) still accepts an owner config missing large_file_paths, and the zero-contribution path at fluxon_py/config.py:578 still serializes large_file_paths even though the Rust schema now rejects it. Please centralize the new role-based validation in the Python config contract, keep it aligned with Rust, and update the existing owner fixtures so the full config test suite passes.

@ActivePeter ActivePeter left a comment

Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Additional review finding.

  • fluxon_rs/fluxon_kv/src/config.rs:1086 makes fluxonkv_spec.large_file_paths mandatory for owner mode, but two public owner startup examples still build owner configs without that field: examples/start_master_owner.py:117 and examples/start_kv_and_fs_svc.py:188. On this PR, both reproduce as ValueError: fluxonkv_spec.large_file_paths is required for owner mode when their build_owner_config() output is passed through FluxonKvClientConfig(...).to_fluxon_kv_client_config_yaml_str(), which is exactly the path used before new_store(...) starts the owner. examples/start_master_owner.py is also the recommended service-plane entry in the KV/MQ user docs (fluxon_doc_cn/user_doc/用户 - 3 - KV-RPC接口.md:187, fluxon_doc_en/user_doc/User - 3 - KV and RPC Interface.md:162, and the service-plane/MQ pages). Please update these canonical examples and the copied doc snippets at the same time as the new owner contract, and add a lightweight test that serializes the example owner configs so public quick starts do not drift from the runtime schema again.

@ActivePeter ActivePeter left a comment

Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Additional review finding.

  • fluxon_rs/fluxon_kv/src/config.rs:944 now rejects fluxonkv_spec.large_file_paths for every zero-contribution YAML config, but the side-transfer worker YAML fixtures in the same file still include that owner-only field and then call cfg.verify().unwrap() (fluxon_rs/fluxon_kv/src/config.rs:1919, fluxon_rs/fluxon_kv/src/config.rs:1961, and fluxon_rs/fluxon_kv/src/config.rs:1995). Those configs have no contribute_to_cluster_pool_size and test_spec_config.side_transfer_role: worker, so they enter the zero-contribution branch and fail before reaching the expected assertions. This also conflicts with the runtime generator, which correctly omits large_file_paths for side-worker YAML at fluxon_rs/fluxon_kv/src/lib.rs:842 and asserts that omission at fluxon_rs/fluxon_kv/src/lib.rs:2746. Please remove the owner-only field from these side-worker YAML contract tests so the test fixtures match the new single external/side-worker contract. I could compile the targeted Rust test here, but the harness cannot execute in this checkout because libfluxon_commu_core.so is missing from the runtime library path; the branch contradiction above is still deterministic from the code.

@ActivePeter ActivePeter left a comment

Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Additional review finding.

  • fluxon_test_stack/start_test_bed.py:1504 still returns the bare service logical base path ($HOSTWORKDIR/log/<service>.log), and _collect_bare_runtime_statuses() then reads that path directly at fluxon_test_stack/start_test_bed.py:3415. This PR moved generated bare service stdout/stderr to daily shards through deployment/utils/selection_supervisor_codegen.py:753 (<service>.<YYYY-MM-DD>.log), and the shared helper already exposes log_shard.resolve_readable_log_path() for that exact base-to-shard lookup (deployment/utils/log_shard.py:102). With only master.2026-06-23.log present, the current status code reports master.log and omits service_log_tail from the failure detail, so start-testbed diagnostics lose the actual service log after the shard migration. Please resolve the readable shard path before returning/reporting the bare runtime log, and extend deployment/tests/test_start_test_bed_bootstrap_log.py:155 with a non-TiKV sharded-log case so this consumer stays aligned with the new log contract.

@ActivePeter ActivePeter left a comment

Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Review summary: changes are still needed before this PR should merge.

The blocking issues I found are:

  • The new top-level skills/ tree vendors 100 Codex/OpenAI-local skill files, while Fluxon artifacts only exclude it and do not use it. Please keep those outside this repository.
  • The new owner-mode large_file_paths contract breaks python3 fluxon_py/tests/test_config.py in three registered checks and leaves Python/Rust config validation inconsistent.
  • Canonical public owner startup examples still omit owner large_file_paths, so documented quick-start paths now fail before startup.
  • Rust side-transfer worker YAML tests still include owner-only large_file_paths even though zero-contribution YAML now rejects it.
  • Bare start-testbed runtime status still reads the logical base log path after bare service logs moved to daily shards, so failure diagnostics can miss the real service log tail.

Correction note: my earlier supervisor command-substitution hang comment was re-tested and explicitly retracted in a follow-up comment; that item should not be treated as an active finding.

@ActivePeter ActivePeter left a comment

Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Additional review finding from the follow-up design/contract sweep.

  • fluxon_test_stack/test_runner.py:59 and fluxon_test_stack/start_test_bed.py:30 import the new daily-shard helper as from utils import log_shard after adding deployment/ or deployment/utils to sys.path. That is not a stable module surface in this repo because setup_and_pack/pack_test_stack_rsc.py:32 already imports the package name utils from setup_and_pack/utils. In a long-lived Python process that imports pack_test_stack_rsc.py first and then imports test_runner.py with the normal test-runner path, the second import fails deterministically with ImportError: cannot import name 'log_shard' from 'utils' (.../setup_and_pack/utils/__init__.py). The same preloaded utils package also breaks importing start_test_bed.py. This is an import-order bug and a naming-contract regression: the PR reuses the right log-shard implementation, but exposes it through an ambiguous top-level module name that collides with an existing repo helper package. Please load the canonical deployment/utils/log_shard.py by an explicit path/module name or package-qualified surface, and cover the import-order case so teststack and packaging helpers can coexist in one interpreter.

@ActivePeter ActivePeter left a comment

Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Additional review finding from the config/runtime contract sweep.

  • fluxon_rs/fluxon_kv/src/lib.rs:626 now makes the public load_client_config(...) path call bootstrap_zero_contribution_client_config(...), which waits in wait_for_external_bootstrap_bundle(...) until the owner shared.json and mmap.file are present and valid (fluxon_rs/fluxon_kv/src/lib.rs:1615). That changes a config-load/verify API into a runtime attach operation for every zero-contribution config. The drift shows up directly in fluxon_rs/fluxon_kv/examples/kvcli_example.rs:249: the comment says "Validate config early for clearer errors", but the call at line 250 now waits for the owner bundle before initialization. Then init_client(config_arg) re-enters run_client_impl(...), which calls load_client_config(...) again at fluxon_rs/fluxon_kv/src/lib.rs:1853; for external configs it also calls wait_for_external_bootstrap_bundle(...) again at fluxon_rs/fluxon_kv/src/lib.rs:1923. So one external CLI startup can perform owner-bundle bootstrap as an early validation side effect, then repeat the same bootstrap during initialization. This violates the repo contract to keep one semantic operation on one primary path, and makes the public API name/type misleading: callers asking only to load a config can now block on runtime state that is outside the config file. Please keep load_client_config as the pure load/verify contract and fold zero-contribution owner-bundle bootstrap into the actual runtime initialization path once, returning both the bootstrapped config and shared metadata there; or introduce a separately named runtime-bootstrap helper if that blocking attach step is intentionally public.

@ActivePeter ActivePeter left a comment

Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Additional review finding from the test-runner scene-contract sweep.

  • fluxon_test_stack/test_runner.py:357 now treats any scene id with the ci_top_attention_ prefix as allowed, but the runner-native CI command surface remains a closed set of three ids at fluxon_test_stack/test_runner.py:361 and fluxon_test_stack/test_runner.py:7662. With a normal suite shape, a scene named ci_top_attention_unknown parses and expands successfully, then fails only when _build_ci_execution_plan(...) reaches fluxon_test_stack/test_runner.py:7748 with scene[ci_top_attention_unknown] does not declare a runner-native CI command branch. This is exactly the kind of open-ended branch proliferation the repo rules try to avoid: the schema validator accepts a broader public scene surface than the executor implements. Please make the allowed scene set explicit and finite, for example by deriving validation from _runner_native_ci_scene_ids() or by registering each top-attention entry through one canonical mapping that both validation and command generation use. That keeps new top-attention scenes exhaustive at parse time and avoids late runtime failures from a prefix-only admission rule.

@ActivePeter ActivePeter left a comment

Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Additional review finding from the log-shard helper sweep.

  • The Rust log-shard resolver does not match the Python helper's tolerant shard scan. In fluxon_rs/fluxon_util/src/log.rs:337, latest_existing_daily_sharded_log_path(...) iterates same-prefix *.log files, but entry.file_name().to_str()? at line 351 and chrono::NaiveDate::parse_from_str(...).ok()? at line 359 return None for the whole function when any matching entry has a non-UTF8 name or an invalid date suffix. The Python source of the same contract explicitly skips invalid shard names (deployment/utils/log_shard.py:89), so a single leftover file such as workload__Deployment__demo.tmp.log or workload__Deployment__demo.not-a-date.log can make Rust resolve_readable_log_path(...) ignore a valid workload__Deployment__demo.2026-06-20.log. That directly affects ops log reading: fluxon_rs/fluxon_ops/src/lib.rs:3000 falls back to the logical base path and then returns stat log failed if the base file does not exist. Please make the Rust scanner skip malformed candidates and add a test beside fluxon_rs/fluxon_ops/src/lib.rs:14830 or fluxon_rs/fluxon_util/tests/log_mgmt.rs:110 that includes one invalid same-prefix shard plus one valid dated shard, so the Rust and Python shard contracts stay aligned.

@ActivePeter ActivePeter left a comment

Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Additional review finding from the ops log-tail contract sweep.

  • fluxon_rs/fluxon_ops/src/lib.rs:3000 resolves each read_workload_log request from the logical base path to whichever daily shard is current/latest, but the cursor contract at fluxon_rs/fluxon_ops/src/lib.rs:592 only returns file_size, start_offset, and end_offset; it never tells the caller which concrete shard file those offsets belong to. The embedded UI then stores only workloadLogEndOffset and polls Forward with that offset (fluxon_rs/fluxon_ops/src/lib.rs:8512). After deployment/utils/selection_supervisor_codegen.py:753 rolls workload stdout/stderr from workload.log to workload.<next-date>.log, the next poll transparently targets the new shard while reusing the old shard offset. If the new shard is smaller, the agent returns cursor out of range at fluxon_rs/fluxon_ops/src/lib.rs:3025; if it has already grown past that offset, the UI starts reading from the middle of the new shard and drops the beginning of the day's log. This is a contract bug in the new sharded log surface, not just a UI display issue: offsets are file-relative but the API does not expose a file identity. Please return a stable resolved_log_path/shard id in ReadWorkloadLogResp and have the UI reset or append a shard-boundary marker when it changes, or make the RPC accept and validate the shard identity with the cursor. Add a regression test that simulates two daily shards and verifies a tail poll after rollover does not error and does not skip the new shard prefix.

@ActivePeter ActivePeter left a comment

Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Additional review finding from the public-doc contract sweep.

  • The PR moves KV runtime logs and profiles out of shared_file_path and into large_file_paths.log_root_path: fluxon_rs/fluxon_kv/src/lib.rs:1872 builds kv_logs_dir from config.large_file_paths.log_root_path, and fluxon_rs/fluxon_kv/src/lib.rs:2427 does the same for kv_profiles_dir. The owner also publishes that split through SharedJsonMeta.large_file_paths at fluxon_rs/fluxon_kv/src/client_seg_pool/mod.rs:1183, so external clients inherit it from shared.json while zero-contribution YAML must omit the owner-only field. The user docs still describe shared_file_path as the authority for shared.json, logs, and profiles in both languages (fluxon_doc_en/user_doc/User - 1 - Architecture and Concepts.md:81, fluxon_doc_cn/user_doc/用户 - 1 - 架构和概念.md:65, fluxon_doc_en/user_doc/User - 3 - KV and RPC Interface.md:388, fluxon_doc_cn/user_doc/用户 - 3 - KV-RPC接口.md:617). That now points users at the wrong directory for runtime artifacts and hides the new owner/external contract split. Please update the EN/CN user docs together: make shared_file_path only the shared.json/metadata attachment authority, document large_file_paths.log_root_path and cache_root_path as owner-side physical runtime roots, and show that external zero-contribution configs inherit them instead of declaring them.

@ActivePeter ActivePeter left a comment

Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Additional review finding from the source-selection contract sweep.

  • scripts/git_source_selection.py:22 now hard-codes one git ls-files mode for every source-selection profile, including --others --exclude-standard. That is fine for a CI source pack if the goal is to include local untracked-but-not-ignored test files, but the same helper also feeds the public build seed through setup_and_pack/public_workspace_contract.py:37, then setup_and_pack/nix/pack_fluxonkv_pylib.py:213 uses that list as the Pyo3 workspace/wheel input set and setup_and_pack/nix/lib_layout.py:751 materializes it into the bridge-prebuilt workspace_seed. The published profile manifest then digests the whole workspace seed at setup_and_pack/nix/pack_fluxonkv_pylib.py:2812. I reproduced the current behavior with a temp repo where mocked git ls-files --cached --others --exclude-standard -z returned config.py\0local_probe.py\0; collect_public_workspace_input_relative_paths(...) returned both fluxon_py/config.py and the untracked fluxon_py/local_probe.py. This makes wheel/cache digests and published workspace seeds depend on arbitrary local scratch files under selected roots, so two developers can publish different artifacts from the same commit, and an accidental untracked file can enter the public workspace. Please keep the shared profile abstraction, but make the untracked-file policy an explicit bounded branch: tracked-only for build_seed/published public workspace inputs, and opt-in include-untracked only for source_pack if that is really the CI artifact contract. Add a regression test that creates or mocks an untracked file under fluxon_py/ and proves collect_public_workspace_input_relative_paths() excludes it while _collect_ci_source_relpaths() keeps whatever policy is intended for CI source packs.

@ActivePeter ActivePeter left a comment

Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Additional review finding from the bare startup-gate sweep.

  • deployment/utils/proc_lifecycle_codegen.py:167 documents wait_service_probably_ready_pid_tree(...) as requiring one supervised direct child PID to stay unchanged across the fixed startup window, but the implementation only records the first child PID at line 212 and then returns success as soon as now >= deadline_ts if any child was ever observed (lines 219-226). The startup_window_seconds argument is validated but never used to measure stability duration. This is paired with deployment/gen_bare_deploy_bash.py:50-53, which now sets both the required stable window and the entire startup deadline to 10 seconds, and the generated standalone/atomic paths pass those equal values at lines 621-622 and 695-696. A child that appears in the last fraction of the deadline is therefore accepted immediately even though it has not survived the advertised stable window, and atomic-group startup can advance to the next service before the previous member has been stable. I reproduced the helper directly by starting a parent that sleeps ~1.7s, then creates its only direct child, and calling wait_service_probably_ready_pid_tree svc <pid> 2 <now+2> '[repro]'; it returned rc=0 elapsed_ms=1722, proving it did not wait for 2 seconds of child stability. Please either track observed_child_since and require now - observed_child_since >= startup_window_seconds, with a separate deadline that allows that full window after the child appears, or rename/remove the stability contract if the intended behavior is only "child appeared once before deadline". Add a regression test where the child appears late relative to the deadline and assert the helper does not report ready until the full stability window has elapsed.

@ActivePeter ActivePeter left a comment

Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Additional review finding from the Python/Rust config contract sweep.

  • fluxon_py/config.py:579 updates the zero-contribution forbidden-field list for the Python public FluxonKvClientConfig path, but it does not include the new owner-only fluxonkv_spec.large_file_paths. Rust now rejects that field in zero-contribution YAML at fluxon_rs/fluxon_kv/src/config.rs:944, and the design docs say externals inherit large_file_paths from owner shared.json, so the Python facade should reject the same input before rendering YAML. On this PR, this snippet is accepted and emits YAML that Rust will later reject:

    from fluxon_py.config import FluxonKvClientConfig
    FluxonKvClientConfig({
        'instance_key': 'external_with_large_paths',
        'fluxonkv_spec': {
            'cluster_name': 'demo',
            'shared_memory_path': '/tmp/fluxon_shm',
            'shared_file_path': '/tmp/fluxon_shared',
            'large_file_paths': {
                'log_root_path': '/tmp/fluxon_logs',
                'cache_root_path': '/tmp/fluxon_cache',
            },
        },
    }).to_fluxon_kv_client_config_yaml_str()

    I also checked the existing forbidden fields through the same path: etcd_addresses, sub_cluster, and redis_compat are rejected, while large_file_paths is accepted. Please add large_file_paths to the Python zero-contribution forbidden list in both validation surfaces and add a Python contract test mirroring the Rust client_config_zero_contribution_rejects_large_file_paths_in_yaml test.

@ActivePeter ActivePeter left a comment

Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Additional review finding from the bare deploy contract sweep.

  • deployment/gen_bare_deploy_bash.py:52 changes STANDALONE_STARTUP_DEADLINE_SECONDS from 60 to 10, and deployment/gen_bare_deploy_bash.py:53 changes ATOMIC_GROUP_STARTUP_DEADLINE_SECONDS from 10 minutes to 10 seconds. Those constants feed the generated public bare start scripts at deployment/gen_bare_deploy_bash.py:618-623 and deployment/gen_bare_deploy_bash.py:693-697. The PR also removes the single group-level GROUP_STARTUP_DEADLINE_TS and now gives each atomic-group service its own STARTUP_DEADLINE_TS=$(( $(date +%s) + 10 )), which the updated tests assert. That is a behavioral contraction of the production startup contract, not just a template refactor: any standalone service that needs more than 10s to expose a stable child, or any atomic-group member that needs more than 10s during rollout, now fails bare bootstrap even though the previous generated scripts allowed 60s / 600s. This conflicts with the repo rule to keep teststack/bare contracts explicit and bounded rather than letting test-oriented constants leak into the shared deploy path. Please keep the deadline constants at their previous production values or split test-only fast windows from the generated deployment defaults.

@ActivePeter ActivePeter left a comment

Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Additional review finding from the shared-supervisor module-boundary sweep.

  • deployment/utils/selection_supervisor_codegen.py:72 generates _load_log_shard_helper() with an open-ended search over the generated script directory, current working directory, cwd/deployment/utils, and every sys.path entry, then loads the first log_shard.py it finds. That undermines the PR's own single-source contract at deployment/utils/selection_supervisor_codegen.py:5: both public materializers already publish the helper as a concrete sibling artifact (deployment/gen_bare_deploy_bash.py:115-121 writes selection_supervisor.py and log_shard.py into the same outdir; fluxon_rs/fluxon_ops/src/lib.rs:984-1040 writes both into the same selection_supervisor runtime dir). With the current generated runtime, a missing/stale sibling silently falls through to cwd or sys.path and can bind a different helper version, so bare and ops supervisors no longer have one canonical module pair. This is the kind of implementation probing the repo rules try to avoid: fail fast if the sibling helper is missing or incompatible, and make the generated supervisor load only the co-materialized Path(__file__).with_name("log_shard.py") (or pass an explicit helper path if that is meant to be a real contract).

"git",
"ls-files",
"--cached",
"--others",

Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

这会把所有 untracked 文件也带进同一个 source selection helper。collect_public_workspace_input_relative_paths()build_seed profile,后面 pack_fluxonkv_pylib.py / lib_layout.py 会把结果物化为 public workspace seed;这样发布输入不再只由提交内容决定,本地临时文件、未 stage 的实验文件也会进入 release/cache digest。建议把 --others 收窄到明确的 dev/source-pack profile,public build_seed 默认只选 tracked 文件;需要额外文件时用显式 include。

Comment thread fluxon_rs/fluxon_ops/src/lib.rs Outdated

let path = self.log_dir.join(log_filename);
let logical_path = self.log_dir.join(log_filename);
let path = resolve_readable_log_path(&logical_path).unwrap_or(logical_path.clone());

Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

这里每次请求都会重新把 logical base 解析成 current/latest 日分片,但 ReadWorkloadLogResp 只返回 file_size/start_offset/end_offset,没有实际 shard/path identity。UI 后续用旧 end_offset 继续读时,如果跨午夜或 current shard 创建,offset 会被套到另一个文件上,结果可能缺日志、重复日志或 out-of-range。这个 API 需要在响应中返回并在后续请求中校验/传回 resolved shard identity,或者在 rollover 时显式要求客户端 reset。

cwd = Path.cwd().resolve()
candidates.append(cwd / "__LOG_SHARD_HELPER_FILENAME__")
candidates.append(cwd / "deployment" / "utils" / "__LOG_SHARD_HELPER_FILENAME__")
for entry in sys.path:

Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

生成的 selection_supervisor.pylog_shard.py 已经由 gen_bare / ops embedded runtime 同目录发布,这里再从 cwd、deployment/utils 和整个 sys.path 搜索会把运行时 authority 打开:同目录 helper 缺失或旧版本时,脚本会静默加载仓库里或环境里的另一个 log_shard.py。这违反“一条公共路径一个明确契约”的规约,也让生成脚本依赖启动 cwd。建议只接受 __file__.with_name(...) 的同目录 helper,缺失就 fail fast,或由 materializer 显式传入 helper path。

Comment thread fluxon_test_stack/test_runner.py
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant