Skip to content

feat: 实现 commit、restore、show 文件系统多版本管理功能#2756

Open
Hao-Yu-la wants to merge 5 commits into
volcengine:mainfrom
Hao-Yu-la:feat_git_version
Open

feat: 实现 commit、restore、show 文件系统多版本管理功能#2756
Hao-Yu-la wants to merge 5 commits into
volcengine:mainfrom
Hao-Yu-la:feat_git_version

Conversation

@Hao-Yu-la

@Hao-Yu-la Hao-Yu-la commented Jun 21, 2026

Copy link
Copy Markdown
Contributor

Description

This PR introduces multi-version management (Snapshots) to OpenViking — a Git-based versioning layer built on top of VikingFS. It embeds an in-process Git service (powered by gitoxide) into the Rust RAGFS layer, maintaining one logical Git repository per account_id and exposing four version-control primitives: commit, log, show, and restore.

The feature lets users/Agents save an account's resource tree as a series of immutable snapshots, walk commit history, compare/read past versions, and restore the workspace (or a sub-directory) to any prior state. It is fully transparent to callers — no git binary is invoked, and Git objects/refs never leak into the viking:// user namespace.

Key design points:

  • Account-granular repository: all scopes (resources/, agent/, user/, session/) share one root tree, enabling atomic cross-scope snapshots.
  • Symmetric backends: Git objects/refs can be stored on the local filesystem (.ovgit) or S3-compatible object storage (TOS/S3/OSS), mirroring the resources backend.
  • Zero process overhead: implemented as a PyO3 binding sharing the existing Tokio runtime and config chain — no new HTTP server or subprocess.
  • Minimal intrusion: the core write path (content_write.py, viking_fs.write/rm/mv) is untouched; only new methods are added.
  • Forward-commit restore: restore never rewinds or rewrites history — it writes the diff back and creates a new commit on top of the current HEAD, keeping HEAD monotonic and history fully auditable.

The technical design document is available at git-version-control-design.md, along with a new usage guide (en / zh) and API reference (en / zh).

Type of Change

  • Bug fix (non-breaking change that fixes an issue)
  • New feature (non-breaking change that adds functionality)
  • Breaking change (fix or feature that would cause existing functionality to not work as expected)
  • Documentation update
  • Refactoring (no functional changes)
  • Performance improvement
  • Test update

Changes Made

  • Rust core (crates/ragfs/src/git/): new in-process Git service built on gitoxide, including GitService, ObjectStore/RefStore traits with local and s3 backends, tree builder, index store, commit/enumerate logic, config, and error types. Added git_s3_integration tests.
  • PyO3 binding (crates/ragfs-python/src/git.rs): exposes commit / log / show / restore on RAGFSBindingClient, invoked from Python via AsyncAGFSClient.
  • Python SDK & service layer: added snapshot_namespace.py (client.snapshot.*), storage/viking_fs.py snapshot methods, fs_service.py forwarders, error mapping, and sync/async client wiring.
  • HTTP API: new snapshot router under /api/v1/snapshot with commit / log / show / restore endpoints.
  • CLI (crates/ov_cli): new ov snapshot subcommands with rendered output, plus an HTTP client for the snapshot API.
  • Configuration: added a top-level git section (enabled, backend, default_branch, author_*, local/s3 settings); enabled by default. Snapshot data defaults to .ovgit under the workspace.
  • Documentation: added design doc, usage guides, and API references (en + zh); renamed the ragfs-cache doc to keep ordering.
  • Examples: added snapshot example scripts and ov.conf examples for local and S3/TOS backends.
  • Tests: added unit, integration, CLI, client, server, and storage tests covering Git bindings (local + S3), config, forwarders, namespace, and HTTP/SDK paths.

Testing

  • I have added tests that prove my fix is effective or that my feature works
  • New and existing unit tests pass locally with my changes
  • I have tested this on the following platforms:
    • Linux
    • macOS
    • Windows

Added extensive test coverage across layers: Rust S3 integration tests (git_s3_integration.rs), Python binding tests (local + S3), config validation, service forwarders, snapshot namespace, CLI, HTTP API, and SDK client tests.

Checklist

  • My code follows the project's coding style
  • I have performed a self-review of my code
  • I have commented my code, particularly in hard-to-understand areas
  • I have made corresponding changes to the documentation
  • My changes generate no new warnings
  • Any dependent changes have been merged and published

Additional Notes

  • Multi-version management is enabled by default; disabled it by adding a git section with enabled: false to ov.conf.
  • This is the first version and intentionally scopes to snapshot + view + restore. Out of scope for now: automatic commit hooks, branch merge/rebase/cherry-pick/push/pull, ref-rewind checkout, and versioning of vector-index data (vector indexes are rebuilt asynchronously by the watcher and must be re-triggered after restore).
  • Git data is stored outside the viking:// namespace (under .ovgit locally or {prefix}/{account}/... on S3), so users cannot accidentally read, modify, or delete it.

@github-actions

Copy link
Copy Markdown

PR Reviewer Guide 🔍

Here are some key observations to aid the review process:

⏱️ Estimated effort to review: 4 🔵🔵🔵🔵⚪
🏅 Score: 85
🧪 PR contains tests
🔒 No security concerns identified
✅ No TODO sections
🔀 Multiple PR themes

Sub-PR theme: Add Rust Git service and PyO3 bindings

Relevant files:

  • crates/ragfs/src/git/*
  • crates/ragfs-python/src/git.rs
  • openviking/pyagfs/init.py

Sub-PR theme: Integrate Git versioning into VikingFS and client layers

Relevant files:

  • openviking/storage/viking_fs.py
  • openviking/snapshot_namespace.py
  • openviking/async_client.py
  • openviking/client/local.py
  • openviking/service/fs_service.py
  • tests/agfs/test_viking_fs_git.py
  • tests/agfs/test_git_binding.py
  • tests/agfs/test_git_binding_s3.py

Sub-PR theme: Add HTTP API endpoints for snapshot operations

Relevant files:

  • openviking/server/routers/snapshot.py
  • openviking/server/routers/init.py
  • tests/server/test_api_snapshot.py

Sub-PR theme: Add documentation, examples, and CLI support

Relevant files:

  • docs/*
  • examples/snapshot/*
  • openviking_cli/utils/config/git_config.py
  • crates/ov_cli/src/commands/snapshot.rs

⚡ Recommended focus areas for review

Missing License Header

New test file does not include the standard copyright header, though the rule technically only applies to openviking/ and openviking_cli/ directories.

"""End-to-end tests for VikingFS git commit/restore/show/log Python layer.
Missing License Header

New test file does not include the standard copyright header, though the rule technically only applies to openviking/ and openviking_cli/ directories.

"""End-to-end tests for the git_commit/git_restore/git_show PyO3 bindings.

@github-actions

Copy link
Copy Markdown

PR Code Suggestions ✨

No code suggestions found for the PR.

fix: 修复commit时删除文件

fix: commit 的 fast path 1 添加 Racy-clean 机制

fix: 将 sdk 中的 git 命令改为 snapshot 命令,同步修改单测

fix: 多版本管理的文件存储目录改为 .ovgit

feat: snapshot cli 渲染

fix: 修复 restore 时将删除的文件回滚时,目录不存在的问题

feat: restore 命令的 project_dir 参数改为可选,不传时默认全目录回滚

feat: 更新文档

fix: 删除暂未使用的配置参数

feat: 新增示例脚本

fix: 修复示例代码

fix: 修复 restore 返回的 task id 任务完成状态

feat: 在 restore 修改文件系统时加锁

fix: fix openviking_sdk
qin-ctx

This comment was marked as duplicate.

@qin-ctx qin-ctx left a comment

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Review Summary

本次审查重点覆盖 snapshot 功能的一致性、并发语义、配置契约和用户可复制性。整体分层方向清晰:Rust git service 承担对象、tree、commit 和 ref CAS;PyO3 负责窄绑定;Python service、HTTP、SDK 和 CLI 提供用户入口。但当前仍有 4 个阻塞问题需要在合并前修复,另有 2 个非阻塞问题建议同步处理。

阻塞问题

  1. restore() 在 branch ref CAS 前已经写回 VFS。CAS 冲突时,API 会返回失败,但工作区可能已被改写,且不会触发 reindex。
  2. local object store 使用固定 .tmp 路径。并发写入同一个尚不存在的 object 时,临时文件可能互相覆盖,破坏 put 的幂等语义。
  3. 配置层允许 cas_mode = "redis_lock",但 S3 backend 对该模式直接返回未实现。用户配置该模式后,snapshot commit/restore 会在 ref CAS 阶段失败。
  4. Rust/PyO3 git 层缺少 account 校验。local path 和 S3 key 直接拼接 account,直接 binding 调用存在账号目录或 key prefix 隔离风险。

非阻塞问题

  1. snapshot 示例配置包含 trailing comma,但配置加载器使用 json.loads,复制示例会解析失败。
  2. PR 描述仍说明 multi-version 默认关闭,但当前 GitConfig.enabled 实际默认开启,需要同步默认行为说明。

结论:暂不建议 merge。建议先修复以上阻塞问题,并处理当前 CI lint 失败。

Comment thread crates/ragfs/src/git/service.rs Outdated
.await?;

// 9. CAS-swap the branch ref. Map Conflict → ConcurrentCommit.
match self

@qin-ctx qin-ctx Jun 23, 2026

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

[Bug][阻塞] restore() 当前在 branch ref CAS 之前已经把 diff.to_write 写入 VFS,并把 diff.to_delete 从 VFS 删除。若另一个 commit 在 resolve HEAD 与 cas_update 之间推进分支,cas_update 会返回 ConcurrentCommit,但工作区已经被 restore 内容改写;Python 层因为 restore 返回错误,也不会调度 reindex,最终形成“请求失败、工作区已变更、索引未更新”的不一致状态。建议将同一 account/ref 的 restore 与 commit 串行化,或调整为先完成 ref 一致性协议再写回 VFS,并增加 CAS 冲突后 VFS 不应变化的回归测试。

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

已完成修改

Comment thread crates/ragfs/src/git/backends/local.rs Outdated
}

// Write to temp file first, then rename for atomicity
let tmp_path = path.with_extension("tmp");

@qin-ctx qin-ctx Jun 23, 2026

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

[Bug][阻塞] path.with_extension("tmp") 生成的是固定临时文件名,会破坏 object store 在并发场景下的幂等语义。两个并发 commit 写同一个尚不存在的 object 时,二者可能都通过 try_exists,随后共用同一个 .tmp;其中一个 rename 后,另一个 rename(tmp, path) 可能失败。建议使用唯一 temp path,并在 rename 冲突或失败后重新检查目标 object 是否已存在,存在则返回成功。

CasMode::Native => {
self.cas_native(account, ref_name, expected, new).await
}
CasMode::RedisLock => {

@qin-ctx qin-ctx Jun 23, 2026

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

[Bug][阻塞] 配置层允许并暴露 cas_mode = "redis_lock",但 S3 backend 对该模式直接返回 RedisLock CAS mode not yet implemented。用户一旦配置该模式,snapshot commit/restore 会在第一次 ref CAS 时失败。建议短期在配置校验层拒绝 redis_lock 并从文档移除,或在本 PR 内实现 Redis lock CAS。

/// Get the filesystem path for an object.
fn object_path(&self, account: &str, oid: &ObjectId) -> PathBuf {
let hex = oid.to_hex().to_string();
self.base_dir

@qin-ctx qin-ctx Jun 23, 2026

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

[Bug][阻塞] Rust/PyO3 git 层缺少与 Python validate_account_id 等价的 account 校验,但 local backend 会直接把 account 拼入 filesystem path,S3 backend 也会直接拼入 key prefix。直接调用 binding 时,如果传入 ../x 或带 / 的 account,可能突破账号目录或 key prefix 隔离。建议在 Rust GitService 边界统一校验 account,至少拒绝空值、...、slash、反斜杠和控制字符,并补充直接 binding 测试。

"api_base": "https://ark.cn-beijing.volces.com/api/v3",
"dimension": 1024,
"provider": "volcengine",
"input": "multimodal",

@qin-ctx qin-ctx Jun 23, 2026

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

[Bug][非阻塞] 示例配置包含 trailing comma,但配置加载器使用 json.loads,复制该示例会直接触发 JSONDecodeError。local/S3 示例使用 python3 -m json.tool 校验均无法通过。建议将示例和文档片段改为严格 JSON,或明确改 loader 支持 JSON5/注释配置。

class GitConfig(BaseModel):
"""Git multi-version management configuration."""

enabled: bool = Field(

@qin-ctx qin-ctx Jun 23, 2026

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

[Bug][非阻塞] 当前实现把 git multi-version 默认开启,但 PR 描述仍说明默认关闭。建议同步 PR 描述和文档中的默认行为,避免 reviewer 和用户误判启用范围与迁移风险。

@qin-ctx qin-ctx left a comment

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

这次先发 5 个 blocking 问题。整体方向我觉得可以继续推进,但下面这些点会影响快照正确性、restore 一致性,或者让配置在不同入口表现不一样,建议修完再合。

let mut changed = 0usize;
for rel_path in candidates {
let abs = format!("/local/{}/{}", account, rel_path);
match self.vfs.stat(&abs).await {

@qin-ctx qin-ctx Jun 24, 2026

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

[Bug] (blocking) 这里会把 paths 里的目录静默跳过。

上面 paths=Some(...) 时只是把用户传进来的路径原样放进 candidates;到了这里如果这个路径是目录,就直接 continue。这和 CLI 帮助里的例子不太一致:ov snapshot commit -m "docs only" --paths viking://docs 看起来是在提交整个 docs 目录,但实际会因为 docs 是目录而什么文件都没提交。

举个例子:用户改了 viking://docs/a.mdviking://docs/b.md,然后只想提交 docs 目录,于是传 paths=["docs"]。现在这次提交可能变成 noop,或者至少漏掉 docs 下的所有文件,后面 restore/log 看到的历史就不是用户以为的历史。

这里不一定要在这个 PR 里支持目录递归。最小修法可以更简单:如果 paths 里传进来的是目录,就返回一个明确错误,告诉用户当前只支持文件路径,并同步 CLI 帮助、文档和测试。如果产品预期就是支持 --paths viking://docs,那再递归展开目录。关键点是不能成功返回但悄悄漏提交。

// leaving caller-driven reindex and on-disk state consistent.
match self
.ref_store
.cas_update(account, &ref_name, Some(head_oid), new_commit_oid)

@qin-ctx qin-ctx Jun 24, 2026

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

[Bug] (blocking) restore 现在是先把 branch ref 推到新 commit,再写回 VFS。这个顺序会留下一个半成功状态。

现在流程是:cas_update(..., new_commit_oid) 成功后,后面才开始 read_blob_payloadensure_parent_dirswriteremove。只要这些 VFS 操作中间任何一步失败,HEAD 已经指向“恢复后的 commit”,但真实文件树可能还停在旧状态,或者只写了一半。Python 层拿到异常后也不会调度 reindex,所以索引也可能继续和 ref 不一致。

举个例子:恢复要写回 100 个文件,CAS 成功后写到第 40 个时后端超时。用户再看 snapshot HEAD,会觉得已经恢复成功;但 VFS 里只有前 40 个文件变了,剩下 60 个还没变。这种状态后面很难靠普通重试判断清楚,因为 ref 已经变了。

建议先用最小修法把语义收住,不需要在这个 PR 里上很重的事务机制。至少补一个测试:模拟 CAS 成功后 VFS write 失败。然后让实现保证这类失败不会悄悄留下“ref 已前进、工作树没恢复完、也没有 reindex”的状态。具体可以选最简单可维护的方案,比如写回失败时返回足够明确的可恢复错误并安排/触发一致性修复,或者调整顺序让失败时 ref 不会先前进。

Comment thread crates/ragfs-python/src/git.rs
"bot@openviking.local".to_string()
}
fn default_s3_prefix() -> String {
"git".to_string()

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

[Bug] (blocking) S3 默认 prefix 这里返回的是 git,但同文件的测试、Python 配置默认值和文档语义并不一致。

我本地跑了 cargo test --manifest-path crates/ragfs/Cargo.toml parses_s3_config_with_overrides --quiet,失败点就是这个默认值:测试期望 s3.prefix == ".ovgit",实际拿到的是 "git"。Python 配置那边默认也是 .ovgit

这个不只是测试写错那么简单。默认 prefix 决定对象和 refs 写到 bucket 的哪个目录下。如果有的入口默认写 .ovgit,有的入口默认写 git,用户用 CLI 提交、再用 native binding 读,就可能看起来“快照丢了”,其实是两边读写了不同前缀。

建议先定一个唯一默认值。如果产品上想用 .ovgit,这里改成 .ovgit,并统一示例;如果想用 git,那就反过来更新 Python 默认、测试和文档。关键是所有入口必须一致。

return Err(GitError::Other("empty path".into()));
}

let components: Vec<&str> = path.split('/').collect();

@qin-ctx qin-ctx Jun 24, 2026

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

[Bug] (blocking) 这里的路径校验还不够,split_path 只拒绝空字符串和空组件,但没有拒绝 . / .. 这类路径组件。

高层 HTTP/SDK 入口可能已经把 URI 规范化过了,但这个 PR 也暴露了 native GitService / PyO3 binding 入口。直接 binding 可以把 paths=["../other/file.md"] 或 restore 的 project_dir="../other" 传进来;后面 service 会拼成类似 /local/{account}/../other/file.md 的路径,底层 localfs 再按真实文件路径解析。这样就有机会越过当前 account/project 的边界。

这个点和你们刚补的 account 校验是同一类问题:Rust GitService 作为 native 边界,最好自己也挡住这些明显危险的 path。

修法不用复杂:在进入 commit / show / restore 的地方,对用户传入的相对路径做一层明确校验即可。至少拒绝 ...、反斜杠、控制字符、前后斜杠和空组件;再补一个直接 binding 的测试,比如恶意 path 应该返回 InvalidPath,而不是继续访问 VFS。

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

Status: Backlog

Development

Successfully merging this pull request may close these issues.

2 participants