Skip to content

feat(storage): optimize glob func#2776

Open
baojun-zhang wants to merge 1 commit into
volcengine:mainfrom
baojun-zhang:optimize-glob
Open

feat(storage): optimize glob func#2776
baojun-zhang wants to merge 1 commit into
volcengine:mainfrom
baojun-zhang:optimize-glob

Conversation

@baojun-zhang

@baojun-zhang baojun-zhang commented Jun 22, 2026

Copy link
Copy Markdown
Collaborator

Description

This PR lands the RAGFS-backed glob implementation for OpenViking, and completes the S3-side pagination optimization in this change set.

The new flow keeps the existing OpenViking visibility semantics in Python while moving candidate enumeration and glob pagination into Rust. For S3, glob_directory now uses scan-state pagination on top of ListObjectsV2, stops early once the requested page is filled, and keeps opaque continuation tokens scoped to the original query.

Related Issue

N/A

Type of Change

  • Bug fix (non-breaking change that fixes an issue)
  • New feature (non-breaking change that adds functionality)
  • Breaking change (fix or feature that would cause existing functionality to not work as expected)
  • Documentation update
  • Refactoring (no functional changes)
  • Performance improvement
  • Test update

Changes Made

  • Added the RAGFS glob_directory contract with GlobEntry / GlobPage, plus the Python binding and client plumbing needed for VikingFS.glob().
  • Kept OpenViking access control and visibility filtering in Python, while moving backend candidate enumeration and pagination into Rust without changing PurePath.match()-compatible semantics.
  • Implemented S3 scan-state pagination based on ListObjectsV2, added scoped opaque continuation tokens, removed shadow/legacy rollout guidance from the design doc, and fixed S3 fetch size handling so the internal scan batch stays large enough for sparse-match workloads.

Testing

  • I have added tests that prove my fix is effective or that my feature works
  • New and existing unit tests pass locally with my changes
  • I have tested this on the following platforms:
    • Linux
    • macOS
    • Windows

Checklist

  • My code follows the project's coding style
  • I have performed a self-review of my code
  • I have commented my code, particularly in hard-to-understand areas
  • I have made corresponding changes to the documentation
  • My changes generate no new warnings
  • Any dependent changes have been merged and published

Screenshots (if applicable)

N/A

Additional Notes

  • The current S3 implementation still relies on query-root ListObjectsV2 scanning for generic glob patterns. This PR fixes the regression where the internal S3 scan batch was tied to the outward page size, but it does not introduce an index-based lookup layer.

@baojun-zhang

Copy link
Copy Markdown
Collaborator Author

性能测试(Glob)

前置

导入 10w 个文件资源,用于准备 glob 测试数据:

测试命令

python3 perf/s3/glob/load_test_glob_100.py \
  --account-id 100k \
  --user-id 100k \
  --api-key 'M...0Yg' \
  --uri viking://resources \
  --pattern "*.yaml" \
  --iterations 1

结果汇总(100 次)

场景 total_sec avg_ms p50_ms p95_ms p99_ms ok/fail last_result_count
优化前(同结果规模,存在失败) 3500.4586 35004.57 31816.59 57718.53 60005.40 96/4 256
优化后 1340.2618 13402.61 13309.03 14124.81 14469.24 100/0 256

一句话结论

在相同结果规模(last_result_count=256)下,glob 查询平均耗时从 ~35.0s/次 降到 ~13.4s/次,平均性能提升约 2.61 倍;同时错误率从 4% 降到 0%,长尾延迟(p99)从 ~60.0s 收敛到 ~14.5s

原始输出

优化前

using existing tenant: account_id=100k user_id=100k
auth_mode=api_key
progress: 20/100, ok=20, fail=0
progress: 40/100, ok=40, fail=0
progress: 60/100, ok=60, fail=0
ERR i=77 status=500 data=None text=
progress: 80/100, ok=79, fail=1
ERR i=81 status=500 data=None text=
ERR i=82 status=500 data=None text=
ERR i=83 status=500 data=None text=
progress: 100/100, ok=96, fail=4

Summary:
  iterations: 100
  ok: 96
  fail: 4
  total_sec: 3500.4586
  avg_ms: 35004.57
  p50_ms: 31816.59
  p95_ms: 57718.53
  p99_ms: 60005.40
  last_result_count: 256

Summary:
  iterations: 100
  ok: 100
  fail: 0
  total_sec: 651.7339
  avg_ms: 6517.33
  p50_ms: 5810.93
  p95_ms: 9523.33
  p99_ms: 20123.16
  last_result_count: 130

优化后

using existing tenant: account_id=100k user_id=100k
auth_mode=api_key
progress: 20/100, ok=20, fail=0
progress: 40/100, ok=40, fail=0
progress: 60/100, ok=60, fail=0
progress: 80/100, ok=80, fail=0
progress: 100/100, ok=100, fail=0

Summary:
  iterations: 100
  ok: 100
  fail: 0
  total_sec: 1340.2618
  avg_ms: 13402.61
  p50_ms: 13309.03
  p95_ms: 14124.81
  p99_ms: 14469.24
  last_result_count: 256

@baojun-zhang

Copy link
Copy Markdown
Collaborator Author

性能测试(Glob)

前置

导入 10w 个文件资源,用于准备 glob 测试数据:

测试命令

python3 perf/s3/glob/load_test_glob_100.py \
  --account-id 100k \
  --user-id 100k \
  --api-key 'M...0Yg' \
  --uri viking://resources \
  --pattern "*.yaml" \
  --iterations 1

结果汇总(100 次)

场景 total_sec avg_ms p50_ms p95_ms p99_ms ok/fail last_result_count
优化前(同结果规模,存在失败) 3500.4586 35004.57 31816.59 57718.53 60005.40 96/4 256
优化后 1340.2618 13402.61 13309.03 14124.81 14469.24 100/0 256

一句话结论

在相同结果规模(last_result_count=256)下,glob 查询平均耗时从 ~35.0s/次 降到 ~13.4s/次,平均性能提升约 2.61 倍;同时错误率从 4% 降到 0%,长尾延迟(p99)从 ~60.0s 收敛到 ~14.5s

原始输出

优化前

using existing tenant: account_id=100k user_id=100k
auth_mode=api_key
progress: 20/100, ok=20, fail=0
progress: 40/100, ok=40, fail=0
progress: 60/100, ok=60, fail=0
ERR i=77 status=500 data=None text=
progress: 80/100, ok=79, fail=1
ERR i=81 status=500 data=None text=
ERR i=82 status=500 data=None text=
ERR i=83 status=500 data=None text=
progress: 100/100, ok=96, fail=4

Summary:
  iterations: 100
  ok: 96
  fail: 4
  total_sec: 3500.4586
  avg_ms: 35004.57
  p50_ms: 31816.59
  p95_ms: 57718.53
  p99_ms: 60005.40
  last_result_count: 256

Summary:
  iterations: 100
  ok: 100
  fail: 0
  total_sec: 651.7339
  avg_ms: 6517.33
  p50_ms: 5810.93
  p95_ms: 9523.33
  p99_ms: 20123.16
  last_result_count: 130

优化后

using existing tenant: account_id=100k user_id=100k
auth_mode=api_key
progress: 20/100, ok=20, fail=0
progress: 40/100, ok=40, fail=0
progress: 60/100, ok=60, fail=0
progress: 80/100, ok=80, fail=0
progress: 100/100, ok=100, fail=0

Summary:
  iterations: 100
  ok: 100
  fail: 0
  total_sec: 1340.2618
  avg_ms: 13402.61
  p50_ms: 13309.03
  p95_ms: 14124.81
  p99_ms: 14469.24
  last_result_count: 256

PS

@github-project-automation github-project-automation Bot moved this from Backlog to Done in OpenViking project Jun 22, 2026
@baojun-zhang baojun-zhang reopened this Jun 22, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

Status: Done

Development

Successfully merging this pull request may close these issues.

1 participant