Skip to content

feat: incremental folder scan — skip unchanged files via MD5 manifest#239

Open
jwchmodx wants to merge 1 commit intoHKUDS:mainfrom
jwchmodx:feat/incremental-folder-scan
Open

feat: incremental folder scan — skip unchanged files via MD5 manifest#239
jwchmodx wants to merge 1 commit intoHKUDS:mainfrom
jwchmodx:feat/incremental-folder-scan

Conversation

@jwchmodx
Copy link
Copy Markdown
Contributor

@jwchmodx jwchmodx commented Apr 3, 2026

Summary

Closes #156 — adds incremental=True to process_folder_complete so that unchanged files are skipped on subsequent runs.

How it works

When incremental=True is passed:

  1. process_folder_complete computes the MD5 digest of every discovered file.
  2. The digest is compared against a lightweight JSON manifest stored in config.working_dir (one manifest per source folder, keyed by a hash of the folder path so multiple folders coexist safely).
  3. Unchanged files (digest matches the manifest) are silently skipped.
  4. New or changed files are processed normally.
  5. After the run, successfully processed files are marked in the manifest. Failed files are removed so they are automatically retried on the next call.

The incremental=False default means existing code is completely unaffected.

New API

await rag.process_folder_complete(
    folder_path="./documents",
    incremental=True,   # skip unchanged files
)

Changes

File Change
raganything/batch.py Added incremental param + 4 helper methods + manifest update logic
tests/test_incremental_folder_scan.py 12 new unit tests

Test plan

  • test_processes_all_files_on_first_run — first run always processes everything
  • test_skips_unchanged_files_on_second_run — unchanged files are skipped
  • test_reprocesses_changed_file — modified file is re-processed
  • test_processes_newly_added_file — new files are picked up
  • test_failed_file_is_retried_next_run — failed files removed from manifest
  • test_non_incremental_does_not_create_manifest — no side effects when incremental=False
  • MD5 helpers + manifest I/O (3 + 3 tests)

All 12 pass; the 4 pre-existing failures in test_callbacks and test_chinese_cid_font are unrelated to this change.

🤖 Generated with Claude Code

…KUDS#156)

When `incremental=True` is passed, `process_folder_complete` computes the
MD5 digest of each discovered file and compares it against a per-folder
manifest stored in `config.working_dir`.  Files whose digest matches the
manifest are skipped; new or changed files are (re-)processed.  Failed files
are removed from the manifest so they are automatically retried on the next
run.

New helpers on BatchMixin:
- `_file_md5(path)` – compute hex MD5 of a file
- `_manifest_path(folder_path)` – locate the JSON manifest for a folder
- `_load_manifest(path)` / `_save_manifest(path, data)` – read/write manifest

New test file: tests/test_incremental_folder_scan.py (12 tests)

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
@LarFii
Copy link
Copy Markdown
Collaborator

LarFii commented Apr 7, 2026

Thanks for your contribution!

  1. P1 Stale document versions are kept after a file changes
    In raganything/batch.py:147 the new incremental path decides whether to re-run a file, but when a changed file is reprocessed it just calls process_document_complete() again. That matters because raganything/processor.py:94 generates doc_id from document content, and raganything/processor.py:1556 uses that new content-based ID for insertion. I could not find any cleanup of the previous version before the reinsert. The result is that editing a file will create a new document identity while the old chunks/entities/relations remain in the index, so retrieval can return both old and new versions of the same source file. For an “incremental update” feature, that is a correctness bug, not just a UX gap.

  2. P1 The manifest key ignores processing configuration, so valid reruns are skipped
    The manifest entry in raganything/batch.py:159 only checks file-content MD5. If the user changes parse_method, parser backend/config, splitting settings, or even points the run at a different output layout, unchanged files are silently skipped and keep results produced under the old configuration. That is inconsistent with the existing parser cache design, which explicitly treats parsing config as part of cache validity in raganything/processor.py:164. So the new incremental layer can now suppress necessary reprocessing and leave stale outputs/index state behind.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

[Feature Request]: support incremental scan of a folder and update the changed file based on date and md5

2 participants