feat: incremental folder scan — skip unchanged files via MD5 manifest by jwchmodx · Pull Request #239 · HKUDS/RAG-Anything

jwchmodx · 2026-04-03T03:41:36Z

Summary

Closes #156 — adds incremental=True to process_folder_complete so that unchanged files are skipped on subsequent runs.

How it works

When incremental=True is passed:

process_folder_complete computes the MD5 digest of every discovered file.
The digest is compared against a lightweight JSON manifest stored in config.working_dir (one manifest per source folder, keyed by a hash of the folder path so multiple folders coexist safely).
Unchanged files (digest matches the manifest) are silently skipped.
New or changed files are processed normally.
After the run, successfully processed files are marked in the manifest. Failed files are removed so they are automatically retried on the next call.

The incremental=False default means existing code is completely unaffected.

New API

await rag.process_folder_complete(
    folder_path="./documents",
    incremental=True,   # skip unchanged files
)

Changes

File	Change
`raganything/batch.py`	Added `incremental` param + 4 helper methods + manifest update logic
`tests/test_incremental_folder_scan.py`	12 new unit tests

Test plan

test_processes_all_files_on_first_run — first run always processes everything
test_skips_unchanged_files_on_second_run — unchanged files are skipped
test_reprocesses_changed_file — modified file is re-processed
test_processes_newly_added_file — new files are picked up
test_failed_file_is_retried_next_run — failed files removed from manifest
test_non_incremental_does_not_create_manifest — no side effects when incremental=False
MD5 helpers + manifest I/O (3 + 3 tests)

All 12 pass; the 4 pre-existing failures in test_callbacks and test_chinese_cid_font are unrelated to this change.

🤖 Generated with Claude Code

…KUDS#156) When `incremental=True` is passed, `process_folder_complete` computes the MD5 digest of each discovered file and compares it against a per-folder manifest stored in `config.working_dir`. Files whose digest matches the manifest are skipped; new or changed files are (re-)processed. Failed files are removed from the manifest so they are automatically retried on the next run. New helpers on BatchMixin: - `_file_md5(path)` – compute hex MD5 of a file - `_manifest_path(folder_path)` – locate the JSON manifest for a folder - `_load_manifest(path)` / `_save_manifest(path, data)` – read/write manifest New test file: tests/test_incremental_folder_scan.py (12 tests) Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

LarFii · 2026-04-07T12:59:48Z

Thanks for your contribution!

P1 Stale document versions are kept after a file changes
In raganything/batch.py:147 the new incremental path decides whether to re-run a file, but when a changed file is reprocessed it just calls process_document_complete() again. That matters because raganything/processor.py:94 generates doc_id from document content, and raganything/processor.py:1556 uses that new content-based ID for insertion. I could not find any cleanup of the previous version before the reinsert. The result is that editing a file will create a new document identity while the old chunks/entities/relations remain in the index, so retrieval can return both old and new versions of the same source file. For an “incremental update” feature, that is a correctness bug, not just a UX gap.
P1 The manifest key ignores processing configuration, so valid reruns are skipped
The manifest entry in raganything/batch.py:159 only checks file-content MD5. If the user changes parse_method, parser backend/config, splitting settings, or even points the run at a different output layout, unchanged files are silently skipped and keep results produced under the old configuration. That is inconsistent with the existing parser cache design, which explicitly treats parsing config as part of cache validity in raganything/processor.py:164. So the new incremental layer can now suppress necessary reprocessing and leave stale outputs/index state behind.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat: incremental folder scan — skip unchanged files via MD5 manifest#239

feat: incremental folder scan — skip unchanged files via MD5 manifest#239
jwchmodx wants to merge 1 commit intoHKUDS:mainfrom
jwchmodx:feat/incremental-folder-scan

jwchmodx commented Apr 3, 2026

Uh oh!

LarFii commented Apr 7, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

jwchmodx commented Apr 3, 2026

Summary

How it works

New API

Changes

Test plan

Uh oh!

LarFii commented Apr 7, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants