fix: prevent crashes from uninitialized LightRAG, env-var stripping, and parser cleanup by jwchmodx · Pull Request #240 · HKUDS/RAG-Anything

jwchmodx · 2026-04-03T03:57:21Z

Summary

This PR fixes three categories of bugs found in processor.py, config.py, and parser.py.

Bug fixes

processor.py — LightRAG accessed before initialization (crash risk)

_process_multimodal_content(): self.lightrag.doc_status was accessed at the top of the function (inside a try/except) before _ensure_lightrag_initialized() was called further down. If self.lightrag is None at that point the AttributeError is silently swallowed and the function continues in a broken state. Fix: call _ensure_lightrag_initialized() first and return early on failure.
process_document_complete_lightrag_api(): same pattern — self.lightrag.doc_status.get_by_id(doc_pre_id) was called on line 1678 before the try block that initialises LightRAG on line 1682. Fix: move the doc_status fetch after initialisation (it is already fetched again at line 1707, so the early call was redundant).
process_document_complete() and insert_content_list(): _ensure_lightrag_initialized() return value was ignored (await self._ensure_lightrag_initialized()). A failed initialisation would silently fall through and crash later on an unrelated line. Fix: capture and check the result, raising RuntimeError on failure.

config.py — missing .strip() on env-var splits

SUPPORTED_FILE_EXTENSIONS and CONTEXT_FILTER_CONTENT_TYPES are split on "," without stripping whitespace. A value like ".pdf, .jpg" produces [".pdf", " .jpg"]; the leading space means " .jpg" never matches any file extension check. Fix: add a list-comprehension with .strip().

parser.py — redundant platform.system() calls and unused TypeVar

platform.system() == "Windows" appeared in five separate functions, each preceded by a local import platform. This is evaluated at runtime on every call. Fix: hoist it to a single module-level constant _IS_WINDOWS and remove the five local imports. Also removed the unused TypeVar T import (it was imported and assigned but never referenced).

Test plan

Run existing test suite: pytest tests/
Verify RAGAnything still initializes and processes documents correctly when lightrag is pre-provided
Verify RAGAnything raises a clear error (not a cryptic AttributeError) when llm_model_func is missing
Set SUPPORTED_FILE_EXTENSIONS=".pdf, .jpg" (with space) and confirm both extensions are recognized
Confirm no regressions on Windows subprocess behavior

🤖 Generated with Claude Code

- processor.py: move _ensure_lightrag_initialized() before first self.lightrag access in _process_multimodal_content() and process_document_complete_lightrag_api(); raise on init failure instead of silently ignoring the return value - config.py: strip whitespace from SUPPORTED_FILE_EXTENSIONS and CONTEXT_FILTER_CONTENT_TYPES env-var splits so that values like ".pdf, .jpg" match correctly - parser.py: hoist platform.system() check to module-level constant _IS_WINDOWS, removing five redundant local `import platform` + repeated calls; remove unused TypeVar T Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

…ustness processor.py: - process_document_complete_lightrag_api(): the finally block unconditionally called `async with pipeline_status_lock` even when `pipeline_status_lock` is None (e.g. when LightRAG initialization fails before lines 1733-1734 execute). This caused an unhandled AttributeError that masked the original error. Added None guard and wrapped in try/except so the finally block never itself raises. parser.py (DoclingParser): - read_from_block(): base64 image URI was split on "," and the second element accessed unconditionally. When the URI has no comma (rare but valid — some exporters omit the "data:...;base64," prefix) this raises IndexError silently swallowed by the outer except. Now uses split(",", 1) with a len-check. - read_from_block_recursive(): $ref values like "#/body/0" were split by "/" without checking the resulting length, and int() was called on the third part without ValueError handling. Malformed refs now log a warning and are skipped instead of crashing the whole parse. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

…atch query.py (aquery_with_multimodal, aquery_vlm_enhanced) and batch.py (process_folder_complete, process_documents_with_rag_batch) all called `await self._ensure_lightrag_initialized()` without checking the return value. _ensure_lightrag_initialized() returns {"success": False, "error": "..."} on failure (e.g. missing llm_model_func, parser not installed), so silently ignoring it caused a cryptic AttributeError later when self.lightrag was accessed while still None. All four call sites now capture the result and raise RuntimeError with the initialization error message, consistent with processor.py. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

processor.py (process_document_complete_lightrag_api): - current_doc_status UnboundLocalError: the variable was first assigned inside the try block (line 1706). If self.lightrag.doc_status.get_by_id() raises before the assignment completes, the except handler at line 1829 references current_doc_status before it was ever bound, causing UnboundLocalError. Fixed by pre-initialising current_doc_status = {} before the try block so the **-unpack always has a valid mapping. - Unsafe dict access on _ensure_lightrag_initialized() result: line 1688 used result["success"] with direct key access, which raises TypeError if the method returns None. Changed to the safe pattern `not result or not result.get("success")` already used elsewhere. - MineruExecutionError.error_msg type coercion: lines 1760-1762 only handled list and silently left non-string, non-list values as-is. Now uses str() conversion in both branches so the stored error_msg is always a plain string. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

LarFii · 2026-04-07T12:55:43Z

Thank you for your contribution!

I used Codex to review PR #240 against main, focusing on P0 and P1 issues. No P0 issues were found, but one P1 regression was identified.

[P1] Initialization failures are now silently dropped instead of being persisted as a failed document status. In processor.py, process_document_complete_lightrag_api() now returns False immediately when _ensure_lightrag_initialized() fails. The problem is that this return happens before the function ever loads or upserts doc_pre_id status, which only starts later at processor.py. On main, the same failure path explicitly wrote DocStatus.FAILED plus error_msg before returning. With this PR, parser/bootstrap/storage-init failures become “silent” from the status store’s perspective: callers get False, but any polling UI / API that relies on doc_status can see a stale or missing record and treat the job as not started rather than failed. That is a real behavior regression for observability and recovery flows, so I would block on restoring the failed-status write before return.

jwchmodx and others added 4 commits April 3, 2026 12:56

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

fix: prevent crashes from uninitialized LightRAG, env-var stripping, and parser cleanup#240

fix: prevent crashes from uninitialized LightRAG, env-var stripping, and parser cleanup#240
jwchmodx wants to merge 4 commits intoHKUDS:mainfrom
jwchmodx:fix/bugs-processor-parser-config

jwchmodx commented Apr 3, 2026

Uh oh!

LarFii commented Apr 7, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

jwchmodx commented Apr 3, 2026

Summary

Bug fixes

Test plan

Uh oh!

LarFii commented Apr 7, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants