Skip to content

fix: prevent crashes from uninitialized LightRAG, env-var stripping, and parser cleanup#240

Open
jwchmodx wants to merge 4 commits intoHKUDS:mainfrom
jwchmodx:fix/bugs-processor-parser-config
Open

fix: prevent crashes from uninitialized LightRAG, env-var stripping, and parser cleanup#240
jwchmodx wants to merge 4 commits intoHKUDS:mainfrom
jwchmodx:fix/bugs-processor-parser-config

Conversation

@jwchmodx
Copy link
Copy Markdown
Contributor

@jwchmodx jwchmodx commented Apr 3, 2026

Summary

This PR fixes three categories of bugs found in processor.py, config.py, and parser.py.

Bug fixes

processor.py — LightRAG accessed before initialization (crash risk)

  • _process_multimodal_content(): self.lightrag.doc_status was accessed at the top of the function (inside a try/except) before _ensure_lightrag_initialized() was called further down. If self.lightrag is None at that point the AttributeError is silently swallowed and the function continues in a broken state. Fix: call _ensure_lightrag_initialized() first and return early on failure.
  • process_document_complete_lightrag_api(): same pattern — self.lightrag.doc_status.get_by_id(doc_pre_id) was called on line 1678 before the try block that initialises LightRAG on line 1682. Fix: move the doc_status fetch after initialisation (it is already fetched again at line 1707, so the early call was redundant).
  • process_document_complete() and insert_content_list(): _ensure_lightrag_initialized() return value was ignored (await self._ensure_lightrag_initialized()). A failed initialisation would silently fall through and crash later on an unrelated line. Fix: capture and check the result, raising RuntimeError on failure.

config.py — missing .strip() on env-var splits

SUPPORTED_FILE_EXTENSIONS and CONTEXT_FILTER_CONTENT_TYPES are split on "," without stripping whitespace. A value like ".pdf, .jpg" produces [".pdf", " .jpg"]; the leading space means " .jpg" never matches any file extension check. Fix: add a list-comprehension with .strip().

parser.py — redundant platform.system() calls and unused TypeVar

platform.system() == "Windows" appeared in five separate functions, each preceded by a local import platform. This is evaluated at runtime on every call. Fix: hoist it to a single module-level constant _IS_WINDOWS and remove the five local imports. Also removed the unused TypeVar T import (it was imported and assigned but never referenced).

Test plan

  • Run existing test suite: pytest tests/
  • Verify RAGAnything still initializes and processes documents correctly when lightrag is pre-provided
  • Verify RAGAnything raises a clear error (not a cryptic AttributeError) when llm_model_func is missing
  • Set SUPPORTED_FILE_EXTENSIONS=".pdf, .jpg" (with space) and confirm both extensions are recognized
  • Confirm no regressions on Windows subprocess behavior

🤖 Generated with Claude Code

jwchmodx and others added 4 commits April 3, 2026 12:56
- processor.py: move _ensure_lightrag_initialized() before first
  self.lightrag access in _process_multimodal_content() and
  process_document_complete_lightrag_api(); raise on init failure
  instead of silently ignoring the return value
- config.py: strip whitespace from SUPPORTED_FILE_EXTENSIONS and
  CONTEXT_FILTER_CONTENT_TYPES env-var splits so that values like
  ".pdf, .jpg" match correctly
- parser.py: hoist platform.system() check to module-level constant
  _IS_WINDOWS, removing five redundant local `import platform` +
  repeated calls; remove unused TypeVar T

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
…ustness

processor.py:
- process_document_complete_lightrag_api(): the finally block
  unconditionally called `async with pipeline_status_lock` even when
  `pipeline_status_lock` is None (e.g. when LightRAG initialization
  fails before lines 1733-1734 execute). This caused an unhandled
  AttributeError that masked the original error. Added None guard and
  wrapped in try/except so the finally block never itself raises.

parser.py (DoclingParser):
- read_from_block(): base64 image URI was split on "," and the second
  element accessed unconditionally. When the URI has no comma (rare but
  valid — some exporters omit the "data:...;base64," prefix) this raises
  IndexError silently swallowed by the outer except. Now uses split(",", 1)
  with a len-check.
- read_from_block_recursive(): $ref values like "#/body/0" were split
  by "/" without checking the resulting length, and int() was called
  on the third part without ValueError handling. Malformed refs now
  log a warning and are skipped instead of crashing the whole parse.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
…atch

query.py (aquery_with_multimodal, aquery_vlm_enhanced) and batch.py
(process_folder_complete, process_documents_with_rag_batch) all called
`await self._ensure_lightrag_initialized()` without checking the return
value. _ensure_lightrag_initialized() returns {"success": False, "error":
"..."} on failure (e.g. missing llm_model_func, parser not installed),
so silently ignoring it caused a cryptic AttributeError later when
self.lightrag was accessed while still None.

All four call sites now capture the result and raise RuntimeError with
the initialization error message, consistent with processor.py.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
processor.py (process_document_complete_lightrag_api):

- current_doc_status UnboundLocalError: the variable was first assigned
  inside the try block (line 1706). If self.lightrag.doc_status.get_by_id()
  raises before the assignment completes, the except handler at line 1829
  references current_doc_status before it was ever bound, causing
  UnboundLocalError. Fixed by pre-initialising current_doc_status = {}
  before the try block so the **-unpack always has a valid mapping.

- Unsafe dict access on _ensure_lightrag_initialized() result: line 1688
  used result["success"] with direct key access, which raises TypeError
  if the method returns None. Changed to the safe pattern
  `not result or not result.get("success")` already used elsewhere.

- MineruExecutionError.error_msg type coercion: lines 1760-1762 only
  handled list and silently left non-string, non-list values as-is.
  Now uses str() conversion in both branches so the stored error_msg
  is always a plain string.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
@LarFii
Copy link
Copy Markdown
Collaborator

LarFii commented Apr 7, 2026

Thank you for your contribution!

I used Codex to review PR #240 against main, focusing on P0 and P1 issues. No P0 issues were found, but one P1 regression was identified.

  1. [P1] Initialization failures are now silently dropped instead of being persisted as a failed document status. In processor.py, process_document_complete_lightrag_api() now returns False immediately when _ensure_lightrag_initialized() fails. The problem is that this return happens before the function ever loads or upserts doc_pre_id status, which only starts later at processor.py. On main, the same failure path explicitly wrote DocStatus.FAILED plus error_msg before returning. With this PR, parser/bootstrap/storage-init failures become “silent” from the status store’s perspective: callers get False, but any polling UI / API that relies on doc_status can see a stale or missing record and treat the job as not started rather than failed. That is a real behavior regression for observability and recovery flows, so I would block on restoring the failed-status write before return.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants