Skip to content

fix: reject invalid PDF decoder input#977

Merged
cybermaggedon merged 1 commit into
trustgraph-ai:masterfrom
jmolz:codex/validate-pdf-decoder-input
Jun 9, 2026
Merged

fix: reject invalid PDF decoder input#977
cybermaggedon merged 1 commit into
trustgraph-ai:masterfrom
jmolz:codex/validate-pdf-decoder-input

Conversation

@jmolz

@jmolz jmolz commented Jun 7, 2026

Copy link
Copy Markdown
Contributor

Summary

  • Validate decoded PDF bytes before invoking PyPDFLoader
  • Skip non-PDF librarian content such as HTML error pages without emitting page documents
  • Keep the existing inline and librarian paths covered by the PDF decoder unit tests

Root cause

The PDF decoder checked librarian metadata MIME type, but still trusted the fetched content bytes. If a URL returned an HTML error page while the metadata path still identified the document as application/pdf, the decoder wrote those bytes to a .pdf temp file and handed them to the PDF loader.

Validation

  • PYTHONPATH=/private/tmp/trustgraph-test-shim /private/tmp/trustgraph-949-venv/bin/pytest tests/unit/test_decoding/test_pdf_decoder.py -q
  • /private/tmp/trustgraph-949-venv/bin/python -m py_compile trustgraph-flow/trustgraph/decoding/pdf/pdf_decoder.py tests/unit/test_decoding/test_pdf_decoder.py
  • git diff --check

Fixes #949.

@github-actions

github-actions Bot commented Jun 7, 2026

Copy link
Copy Markdown

Contributor License Agreement ✅

All contributors have signed the CLA. Thank you!

@jmolz jmolz mentioned this pull request Jun 7, 2026
@cybermaggedon cybermaggedon self-assigned this Jun 9, 2026
@cybermaggedon cybermaggedon self-requested a review June 9, 2026 15:31

@cybermaggedon cybermaggedon left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Tidy change, the additional test is appreciated

@cybermaggedon cybermaggedon merged commit 28a51c2 into trustgraph-ai:master Jun 9, 2026
3 checks passed
@jmolz

jmolz commented Jun 9, 2026

Copy link
Copy Markdown
Contributor Author

Thanks, appreciate the review and merge. Glad the focused test helped keep the fix tight.

@jmolz jmolz deleted the codex/validate-pdf-decoder-input branch June 9, 2026 16:24
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Crash in document-decoder

2 participants