Skip to content

Added functionality to more easily extract plain text#3

Closed
TheOriginalBytePlayer wants to merge 6 commits intoomonien:masterfrom
TheOriginalBytePlayer:master
Closed

Added functionality to more easily extract plain text#3
TheOriginalBytePlayer wants to merge 6 commits intoomonien:masterfrom
TheOriginalBytePlayer:master

Conversation

@TheOriginalBytePlayer
Copy link
Copy Markdown

This pull request introduces several improvements to the PDF document API, focusing on enhancing the usability and functionality of the TPdfDocument class and exposing additional PDFium text extraction capabilities. The most important changes include updating the way PDF documents are loaded, adding indexed page access, and exposing a new function for retrieving character bounding boxes.

Enhancements to PDF document loading and access:

  • Changed TPdfDocument.LoadFromFile from a procedure to a function that returns a boolean indicating success, and updated its implementation to set the result based on loading status and page count. [1] [2] [3]
  • Added an indexed Pages property to TPdfDocument for direct access to pages by index.

PDFium API exposure:

  • Added a new external procedure FPDFText_GetCharBox to expose PDFium's character bounding box retrieval functionality.

Copilot AI review requested due to automatic review settings January 13, 2026 21:38
Copy link
Copy Markdown

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Copilot encountered an error and was unable to review this pull request. You can try again by re-requesting a review.

Comment thread src/DX.Pdf.Document.pas Outdated
Comment thread src/DX.Pdf.Document.pas Outdated
Comment thread src/DX.Pdf.API.pas Outdated
Comment thread src/DX.Pdf.Document.pas
Comment thread src/DX.Pdf.Document.pas Outdated
Comment thread src/DX.Pdf.Document.pas Outdated
Comment thread src/DX.Pdf.Document.pas
Comment thread src/DX.Pdf.Document.pas
Comment thread src/DX.Pdf.Document.pas
Comment thread src/DX.Pdf.Document.pas
TheOriginalBytePlayer and others added 5 commits January 13, 2026 14:59
Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com>
Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com>
Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com>
Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com>
Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com>
@omonien
Copy link
Copy Markdown
Owner

omonien commented Mar 31, 2026

Hi @TheOriginalBytePlayer, thanks for your interest in the project and for taking the time to submit this PR!

After careful review, I've decided not to merge this in its current form for the following reasons:

  1. FPDFText_GetText is already in master - this declaration would be a duplicate.
    1. Breaking API change: Changing LoadFromFile from procedure to function: Boolean breaks all existing callers. The current exception-based error handling via EPdfLoadException is the intended pattern and provides richer error information than a boolean return value.
    1. Pages[] indexed property: Returning a TPdfPage that the caller must free from an indexed property is a memory-leak risk - users typically don't expect ownership transfer from property getters. I do like the idea though and will implement a cached version where the document manages the page lifecycle internally.
    1. .gitignore: The /lib entry was added twice.
      I'll be incorporating a properly designed Pages[] accessor with internal caching in an upcoming commit. Thanks again for your contribution!

@omonien omonien closed this Mar 31, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants