Skip to content

Conversation

@k00ni
Copy link
Contributor

@k00ni k00ni commented Oct 28, 2025

Summary

This PR contains a few suggestions regarding the samples part of the test environment. You might had further thoughts since we spoke last time. These changes would help me when providing "broken" sample files in the future, because the amount and kind of information I have to provide was reduced and I can pinpoint "problems" more precisely.

Feel free to adapt/remove the code provided.

Adaptions

Added textPartsExpectedSomewhereInTheExtractedText

I added textPartsExpectedSomewhereInTheExtractedText as a new property to contents.yml. It is either empty/not there or a non-empty list of strings which are expected to be somewhere in the extracted text. This structure helps me to pinpoint certain text parts which are not extracted properly.

I've added the content of #273 to demonstrate the usage of this property. The related test-part fails currently:
https://github.com/PrinsFrank/pdfparser/actions/runs/18876963633/job/53869189967?pr=277#step:5:21

#273 can be closed if we use its content in this PR.

contents.yml: Only check if a value is not null

I suggest only checking values if they are different from null. In many cases I have no idea what I have to insert into the metadata, such as creation date or producer because the PDF in question is not mine. Only checking when its different from null allows us to construct real life cases but also don't have to bother, if these information are not available.

After this particular online service we talked about is available this might be obsolete :)

FileInfo

I also had to "fix" the date related part of FileInfo.php because I got the following error:

Message: PrinsFrank\PdfParser\Tests\Samples\Info\FileInfo::__construct():
Argument #8 ($creationDate) must be of type ?DateTimeImmutable, string given, called in /var/www/html/tests/Samples/Info/SampleProvider.php on line 33
Location: /var/www/html/tests/Samples/Info/FileInfo.php:12

Pages property made optional

Also made pages-property optional, so I don't have to input the raw content necessarily when creating such a file. Especially for PDFs with many pages this becomes cumbersome.

Questions

  1. What is the difference between a producer and a creator of PDF (contents.yml)?

@codecov-commenter
Copy link

codecov-commenter commented Oct 28, 2025

Codecov Report

✅ All modified and coverable lines are covered by tests.
✅ Project coverage is 52.55%. Comparing base (5fa35ac) to head (b0185aa).

Additional details and impacted files
@@            Coverage Diff            @@
##               main     #277   +/-   ##
=========================================
  Coverage     52.55%   52.55%           
  Complexity     1685     1685           
=========================================
  Files            99       99           
  Lines          3048     3048           
=========================================
  Hits           1602     1602           
  Misses         1446     1446           

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

🚀 New features to boost your workflow:
  • ❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

creationDate: 2025-10-06T10:48:53.000000+0200
modificationDate: null
textPartsExpectedSomewhereInTheExtractedText:
- Liposuktion bei Lipödem
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This text part is from the related issue #272

Current only ipödem is extracted.

Image

Comment on lines +33 to +42
$creationDate = null;
if (is_string($content->creationDate)) {
$creationDate = new DateTimeImmutable($content->creationDate);
}

$modificationDate = null;
if (is_string($content->modificationDate)) {
$modificationDate = new DateTimeImmutable($content->modificationDate);
}

Copy link
Contributor Author

@k00ni k00ni Oct 28, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Without these I receive the following error in my local test setup:

Message: PrinsFrank\PdfParser\Tests\Samples\Info\FileInfo::__construct():
Argument #8 ($creationDate) must be of type ?DateTimeImmutable, string given, called in /var/www/html/tests/Samples/Info/SampleProvider.php on line 33
Location: /var/www/html/tests/Samples/Info/FileInfo.php:12

@k00ni k00ni force-pushed the feature/pdf-samples-extended branch from bc523a9 to b0185aa Compare November 7, 2025 07:35
@k00ni
Copy link
Contributor Author

k00ni commented Nov 10, 2025

@PrinsFrank Friendly ping: is that helpful and goes in the right direction? If not, could you please give me an indication of which direction you are going?

@PrinsFrank
Copy link
Owner

@k00ni Thanks again for your contribution! Sorry I didn't reply to you before, I'm a bit overwhelmed with the work that needs to be done on this project and spent some time on other things. ;)

I'm a bit torn on these changes. On one hand this will make it easier for people to open PRs with samples. On the other hand, it decreases the usability of the samples folder slightly, where after this change it cannot be used to fully check parse differences between revisions.

I'm still not sure where I want to go with the samples. Right now I use them to validate any code changes I do with regards to text sorting based on x/y values and the resulting text output, and to validate that images are correctly exported. To do that with the test samples we have, we'll need to have a detailed "current" state. That can only be the total state, because otherwise we cannot see when slight reordering happens for example.

One option I have been thinking about, is to automatically parse PDFs in issues that have been marked as "I agree that this file can be used for testing and have the authority to do so", to automatically open a PR with both that file and the current output as a test file. That does require some work, especially because it needs to be safe and not allow token extraction in Github Actions.

What do you think?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants