-
-
Notifications
You must be signed in to change notification settings - Fork 9
Extended samples related test code #277
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Conversation
Codecov Report✅ All modified and coverable lines are covered by tests. Additional details and impacted files@@ Coverage Diff @@
## main #277 +/- ##
=========================================
Coverage 52.55% 52.55%
Complexity 1685 1685
=========================================
Files 99 99
Lines 3048 3048
=========================================
Hits 1602 1602
Misses 1446 1446 ☔ View full report in Codecov by Sentry. 🚀 New features to boost your workflow:
|
| creationDate: 2025-10-06T10:48:53.000000+0200 | ||
| modificationDate: null | ||
| textPartsExpectedSomewhereInTheExtractedText: | ||
| - Liposuktion bei Lipödem |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
| $creationDate = null; | ||
| if (is_string($content->creationDate)) { | ||
| $creationDate = new DateTimeImmutable($content->creationDate); | ||
| } | ||
|
|
||
| $modificationDate = null; | ||
| if (is_string($content->modificationDate)) { | ||
| $modificationDate = new DateTimeImmutable($content->modificationDate); | ||
| } | ||
|
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Without these I receive the following error in my local test setup:
Message: PrinsFrank\PdfParser\Tests\Samples\Info\FileInfo::__construct():
Argument #8 ($creationDate) must be of type ?DateTimeImmutable, string given, called in /var/www/html/tests/Samples/Info/SampleProvider.php on line 33
Location: /var/www/html/tests/Samples/Info/FileInfo.php:12
* Implement ASCII85Decode * Add sample from issue PrinsFrank#153 with ASCII85Decode filter
bc523a9 to
b0185aa
Compare
|
@PrinsFrank Friendly ping: is that helpful and goes in the right direction? If not, could you please give me an indication of which direction you are going? |
|
@k00ni Thanks again for your contribution! Sorry I didn't reply to you before, I'm a bit overwhelmed with the work that needs to be done on this project and spent some time on other things. ;) I'm a bit torn on these changes. On one hand this will make it easier for people to open PRs with samples. On the other hand, it decreases the usability of the samples folder slightly, where after this change it cannot be used to fully check parse differences between revisions. I'm still not sure where I want to go with the samples. Right now I use them to validate any code changes I do with regards to text sorting based on x/y values and the resulting text output, and to validate that images are correctly exported. To do that with the test samples we have, we'll need to have a detailed "current" state. That can only be the total state, because otherwise we cannot see when slight reordering happens for example. One option I have been thinking about, is to automatically parse PDFs in issues that have been marked as "I agree that this file can be used for testing and have the authority to do so", to automatically open a PR with both that file and the current output as a test file. That does require some work, especially because it needs to be safe and not allow token extraction in Github Actions. What do you think? |

Summary
This PR contains a few suggestions regarding the samples part of the test environment. You might had further thoughts since we spoke last time. These changes would help me when providing "broken" sample files in the future, because the amount and kind of information I have to provide was reduced and I can pinpoint "problems" more precisely.
Feel free to adapt/remove the code provided.
Adaptions
Added textPartsExpectedSomewhereInTheExtractedText
I added
textPartsExpectedSomewhereInTheExtractedTextas a new property to contents.yml. It is either empty/not there or a non-empty list of strings which are expected to be somewhere in the extracted text. This structure helps me to pinpoint certain text parts which are not extracted properly.I've added the content of #273 to demonstrate the usage of this property. The related test-part fails currently:
https://github.com/PrinsFrank/pdfparser/actions/runs/18876963633/job/53869189967?pr=277#step:5:21
#273 can be closed if we use its content in this PR.
contents.yml: Only check if a value is not null
I suggest only checking values if they are different from null. In many cases I have no idea what I have to insert into the metadata, such as creation date or producer because the PDF in question is not mine. Only checking when its different from null allows us to construct real life cases but also don't have to bother, if these information are not available.
After this particular online service we talked about is available this might be obsolete :)
FileInfo
I also had to "fix" the date related part of FileInfo.php because I got the following error:
Pages property made optional
Also made pages-property optional, so I don't have to input the raw content necessarily when creating such a file. Especially for PDFs with many pages this becomes cumbersome.
Questions
contents.yml)?