Skip to content

working with xfa forms#129

Open
vision10 wants to merge 3 commits intocantoo-scribe:masterfrom
vision10:master
Open

working with xfa forms#129
vision10 wants to merge 3 commits intocantoo-scribe:masterfrom
vision10:master

Conversation

@vision10
Copy link
Copy Markdown

@vision10 vision10 commented Dec 1, 2025

What?

First attempt at adding support for preserving XFA (XML Forms) forms and extracting/modifying JavaScript embedded in XFA templates.

Why?

pdf-lib strips by default XFA data when loading and saving PDFs, causing these forms to lose all functionality. Additionally, there was no way to programmatically access or modify the JavaScript code embedded in XFA templates

How?

  1. **Preserv XFA forms **

    • Added preserveXFA option to PDFDocument.load() and PDFDocument.save()
    • When enabled, preserves the entire XFA array structure from the AcroForm dictionary
    • Prevents XFA data loss during PDF modification
  2. XFA JavaScript Extraction (getXFAJavaScripts())

    • New method that extracts all JavaScript from XFA template XML
    • Parses compressed PDF streams and XML structure
    • Returns array of {field: string, event: string, script: string} objects
    • Handles XFA's non-standard XML formatting (newlines in closing tags like </script\n>)
    • Uses backward search to determine field and event context for each script
  3. XFA JavaScript Modification (setXFAJavaScript(field, event, script))

    • New method to modify specific scripts by field name and event name
    • Finds matching field/event in XML, replaces script content
    • Creates new compressed stream with modified XML
    • Returns boolean indicating success/failure
    • Preserves XFA structure and all other scripts

Technical details:

  • XFA data is stored as alternating name/stream pairs in a PDFArray
  • Template section contains the JavaScript in XML <script> elements
  • Implemented special regex pattern to handle XFA's malformed XML (<\/script\s*> instead of <\/script>)
  • Added PDFRef dereferencing for XFA array lookup
  • Uses decodePDFRawStream to handle compressed streams

Testing?

  1. Unit Tests (7 tests in PDFDocumentXFA.spec.ts)

    • ✅ Extract XFA JavaScript from template (29 scripts from test PDF)
    • ✅ Returns empty array for non-XFA PDFs
    • ✅ Can modify XFA JavaScript
    • ✅ Returns false when modifying non-existent field
    • ✅ Preserves XFA structure after modification
    • ✅ Can save and reload PDF with modified XFA JavaScript
    • ✅ Extracts scripts from multiple events on same field
    • Uses assets/pdfs/with_xfa_fields.pdf (included in repo)
  2. Integration Testing

    • Tested with my own complex pdf and the ready made one from the tests
    • Save/reload cycle preserves all modifications

New Dependencies?

No new production dependencies. The implementation uses existing dependencies:

  • pako (already in dependencies) - for stream compression/decompression
  • All XFA functionality built using existing pdf-lib core modules

Screenshots

Suggested Reading?

  • PDF 1.7 Specification Section 12.7.8 (Interactive Forms - XFA)
  • Adobe XFA Specification 3.3
  • AcroForm dictionary structure (Section 12.7.2)
  • Stream encoding/compression (Section 7.3)

Anything Else?

Documentation updates

README.md Outdated
// Make modifications...

// Save with XFA preservation
const pdfBytes = await pdfDoc.save({
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Unless I missed it, this is not yet implemented in this PR

* @param newScript The new JavaScript code to set
* @returns True if the script was found and updated, false otherwise
*/
setXFAJavaScript(
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You will probably need to escape xml from newScript.


// Find and replace the script
// Note: XFA uses newlines in closing tags like </script\n>
const fieldPattern = new RegExp(
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You should be able to use XML parsing instead of a regex

: acroForm;

const xfa = formDict.get(PDFName.of('XFA'));
if (!xfa || !(xfa instanceof PDFArray)) {
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This will return false if xfa is a PDFRef


// Create new stream with modified XML
const newXmlBytes = new TextEncoder().encode(xmlString);
const newStream = this.context.stream(newXmlBytes);
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If the stream is compressed, you'll need to call flateStream instead of streal

const fieldNameMatch = beforeScript.match(
/<field[^>]*name="([^"]*)"[^>]*>/gi,
);
const fieldName = fieldNameMatch
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Returning unknown for multiple field might cause a problem when using setXFAJavaScripts, no?

* ```
* @returns An array of objects containing script names and their JavaScript code.
*/
getDocumentJavaScripts(): Array<{ name: string; script: string }> {
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This doesn't appear in the readme

? nameStr.substring(1)
: nameStr;
// Decode hex sequences like #28 -> (
return withoutSlash.replace(/#([\dA-Fa-f]{2})/g, (_, hex) =>
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hum... Seems fragile.

* ```
* @returns An array of objects containing field names, events, and JavaScript code.
*/
getXFAJavaScripts(): Array<{ field: string; event: string; script: string }> {
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It is probably possible to share parts of this function with setXFAJavaScripts

fieldName: string,
eventName: string,
newScript: string,
): boolean {
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I wonder if it would not be better to throw an error with details about why it failed instead of returning false. At least, we should consider logging it

@vision10
Copy link
Copy Markdown
Author

thank you for the input, Im not very familiar with pdfs spec
I did some refactoring, hope its better now

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants