revpdf

revpdf is a triage-and-recovery toolkit for PDFs saved with incremental updates. It helps you inspect a PDF's edit history and safely roll it back to an earlier revision when later saves added unwanted markup.

Installation

pip install revpdf

Quick Start

1. Inspect a PDF

revpdf-inspect input.pdf

This prints revision numbers, byte ranges, and counts of suspicious annotation markers.

2. List object-level changes per revision

revpdf-list input.pdf

This identifies which objects each revision appends and distinguishes likely annotation-only changes from content-affecting changes.

3. Extract a chosen revision

revpdf-extract input.pdf --revision-index 1 --output cleaned.pdf

4. Automated Sanitization (New)

revpdf-clean input.pdf --output cleaned.pdf --strategies acrobat samsung --manifest integrity.json

This uses best-guess heuristics to automatically identify and surgically remove platform-specific markup while regenerating the XRef stream for a clean file.

Forensic Features:

--manifest (-m): Generates a machine-readable JSON report mapping original object IDs to their cryptographic hashes.
Global Signature: A cumulative SHA-256 fingerprint of all kept content, providing proof that the original textbook data is byte-identical.

Forensic Integrity Module

revpdf is designed for high-assurance environments where data provenance is critical.

Tiered Hashing

To maintain high performance on large files, the integrity engine uses a tiered approach:

Hashed: Text content streams, page geometry, and structural objects.
Skipped: Large binary assets like embedded images and fonts (can be enabled via API).

Integrity Manifest (JSON)

The manifest provides a verifiable audit trail:

global_signature: The unique hash for the entire "cleaned" document state.
object_hashes: A dictionary of New_ID -> SHA-256_Hash.

Developer SDK (v0.2.0+)

revpdf now provides a tiered Python SDK with a high-performance Rust backend. It supports asyncio for non-blocking I/O.

Object-Model API

Designed for ease of use and integration into Python workflows.

import asyncio
from revpdf import PdfDocument, Sanitizer

async def main():
    # Load document (Lazy loading handled by Rust)
    doc = await PdfDocument.open("textbook.pdf")
    
    # Apply automated sanitization strategies
    sanitizer = Sanitizer(strategies=["acrobat", "samsung"])
    removed_count = await sanitizer.apply(doc)
    print(f"Identified {removed_count} objects for removal")
    
    # Surgical Save (Modern XRef Stream regeneration)
    manifest = await doc.save("cleaned.pdf", surgical=True)
    print(f"Surgical Save Complete!")
    print(f"Global Document Signature: {manifest.global_signature}")
    print(f"Verified Objects: {len(manifest.object_hashes)}")

if __name__ == "__main__":
    asyncio.run(main())

The Workflow

When this works well

This method is appropriate when the PDF was modified by incremental saves. In that format, each save appends a new revision to the end of the file.

Typical signs:

multiple %%EOF markers in the file
trailer dictionaries with /Prev pointers
later revisions containing annotation markers such as /Type /Annot, /Subtype /Stamp, /InkList, /AAPL:AKExtras, or /PPKType (draw)

Safety Rule

Only roll back to an earlier revision if the later revisions contain unwanted annotations or annotation appearance streams and do not replace the actual textbook page content you need to keep.

Do not roll back blindly if later revisions also change:

page content streams
text objects
fonts
images
page tree structure for real content changes

If those appear in the later revisions, you need a more careful repair strategy.

The Manual Workflow

1. Find the revision boundaries

Each incremental revision normally ends with %%EOF.

Run:

python3 - <<'PY'
from pathlib import Path

data = Path("input.pdf").read_bytes()
cursor = 0
index = 1
while True:
    pos = data.find(b"%%EOF", cursor)
    if pos == -1:
        break
    print(f"Revision {index}: EOF at byte {pos}")
    cursor = pos + 1
    index += 1
PY

If you see more than one %%EOF, the file contains multiple revisions.

2. Inspect the trailer chain

Each later trailer often points backward to the previous revision using /Prev.

Run:

python3 - <<'PY'
from pathlib import Path

data = Path("input.pdf").read_bytes()
cursor = 0
while True:
    pos = data.find(b"%%EOF", cursor)
    if pos == -1:
        break
    snippet = data[max(0, pos - 400):pos + 20]
    print(snippet.decode("latin1", "replace"))
    print("-----")
    cursor = pos + 1
PY

This helps confirm that the file was saved incrementally rather than rewritten from scratch.

3. Search for suspicious annotation markers

Search the raw PDF bytes for common overlay markers:

rg -a -n '/Subtype /(Stamp|Ink|FreeText|Square|Circle|Highlight)|/InkList|/AAPL:AKExtras|/PPKType \(draw\)|Mobile User' input.pdf

Common signs of hand-drawn markup include:

/Subtype /Stamp
/Subtype /Ink
/InkList
/PPKType (draw)
/AAPL:AKExtras
Mobile User

4. Compare revisions, not just the whole file

The important question is not whether the PDF contains annotations somewhere. The important question is when those objects first appear.

For each appended revision, inspect whether it adds:

only annotation objects and annotation appearance streams
page /Annots references pointing to those annotations

That is usually safe to roll back.

If the appended revision adds or replaces actual page contents, treat it as unsafe for blind rollback.

5. Choose the rollback target

Choose the last revision before the unwanted markers first appear.

Example logic:

revision 1: no unwanted annotation markers
revision 2: unwanted drawing markers appear
revision 3: more of the same unwanted drawing markers

In that case, revision 1 is the clean rollback target.

6. Extract the earlier revision into a new file

Once you know the correct revision boundary, copy the file only up to that revision's final %%EOF.

Manual example:

head -c <SAFE_END_OFFSET> input.pdf > cleaned.pdf

Do this into a new output file. Leave the original untouched.

7. Verify the cleaned file

Run:

pdfinfo cleaned.pdf

Then confirm the unwanted markers are gone:

python3 - <<'PY'
from pathlib import Path

data = Path("cleaned.pdf").read_bytes()
for token in [
    b"Mobile User",
    b"/PPKType (draw)",
    b"/Subtype /Stamp",
    b"/Subtype /Ink",
    b"/AAPL:AKExtras",
]:
    print(token.decode("latin1"), data.count(token))
PY

Check:

page count still matches what you expect
the cleaned file opens normally
the unwanted annotation markers are gone
the original content remains intact

Reusable Commands

The package includes four core commands:

revpdf-inspect
revpdf-extract
revpdf-list
revpdf-clean

Inspect a PDF

revpdf-inspect input.pdf

This prints:

revision number
revision byte range
end offset
trailer /Prev
trailer /Size
counts of suspicious annotation markers in that revision

List object-level changes per revision

revpdf-list input.pdf

This prints, for each revision:

object count
per-kind counts such as page, annot, xobject, font, and generic objects
whether each appended object was added, redefined, or repeated
a revision assessment such as likely_annotation_only or content_affecting_or_mixed
notable details such as /Rect, /Annots, /Contents, stream filters, compressed-object containers, and vendor markers when present

Useful options:

revpdf-list input.pdf --revision-index 3
revpdf-list input.pdf --show-baseline-objects
revpdf-list input.pdf --summary-only
revpdf-list input.pdf --json

Extract a chosen revision

revpdf-extract input.pdf --revision-index 1 --output cleaned.pdf

This writes a new PDF containing only the bytes through the selected revision.

You can also extract by byte offset:

revpdf-extract input.pdf --end-offset 229086321 --output cleaned.pdf

Practical Decision Checklist

Use rollback when all of these are true:

the PDF has multiple revisions
the unwanted changes were introduced in later revisions
those later revisions are annotation-only or annotation-dominant
the earlier revision already contains the correct textbook content

Do not use blind rollback when any of these are true:

the later revisions contain actual content changes you need
you cannot tell whether the later objects are only annotations
the PDF was fully rewritten instead of saved incrementally

Notes

Some tools warn about broken or invalid linearization tables. That does not automatically mean the PDF is unusable.
Some annotation systems render hand-drawn marks as /Stamp objects with appearance streams rather than /Ink objects. Search broadly.
Some editors store changed objects inside compressed object streams (/ObjStm) or xref streams. The change-report script now detects these and expands common Flate-compressed object streams.
Always work on a copy when the document is important.

Recommended Sequence

Run revpdf-inspect.
Run revpdf-list to see exactly which objects each revision added or redefined.
Identify the first revision that introduces unwanted annotation markers.
Choose between:
- Rollback: Use revpdf-extract to truncate at a safe revision.
- Surgical Cleanup: Use revpdf-clean to remove specific markup layers while keeping recent content.
Verify the page count, metadata, and marker counts.

Name		Name	Last commit message	Last commit date
Latest commit History 7 Commits
.github/workflows		.github/workflows
legacy_scripts		legacy_scripts
revpdf-rs		revpdf-rs
src/revpdf		src/revpdf
.gitignore		.gitignore
CHANGELOG.md		CHANGELOG.md
README.md		README.md
pyproject.toml		pyproject.toml
uv.lock		uv.lock

Folders and files

Latest commit

History

Repository files navigation

revpdf

Installation

Quick Start

1. Inspect a PDF

2. List object-level changes per revision

3. Extract a chosen revision

4. Automated Sanitization (New)

Forensic Integrity Module

Tiered Hashing

Integrity Manifest (JSON)

Developer SDK (v0.2.0+)

Object-Model API

The Workflow

When this works well

Safety Rule

The Manual Workflow

1. Find the revision boundaries

2. Inspect the trailer chain

3. Search for suspicious annotation markers

4. Compare revisions, not just the whole file

5. Choose the rollback target

6. Extract the earlier revision into a new file

7. Verify the cleaned file

Reusable Commands

Inspect a PDF

List object-level changes per revision

Extract a chosen revision

Practical Decision Checklist

Notes

Recommended Sequence

About

Topics

Resources

Uh oh!

Stars

Watchers

Forks

Releases 1

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages