revpdf is a triage-and-recovery toolkit for PDFs saved with incremental updates. It helps you inspect a PDF's edit history and safely roll it back to an earlier revision when later saves added unwanted markup.
pip install revpdfrevpdf-inspect input.pdfThis prints revision numbers, byte ranges, and counts of suspicious annotation markers.
revpdf-list input.pdfThis identifies which objects each revision appends and distinguishes likely annotation-only changes from content-affecting changes.
revpdf-extract input.pdf --revision-index 1 --output cleaned.pdfrevpdf-clean input.pdf --output cleaned.pdf --strategies acrobat samsung --manifest integrity.jsonThis uses best-guess heuristics to automatically identify and surgically remove platform-specific markup while regenerating the XRef stream for a clean file.
Forensic Features:
- --manifest (-m): Generates a machine-readable JSON report mapping original object IDs to their cryptographic hashes.
- Global Signature: A cumulative SHA-256 fingerprint of all kept content, providing proof that the original textbook data is byte-identical.
revpdf is designed for high-assurance environments where data provenance is critical.
To maintain high performance on large files, the integrity engine uses a tiered approach:
- Hashed: Text content streams, page geometry, and structural objects.
- Skipped: Large binary assets like embedded images and fonts (can be enabled via API).
The manifest provides a verifiable audit trail:
global_signature: The unique hash for the entire "cleaned" document state.object_hashes: A dictionary ofNew_ID -> SHA-256_Hash.
revpdf now provides a tiered Python SDK with a high-performance Rust backend. It supports asyncio for non-blocking I/O.
Designed for ease of use and integration into Python workflows.
import asyncio
from revpdf import PdfDocument, Sanitizer
async def main():
# Load document (Lazy loading handled by Rust)
doc = await PdfDocument.open("textbook.pdf")
# Apply automated sanitization strategies
sanitizer = Sanitizer(strategies=["acrobat", "samsung"])
removed_count = await sanitizer.apply(doc)
print(f"Identified {removed_count} objects for removal")
# Surgical Save (Modern XRef Stream regeneration)
manifest = await doc.save("cleaned.pdf", surgical=True)
print(f"Surgical Save Complete!")
print(f"Global Document Signature: {manifest.global_signature}")
print(f"Verified Objects: {len(manifest.object_hashes)}")
if __name__ == "__main__":
asyncio.run(main())This method is appropriate when the PDF was modified by incremental saves. In that format, each save appends a new revision to the end of the file.
Typical signs:
- multiple
%%EOFmarkers in the file - trailer dictionaries with
/Prevpointers - later revisions containing annotation markers such as
/Type /Annot,/Subtype /Stamp,/InkList,/AAPL:AKExtras, or/PPKType (draw)
Only roll back to an earlier revision if the later revisions contain unwanted annotations or annotation appearance streams and do not replace the actual textbook page content you need to keep.
Do not roll back blindly if later revisions also change:
- page content streams
- text objects
- fonts
- images
- page tree structure for real content changes
If those appear in the later revisions, you need a more careful repair strategy.
Each incremental revision normally ends with %%EOF.
Run:
python3 - <<'PY'
from pathlib import Path
data = Path("input.pdf").read_bytes()
cursor = 0
index = 1
while True:
pos = data.find(b"%%EOF", cursor)
if pos == -1:
break
print(f"Revision {index}: EOF at byte {pos}")
cursor = pos + 1
index += 1
PYIf you see more than one %%EOF, the file contains multiple revisions.
Each later trailer often points backward to the previous revision using /Prev.
Run:
python3 - <<'PY'
from pathlib import Path
data = Path("input.pdf").read_bytes()
cursor = 0
while True:
pos = data.find(b"%%EOF", cursor)
if pos == -1:
break
snippet = data[max(0, pos - 400):pos + 20]
print(snippet.decode("latin1", "replace"))
print("-----")
cursor = pos + 1
PYThis helps confirm that the file was saved incrementally rather than rewritten from scratch.
Search the raw PDF bytes for common overlay markers:
rg -a -n '/Subtype /(Stamp|Ink|FreeText|Square|Circle|Highlight)|/InkList|/AAPL:AKExtras|/PPKType \(draw\)|Mobile User' input.pdfCommon signs of hand-drawn markup include:
/Subtype /Stamp/Subtype /Ink/InkList/PPKType (draw)/AAPL:AKExtrasMobile User
The important question is not whether the PDF contains annotations somewhere. The important question is when those objects first appear.
For each appended revision, inspect whether it adds:
- only annotation objects and annotation appearance streams
- page
/Annotsreferences pointing to those annotations
That is usually safe to roll back.
If the appended revision adds or replaces actual page contents, treat it as unsafe for blind rollback.
Choose the last revision before the unwanted markers first appear.
Example logic:
- revision 1: no unwanted annotation markers
- revision 2: unwanted drawing markers appear
- revision 3: more of the same unwanted drawing markers
In that case, revision 1 is the clean rollback target.
Once you know the correct revision boundary, copy the file only up to that revision's final %%EOF.
Manual example:
head -c <SAFE_END_OFFSET> input.pdf > cleaned.pdfDo this into a new output file. Leave the original untouched.
Run:
pdfinfo cleaned.pdfThen confirm the unwanted markers are gone:
python3 - <<'PY'
from pathlib import Path
data = Path("cleaned.pdf").read_bytes()
for token in [
b"Mobile User",
b"/PPKType (draw)",
b"/Subtype /Stamp",
b"/Subtype /Ink",
b"/AAPL:AKExtras",
]:
print(token.decode("latin1"), data.count(token))
PYCheck:
- page count still matches what you expect
- the cleaned file opens normally
- the unwanted annotation markers are gone
- the original content remains intact
The package includes four core commands:
revpdf-inspectrevpdf-extractrevpdf-listrevpdf-clean
revpdf-inspect input.pdfThis prints:
- revision number
- revision byte range
- end offset
- trailer
/Prev - trailer
/Size - counts of suspicious annotation markers in that revision
revpdf-list input.pdfThis prints, for each revision:
- object count
- per-kind counts such as page, annot, xobject, font, and generic objects
- whether each appended object was added, redefined, or repeated
- a revision assessment such as
likely_annotation_onlyorcontent_affecting_or_mixed - notable details such as
/Rect,/Annots,/Contents, stream filters, compressed-object containers, and vendor markers when present
Useful options:
revpdf-list input.pdf --revision-index 3
revpdf-list input.pdf --show-baseline-objects
revpdf-list input.pdf --summary-only
revpdf-list input.pdf --jsonrevpdf-extract input.pdf --revision-index 1 --output cleaned.pdfThis writes a new PDF containing only the bytes through the selected revision.
You can also extract by byte offset:
revpdf-extract input.pdf --end-offset 229086321 --output cleaned.pdfUse rollback when all of these are true:
- the PDF has multiple revisions
- the unwanted changes were introduced in later revisions
- those later revisions are annotation-only or annotation-dominant
- the earlier revision already contains the correct textbook content
Do not use blind rollback when any of these are true:
- the later revisions contain actual content changes you need
- you cannot tell whether the later objects are only annotations
- the PDF was fully rewritten instead of saved incrementally
- Some tools warn about broken or invalid linearization tables. That does not automatically mean the PDF is unusable.
- Some annotation systems render hand-drawn marks as
/Stampobjects with appearance streams rather than/Inkobjects. Search broadly. - Some editors store changed objects inside compressed object streams (
/ObjStm) or xref streams. The change-report script now detects these and expands common Flate-compressed object streams. - Always work on a copy when the document is important.
- Run
revpdf-inspect. - Run
revpdf-listto see exactly which objects each revision added or redefined. - Identify the first revision that introduces unwanted annotation markers.
- Choose between:
- Rollback: Use
revpdf-extractto truncate at a safe revision. - Surgical Cleanup: Use
revpdf-cleanto remove specific markup layers while keeping recent content.
- Rollback: Use
- Verify the page count, metadata, and marker counts.