Background
extract/pdf (landed in #6) uses github.com/dslipak/pdf. Testing surfaced a concrete failure: a structurally-valid PDF whose single page has no /Contents stream parses fine (NumPage()==1, page not null) but then GetPlainText hangs — it spins without returning.
This is worse than the panics the original Morris design anticipated:
recover() cannot catch a hang, so the panic-recovery boundary is useless against it.
- The parser ignores
context cancellation (no ctx awareness), so a caller cannot interrupt it cooperatively.
For a library that accepts untrusted PDFs, an uninterruptible hang is a denial-of-service vector: one crafted input ties up a worker indefinitely.
Current stopgap (shipped in #6, ADR 0007)
extract/pdf ships a wall-clock timeout watchdog (Extractor.Timeout, default 30s): parsing runs in a goroutine, Extract selects on result / ctx.Done() / timer, and returns extract.ErrMalformedSource on timeout.
This is knowingly a band-aid, not a fix:
- A timed-out parse leaks its goroutine — the worker stays stuck in the uninterruptible parser, so sustained hostile input grows memory. (Residual risk this stopgap does not close.)
- The timeout is wall-clock, conflating "hostile hang" with "legitimately huge PDF".
Spike goals
Outcome supersedes ADR 0007.
Until then
Do not point extract/pdf at high-volume untrusted input without external process isolation.
References
Background
extract/pdf(landed in #6) usesgithub.com/dslipak/pdf. Testing surfaced a concrete failure: a structurally-valid PDF whose single page has no/Contentsstream parses fine (NumPage()==1, page not null) but thenGetPlainTexthangs — it spins without returning.This is worse than the panics the original Morris design anticipated:
recover()cannot catch a hang, so the panic-recovery boundary is useless against it.contextcancellation (no ctx awareness), so a caller cannot interrupt it cooperatively.For a library that accepts untrusted PDFs, an uninterruptible hang is a denial-of-service vector: one crafted input ties up a worker indefinitely.
Current stopgap (shipped in #6, ADR 0007)
extract/pdfships a wall-clock timeout watchdog (Extractor.Timeout, default 30s): parsing runs in a goroutine,Extractselects on result /ctx.Done()/ timer, and returnsextract.ErrMalformedSourceon timeout.This is knowingly a band-aid, not a fix:
Spike goals
/Contentscase? Can we pre-validate page structure to avoid it, or is it fixable upstream?dslipak/pdf.Outcome supersedes ADR 0007.
Until then
Do not point
extract/pdfat high-volume untrusted input without external process isolation.References
docs/adr/0007-pdf-extraction-watchdog-stopgap.mddocs/deferred-tooling.mdPhase 1: chunk — boundary-aware chunker with token-budget enforcement #4