Skip to content

Spike: diagnose or replace dslipak/pdf (uninterruptible hang → DoS vector) #7

@dratner

Description

@dratner

Background

extract/pdf (landed in #6) uses github.com/dslipak/pdf. Testing surfaced a concrete failure: a structurally-valid PDF whose single page has no /Contents stream parses fine (NumPage()==1, page not null) but then GetPlainText hangs — it spins without returning.

This is worse than the panics the original Morris design anticipated:

  • recover() cannot catch a hang, so the panic-recovery boundary is useless against it.
  • The parser ignores context cancellation (no ctx awareness), so a caller cannot interrupt it cooperatively.

For a library that accepts untrusted PDFs, an uninterruptible hang is a denial-of-service vector: one crafted input ties up a worker indefinitely.

Current stopgap (shipped in #6, ADR 0007)

extract/pdf ships a wall-clock timeout watchdog (Extractor.Timeout, default 30s): parsing runs in a goroutine, Extract selects on result / ctx.Done() / timer, and returns extract.ErrMalformedSource on timeout.

This is knowingly a band-aid, not a fix:

  • A timed-out parse leaks its goroutine — the worker stays stuck in the uninterruptible parser, so sustained hostile input grows memory. (Residual risk this stopgap does not close.)
  • The timeout is wall-clock, conflating "hostile hang" with "legitimately huge PDF".

Spike goals

  • Diagnose the hang — is it specifically the missing-/Contents case? Can we pre-validate page structure to avoid it, or is it fixable upstream?
  • Evaluate maintained alternatives to dslipak/pdf.
  • Decide on out-of-process parsing — the real fix for an uninterruptible parser (sandbox / separate process with hard kill).

Outcome supersedes ADR 0007.

Until then

Do not point extract/pdf at high-volume untrusted input without external process isolation.

References

Metadata

Metadata

Assignees

No one assigned

    Labels

    enhancementNew feature or requesthelp wantedExtra attention is needed

    Type

    No type
    No fields configured for issues without a type.

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions