DocDig is an Elixir wrapper around the Rust-based extractous library, exposing high-performance document and web-page text extraction via Rustler NIFs.
- Extract text from local files: PDF, DOCX, HTML, Markdown, etc.
- Fetch and extract from URLs.
- Extract from in-memory binaries.
- Perform OCR on image-only PDFs or images via Tesseract (customizable language).
- Optional bang (
!) variants that raise on errors for concise workflows. - Precompiled NIFs with
rustler_precompiledsupport for zero‑toolchain installs.
Add to your mix.exs:
def deps do
[
{:doc_dig, github: "elchemista/doc_dig", branch: "master"}
]
endThen fetch and compile:
mix deps.get
mix compile# Extract from a local Markdown file:
{:ok, {text, metadata}} = DocDig.extract_file("README.md")
IO.puts(text)
IO.inspect(metadata)
# Raise on failure:
{text, _meta} = DocDig.extract_file!("README.md")
# Extract from a URL:
{:ok, {html_text, _}} = DocDig.extract_url("https://example.com")
# Extract from in-memory binary (e.g. download via HTTPoison):
{:ok, file_bytes} = HTTPoison.get("https://example.com/sample.docx")
{:ok, {doc_text, _}} = DocDig.extract_bytes(file_bytes)
# Force OCR on a scanned PDF (German language):
{:ok, {ocr_text, _}} = DocDig.extract_file_ocr("invoice_scanned.pdf", "deu")- Fork the repo
- Create a feature branch:
git checkout -b feature/my-addition - Run tests:
mix test - Submit a pull request
- extractous by Yobix AI and contributors
- Rustler by the Rustler team
- Tesseract OCR for OCR support
- Elixir and Erlang/OTP community
Apache-2.0 © License