DocDig

DocDig is an Elixir wrapper around the Rust-based extractous library, exposing high-performance document and web-page text extraction via Rustler NIFs.

Features

Extract text from local files: PDF, DOCX, HTML, Markdown, etc.
Fetch and extract from URLs.
Extract from in-memory binaries.
Perform OCR on image-only PDFs or images via Tesseract (customizable language).
Optional bang (!) variants that raise on errors for concise workflows.
Precompiled NIFs with rustler_precompiled support for zero‑toolchain installs.

Installation

Add to your mix.exs:

def deps do
  [
    {:doc_dig, github: "elchemista/doc_dig", branch: "master"}
  ]
end

Then fetch and compile:

mix deps.get
mix compile

Usage Examples

# Extract from a local Markdown file:
{:ok, {text, metadata}} = DocDig.extract_file("README.md")
IO.puts(text)
IO.inspect(metadata)

# Raise on failure:
{text, _meta} = DocDig.extract_file!("README.md")

# Extract from a URL:
{:ok, {html_text, _}} = DocDig.extract_url("https://example.com")

# Extract from in-memory binary (e.g. download via HTTPoison):
{:ok, file_bytes} = HTTPoison.get("https://example.com/sample.docx")
{:ok, {doc_text, _}} = DocDig.extract_bytes(file_bytes)

# Force OCR on a scanned PDF (German language):
{:ok, {ocr_text, _}} = DocDig.extract_file_ocr("invoice_scanned.pdf", "deu")

Contributing

Fork the repo
Create a feature branch: git checkout -b feature/my-addition
Run tests: mix test
Submit a pull request

Credits

extractous by Yobix AI and contributors
Rustler by the Rustler team
Tesseract OCR for OCR support
Elixir and Erlang/OTP community

Name		Name	Last commit message	Last commit date
Latest commit History 17 Commits
.github/workflows		.github/workflows
lib		lib
native/doc_dig		native/doc_dig
test		test
.formatter.exs		.formatter.exs
.gitignore		.gitignore
Cargo.lock		Cargo.lock
Cargo.toml		Cargo.toml
LICENSE		LICENSE
README.md		README.md
Taskfile.yml		Taskfile.yml
mix.exs		mix.exs
mix.lock		mix.lock

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

DocDig

Features

Installation

Usage Examples

Contributing

Credits

License

About

Uh oh!

Releases

Packages

Uh oh!

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

DocDig

Features

Installation

Usage Examples

Contributing

Credits

License

About

Topics

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Uh oh!

Contributors

Uh oh!

Languages

Packages