German-first PII redaction and pseudonymization for documents. Local by default. Reversible when you need it.
Noirdoc redacts names, addresses, phone numbers, IBANs, Steuer-IDs, SVNRs, and the rest — from PDFs, DOCX, XLSX, and plain text — without sending anything to a third party. Under the hood it's a rules-based Presidio pipeline by default, and an ensemble (Presidio + GLiNER + Flair) when the [full] extra is installed. It's built for real-world German documents and mixed DE/EN text — the kind of stuff Mittelstand actually runs through an LLM.
Status: alpha (
0.1.x). API will change before1.0. Pin the minor version.
- Python 3.12 or 3.13
- ~1 GB free disk if you install the
[full]extra (spaCy + Flair + GLiNER weights) - Optional: a Redis instance if you want shared mapping storage across workers (
[redis]extra)
# Baseline — Presidio + all file extractors + reversible mapper.
pip install noirdoc
# Full ensemble (adds GLiNER + Flair, large ML weights). Recommended for real work.
pip install noirdoc[full]
noirdoc models pull
# Optional distributed mapper backend.
pip install noirdoc[redis]For anything beyond toy examples, use noirdoc[full] — the ensemble catches what the baseline misses, especially on German lowercase text.
# One-shot redact (ephemeral mapping, discarded on exit).
noirdoc redact vertrag.pdf -o vertrag-clean.pdf
# Persistent namespace — placeholders stay consistent across files and sessions.
noirdoc redact --namespace mandant-mueller brief.docx -o brief-clean.docx
noirdoc reveal --namespace mandant-mueller brief-clean.docx -o brief-revealed.docx
noirdoc lookup --namespace mandant-mueller "<<PERSON_3>>"from noirdoc import Redactor
r = Redactor(namespace="mandant-mueller")
r.redact_file("vertrag.pdf", output="vertrag-clean.pdf")
r.redact_file("brief.docx", output="brief-clean.docx")
r.reveal_text(llm_response) # un-redact the model's replyInput:
Anna Müller, geboren am 12.03.1981 in München, erreichbar unter 0171-2345678, Steuer-ID 12 345 678 901, IBAN DE89 3704 0044 0532 0130 00.
Output:
<<PERSON_1>>, geboren am<<DATE_TIME_1>>in<<LOCATION_1>>, erreichbar unter<<PHONE_NUMBER_1>>, Steuer-ID<<DE_STEUER_ID_1>>, IBAN<<IBAN_CODE_1>>.
| Command | What it does |
|---|---|
noirdoc redact <files> |
Redact one or more files (accepts directories; -o FILE or --output-dir DIR). |
noirdoc reveal <file> |
Reverse pseudonyms back to originals (DOCX / XLSX / plain; --namespace required). |
noirdoc lookup <token> |
Resolve a pseudonym like <<PERSON_1>> to its original value. |
noirdoc ns list |
List persistent namespaces. |
noirdoc ns summary <name> |
Counts-only summary (entity totals + per-type counts). Safe to log. |
noirdoc ns show <name> --unsafe |
Print the full pseudonym↔original mapping as JSON. Reveals every original value. Requires --unsafe. |
noirdoc ns delete <name> |
Delete a namespace (prompts for confirmation). |
noirdoc models pull |
Download spaCy models and (optionally) GLiNER weights up front. |
Run noirdoc <cmd> --help for the full flag list on any subcommand.
A few honest caveats before you ship this into a pipeline:
- Best results need
[full]. On first use (or vianoirdoc models pull) the full extra downloads roughly 560 MB of weights: spaCyde_core_news_lg, Flairner-german-large, and a GLiNER multilingual model. Budget disk and bandwidth. - PDF reveal is not supported yet. Round-tripping placeholders back into a PDF is a hard problem (position drift, font metrics, image-based redactions). PDFs redact cleanly; reveal is pass-through. DOCX, XLSX, and plain text round-trip fully.
- Alpha API. Classes and CLI flags may change between
0.1.xand0.2.x. Pin accordingly. - Detector quality depends on the upstream models. Presidio + Flair + GLiNER do the heavy lifting. Noirdoc adds German-specific recognizers on top, but it does not train models.
Noirdoc defaults to German (language="de") with fallback to ["de", "en"] for mixed documents. What that actually means:
- Custom recognizers in
src/noirdoc/detection/presidio_detector.py:GermanPhoneRecognizer— German phone formats (0171-..., +49...)GermanSVNRRecognizer— Sozialversicherungsnummer with checksumGermanSteuerIDRecognizer— 11-digit Steuer-ID with checksumInvertedNameRecognizer— registered for bothdeandento catch "Nachname, Vorname" patterns
- Flair
ner-german-large(XLM-R, F1 92.3 % on CoNLL-03 DE) handles lowercase German text — the case where spaCy tends to drop names. - GLiNER multilingual catches entity types the others miss.
- German-style lowercase financial terms, German IBANs, German date formats, and German address patterns are covered in the test suite (
tests/test_presidio_detector.py).
If you're working with German legal, medical, HR, or financial documents, this is what the defaults are tuned for.
| Format | Redact | Reveal (round-trip) |
|---|---|---|
| ✓ | ✗ (pass-through) | |
| DOCX | ✓ | ✓ |
| XLSX | ✓ | ✓ |
| Plain text / CSV / MD / HTML | ✓ | ✓ |
| PPTX / images | ✓ | ✗ (pass-through) |
PDF reveal is an open contribution target — see CONTRIBUTING.md.
The [redis] extra ships a RedisMappingBackend that plugs into the lower-level MappingStore — the same primitive Noirdoc Cloud uses for request-scoped, encrypted, TTL-bounded mapping persistence across workers. It is not wired into Redactor(namespace=...), which persists to the local filesystem under ~/.noirdoc/namespaces/. Use MappingStore when you have multiple workers that need to share pseudonym mappings for the same request, or when you want encrypted-at-rest mappings with automatic expiry.
import asyncio
from cryptography.fernet import Fernet
from redis.asyncio import Redis
from noirdoc.mappings.backends.redis_backend import RedisMappingBackend
from noirdoc.mappings.store import MappingStore
async def main() -> None:
redis = Redis.from_url("redis://localhost:6379")
store = MappingStore(
backend=RedisMappingBackend(redis),
encryption_key=Fernet.generate_key(), # keep stable across workers
)
# store.save(request_id=..., tenant_id=..., mapper=...)
# mappings = await store.load(request_id)
asyncio.run(main())The encryption_key must be identical across workers that need to read the same mappings. MappingStore.save() accepts a ttl_days kwarg (default 30).
Don't want to run this yourself? Noirdoc Cloud is the hosted API wrapper: a privacy-preserving reverse proxy for LLM calls that uses this exact pipeline, plus multi-tenancy, audit, and provider key management. Compliance story: what's on GitHub is what the cloud runs.
This repo uses the shared noirdoc tooling standard (uv + ruff/mypy). Common tasks go through make:
make install # set up the dev environment
make check # lint + format-check + typecheck + test — run before pushing
make test # run fast tests (excludes slow ML-model tests)Run make help for the full list of targets (also: make lint, make fmt, make typecheck, make test-slow, make models).
Bug reports, detectors, and format support are all welcome. See CONTRIBUTING.md for dev setup, tests, and the recognizer pattern.
Report vulnerabilities via GitHub's private vulnerability reporting — see SECURITY.md. Please don't open public issues for security bugs.
See CHANGELOG.md. Follows Keep a Changelog and SemVer.
MIT © 2026 Antonio Maiolo / Nextaim GmbH. See LICENSE.
Built by Nextaim GmbH · noirdoc.de