Skip to content

noirdoc-ai/mask-engine

Repository files navigation

noirdoc — German-first PII redaction, local by default.

CI Python 3.12 | 3.13 License: MIT pre-commit enabled

noirdoc

German-first PII redaction and pseudonymization for documents. Local by default. Reversible when you need it.

Noirdoc redacts names, addresses, phone numbers, IBANs, Steuer-IDs, SVNRs, and the rest — from PDFs, DOCX, XLSX, and plain text — without sending anything to a third party. Under the hood it's a rules-based Presidio pipeline by default, and an ensemble (Presidio + GLiNER + Flair) when the [full] extra is installed. It's built for real-world German documents and mixed DE/EN text — the kind of stuff Mittelstand actually runs through an LLM.

Status: alpha (0.1.x). API will change before 1.0. Pin the minor version.

Prerequisites

  • Python 3.12 or 3.13
  • ~1 GB free disk if you install the [full] extra (spaCy + Flair + GLiNER weights)
  • Optional: a Redis instance if you want shared mapping storage across workers ([redis] extra)

Install

# Baseline — Presidio + all file extractors + reversible mapper.
pip install noirdoc

# Full ensemble (adds GLiNER + Flair, large ML weights). Recommended for real work.
pip install noirdoc[full]
noirdoc models pull

# Optional distributed mapper backend.
pip install noirdoc[redis]

For anything beyond toy examples, use noirdoc[full] — the ensemble catches what the baseline misses, especially on German lowercase text.

Quickstart

# One-shot redact (ephemeral mapping, discarded on exit).
noirdoc redact vertrag.pdf -o vertrag-clean.pdf

# Persistent namespace — placeholders stay consistent across files and sessions.
noirdoc redact --namespace mandant-mueller brief.docx -o brief-clean.docx
noirdoc reveal --namespace mandant-mueller brief-clean.docx -o brief-revealed.docx
noirdoc lookup --namespace mandant-mueller "<<PERSON_3>>"
from noirdoc import Redactor

r = Redactor(namespace="mandant-mueller")
r.redact_file("vertrag.pdf", output="vertrag-clean.pdf")
r.redact_file("brief.docx", output="brief-clean.docx")
r.reveal_text(llm_response)  # un-redact the model's reply

Input:

Anna Müller, geboren am 12.03.1981 in München, erreichbar unter 0171-2345678, Steuer-ID 12 345 678 901, IBAN DE89 3704 0044 0532 0130 00.

Output:

<<PERSON_1>>, geboren am <<DATE_TIME_1>> in <<LOCATION_1>>, erreichbar unter <<PHONE_NUMBER_1>>, Steuer-ID <<DE_STEUER_ID_1>>, IBAN <<IBAN_CODE_1>>.

Commands

Command What it does
noirdoc redact <files> Redact one or more files (accepts directories; -o FILE or --output-dir DIR).
noirdoc reveal <file> Reverse pseudonyms back to originals (DOCX / XLSX / plain; --namespace required).
noirdoc lookup <token> Resolve a pseudonym like <<PERSON_1>> to its original value.
noirdoc ns list List persistent namespaces.
noirdoc ns summary <name> Counts-only summary (entity totals + per-type counts). Safe to log.
noirdoc ns show <name> --unsafe Print the full pseudonym↔original mapping as JSON. Reveals every original value. Requires --unsafe.
noirdoc ns delete <name> Delete a namespace (prompts for confirmation).
noirdoc models pull Download spaCy models and (optionally) GLiNER weights up front.

Run noirdoc <cmd> --help for the full flag list on any subcommand.

Before you start

A few honest caveats before you ship this into a pipeline:

  • Best results need [full]. On first use (or via noirdoc models pull) the full extra downloads roughly 560 MB of weights: spaCy de_core_news_lg, Flair ner-german-large, and a GLiNER multilingual model. Budget disk and bandwidth.
  • PDF reveal is not supported yet. Round-tripping placeholders back into a PDF is a hard problem (position drift, font metrics, image-based redactions). PDFs redact cleanly; reveal is pass-through. DOCX, XLSX, and plain text round-trip fully.
  • Alpha API. Classes and CLI flags may change between 0.1.x and 0.2.x. Pin accordingly.
  • Detector quality depends on the upstream models. Presidio + Flair + GLiNER do the heavy lifting. Noirdoc adds German-specific recognizers on top, but it does not train models.

German-first

Noirdoc defaults to German (language="de") with fallback to ["de", "en"] for mixed documents. What that actually means:

  • Custom recognizers in src/noirdoc/detection/presidio_detector.py:
    • GermanPhoneRecognizer — German phone formats (0171-..., +49...)
    • GermanSVNRRecognizer — Sozialversicherungsnummer with checksum
    • GermanSteuerIDRecognizer — 11-digit Steuer-ID with checksum
    • InvertedNameRecognizer — registered for both de and en to catch "Nachname, Vorname" patterns
  • Flair ner-german-large (XLM-R, F1 92.3 % on CoNLL-03 DE) handles lowercase German text — the case where spaCy tends to drop names.
  • GLiNER multilingual catches entity types the others miss.
  • German-style lowercase financial terms, German IBANs, German date formats, and German address patterns are covered in the test suite (tests/test_presidio_detector.py).

If you're working with German legal, medical, HR, or financial documents, this is what the defaults are tuned for.

Supported formats

Format Redact Reveal (round-trip)
PDF ✗ (pass-through)
DOCX
XLSX
Plain text / CSV / MD / HTML
PPTX / images ✗ (pass-through)

PDF reveal is an open contribution target — see CONTRIBUTING.md.

Advanced: shared mapping storage

The [redis] extra ships a RedisMappingBackend that plugs into the lower-level MappingStore — the same primitive Noirdoc Cloud uses for request-scoped, encrypted, TTL-bounded mapping persistence across workers. It is not wired into Redactor(namespace=...), which persists to the local filesystem under ~/.noirdoc/namespaces/. Use MappingStore when you have multiple workers that need to share pseudonym mappings for the same request, or when you want encrypted-at-rest mappings with automatic expiry.

import asyncio
from cryptography.fernet import Fernet
from redis.asyncio import Redis

from noirdoc.mappings.backends.redis_backend import RedisMappingBackend
from noirdoc.mappings.store import MappingStore

async def main() -> None:
    redis = Redis.from_url("redis://localhost:6379")
    store = MappingStore(
        backend=RedisMappingBackend(redis),
        encryption_key=Fernet.generate_key(),  # keep stable across workers
    )
    # store.save(request_id=..., tenant_id=..., mapper=...)
    # mappings = await store.load(request_id)

asyncio.run(main())

The encryption_key must be identical across workers that need to read the same mappings. MappingStore.save() accepts a ttl_days kwarg (default 30).

Noirdoc Cloud

Don't want to run this yourself? Noirdoc Cloud is the hosted API wrapper: a privacy-preserving reverse proxy for LLM calls that uses this exact pipeline, plus multi-tenancy, audit, and provider key management. Compliance story: what's on GitHub is what the cloud runs.

Development

This repo uses the shared noirdoc tooling standard (uv + ruff/mypy). Common tasks go through make:

make install   # set up the dev environment
make check     # lint + format-check + typecheck + test — run before pushing
make test      # run fast tests (excludes slow ML-model tests)

Run make help for the full list of targets (also: make lint, make fmt, make typecheck, make test-slow, make models).

Contributing

Bug reports, detectors, and format support are all welcome. See CONTRIBUTING.md for dev setup, tests, and the recognizer pattern.

Security

Report vulnerabilities via GitHub's private vulnerability reporting — see SECURITY.md. Please don't open public issues for security bugs.

Changelog

See CHANGELOG.md. Follows Keep a Changelog and SemVer.

License

MIT © 2026 Antonio Maiolo / Nextaim GmbH. See LICENSE.


Built by Nextaim GmbH · noirdoc.de

About

German-first PII redaction and pseudonymization for documents. Local by default. Reversible when you need it.

Topics

Resources

License

Contributing

Security policy

Stars

Watchers

Forks

Contributors