feat: add no-op NLP engine#2071
Conversation
There was a problem hiding this comment.
Pull request overview
Note
Copilot was unable to run its full agentic suite in this review.
Adds a NoOpNlpEngine option to Presidio Analyzer to support running analyzers/recognizers without loading NLP models/artifacts (returning empty artifacts instead), and integrates it into engine/recognizer providers and model installation.
Changes:
- Introduces
NoOpNlpEngineand a default YAML config for it (conf/no_op.yaml). - Updates
NlpEngineProvider, model installation, and recognizer registry/provider to support/guard no-op behavior. - Adds comprehensive tests for initialization, batch/text processing, and Analyzer/provider integration.
Reviewed changes
Copilot reviewed 10 out of 10 changed files in this pull request and generated 6 comments.
Show a summary per file
| File | Description |
|---|---|
| presidio-analyzer/presidio_analyzer/nlp_engine/no_op_nlp_engine.py | Implements the new no-op NLP engine returning empty artifacts. |
| presidio-analyzer/presidio_analyzer/nlp_engine/nlp_engine_provider.py | Registers no_op as an available engine and constructs it from config. |
| presidio-analyzer/presidio_analyzer/nlp_engine/init.py | Exposes NoOpNlpEngine in the package exports. |
| presidio-analyzer/presidio_analyzer/conf/no_op.yaml | Adds a config preset for the no-op engine. |
| presidio-analyzer/install_nlp_models.py | Skips model installation for no_op. |
| presidio-analyzer/presidio_analyzer/recognizer_registry/recognizer_registry.py | Skips NLP recognizer registration and blocks NLP recognizer retrieval for NoOpNlpEngine. |
| presidio-analyzer/presidio_analyzer/recognizer_registry/recognizer_registry_provider.py | Prevents configuring NLP recognizers when using NoOpNlpEngine. |
| presidio-analyzer/presidio_analyzer/predefined_recognizers/ner/huggingface_ner_recognizer.py | Updates docs example to use no-op engine for recognizer-only flows. |
| presidio-analyzer/tests/conftest.py | Adds no_op to the parametrized NLP engines used in tests. |
| presidio-analyzer/tests/test_no_op_nlp_engine.py | Adds unit/integration tests for the no-op engine and its provider/analyzer behavior. |
| def _create_empty_nlp_artifacts(self, language: str) -> NlpArtifacts: | ||
| return NlpArtifacts( | ||
| entities=[], | ||
| tokens=Doc(self._vocab, words=[]), | ||
| tokens_indices=[], | ||
| lemmas=[], | ||
| nlp_engine=self, | ||
| language=language, | ||
| scores=[], | ||
| ) |
There was a problem hiding this comment.
NlpArtifacts does not accept a keywords constructor argument. Keywords are derived internally from lemmas in NlpArtifacts.__init__, so passing lemmas=[] keeps artifacts.keywords as an empty list.
|
Thanks! Have you seen the slim nlp engine? https://github.com/microsoft/presidio/blob/main/presidio-analyzer/presidio_analyzer/nlp_engine/slim_spacy_nlp_engine.py Can we use this as is / adapt it? |
|
Thanks for the pointer! I looked at The main difference is that We could technically adapt I went with a separate |
Change Description
Adds a
NoOpNlpEnginefor analyzer configurations where the active recognizers do not need artifacts produced by an NLP engine.This is mainly useful for recognizers such as
HuggingFaceNerRecognizer, which runs model inference directly and does not need NLP artifacts from spaCy. In that setup, loading a spaCy model only adds startup cost.With this change, users can select
nlp_engine_name: no_opin analyzer configuration instead of configuring a real NLP model that will not be used.Includes regression tests and an updated
HuggingFaceNerRecognizerexample.Issue reference
Refs #2012
Checklist