CogStack · vladd-bit · Nov 19, 2025 · Nov 19, 2025
diff --git a/README.md b/README.md
@@ -4,7 +4,7 @@
 [![doc-build](https://github.com/CogStack/CogStack-NiFi/actions/workflows/doc-build.yml/badge.svg?branch=main)](https://github.com/CogStack/CogStack-NiFi/actions/workflows/doc-build.yml)
 [![elasticsearch-stack](https://github.com/CogStack/CogStack-NiFi/actions/workflows/docker-elasticsearch-stack.yml/badge.svg?branch=main)](https://github.com/CogStack/CogStack-NiFi/actions/workflows/docker-elasticsearch-stack.yml)
 
-## Introduction
+## 💡 Introduction
 
 This repository proposes a possible next step in the evolution of free-text data processing originally implemented in [CogStack-Pipeline](https://github.com/CogStack/CogStack-Pipeline), moving towards a more modular, Platform-as-a-Service (PaaS) approach.
 
@@ -38,7 +38,8 @@ Need help? Feel free to:
 | [`services`](./services) | NLP and auxiliary services, each with its own configs and resources. |
 | [`deploy`](./deploy)     | Example deployment setup, combining NiFi and related services. |
 | [`scripts`](./scripts)   | Helper scripts (e.g., setup tools, sample DB ingestion, Elasticsearch ingestion). |
-| [`data`](./data)         | Place any test or ingested data here. |
+| [`data`](./data)         | Place any test or data to be ingested here. |
+| [`typings`](./typings)   | Stubs for code linting/type-hint, etc. |
 
 ---
 

diff --git a/deploy/Makefile b/deploy/Makefile
@@ -86,7 +86,7 @@ start-git-ea:
 
 start-data-infra: start-nifi start-elastic start-samples
 
-start-all: start-data-infra start-jupyter
+start-all: start-data-infra start-jupyter start-medcat-service start-ocr-services
 
 .PHONY: start-all start-data-infra start-nifi start-elastic start-samples start-jupyter
 
@@ -155,7 +155,7 @@ stop-production-db:
 
 stop-data-infra: stop-nifi stop-elastic stop-samples
 
-stop-all: stop-data-infra stop-jupyter
+stop-all: stop-data-infra stop-jupyter stop-medcat-service stop-ocr-services
 
 .PHONY: stop-data-infra stop-nifi stop-elastic stop-samples stop-jupyter
 

diff --git a/docs/deploy/configuration.md b/docs/deploy/configuration.md
@@ -0,0 +1,79 @@
+
+
+
+## Environment variables
+
+As mentioned above, environment variables have been made available after release 1.0.
+The variables are configurable, and are separated, into security and general env vars, furthermore, all services declared in the `services.yml` file have their variables in separate files.
+In most cases, modifying these variables should be the only thing that is needed in order to run a successful deployment.
+
+Multiple files are available, split into two categories:
+- service: located in `./deploy/` are reponsible for direct service configuration
+- security: located in `./security`, ceriticate related settings are always in the files starting with `certificates_` and user settings are located in the files ending with `_users`
+
+The variables declared in the `./deploy` folder are used in multiple config files, as follows:
+- `elasticsearch.env`, variables here are used in :
+    -   `./services/elasticsearch/config/(opensearch|elasticsearch).yml`
+    -   `./services/kibana/config/(opensearch|elasticsearch).yml` 
+    -   `./services/metricbeat/metricbeat.yml`
+    -   `./deploy/services.yml` in the following sections: `nifi`, `elasticsearch-1`, `elasticsearch-1`, `elasticsearch-3`, `kibana`, `metricbeat-1`,`metricbeat-2`
+
+- `nifi.env`, vars used in:
+    -   `./deploy/services.yml`, sections: `nifi`
+    -   `./nifi/conf/nifi.properties`
+
+- `jupyter.env`, vars used in:
+    -   `./deploy/services.yml`, sections: `jupyter`
+
+- `nlp_service.env`, vars used in:
+    -   `./deploy/services.yml`, sections: `nlp-medcat-service-production`
+
+- `database.env`, vars used in:
+    -   `./deploy/services.yml`, sections: `cogstack-databank-db`,  `samples-db`
+
+- `general.env`, these vars are optional, declared any custom variables you want here, used in the `nifi` section
+
+Additional variablesenv files, used only or certificate generation and user accounts, found in `./security`:
+- `certificates_elasticsearch.env`, used in `create_opensearch_*`/`create_es_native*` scripts
+- `certificates_general.env`, used in `create_root_ca.sh`
+- `certificates_nifi.env`, used in `nifi_toolkit_security.sh`
+- `database_users.env`
+- `elasticsearch_users.env`
+- `nginx_users.env`
+
+
+### Customization
+For custom deployments, copy all the `.env` files (which are not tracked by Git) and add deployment specific configurations to these files. For example:
+
+```
+cp deploy/*.env deploy/new_deploy_folder/
+cp security/*.env deploy/new_deploy_folder/
+```
+
+### Multiple deployments on the same machine
+When deploying multiple docker-compose projects on the same machine (e.g. for dev or testing), it can be useful to remove all containers, volume and network names from the docker-compose file, and let [Docker create names](https://docs.docker.com/compose/reference/envvars/#compose_project_name) based on `COMPOSE_PROJECT_NAME` in `deploy/.env`. Docker will automatically create a Docker network and makes sure that containers can find each other by container name.
+
+For example, when setting `COMPOSE_PROJECT_NAME=cogstack-prod`, Docker Compose will create a container named `cogstack-prod_elasticsearch-1_1` for the `elasticsearch-1` service. Within the NiFi container, which is running in the same Docker network, you can refer to that container using just the service name `elasticsearch-1`.
+
+<br>
+
+## <span style="color:red">Important security detail</span> 
+
+Please note that in the example service defintions, for ease of deployment and demonstration, SSL encryption is enabled among services (NiFi, ES, etc.), however, the certificates that are used are in this public repository, anyone can see them, so **please** make sure to re-generate them when you go into production. 
+
+## Services
+Please note that all the services are deployed using [Docker](https://docker.io) engine and requires docker deamon to be running / functioning.
+
+Please see [the available services](./services.md) for more details.
+
+
+## Workflows
+Apache NiFi provides users the ability to build very large and complex data flows. 
+These data flows can be later saved as workflow *templates*, exported into XML format and shared with other users.
+We provide few example templates for ingesting the records from a database into Elasticsearch and to perform extraction of NLP annotations from documents.
+
+### Deployment using Makefile
+For deployments based on the example workflows, please see [example workflows](./workflows.md) for more details.
+
+### Deployment using a custom Docker-compose
+When using a fork of this repository for a customized deployments, it can be useful to copy `services.yml` to a deployment-specific `docker-compose.yml`. In this Compose file you can specify the services you need for your instance and configure all parameters per service, as well as track this file in a branch in your own fork. This way you can use your own version control and rebase on `CogStack/CogStack-NiFi` master without running into merge conflicts.
diff --git a/docs/deploy/deployment.md b/docs/deploy/deployment.md
@@ -0,0 +1,228 @@
+
+# 📦 Deployment
+
+The [`deploy`](https://github.com/CogStack/CogStack-NiFi/tree/main/deploy/) directory contains an example dockerized deployment setup of the customised NiFi image, along with related services for document processing, NLP, and text analytics.
+
+Make sure you have read the [Prerequisites](./main.md) section before proceeding.
+
+## 🗂️ Key files
+
+- **`services.yml`** – defines the *core* services that are orchestrated directly from this repository via Docker Compose.   (Kubernetes-based multi-container deployments are coming soon.)
+
+- **`Makefile`** – provides convenient commands for starting, stopping, and managing the deployment.
+
+- **`.env` files in `./deploy/`** , environment variables used across services, specifications:
+  - environment variables that apply **only to the services defined inside `services.yml`**.  
+  - Security-related `.env` files (certificates, users) are under **`/security`**
+
+  These variables configure NiFi, Elasticsearch/OpenSearch, Kibana, Jupyter, Metricbeat, the sample DB, etc.
+
+## 🧩 Modular service design (important)
+
+This repository follows a **modular deployment model**:
+
+- Only the services defined in **`services.yml`** use the environment files located in **`./deploy/*.env`**.  
+- **All other services** included in the ecosystem are launched via `docker-compose` commands inside their own directories, for example:  
+
+  ```bash
+  ./services/<service_name>/docker/docker-compose.yml
+  ```
+
+- Each of these standalone services maintains **its own environment configuration** in:
+
+  ```bash
+  ./services/<service_name>/env/
+  ```
+
+This design allows each service to be:
+
+- independently configurable  
+- versioned and deployed in isolation  
+- consumed by other projects without modifying the core deployment  
+
+> These are the files you will most commonly modify when creating or adjusting a deployment.
+
+## ⚙️ Additional service configuration
+
+- Service-specific configurations are located under:  
+  [`./services`](https://github.com/CogStack/CogStack-NiFi/tree/main/services/)
+- NiFi-specific configuration (properties, custom processors, drivers, Python scripts, etc.) is under:  
+  [`./nifi`](https://github.com/CogStack/CogStack-NiFi/tree/main/nifi/)
+
+## 🚀 Starting the Services
+
+All core services defined in `services.yml` can be started using the Makefile in the `deploy/` directory.
+
+For most services in the `services` folder that are not part of the core stack defined in `services.yml` and are pulled from external git submodule repositories, the start-up process is the same.
+
+### ▶️ Start each service individually
+
+You can start individual components of the CogStack-NiFi stack using the `make start-*` commands.  
+Each target loads all required environment variables automatically via `export_env_vars.sh`.
+
+This is useful for:
+
+- debugging a single service  
+- restarting only one component after config changes  
+- running lightweight subsets of the stack  
+- isolating problems or logs per service  
+
+---
+
+#### 🧩 Core NiFi Services
+
+```bash
+make start-nifi
+```
+
+Starts:
+
+- **nifi** — the Apache NiFi instance (main ETL/orchestration engine)  
+- **nifi-nginx** — reverse proxy/front-end for NiFi  
+- **nifi-registry-flow** — NiFi Registry backend that stores flow versions
+
+Use when you want to run, debug, or modify NiFi workflows without bringing up the entire ecosystem.
+
+---
+
+### 🏗️ Start Core Data Infrastructure
+
+```bash
+make start-data-infra
+```
+
+Starts:
+
+- NiFi
+- NiFi Registry Flow
+- NiFi Nginx
+- Elasticsearch  
+- Samples DB  
+
+Ideal for running ingestion pipelines and ETL workflows.
+
+---
+
+#### 🛢️ Elasticsearch / OpenSearch Services
+
+```bash
+make start-elastic
+```
+
+Starts the standard 2-node Elasticsearch cluster + Kibana.
+
+```bash
+make start-elastic-cluster
+```
+
+Starts all 3 ES nodes. Useful for testing clustering, sharding, and replication.
+
+```bash
+make start-elastic-1
+make start-elastic-2
+make start-elastic-3
+```
+
+Start individual Elasticsearch nodes for debugging or failure-scenario testing.
+
+---
+
+#### 📈 Kibana
+
+```bash
+make start-kibana
+```
+
+Starts Kibana for inspecting logs, checking index mappings, monitoring ES health, and debugging pipelines.
+
+---
+
+#### 🗄️ Databases
+
+```bash
+make start-samples
+```
+
+Starts **samples-db**, the small example DB used for demo flows.
+
+```bash
+make start-production-db
+```
+
+Starts the **cogstack-databank-db** production database.
+
+Use when testing SQL ingestion or verifying DB-driven NiFi flows.
+
+---
+
+#### 📚 JupyterHub
+
+```bash
+make start-jupyter
+```
+
+Starts the CogStack JupyterHub instance. Used for notebooks, analysis, model testing, and visualisation.
+
+---
+
+#### 🧠 NLP Services (MedCAT & Trainer)
+
+```bash
+make start-medcat-service
+```
+
+Starts the MedCAT concept extraction inference API.
+
+```bash
+make start-medcat-service-deid
+```
+
+Starts the MedCAT DEID (de-identification) inference API.
+
+```bash
+make start-medcat-trainer
+```
+
+Starts the full MedCAT Trainer stack (Trainer UI + Solr + NGINX). Useful for annotation and supervised training tasks.
+
+---
+
+#### 📝 OCR Services
+
+```bash
+make start-ocr-services
+```
+
+Starts:
+
+- **ocr-service** — main OCR pipeline  
+- **ocr-service-text-only** — lightweight OCR/text extraction  
+
+Use for PDF ingestion, OCR debugging, and pipeline validation.
+
+---
+
+#### 🛠️ Miscellaneous Services (GIT EA)'
+
+```bash
+make start-git-ea
+```
+
+Starts the internal Gitea Git server used for local code/config storage.
+
+---
+
+### 🚀 Start the Entire Stack
+
+```bash
+make start-all
+```
+
+Starts everything:
+
+- Core infra
+- JupyterHub  
+- MedCAT NLP services  
+- OCR services  
+
+Use for complete deployments, demos, or full-stack development.