Skip to content
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
5 changes: 3 additions & 2 deletions README.md
Original file line number Diff line number Diff line change
Expand Up @@ -4,7 +4,7 @@
[![doc-build](https://github.com/CogStack/CogStack-NiFi/actions/workflows/doc-build.yml/badge.svg?branch=main)](https://github.com/CogStack/CogStack-NiFi/actions/workflows/doc-build.yml)
[![elasticsearch-stack](https://github.com/CogStack/CogStack-NiFi/actions/workflows/docker-elasticsearch-stack.yml/badge.svg?branch=main)](https://github.com/CogStack/CogStack-NiFi/actions/workflows/docker-elasticsearch-stack.yml)

## Introduction
## 💡 Introduction

This repository proposes a possible next step in the evolution of free-text data processing originally implemented in [CogStack-Pipeline](https://github.com/CogStack/CogStack-Pipeline), moving towards a more modular, Platform-as-a-Service (PaaS) approach.

Expand Down Expand Up @@ -38,7 +38,8 @@ Need help? Feel free to:
| [`services`](./services) | NLP and auxiliary services, each with its own configs and resources. |
| [`deploy`](./deploy) | Example deployment setup, combining NiFi and related services. |
| [`scripts`](./scripts) | Helper scripts (e.g., setup tools, sample DB ingestion, Elasticsearch ingestion). |
| [`data`](./data) | Place any test or ingested data here. |
| [`data`](./data) | Place any test or data to be ingested here. |
| [`typings`](./typings) | Stubs for code linting/type-hint, etc. |

---

Expand Down
4 changes: 2 additions & 2 deletions deploy/Makefile
Original file line number Diff line number Diff line change
Expand Up @@ -86,7 +86,7 @@ start-git-ea:

start-data-infra: start-nifi start-elastic start-samples

start-all: start-data-infra start-jupyter
start-all: start-data-infra start-jupyter start-medcat-service start-ocr-services

.PHONY: start-all start-data-infra start-nifi start-elastic start-samples start-jupyter

Expand Down Expand Up @@ -155,7 +155,7 @@ stop-production-db:

stop-data-infra: stop-nifi stop-elastic stop-samples

stop-all: stop-data-infra stop-jupyter
stop-all: stop-data-infra stop-jupyter stop-medcat-service stop-ocr-services

.PHONY: stop-data-infra stop-nifi stop-elastic stop-samples stop-jupyter

Expand Down
79 changes: 79 additions & 0 deletions docs/deploy/configuration.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,79 @@



## Environment variables

As mentioned above, environment variables have been made available after release 1.0.
The variables are configurable, and are separated, into security and general env vars, furthermore, all services declared in the `services.yml` file have their variables in separate files.
In most cases, modifying these variables should be the only thing that is needed in order to run a successful deployment.

Multiple files are available, split into two categories:
- service: located in `./deploy/` are reponsible for direct service configuration
- security: located in `./security`, ceriticate related settings are always in the files starting with `certificates_` and user settings are located in the files ending with `_users`

The variables declared in the `./deploy` folder are used in multiple config files, as follows:
- `elasticsearch.env`, variables here are used in :
- `./services/elasticsearch/config/(opensearch|elasticsearch).yml`
- `./services/kibana/config/(opensearch|elasticsearch).yml`
- `./services/metricbeat/metricbeat.yml`
- `./deploy/services.yml` in the following sections: `nifi`, `elasticsearch-1`, `elasticsearch-1`, `elasticsearch-3`, `kibana`, `metricbeat-1`,`metricbeat-2`

- `nifi.env`, vars used in:
- `./deploy/services.yml`, sections: `nifi`
- `./nifi/conf/nifi.properties`

- `jupyter.env`, vars used in:
- `./deploy/services.yml`, sections: `jupyter`

- `nlp_service.env`, vars used in:
- `./deploy/services.yml`, sections: `nlp-medcat-service-production`

- `database.env`, vars used in:
- `./deploy/services.yml`, sections: `cogstack-databank-db`, `samples-db`

- `general.env`, these vars are optional, declared any custom variables you want here, used in the `nifi` section

Additional variablesenv files, used only or certificate generation and user accounts, found in `./security`:
- `certificates_elasticsearch.env`, used in `create_opensearch_*`/`create_es_native*` scripts
- `certificates_general.env`, used in `create_root_ca.sh`
- `certificates_nifi.env`, used in `nifi_toolkit_security.sh`
- `database_users.env`
- `elasticsearch_users.env`
- `nginx_users.env`


### Customization
For custom deployments, copy all the `.env` files (which are not tracked by Git) and add deployment specific configurations to these files. For example:

```
cp deploy/*.env deploy/new_deploy_folder/
cp security/*.env deploy/new_deploy_folder/
```

### Multiple deployments on the same machine
When deploying multiple docker-compose projects on the same machine (e.g. for dev or testing), it can be useful to remove all containers, volume and network names from the docker-compose file, and let [Docker create names](https://docs.docker.com/compose/reference/envvars/#compose_project_name) based on `COMPOSE_PROJECT_NAME` in `deploy/.env`. Docker will automatically create a Docker network and makes sure that containers can find each other by container name.

For example, when setting `COMPOSE_PROJECT_NAME=cogstack-prod`, Docker Compose will create a container named `cogstack-prod_elasticsearch-1_1` for the `elasticsearch-1` service. Within the NiFi container, which is running in the same Docker network, you can refer to that container using just the service name `elasticsearch-1`.

<br>

## <span style="color:red">Important security detail</span>

Please note that in the example service defintions, for ease of deployment and demonstration, SSL encryption is enabled among services (NiFi, ES, etc.), however, the certificates that are used are in this public repository, anyone can see them, so **please** make sure to re-generate them when you go into production.

## Services
Please note that all the services are deployed using [Docker](https://docker.io) engine and requires docker deamon to be running / functioning.

Please see [the available services](./services.md) for more details.


## Workflows
Apache NiFi provides users the ability to build very large and complex data flows.
These data flows can be later saved as workflow *templates*, exported into XML format and shared with other users.
We provide few example templates for ingesting the records from a database into Elasticsearch and to perform extraction of NLP annotations from documents.

### Deployment using Makefile
For deployments based on the example workflows, please see [example workflows](./workflows.md) for more details.

### Deployment using a custom Docker-compose
When using a fork of this repository for a customized deployments, it can be useful to copy `services.yml` to a deployment-specific `docker-compose.yml`. In this Compose file you can specify the services you need for your instance and configure all parameters per service, as well as track this file in a branch in your own fork. This way you can use your own version control and rebase on `CogStack/CogStack-NiFi` master without running into merge conflicts.
228 changes: 228 additions & 0 deletions docs/deploy/deployment.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,228 @@

# 📦 Deployment

The [`deploy`](https://github.com/CogStack/CogStack-NiFi/tree/main/deploy/) directory contains an example dockerized deployment setup of the customised NiFi image, along with related services for document processing, NLP, and text analytics.

Make sure you have read the [Prerequisites](./main.md) section before proceeding.

## 🗂️ Key files

- **`services.yml`** – defines the *core* services that are orchestrated directly from this repository via Docker Compose. (Kubernetes-based multi-container deployments are coming soon.)

- **`Makefile`** – provides convenient commands for starting, stopping, and managing the deployment.

- **`.env` files in `./deploy/`** , environment variables used across services, specifications:
- environment variables that apply **only to the services defined inside `services.yml`**.
- Security-related `.env` files (certificates, users) are under **`/security`**

These variables configure NiFi, Elasticsearch/OpenSearch, Kibana, Jupyter, Metricbeat, the sample DB, etc.

## 🧩 Modular service design (important)

This repository follows a **modular deployment model**:

- Only the services defined in **`services.yml`** use the environment files located in **`./deploy/*.env`**.
- **All other services** included in the ecosystem are launched via `docker-compose` commands inside their own directories, for example:

```bash
./services/<service_name>/docker/docker-compose.yml
```

- Each of these standalone services maintains **its own environment configuration** in:

```bash
./services/<service_name>/env/
```

This design allows each service to be:

- independently configurable
- versioned and deployed in isolation
- consumed by other projects without modifying the core deployment

> These are the files you will most commonly modify when creating or adjusting a deployment.

## ⚙️ Additional service configuration

- Service-specific configurations are located under:
[`./services`](https://github.com/CogStack/CogStack-NiFi/tree/main/services/)
- NiFi-specific configuration (properties, custom processors, drivers, Python scripts, etc.) is under:
[`./nifi`](https://github.com/CogStack/CogStack-NiFi/tree/main/nifi/)

## 🚀 Starting the Services

All core services defined in `services.yml` can be started using the Makefile in the `deploy/` directory.

For most services in the `services` folder that are not part of the core stack defined in `services.yml` and are pulled from external git submodule repositories, the start-up process is the same.

### ▶️ Start each service individually

You can start individual components of the CogStack-NiFi stack using the `make start-*` commands.
Each target loads all required environment variables automatically via `export_env_vars.sh`.

This is useful for:

- debugging a single service
- restarting only one component after config changes
- running lightweight subsets of the stack
- isolating problems or logs per service

---

#### 🧩 Core NiFi Services

```bash
make start-nifi
```

Starts:

- **nifi** — the Apache NiFi instance (main ETL/orchestration engine)
- **nifi-nginx** — reverse proxy/front-end for NiFi
- **nifi-registry-flow** — NiFi Registry backend that stores flow versions

Use when you want to run, debug, or modify NiFi workflows without bringing up the entire ecosystem.

---

### 🏗️ Start Core Data Infrastructure

```bash
make start-data-infra
```

Starts:

- NiFi
- NiFi Registry Flow
- NiFi Nginx
- Elasticsearch
- Samples DB

Ideal for running ingestion pipelines and ETL workflows.

---

#### 🛢️ Elasticsearch / OpenSearch Services

```bash
make start-elastic
```

Starts the standard 2-node Elasticsearch cluster + Kibana.

```bash
make start-elastic-cluster
```

Starts all 3 ES nodes. Useful for testing clustering, sharding, and replication.

```bash
make start-elastic-1
make start-elastic-2
make start-elastic-3
```

Start individual Elasticsearch nodes for debugging or failure-scenario testing.

---

#### 📈 Kibana

```bash
make start-kibana
```

Starts Kibana for inspecting logs, checking index mappings, monitoring ES health, and debugging pipelines.

---

#### 🗄️ Databases

```bash
make start-samples
```

Starts **samples-db**, the small example DB used for demo flows.

```bash
make start-production-db
```

Starts the **cogstack-databank-db** production database.

Use when testing SQL ingestion or verifying DB-driven NiFi flows.

---

#### 📚 JupyterHub

```bash
make start-jupyter
```

Starts the CogStack JupyterHub instance. Used for notebooks, analysis, model testing, and visualisation.

---

#### 🧠 NLP Services (MedCAT & Trainer)

```bash
make start-medcat-service
```

Starts the MedCAT concept extraction inference API.

```bash
make start-medcat-service-deid
```

Starts the MedCAT DEID (de-identification) inference API.

```bash
make start-medcat-trainer
```

Starts the full MedCAT Trainer stack (Trainer UI + Solr + NGINX). Useful for annotation and supervised training tasks.

---

#### 📝 OCR Services

```bash
make start-ocr-services
```

Starts:

- **ocr-service** — main OCR pipeline
- **ocr-service-text-only** — lightweight OCR/text extraction

Use for PDF ingestion, OCR debugging, and pipeline validation.

---

#### 🛠️ Miscellaneous Services (GIT EA)'

```bash
make start-git-ea
```

Starts the internal Gitea Git server used for local code/config storage.

---

### 🚀 Start the Entire Stack

```bash
make start-all
```

Starts everything:

- Core infra
- JupyterHub
- MedCAT NLP services
- OCR services

Use for complete deployments, demos, or full-stack development.
Loading