From 01271bab6816171f1e57d51692df7256a33d1bb3 Mon Sep 17 00:00:00 2001 From: vladd-bit Date: Wed, 19 Nov 2025 09:30:47 +0000 Subject: [PATCH 1/2] Docs: updated main (prerequisities), deployment + troubleshooting sections. --- README.md | 5 +- deploy/Makefile | 4 +- docs/deploy/configuration.md | 79 ++++++ docs/deploy/deployment.md | 228 ++++++++++++++++++ docs/deploy/main.md | 279 +++++++--------------- docs/deploy/services.md | 77 ------ docs/deploy/troubleshooting.md | 129 ++++++++++ docs/index.rst | 4 +- docs/nifi/main.md | 5 +- docs/security/elasticsearch_opensearch.md | 28 +-- 10 files changed, 543 insertions(+), 295 deletions(-) create mode 100644 docs/deploy/configuration.md create mode 100644 docs/deploy/deployment.md create mode 100644 docs/deploy/troubleshooting.md diff --git a/README.md b/README.md index 0fb963b42..ced0a91f8 100644 --- a/README.md +++ b/README.md @@ -4,7 +4,7 @@ [![doc-build](https://github.com/CogStack/CogStack-NiFi/actions/workflows/doc-build.yml/badge.svg?branch=main)](https://github.com/CogStack/CogStack-NiFi/actions/workflows/doc-build.yml) [![elasticsearch-stack](https://github.com/CogStack/CogStack-NiFi/actions/workflows/docker-elasticsearch-stack.yml/badge.svg?branch=main)](https://github.com/CogStack/CogStack-NiFi/actions/workflows/docker-elasticsearch-stack.yml) -## Introduction +## πŸ’‘ Introduction This repository proposes a possible next step in the evolution of free-text data processing originally implemented in [CogStack-Pipeline](https://github.com/CogStack/CogStack-Pipeline), moving towards a more modular, Platform-as-a-Service (PaaS) approach. @@ -38,7 +38,8 @@ Need help? Feel free to: | [`services`](./services) | NLP and auxiliary services, each with its own configs and resources. | | [`deploy`](./deploy) | Example deployment setup, combining NiFi and related services. | | [`scripts`](./scripts) | Helper scripts (e.g., setup tools, sample DB ingestion, Elasticsearch ingestion). | -| [`data`](./data) | Place any test or ingested data here. | +| [`data`](./data) | Place any test or data to be ingested here. | +| [`typings`](./typings) | Stubs for code linting/type-hint, etc. | --- diff --git a/deploy/Makefile b/deploy/Makefile index 83e4e2de2..53f9812f9 100644 --- a/deploy/Makefile +++ b/deploy/Makefile @@ -86,7 +86,7 @@ start-git-ea: start-data-infra: start-nifi start-elastic start-samples -start-all: start-data-infra start-jupyter +start-all: start-data-infra start-jupyter start-medcat-service start-ocr-services .PHONY: start-all start-data-infra start-nifi start-elastic start-samples start-jupyter @@ -155,7 +155,7 @@ stop-production-db: stop-data-infra: stop-nifi stop-elastic stop-samples -stop-all: stop-data-infra stop-jupyter +stop-all: stop-data-infra stop-jupyter stop-medcat-service stop-ocr-services .PHONY: stop-data-infra stop-nifi stop-elastic stop-samples stop-jupyter diff --git a/docs/deploy/configuration.md b/docs/deploy/configuration.md new file mode 100644 index 000000000..43cd2090d --- /dev/null +++ b/docs/deploy/configuration.md @@ -0,0 +1,79 @@ + + + +## Environment variables + +As mentioned above, environment variables have been made available after release 1.0. +The variables are configurable, and are separated, into security and general env vars, furthermore, all services declared in the `services.yml` file have their variables in separate files. +In most cases, modifying these variables should be the only thing that is needed in order to run a successful deployment. + +Multiple files are available, split into two categories: +- service: located in `./deploy/` are reponsible for direct service configuration +- security: located in `./security`, ceriticate related settings are always in the files starting with `certificates_` and user settings are located in the files ending with `_users` + +The variables declared in the `./deploy` folder are used in multiple config files, as follows: +- `elasticsearch.env`, variables here are used in : + - `./services/elasticsearch/config/(opensearch|elasticsearch).yml` + - `./services/kibana/config/(opensearch|elasticsearch).yml` + - `./services/metricbeat/metricbeat.yml` + - `./deploy/services.yml` in the following sections: `nifi`, `elasticsearch-1`, `elasticsearch-1`, `elasticsearch-3`, `kibana`, `metricbeat-1`,`metricbeat-2` + +- `nifi.env`, vars used in: + - `./deploy/services.yml`, sections: `nifi` + - `./nifi/conf/nifi.properties` + +- `jupyter.env`, vars used in: + - `./deploy/services.yml`, sections: `jupyter` + +- `nlp_service.env`, vars used in: + - `./deploy/services.yml`, sections: `nlp-medcat-service-production` + +- `database.env`, vars used in: + - `./deploy/services.yml`, sections: `cogstack-databank-db`, `samples-db` + +- `general.env`, these vars are optional, declared any custom variables you want here, used in the `nifi` section + +Additional variablesenv files, used only or certificate generation and user accounts, found in `./security`: +- `certificates_elasticsearch.env`, used in `create_opensearch_*`/`create_es_native*` scripts +- `certificates_general.env`, used in `create_root_ca.sh` +- `certificates_nifi.env`, used in `nifi_toolkit_security.sh` +- `database_users.env` +- `elasticsearch_users.env` +- `nginx_users.env` + + +### Customization +For custom deployments, copy all the `.env` files (which are not tracked by Git) and add deployment specific configurations to these files. For example: + +``` +cp deploy/*.env deploy/new_deploy_folder/ +cp security/*.env deploy/new_deploy_folder/ +``` + +### Multiple deployments on the same machine +When deploying multiple docker-compose projects on the same machine (e.g. for dev or testing), it can be useful to remove all containers, volume and network names from the docker-compose file, and let [Docker create names](https://docs.docker.com/compose/reference/envvars/#compose_project_name) based on `COMPOSE_PROJECT_NAME` in `deploy/.env`. Docker will automatically create a Docker network and makes sure that containers can find each other by container name. + +For example, when setting `COMPOSE_PROJECT_NAME=cogstack-prod`, Docker Compose will create a container named `cogstack-prod_elasticsearch-1_1` for the `elasticsearch-1` service. Within the NiFi container, which is running in the same Docker network, you can refer to that container using just the service name `elasticsearch-1`. + +
+ +## Important security detail + +Please note that in the example service defintions, for ease of deployment and demonstration, SSL encryption is enabled among services (NiFi, ES, etc.), however, the certificates that are used are in this public repository, anyone can see them, so **please** make sure to re-generate them when you go into production. + +## Services +Please note that all the services are deployed using [Docker](https://docker.io) engine and requires docker deamon to be running / functioning. + +Please see [the available services](./services.md) for more details. + + +## Workflows +Apache NiFi provides users the ability to build very large and complex data flows. +These data flows can be later saved as workflow *templates*, exported into XML format and shared with other users. +We provide few example templates for ingesting the records from a database into Elasticsearch and to perform extraction of NLP annotations from documents. + +### Deployment using Makefile +For deployments based on the example workflows, please see [example workflows](./workflows.md) for more details. + +### Deployment using a custom Docker-compose +When using a fork of this repository for a customized deployments, it can be useful to copy `services.yml` to a deployment-specific `docker-compose.yml`. In this Compose file you can specify the services you need for your instance and configure all parameters per service, as well as track this file in a branch in your own fork. This way you can use your own version control and rebase on `CogStack/CogStack-NiFi` master without running into merge conflicts. diff --git a/docs/deploy/deployment.md b/docs/deploy/deployment.md new file mode 100644 index 000000000..928da02b8 --- /dev/null +++ b/docs/deploy/deployment.md @@ -0,0 +1,228 @@ + +# πŸ“¦ Deployment + +The [`deploy`](https://github.com/CogStack/CogStack-NiFi/tree/main/deploy/) directory contains an example dockerized deployment setup of the customised NiFi image, along with related services for document processing, NLP, and text analytics. + +Make sure you have read the [Prerequisites](./main.md) section before proceeding. + +## πŸ—‚οΈ Key files + +- **`services.yml`** – defines the *core* services that are orchestrated directly from this repository via Docker Compose. (Kubernetes-based multi-container deployments are coming soon.) + +- **`Makefile`** – provides convenient commands for starting, stopping, and managing the deployment. + +- **`.env` files in `./deploy/`** , environment variables used across services, specifications: + - environment variables that apply **only to the services defined inside `services.yml`**. + - Security-related `.env` files (certificates, users) are under **`/security`** + + These variables configure NiFi, Elasticsearch/OpenSearch, Kibana, Jupyter, Metricbeat, the sample DB, etc. + +## 🧩 Modular service design (important) + +This repository follows a **modular deployment model**: + +- Only the services defined in **`services.yml`** use the environment files located in **`./deploy/*.env`**. +- **All other services** included in the ecosystem are launched via `docker-compose` commands inside their own directories, for example: + + ```bash + ./services//docker/docker-compose.yml + ``` + +- Each of these standalone services maintains **its own environment configuration** in: + + ```bash + ./services//env/ + ``` + +This design allows each service to be: + +- independently configurable +- versioned and deployed in isolation +- consumed by other projects without modifying the core deployment + +> These are the files you will most commonly modify when creating or adjusting a deployment. + +## βš™οΈ Additional service configuration + +- Service-specific configurations are located under: + [`./services`](https://github.com/CogStack/CogStack-NiFi/tree/main/services/) +- NiFi-specific configuration (properties, custom processors, drivers, Python scripts, etc.) is under: + [`./nifi`](https://github.com/CogStack/CogStack-NiFi/tree/main/nifi/) + +## πŸš€ Starting the Services + +All core services defined in `services.yml` can be started using the Makefile in the `deploy/` directory. + +For most services in the `services` folder that are not part of the core stack defined in `services.yml` and are pulled from external git submodule repositories, the start-up process is the same. + +### ▢️ Start each service individually + +You can start individual components of the CogStack-NiFi stack using the `make start-*` commands. +Each target loads all required environment variables automatically via `export_env_vars.sh`. + +This is useful for: + +- debugging a single service +- restarting only one component after config changes +- running lightweight subsets of the stack +- isolating problems or logs per service + +--- + +#### 🧩 Core NiFi Services + +```bash +make start-nifi +``` + +Starts: + +- **nifi** β€” the Apache NiFi instance (main ETL/orchestration engine) +- **nifi-nginx** β€” reverse proxy/front-end for NiFi +- **nifi-registry-flow** β€” NiFi Registry backend that stores flow versions + +Use when you want to run, debug, or modify NiFi workflows without bringing up the entire ecosystem. + +--- + +### πŸ—οΈ Start Core Data Infrastructure + +```bash +make start-data-infra +``` + +Starts: + +- NiFi +- NiFi Registry Flow +- NiFi Nginx +- Elasticsearch +- Samples DB + +Ideal for running ingestion pipelines and ETL workflows. + +--- + +#### πŸ›’οΈ Elasticsearch / OpenSearch Services + +```bash +make start-elastic +``` + +Starts the standard 2-node Elasticsearch cluster + Kibana. + +```bash +make start-elastic-cluster +``` + +Starts all 3 ES nodes. Useful for testing clustering, sharding, and replication. + +```bash +make start-elastic-1 +make start-elastic-2 +make start-elastic-3 +``` + +Start individual Elasticsearch nodes for debugging or failure-scenario testing. + +--- + +#### πŸ“ˆ Kibana + +```bash +make start-kibana +``` + +Starts Kibana for inspecting logs, checking index mappings, monitoring ES health, and debugging pipelines. + +--- + +#### πŸ—„οΈ Databases + +```bash +make start-samples +``` + +Starts **samples-db**, the small example DB used for demo flows. + +```bash +make start-production-db +``` + +Starts the **cogstack-databank-db** production database. + +Use when testing SQL ingestion or verifying DB-driven NiFi flows. + +--- + +#### πŸ“š JupyterHub + +```bash +make start-jupyter +``` + +Starts the CogStack JupyterHub instance. Used for notebooks, analysis, model testing, and visualisation. + +--- + +#### 🧠 NLP Services (MedCAT & Trainer) + +```bash +make start-medcat-service +``` + +Starts the MedCAT concept extraction inference API. + +```bash +make start-medcat-service-deid +``` + +Starts the MedCAT DEID (de-identification) inference API. + +```bash +make start-medcat-trainer +``` + +Starts the full MedCAT Trainer stack (Trainer UI + Solr + NGINX). Useful for annotation and supervised training tasks. + +--- + +#### πŸ“ OCR Services + +```bash +make start-ocr-services +``` + +Starts: + +- **ocr-service** β€” main OCR pipeline +- **ocr-service-text-only** β€” lightweight OCR/text extraction + +Use for PDF ingestion, OCR debugging, and pipeline validation. + +--- + +#### πŸ› οΈ Miscellaneous Services (GIT EA)' + +```bash +make start-git-ea +``` + +Starts the internal Gitea Git server used for local code/config storage. + +--- + +### πŸš€ Start the Entire Stack + +```bash +make start-all +``` + +Starts everything: + +- Core infra +- JupyterHub +- MedCAT NLP services +- OCR services + +Use for complete deployments, demos, or full-stack development. diff --git a/docs/deploy/main.md b/docs/deploy/main.md index c851613a4..e10c9eda3 100755 --- a/docs/deploy/main.md +++ b/docs/deploy/main.md @@ -1,229 +1,116 @@ -# Prequisites +# πŸ“‹ Prerequisites -Software required on machine: - - git + git-lfs - - Docker +Please read carefully as there can be many points of failure when installing/deploying everything into a clean environment. -You can use the script with `SUDO` rights, located at `/scripts/installation_utils/install_docker_and_utils.sh`, it can be used on Debian/Ubuntu/CentOS/RedHAT RHEL 8 only, run it once and everything should be set up. -Consult the (`Docker installation steps`)[https://docs.docker.com/engine/install/debian/] if there are issues with the docker setup. +## πŸ–₯️ OS Requirements -:::{warning} -IMPORTANT NOTE: Do a `git-lfs pull` so that you have everything downloaded from the repo (including bigger zipped files.). -::: - -# Deployment -[./deploy](https://github.com/CogStack/CogStack-NiFi/tree/main/deploy/) contains an example deployment of the customised NiFi image with related services for document processing, NLP and text analytics. - -The key files are: -- `services.yml` - defines all the available services in docker-compose format. K8s (i.e. multi container service deployments is coming soon...) -- `Makefile` - scripts for running docker-compose commands, -- `.env` - local environment variables definitions, deployment `.env` files are located in the `/deploy` folder, security `.env` files are located in the `/security` folder, containing users and certificate generation settings. -The above mentioned files should be the files that you will most likely need to change during a deployment. - -Individual service configurations are provided in [`./services`](https://github.com/CogStack/CogStack-NiFi/tree/main/services/). - -Apache NiFi-related files are provided in [`./nifi`](https://github.com/CogStack/CogStack-NiFi/tree/main/nifi/) directory. - -
- -## Environment variables - -As mentioned above, environment variables have been made available after release 1.0. -The variables are configurable, and are separated, into security and general env vars, furthermore, all services declared in the `services.yml` file have their variables in separate files. -In most cases, modifying these variables should be the only thing that is needed in order to run a successful deployment. - -Multiple files are available, split into two categories: -- service: located in `./deploy/` are reponsible for direct service configuration -- security: located in `./security`, ceriticate related settings are always in the files starting with `certificates_` and user settings are located in the files ending with `_users` - -The variables declared in the `./deploy` folder are used in multiple config files, as follows: -- `elasticsearch.env`, variables here are used in : - - `./services/elasticsearch/config/(opensearch|elasticsearch).yml` - - `./services/kibana/config/(opensearch|elasticsearch).yml` - - `./services/metricbeat/metricbeat.yml` - - `./deploy/services.yml` in the following sections: `nifi`, `elasticsearch-1`, `elasticsearch-1`, `elasticsearch-3`, `kibana`, `metricbeat-1`,`metricbeat-2` - -- `nifi.env`, vars used in: - - `./deploy/services.yml`, sections: `nifi` - - `./nifi/conf/nifi.properties` - -- `jupyter.env`, vars used in: - - `./deploy/services.yml`, sections: `jupyter` +Please note that the OSes mentioned below are the versions we support, whatever is not listed here is not supported, and we will not provide support for. -- `nlp_service.env`, vars used in: - - `./deploy/services.yml`, sections: `nlp-medcat-service-production` +- Linux OS (Ubuntu 24.04 LTS +, Debian 10+ are preffered, RHEL 9+). +- Windows 11+/Windows Server 2022+ (Requires [WSL 2.0](https://learn.microsoft.com/en-us/windows/wsl/about) installation and the installation of an Ubuntu image, for a working setup, follow [this guide](https://documentation.ubuntu.com/wsl/latest/howto/install-ubuntu-wsl2/) to get going, and get back here when things are working). +- macOS 15+ (Sequoia). -- `database.env`, vars used in: - - `./deploy/services.yml`, sections: `cogstack-databank-db`, `samples-db` +## 🧰 Software requirements (Linux/macOS) -- `general.env`, these vars are optional, declared any custom variables you want here, used in the `nifi` section +Software required on machine (the minimum/basics to get demos running): -Additional variablesenv files, used only or certificate generation and user accounts, found in `./security`: -- `certificates_elasticsearch.env`, used in `create_opensearch_*`/`create_es_native*` scripts -- `certificates_general.env`, used in `create_root_ca.sh` -- `certificates_nifi.env`, used in `nifi_toolkit_security.sh` -- `database_users.env` -- `elasticsearch_users.env` -- `nginx_users.env` +- make +- git + git-lfs +- Docker +- python3.11 +## πŸ” Other requirements (User Permissions/Firewall) -### Customization -For custom deployments, copy all the `.env` files (which are not tracked by Git) and add deployment specific configurations to these files. For example: + - a Linux account with 'admin' rights, if possible, if not, you will need to get your IT team to take a look at this README and install the packages for you using the steps below (make sure they look at [Docker rootless installation steps](https://docs.docker.com/engine/security/rootless)) + - firewall whitelisting of the following addreses: + - https://github.com/ + - https://hub.docker.com/ + - https://docker.io + - http://download.docker.com + - https://huggingface.co/ + - https://www.nltk.org + - https://pypi.org/ + - https://pypi.python.org + - https://Files.pythonhosted.org + - https://pythonhosted.org -``` -cp deploy/*.env deploy/new_deploy_folder/ -cp security/*.env deploy/new_deploy_folder/ -``` +## βš™οΈ Installation steps -### Multiple deployments on the same machine -When deploying multiple docker-compose projects on the same machine (e.g. for dev or testing), it can be useful to remove all containers, volume and network names from the docker-compose file, and let [Docker create names](https://docs.docker.com/compose/reference/envvars/#compose_project_name) based on `COMPOSE_PROJECT_NAME` in `deploy/.env`. Docker will automatically create a Docker network and makes sure that containers can find each other by container name. +Assuming you are the system admin, meaning you have`SUDO` rights. +You can use the script with `SUDO` rights, located at `/scripts/installation_utils/install_docker_and_utils.sh`, it can be used on Debian(10+)/Ubuntu(22.04+)/RedHAT RHEL 8/9 only, run it once and everything should be set up. -For example, when setting `COMPOSE_PROJECT_NAME=cogstack-prod`, Docker Compose will create a container named `cogstack-prod_elasticsearch-1_1` for the `elasticsearch-1` service. Within the NiFi container, which is running in the same Docker network, you can refer to that container using just the service name `elasticsearch-1`. +Execute the following commands in the root directory of the repo: -
- -## Important security detail - -Please note that in the example service defintions, for ease of deployment and demonstration, SSL encryption is enabled among services (NiFi, ES, etc.), however, the certificates that are used are in this public repository, anyone can see them, so **please** make sure to re-generate them when you go into production. - -## Services -Please note that all the services are deployed using [Docker](https://docker.io) engine and requires docker deamon to be running / functioning. - -Please see [the available services](./services.md) for more details. +1. `git-lfs pull` +2. (OPTIONAL, if you already have the software in [this section installed](#-software-requirements-linuxmacos))`sudo bash ./scripts/installation_utils/install_docker_and_utils.sh` , and wait for it to finish, it may take a while to get all the packages.. +3. `sudo bash ./scripts/git_update_submodules_in_repo.sh` +4. check that docker works correctly : `docker pull hello-world` +5. if no errors, run: `docker run --rm hello-world`, it should run without issues +6. if there are any issues check the below warning section +:::{warning} +IMPORTANT NOTE: Do a `git-lfs pull` so that you have everything downloaded from the repo (including bigger zipped files.). +::: -## Workflows -Apache NiFi provides users the ability to build very large and complex data flows. -These data flows can be later saved as workflow *templates*, exported into XML format and shared with other users. -We provide few example templates for ingesting the records from a database into Elasticsearch and to perform extraction of NLP annotations from documents. +:::{warning} +Ensure all Git submodules are initialized and updated: +`sudo bash ./scripts/git_update_submodules_in_repo.sh` +::: -### Deployment using Makefile -For deployments based on the example workflows, please see [example workflows](./workflows.md) for more details. +:::{warning} +Consult the if there are issues with the docker setup. +If Docker fails to install or `docker pull hello-world` does not work: -### Deployment using a custom Docker-compose -When using a fork of this repository for a customized deployments, it can be useful to copy `services.yml` to a deployment-specific `docker-compose.yml`. In this Compose file you can specify the services you need for your instance and configure all parameters per service, as well as track this file in a branch in your own fork. This way you can use your own version control and rebase on `CogStack/CogStack-NiFi` master without running into merge conflicts. + - Follow the official [Docker installation steps](https://docs.docker.com/engine/install/debian/) + - Ensure your user is in the docker group + - For non-sudo users, check Docker rootless mode and required post-install steps: + - https://docs.docker.com/engine/security/rootless/ + - https://docs.docker.com/engine/install/linux-postinstall/ +::: -## Troubleshooting +## ⚠️ Essential Elasticsearch Requirement -Always start with fresh containers and volumes, to make sure that there are no volumes from previous experimentations, make sure to always delete all/any cogstack running containers by executing: +:::{warning} +**Elasticsearch may fail to start unless `vm.max_map_count` is increased.** -`docker container rm samples-db elasticsearch-1 kibana nifi nlp-medcat-service-production tika-service nlp-gate-drugapp nlp-medcat-snomed nlp-gate-bioyodie medcat-trainer-ui medcat-trainer-nginx jupyter-hub -f` +If this value is too low, Elasticsearch will exit with the error: -followed by a cleanup or dangling volumes (careful as this will remove all volumes which are NOT being used by a container, if you want to remove specific volumes you will have to manually specifiy the volume names), otherwise, you can specify : + ```bash + bootstrap checks failed + max virtual memory areas vm.max_map_count [65530] is too low, increase to at least [262144] + ``` -`docker volume prune -f` WARNING THIS WILL DELETE ALL UNUSED VOLUMES ON YOUR MACHINE!. Check the volume names used in services.yml file and delete them as necessary `dockr volume rm volume_name` +If you did **not** run the installation script, set it manually: -### Known Issues/errors -Common issues that can be encountered across services. -
-
+**Temporary (until reboot):** + ```bash + sudo sysctl -w vm.max_map_count=262144 + ``` -#### **Apple Silicon** +**Permanent (persists across reboots):** +Add the line below to `/etc/sysctl.conf`: -Many services cannot run natively on Apple Silicon (such as M1 and M2 architectures). Common error messages related to Apple silicon follow patterns similar to: -

- - `no match for platform in manifest` -

-

- - `no matching manifest for linux/arm64/v8 in the manifest list entries` -

-

- - `image with reference cogstacksystems/cogstack-ocr-service:0.2.4 was found but does not match the specified platform: wanted linux/arm64, actual: linux/amd64` -

-To solve these issues; Rosetta is required and enabled in Docker Desktop. Finally an environment variable is required to be set. + ```bash + vm.max_map_count=262144 + ``` -Rosetta can which can be installed via the following command: -``` -softwareupdate --install-rosetta -``` -When Rosetta and Docker Desktop are installed, Rosetta must be enabled. This done by going to Docker Desktop -> Setting -> General and enabling "Use Virtualization framework". After in the same settings go to "features in development" -> "Use Rosetta for x86/amd64 emulation on Apple Silicon". Finally execute the following command: -``` -export DOCKER_DEFAULT_PLATFORM=linux/amd64 -``` -to set the environment variable. These issues are known to occur on the "cogstack-nifi", "cogstack-ocr-services" and "jupyter-hub" services and may occur on others. +Or a one-liner: -#### **NiFi** + ```bash + sudo sh -c "echo 'vm.max_map_count=262144' >> /etc/sysctl.conf" + ``` -When dealing with contaminated deployments ( containers using volumes from previous instances ) : -

- - `NiFi only supports one mode of HTTP or HTTPS operation...` deleting the volumes should usually solve this issue, if not, please check the `nifi.properties` if there have been modifications done by yourself or a developer on it. -

- - building the NiFi image manually on a restricted system, this is usually not necessary, but if for some reason this needs to be done then some settings such as proxy configs might need to be set up in the `nifi/Dockerfile` epecially ones related to the `grape` application and dealing with external downloads. -

- - `keystore.jks`/`truststore.jks` related errors, remove the nifi container & related volumes then restart the nifi instance. -

- - `System Error: Invalid host header : this occurs when nifi host has not been properly configured`, please check the `/nifi/conf/nifi.properties` file and set the `nifi.web.proxy.host` property to the IP address of the server along with the port `:`, if this does not work then it is usually a proxy/network configuration problem (also check firewalls), another workaround would be to comment out the following subsections of the `nifi` service in the `services.yml` file : `ports:` and `networks` with all their child settings. After this is done the following property should be added `network_mode: host`, restart the instance using the `docker-compoes -f services.yml up -d nifi` command afterwards. -

- - Possible error when dealing with non-pgsql databases `due to Incorrect syntax near 'LIMIT'.; routing to failure: com.microsoft.sqlserver.jdbc.SQLServerException: Incorrect syntax near 'LIMIT'`, go to the GenerateTableFetch Process -> right-click -> configure -> change database type from Generic to -> MS SQL 2012 + or 2008 (if an older DB system is used) - - Possible error on Linux systems related to `nifi.properties` permission error and/or other files from the `nifi/conf/` folder, please see the [nifi doc](../nifi/main.md#important-note-about-nifi-properties) {nifi.properties} section. -

- - `Driver class org.postgresql.Driver is not found` or something similar for other MSSQL/SQL drivers, this is a known issue after NiFi version v1.20+, first, make sure you pull the latest version of the repository, then for the JAR file you are using, please execute the following command in order to verify its integrity `jar -tvf ./nifi/drivers/your_file_version.jar`, if this returns a list of files and NO errors then the files are not corrupted and can be loaded. On the NiFi side make sure to go to the `DBCPConnectionPool` controller service and verify the propertiesit a few times, make sure the file path is correct and in the following format: `file:///opt/nifi/drivers/postgresql-42.6.0.jar` for example. If all this fails stop nifi, delete all the Docker volumes associated with it -> restart NiFi, perform the above steps again. You can try forcefully starting the `GenerateTableFetch` or `QueryDatabaseTable` processors by enabling the `DBCPConnectionPool` even if an error popus up after clicking the verify button. -

- - `502 Bad Gateway`, NiFi simply not starting, even after waiting more than 2-3 minutes. This can occur due to a wide variety of issues, you can check the NiFi container log : β€œdocker logs -f --tail 1000 cogstack-nifi > my_log_file.txt” to capture the output easily. The most common cause is running out of memory, increase or decrease the limits in `nifi/conf/bootstrap.conf` according to your machine's spec, please read [bootstrap.conf](../nifi/main.md#bootstrapconf) -

- - `Unable to connect to ElasticSearch` using the `ElasticSearchClientService` NiFi controller, make sure the settings are correct (username,password,certificates, etc.) and then click `Apply`, disregard the errors and click `Enable` on the controller to forcefully reload the controller, stop it and then validate the settings, start it again after and it should work. +Then apply: -#### **Elasticsearch Errors** -
+ ```bash + sudo sysctl -p + ``` -##### **VM memory errors, failed bootstrap check** -
+> The `install_docker_and_utils.sh` script automatically configures this. +> You only need to set it manually if the script was skipped. +::: -It is quite a common issue for both opensearch and native-ES to error out when it comes to virtual memory allocation, this error typically comes in the form of : +## πŸ… Deploying services -``` -ERROR: [1] bootstrap checks failed -[1]: max virtual memory areas vm.max_map_count [65111] is too low, increase to at least [262144] -``` -To solve this one needs to simply execute : -
- - on Linux/Mac OS X : - ```sysctl -w vm.max_map_count=262144``` in terminal. - To make the same change systemwide plase add ```vm.max_map_count=262144``` to /etc/sysctl.conf and restart the dockerservice/machine. - An example of this can be found under /services/elasticsearch/sysctl.conf -
- - on Windows you need to enter the following commands in a powershell instance: -
- ```wsl -d docker-desktop``` -
- ```sysctl -w vm.max_map_count=262144``` - -For more on this issue please read: https://www.elastic.co/guide/en/elasticsearch/reference/current/vm-max-map-count.html - -
- -##### **OpenSearch: validating opensearch.yml hosts** -
- - -``` -FATAL Error: [config validation of [opensearch].hosts]: types that failed validation: -- [config validation of [opensearch].hosts.0]: expected URI with scheme [http|https]. -- [config validation of [opensearch].hosts.1]: could not parse array value from json input -``` - -This issue may appear after the recent switch to using fully customizable environment variables. Strings and ENV vars may be parsed differently depending on the shell version found on the host system. - -To solve this, the easiest way is to make sure to load the `elasticsearch.env` variables before starting the Elastic & Kibana containers by doing the following: - -``` - cd ./deploy/ - set -a - source elasticsearch.env - make start-elastic -``` - -Alternatively (if the script executes without issues): -``` - cd ./deploy/ - source export_env_vars.sh - make start-elastic -``` - - -### DB-samples issues - -``` No table data for samples_db``` -It is possible that you may have forgotten to pull the large files from the repo, please do : `git lfs pull` . -Delete the samples-db container and it's volumes and restart it, you should now see the data in the tables. \ No newline at end of file +If everything up to this point is running fine, then, congratulations, you should now be able to start looking at the [deployment section](./deployment.md) diff --git a/docs/deploy/services.md b/docs/deploy/services.md index 0447b0dc0..e3f688607 100644 --- a/docs/deploy/services.md +++ b/docs/deploy/services.md @@ -89,49 +89,6 @@ Bio-YODIE requires [UMLS](https://www.nlm.nih.gov/research/umls/index.html) reso MedCAT SNOMED CT model requires a prepared model based on [SNOMED CT](http://www.snomed.org/) dictionary with the model available in `RES_MEDCAT_SERVICE_MODEL_PRODUCTION_PATH` directory. These paths can be defined in `.env` file in the deployment directory. -### Bio-YODIE -[Bio-YODIE](https://github.com/GateNLP/Bio-YODIE) is a named entity linking application build using [GATE NLP](https://gate.ac.uk/) suite ([publication](https://arxiv.org/abs/1811.04860)). - -The application files are stored in [`nlp-services/applications/bio-yodie/`](https://github.com/CogStack/CogStack-NiFi/tree/main/services/nlp-services/applications/bio-yodie) directory. - -The Bio-Yodie service configuration is stored in [`nlp-services/applications/bio-yodie/config/`](https://github.com/CogStack/CogStack-NiFi/tree/main/services/nlp-services/applications/bio-yodie/config) directory - the key service configuration properties are defined in `application.properties` file. - - -### GATE - -**Important** -Please note that this application is provided just as a proof-of-concept of running GATE applications. - - -This simple application implements annotation of common drugs and medications. -It was created using [GATE NLP](https://gate.ac.uk/sale/tao/splitch13.html) suite and uses GATE ANNIE Gazetteer plugin. -The application was been created in GATE Developer studio and exported into `gapp` format. -This application is hence ready to be used by GATE and is stored in `nlp-services/applications/drug-app/gate` directory as `drug.gapp` alongside the used resources. - -The list of drugs and medications to annotate is based on a publicly available list of FDA-approved drugs and active ingredients. -The data can be downloaded directly from [Drugs@FDA database](https://www.fda.gov/drugs/informationondrugs/ucm079750.htm). - -This applications is being run using a NLP Service runner application that uses internally [GATE Embedded](https://gate.ac.uk/family/embedded.html) (for running GATE applications) and exposes a REST API. -The NLP Service necessary configuration files are stored in `nlp-services/applications/drug-app/config/` directory - the key service configuration properties are defined in `application.properties` file. - -If you would like to build the docker image with already initialized NLP application, service and necessary resources bundled, please use provided `Dockerfile` in the `nlp-services/applications/drug-app/` directory. - -To deploy an example GATE NLP Drug names extraction application as a service, type: -``` -make start-nlp-gate -``` -The command will deploy `nlp-gate-drugapp` service. -Please see below the description of the deployed NLP service. - -To stop the service, type: -``` -make stop-nlp-gate -``` - - -**Important** -This service will be discontinued in the near future, meaning it will be removed from the repo. - ### MedCAT [MedCAT](https://github.com/CogStack/MedCAT) is a named entity recognition and linking application for concept annotation from UMLS or any other source. @@ -139,7 +96,6 @@ MedCAT is deployed as a service exposing RESTful API using the implementation fr ### MedCAT Service - MedCAT Service resources are stored in [`./services/nlp-services/applications/medcat/`](https://github.com/CogStack/CogStack-NiFi/tree/main/services/nlp-services/applications/medcat) directory. The key configuration properties stored as environment variables are defined in [`./services/nlp-services/applications/medcat/config/`](https://github.com/CogStack/CogStack-NiFi/tree/main/services/nlp-services/applications/medcat/config) sub-directory. The models used by MedCAT are stored in [`./servies/nlp-services/applications/cat/models/`](https://github.com/CogStack/CogStack-NiFi/tree/main/services/nlp-services/applications/medcat/models). @@ -257,21 +213,6 @@ More configuration options are covered in [nifi-doc](../nifi/main.md). Other `.env` files are mounted but those are only useful for custom scripts where you plan to use certain vars from other services, check the `services.yml` nifi `env-file` section definition. -## Tika Service - -`tika-service` provides document text extraction functionality of [Apache Tika](https://tika.apache.org/). -[Tika Service](https://github.com/CogStack/tika-service) implements the actual Apache Tika functionality behind a RESTful API. - -The application data, alongside configuration file, is stored in [`./services/tika-service`](https://github.com/CogStack/CogStack-NiFi/tree/main/services/tika-service) directory. - -When deployed Tika Service exposes port `8090` at `tika-service` container being available to all services within `cognet` Docker network, most importantly by `nifi` data processing engine. -The Tika service REST API endpoint for processing documents is available at `http://tika-service:8090/api/process`. - -For more details on configuration, API definition and example use of Tika Service please refer to [the official documentation](https://github.com/CogStack/tika-service). - -### ENV/CONF files: -- `/deploy/tika-service/config/application.yaml` - ## OCR Service The new `ocr-service` provides a new way to OCR documents at good speed, the equivalent in Tika-service but revwritten in Python and optimized. @@ -293,7 +234,6 @@ In the example deployment we use NLP applications running as a service exposing The current version of API specs is specified in [`./services/nlp-services/api-specs/`](https://github.com/CogStack/CogStack-NiFi/tree/main/services/nlp-services/api-specs) directory (both [Swagger](https://swagger.io/) and [OpenAPI](https://www.openapis.org/) specs). The applications are stored in [`./services/nlp-services/applications`](https://github.com/CogStack/CogStack-NiFi/tree/main/services/nlp-services/applications). - ### NLP API All the NLP services implement a RESTful API that is defined in [OpenAPI specification](https://github.com/CogStack/CogStack-Nifi/services/nlp-services/api-specs/openapi.yaml). @@ -307,23 +247,6 @@ Please see example Apache NiFi [workflows](./workflows.md) and [user scripts](ht For further details on the used API please refer to the [OpenAPI specification](https://github.com/CogStack/CogStack-Nifi/services/nlp-services/api-specs/openapi.yaml) for the definition of the request and response payload. -### GATE NLP -`nlp-gate-drugapp` serves a simple drug names extraction NLP application using [GATE NLP Service](https://github.com/CogStack/gate-nlp-service). -This simple application implements annotation of common drugs and medications. -It was created using [GATE NLP](https://gate.ac.uk/sale/tao/splitch13.html) suite and uses GATE ANNIE Gazetteer plugin. -The GATE application definition and resources are available in directory [`./services/nlp-services/applications/drug-app`](https://github.com/CogStack/CogStack-Nifi/services/nlp-services/applications/drug-app/). - -When deployed `nlp-gate-drugapp` exposes port `8095` on the container. -The port is also bound from container to the host machine `8095` port. -The service endpoint should be available to all the services running inside the `cognet` Docker network. -For example, to access the API endpoint to process a document by a service in `cognet` network, the endpoint address would be `http://nlp-gate-drugapp:8095/api/process`. - -As a side note, when deployed `nlp-gate-bioyodie` (assuming that the Bio-YODIE resources are properly set up with `RES_BIOYODIE_UMLS_PATH` variable), the service will only expose port `8095` on container. -Although the service won't be accessible from the host machine, but all the services inside the `cognet` network will be able to access it. - -For more information on the GATE NLP Service configuration and use please refer to [the official documentation](https://github.com/CogStack/gate-nlp-service). - - ### MedCAT NLP [MedCAT](https://github.com/CogStack/MedCAT) is a named entity recognition and linking application for concept annotation from UMLS or any other source. MedCAT deployment consists of [MedCAT NLP Service](https://github.com/CogStack/MedCATservice) serving NLP models via RESTful API and [MedCAT Trainer](https://github.com/CogStack/MedCATtrainer) for collecting annotations and refinement of the NLP models. diff --git a/docs/deploy/troubleshooting.md b/docs/deploy/troubleshooting.md new file mode 100644 index 000000000..dbbbcd36e --- /dev/null +++ b/docs/deploy/troubleshooting.md @@ -0,0 +1,129 @@ +# Troubleshooting + +Always start with fresh containers and volumes, to make sure that there are no volumes from previous experimentations, make sure to always delete all/any cogstack running containers by executing: + +`docker container rm samples-db elasticsearch-1 kibana nifi nlp-medcat-service-production tika-service nlp-gate-drugapp nlp-medcat-snomed nlp-gate-bioyodie medcat-trainer-ui medcat-trainer-nginx jupyter-hub -f` + +followed by a cleanup or dangling volumes (careful as this will remove all volumes which are NOT being used by a container, if you want to remove specific volumes you will have to manually specifiy the volume names), otherwise, you can specify : + +`docker volume prune -f` WARNING THIS WILL DELETE ALL UNUSED VOLUMES ON YOUR MACHINE!. Check the volume names used in services.yml file and delete them as necessary `dockr volume rm volume_name` + +## Known Issues/errors + +Common issues that can be encountered across services. +
+
+ +### **Apple Silicon** + +Many services cannot run natively on Apple Silicon (such as M1 and M2 architectures). Common error messages related to Apple silicon follow patterns similar to: +

+ - `no match for platform in manifest` +

+

+ - `no matching manifest for linux/arm64/v8 in the manifest list entries` +

+

+ - `image with reference cogstacksystems/cogstack-ocr-service:0.2.4 was found but does not match the specified platform: wanted linux/arm64, actual: linux/amd64` +

+To solve these issues; Rosetta is required and enabled in Docker Desktop. Finally an environment variable is required to be set. + +Rosetta can which can be installed via the following command: + +```bash +softwareupdate --install-rosetta +``` + +When Rosetta and Docker Desktop are installed, Rosetta must be enabled. This done by going to Docker Desktop -> Setting -> General and enabling "Use Virtualization framework". After in the same settings go to "features in development" -> "Use Rosetta for x86/amd64 emulation on Apple Silicon". Finally execute the following command: + +```bash +export DOCKER_DEFAULT_PLATFORM=linux/amd64 +``` + +to set the environment variable. These issues are known to occur on the "cogstack-nifi", "cogstack-ocr-services" and "jupyter-hub" services and may occur on others. + +### **NiFi** + +When dealing with contaminated deployments ( containers using volumes from previous instances ) : +

+ - `NiFi only supports one mode of HTTP or HTTPS operation...` deleting the volumes should usually solve this issue, if not, please check the `nifi.properties` if there have been modifications done by yourself or a developer on it. +

+ - building the NiFi image manually on a restricted system, this is usually not necessary, but if for some reason this needs to be done then some settings such as proxy configs might need to be set up in the `nifi/Dockerfile` epecially ones related to the `grape` application and dealing with external downloads. +

+ - `keystore.jks`/`truststore.jks` related errors, remove the nifi container & related volumes then restart the nifi instance. +

+ - `System Error: Invalid host header : this occurs when nifi host has not been properly configured`, please check the `/nifi/conf/nifi.properties` file and set the `nifi.web.proxy.host` property to the IP address of the server along with the port `:`, if this does not work then it is usually a proxy/network configuration problem (also check firewalls), another workaround would be to comment out the following subsections of the `nifi` service in the `services.yml` file : `ports:` and `networks` with all their child settings. After this is done the following property should be added `network_mode: host`, restart the instance using the `docker-compoes -f services.yml up -d nifi` command afterwards. +

+ - Possible error when dealing with non-pgsql databases `due to Incorrect syntax near 'LIMIT'.; routing to failure: com.microsoft.sqlserver.jdbc.SQLServerException: Incorrect syntax near 'LIMIT'`, go to the GenerateTableFetch Process -> right-click -> configure -> change database type from Generic to -> MS SQL 2012 + or 2008 (if an older DB system is used) + - Possible error on Linux systems related to `nifi.properties` permission error and/or other files from the `nifi/conf/` folder, please see the [nifi doc](../nifi/main.md#important-note-about-nifi-properties) {nifi.properties} section. +

+ - `Driver class org.postgresql.Driver is not found` or something similar for other MSSQL/SQL drivers, this is a known issue after NiFi version v1.20+, first, make sure you pull the latest version of the repository, then for the JAR file you are using, please execute the following command in order to verify its integrity `jar -tvf ./nifi/drivers/your_file_version.jar`, if this returns a list of files and NO errors then the files are not corrupted and can be loaded. On the NiFi side make sure to go to the `DBCPConnectionPool` controller service and verify the propertiesit a few times, make sure the file path is correct and in the following format: `file:///opt/nifi/drivers/postgresql-42.6.0.jar` for example. If all this fails stop nifi, delete all the Docker volumes associated with it -> restart NiFi, perform the above steps again. You can try forcefully starting the `GenerateTableFetch` or `QueryDatabaseTable` processors by enabling the `DBCPConnectionPool` even if an error popus up after clicking the verify button. +

+ - `502 Bad Gateway`, NiFi simply not starting, even after waiting more than 2-3 minutes. This can occur due to a wide variety of issues, you can check the NiFi container log : β€œdocker logs -f --tail 1000 cogstack-nifi > my_log_file.txt” to capture the output easily. The most common cause is running out of memory, increase or decrease the limits in `nifi/conf/bootstrap.conf` according to your machine's spec, please read [bootstrap.conf](../nifi/main.md#bootstrapconf) +

+ - `Unable to connect to ElasticSearch` using the `ElasticSearchClientService` NiFi controller, make sure the settings are correct (username,password,certificates, etc.) and then click `Apply`, disregard the errors and click `Enable` on the controller to forcefully reload the controller, stop it and then validate the settings, start it again after and it should work. + +### **Elasticsearch Errors** + +#### **VM memory errors, failed bootstrap check** + +It is quite a common issue for both opensearch and native-ES to error out when it comes to virtual memory allocation, this error typically comes in the form of : + +```bash +ERROR: [1] bootstrap checks failed +[1]: max virtual memory areas vm.max_map_count [65111] is too low, increase to at least [262144] +``` + +To solve this one needs to simply execute : +
+ - on Linux/Mac OS X : + ```sysctl -w vm.max_map_count=262144``` in terminal. + To make the same change systemwide plase add ```vm.max_map_count=262144``` to /etc/sysctl.conf and restart the dockerservice/machine. + An example of this can be found under /services/elasticsearch/sysctl.conf +
+ - on Windows you need to enter the following commands in a powershell instance: +
+ ```wsl -d docker-desktop``` +
+ ```sysctl -w vm.max_map_count=262144``` + +For more on this issue please read: https://www.elastic.co/guide/en/elasticsearch/reference/current/vm-max-map-count.html + +
+ +#### **OpenSearch: validating opensearch.yml hosts** + +```bash +FATAL Error: [config validation of [opensearch].hosts]: types that failed validation: +- [config validation of [opensearch].hosts.0]: expected URI with scheme [http|https]. +- [config validation of [opensearch].hosts.1]: could not parse array value from json input +``` + +This issue may appear after the recent switch to using fully customizable environment variables. Strings and ENV vars may be parsed differently depending on the shell version found on the host system. + +To solve this, the easiest way is to make sure to load the `elasticsearch.env` variables before starting the Elastic & Kibana containers by doing the following: + +```bash + cd ./deploy/ + set -a + source elasticsearch.env + make start-elastic +``` + +Alternatively (if the script executes without issues): + +```bash + cd ./deploy/ + source export_env_vars.sh + make start-elastic +``` + +### DB-samples issues + +```bash +No table data for samples_db +``` + +It is possible that you may have forgotten to pull the large files from the repo, please do : `git lfs pull`. + +Delete the samples-db container and it's volumes and restart it, you should now see the data in the tables. diff --git a/docs/index.rst b/docs/index.rst index 300a7dc8a..001b209e0 100644 --- a/docs/index.rst +++ b/docs/index.rst @@ -15,10 +15,10 @@ Welcome to CogStack-Nifi's documentation! nifi/main.md security/main.md deploy/main.md - deploy/services.md + deploy/deployment.md + deploy/troubleshooting.md deploy/workflows.md - Indices and tables ================== diff --git a/docs/nifi/main.md b/docs/nifi/main.md index dbbfb0375..42b1ae1fc 100644 --- a/docs/nifi/main.md +++ b/docs/nifi/main.md @@ -1,4 +1,5 @@ -# NiFi +# πŸ’§ NiFi + This directory contains files related with our custom Apache NiFi image and example deployment templates with associated services. Apache NiFi is used as a customizable data pipeline engine for controlling and executing data flow between used services. There are multiple workflow templates provided with custom user scripts to work with NiFi. @@ -16,7 +17,7 @@ Please read the following [article](https://nifi.apache.org/docs/nifi-docs/html/ Avro Schema:[official documentation](https://avro.apache.org/docs/1.11.1/) -## `NiFi directory layout : /nifi` +## `NiFi directory layout : /nifi` ``` β”œβ”€β”€ Dockerfile - contains the base definition of the NiFi image along with all the packages/addons installed diff --git a/docs/security/elasticsearch_opensearch.md b/docs/security/elasticsearch_opensearch.md index bacac0374..fde8a0aba 100644 --- a/docs/security/elasticsearch_opensearch.md +++ b/docs/security/elasticsearch_opensearch.md @@ -38,6 +38,20 @@ cd ../security --- +### βš™οΈ Version variable + +Set the ES/OS version in `deploy/elasticsearch.env` before launching containers: + +```bash +ELASTICSEARCH_VERSION=opensearch +# or +ELASTICSEARCH_VERSION=elasticsearch +``` + +This ensures the correct certificate directory (`elasticsearch` or `opensearch`) is mounted into containers. + +--- + ### 🧩 Common certificate layout Certificate naming and folder structure are consistent across both ES and OpenSearch: @@ -124,20 +138,6 @@ security/certificates/elastic/opensearch/ --- -## βš™οΈ Version variable - -Set the ES/OS version in `deploy/elasticsearch.env` before launching containers: - -```bash -ELASTICSEARCH_VERSION=opensearch -# or -ELASTICSEARCH_VERSION=elasticsearch -``` - -This ensures the correct certificate directory (`elasticsearch` or `opensearch`) is mounted into containers. - ---- - ### πŸ“ Kibana / OpenDashboard certificates | Platform | Required Certificates | Source Folder | From 1215c9936f4b73972c801584fdc1f441a3cf530b Mon Sep 17 00:00:00 2001 From: vladd-bit Date: Thu, 20 Nov 2025 08:54:51 +0000 Subject: [PATCH 2/2] Docs: revised services section. --- README.md | 6 +- deploy/Makefile | 4 +- deploy/services.yml | 2 +- docs/deploy/services.md | 884 ++++++++++++++++----------------- docs/deploy/troubleshooting.md | 20 +- docs/index.rst | 5 +- docs/news.md | 6 +- 7 files changed, 453 insertions(+), 474 deletions(-) diff --git a/README.md b/README.md index ced0a91f8..e9214e4b3 100644 --- a/README.md +++ b/README.md @@ -10,7 +10,7 @@ This repository proposes a possible next step in the evolution of free-text data **CogStack-NiFi** demonstrates how to use [Apache NiFi](https://nifi.apache.org/) as the central data workflow engine for clinical document processing, integrating services such as text extraction and natural language processing (NLP). Each component runs as a standalone service, with NiFi handling data routing between components and data sources/sinks. -All NLP services are expected to implement a uniform RESTful API, allowing seamless integration into existing pipelinesβ€”making it easy to incorporate any NLP application into the stack. +All NLP/ML/DATA services are expected to implement a uniform RESTful API, allowing seamless integration into existing pipelinesβ€”making it easy to incorporate any NLP application into the stack. --- @@ -48,13 +48,13 @@ Need help? Feel free to: **Prerequisites**: - Docker (mandatory) -- Basic knowledge of Python and Linux/UNIX systems +- Basic knowledge of Python and Linux/UNIX systems (Bash (simple commands only, we promise)) πŸ“– Official documentation: [cogstack-nifi.readthedocs.io](https://cogstack-nifi.readthedocs.io/en/latest/) πŸš€ New to the project? Start with the [deployment guide](https://cogstack-nifi.readthedocs.io/en/latest/deploy/main.html) for example setups and workflows. -🐞 For troubleshooting or bug reports, consult the [Known Issues section](https://cogstack-nifi.readthedocs.io/en/latest/deploy/main.html) before opening a ticket. +🐞 For troubleshooting or bug reports, consult the [known issues section](https://cogstack-nifi.readthedocs.io/en/latest/deploy/troubleshooting.html) before opening a ticket. --- diff --git a/deploy/Makefile b/deploy/Makefile index 53f9812f9..aff95d4af 100644 --- a/deploy/Makefile +++ b/deploy/Makefile @@ -73,7 +73,7 @@ start-medcat-service-deid: $(WITH_ENV) docker compose -f ../services/cogstack-nlp/medcat-service/docker/docker-compose.yml $(DC_START_CMD) nlp-medcat-service-production-deid start-medcat-trainer: - $(WITH_ENV) docker compose -f../services/cogstack-nlp/medcat-trainer/docker-compose-prod.yml $(DC_START_CMD) medcattrainer nginx solr + $(WITH_ENV) docker compose -f ../services/cogstack-nlp/medcat-trainer/docker-compose-prod.yml $(DC_START_CMD) medcattrainer nginx solr start-production-db: $(WITH_ENV) docker compose -f services.yml ${DC_START_CMD} cogstack-databank-db @@ -136,7 +136,7 @@ stop-jupyter: $(WITH_ENV) docker compose -f ../services/cogstack-jupyter-hub/docker/docker-compose.yml $(DC_STOP_CMD) cogstack-jupyter-hub stop-medcat-trainer: - $(WITH_ENV) docker compose -f../services/cogstack-nlp/medcat-trainer/docker-compose-prod.yml $(DC_STOP_CMD) medcattrainer nginx solr + $(WITH_ENV) docker compose -f ../services/cogstack-nlp/medcat-trainer/docker-compose-prod.yml $(DC_STOP_CMD) medcattrainer nginx solr stop-medcat-service: $(WITH_ENV) docker compose -f ../services/cogstack-nlp/medcat-service/docker/docker-compose.yml $(DC_STOP_CMD) nlp-medcat-service-production diff --git a/deploy/services.yml b/deploy/services.yml index c12f319fb..362308a57 100644 --- a/deploy/services.yml +++ b/deploy/services.yml @@ -215,7 +215,7 @@ services: - databank-vol:/var/lib/postgresql/data command: postgres -c "max_connections=${POSTGRES_DB_MAX_CONNECTIONS:-100}" ports: - - 5556:5432 + - 5558:5432 expose: - 5432 networks: diff --git a/docs/deploy/services.md b/docs/deploy/services.md index e3f688607..4568ccf0d 100644 --- a/docs/deploy/services.md +++ b/docs/deploy/services.md @@ -1,642 +1,622 @@ -# Available Services -This file covers the available services in the example deployment. +# πŸ“¦ Services -Apache NiFi-related files are provided in `../nifi` directory. - -Please note that all the services are deployed using [Docker](https://docker.io) engine and it needs to be present in the system. -Please see [example deployment](main.md) for more details on the used services and their configuration. +This section provides a complete overview of all services included in the CogStack-NiFi deployment. +All services run in Docker and interact within a shared internal Docker network. -## Overview +--- -The below image sums up how CogStack services work with eachother in an environment where all available components are used. +## πŸ“Š Overview + +Below is a high-level architecture diagram illustrating how CogStack services communicate when all components are enabled: ![nifi-services](../_static/img/nifi_services.png) -## Primary services -All the services are defined in `services.yml` file and these are: -- `samples-db` - a PostgreSQL database with sample data to play with, -- `cogstack-databank-db` - production PostgreSQL database, has it's own scripts in `/services/cogstack-db/pgsql` -- `cogstack-databank-db-mssql` - production MSSQL database, has it's own scripts in `/services/cogstack-db/mssql`, this is just an alternative, needs a license. -- `nifi` - a single instance of Apache NiFi processor (with Zookeper embedded) with exposing a web user interface, -- `nifi-nginx` - used for reverse proxy to enable secure access to NiFi and other services. -- `tika-service` - the [Apache Tika](https://tika.apache.org/) running as a web service (see: [Tika Service repository](https://github.com/CogStack/tika-service/)). -- `ocr-service-1/ocr-service-2` - the new OCR text extraction tool that is a replacement of `tika-service`. -- `nlp-gate-drugapp` - an example drug names extraction NLP application using [GATE NLP Service runner exposing a REST API](https://github.com/CogStack/gate-nlp-service), -- `nlp-medcat-service-production` - [MedCAT](https://github.com/CogStack/MedCAT) NLP application running as a [web Service](https://github.com/CogStack/MedCATservice) and using an example model trained on [Med-Mentions](https://github.com/chanzuckerberg/MedMentions) corpus, -- `medcat-trainer-ui` - [MedCAT Trainer](https://github.com/CogStack/MedCATtrainer) web application used for training and refining MedCAT NLP models, -- `medcat-trainer-nginx` - a [NGINX](https://www.nginx.com/) reverse-proxy for MedCAT Trainer, -- `elasticsearch-1/elasticsearch-2` - a two-node cluster of Elasticsearch based on [OpenSearch for Elasticsearch](https://opensearch.org/) distribution, -- `metricbeat` - Elasticsearch Native only cluster monitoring service -- `filebeat` - log ingestion service for ElasticSearch Native -- `kibana` - Kibana user-interface based on [OpenSearch for Elasticsearch](https://opensearch.org/docs/latest/dashboards/index/) distribution, -- `jupyter-hub` - a single instance of [Jupyter Hub](https://jupyter.org/hub) for serving Jupyter Notebooks for interacting with the data. -- `git-ea` - Github-like web service, you can host your own repositories here if your organisation is strict security-wise - -**IMPORTANT** -Please note that some of the necessary configuration parameters, variables and paths are also defined in the [`services.yml`](https://github.com/CogStack/CogStack-NiFi/tree/main/deploy/services.yml) file. - -## Optional NLP services -In addition, there are defined such NLP services: -- `nlp-medcat-service-production` serving SNOMED CT model, -- `nlp-gate-bioyodie` - same as `nlp-gate-drugapp` but serving [Bio-YODIE](https://github.com/GateNLP/Bio-YODIE) NLP application. - -These services are optional and won't be started by default. -They were left in the `services.yml` file for informative purposes if one would be interested in deploying these having access to necessary resources. - -## Security -**Important** -Please note that for the demonstration purposes, the services are run with default built-in usernames / passwords. -Moreover, SSL encryption is also disabled or not set up in the configuration files. -For more information please see the [security](../security.md) - -## Deployment -The example deployment recipes are defined in `Makefile` file. -The commands that start services are prefixed with `start-` keyword, similarly the ones to stop are prefixed with `stop`. - -## Data ingestion and storage infrastructure -To deploy the data ingestion and storage infrastructure, type: -``` -make start-data-infra -``` +--- -The command will deploy services: `nifi`, `elasticsearch-1`, `kibana`, `tika-service`, `samples-db`. -Please see below the description of the services with the information on the accessibility. +## 🧩 Primary Services -To stop the services, type: -``` -make stop-data-infra -``` +The core services defined in `services.yml` include: -## Cleanup -To tear down all the containers and the data persisted in mounted volumes, type: -``` -make cleanup +- **samples-db** β€” PostgreSQL database populated with demo datasets. +- **cogstack-databank-db / cogstack-databank-db-mssql** β€” Production-grade PostgreSQL and optional MSSQL instances. +- **elasticsearch-1 / elasticsearch-2 / elasticsearch-3** β€” Multi-node Elasticsearch or OpenSearch cluster. +- **metricbeat / filebeat** β€” Elastic monitoring and log forwarder services. +- **nifi** β€” Apache NiFi single-node instance with embedded ZooKeeper. +- **nifi-nginx** β€” Reverse proxy providing secure access to NiFi. +- **ocr-service / ocr-service-text-only** β€” High-performance Python OCR and text extraction services. +- **nlp-medcat-service-production** β€” MedCAT NLP model service with REST API. +- **medcat-trainer-ui / medcat-trainer-nginx** β€” Web UI and reverse proxy for model training and refinement. + +- **kibana** β€” OpenSearch Dashboards UI. +- **jupyter-hub** β€” Fully featured data science interface. +- **git-ea** β€” Self‑hosted Git service (Gitea). + +> πŸ” **Note:** Important configuration options and environment variables for these services are managed in `services.yml` and the associated `.env` files under `deploy/` and `security/`. + +## πŸ—‚οΈ Service Definitions + +All core services are defined in: + +```bash +deploy/services.yml ``` -## Services & definition description -All the essential details on the services configuration are defined in `services.yml` file. +They run inside the internal Docker network `cognet`. +Some services expose ports to the host for convenience. -Please note that all the services are running within a private `cognet` Docker network hence the endpoints are all accessible within the deployed services. -However, for the ease of use, some of the services have their ports bound from container to the host machine. +--- +## πŸ—£οΈ NLP/OCR and other services API Endpoints -## NLP services +Most web ETL & data-enrichment API services that we use will offer thw following endpoints for querying. -**Important** -
-Please note that `nlp-medcat-service-production` and `nlp-gate-bioyodie` NLP services use license-restricted resources and these need to be provided by the user prior running these services. -Bio-YODIE requires [UMLS](https://www.nlm.nih.gov/research/umls/index.html) resources that need to be provided in the `RES_BIOYODIE_UMLS_PATH` directory. -MedCAT SNOMED CT model requires a prepared model based on [SNOMED CT](http://www.snomed.org/) dictionary with the model available in `RES_MEDCAT_SERVICE_MODEL_PRODUCTION_PATH` directory. -These paths can be defined in `.env` file in the deployment directory. +- **GET** `/api/info` +- **POST** `/api/process` +- **POST** `/api/process_bulk` +Useful for NiFi workflows (see `workflows.md`). -### MedCAT -[MedCAT](https://github.com/CogStack/MedCAT) is a named entity recognition and linking application for concept annotation from UMLS or any other source. -MedCAT is deployed as a service exposing RESTful API using the implementation from [MedCATservice](https://github.com/CogStack/MedCATservice). +--- -### MedCAT Service +## 🧬 MedCAT Service -MedCAT Service resources are stored in [`./services/nlp-services/applications/medcat/`](https://github.com/CogStack/CogStack-NiFi/tree/main/services/nlp-services/applications/medcat) directory. -The key configuration properties stored as environment variables are defined in [`./services/nlp-services/applications/medcat/config/`](https://github.com/CogStack/CogStack-NiFi/tree/main/services/nlp-services/applications/medcat/config) sub-directory. -The models used by MedCAT are stored in [`./servies/nlp-services/applications/cat/models/`](https://github.com/CogStack/CogStack-NiFi/tree/main/services/nlp-services/applications/medcat/models). -A default model to play with is provided, called `MedMen` and there is a script `./services/nlp-services/applications/medcat/models/download_medmen.sh` to download it, please make sure you are in the `./services/nlp-services/applications/medcat/models/` before executing the download script. +Runs a REST API for model inference uses the [MedCAT library](https://github.com/CogStack/cogstack-nlp/tree/main/medcat-v2) which performss clinical concept extraction and linking. -For more information on the MedCAT Service configuration and use please refer to [the official documentation](https://github.com/CogStack/MedCATservice). +The service has two operation modes: -**Important** -For the example deployment we provide a simple and publicly available MedCAT model. -However, custom and more advanced MedCAT models can be used based on license-restricted terminology dictionaries such as [UMLS](https://www.nlm.nih.gov/research/umls/index.html) or [SNOMED CT](http://www.snomed.org/). -Which model is being used by the deployed MedCAT Service is defined both in the MedCAT Service config file and the deployment configuration file (see: [deploy](main.md)). +- concept detection: exctracts medical concepts: outputs original text + annotations list. +- de-id mode aka. AnonCAT mode, for de-identifying documents: outputs de-identified text + (will output annotations that represent what was de-id in a future version). +### Access -To deploy MedCAT application stack, type: -``` -make start-nlp-medcat -``` -The command will deploy MedCAT NLP service ` nlp-medcat-service-production` with related MedCAT Trainer services `medcat-trainer-ui`, `medcat-trainer-nginx`. -Please see below the description of the deployed NLP services. +- `https://localhost:5555/api/info` - NER container, check if model loads successfully +- `https://localhost:5556/api/info` - DE-ID/AnonCAT container -To stop the services, type: -``` -make stop-nlp-medcat -``` +### Containers -#### ENV/CONF files: -- `/service/nlp-services/applications/medcat/config/env_app` - settings specifically related to the medcat service app, such as model(pack) file location(s) -- `/service/nlp-services/applications/medcat/config/env_medcat` - medcat specific settings +- `cogstack-medcat-service-production` - for concept NER +- `cogstack-medcat-service-production-deid` - for DE-ID/AnonCAT -## Jupyter Hub -To deploy Jupyter Hub, type: -``` -make start-jupyter -``` -Please see below the description of the Jupyter Hub. +### Service location & files -To stop the services, type: -``` -make stop-jupyter -``` -### ENV/CONF files: -- `/deploy/jupyter.env` +- dir: `/services/cogstack-nlp/medcat-service/` +- docker compose file: `/services/cogstack-nlp/medcat-service/docker/docker-compose.yml` +- env: located in `services/cogstack-nlp/medcat-service/env/` -## Database Stack + ```bash + app.env - controls APP settings (number of cpus used, log level, etc) used by the NER container cogstack-medcat-service-production + medcat.env - used by the NER container, controls MedCAT settings directly. + app_deid.env - used by the DE-ID container, same app setting control, the main difference being the `APP_DEID_MODE`. + medcat_deid.env - used by the DE-ID container, controls MedCAT settings directly + ``` -The samples DB uses PgSQL, but we also provide an MSSQL instance (no data on it however), that can be used in prod environments.Please see [the workflows section](workflows.md#configuring-db-connector) about how to configure the difference controllers and DB drivers. +### Ports +| Service | External Port | Internal Port | +|--------------------|---------------|----------------| +| NER (MedCAT) | `5555` | `5000` | +| DE-ID / AnonCAT | `5556` | `5000` | -### Samples DB -`samples-db` provides a [PostgreSQL](https://www.postgresql.org/) database that contains sample data to play with. -During start-up the data is loaded from a previously generated DB dump. +### Models -All the necessary resources, data and scripts are stored in `pgsamples/` directory. -During the service initialization, the script `init_db.sh` will populate the database with sample data read from a database dump stored in `db_dump` directory. -The directory [`./services/pgsamples/scripts`](https://github.com/CogStack/CogStack-NiFi/tree/main/services/pgsamples/scripts) contains SQL schemas with scripts that will generate the database dump using sample data. +- A default MedMentions `MedMen` NER+L model (includes MetaCAT models) is available for public use but needs to be downloaded. +- To download a model head to the directory of the service `services/cogstack-nlp/medcat-service/scripts` +- Execute: `bash download_medmen.sh`, wait for download to complete. -When deployed the PostgreSQL database is exposed at port `5432` of the `samples-db` container. -The port is also bound from container to the host machine `5555` port. -The example data is stored in `db_samples` database. -Use user `test` with password `test` to connect to it. +### README -For an example deployment, a PostgreSQL database that contains some example data to play with was generated [synthetic records](https://github.com/synthetichealth/synthea) enrinched with free-text from [MTSamples](https://www.mtsamples.com/). -The free-text sample data is based on [MT Samples](https://www.mtsamples.com/) dataset with the structured fields generated by [Synthea](https://github.com/synthetichealth/synthea). +Please check the service's own [README.md](https://github.com/CogStack/cogstack-nlp/tree/main/medcat-service) -The tables available in the database are: -- `patients` - structured patient information, -- `encounters` - structured encounters information, -- `observations` - structured observations information, -- `medical_reports_raw` - free-text documents in raw format (PDFs) `(*)`, -- `medical_reports_text` - free-text documents in clean, text format `(*)`, -- `medical_reports_processed` - for storing processed documents, empty `(*)`, -- `annotations_medcat` - for storing extracted MedCAT annotations, empty. +--- -The tables used in the deployment example are marked with `(*)`. +## πŸ› οΈ MedCAT Trainer +Provides UI workflows for annotation, correction, and iterative model training. -#### ENV/CONF files: -- `/deploy/database.env` - currently only basic stuff like DB users/passwords are included +### Access -### Cogstack-db -This is a general database provided for production, it does not have any data in it beyond the defined cogstack_schema (this is not yet present) and annotation_schema. -Provided for both PGSQL and MSSQL. +- `https://localhost:8001` -In the future the `${DB_PROVIDER}` will be an environment variable that will take into account the db-provider you can select, possible values [`mssql`,`pgsql`] +### Containers -By default all the `.sql` files beginning with `annotations*` and `cogstack*` prefix in the `services/cogstack-db/${DB_PROVIDER}/schemas/` will be loaded.This is defined in the `services/cogstack-db/${DB_PROVIDER}/init_db.sh`.There should not be a need to change them as users can simply name their schemas accordingly.Place the desired `sql` files in the `schemas` folder and it will be picked up.To debug any issues with the container or with the SQL scripts please run the startup commands separately `docker-compose -f services.yml up cogstack-databank-db` or `docker-compose -f services.yml cogstack-databank-db-mssql` while in the `deploy/` folder. +- `medcattrainer` +- `medcattrainer_nginx` +- `mct_solr` -MSSQL note -The MSSQL container will require license activation for production as per [Microsoft's guideline](https://hub.docker.com/_/microsoft-mssql-server), setting the `MSSQL_PID` env variable to the correct license PID key should activate the product. +### Service location & files +- dir: `services/cogstack-nlp/medcat-trainer/` +- docker compose file: `services/cogstack-nlp/medcat-trainer/docker-compose-prod.yml` +- env: `services/cogstack-nlp/medcat-trainer/envs/env-prod` -### ENV/CONF files: -- `/deploy/database.env` - currently only basic stuff like DB users/passwords are included +### Ports -## Apache NiFi -`nifi` serves a single-node instance of Apache NiFi that includes the data processing engine with user interface for defining data flows and monitoring. -Since this is a single-node NiFi instance, it also contains the default, embedded [Apache Zookeper](https://zookeeper.apache.org/) instance for managing state. +- external: `8001` -`nifi` container exposes port `8443` which is also bound to the host machine on port 8082. -
+### README -`nifi-nginx` contianer exposes the 8443 port directly, reverser-proxying the connection to nifi. -The Apache NiFi user interface can be hence accessed by navigating on the host (e.g.`localhost`) machine at `http://localhost:8443`. +Please check the service's own [README.md](https://github.com/CogStack/cogstack-nlp/blob/main/medcat-trainer/README.md) file and [docs](https://docs.cogstack.org/projects/medcat-trainer/en/latest/). -In this deployment example, we use a custom build Apache NiFi image with example user scripts and workflow templates. -For more information on configuration, user scripts and user templates that are embeded with the custom Apache NiFi image please refer to the [nifi](../nifi/main.md). -The available example workflows are covered in [workflows](./workflows.md) -Alternatively, please refer to [the official Apache NiFi documentation](https://nifi.apache.org/) for more details on actual use of Apache NiFi. +--- -### ENV/CONF files: -- `/deploy/nifi.env` - most notable settings are related to port mapping and proxy -- `/security/certificates_nifi.env` - define NiFi certificate settings here -- `/security/nifi_users.env` - defines the NiFi user credentials for single user auth & others -More configuration options are covered in [nifi-doc](../nifi/main.md). +## πŸ“š Jupyter Hub -Other `.env` files are mounted but those are only useful for custom scripts where you plan to use certain vars from other services, check the `services.yml` nifi `env-file` section definition. +A multi-user JupyterHub instance deployed via Docker. -## OCR Service +### Access -The new `ocr-service` provides a new way to OCR documents at good speed, the equivalent in Tika-service but revwritten in Python and optimized. +- `https://localhost:8888` -`ocr-service-1` - this container is used for OCR -`ocr-service-2` - this container is used for NON-OCR, meaning documents will simply have their text extracted if they contain text without images +### Containers -### ENV/CONF files: -- `/deploy/ocr_service.env` - for `ocr-service-1` -- `/deploy/ocr_service_text_only.env` - for `ocr-service-2`, NON-OCR instance +- `cogstack-jupyter-hub` +- `cogstack-jupyter-singleuser-` (per user container started by each user once hub is up) -**IMPORTANT** -All settings are decribed [here](https://github.com/CogStack/ocr-service/blob/master/README.md). +### Service location & files +- dir: `services/cogstack-jupyter-hub/` +- docker compose file: `services/cogstack-jupyter-hub/docker/` +- env: `services/cogstack-jupyter-hub/env/jupyter.env` -## NLP Services +### Supports -In the example deployment we use NLP applications running as a service exposing REST API. -The current version of API specs is specified in [`./services/nlp-services/api-specs/`](https://github.com/CogStack/CogStack-NiFi/tree/main/services/nlp-services/api-specs) directory (both [Swagger](https://swagger.io/) and [OpenAPI](https://www.openapis.org/) specs). -The applications are stored in [`./services/nlp-services/applications`](https://github.com/CogStack/CogStack-NiFi/tree/main/services/nlp-services/applications). +- Per-user containers +- CPU/RAM limits (via `services/cogstack-jupyter-hub/env/jupyter.env`) +- Optional GPU support +- Notebook image selection -### NLP API -All the NLP services implement a RESTful API that is defined in [OpenAPI specification](https://github.com/CogStack/CogStack-Nifi/services/nlp-services/api-specs/openapi.yaml). +### Ports -The available endpoints are: -- **GET** `/api/info` - for displaying general information about the used NLP application, -- **POST** `/api/process` - for processing text documents (single document mode), -- **POST** `/api/process_bulk` - for processing multiple text documents (bulk mode). +| Component | External Port | Internal Port(s) | +|-------------|---------------|------------------| +| JupyterHub | `8888` | `8087`, `443` | -When plugging-in the NLP services into Apache NiFi workflows, the endpoint for processing single or multiple documents will be used to extract the annotations from documents. -Please see example Apache NiFi [workflows](./workflows.md) and [user scripts](https://github.com/CogStack/Cogstack-Nifi/nifi/user-scripts) on using and parsing the payloads with NiFi. +### README -For further details on the used API please refer to the [OpenAPI specification](https://github.com/CogStack/CogStack-Nifi/services/nlp-services/api-specs/openapi.yaml) for the definition of the request and response payload. +Please check the service's own [README.md](https://github.com/CogStack/cogstack-jupyter-hub/blob/main/README.md) file. -### MedCAT NLP -[MedCAT](https://github.com/CogStack/MedCAT) is a named entity recognition and linking application for concept annotation from UMLS or any other source. -MedCAT deployment consists of [MedCAT NLP Service](https://github.com/CogStack/MedCATservice) serving NLP models via RESTful API and [MedCAT Trainer](https://github.com/CogStack/MedCATtrainer) for collecting annotations and refinement of the NLP models. +--- -### MedCAT Service -` nlp-medcat-service-production` serves a basic UMLS model trained on MedMentions dataset via RESTful API. -The served model data is available in [`./services/nlp-services/applications/medcat/models/medmen/`](https://github.com/CogStack/CogStack-Nifi/services/nlp-services/applications/medcat/models/medmen`) directory. +## πŸ§ͺ Samples DB (PostgreSQL) -When deployed ` nlp-medcat-service-production` exposes port `5000` on the container and binds it to port `5000` on the host machine. -For example, to access the API endpoint to process a document by a service from `cognet` Docker network, the endpoint address would be `http:// nlp-medcat-service-production:5000/api/process`. +Demo dataset with: -As a side note, when deployed `nlp-medcat-service-production` (assuming that the MedCAT SNOMED CT model is available and set via `RES_MEDCAT_SERVICE_MODEL_PRODUCTION_PATH` variable), the service will only expose port `5000` on container. -Although the service won't be accessible from the host machine, but all the services inside the `cognet` network will be able to access it. +- patients +- encounters +- observations +- raw medical reports +- cleaned reports +- annotation tables -For more information on the MedCAT NLP Service configuration and use please refer to [the official documentation](https://github.com/CogStack/MedCATservice). +### Acess +- `localhost:5555` -#### ENV/CONF files: -- `/service/nlp-services/applications/medcat/config/env_app` - settings specifically related to the medcat service app, such as model(pack) file location(s) -- `/service/nlp-services/applications/medcat/config/env_medcat` - medcat specific settings +### Ports +- external: `5432` +- internal: `5432` -### MedCAT Trainer -Apart from MedCAT Service, there is provided [MedCAT Trainer](https://github.com/CogStack/MedCATtrainer) that serves a web application used for training and refining MedCAT NLP models. -Such trained models can be later saved as files and loaded into MedCAT Service. -Alternatively, the models can be loaded into custom application. +### Credentials -`medcat-trainer-ui` serves the MedCAT Trainer web application used for training and refining MedCAT NLP models. -Such trained models can be later saved as files and loaded into MedCAT Service. -Alternatively, the models can be loaded into custom application. +- user - `test`, password - `test` -As a companion service, `medcat-trainer-nginx` serves as a NGINX reverse-proxy for providing content from MedCAT Trainer web service. +--- -When deployed, `medcat-trainer-ui` exposes port `8000` on the container. -`medcat-trainer-nginx` exposes port `8000` on the container and binds it to port `8001` on the host machine - it proxies all the requests to the MedCAT Trainer web service. -The NGINX configuration is stored in [`./services/medcat-trainer/nginx`](https://github.com/CogStack/CogStack-NiFi/tree/main/services/medcat-trainer/nginx) directory. +## 🏦 Cogstack databank production DB (Production only: PgSQL, MSSQL) -To access the MedCAT Trainer user interface and admin panel, one can use the default built-in credentials: user `admin` with password `admin`. +Empty database for production ingestion pipelines. +Supports both PostgreSQL and MSSQL. -For more information on the MedCAT Trainer configuration and use please refer to [the official documentation](https://github.com/CogStack/MedCATtrainer). +Place schema files inside and they will be loaded instantly on container startup: -MedCAT Trainer resources are stored in [`./services/medcat-trainer`](https://github.com/CogStack/CogStack-NiFi/tree/main/services/nlp-services//medcat-trainer) directory. -The key configuration is stored in [`./services/medcat-trainer/env`](https://github.com/CogStack/CogStack-NiFi/tree/main/services/medcat-trainer/envs/env) file. +```bash +services/cogstack-db//schemas/ +``` +Where `` can be: `mssql`,`pgsql`. +### Credentials -## ELK stack +- PgSQL: user - `admin` password - `admin` +- MsSQL: user - `admin` password - `admin!COGSTACK2022` -There are two types of Elasticsearch versions available, apart from the native one there is a also OpenSearch, which is a fork of the original but developed & maintained by Amazon as an opensource alternative. +### Access -The example deployment uses [ELK stack](https://www.elastic.co/what-is/elk-stack) from [OpenSearch for Elasticsearch](https://opensearch.org/) distribution. -OpenSearch for Elasticsearch is a fully open-source, free and community-driven fork of Elasticseach. -It implements many of the commercial X-Pack components functionality, such as advanced security module, alerting module or SQL support. -Nonetheless, the standard core functionality and APIs of the official Elasticsearch and OpenSearch remain the same. -Hence, OpenSearch can be used as a drop-in replacement for the standard ELK stack. +- PgSQL: `localhost:5558` β†’ container `5432` +- MSSQL: `localhost:1443` β†’ container `1433` -The names of the services within the NiFi project are the same even though they have different names, we will refer to original Elasticsearch as ES native in the documentation. +### Containers -Services names Elasticsearch | OpenSearch : +- PgSQL: `cogstack-databank-db` +- MSSQL: `cogstack-databank-db-mssql` - - Elasticsearch <-> OpenSearch - - Kibana <-> OpenSearch Dashboards +### Service location & files -In essence the configuration is very similar, however, there are a few differences: +- docker compose file: `services.yml` +- dir: `services/cogstack-db/` +- env: + - `security/users/users_database.env` - controlers DB user credentials + - `deploy/database.env` - general DB configs -| | Elasticsearch Native | OpenSearch | -| ------------------- | ------------------------------------------------------------------------------------------------------------------------------------------ | ------------------------- | -| Subscription | paid licensing, will require [subscription](https://www.elastic.co/subscriptions), 30-day free trial available | Free | -| Plugins | Xpack (native), analysis-icu & elastiknn (3rd party), for more check this [link](https://www.elastic.co/guide/en/elasticsearch/plugins/8.9/index.html). | Xpack | -| Security | AD/LDAP/AWS/OpenID/Native auth | AD/LDAP/AWS/OpenID auth | +### Ports +| Database | External Port | Internal Port | +|----------|---------------|---------------| +| PgSQL | `5558` | `5432` | +| MSSQL | `1433` | `1433` | +--- +## πŸ’§ Apache NiFi & NiFi Registry -**Important** -Please note that for the demonstration purposes SSL encryption has been disabled in Elasticsearch and Kibana. -For enabling it and generating self-signed certificates please refer directly to the `services.yml` file and [security.md](../security.md) in `docs` directory. -The security aspects are covered expensively in [the official OpenSearch for Elasticsearch documentation](https://opensearch.org/). +Primary ETL/processing engine. +This service is complex and is completely described in [this section](../nifi/main.md). -### Elasticsearch / Opensearch -Elasticsearch cluster is deployed as a single-node cluster with `elasticsearch-1` service. -It exposes port `9200` on the container and binds it to the same port on the host machine. -The service endpoint should be available to all the services running inside the `cognet` Docker network under address `http://elasticsearch-1:9200`. -The default user is : `admin` and password `admin`. -In the example deployment, the default, built-in configuration file is used with selected configuration options being overridden in `services.yml` file. -However, for manual tailoring the available configuration parameters are available in the `elasticsearch.yml` [configuration file](https://github.com/CogStack/CogStack-Nifi/services/elasticsearch/config/elasticsearch.yml). +### Credentials -For more information on use of Elasticsearch please refer either to [the official Elasticsearch documentation](https://www.elastic.co/guide/en/elasticsearch/reference/current/index.html) or [the official OpenSearch for Elasticsearch documentation](https://opensearch.org/). +- PgSQL: user - `admin` password - `cogstackNiFi` +### Access -### Kibana / Opensearch-Dashboard -`kibana` service implements the Kibana user interface for interacting with the data stored in Elasticsearch cluster. -It exposes port `5601` on the container and binds it to the same port on the host machine. -To access Kibana user interface from web browser on the host (e.g.`localhost`) machine one can use URL: `https://localhost:5601`. -The default user is : `admin` and password `admin`. -In the example deployment, the default, built-in configuration file is used with selected configuration options being overridden in `services.yml` file. -However, for manual tailoring the available configuration parameters are available in `kibana.yml` [configuration file](https://github.com/CogStack/CogStack-Nifi/services/kibana/config/kibana.yml). +`https://localhost:8443` (via nifi-nginx) -For more information on use of Kibana please refer either to [the official Kibana documentation](https://www.elastic.co/guide/en/kibana/current/index.html) or [the official OpenSearch for Elasticsearch documentation](https://opensearch.org/docs/latest/dashboards/index/). +### Containers +- NiFi: `cogstack-nifi` +- NiFi-Registry-flow: `cogstack-nifi-registry-flow` -#### ENV/CONF files: -- `/deploy/elasticsearch.env` - general settings for boith Kibana and ES , OpenSearch and OpenSearch-Dashboards -- `/security/certificates_elasticsearch.env` - you can control the settings for the SSL certificates here -- `/security/elasticsearch_users.env` - define system user credentials here +### Service location & files -You should not really need to ever modify these files, only the `.env` files should be modified. -- `/services/elasticsearch/config/elasticsearch.yml` - Elasticsearch -- `/services/kibana/config/elasticsearch.yml` - Elasticsearch Kibana -- `/services/elasticsearch/config/opensearch.yml` - Opensearch -- `/services/kibana/config/opensearch.yml` - Opensearch-Dashboards +- docker compose file: `services.yml` +- dir: `nifi/` +- env: + - `/deploy/nifi.env` - general NiFi & NiFi Registry flow settings, JVM memory, etc. + - `/security/nifi_users.env` - controlers DB user credentials + - `/security/certificates_nifi.env` +### Ports -The used configuration files for ElasticSearch and Kibana are provided in [`./services/elasticsearch/config/`](https://github.com/CogStack/CogStack-NiFi/tree/main/services/elasticsearch/config) and [`./services/kibana/config/`](https://github.com/CogStack/CogStack-NiFi/tree/main/services/kibana/config) directories respectively for [`OpenSearch`](https://opensearch.org/docs/latest/install-and-configure/configuration/) and [`OpenSearch Dashboard`](https://opensearch.org/docs/latest/dashboards/index/). +| Component | External Port | Internal Port | +|---------------------|---------------|----------------| +| NiFi | `8443` | `8082`, `10000` | +| NiFi Registry Flow | `18443` | `8083` | +--- -### Security +## πŸ”Ž ELK Stack (Elasticsearch / OpenSearch) -Please note that both ElasticSearch and Kibana use security module to manage user access permissions and roles. -However, for production use, proper users and roles need to be set up otherwise the default built-in ones will be used and with default passwords. +Backend search and indexing engine powering document storage, query, analytics, and NLP output retrieval. -In the example deployment, the default built-in user credentials are used, such as: - - OpenSearch user: `admin` with pass `admin`. - - ElasticSearch user: `elastic` with pass `kibanaserver` +This service is fully described in the Elasticsearch section of the documentation. -For more details on setting up the security certificates, users, roles and more in this example deployment please refer to [`security`](../security.md). +The repo supports both: -### Indexing & Ingesting data +- ElasticSearch (native) +- OpenSearch (Amazon fork) -Also note that in some scenarios a manual creation of index mapping may be a good idea prior to starting ingestion. Please look at Elasticsearch [mapping](https://www.elastic.co/guide/en/elasticsearch/reference/current/mapping.html) and OpenSearch [mapping](https://opensearch.org/docs/2.4/opensearch/mappings/) docs on how to create the mapping before ingesting. - IMPORTANT: not creating the mapping of an index will result in ElasticSearch/OpenSearch automatically map all field datatypes as string, making fields such as date/timestamps not incredibly ! +Switch between modes via environment variables in `deploy/elasticsearch.env`. +### πŸ›’οΈ Elasticsearch / OpenSearch -A script `es_index_initializer.py` has been provided in [`./services/elasticsearch/scripts/`](https://github.com/CogStack/CogStack-NiFi/tree/main/services/elasticsearch/scripts) directory to help with that. +#### Credentials -### Installing and maintaining Elasticsearch/Opensearch +- OpenSearch: user - `admin`, password - `admin` +- ElasticSearch: user - `elastic`, password - `kibanaserver` -Please follow the instructions carefully and adapt where necessary. +#### Access -#### Switching between OpenSearch and ElasticSearch +- `http://localhost:9200` β€” Node 1 +- `http://localhost:9201` β€” Node 2 +- `http://localhost:9202` β€” Node 3 -You can switch by simple modifying the following variables: +#### Containers -- `ELASTICSEARCH_VERSION` - set to `elasticsearch` or `opensearch` -- `ELASTICSEARCH_DOCKER_IMAGE` - check the possible values in the `elasticsearch.env` file -- `ELASTICSEARCH_KIBANA_DOCKER_IMAGE` - check the possible values in the `elasticsearch.env` file -- `KIBANA_VERSION` - set to either `kibana` or `opensearch-dashboards` (note that opensearch-dashboards does not have an underscore in the name..) -- `KIBANA_CONFIG_FILE_VERSION` - set to either `kibana` or `opensearch_dashboards` +- `elasticsearch-1` +- `elasticsearch-2` +- `elasticsearch-3` -There are no metricbeat & filebeat equivalents provided for OpenSearch at the moment as part of this repo. +#### Ports -#### Setting up a fresh cluster with 3 nodes +- all ports need to be exposed via firewall to allow for intercluster communication, we assume 1 different port per node if hosted on the same machine/VM, in production mode all machines can have and use the following ports (if they live on separarate VMs/machines ): `9200`, `9300`, `9600` +- internal: `9300`, `9301`, `9302`, `9600`, `9601`, `9602`, `9200`, `9201`, `9202` +- external: `9300`, `9301`, `9302`, `9600`, `9601`, `9602`, `9200`, `9201`, `9202` -Assuming you will respect the proper guidelines, you would need 3 machines to set things up. If not, then you can still set them up on one machine. +| Node | HTTP | Transport | Analyzer | +|------|------|-----------|----------| +| ES1 | `${ELASTICSEARCH_NODE_1_OUTPUT_PORT:-9200}` | `${ELASTICSEARCH_NODE_1_COMM_OUTPUT_PORT:-9300}` | `${ELASTICSEARCH_NODE_1_ANALYZER_OUTPUT_PORT:-9600}` | +| ES2 | `${ELASTICSEARCH_NODE_2_OUTPUT_PORT:-9201}` | `${ELASTICSEARCH_NODE_2_COMM_OUTPUT_PORT:-9301}` | `${ELASTICSEARCH_NODE_2_ANALYZER_OUTPUT_PORT:-9601}` | +| ES3 | `${ELASTICSEARCH_NODE_3_OUTPUT_PORT:-9202}` | `${ELASTICSEARCH_NODE_3_COMM_OUTPUT_PORT:-9302}` | `${ELASTICSEARCH_NODE_3_ANALYZER_OUTPUT_PORT:-9602}` | -Steps: -- go into the `/deploy/` folder, edit `elasticsearch.env` -- once you get the machine's IP addresses, modify the following variable on each machine `ELASTICSEARCH_NETWORK_HOST`, with the IP of each instance -- next, the env file will have a var for each server for settings such as: - - `node name`: ELASTICSEARCH_NODE_1_NAME - - `output port`: ELASTICSEARCH_NODE_1_OUTPUT_PORT - - `docker volume names`: ELASTICSEARCH_NODE_1_DATA_VOL_NAME. -- on all three servers this variable should be the same: `ELASTICSEARCH_SEED_HOSTS`, it should be set to all 3 ip addresess or machine names, respect the format as it is given in the file `ELASTICSEARCH_SEED_HOSTS=localhost,elasticsearch-2,elasticsearch-1,elasticsearch-3` for example, localhost must always be present -- change the cluster name if needed, by setting `ELASTICSEARCH_CLUSTER_NAME`. -- the intial cluster manager must be set via `ELASTICSEARCH_INITIAL_CLUSTER_MANAGER_NODES`, normally this can be either of the servers -- a setting you may change here IF needed is the `ELASTICSEARCH_NODE_1_NAME`, for each server, e.g: ELASTICSEARCH_NODE_1_NAME="test1", ELASTICSEARCH_NODE_2_NAME="test2", ELASTICSEARCH_NODE_3_NAME="test3". -- extra step for Kibana and Metricbeat we will need to add all three URLs to the nodes via the `ELASTICSEARCH_HOSTS` variable, e.g: ELASTICSEARCH_HOSTS='["https://elasticsearch-1:9200","https://elasticsearch-2:9200","https://elasticsearch-3:9200"]', please respect the quotes as shown in the file otherwise there can be parsing errors. -- update your license, set `ELASTICSEARCH_LICENSE_TYPE` from `trial` to `basic` if you are on ElasticSearch native and if you have a bought license! -- after you are finished please read [post-setup-todos](#post-setup-to-dos) +#### Service Location & files -#### Resource management +- docker compose: `deploy/services.yml` +- config: `services/elasticsearch/config/` +- env: + - `/deploy/elasticsearch.env` + - `/security/certificates_elasticsearch.env` + - `/security/elasticsearch_users.env` -You may want to also change the allocated number of CPUs to one instance/node, to do this, change the following variables: - - `ELASTICSEARCH_NODE_PROCESSORS`, default is 2 cores, max it out if you have a node dedicated for ES only. - - `ELASTICSEARCH_JAVA_OPTS`, default is to `-Xms2048m -Xmx2048m` only, the max allowed memory for HEAP is 32GB, read [this article](https://www.elastic.co/blog/managing-and-troubleshooting-elasticsearch-memory). +#### SSL & Certificates -##### Other settings -- OPTIONAL: you can change the location of the backup mounted volumes in the container if needed by setting the `ELASTICSEARCH_BACKUPS_PATH_REPO` var, please check the syntax so it matches the format of the provided string sample:["/mnt/es_data_backups","/mnt/es_config_backups"] -- OPTIONAL: you will need to setup the LDAP connection, if you are using LDAP, modify `ELASTICSEARCH_AD_URL`, `ELASTICSEARCH_AD_DOMAIN_NAME` and `ELASTICSEARCH_AD_TIMEOUT` (for timeout controls) also `ELASTICSEARCH_AD_UNMAPPED_GROUPS_AS_ROLES` for automatic LDAP group to role mapping (check [this](https://www.elastic.co/guide/en/enterprise-search/8.9/ldap-auth.html) for more info) -- OPTIONAL: additionally, you may want to have an email for your watcher jobs, this can be set via the `ELASTICSEARCH_EMAIL_ACCOUNT_PROFILE` variable and `ELASTICSEARCH_EMAIL_ACCOUNT_EMAIL_DEFAULTS`, the SMTP server must be set for this to work, so set `ELASTICSEARCH_EMAIL_SMTP_HOST` and `ELASTICSEARCH_EMAIL_SMTP_PORT` accordingly, look at the sample settings in the env file for guidance. +Certificates stored in: -#### Setting up Kibana/OpenSearch Dashboards -- if you wish to change the kibana instance name, change the `KIBANA_SERVER_NAME` var. -- the `ELASTICSEARCH_HOSTS` var must be set so that it contains the URLs of all the nodes in the cluster [check the previous section's last non-optional step](#setting-up-a-fresh-cluster-with-3-nodes) -- set `KIBANA_PUBLIC_BASE_URL` to the url of the server hosting Kibana/OS dashboards +```bash +/security/certificates/elastic// +``` -#### Setting up Metricbeat and Filebeat -- set `KIBANA_HOST` to the host of your Kibana server -- set `FILEBEAT_HOST` to the url of the server each FileBeat is on, it can be just `https://localhost:9200` or `https://0.0.0.0:9200`, if it does not work, then set it to the URL of each docker instance `https://elasticsearch-1` etc. -- set `FILEBEAT_USER` and `FILEBEAT_PASSWORD` in `./security/elasticsearch_users.env` if needed. +Settings in: -#### POST-SETUP TO DOs -You have to create accounts for the default users. Please use the provided scripts in the `/security` folder. +- `certificates_elasticsearch.env` -Set users in `elasticsearch_users.env` for either versions. -For ElasticSearch native, use: `create_es_native_credentials.sh`. -For OpenSearch use: `create_opensearch_users.sh`. +### πŸ“Š Metricbeat & Filebeat -If you wish to also setup certificates, check the [security section](../security.md#elk-stack). +Lightweight Elastic stack agents used for **monitoring** and **log forwarding**. +They run alongside Elasticsearch to provide observability of the cluster and ingestion pipelines. +**Purpose:** -### Updating the version of the cluster +- **Metricbeat** β€” collects system & Elasticsearch metrics (CPU, memory, JVM, node health). +- **Filebeat** β€” ships container and service logs into Elasticsearch. - IMPORTANT: Make sure to disable any ingestion jobs before doing any of the update steps +Both run as independent containers in the deployment. -#### For ElasticSearch: -- please check [this link](https://www.elastic.co/guide/en/elastic-stack/8.9/upgrading-elastic-stack.html) for specific version guides. -- carefully read [this](https://www.elastic.co/guide/en/elastic-stack/current/upgrading-elasticsearch.html), there are a few steps that need to be completed via the Dev Console in Kibana and/or via `curl` in terminal. -- take note of which Elastic version you are using and check if there are any extra steps that you might need to do, for example you cant upgrade from v7.1.0 to v8.9.2, you'd need to go v7.1.0->7.9.0 first then v8.1.0 -> v8.9.x, this is a pattern that will likely repeat for future versions -- there may be some additional steps that can be done via Kibana if the documentation says you may need to upgrade your indices to a later version, check [this](https://www.elastic.co/guide/en/elastic-stack/8.9/upgrading-elastic-stack.html#prepare-to-upgrade) as an example, upgrading from 7.x to 8.x requires a REINDEX operation on all indices! -- steps: - - make sure you stop ALL ingestion jobs +#### Containers - - this disables shard allocation:
`curl -u your_username -X PUT "localhost:9200/_cluster/settings?pretty" -H 'Content-Type: application/json' -d' - { - "persistent": { - "cluster.routing.allocation.enable": "primaries" - } - } - '` - - flush indices: `curl -u your_username -X POST "localhost:9200/_flush/synced?pretty"` - - wait for everything to complete, check to see if the health of all clusters is green and the shards are fine - - shut down all ES services, start with Kibana, Metricbeat, Filebeat and then the Elasticserch cluster : `docker container stop cogstack-kibana cogstack-metricbeat-1 cogstack-metricbeat-2 cogstack-filebeat-1 cogstack-filebeat-2 cogstack-filebeat-3`, `docker container stop elasticsearch-1 elasticsearch-2 elasticsearch-3`, obviously execute these on each - - change the relevant ENV VARS (change these in `deploy/elasticsearch.env`): ELASTICSEARCH_DOCKER_IMAGE="docker.elastic.co/elasticsearch/elasticsearch:8.3.3", ELASTICSEARCH_KIBANA_DOCKER_IMAGE="docker.elastic.co/kibana/kibana:8.3.3", METRICBEAT_IMAGE="docker.elastic.co/beats/metricbeat:8.3.3", FILEBEAT_IMAGE="docker.elastic.co/beats/filebeat:8.3.3" - - NOTE: all docker images must have the same version, e.g 8.3.3, otherwise there may be errors, please check this before starting the services. - - go to the `deploy` folder and start update the source env vars by executing `source export_env_vars.sh`, do a test to see if the new vars are set `echo $ELASTICSEARCH_DOCKER_IMAGE` for example - - start only the elastic instance on the correct cluster (assuming each node is on its own separate machine, as it should normally be), wait for startup to complete - - start the rest of the services and check for the health of each node - - re-enable shard allocation: -
`curl -u your_username -X PUT "localhost:9200/_cluster/settings?pretty" -H 'Content-Type: application/json' -d' - { - "persistent": { - "cluster.routing.allocation.enable": "all" - } - } - '` - - go to Kibana > System Monitor > Clusters and check the status of all the nodes & shards. +Metricbeat: -#### For OpenSearch: -- please check [this link](https://opensearch.org/docs/2.0/install-and-configure/upgrade-opensearch/index/) -- the follow the steps from the `For Elasticsearch` section above, the only diference is the curl command for disabling the shard allocation: - - `curl -u your_username -X PUT "localhost:9200/_cluster/settings?pretty" -H 'Content-Type: application/json' -d -{ - "persistent":{ - "cluster.routing.rebalance.enable": "primaries" - } -}` -- shut down kibana & the nodes -- change the relevant ENV vars in `deploy/elasticsearch.env` such as ELASTICSEARCH_KIBANA_DOCKER_IMAGE and ELASTICSEARCH_DOCKER_IMAGE. -- go to the `deploy` folder and start update the source env vars by executing `source export_env_vars.sh`, do a test to see if the new vars are set `echo $ELASTICSEARCH_DOCKER_IMAGE` for example -- all things should be working, re-enable allocation of shards: - - `curl -u your_username -X PUT "localhost:9200/_cluster/settings?pretty" -H 'Content-Type: application/json' -d -{ - "persistent":{ - "cluster.routing.rebalance.enable": "primaries" - } -}` +- `metricbeat-1` +- `metricbeat-2` +- `metricbeat-3` -## Jupyter Hub +Filebeat: -`jupyter-hub` service provides a single instance of Jupyter Hub to serve Jupyter Notebooks containers to users.In essence, the jupyter-hub container will spawn jupyter-singleuser containers for users, on the fly, as necessary.The settings applied to the jupyter-hub service in `services.yml` won't apply to the singleuser containers, please note that the singleuser containers and jupyter-hub container are entirely independent of one another. +- `filebeat-1` +- `filebeat-2` +- `filebeat-3` -It exposes port `8888` by default on the container and binds to the same port on the host machine. -Since `jupyter-hub` is running in the `cognet` Docker network it has access to all services available within it, hence can be used to read data directly from Elasticsearch or query NLP services. +#### **Service Location & Files** -For more information on the use and configuration of Jupyter Hub please refer to [the official Jupyter Hub documentation](https://jupyter.org/hub). +- compose: `deploy/services.yml` +- config: + - `services/metricbeat/metricbeat.yml` + - `services/filebeat/filebeat.yml` +- env: + - `/deploy/elasticsearch.env` + - `/security/elasticsearch_users.env` -The JupyterHub comes with an example Jupyter notebook that is stored in [`./services/jupyter-hub/notebooks`](https://github.com/CogStack/CogStack-NiFi/tree/main/services/jupyter-hub/notebooks) directory. +#### **Ports** -### Access and account control -To access Jupyter Hub on the host machine (e.g.localhost), one can type in the browser `http://localhost:8888`. +No external ports exposed. +All communication occurs internally within the `cogstack-net` Docker network. +#### **Notes** -Creating accounts for other users is possible, just go to the admin page `https://localhost:8888/hub/admin#/`, click on add users and follow the instructions (make sure usernames are lower-cased and DO NOT contain symbols, if usernames contain uppercase they will be converted to lower case in the creation process). +- Elasticsearch must be running before Metricbeat or Filebeat start. +- Only Elastic-native Beats are available; OpenSearch-native Beats do not exist. +- Authentication/credentials come from `elasticsearch_users.env`. -The default password is blank, you can set the password for the admin user the first time you LOG IN, remember it. +### πŸ“‰ Kibana / OpenSearch Dashboards -Or you can set the password is defined by a local variable `JUPYTERHUB_PASSWORD` in `.env` file that is the password SHA-1 value if the authenticator is set to either LocalAuthenticator or Native read more in [jupyter doc](https://jupyterhub.readthedocs.io/en/stable/api/auth.html?highlight#) about this. +Web UI for exploring indexed data, visualising documents, managing index templates, monitoring the cluster, and debugging ingestion pipelines. -Users must use the "/work/"directory for their work, otherwise files might not get saved! +**Purpose:** -### User singleuser container image selection +- Search & browse Elasticsearch/OpenSearch indices +- Visualise ingestion outputs and cluster metrics +- Manage index patterns, dashboards, and Dev Tools +- Validate mappings and test queries used in NiFi flows -Users can be allowed to select their own image upon starting their container service, this is enabled by default, it can be turned off by setting `DOCKER_SELECT_NOTEBOOK_IMAGE_ALLOWED=false` in the `services.yml` file. +#### Host Access +- URL: **https://localhost:5601** -### GPU support within jupyter +#### credentials -Pre-requisites (for Linux and Windows): - - for Linux, you need to install the nvidia-docker2 package / nvidia toolkit package that adds gpu spport for docker, official documentation [here](https://docs.nvidia.com/datacenter/cloud-native/container-toolkit/install-guide.html) - - this also needs to be done for Windows machines, please read the the documentation for WSL2 [here](https://docs.nvidia.com/cuda/wsl-user-guide/index.html) +- **OpenSearch Dashboards:** `admin` / `admin` +- **Elasticsearch Native:** `elastic` / `kibanaserver` -GPU support is disabled by default, to enable it, set `DOCKER_ENABLE_GPU_SUPPORT=true` in the `services.yml` file.Please note that only the `cogstacksystems/jupyter-singleuser-gpu:latest`/ `cogstack-gpu` should be used, as it is the only image that has the drivers installed. +#### Containers -Do not attempt to use the gpu image on a non-gpu machine, it wont work and it will crash the container service. +- `cogstack-kibana` (OpenSearch Dashboards or Kibana depending on configuration) -### Resource limit control in Jupyter-Hub +#### **Service Location & Files** -It is possible to set CPU and RAM limits for admins and normal users, check the following properties in [/deploy/jupyter.env](../../deploy/jupyter.env). +- docker compose: `deploy/services.yml` +- config files: + - `services/kibana/config/elasticsearch.yml` (Elasticsearch) + - `services/kibana/config/opensearch.yml` (OpenSearch Dashboards) +- env: + - `/deploy/elasticsearch.env` + - `/security/certificates_elasticsearch.env` + - `/security/elasticsearch_users.env` -``` -# general user resource cap per container -RESOURCE_ALLOCATION_USER_CPU_LIMIT="2" -RESOURCE_ALLOCATION_USER_RAM_LIMIT="2G" +Image selection controlled by: -# admin resource cap per container -RESOURCE_ALLOCATION_ADMIN_CPU_LIMIT="2" -RESOURCE_ALLOCATION_ADMIN_RAM_LIMIT="4G" -``` +- `${ELASTICSEARCH_KIBANA_DOCKER_IMAGE}` +- `${KIBANA_VERSION}` +- `${KIBANA_CONFIG_FILE_VERSION}` -Go to the `/deploy` folder. -You will need to execute the `export_env_vars.sh` script in order to set these limits, BEFORE running the jupyter-hub container. +#### Ports -Check if the variables have been set by running: -``` - echo $RESOURCE_ALLOCATION_USER_CPU_LIMIT -``` +| Component | External | Internal | +|-----------|----------|----------| +| Kibana / OpenSearch Dashboards | `5601` | `5601` | -If no value is diplsayed then you will manually have to set it, run the following: -``` -set -a -source jupyter.env -set +a +#### Notes + +- Must be started after Elasticsearch/OpenSearch +- Connects automatically using `ELASTICSEARCH_HOSTS` +- TLS/user settings are applied from the `/security` env files + +--- + +## πŸ€– OCR Service + +High-performance document text extraction engine replacing legacy Tika for OCR + text processing. +In the near future it will be possible to use LLMs/custom models for ocr-ing (pending v2 release, ETA 2026). + +The service comes in **two variants**: + +- **ocr-service** β€” full OCR pipeline (images β†’ text) +- **ocr-service-text-only** β€” lightweight mode (text extraction only, no OCR) + +Both expose a simple REST API. + +**Purpose:** + +- Extract text from PDFs, images, and scanned documents +- Provide OCR via Tesseract (wrapped in optimised Python service) +- Provide fast plain text extraction for digital PDFs (text-only variant) +- Designed for large-scale throughput within NiFi ingestion pipelines + +### Access + +- ocr-service: `http://localhost:8090/api/process` +- ocr-seervice-text-only: `http://localhost:8091/api/process` + +### Containers + +- `ocr-service` +- `ocr-service-text-only` + +Both built from: + +```bash +cogstacksystems/cogstack-ocr-service: ``` -#### ENV/CONF files: +### Service Location & Files + +- docker compose file: `services/ocr-service/docker/docker-compose.yml` +- service directory: `services/ocr-service/` +- logs: + - Host: `services/ocr-service/log/` + - Container: `/ocr_service/log/` + +- env files: + - `deploy/general.env` β€” shared variables + - `services/ocr-service/env/ocr_service.env` β€” full OCR config + - `services/ocr-service/env/ocr_service_text_only.env` β€” overrides for text-only pipeline + +### Ports + +| Service | External | Internal | +|---------|----------|----------| +| ocr-service | `8090` | `8090` | +| ocr-service-text-only | `8091` | `8090` | + +Both expose the API internally on port `8090`. + +Please check the service's own [README.md](https://github.com/CogStack/ocr-service/blob/main/README.md) + +--- + +## πŸ—‚οΈ Git-ea + +Self-hosted Git instance (Gitea). +Lightweight GitHub/GitLab-style service used for hosting repositories inside secure or offline environments. + +**Purpose:** + +- Internal code hosting for organisations without external Git access +- Repository management, issue tracking, wiki, and basic CI hooks +- Ideal for notebooks, configs, workflows, and internal project code + +### Access + +- URL: **http://localhost:3000** *(default Gitea port)* + +### Containers + +- `gitea` + +### Service Location & Files -- `/deploy/jupyter.env` - all you should ever set is located here -- `/services/jupyter-hub/jupyter_config.py` - only tamper if you know what you are doing, please see [config documentation](https://github.com/jupyterhub/jupyterhub-deploy-docker/blob/main/basic-example/jupyterhub_config.py) for detailed settings +- docker compose file: `deploy/services.yml` +- config file: `services/gitea/app.ini` +- env files: + - `/security/certificates_general.env` -**IMPORTANT**: -- `/services/jupyter-hub/userlist` - userlist that gets loaded once jupyter starts up, you will need to update this manually at the moment whenever a user is created -- `/services/jupyter-hub/teamlist` - teamlist that gets loaded once jupyter starts up +Persistent repository data is stored in the volume defined in `services.yml`. -Re-run the above if you change the values.Make sure to delete old instances of Jupyter-hub containers, and Jupyter single-user containers for each user.DO NOT delete their volumes, you don't want to delete their data! +### Ports -IMPORTANT NOTE: all environment variable(s) are described in detail in the env file comments in `/deploy/jupyter.env` +| Service | External | Internal | +|---------|----------|----------| +| Git-ea | `3000` | `3000` | +### Notes -### Security +- Supports repository migration from external Git servers +- Mirroring available when external access is allowed +- Can use CogStack certificates for HTTPS if configured -This service users NiFi's `../../security/root-ca.p12` and `../../security/root-ca.key` certificates,so if you have generated them for NiFi then there is nothing else to do, please see the [jupytherhub secion](../security.md#jupyterhub) for other security configs. +--- -## Git-ea +## 🧱 NGINX -This is a GitHub/GitLab equivalent.Feel free to use it if you organisation doesn't allow access to Github, etc. +*Note: this component may eventually be replaced by **Traefik** as the preferred reverse‑proxy and ingress layer for CogStack deployments.* +NGINX is used as a lightweight reverse proxy to provide secure, unified access to internal CogStack services. +It handles HTTPS, routing, and access control for NiFi, MedCAT Trainer, and other components. -### Migrating Git repositories: +MedCAT-Trainer has its own nginx instance that runs independently. -Migrating git repos is straightforward. +**Purpose:** -If you have an Git organisation (e.g COGSTACK) on your git-ea server, make sure you do the following steps: -- make sure you have the same organisation name created/existing on both servers, and that the source server has the repos you need migrating assigned to the organisation -- select -- the above option reveals a screen, select `Git` not `Gitea` -- in the next screen we can pick a user -- complete the migration as per the following example: - - get url of the source and dest servers : e.g cogstack1 (source) and cogstack2 (dest) respectively - - use a user and password that is able to manage the repo on cogstack2 - - untick the `mirror` option as we will not be using cogstack2 in future - - select and it should report success and the repo will be migrated into the COGSTACK organisation on the new server +- Secure external access to internal services +- Reverse proxy for NiFi, MedCAT Trainer, and service UIs +- TLS termination (optional) +- Basic auth / access control where required +Two variants are included: +- **nginx-nifi** β€” main proxy for NiFi and related services +- **nginx-medcat-trainer** β€” specialized proxy for MedCAT Trainer -### ENV/settings files: +Two variants: -- `/services/gitea/app.ini`` - this is the file you will need to edit manually for settings for now, ENV file will soon be available. +- **nginx-nifi** β€” main proxy for services +- **nginx-medcat-trainer** β€” dedicated trainer proxy +### Access -### Security +Examples (actual paths depend on config): -This service users NiFi's `../../security/root-ca.p12` and `../../security/root-ca.key` certificates, nothing else is required. +- NiFi: `https://localhost:8443` +- MedCAT Trainer: `https://localhost:8001` -## NGINX -Although by default not used in the deployment example, NGINX is primarily used as a reverse proxy, limiting the access to the used services that normally expose endpoint for the end-user. -For a simple scenario, it can used only for securing access to Apache NiFi webservice endpoint. +Routing rules are defined in the NGINX configuration files. -All the necessary configuration files and scripts are located in [`./services/nginx/config/`](https://github.com/CogStack/CogStack-NiFi/tree/main/services/nginx/config) directory where the user and password generation script `setup_passwd.sh`. +### Containers -### NGINX-NiFi +- `nifi-nginx` β€” main proxy for NiFi & NiFi Registry +- `medcat-trainer-nginx` β€” proxy dedicated to MedCAT Trainer -This is a specific nginx instance that is used directly by all services EXCEPT MedCAT Trainer, the trainer has it's own instance started separately with different rules. +### Service Location & Files -### NGINX-MEDCAT-TRAINER +- docker compose file: `deploy/services.yml`, trainer - `deploy/cogstack-nlp/medcat-trainer` +- config files: + - `services/nginx/config/nifi.conf` + - `services/nginx/config/medcat-trainer.conf` + - additional templates under `services/nginx/config/` +- env / certificates: + - `/security/certificates_general.env` + - `/security/certificates_nifi.env` +- Uses shared CogStack Root CA & NiFi certs (`root-ca.p12`, `root-ca.key`, `nifi.key`, `nifi.pem`) -Please refer to the trainer docs, [MedCAT Trainer](https://github.com/CogStack/MedCATtrainer) for more info on configuration. +### Port +| Proxy Target | External | Internal | +|------------------|----------|----------| +| NiFi | `8443` | `8443` | +| NiFi Registry Flow | `18443` | `18443` | -#### Security +### Notes -This service users NiFi's `../../security/root-ca.p12` and `../../security/root-ca.key` certificates. +- Provides HTTPS entrypoints for internal services +- Works with CogStack certificate bundle +- Trainer uses a separate NGINX instance for routing differences +- Modify NGINX configs only if comfortable with its syntax diff --git a/docs/deploy/troubleshooting.md b/docs/deploy/troubleshooting.md index dbbbcd36e..15dc8c94f 100644 --- a/docs/deploy/troubleshooting.md +++ b/docs/deploy/troubleshooting.md @@ -1,4 +1,4 @@ -# Troubleshooting +# πŸ“› Troubleshooting Always start with fresh containers and volumes, to make sure that there are no volumes from previous experimentations, make sure to always delete all/any cogstack running containers by executing: @@ -8,13 +8,11 @@ followed by a cleanup or dangling volumes (careful as this will remove all volum `docker volume prune -f` WARNING THIS WILL DELETE ALL UNUSED VOLUMES ON YOUR MACHINE!. Check the volume names used in services.yml file and delete them as necessary `dockr volume rm volume_name` -## Known Issues/errors +## 🐞 Known Issues/errors Common issues that can be encountered across services. -
-
-### **Apple Silicon** +### 🍎 **Apple Silicon** Many services cannot run natively on Apple Silicon (such as M1 and M2 architectures). Common error messages related to Apple silicon follow patterns similar to:

@@ -24,7 +22,7 @@ Many services cannot run natively on Apple Silicon (such as M1 and M2 architectu - `no matching manifest for linux/arm64/v8 in the manifest list entries`



- - `image with reference cogstacksystems/cogstack-ocr-service:0.2.4 was found but does not match the specified platform: wanted linux/arm64, actual: linux/amd64` + - `image with reference cogstacksystems/cogstack-ocr-service:1.0.2 was found but does not match the specified platform: wanted linux/arm64, actual: linux/amd64`

To solve these issues; Rosetta is required and enabled in Docker Desktop. Finally an environment variable is required to be set. @@ -42,7 +40,7 @@ export DOCKER_DEFAULT_PLATFORM=linux/amd64 to set the environment variable. These issues are known to occur on the "cogstack-nifi", "cogstack-ocr-services" and "jupyter-hub" services and may occur on others. -### **NiFi** +### πŸ”§ **NiFi** When dealing with contaminated deployments ( containers using volumes from previous instances ) :

@@ -63,9 +61,9 @@ When dealing with contaminated deployments ( containers using volumes from previ

- `Unable to connect to ElasticSearch` using the `ElasticSearchClientService` NiFi controller, make sure the settings are correct (username,password,certificates, etc.) and then click `Apply`, disregard the errors and click `Enable` on the controller to forcefully reload the controller, stop it and then validate the settings, start it again after and it should work. -### **Elasticsearch Errors** +### πŸ›’οΈ **Elasticsearch Errors** -#### **VM memory errors, failed bootstrap check** +#### ⚑ **VM memory errors, failed bootstrap check** It is quite a common issue for both opensearch and native-ES to error out when it comes to virtual memory allocation, this error typically comes in the form of : @@ -91,7 +89,7 @@ For more on this issue please read: https://www.elastic.co/guide/en/elasticsearc
-#### **OpenSearch: validating opensearch.yml hosts** +#### πŸ“„ **OpenSearch: validating opensearch.yml hosts** ```bash FATAL Error: [config validation of [opensearch].hosts]: types that failed validation: @@ -118,7 +116,7 @@ Alternatively (if the script executes without issues): make start-elastic ``` -### DB-samples issues +### πŸ—ƒοΈ DB-samples issues ```bash No table data for samples_db diff --git a/docs/index.rst b/docs/index.rst index 001b209e0..4c171c0c4 100644 --- a/docs/index.rst +++ b/docs/index.rst @@ -12,12 +12,13 @@ Welcome to CogStack-Nifi's documentation! main.md news.md - nifi/main.md - security/main.md deploy/main.md deploy/deployment.md deploy/troubleshooting.md deploy/workflows.md + nifi/main.md + security/main.md + Indices and tables ================== diff --git a/docs/news.md b/docs/news.md index 53f8de7d2..98627fe70 100644 --- a/docs/news.md +++ b/docs/news.md @@ -1,10 +1,10 @@ -# News +# πŸ“° News This document covers important news with regards to the components of CogStack as a whole, any major security issues or major changes that might break existing deployments are covered here along with how to handle them.

-## 13-12-2021 LOG4J Vulnerabity +## πŸ›‘ 13-12-2021 LOG4J Vulnerabity Since the discovery of the Log4J package vulnerability (https://www.ncsc.gov.uk/news/apache-log4j-vulnerability) it is necessary and recommended to update all existing deployments of CogStack. @@ -22,7 +22,7 @@ For NiFI: - re-pull (docker pull cogstacksystems/cogstack-nifi:latest) - re-pull the tika image (docker pull cogstacksystems/tika-service:latest) -## 01-10-2025 NiFi 2.0 Release +## πŸš€ 01-10-2025 NiFi 2.0 Release New version of NiFi along with the long awaited NiFi registry flow released: