From 01271bab6816171f1e57d51692df7256a33d1bb3 Mon Sep 17 00:00:00 2001 From: vladd-bit Date: Wed, 19 Nov 2025 09:30:47 +0000 Subject: [PATCH] Docs: updated main (prerequisities), deployment + troubleshooting sections. --- README.md | 5 +- deploy/Makefile | 4 +- docs/deploy/configuration.md | 79 ++++++ docs/deploy/deployment.md | 228 ++++++++++++++++++ docs/deploy/main.md | 279 +++++++--------------- docs/deploy/services.md | 77 ------ docs/deploy/troubleshooting.md | 129 ++++++++++ docs/index.rst | 4 +- docs/nifi/main.md | 5 +- docs/security/elasticsearch_opensearch.md | 28 +-- 10 files changed, 543 insertions(+), 295 deletions(-) create mode 100644 docs/deploy/configuration.md create mode 100644 docs/deploy/deployment.md create mode 100644 docs/deploy/troubleshooting.md diff --git a/README.md b/README.md index 0fb963b42..ced0a91f8 100644 --- a/README.md +++ b/README.md @@ -4,7 +4,7 @@ [![doc-build](https://github.com/CogStack/CogStack-NiFi/actions/workflows/doc-build.yml/badge.svg?branch=main)](https://github.com/CogStack/CogStack-NiFi/actions/workflows/doc-build.yml) [![elasticsearch-stack](https://github.com/CogStack/CogStack-NiFi/actions/workflows/docker-elasticsearch-stack.yml/badge.svg?branch=main)](https://github.com/CogStack/CogStack-NiFi/actions/workflows/docker-elasticsearch-stack.yml) -## Introduction +## πŸ’‘ Introduction This repository proposes a possible next step in the evolution of free-text data processing originally implemented in [CogStack-Pipeline](https://github.com/CogStack/CogStack-Pipeline), moving towards a more modular, Platform-as-a-Service (PaaS) approach. @@ -38,7 +38,8 @@ Need help? Feel free to: | [`services`](./services) | NLP and auxiliary services, each with its own configs and resources. | | [`deploy`](./deploy) | Example deployment setup, combining NiFi and related services. | | [`scripts`](./scripts) | Helper scripts (e.g., setup tools, sample DB ingestion, Elasticsearch ingestion). | -| [`data`](./data) | Place any test or ingested data here. | +| [`data`](./data) | Place any test or data to be ingested here. | +| [`typings`](./typings) | Stubs for code linting/type-hint, etc. | --- diff --git a/deploy/Makefile b/deploy/Makefile index 83e4e2de2..53f9812f9 100644 --- a/deploy/Makefile +++ b/deploy/Makefile @@ -86,7 +86,7 @@ start-git-ea: start-data-infra: start-nifi start-elastic start-samples -start-all: start-data-infra start-jupyter +start-all: start-data-infra start-jupyter start-medcat-service start-ocr-services .PHONY: start-all start-data-infra start-nifi start-elastic start-samples start-jupyter @@ -155,7 +155,7 @@ stop-production-db: stop-data-infra: stop-nifi stop-elastic stop-samples -stop-all: stop-data-infra stop-jupyter +stop-all: stop-data-infra stop-jupyter stop-medcat-service stop-ocr-services .PHONY: stop-data-infra stop-nifi stop-elastic stop-samples stop-jupyter diff --git a/docs/deploy/configuration.md b/docs/deploy/configuration.md new file mode 100644 index 000000000..43cd2090d --- /dev/null +++ b/docs/deploy/configuration.md @@ -0,0 +1,79 @@ + + + +## Environment variables + +As mentioned above, environment variables have been made available after release 1.0. +The variables are configurable, and are separated, into security and general env vars, furthermore, all services declared in the `services.yml` file have their variables in separate files. +In most cases, modifying these variables should be the only thing that is needed in order to run a successful deployment. + +Multiple files are available, split into two categories: +- service: located in `./deploy/` are reponsible for direct service configuration +- security: located in `./security`, ceriticate related settings are always in the files starting with `certificates_` and user settings are located in the files ending with `_users` + +The variables declared in the `./deploy` folder are used in multiple config files, as follows: +- `elasticsearch.env`, variables here are used in : + - `./services/elasticsearch/config/(opensearch|elasticsearch).yml` + - `./services/kibana/config/(opensearch|elasticsearch).yml` + - `./services/metricbeat/metricbeat.yml` + - `./deploy/services.yml` in the following sections: `nifi`, `elasticsearch-1`, `elasticsearch-1`, `elasticsearch-3`, `kibana`, `metricbeat-1`,`metricbeat-2` + +- `nifi.env`, vars used in: + - `./deploy/services.yml`, sections: `nifi` + - `./nifi/conf/nifi.properties` + +- `jupyter.env`, vars used in: + - `./deploy/services.yml`, sections: `jupyter` + +- `nlp_service.env`, vars used in: + - `./deploy/services.yml`, sections: `nlp-medcat-service-production` + +- `database.env`, vars used in: + - `./deploy/services.yml`, sections: `cogstack-databank-db`, `samples-db` + +- `general.env`, these vars are optional, declared any custom variables you want here, used in the `nifi` section + +Additional variablesenv files, used only or certificate generation and user accounts, found in `./security`: +- `certificates_elasticsearch.env`, used in `create_opensearch_*`/`create_es_native*` scripts +- `certificates_general.env`, used in `create_root_ca.sh` +- `certificates_nifi.env`, used in `nifi_toolkit_security.sh` +- `database_users.env` +- `elasticsearch_users.env` +- `nginx_users.env` + + +### Customization +For custom deployments, copy all the `.env` files (which are not tracked by Git) and add deployment specific configurations to these files. For example: + +``` +cp deploy/*.env deploy/new_deploy_folder/ +cp security/*.env deploy/new_deploy_folder/ +``` + +### Multiple deployments on the same machine +When deploying multiple docker-compose projects on the same machine (e.g. for dev or testing), it can be useful to remove all containers, volume and network names from the docker-compose file, and let [Docker create names](https://docs.docker.com/compose/reference/envvars/#compose_project_name) based on `COMPOSE_PROJECT_NAME` in `deploy/.env`. Docker will automatically create a Docker network and makes sure that containers can find each other by container name. + +For example, when setting `COMPOSE_PROJECT_NAME=cogstack-prod`, Docker Compose will create a container named `cogstack-prod_elasticsearch-1_1` for the `elasticsearch-1` service. Within the NiFi container, which is running in the same Docker network, you can refer to that container using just the service name `elasticsearch-1`. + +
+ +## Important security detail + +Please note that in the example service defintions, for ease of deployment and demonstration, SSL encryption is enabled among services (NiFi, ES, etc.), however, the certificates that are used are in this public repository, anyone can see them, so **please** make sure to re-generate them when you go into production. + +## Services +Please note that all the services are deployed using [Docker](https://docker.io) engine and requires docker deamon to be running / functioning. + +Please see [the available services](./services.md) for more details. + + +## Workflows +Apache NiFi provides users the ability to build very large and complex data flows. +These data flows can be later saved as workflow *templates*, exported into XML format and shared with other users. +We provide few example templates for ingesting the records from a database into Elasticsearch and to perform extraction of NLP annotations from documents. + +### Deployment using Makefile +For deployments based on the example workflows, please see [example workflows](./workflows.md) for more details. + +### Deployment using a custom Docker-compose +When using a fork of this repository for a customized deployments, it can be useful to copy `services.yml` to a deployment-specific `docker-compose.yml`. In this Compose file you can specify the services you need for your instance and configure all parameters per service, as well as track this file in a branch in your own fork. This way you can use your own version control and rebase on `CogStack/CogStack-NiFi` master without running into merge conflicts. diff --git a/docs/deploy/deployment.md b/docs/deploy/deployment.md new file mode 100644 index 000000000..928da02b8 --- /dev/null +++ b/docs/deploy/deployment.md @@ -0,0 +1,228 @@ + +# πŸ“¦ Deployment + +The [`deploy`](https://github.com/CogStack/CogStack-NiFi/tree/main/deploy/) directory contains an example dockerized deployment setup of the customised NiFi image, along with related services for document processing, NLP, and text analytics. + +Make sure you have read the [Prerequisites](./main.md) section before proceeding. + +## πŸ—‚οΈ Key files + +- **`services.yml`** – defines the *core* services that are orchestrated directly from this repository via Docker Compose. (Kubernetes-based multi-container deployments are coming soon.) + +- **`Makefile`** – provides convenient commands for starting, stopping, and managing the deployment. + +- **`.env` files in `./deploy/`** , environment variables used across services, specifications: + - environment variables that apply **only to the services defined inside `services.yml`**. + - Security-related `.env` files (certificates, users) are under **`/security`** + + These variables configure NiFi, Elasticsearch/OpenSearch, Kibana, Jupyter, Metricbeat, the sample DB, etc. + +## 🧩 Modular service design (important) + +This repository follows a **modular deployment model**: + +- Only the services defined in **`services.yml`** use the environment files located in **`./deploy/*.env`**. +- **All other services** included in the ecosystem are launched via `docker-compose` commands inside their own directories, for example: + + ```bash + ./services//docker/docker-compose.yml + ``` + +- Each of these standalone services maintains **its own environment configuration** in: + + ```bash + ./services//env/ + ``` + +This design allows each service to be: + +- independently configurable +- versioned and deployed in isolation +- consumed by other projects without modifying the core deployment + +> These are the files you will most commonly modify when creating or adjusting a deployment. + +## βš™οΈ Additional service configuration + +- Service-specific configurations are located under: + [`./services`](https://github.com/CogStack/CogStack-NiFi/tree/main/services/) +- NiFi-specific configuration (properties, custom processors, drivers, Python scripts, etc.) is under: + [`./nifi`](https://github.com/CogStack/CogStack-NiFi/tree/main/nifi/) + +## πŸš€ Starting the Services + +All core services defined in `services.yml` can be started using the Makefile in the `deploy/` directory. + +For most services in the `services` folder that are not part of the core stack defined in `services.yml` and are pulled from external git submodule repositories, the start-up process is the same. + +### ▢️ Start each service individually + +You can start individual components of the CogStack-NiFi stack using the `make start-*` commands. +Each target loads all required environment variables automatically via `export_env_vars.sh`. + +This is useful for: + +- debugging a single service +- restarting only one component after config changes +- running lightweight subsets of the stack +- isolating problems or logs per service + +--- + +#### 🧩 Core NiFi Services + +```bash +make start-nifi +``` + +Starts: + +- **nifi** β€” the Apache NiFi instance (main ETL/orchestration engine) +- **nifi-nginx** β€” reverse proxy/front-end for NiFi +- **nifi-registry-flow** β€” NiFi Registry backend that stores flow versions + +Use when you want to run, debug, or modify NiFi workflows without bringing up the entire ecosystem. + +--- + +### πŸ—οΈ Start Core Data Infrastructure + +```bash +make start-data-infra +``` + +Starts: + +- NiFi +- NiFi Registry Flow +- NiFi Nginx +- Elasticsearch +- Samples DB + +Ideal for running ingestion pipelines and ETL workflows. + +--- + +#### πŸ›’οΈ Elasticsearch / OpenSearch Services + +```bash +make start-elastic +``` + +Starts the standard 2-node Elasticsearch cluster + Kibana. + +```bash +make start-elastic-cluster +``` + +Starts all 3 ES nodes. Useful for testing clustering, sharding, and replication. + +```bash +make start-elastic-1 +make start-elastic-2 +make start-elastic-3 +``` + +Start individual Elasticsearch nodes for debugging or failure-scenario testing. + +--- + +#### πŸ“ˆ Kibana + +```bash +make start-kibana +``` + +Starts Kibana for inspecting logs, checking index mappings, monitoring ES health, and debugging pipelines. + +--- + +#### πŸ—„οΈ Databases + +```bash +make start-samples +``` + +Starts **samples-db**, the small example DB used for demo flows. + +```bash +make start-production-db +``` + +Starts the **cogstack-databank-db** production database. + +Use when testing SQL ingestion or verifying DB-driven NiFi flows. + +--- + +#### πŸ“š JupyterHub + +```bash +make start-jupyter +``` + +Starts the CogStack JupyterHub instance. Used for notebooks, analysis, model testing, and visualisation. + +--- + +#### 🧠 NLP Services (MedCAT & Trainer) + +```bash +make start-medcat-service +``` + +Starts the MedCAT concept extraction inference API. + +```bash +make start-medcat-service-deid +``` + +Starts the MedCAT DEID (de-identification) inference API. + +```bash +make start-medcat-trainer +``` + +Starts the full MedCAT Trainer stack (Trainer UI + Solr + NGINX). Useful for annotation and supervised training tasks. + +--- + +#### πŸ“ OCR Services + +```bash +make start-ocr-services +``` + +Starts: + +- **ocr-service** β€” main OCR pipeline +- **ocr-service-text-only** β€” lightweight OCR/text extraction + +Use for PDF ingestion, OCR debugging, and pipeline validation. + +--- + +#### πŸ› οΈ Miscellaneous Services (GIT EA)' + +```bash +make start-git-ea +``` + +Starts the internal Gitea Git server used for local code/config storage. + +--- + +### πŸš€ Start the Entire Stack + +```bash +make start-all +``` + +Starts everything: + +- Core infra +- JupyterHub +- MedCAT NLP services +- OCR services + +Use for complete deployments, demos, or full-stack development. diff --git a/docs/deploy/main.md b/docs/deploy/main.md index c851613a4..e10c9eda3 100755 --- a/docs/deploy/main.md +++ b/docs/deploy/main.md @@ -1,229 +1,116 @@ -# Prequisites +# πŸ“‹ Prerequisites -Software required on machine: - - git + git-lfs - - Docker +Please read carefully as there can be many points of failure when installing/deploying everything into a clean environment. -You can use the script with `SUDO` rights, located at `/scripts/installation_utils/install_docker_and_utils.sh`, it can be used on Debian/Ubuntu/CentOS/RedHAT RHEL 8 only, run it once and everything should be set up. -Consult the (`Docker installation steps`)[https://docs.docker.com/engine/install/debian/] if there are issues with the docker setup. +## πŸ–₯️ OS Requirements -:::{warning} -IMPORTANT NOTE: Do a `git-lfs pull` so that you have everything downloaded from the repo (including bigger zipped files.). -::: - -# Deployment -[./deploy](https://github.com/CogStack/CogStack-NiFi/tree/main/deploy/) contains an example deployment of the customised NiFi image with related services for document processing, NLP and text analytics. - -The key files are: -- `services.yml` - defines all the available services in docker-compose format. K8s (i.e. multi container service deployments is coming soon...) -- `Makefile` - scripts for running docker-compose commands, -- `.env` - local environment variables definitions, deployment `.env` files are located in the `/deploy` folder, security `.env` files are located in the `/security` folder, containing users and certificate generation settings. -The above mentioned files should be the files that you will most likely need to change during a deployment. - -Individual service configurations are provided in [`./services`](https://github.com/CogStack/CogStack-NiFi/tree/main/services/). - -Apache NiFi-related files are provided in [`./nifi`](https://github.com/CogStack/CogStack-NiFi/tree/main/nifi/) directory. - -
- -## Environment variables - -As mentioned above, environment variables have been made available after release 1.0. -The variables are configurable, and are separated, into security and general env vars, furthermore, all services declared in the `services.yml` file have their variables in separate files. -In most cases, modifying these variables should be the only thing that is needed in order to run a successful deployment. - -Multiple files are available, split into two categories: -- service: located in `./deploy/` are reponsible for direct service configuration -- security: located in `./security`, ceriticate related settings are always in the files starting with `certificates_` and user settings are located in the files ending with `_users` - -The variables declared in the `./deploy` folder are used in multiple config files, as follows: -- `elasticsearch.env`, variables here are used in : - - `./services/elasticsearch/config/(opensearch|elasticsearch).yml` - - `./services/kibana/config/(opensearch|elasticsearch).yml` - - `./services/metricbeat/metricbeat.yml` - - `./deploy/services.yml` in the following sections: `nifi`, `elasticsearch-1`, `elasticsearch-1`, `elasticsearch-3`, `kibana`, `metricbeat-1`,`metricbeat-2` - -- `nifi.env`, vars used in: - - `./deploy/services.yml`, sections: `nifi` - - `./nifi/conf/nifi.properties` - -- `jupyter.env`, vars used in: - - `./deploy/services.yml`, sections: `jupyter` +Please note that the OSes mentioned below are the versions we support, whatever is not listed here is not supported, and we will not provide support for. -- `nlp_service.env`, vars used in: - - `./deploy/services.yml`, sections: `nlp-medcat-service-production` +- Linux OS (Ubuntu 24.04 LTS +, Debian 10+ are preffered, RHEL 9+). +- Windows 11+/Windows Server 2022+ (Requires [WSL 2.0](https://learn.microsoft.com/en-us/windows/wsl/about) installation and the installation of an Ubuntu image, for a working setup, follow [this guide](https://documentation.ubuntu.com/wsl/latest/howto/install-ubuntu-wsl2/) to get going, and get back here when things are working). +- macOS 15+ (Sequoia). -- `database.env`, vars used in: - - `./deploy/services.yml`, sections: `cogstack-databank-db`, `samples-db` +## 🧰 Software requirements (Linux/macOS) -- `general.env`, these vars are optional, declared any custom variables you want here, used in the `nifi` section +Software required on machine (the minimum/basics to get demos running): -Additional variablesenv files, used only or certificate generation and user accounts, found in `./security`: -- `certificates_elasticsearch.env`, used in `create_opensearch_*`/`create_es_native*` scripts -- `certificates_general.env`, used in `create_root_ca.sh` -- `certificates_nifi.env`, used in `nifi_toolkit_security.sh` -- `database_users.env` -- `elasticsearch_users.env` -- `nginx_users.env` +- make +- git + git-lfs +- Docker +- python3.11 +## πŸ” Other requirements (User Permissions/Firewall) -### Customization -For custom deployments, copy all the `.env` files (which are not tracked by Git) and add deployment specific configurations to these files. For example: + - a Linux account with 'admin' rights, if possible, if not, you will need to get your IT team to take a look at this README and install the packages for you using the steps below (make sure they look at [Docker rootless installation steps](https://docs.docker.com/engine/security/rootless)) + - firewall whitelisting of the following addreses: + - https://github.com/ + - https://hub.docker.com/ + - https://docker.io + - http://download.docker.com + - https://huggingface.co/ + - https://www.nltk.org + - https://pypi.org/ + - https://pypi.python.org + - https://Files.pythonhosted.org + - https://pythonhosted.org -``` -cp deploy/*.env deploy/new_deploy_folder/ -cp security/*.env deploy/new_deploy_folder/ -``` +## βš™οΈ Installation steps -### Multiple deployments on the same machine -When deploying multiple docker-compose projects on the same machine (e.g. for dev or testing), it can be useful to remove all containers, volume and network names from the docker-compose file, and let [Docker create names](https://docs.docker.com/compose/reference/envvars/#compose_project_name) based on `COMPOSE_PROJECT_NAME` in `deploy/.env`. Docker will automatically create a Docker network and makes sure that containers can find each other by container name. +Assuming you are the system admin, meaning you have`SUDO` rights. +You can use the script with `SUDO` rights, located at `/scripts/installation_utils/install_docker_and_utils.sh`, it can be used on Debian(10+)/Ubuntu(22.04+)/RedHAT RHEL 8/9 only, run it once and everything should be set up. -For example, when setting `COMPOSE_PROJECT_NAME=cogstack-prod`, Docker Compose will create a container named `cogstack-prod_elasticsearch-1_1` for the `elasticsearch-1` service. Within the NiFi container, which is running in the same Docker network, you can refer to that container using just the service name `elasticsearch-1`. +Execute the following commands in the root directory of the repo: -
- -## Important security detail - -Please note that in the example service defintions, for ease of deployment and demonstration, SSL encryption is enabled among services (NiFi, ES, etc.), however, the certificates that are used are in this public repository, anyone can see them, so **please** make sure to re-generate them when you go into production. - -## Services -Please note that all the services are deployed using [Docker](https://docker.io) engine and requires docker deamon to be running / functioning. - -Please see [the available services](./services.md) for more details. +1. `git-lfs pull` +2. (OPTIONAL, if you already have the software in [this section installed](#-software-requirements-linuxmacos))`sudo bash ./scripts/installation_utils/install_docker_and_utils.sh` , and wait for it to finish, it may take a while to get all the packages.. +3. `sudo bash ./scripts/git_update_submodules_in_repo.sh` +4. check that docker works correctly : `docker pull hello-world` +5. if no errors, run: `docker run --rm hello-world`, it should run without issues +6. if there are any issues check the below warning section +:::{warning} +IMPORTANT NOTE: Do a `git-lfs pull` so that you have everything downloaded from the repo (including bigger zipped files.). +::: -## Workflows -Apache NiFi provides users the ability to build very large and complex data flows. -These data flows can be later saved as workflow *templates*, exported into XML format and shared with other users. -We provide few example templates for ingesting the records from a database into Elasticsearch and to perform extraction of NLP annotations from documents. +:::{warning} +Ensure all Git submodules are initialized and updated: +`sudo bash ./scripts/git_update_submodules_in_repo.sh` +::: -### Deployment using Makefile -For deployments based on the example workflows, please see [example workflows](./workflows.md) for more details. +:::{warning} +Consult the if there are issues with the docker setup. +If Docker fails to install or `docker pull hello-world` does not work: -### Deployment using a custom Docker-compose -When using a fork of this repository for a customized deployments, it can be useful to copy `services.yml` to a deployment-specific `docker-compose.yml`. In this Compose file you can specify the services you need for your instance and configure all parameters per service, as well as track this file in a branch in your own fork. This way you can use your own version control and rebase on `CogStack/CogStack-NiFi` master without running into merge conflicts. + - Follow the official [Docker installation steps](https://docs.docker.com/engine/install/debian/) + - Ensure your user is in the docker group + - For non-sudo users, check Docker rootless mode and required post-install steps: + - https://docs.docker.com/engine/security/rootless/ + - https://docs.docker.com/engine/install/linux-postinstall/ +::: -## Troubleshooting +## ⚠️ Essential Elasticsearch Requirement -Always start with fresh containers and volumes, to make sure that there are no volumes from previous experimentations, make sure to always delete all/any cogstack running containers by executing: +:::{warning} +**Elasticsearch may fail to start unless `vm.max_map_count` is increased.** -`docker container rm samples-db elasticsearch-1 kibana nifi nlp-medcat-service-production tika-service nlp-gate-drugapp nlp-medcat-snomed nlp-gate-bioyodie medcat-trainer-ui medcat-trainer-nginx jupyter-hub -f` +If this value is too low, Elasticsearch will exit with the error: -followed by a cleanup or dangling volumes (careful as this will remove all volumes which are NOT being used by a container, if you want to remove specific volumes you will have to manually specifiy the volume names), otherwise, you can specify : + ```bash + bootstrap checks failed + max virtual memory areas vm.max_map_count [65530] is too low, increase to at least [262144] + ``` -`docker volume prune -f` WARNING THIS WILL DELETE ALL UNUSED VOLUMES ON YOUR MACHINE!. Check the volume names used in services.yml file and delete them as necessary `dockr volume rm volume_name` +If you did **not** run the installation script, set it manually: -### Known Issues/errors -Common issues that can be encountered across services. -
-
+**Temporary (until reboot):** + ```bash + sudo sysctl -w vm.max_map_count=262144 + ``` -#### **Apple Silicon** +**Permanent (persists across reboots):** +Add the line below to `/etc/sysctl.conf`: -Many services cannot run natively on Apple Silicon (such as M1 and M2 architectures). Common error messages related to Apple silicon follow patterns similar to: -

- - `no match for platform in manifest` -

-

- - `no matching manifest for linux/arm64/v8 in the manifest list entries` -

-

- - `image with reference cogstacksystems/cogstack-ocr-service:0.2.4 was found but does not match the specified platform: wanted linux/arm64, actual: linux/amd64` -

-To solve these issues; Rosetta is required and enabled in Docker Desktop. Finally an environment variable is required to be set. + ```bash + vm.max_map_count=262144 + ``` -Rosetta can which can be installed via the following command: -``` -softwareupdate --install-rosetta -``` -When Rosetta and Docker Desktop are installed, Rosetta must be enabled. This done by going to Docker Desktop -> Setting -> General and enabling "Use Virtualization framework". After in the same settings go to "features in development" -> "Use Rosetta for x86/amd64 emulation on Apple Silicon". Finally execute the following command: -``` -export DOCKER_DEFAULT_PLATFORM=linux/amd64 -``` -to set the environment variable. These issues are known to occur on the "cogstack-nifi", "cogstack-ocr-services" and "jupyter-hub" services and may occur on others. +Or a one-liner: -#### **NiFi** + ```bash + sudo sh -c "echo 'vm.max_map_count=262144' >> /etc/sysctl.conf" + ``` -When dealing with contaminated deployments ( containers using volumes from previous instances ) : -

- - `NiFi only supports one mode of HTTP or HTTPS operation...` deleting the volumes should usually solve this issue, if not, please check the `nifi.properties` if there have been modifications done by yourself or a developer on it. -

- - building the NiFi image manually on a restricted system, this is usually not necessary, but if for some reason this needs to be done then some settings such as proxy configs might need to be set up in the `nifi/Dockerfile` epecially ones related to the `grape` application and dealing with external downloads. -

- - `keystore.jks`/`truststore.jks` related errors, remove the nifi container & related volumes then restart the nifi instance. -

- - `System Error: Invalid host header : this occurs when nifi host has not been properly configured`, please check the `/nifi/conf/nifi.properties` file and set the `nifi.web.proxy.host` property to the IP address of the server along with the port `:`, if this does not work then it is usually a proxy/network configuration problem (also check firewalls), another workaround would be to comment out the following subsections of the `nifi` service in the `services.yml` file : `ports:` and `networks` with all their child settings. After this is done the following property should be added `network_mode: host`, restart the instance using the `docker-compoes -f services.yml up -d nifi` command afterwards. -

- - Possible error when dealing with non-pgsql databases `due to Incorrect syntax near 'LIMIT'.; routing to failure: com.microsoft.sqlserver.jdbc.SQLServerException: Incorrect syntax near 'LIMIT'`, go to the GenerateTableFetch Process -> right-click -> configure -> change database type from Generic to -> MS SQL 2012 + or 2008 (if an older DB system is used) - - Possible error on Linux systems related to `nifi.properties` permission error and/or other files from the `nifi/conf/` folder, please see the [nifi doc](../nifi/main.md#important-note-about-nifi-properties) {nifi.properties} section. -

- - `Driver class org.postgresql.Driver is not found` or something similar for other MSSQL/SQL drivers, this is a known issue after NiFi version v1.20+, first, make sure you pull the latest version of the repository, then for the JAR file you are using, please execute the following command in order to verify its integrity `jar -tvf ./nifi/drivers/your_file_version.jar`, if this returns a list of files and NO errors then the files are not corrupted and can be loaded. On the NiFi side make sure to go to the `DBCPConnectionPool` controller service and verify the propertiesit a few times, make sure the file path is correct and in the following format: `file:///opt/nifi/drivers/postgresql-42.6.0.jar` for example. If all this fails stop nifi, delete all the Docker volumes associated with it -> restart NiFi, perform the above steps again. You can try forcefully starting the `GenerateTableFetch` or `QueryDatabaseTable` processors by enabling the `DBCPConnectionPool` even if an error popus up after clicking the verify button. -

- - `502 Bad Gateway`, NiFi simply not starting, even after waiting more than 2-3 minutes. This can occur due to a wide variety of issues, you can check the NiFi container log : β€œdocker logs -f --tail 1000 cogstack-nifi > my_log_file.txt” to capture the output easily. The most common cause is running out of memory, increase or decrease the limits in `nifi/conf/bootstrap.conf` according to your machine's spec, please read [bootstrap.conf](../nifi/main.md#bootstrapconf) -

- - `Unable to connect to ElasticSearch` using the `ElasticSearchClientService` NiFi controller, make sure the settings are correct (username,password,certificates, etc.) and then click `Apply`, disregard the errors and click `Enable` on the controller to forcefully reload the controller, stop it and then validate the settings, start it again after and it should work. +Then apply: -#### **Elasticsearch Errors** -
+ ```bash + sudo sysctl -p + ``` -##### **VM memory errors, failed bootstrap check** -
+> The `install_docker_and_utils.sh` script automatically configures this. +> You only need to set it manually if the script was skipped. +::: -It is quite a common issue for both opensearch and native-ES to error out when it comes to virtual memory allocation, this error typically comes in the form of : +## πŸ… Deploying services -``` -ERROR: [1] bootstrap checks failed -[1]: max virtual memory areas vm.max_map_count [65111] is too low, increase to at least [262144] -``` -To solve this one needs to simply execute : -
- - on Linux/Mac OS X : - ```sysctl -w vm.max_map_count=262144``` in terminal. - To make the same change systemwide plase add ```vm.max_map_count=262144``` to /etc/sysctl.conf and restart the dockerservice/machine. - An example of this can be found under /services/elasticsearch/sysctl.conf -
- - on Windows you need to enter the following commands in a powershell instance: -
- ```wsl -d docker-desktop``` -
- ```sysctl -w vm.max_map_count=262144``` - -For more on this issue please read: https://www.elastic.co/guide/en/elasticsearch/reference/current/vm-max-map-count.html - -
- -##### **OpenSearch: validating opensearch.yml hosts** -
- - -``` -FATAL Error: [config validation of [opensearch].hosts]: types that failed validation: -- [config validation of [opensearch].hosts.0]: expected URI with scheme [http|https]. -- [config validation of [opensearch].hosts.1]: could not parse array value from json input -``` - -This issue may appear after the recent switch to using fully customizable environment variables. Strings and ENV vars may be parsed differently depending on the shell version found on the host system. - -To solve this, the easiest way is to make sure to load the `elasticsearch.env` variables before starting the Elastic & Kibana containers by doing the following: - -``` - cd ./deploy/ - set -a - source elasticsearch.env - make start-elastic -``` - -Alternatively (if the script executes without issues): -``` - cd ./deploy/ - source export_env_vars.sh - make start-elastic -``` - - -### DB-samples issues - -``` No table data for samples_db``` -It is possible that you may have forgotten to pull the large files from the repo, please do : `git lfs pull` . -Delete the samples-db container and it's volumes and restart it, you should now see the data in the tables. \ No newline at end of file +If everything up to this point is running fine, then, congratulations, you should now be able to start looking at the [deployment section](./deployment.md) diff --git a/docs/deploy/services.md b/docs/deploy/services.md index 0447b0dc0..e3f688607 100644 --- a/docs/deploy/services.md +++ b/docs/deploy/services.md @@ -89,49 +89,6 @@ Bio-YODIE requires [UMLS](https://www.nlm.nih.gov/research/umls/index.html) reso MedCAT SNOMED CT model requires a prepared model based on [SNOMED CT](http://www.snomed.org/) dictionary with the model available in `RES_MEDCAT_SERVICE_MODEL_PRODUCTION_PATH` directory. These paths can be defined in `.env` file in the deployment directory. -### Bio-YODIE -[Bio-YODIE](https://github.com/GateNLP/Bio-YODIE) is a named entity linking application build using [GATE NLP](https://gate.ac.uk/) suite ([publication](https://arxiv.org/abs/1811.04860)). - -The application files are stored in [`nlp-services/applications/bio-yodie/`](https://github.com/CogStack/CogStack-NiFi/tree/main/services/nlp-services/applications/bio-yodie) directory. - -The Bio-Yodie service configuration is stored in [`nlp-services/applications/bio-yodie/config/`](https://github.com/CogStack/CogStack-NiFi/tree/main/services/nlp-services/applications/bio-yodie/config) directory - the key service configuration properties are defined in `application.properties` file. - - -### GATE - -**Important** -Please note that this application is provided just as a proof-of-concept of running GATE applications. - - -This simple application implements annotation of common drugs and medications. -It was created using [GATE NLP](https://gate.ac.uk/sale/tao/splitch13.html) suite and uses GATE ANNIE Gazetteer plugin. -The application was been created in GATE Developer studio and exported into `gapp` format. -This application is hence ready to be used by GATE and is stored in `nlp-services/applications/drug-app/gate` directory as `drug.gapp` alongside the used resources. - -The list of drugs and medications to annotate is based on a publicly available list of FDA-approved drugs and active ingredients. -The data can be downloaded directly from [Drugs@FDA database](https://www.fda.gov/drugs/informationondrugs/ucm079750.htm). - -This applications is being run using a NLP Service runner application that uses internally [GATE Embedded](https://gate.ac.uk/family/embedded.html) (for running GATE applications) and exposes a REST API. -The NLP Service necessary configuration files are stored in `nlp-services/applications/drug-app/config/` directory - the key service configuration properties are defined in `application.properties` file. - -If you would like to build the docker image with already initialized NLP application, service and necessary resources bundled, please use provided `Dockerfile` in the `nlp-services/applications/drug-app/` directory. - -To deploy an example GATE NLP Drug names extraction application as a service, type: -``` -make start-nlp-gate -``` -The command will deploy `nlp-gate-drugapp` service. -Please see below the description of the deployed NLP service. - -To stop the service, type: -``` -make stop-nlp-gate -``` - - -**Important** -This service will be discontinued in the near future, meaning it will be removed from the repo. - ### MedCAT [MedCAT](https://github.com/CogStack/MedCAT) is a named entity recognition and linking application for concept annotation from UMLS or any other source. @@ -139,7 +96,6 @@ MedCAT is deployed as a service exposing RESTful API using the implementation fr ### MedCAT Service - MedCAT Service resources are stored in [`./services/nlp-services/applications/medcat/`](https://github.com/CogStack/CogStack-NiFi/tree/main/services/nlp-services/applications/medcat) directory. The key configuration properties stored as environment variables are defined in [`./services/nlp-services/applications/medcat/config/`](https://github.com/CogStack/CogStack-NiFi/tree/main/services/nlp-services/applications/medcat/config) sub-directory. The models used by MedCAT are stored in [`./servies/nlp-services/applications/cat/models/`](https://github.com/CogStack/CogStack-NiFi/tree/main/services/nlp-services/applications/medcat/models). @@ -257,21 +213,6 @@ More configuration options are covered in [nifi-doc](../nifi/main.md). Other `.env` files are mounted but those are only useful for custom scripts where you plan to use certain vars from other services, check the `services.yml` nifi `env-file` section definition. -## Tika Service - -`tika-service` provides document text extraction functionality of [Apache Tika](https://tika.apache.org/). -[Tika Service](https://github.com/CogStack/tika-service) implements the actual Apache Tika functionality behind a RESTful API. - -The application data, alongside configuration file, is stored in [`./services/tika-service`](https://github.com/CogStack/CogStack-NiFi/tree/main/services/tika-service) directory. - -When deployed Tika Service exposes port `8090` at `tika-service` container being available to all services within `cognet` Docker network, most importantly by `nifi` data processing engine. -The Tika service REST API endpoint for processing documents is available at `http://tika-service:8090/api/process`. - -For more details on configuration, API definition and example use of Tika Service please refer to [the official documentation](https://github.com/CogStack/tika-service). - -### ENV/CONF files: -- `/deploy/tika-service/config/application.yaml` - ## OCR Service The new `ocr-service` provides a new way to OCR documents at good speed, the equivalent in Tika-service but revwritten in Python and optimized. @@ -293,7 +234,6 @@ In the example deployment we use NLP applications running as a service exposing The current version of API specs is specified in [`./services/nlp-services/api-specs/`](https://github.com/CogStack/CogStack-NiFi/tree/main/services/nlp-services/api-specs) directory (both [Swagger](https://swagger.io/) and [OpenAPI](https://www.openapis.org/) specs). The applications are stored in [`./services/nlp-services/applications`](https://github.com/CogStack/CogStack-NiFi/tree/main/services/nlp-services/applications). - ### NLP API All the NLP services implement a RESTful API that is defined in [OpenAPI specification](https://github.com/CogStack/CogStack-Nifi/services/nlp-services/api-specs/openapi.yaml). @@ -307,23 +247,6 @@ Please see example Apache NiFi [workflows](./workflows.md) and [user scripts](ht For further details on the used API please refer to the [OpenAPI specification](https://github.com/CogStack/CogStack-Nifi/services/nlp-services/api-specs/openapi.yaml) for the definition of the request and response payload. -### GATE NLP -`nlp-gate-drugapp` serves a simple drug names extraction NLP application using [GATE NLP Service](https://github.com/CogStack/gate-nlp-service). -This simple application implements annotation of common drugs and medications. -It was created using [GATE NLP](https://gate.ac.uk/sale/tao/splitch13.html) suite and uses GATE ANNIE Gazetteer plugin. -The GATE application definition and resources are available in directory [`./services/nlp-services/applications/drug-app`](https://github.com/CogStack/CogStack-Nifi/services/nlp-services/applications/drug-app/). - -When deployed `nlp-gate-drugapp` exposes port `8095` on the container. -The port is also bound from container to the host machine `8095` port. -The service endpoint should be available to all the services running inside the `cognet` Docker network. -For example, to access the API endpoint to process a document by a service in `cognet` network, the endpoint address would be `http://nlp-gate-drugapp:8095/api/process`. - -As a side note, when deployed `nlp-gate-bioyodie` (assuming that the Bio-YODIE resources are properly set up with `RES_BIOYODIE_UMLS_PATH` variable), the service will only expose port `8095` on container. -Although the service won't be accessible from the host machine, but all the services inside the `cognet` network will be able to access it. - -For more information on the GATE NLP Service configuration and use please refer to [the official documentation](https://github.com/CogStack/gate-nlp-service). - - ### MedCAT NLP [MedCAT](https://github.com/CogStack/MedCAT) is a named entity recognition and linking application for concept annotation from UMLS or any other source. MedCAT deployment consists of [MedCAT NLP Service](https://github.com/CogStack/MedCATservice) serving NLP models via RESTful API and [MedCAT Trainer](https://github.com/CogStack/MedCATtrainer) for collecting annotations and refinement of the NLP models. diff --git a/docs/deploy/troubleshooting.md b/docs/deploy/troubleshooting.md new file mode 100644 index 000000000..dbbbcd36e --- /dev/null +++ b/docs/deploy/troubleshooting.md @@ -0,0 +1,129 @@ +# Troubleshooting + +Always start with fresh containers and volumes, to make sure that there are no volumes from previous experimentations, make sure to always delete all/any cogstack running containers by executing: + +`docker container rm samples-db elasticsearch-1 kibana nifi nlp-medcat-service-production tika-service nlp-gate-drugapp nlp-medcat-snomed nlp-gate-bioyodie medcat-trainer-ui medcat-trainer-nginx jupyter-hub -f` + +followed by a cleanup or dangling volumes (careful as this will remove all volumes which are NOT being used by a container, if you want to remove specific volumes you will have to manually specifiy the volume names), otherwise, you can specify : + +`docker volume prune -f` WARNING THIS WILL DELETE ALL UNUSED VOLUMES ON YOUR MACHINE!. Check the volume names used in services.yml file and delete them as necessary `dockr volume rm volume_name` + +## Known Issues/errors + +Common issues that can be encountered across services. +
+
+ +### **Apple Silicon** + +Many services cannot run natively on Apple Silicon (such as M1 and M2 architectures). Common error messages related to Apple silicon follow patterns similar to: +

+ - `no match for platform in manifest` +

+

+ - `no matching manifest for linux/arm64/v8 in the manifest list entries` +

+

+ - `image with reference cogstacksystems/cogstack-ocr-service:0.2.4 was found but does not match the specified platform: wanted linux/arm64, actual: linux/amd64` +

+To solve these issues; Rosetta is required and enabled in Docker Desktop. Finally an environment variable is required to be set. + +Rosetta can which can be installed via the following command: + +```bash +softwareupdate --install-rosetta +``` + +When Rosetta and Docker Desktop are installed, Rosetta must be enabled. This done by going to Docker Desktop -> Setting -> General and enabling "Use Virtualization framework". After in the same settings go to "features in development" -> "Use Rosetta for x86/amd64 emulation on Apple Silicon". Finally execute the following command: + +```bash +export DOCKER_DEFAULT_PLATFORM=linux/amd64 +``` + +to set the environment variable. These issues are known to occur on the "cogstack-nifi", "cogstack-ocr-services" and "jupyter-hub" services and may occur on others. + +### **NiFi** + +When dealing with contaminated deployments ( containers using volumes from previous instances ) : +

+ - `NiFi only supports one mode of HTTP or HTTPS operation...` deleting the volumes should usually solve this issue, if not, please check the `nifi.properties` if there have been modifications done by yourself or a developer on it. +

+ - building the NiFi image manually on a restricted system, this is usually not necessary, but if for some reason this needs to be done then some settings such as proxy configs might need to be set up in the `nifi/Dockerfile` epecially ones related to the `grape` application and dealing with external downloads. +

+ - `keystore.jks`/`truststore.jks` related errors, remove the nifi container & related volumes then restart the nifi instance. +

+ - `System Error: Invalid host header : this occurs when nifi host has not been properly configured`, please check the `/nifi/conf/nifi.properties` file and set the `nifi.web.proxy.host` property to the IP address of the server along with the port `:`, if this does not work then it is usually a proxy/network configuration problem (also check firewalls), another workaround would be to comment out the following subsections of the `nifi` service in the `services.yml` file : `ports:` and `networks` with all their child settings. After this is done the following property should be added `network_mode: host`, restart the instance using the `docker-compoes -f services.yml up -d nifi` command afterwards. +

+ - Possible error when dealing with non-pgsql databases `due to Incorrect syntax near 'LIMIT'.; routing to failure: com.microsoft.sqlserver.jdbc.SQLServerException: Incorrect syntax near 'LIMIT'`, go to the GenerateTableFetch Process -> right-click -> configure -> change database type from Generic to -> MS SQL 2012 + or 2008 (if an older DB system is used) + - Possible error on Linux systems related to `nifi.properties` permission error and/or other files from the `nifi/conf/` folder, please see the [nifi doc](../nifi/main.md#important-note-about-nifi-properties) {nifi.properties} section. +

+ - `Driver class org.postgresql.Driver is not found` or something similar for other MSSQL/SQL drivers, this is a known issue after NiFi version v1.20+, first, make sure you pull the latest version of the repository, then for the JAR file you are using, please execute the following command in order to verify its integrity `jar -tvf ./nifi/drivers/your_file_version.jar`, if this returns a list of files and NO errors then the files are not corrupted and can be loaded. On the NiFi side make sure to go to the `DBCPConnectionPool` controller service and verify the propertiesit a few times, make sure the file path is correct and in the following format: `file:///opt/nifi/drivers/postgresql-42.6.0.jar` for example. If all this fails stop nifi, delete all the Docker volumes associated with it -> restart NiFi, perform the above steps again. You can try forcefully starting the `GenerateTableFetch` or `QueryDatabaseTable` processors by enabling the `DBCPConnectionPool` even if an error popus up after clicking the verify button. +

+ - `502 Bad Gateway`, NiFi simply not starting, even after waiting more than 2-3 minutes. This can occur due to a wide variety of issues, you can check the NiFi container log : β€œdocker logs -f --tail 1000 cogstack-nifi > my_log_file.txt” to capture the output easily. The most common cause is running out of memory, increase or decrease the limits in `nifi/conf/bootstrap.conf` according to your machine's spec, please read [bootstrap.conf](../nifi/main.md#bootstrapconf) +

+ - `Unable to connect to ElasticSearch` using the `ElasticSearchClientService` NiFi controller, make sure the settings are correct (username,password,certificates, etc.) and then click `Apply`, disregard the errors and click `Enable` on the controller to forcefully reload the controller, stop it and then validate the settings, start it again after and it should work. + +### **Elasticsearch Errors** + +#### **VM memory errors, failed bootstrap check** + +It is quite a common issue for both opensearch and native-ES to error out when it comes to virtual memory allocation, this error typically comes in the form of : + +```bash +ERROR: [1] bootstrap checks failed +[1]: max virtual memory areas vm.max_map_count [65111] is too low, increase to at least [262144] +``` + +To solve this one needs to simply execute : +
+ - on Linux/Mac OS X : + ```sysctl -w vm.max_map_count=262144``` in terminal. + To make the same change systemwide plase add ```vm.max_map_count=262144``` to /etc/sysctl.conf and restart the dockerservice/machine. + An example of this can be found under /services/elasticsearch/sysctl.conf +
+ - on Windows you need to enter the following commands in a powershell instance: +
+ ```wsl -d docker-desktop``` +
+ ```sysctl -w vm.max_map_count=262144``` + +For more on this issue please read: https://www.elastic.co/guide/en/elasticsearch/reference/current/vm-max-map-count.html + +
+ +#### **OpenSearch: validating opensearch.yml hosts** + +```bash +FATAL Error: [config validation of [opensearch].hosts]: types that failed validation: +- [config validation of [opensearch].hosts.0]: expected URI with scheme [http|https]. +- [config validation of [opensearch].hosts.1]: could not parse array value from json input +``` + +This issue may appear after the recent switch to using fully customizable environment variables. Strings and ENV vars may be parsed differently depending on the shell version found on the host system. + +To solve this, the easiest way is to make sure to load the `elasticsearch.env` variables before starting the Elastic & Kibana containers by doing the following: + +```bash + cd ./deploy/ + set -a + source elasticsearch.env + make start-elastic +``` + +Alternatively (if the script executes without issues): + +```bash + cd ./deploy/ + source export_env_vars.sh + make start-elastic +``` + +### DB-samples issues + +```bash +No table data for samples_db +``` + +It is possible that you may have forgotten to pull the large files from the repo, please do : `git lfs pull`. + +Delete the samples-db container and it's volumes and restart it, you should now see the data in the tables. diff --git a/docs/index.rst b/docs/index.rst index 300a7dc8a..001b209e0 100644 --- a/docs/index.rst +++ b/docs/index.rst @@ -15,10 +15,10 @@ Welcome to CogStack-Nifi's documentation! nifi/main.md security/main.md deploy/main.md - deploy/services.md + deploy/deployment.md + deploy/troubleshooting.md deploy/workflows.md - Indices and tables ================== diff --git a/docs/nifi/main.md b/docs/nifi/main.md index dbbfb0375..42b1ae1fc 100644 --- a/docs/nifi/main.md +++ b/docs/nifi/main.md @@ -1,4 +1,5 @@ -# NiFi +# πŸ’§ NiFi + This directory contains files related with our custom Apache NiFi image and example deployment templates with associated services. Apache NiFi is used as a customizable data pipeline engine for controlling and executing data flow between used services. There are multiple workflow templates provided with custom user scripts to work with NiFi. @@ -16,7 +17,7 @@ Please read the following [article](https://nifi.apache.org/docs/nifi-docs/html/ Avro Schema:[official documentation](https://avro.apache.org/docs/1.11.1/) -## `NiFi directory layout : /nifi` +## `NiFi directory layout : /nifi` ``` β”œβ”€β”€ Dockerfile - contains the base definition of the NiFi image along with all the packages/addons installed diff --git a/docs/security/elasticsearch_opensearch.md b/docs/security/elasticsearch_opensearch.md index bacac0374..fde8a0aba 100644 --- a/docs/security/elasticsearch_opensearch.md +++ b/docs/security/elasticsearch_opensearch.md @@ -38,6 +38,20 @@ cd ../security --- +### βš™οΈ Version variable + +Set the ES/OS version in `deploy/elasticsearch.env` before launching containers: + +```bash +ELASTICSEARCH_VERSION=opensearch +# or +ELASTICSEARCH_VERSION=elasticsearch +``` + +This ensures the correct certificate directory (`elasticsearch` or `opensearch`) is mounted into containers. + +--- + ### 🧩 Common certificate layout Certificate naming and folder structure are consistent across both ES and OpenSearch: @@ -124,20 +138,6 @@ security/certificates/elastic/opensearch/ --- -## βš™οΈ Version variable - -Set the ES/OS version in `deploy/elasticsearch.env` before launching containers: - -```bash -ELASTICSEARCH_VERSION=opensearch -# or -ELASTICSEARCH_VERSION=elasticsearch -``` - -This ensures the correct certificate directory (`elasticsearch` or `opensearch`) is mounted into containers. - ---- - ### πŸ“ Kibana / OpenDashboard certificates | Platform | Required Certificates | Source Folder |