diff --git a/README.md b/README.md index ced0a91f..e9214e4b 100644 --- a/README.md +++ b/README.md @@ -10,7 +10,7 @@ This repository proposes a possible next step in the evolution of free-text data **CogStack-NiFi** demonstrates how to use [Apache NiFi](https://nifi.apache.org/) as the central data workflow engine for clinical document processing, integrating services such as text extraction and natural language processing (NLP). Each component runs as a standalone service, with NiFi handling data routing between components and data sources/sinks. -All NLP services are expected to implement a uniform RESTful API, allowing seamless integration into existing pipelinesβ€”making it easy to incorporate any NLP application into the stack. +All NLP/ML/DATA services are expected to implement a uniform RESTful API, allowing seamless integration into existing pipelinesβ€”making it easy to incorporate any NLP application into the stack. --- @@ -48,13 +48,13 @@ Need help? Feel free to: **Prerequisites**: - Docker (mandatory) -- Basic knowledge of Python and Linux/UNIX systems +- Basic knowledge of Python and Linux/UNIX systems (Bash (simple commands only, we promise)) πŸ“– Official documentation: [cogstack-nifi.readthedocs.io](https://cogstack-nifi.readthedocs.io/en/latest/) πŸš€ New to the project? Start with the [deployment guide](https://cogstack-nifi.readthedocs.io/en/latest/deploy/main.html) for example setups and workflows. -🐞 For troubleshooting or bug reports, consult the [Known Issues section](https://cogstack-nifi.readthedocs.io/en/latest/deploy/main.html) before opening a ticket. +🐞 For troubleshooting or bug reports, consult the [known issues section](https://cogstack-nifi.readthedocs.io/en/latest/deploy/troubleshooting.html) before opening a ticket. --- diff --git a/deploy/Makefile b/deploy/Makefile index 53f9812f..aff95d4a 100644 --- a/deploy/Makefile +++ b/deploy/Makefile @@ -73,7 +73,7 @@ start-medcat-service-deid: $(WITH_ENV) docker compose -f ../services/cogstack-nlp/medcat-service/docker/docker-compose.yml $(DC_START_CMD) nlp-medcat-service-production-deid start-medcat-trainer: - $(WITH_ENV) docker compose -f../services/cogstack-nlp/medcat-trainer/docker-compose-prod.yml $(DC_START_CMD) medcattrainer nginx solr + $(WITH_ENV) docker compose -f ../services/cogstack-nlp/medcat-trainer/docker-compose-prod.yml $(DC_START_CMD) medcattrainer nginx solr start-production-db: $(WITH_ENV) docker compose -f services.yml ${DC_START_CMD} cogstack-databank-db @@ -136,7 +136,7 @@ stop-jupyter: $(WITH_ENV) docker compose -f ../services/cogstack-jupyter-hub/docker/docker-compose.yml $(DC_STOP_CMD) cogstack-jupyter-hub stop-medcat-trainer: - $(WITH_ENV) docker compose -f../services/cogstack-nlp/medcat-trainer/docker-compose-prod.yml $(DC_STOP_CMD) medcattrainer nginx solr + $(WITH_ENV) docker compose -f ../services/cogstack-nlp/medcat-trainer/docker-compose-prod.yml $(DC_STOP_CMD) medcattrainer nginx solr stop-medcat-service: $(WITH_ENV) docker compose -f ../services/cogstack-nlp/medcat-service/docker/docker-compose.yml $(DC_STOP_CMD) nlp-medcat-service-production diff --git a/deploy/services.yml b/deploy/services.yml index c12f319f..362308a5 100644 --- a/deploy/services.yml +++ b/deploy/services.yml @@ -215,7 +215,7 @@ services: - databank-vol:/var/lib/postgresql/data command: postgres -c "max_connections=${POSTGRES_DB_MAX_CONNECTIONS:-100}" ports: - - 5556:5432 + - 5558:5432 expose: - 5432 networks: diff --git a/docs/deploy/services.md b/docs/deploy/services.md index e3f68860..4568ccf0 100644 --- a/docs/deploy/services.md +++ b/docs/deploy/services.md @@ -1,642 +1,622 @@ -# Available Services -This file covers the available services in the example deployment. +# πŸ“¦ Services -Apache NiFi-related files are provided in `../nifi` directory. - -Please note that all the services are deployed using [Docker](https://docker.io) engine and it needs to be present in the system. -Please see [example deployment](main.md) for more details on the used services and their configuration. +This section provides a complete overview of all services included in the CogStack-NiFi deployment. +All services run in Docker and interact within a shared internal Docker network. -## Overview +--- -The below image sums up how CogStack services work with eachother in an environment where all available components are used. +## πŸ“Š Overview + +Below is a high-level architecture diagram illustrating how CogStack services communicate when all components are enabled: ![nifi-services](../_static/img/nifi_services.png) -## Primary services -All the services are defined in `services.yml` file and these are: -- `samples-db` - a PostgreSQL database with sample data to play with, -- `cogstack-databank-db` - production PostgreSQL database, has it's own scripts in `/services/cogstack-db/pgsql` -- `cogstack-databank-db-mssql` - production MSSQL database, has it's own scripts in `/services/cogstack-db/mssql`, this is just an alternative, needs a license. -- `nifi` - a single instance of Apache NiFi processor (with Zookeper embedded) with exposing a web user interface, -- `nifi-nginx` - used for reverse proxy to enable secure access to NiFi and other services. -- `tika-service` - the [Apache Tika](https://tika.apache.org/) running as a web service (see: [Tika Service repository](https://github.com/CogStack/tika-service/)). -- `ocr-service-1/ocr-service-2` - the new OCR text extraction tool that is a replacement of `tika-service`. -- `nlp-gate-drugapp` - an example drug names extraction NLP application using [GATE NLP Service runner exposing a REST API](https://github.com/CogStack/gate-nlp-service), -- `nlp-medcat-service-production` - [MedCAT](https://github.com/CogStack/MedCAT) NLP application running as a [web Service](https://github.com/CogStack/MedCATservice) and using an example model trained on [Med-Mentions](https://github.com/chanzuckerberg/MedMentions) corpus, -- `medcat-trainer-ui` - [MedCAT Trainer](https://github.com/CogStack/MedCATtrainer) web application used for training and refining MedCAT NLP models, -- `medcat-trainer-nginx` - a [NGINX](https://www.nginx.com/) reverse-proxy for MedCAT Trainer, -- `elasticsearch-1/elasticsearch-2` - a two-node cluster of Elasticsearch based on [OpenSearch for Elasticsearch](https://opensearch.org/) distribution, -- `metricbeat` - Elasticsearch Native only cluster monitoring service -- `filebeat` - log ingestion service for ElasticSearch Native -- `kibana` - Kibana user-interface based on [OpenSearch for Elasticsearch](https://opensearch.org/docs/latest/dashboards/index/) distribution, -- `jupyter-hub` - a single instance of [Jupyter Hub](https://jupyter.org/hub) for serving Jupyter Notebooks for interacting with the data. -- `git-ea` - Github-like web service, you can host your own repositories here if your organisation is strict security-wise - -**IMPORTANT** -Please note that some of the necessary configuration parameters, variables and paths are also defined in the [`services.yml`](https://github.com/CogStack/CogStack-NiFi/tree/main/deploy/services.yml) file. - -## Optional NLP services -In addition, there are defined such NLP services: -- `nlp-medcat-service-production` serving SNOMED CT model, -- `nlp-gate-bioyodie` - same as `nlp-gate-drugapp` but serving [Bio-YODIE](https://github.com/GateNLP/Bio-YODIE) NLP application. - -These services are optional and won't be started by default. -They were left in the `services.yml` file for informative purposes if one would be interested in deploying these having access to necessary resources. - -## Security -**Important** -Please note that for the demonstration purposes, the services are run with default built-in usernames / passwords. -Moreover, SSL encryption is also disabled or not set up in the configuration files. -For more information please see the [security](../security.md) - -## Deployment -The example deployment recipes are defined in `Makefile` file. -The commands that start services are prefixed with `start-` keyword, similarly the ones to stop are prefixed with `stop`. - -## Data ingestion and storage infrastructure -To deploy the data ingestion and storage infrastructure, type: -``` -make start-data-infra -``` +--- -The command will deploy services: `nifi`, `elasticsearch-1`, `kibana`, `tika-service`, `samples-db`. -Please see below the description of the services with the information on the accessibility. +## 🧩 Primary Services -To stop the services, type: -``` -make stop-data-infra -``` +The core services defined in `services.yml` include: -## Cleanup -To tear down all the containers and the data persisted in mounted volumes, type: -``` -make cleanup +- **samples-db** β€” PostgreSQL database populated with demo datasets. +- **cogstack-databank-db / cogstack-databank-db-mssql** β€” Production-grade PostgreSQL and optional MSSQL instances. +- **elasticsearch-1 / elasticsearch-2 / elasticsearch-3** β€” Multi-node Elasticsearch or OpenSearch cluster. +- **metricbeat / filebeat** β€” Elastic monitoring and log forwarder services. +- **nifi** β€” Apache NiFi single-node instance with embedded ZooKeeper. +- **nifi-nginx** β€” Reverse proxy providing secure access to NiFi. +- **ocr-service / ocr-service-text-only** β€” High-performance Python OCR and text extraction services. +- **nlp-medcat-service-production** β€” MedCAT NLP model service with REST API. +- **medcat-trainer-ui / medcat-trainer-nginx** β€” Web UI and reverse proxy for model training and refinement. + +- **kibana** β€” OpenSearch Dashboards UI. +- **jupyter-hub** β€” Fully featured data science interface. +- **git-ea** β€” Self‑hosted Git service (Gitea). + +> πŸ” **Note:** Important configuration options and environment variables for these services are managed in `services.yml` and the associated `.env` files under `deploy/` and `security/`. + +## πŸ—‚οΈ Service Definitions + +All core services are defined in: + +```bash +deploy/services.yml ``` -## Services & definition description -All the essential details on the services configuration are defined in `services.yml` file. +They run inside the internal Docker network `cognet`. +Some services expose ports to the host for convenience. -Please note that all the services are running within a private `cognet` Docker network hence the endpoints are all accessible within the deployed services. -However, for the ease of use, some of the services have their ports bound from container to the host machine. +--- +## πŸ—£οΈ NLP/OCR and other services API Endpoints -## NLP services +Most web ETL & data-enrichment API services that we use will offer thw following endpoints for querying. -**Important** -
-Please note that `nlp-medcat-service-production` and `nlp-gate-bioyodie` NLP services use license-restricted resources and these need to be provided by the user prior running these services. -Bio-YODIE requires [UMLS](https://www.nlm.nih.gov/research/umls/index.html) resources that need to be provided in the `RES_BIOYODIE_UMLS_PATH` directory. -MedCAT SNOMED CT model requires a prepared model based on [SNOMED CT](http://www.snomed.org/) dictionary with the model available in `RES_MEDCAT_SERVICE_MODEL_PRODUCTION_PATH` directory. -These paths can be defined in `.env` file in the deployment directory. +- **GET** `/api/info` +- **POST** `/api/process` +- **POST** `/api/process_bulk` +Useful for NiFi workflows (see `workflows.md`). -### MedCAT -[MedCAT](https://github.com/CogStack/MedCAT) is a named entity recognition and linking application for concept annotation from UMLS or any other source. -MedCAT is deployed as a service exposing RESTful API using the implementation from [MedCATservice](https://github.com/CogStack/MedCATservice). +--- -### MedCAT Service +## 🧬 MedCAT Service -MedCAT Service resources are stored in [`./services/nlp-services/applications/medcat/`](https://github.com/CogStack/CogStack-NiFi/tree/main/services/nlp-services/applications/medcat) directory. -The key configuration properties stored as environment variables are defined in [`./services/nlp-services/applications/medcat/config/`](https://github.com/CogStack/CogStack-NiFi/tree/main/services/nlp-services/applications/medcat/config) sub-directory. -The models used by MedCAT are stored in [`./servies/nlp-services/applications/cat/models/`](https://github.com/CogStack/CogStack-NiFi/tree/main/services/nlp-services/applications/medcat/models). -A default model to play with is provided, called `MedMen` and there is a script `./services/nlp-services/applications/medcat/models/download_medmen.sh` to download it, please make sure you are in the `./services/nlp-services/applications/medcat/models/` before executing the download script. +Runs a REST API for model inference uses the [MedCAT library](https://github.com/CogStack/cogstack-nlp/tree/main/medcat-v2) which performss clinical concept extraction and linking. -For more information on the MedCAT Service configuration and use please refer to [the official documentation](https://github.com/CogStack/MedCATservice). +The service has two operation modes: -**Important** -For the example deployment we provide a simple and publicly available MedCAT model. -However, custom and more advanced MedCAT models can be used based on license-restricted terminology dictionaries such as [UMLS](https://www.nlm.nih.gov/research/umls/index.html) or [SNOMED CT](http://www.snomed.org/). -Which model is being used by the deployed MedCAT Service is defined both in the MedCAT Service config file and the deployment configuration file (see: [deploy](main.md)). +- concept detection: exctracts medical concepts: outputs original text + annotations list. +- de-id mode aka. AnonCAT mode, for de-identifying documents: outputs de-identified text + (will output annotations that represent what was de-id in a future version). +### Access -To deploy MedCAT application stack, type: -``` -make start-nlp-medcat -``` -The command will deploy MedCAT NLP service ` nlp-medcat-service-production` with related MedCAT Trainer services `medcat-trainer-ui`, `medcat-trainer-nginx`. -Please see below the description of the deployed NLP services. +- `https://localhost:5555/api/info` - NER container, check if model loads successfully +- `https://localhost:5556/api/info` - DE-ID/AnonCAT container -To stop the services, type: -``` -make stop-nlp-medcat -``` +### Containers -#### ENV/CONF files: -- `/service/nlp-services/applications/medcat/config/env_app` - settings specifically related to the medcat service app, such as model(pack) file location(s) -- `/service/nlp-services/applications/medcat/config/env_medcat` - medcat specific settings +- `cogstack-medcat-service-production` - for concept NER +- `cogstack-medcat-service-production-deid` - for DE-ID/AnonCAT -## Jupyter Hub -To deploy Jupyter Hub, type: -``` -make start-jupyter -``` -Please see below the description of the Jupyter Hub. +### Service location & files -To stop the services, type: -``` -make stop-jupyter -``` -### ENV/CONF files: -- `/deploy/jupyter.env` +- dir: `/services/cogstack-nlp/medcat-service/` +- docker compose file: `/services/cogstack-nlp/medcat-service/docker/docker-compose.yml` +- env: located in `services/cogstack-nlp/medcat-service/env/` -## Database Stack + ```bash + app.env - controls APP settings (number of cpus used, log level, etc) used by the NER container cogstack-medcat-service-production + medcat.env - used by the NER container, controls MedCAT settings directly. + app_deid.env - used by the DE-ID container, same app setting control, the main difference being the `APP_DEID_MODE`. + medcat_deid.env - used by the DE-ID container, controls MedCAT settings directly + ``` -The samples DB uses PgSQL, but we also provide an MSSQL instance (no data on it however), that can be used in prod environments.Please see [the workflows section](workflows.md#configuring-db-connector) about how to configure the difference controllers and DB drivers. +### Ports +| Service | External Port | Internal Port | +|--------------------|---------------|----------------| +| NER (MedCAT) | `5555` | `5000` | +| DE-ID / AnonCAT | `5556` | `5000` | -### Samples DB -`samples-db` provides a [PostgreSQL](https://www.postgresql.org/) database that contains sample data to play with. -During start-up the data is loaded from a previously generated DB dump. +### Models -All the necessary resources, data and scripts are stored in `pgsamples/` directory. -During the service initialization, the script `init_db.sh` will populate the database with sample data read from a database dump stored in `db_dump` directory. -The directory [`./services/pgsamples/scripts`](https://github.com/CogStack/CogStack-NiFi/tree/main/services/pgsamples/scripts) contains SQL schemas with scripts that will generate the database dump using sample data. +- A default MedMentions `MedMen` NER+L model (includes MetaCAT models) is available for public use but needs to be downloaded. +- To download a model head to the directory of the service `services/cogstack-nlp/medcat-service/scripts` +- Execute: `bash download_medmen.sh`, wait for download to complete. -When deployed the PostgreSQL database is exposed at port `5432` of the `samples-db` container. -The port is also bound from container to the host machine `5555` port. -The example data is stored in `db_samples` database. -Use user `test` with password `test` to connect to it. +### README -For an example deployment, a PostgreSQL database that contains some example data to play with was generated [synthetic records](https://github.com/synthetichealth/synthea) enrinched with free-text from [MTSamples](https://www.mtsamples.com/). -The free-text sample data is based on [MT Samples](https://www.mtsamples.com/) dataset with the structured fields generated by [Synthea](https://github.com/synthetichealth/synthea). +Please check the service's own [README.md](https://github.com/CogStack/cogstack-nlp/tree/main/medcat-service) -The tables available in the database are: -- `patients` - structured patient information, -- `encounters` - structured encounters information, -- `observations` - structured observations information, -- `medical_reports_raw` - free-text documents in raw format (PDFs) `(*)`, -- `medical_reports_text` - free-text documents in clean, text format `(*)`, -- `medical_reports_processed` - for storing processed documents, empty `(*)`, -- `annotations_medcat` - for storing extracted MedCAT annotations, empty. +--- -The tables used in the deployment example are marked with `(*)`. +## πŸ› οΈ MedCAT Trainer +Provides UI workflows for annotation, correction, and iterative model training. -#### ENV/CONF files: -- `/deploy/database.env` - currently only basic stuff like DB users/passwords are included +### Access -### Cogstack-db -This is a general database provided for production, it does not have any data in it beyond the defined cogstack_schema (this is not yet present) and annotation_schema. -Provided for both PGSQL and MSSQL. +- `https://localhost:8001` -In the future the `${DB_PROVIDER}` will be an environment variable that will take into account the db-provider you can select, possible values [`mssql`,`pgsql`] +### Containers -By default all the `.sql` files beginning with `annotations*` and `cogstack*` prefix in the `services/cogstack-db/${DB_PROVIDER}/schemas/` will be loaded.This is defined in the `services/cogstack-db/${DB_PROVIDER}/init_db.sh`.There should not be a need to change them as users can simply name their schemas accordingly.Place the desired `sql` files in the `schemas` folder and it will be picked up.To debug any issues with the container or with the SQL scripts please run the startup commands separately `docker-compose -f services.yml up cogstack-databank-db` or `docker-compose -f services.yml cogstack-databank-db-mssql` while in the `deploy/` folder. +- `medcattrainer` +- `medcattrainer_nginx` +- `mct_solr` -MSSQL note -The MSSQL container will require license activation for production as per [Microsoft's guideline](https://hub.docker.com/_/microsoft-mssql-server), setting the `MSSQL_PID` env variable to the correct license PID key should activate the product. +### Service location & files +- dir: `services/cogstack-nlp/medcat-trainer/` +- docker compose file: `services/cogstack-nlp/medcat-trainer/docker-compose-prod.yml` +- env: `services/cogstack-nlp/medcat-trainer/envs/env-prod` -### ENV/CONF files: -- `/deploy/database.env` - currently only basic stuff like DB users/passwords are included +### Ports -## Apache NiFi -`nifi` serves a single-node instance of Apache NiFi that includes the data processing engine with user interface for defining data flows and monitoring. -Since this is a single-node NiFi instance, it also contains the default, embedded [Apache Zookeper](https://zookeeper.apache.org/) instance for managing state. +- external: `8001` -`nifi` container exposes port `8443` which is also bound to the host machine on port 8082. -
+### README -`nifi-nginx` contianer exposes the 8443 port directly, reverser-proxying the connection to nifi. -The Apache NiFi user interface can be hence accessed by navigating on the host (e.g.`localhost`) machine at `http://localhost:8443`. +Please check the service's own [README.md](https://github.com/CogStack/cogstack-nlp/blob/main/medcat-trainer/README.md) file and [docs](https://docs.cogstack.org/projects/medcat-trainer/en/latest/). -In this deployment example, we use a custom build Apache NiFi image with example user scripts and workflow templates. -For more information on configuration, user scripts and user templates that are embeded with the custom Apache NiFi image please refer to the [nifi](../nifi/main.md). -The available example workflows are covered in [workflows](./workflows.md) -Alternatively, please refer to [the official Apache NiFi documentation](https://nifi.apache.org/) for more details on actual use of Apache NiFi. +--- -### ENV/CONF files: -- `/deploy/nifi.env` - most notable settings are related to port mapping and proxy -- `/security/certificates_nifi.env` - define NiFi certificate settings here -- `/security/nifi_users.env` - defines the NiFi user credentials for single user auth & others -More configuration options are covered in [nifi-doc](../nifi/main.md). +## πŸ“š Jupyter Hub -Other `.env` files are mounted but those are only useful for custom scripts where you plan to use certain vars from other services, check the `services.yml` nifi `env-file` section definition. +A multi-user JupyterHub instance deployed via Docker. -## OCR Service +### Access -The new `ocr-service` provides a new way to OCR documents at good speed, the equivalent in Tika-service but revwritten in Python and optimized. +- `https://localhost:8888` -`ocr-service-1` - this container is used for OCR -`ocr-service-2` - this container is used for NON-OCR, meaning documents will simply have their text extracted if they contain text without images +### Containers -### ENV/CONF files: -- `/deploy/ocr_service.env` - for `ocr-service-1` -- `/deploy/ocr_service_text_only.env` - for `ocr-service-2`, NON-OCR instance +- `cogstack-jupyter-hub` +- `cogstack-jupyter-singleuser-` (per user container started by each user once hub is up) -**IMPORTANT** -All settings are decribed [here](https://github.com/CogStack/ocr-service/blob/master/README.md). +### Service location & files +- dir: `services/cogstack-jupyter-hub/` +- docker compose file: `services/cogstack-jupyter-hub/docker/` +- env: `services/cogstack-jupyter-hub/env/jupyter.env` -## NLP Services +### Supports -In the example deployment we use NLP applications running as a service exposing REST API. -The current version of API specs is specified in [`./services/nlp-services/api-specs/`](https://github.com/CogStack/CogStack-NiFi/tree/main/services/nlp-services/api-specs) directory (both [Swagger](https://swagger.io/) and [OpenAPI](https://www.openapis.org/) specs). -The applications are stored in [`./services/nlp-services/applications`](https://github.com/CogStack/CogStack-NiFi/tree/main/services/nlp-services/applications). +- Per-user containers +- CPU/RAM limits (via `services/cogstack-jupyter-hub/env/jupyter.env`) +- Optional GPU support +- Notebook image selection -### NLP API -All the NLP services implement a RESTful API that is defined in [OpenAPI specification](https://github.com/CogStack/CogStack-Nifi/services/nlp-services/api-specs/openapi.yaml). +### Ports -The available endpoints are: -- **GET** `/api/info` - for displaying general information about the used NLP application, -- **POST** `/api/process` - for processing text documents (single document mode), -- **POST** `/api/process_bulk` - for processing multiple text documents (bulk mode). +| Component | External Port | Internal Port(s) | +|-------------|---------------|------------------| +| JupyterHub | `8888` | `8087`, `443` | -When plugging-in the NLP services into Apache NiFi workflows, the endpoint for processing single or multiple documents will be used to extract the annotations from documents. -Please see example Apache NiFi [workflows](./workflows.md) and [user scripts](https://github.com/CogStack/Cogstack-Nifi/nifi/user-scripts) on using and parsing the payloads with NiFi. +### README -For further details on the used API please refer to the [OpenAPI specification](https://github.com/CogStack/CogStack-Nifi/services/nlp-services/api-specs/openapi.yaml) for the definition of the request and response payload. +Please check the service's own [README.md](https://github.com/CogStack/cogstack-jupyter-hub/blob/main/README.md) file. -### MedCAT NLP -[MedCAT](https://github.com/CogStack/MedCAT) is a named entity recognition and linking application for concept annotation from UMLS or any other source. -MedCAT deployment consists of [MedCAT NLP Service](https://github.com/CogStack/MedCATservice) serving NLP models via RESTful API and [MedCAT Trainer](https://github.com/CogStack/MedCATtrainer) for collecting annotations and refinement of the NLP models. +--- -### MedCAT Service -` nlp-medcat-service-production` serves a basic UMLS model trained on MedMentions dataset via RESTful API. -The served model data is available in [`./services/nlp-services/applications/medcat/models/medmen/`](https://github.com/CogStack/CogStack-Nifi/services/nlp-services/applications/medcat/models/medmen`) directory. +## πŸ§ͺ Samples DB (PostgreSQL) -When deployed ` nlp-medcat-service-production` exposes port `5000` on the container and binds it to port `5000` on the host machine. -For example, to access the API endpoint to process a document by a service from `cognet` Docker network, the endpoint address would be `http:// nlp-medcat-service-production:5000/api/process`. +Demo dataset with: -As a side note, when deployed `nlp-medcat-service-production` (assuming that the MedCAT SNOMED CT model is available and set via `RES_MEDCAT_SERVICE_MODEL_PRODUCTION_PATH` variable), the service will only expose port `5000` on container. -Although the service won't be accessible from the host machine, but all the services inside the `cognet` network will be able to access it. +- patients +- encounters +- observations +- raw medical reports +- cleaned reports +- annotation tables -For more information on the MedCAT NLP Service configuration and use please refer to [the official documentation](https://github.com/CogStack/MedCATservice). +### Acess +- `localhost:5555` -#### ENV/CONF files: -- `/service/nlp-services/applications/medcat/config/env_app` - settings specifically related to the medcat service app, such as model(pack) file location(s) -- `/service/nlp-services/applications/medcat/config/env_medcat` - medcat specific settings +### Ports +- external: `5432` +- internal: `5432` -### MedCAT Trainer -Apart from MedCAT Service, there is provided [MedCAT Trainer](https://github.com/CogStack/MedCATtrainer) that serves a web application used for training and refining MedCAT NLP models. -Such trained models can be later saved as files and loaded into MedCAT Service. -Alternatively, the models can be loaded into custom application. +### Credentials -`medcat-trainer-ui` serves the MedCAT Trainer web application used for training and refining MedCAT NLP models. -Such trained models can be later saved as files and loaded into MedCAT Service. -Alternatively, the models can be loaded into custom application. +- user - `test`, password - `test` -As a companion service, `medcat-trainer-nginx` serves as a NGINX reverse-proxy for providing content from MedCAT Trainer web service. +--- -When deployed, `medcat-trainer-ui` exposes port `8000` on the container. -`medcat-trainer-nginx` exposes port `8000` on the container and binds it to port `8001` on the host machine - it proxies all the requests to the MedCAT Trainer web service. -The NGINX configuration is stored in [`./services/medcat-trainer/nginx`](https://github.com/CogStack/CogStack-NiFi/tree/main/services/medcat-trainer/nginx) directory. +## 🏦 Cogstack databank production DB (Production only: PgSQL, MSSQL) -To access the MedCAT Trainer user interface and admin panel, one can use the default built-in credentials: user `admin` with password `admin`. +Empty database for production ingestion pipelines. +Supports both PostgreSQL and MSSQL. -For more information on the MedCAT Trainer configuration and use please refer to [the official documentation](https://github.com/CogStack/MedCATtrainer). +Place schema files inside and they will be loaded instantly on container startup: -MedCAT Trainer resources are stored in [`./services/medcat-trainer`](https://github.com/CogStack/CogStack-NiFi/tree/main/services/nlp-services//medcat-trainer) directory. -The key configuration is stored in [`./services/medcat-trainer/env`](https://github.com/CogStack/CogStack-NiFi/tree/main/services/medcat-trainer/envs/env) file. +```bash +services/cogstack-db//schemas/ +``` +Where `` can be: `mssql`,`pgsql`. +### Credentials -## ELK stack +- PgSQL: user - `admin` password - `admin` +- MsSQL: user - `admin` password - `admin!COGSTACK2022` -There are two types of Elasticsearch versions available, apart from the native one there is a also OpenSearch, which is a fork of the original but developed & maintained by Amazon as an opensource alternative. +### Access -The example deployment uses [ELK stack](https://www.elastic.co/what-is/elk-stack) from [OpenSearch for Elasticsearch](https://opensearch.org/) distribution. -OpenSearch for Elasticsearch is a fully open-source, free and community-driven fork of Elasticseach. -It implements many of the commercial X-Pack components functionality, such as advanced security module, alerting module or SQL support. -Nonetheless, the standard core functionality and APIs of the official Elasticsearch and OpenSearch remain the same. -Hence, OpenSearch can be used as a drop-in replacement for the standard ELK stack. +- PgSQL: `localhost:5558` β†’ container `5432` +- MSSQL: `localhost:1443` β†’ container `1433` -The names of the services within the NiFi project are the same even though they have different names, we will refer to original Elasticsearch as ES native in the documentation. +### Containers -Services names Elasticsearch | OpenSearch : +- PgSQL: `cogstack-databank-db` +- MSSQL: `cogstack-databank-db-mssql` - - Elasticsearch <-> OpenSearch - - Kibana <-> OpenSearch Dashboards +### Service location & files -In essence the configuration is very similar, however, there are a few differences: +- docker compose file: `services.yml` +- dir: `services/cogstack-db/` +- env: + - `security/users/users_database.env` - controlers DB user credentials + - `deploy/database.env` - general DB configs -| | Elasticsearch Native | OpenSearch | -| ------------------- | ------------------------------------------------------------------------------------------------------------------------------------------ | ------------------------- | -| Subscription | paid licensing, will require [subscription](https://www.elastic.co/subscriptions), 30-day free trial available | Free | -| Plugins | Xpack (native), analysis-icu & elastiknn (3rd party), for more check this [link](https://www.elastic.co/guide/en/elasticsearch/plugins/8.9/index.html). | Xpack | -| Security | AD/LDAP/AWS/OpenID/Native auth | AD/LDAP/AWS/OpenID auth | +### Ports +| Database | External Port | Internal Port | +|----------|---------------|---------------| +| PgSQL | `5558` | `5432` | +| MSSQL | `1433` | `1433` | +--- +## πŸ’§ Apache NiFi & NiFi Registry -**Important** -Please note that for the demonstration purposes SSL encryption has been disabled in Elasticsearch and Kibana. -For enabling it and generating self-signed certificates please refer directly to the `services.yml` file and [security.md](../security.md) in `docs` directory. -The security aspects are covered expensively in [the official OpenSearch for Elasticsearch documentation](https://opensearch.org/). +Primary ETL/processing engine. +This service is complex and is completely described in [this section](../nifi/main.md). -### Elasticsearch / Opensearch -Elasticsearch cluster is deployed as a single-node cluster with `elasticsearch-1` service. -It exposes port `9200` on the container and binds it to the same port on the host machine. -The service endpoint should be available to all the services running inside the `cognet` Docker network under address `http://elasticsearch-1:9200`. -The default user is : `admin` and password `admin`. -In the example deployment, the default, built-in configuration file is used with selected configuration options being overridden in `services.yml` file. -However, for manual tailoring the available configuration parameters are available in the `elasticsearch.yml` [configuration file](https://github.com/CogStack/CogStack-Nifi/services/elasticsearch/config/elasticsearch.yml). +### Credentials -For more information on use of Elasticsearch please refer either to [the official Elasticsearch documentation](https://www.elastic.co/guide/en/elasticsearch/reference/current/index.html) or [the official OpenSearch for Elasticsearch documentation](https://opensearch.org/). +- PgSQL: user - `admin` password - `cogstackNiFi` +### Access -### Kibana / Opensearch-Dashboard -`kibana` service implements the Kibana user interface for interacting with the data stored in Elasticsearch cluster. -It exposes port `5601` on the container and binds it to the same port on the host machine. -To access Kibana user interface from web browser on the host (e.g.`localhost`) machine one can use URL: `https://localhost:5601`. -The default user is : `admin` and password `admin`. -In the example deployment, the default, built-in configuration file is used with selected configuration options being overridden in `services.yml` file. -However, for manual tailoring the available configuration parameters are available in `kibana.yml` [configuration file](https://github.com/CogStack/CogStack-Nifi/services/kibana/config/kibana.yml). +`https://localhost:8443` (via nifi-nginx) -For more information on use of Kibana please refer either to [the official Kibana documentation](https://www.elastic.co/guide/en/kibana/current/index.html) or [the official OpenSearch for Elasticsearch documentation](https://opensearch.org/docs/latest/dashboards/index/). +### Containers +- NiFi: `cogstack-nifi` +- NiFi-Registry-flow: `cogstack-nifi-registry-flow` -#### ENV/CONF files: -- `/deploy/elasticsearch.env` - general settings for boith Kibana and ES , OpenSearch and OpenSearch-Dashboards -- `/security/certificates_elasticsearch.env` - you can control the settings for the SSL certificates here -- `/security/elasticsearch_users.env` - define system user credentials here +### Service location & files -You should not really need to ever modify these files, only the `.env` files should be modified. -- `/services/elasticsearch/config/elasticsearch.yml` - Elasticsearch -- `/services/kibana/config/elasticsearch.yml` - Elasticsearch Kibana -- `/services/elasticsearch/config/opensearch.yml` - Opensearch -- `/services/kibana/config/opensearch.yml` - Opensearch-Dashboards +- docker compose file: `services.yml` +- dir: `nifi/` +- env: + - `/deploy/nifi.env` - general NiFi & NiFi Registry flow settings, JVM memory, etc. + - `/security/nifi_users.env` - controlers DB user credentials + - `/security/certificates_nifi.env` +### Ports -The used configuration files for ElasticSearch and Kibana are provided in [`./services/elasticsearch/config/`](https://github.com/CogStack/CogStack-NiFi/tree/main/services/elasticsearch/config) and [`./services/kibana/config/`](https://github.com/CogStack/CogStack-NiFi/tree/main/services/kibana/config) directories respectively for [`OpenSearch`](https://opensearch.org/docs/latest/install-and-configure/configuration/) and [`OpenSearch Dashboard`](https://opensearch.org/docs/latest/dashboards/index/). +| Component | External Port | Internal Port | +|---------------------|---------------|----------------| +| NiFi | `8443` | `8082`, `10000` | +| NiFi Registry Flow | `18443` | `8083` | +--- -### Security +## πŸ”Ž ELK Stack (Elasticsearch / OpenSearch) -Please note that both ElasticSearch and Kibana use security module to manage user access permissions and roles. -However, for production use, proper users and roles need to be set up otherwise the default built-in ones will be used and with default passwords. +Backend search and indexing engine powering document storage, query, analytics, and NLP output retrieval. -In the example deployment, the default built-in user credentials are used, such as: - - OpenSearch user: `admin` with pass `admin`. - - ElasticSearch user: `elastic` with pass `kibanaserver` +This service is fully described in the Elasticsearch section of the documentation. -For more details on setting up the security certificates, users, roles and more in this example deployment please refer to [`security`](../security.md). +The repo supports both: -### Indexing & Ingesting data +- ElasticSearch (native) +- OpenSearch (Amazon fork) -Also note that in some scenarios a manual creation of index mapping may be a good idea prior to starting ingestion. Please look at Elasticsearch [mapping](https://www.elastic.co/guide/en/elasticsearch/reference/current/mapping.html) and OpenSearch [mapping](https://opensearch.org/docs/2.4/opensearch/mappings/) docs on how to create the mapping before ingesting. - IMPORTANT: not creating the mapping of an index will result in ElasticSearch/OpenSearch automatically map all field datatypes as string, making fields such as date/timestamps not incredibly ! +Switch between modes via environment variables in `deploy/elasticsearch.env`. +### πŸ›’οΈ Elasticsearch / OpenSearch -A script `es_index_initializer.py` has been provided in [`./services/elasticsearch/scripts/`](https://github.com/CogStack/CogStack-NiFi/tree/main/services/elasticsearch/scripts) directory to help with that. +#### Credentials -### Installing and maintaining Elasticsearch/Opensearch +- OpenSearch: user - `admin`, password - `admin` +- ElasticSearch: user - `elastic`, password - `kibanaserver` -Please follow the instructions carefully and adapt where necessary. +#### Access -#### Switching between OpenSearch and ElasticSearch +- `http://localhost:9200` β€” Node 1 +- `http://localhost:9201` β€” Node 2 +- `http://localhost:9202` β€” Node 3 -You can switch by simple modifying the following variables: +#### Containers -- `ELASTICSEARCH_VERSION` - set to `elasticsearch` or `opensearch` -- `ELASTICSEARCH_DOCKER_IMAGE` - check the possible values in the `elasticsearch.env` file -- `ELASTICSEARCH_KIBANA_DOCKER_IMAGE` - check the possible values in the `elasticsearch.env` file -- `KIBANA_VERSION` - set to either `kibana` or `opensearch-dashboards` (note that opensearch-dashboards does not have an underscore in the name..) -- `KIBANA_CONFIG_FILE_VERSION` - set to either `kibana` or `opensearch_dashboards` +- `elasticsearch-1` +- `elasticsearch-2` +- `elasticsearch-3` -There are no metricbeat & filebeat equivalents provided for OpenSearch at the moment as part of this repo. +#### Ports -#### Setting up a fresh cluster with 3 nodes +- all ports need to be exposed via firewall to allow for intercluster communication, we assume 1 different port per node if hosted on the same machine/VM, in production mode all machines can have and use the following ports (if they live on separarate VMs/machines ): `9200`, `9300`, `9600` +- internal: `9300`, `9301`, `9302`, `9600`, `9601`, `9602`, `9200`, `9201`, `9202` +- external: `9300`, `9301`, `9302`, `9600`, `9601`, `9602`, `9200`, `9201`, `9202` -Assuming you will respect the proper guidelines, you would need 3 machines to set things up. If not, then you can still set them up on one machine. +| Node | HTTP | Transport | Analyzer | +|------|------|-----------|----------| +| ES1 | `${ELASTICSEARCH_NODE_1_OUTPUT_PORT:-9200}` | `${ELASTICSEARCH_NODE_1_COMM_OUTPUT_PORT:-9300}` | `${ELASTICSEARCH_NODE_1_ANALYZER_OUTPUT_PORT:-9600}` | +| ES2 | `${ELASTICSEARCH_NODE_2_OUTPUT_PORT:-9201}` | `${ELASTICSEARCH_NODE_2_COMM_OUTPUT_PORT:-9301}` | `${ELASTICSEARCH_NODE_2_ANALYZER_OUTPUT_PORT:-9601}` | +| ES3 | `${ELASTICSEARCH_NODE_3_OUTPUT_PORT:-9202}` | `${ELASTICSEARCH_NODE_3_COMM_OUTPUT_PORT:-9302}` | `${ELASTICSEARCH_NODE_3_ANALYZER_OUTPUT_PORT:-9602}` | -Steps: -- go into the `/deploy/` folder, edit `elasticsearch.env` -- once you get the machine's IP addresses, modify the following variable on each machine `ELASTICSEARCH_NETWORK_HOST`, with the IP of each instance -- next, the env file will have a var for each server for settings such as: - - `node name`: ELASTICSEARCH_NODE_1_NAME - - `output port`: ELASTICSEARCH_NODE_1_OUTPUT_PORT - - `docker volume names`: ELASTICSEARCH_NODE_1_DATA_VOL_NAME. -- on all three servers this variable should be the same: `ELASTICSEARCH_SEED_HOSTS`, it should be set to all 3 ip addresess or machine names, respect the format as it is given in the file `ELASTICSEARCH_SEED_HOSTS=localhost,elasticsearch-2,elasticsearch-1,elasticsearch-3` for example, localhost must always be present -- change the cluster name if needed, by setting `ELASTICSEARCH_CLUSTER_NAME`. -- the intial cluster manager must be set via `ELASTICSEARCH_INITIAL_CLUSTER_MANAGER_NODES`, normally this can be either of the servers -- a setting you may change here IF needed is the `ELASTICSEARCH_NODE_1_NAME`, for each server, e.g: ELASTICSEARCH_NODE_1_NAME="test1", ELASTICSEARCH_NODE_2_NAME="test2", ELASTICSEARCH_NODE_3_NAME="test3". -- extra step for Kibana and Metricbeat we will need to add all three URLs to the nodes via the `ELASTICSEARCH_HOSTS` variable, e.g: ELASTICSEARCH_HOSTS='["https://elasticsearch-1:9200","https://elasticsearch-2:9200","https://elasticsearch-3:9200"]', please respect the quotes as shown in the file otherwise there can be parsing errors. -- update your license, set `ELASTICSEARCH_LICENSE_TYPE` from `trial` to `basic` if you are on ElasticSearch native and if you have a bought license! -- after you are finished please read [post-setup-todos](#post-setup-to-dos) +#### Service Location & files -#### Resource management +- docker compose: `deploy/services.yml` +- config: `services/elasticsearch/config/` +- env: + - `/deploy/elasticsearch.env` + - `/security/certificates_elasticsearch.env` + - `/security/elasticsearch_users.env` -You may want to also change the allocated number of CPUs to one instance/node, to do this, change the following variables: - - `ELASTICSEARCH_NODE_PROCESSORS`, default is 2 cores, max it out if you have a node dedicated for ES only. - - `ELASTICSEARCH_JAVA_OPTS`, default is to `-Xms2048m -Xmx2048m` only, the max allowed memory for HEAP is 32GB, read [this article](https://www.elastic.co/blog/managing-and-troubleshooting-elasticsearch-memory). +#### SSL & Certificates -##### Other settings -- OPTIONAL: you can change the location of the backup mounted volumes in the container if needed by setting the `ELASTICSEARCH_BACKUPS_PATH_REPO` var, please check the syntax so it matches the format of the provided string sample:["/mnt/es_data_backups","/mnt/es_config_backups"] -- OPTIONAL: you will need to setup the LDAP connection, if you are using LDAP, modify `ELASTICSEARCH_AD_URL`, `ELASTICSEARCH_AD_DOMAIN_NAME` and `ELASTICSEARCH_AD_TIMEOUT` (for timeout controls) also `ELASTICSEARCH_AD_UNMAPPED_GROUPS_AS_ROLES` for automatic LDAP group to role mapping (check [this](https://www.elastic.co/guide/en/enterprise-search/8.9/ldap-auth.html) for more info) -- OPTIONAL: additionally, you may want to have an email for your watcher jobs, this can be set via the `ELASTICSEARCH_EMAIL_ACCOUNT_PROFILE` variable and `ELASTICSEARCH_EMAIL_ACCOUNT_EMAIL_DEFAULTS`, the SMTP server must be set for this to work, so set `ELASTICSEARCH_EMAIL_SMTP_HOST` and `ELASTICSEARCH_EMAIL_SMTP_PORT` accordingly, look at the sample settings in the env file for guidance. +Certificates stored in: -#### Setting up Kibana/OpenSearch Dashboards -- if you wish to change the kibana instance name, change the `KIBANA_SERVER_NAME` var. -- the `ELASTICSEARCH_HOSTS` var must be set so that it contains the URLs of all the nodes in the cluster [check the previous section's last non-optional step](#setting-up-a-fresh-cluster-with-3-nodes) -- set `KIBANA_PUBLIC_BASE_URL` to the url of the server hosting Kibana/OS dashboards +```bash +/security/certificates/elastic// +``` -#### Setting up Metricbeat and Filebeat -- set `KIBANA_HOST` to the host of your Kibana server -- set `FILEBEAT_HOST` to the url of the server each FileBeat is on, it can be just `https://localhost:9200` or `https://0.0.0.0:9200`, if it does not work, then set it to the URL of each docker instance `https://elasticsearch-1` etc. -- set `FILEBEAT_USER` and `FILEBEAT_PASSWORD` in `./security/elasticsearch_users.env` if needed. +Settings in: -#### POST-SETUP TO DOs -You have to create accounts for the default users. Please use the provided scripts in the `/security` folder. +- `certificates_elasticsearch.env` -Set users in `elasticsearch_users.env` for either versions. -For ElasticSearch native, use: `create_es_native_credentials.sh`. -For OpenSearch use: `create_opensearch_users.sh`. +### πŸ“Š Metricbeat & Filebeat -If you wish to also setup certificates, check the [security section](../security.md#elk-stack). +Lightweight Elastic stack agents used for **monitoring** and **log forwarding**. +They run alongside Elasticsearch to provide observability of the cluster and ingestion pipelines. +**Purpose:** -### Updating the version of the cluster +- **Metricbeat** β€” collects system & Elasticsearch metrics (CPU, memory, JVM, node health). +- **Filebeat** β€” ships container and service logs into Elasticsearch. - IMPORTANT: Make sure to disable any ingestion jobs before doing any of the update steps +Both run as independent containers in the deployment. -#### For ElasticSearch: -- please check [this link](https://www.elastic.co/guide/en/elastic-stack/8.9/upgrading-elastic-stack.html) for specific version guides. -- carefully read [this](https://www.elastic.co/guide/en/elastic-stack/current/upgrading-elasticsearch.html), there are a few steps that need to be completed via the Dev Console in Kibana and/or via `curl` in terminal. -- take note of which Elastic version you are using and check if there are any extra steps that you might need to do, for example you cant upgrade from v7.1.0 to v8.9.2, you'd need to go v7.1.0->7.9.0 first then v8.1.0 -> v8.9.x, this is a pattern that will likely repeat for future versions -- there may be some additional steps that can be done via Kibana if the documentation says you may need to upgrade your indices to a later version, check [this](https://www.elastic.co/guide/en/elastic-stack/8.9/upgrading-elastic-stack.html#prepare-to-upgrade) as an example, upgrading from 7.x to 8.x requires a REINDEX operation on all indices! -- steps: - - make sure you stop ALL ingestion jobs +#### Containers - - this disables shard allocation:
`curl -u your_username -X PUT "localhost:9200/_cluster/settings?pretty" -H 'Content-Type: application/json' -d' - { - "persistent": { - "cluster.routing.allocation.enable": "primaries" - } - } - '` - - flush indices: `curl -u your_username -X POST "localhost:9200/_flush/synced?pretty"` - - wait for everything to complete, check to see if the health of all clusters is green and the shards are fine - - shut down all ES services, start with Kibana, Metricbeat, Filebeat and then the Elasticserch cluster : `docker container stop cogstack-kibana cogstack-metricbeat-1 cogstack-metricbeat-2 cogstack-filebeat-1 cogstack-filebeat-2 cogstack-filebeat-3`, `docker container stop elasticsearch-1 elasticsearch-2 elasticsearch-3`, obviously execute these on each - - change the relevant ENV VARS (change these in `deploy/elasticsearch.env`): ELASTICSEARCH_DOCKER_IMAGE="docker.elastic.co/elasticsearch/elasticsearch:8.3.3", ELASTICSEARCH_KIBANA_DOCKER_IMAGE="docker.elastic.co/kibana/kibana:8.3.3", METRICBEAT_IMAGE="docker.elastic.co/beats/metricbeat:8.3.3", FILEBEAT_IMAGE="docker.elastic.co/beats/filebeat:8.3.3" - - NOTE: all docker images must have the same version, e.g 8.3.3, otherwise there may be errors, please check this before starting the services. - - go to the `deploy` folder and start update the source env vars by executing `source export_env_vars.sh`, do a test to see if the new vars are set `echo $ELASTICSEARCH_DOCKER_IMAGE` for example - - start only the elastic instance on the correct cluster (assuming each node is on its own separate machine, as it should normally be), wait for startup to complete - - start the rest of the services and check for the health of each node - - re-enable shard allocation: -
`curl -u your_username -X PUT "localhost:9200/_cluster/settings?pretty" -H 'Content-Type: application/json' -d' - { - "persistent": { - "cluster.routing.allocation.enable": "all" - } - } - '` - - go to Kibana > System Monitor > Clusters and check the status of all the nodes & shards. +Metricbeat: -#### For OpenSearch: -- please check [this link](https://opensearch.org/docs/2.0/install-and-configure/upgrade-opensearch/index/) -- the follow the steps from the `For Elasticsearch` section above, the only diference is the curl command for disabling the shard allocation: - - `curl -u your_username -X PUT "localhost:9200/_cluster/settings?pretty" -H 'Content-Type: application/json' -d -{ - "persistent":{ - "cluster.routing.rebalance.enable": "primaries" - } -}` -- shut down kibana & the nodes -- change the relevant ENV vars in `deploy/elasticsearch.env` such as ELASTICSEARCH_KIBANA_DOCKER_IMAGE and ELASTICSEARCH_DOCKER_IMAGE. -- go to the `deploy` folder and start update the source env vars by executing `source export_env_vars.sh`, do a test to see if the new vars are set `echo $ELASTICSEARCH_DOCKER_IMAGE` for example -- all things should be working, re-enable allocation of shards: - - `curl -u your_username -X PUT "localhost:9200/_cluster/settings?pretty" -H 'Content-Type: application/json' -d -{ - "persistent":{ - "cluster.routing.rebalance.enable": "primaries" - } -}` +- `metricbeat-1` +- `metricbeat-2` +- `metricbeat-3` -## Jupyter Hub +Filebeat: -`jupyter-hub` service provides a single instance of Jupyter Hub to serve Jupyter Notebooks containers to users.In essence, the jupyter-hub container will spawn jupyter-singleuser containers for users, on the fly, as necessary.The settings applied to the jupyter-hub service in `services.yml` won't apply to the singleuser containers, please note that the singleuser containers and jupyter-hub container are entirely independent of one another. +- `filebeat-1` +- `filebeat-2` +- `filebeat-3` -It exposes port `8888` by default on the container and binds to the same port on the host machine. -Since `jupyter-hub` is running in the `cognet` Docker network it has access to all services available within it, hence can be used to read data directly from Elasticsearch or query NLP services. +#### **Service Location & Files** -For more information on the use and configuration of Jupyter Hub please refer to [the official Jupyter Hub documentation](https://jupyter.org/hub). +- compose: `deploy/services.yml` +- config: + - `services/metricbeat/metricbeat.yml` + - `services/filebeat/filebeat.yml` +- env: + - `/deploy/elasticsearch.env` + - `/security/elasticsearch_users.env` -The JupyterHub comes with an example Jupyter notebook that is stored in [`./services/jupyter-hub/notebooks`](https://github.com/CogStack/CogStack-NiFi/tree/main/services/jupyter-hub/notebooks) directory. +#### **Ports** -### Access and account control -To access Jupyter Hub on the host machine (e.g.localhost), one can type in the browser `http://localhost:8888`. +No external ports exposed. +All communication occurs internally within the `cogstack-net` Docker network. +#### **Notes** -Creating accounts for other users is possible, just go to the admin page `https://localhost:8888/hub/admin#/`, click on add users and follow the instructions (make sure usernames are lower-cased and DO NOT contain symbols, if usernames contain uppercase they will be converted to lower case in the creation process). +- Elasticsearch must be running before Metricbeat or Filebeat start. +- Only Elastic-native Beats are available; OpenSearch-native Beats do not exist. +- Authentication/credentials come from `elasticsearch_users.env`. -The default password is blank, you can set the password for the admin user the first time you LOG IN, remember it. +### πŸ“‰ Kibana / OpenSearch Dashboards -Or you can set the password is defined by a local variable `JUPYTERHUB_PASSWORD` in `.env` file that is the password SHA-1 value if the authenticator is set to either LocalAuthenticator or Native read more in [jupyter doc](https://jupyterhub.readthedocs.io/en/stable/api/auth.html?highlight#) about this. +Web UI for exploring indexed data, visualising documents, managing index templates, monitoring the cluster, and debugging ingestion pipelines. -Users must use the "/work/"directory for their work, otherwise files might not get saved! +**Purpose:** -### User singleuser container image selection +- Search & browse Elasticsearch/OpenSearch indices +- Visualise ingestion outputs and cluster metrics +- Manage index patterns, dashboards, and Dev Tools +- Validate mappings and test queries used in NiFi flows -Users can be allowed to select their own image upon starting their container service, this is enabled by default, it can be turned off by setting `DOCKER_SELECT_NOTEBOOK_IMAGE_ALLOWED=false` in the `services.yml` file. +#### Host Access +- URL: **https://localhost:5601** -### GPU support within jupyter +#### credentials -Pre-requisites (for Linux and Windows): - - for Linux, you need to install the nvidia-docker2 package / nvidia toolkit package that adds gpu spport for docker, official documentation [here](https://docs.nvidia.com/datacenter/cloud-native/container-toolkit/install-guide.html) - - this also needs to be done for Windows machines, please read the the documentation for WSL2 [here](https://docs.nvidia.com/cuda/wsl-user-guide/index.html) +- **OpenSearch Dashboards:** `admin` / `admin` +- **Elasticsearch Native:** `elastic` / `kibanaserver` -GPU support is disabled by default, to enable it, set `DOCKER_ENABLE_GPU_SUPPORT=true` in the `services.yml` file.Please note that only the `cogstacksystems/jupyter-singleuser-gpu:latest`/ `cogstack-gpu` should be used, as it is the only image that has the drivers installed. +#### Containers -Do not attempt to use the gpu image on a non-gpu machine, it wont work and it will crash the container service. +- `cogstack-kibana` (OpenSearch Dashboards or Kibana depending on configuration) -### Resource limit control in Jupyter-Hub +#### **Service Location & Files** -It is possible to set CPU and RAM limits for admins and normal users, check the following properties in [/deploy/jupyter.env](../../deploy/jupyter.env). +- docker compose: `deploy/services.yml` +- config files: + - `services/kibana/config/elasticsearch.yml` (Elasticsearch) + - `services/kibana/config/opensearch.yml` (OpenSearch Dashboards) +- env: + - `/deploy/elasticsearch.env` + - `/security/certificates_elasticsearch.env` + - `/security/elasticsearch_users.env` -``` -# general user resource cap per container -RESOURCE_ALLOCATION_USER_CPU_LIMIT="2" -RESOURCE_ALLOCATION_USER_RAM_LIMIT="2G" +Image selection controlled by: -# admin resource cap per container -RESOURCE_ALLOCATION_ADMIN_CPU_LIMIT="2" -RESOURCE_ALLOCATION_ADMIN_RAM_LIMIT="4G" -``` +- `${ELASTICSEARCH_KIBANA_DOCKER_IMAGE}` +- `${KIBANA_VERSION}` +- `${KIBANA_CONFIG_FILE_VERSION}` -Go to the `/deploy` folder. -You will need to execute the `export_env_vars.sh` script in order to set these limits, BEFORE running the jupyter-hub container. +#### Ports -Check if the variables have been set by running: -``` - echo $RESOURCE_ALLOCATION_USER_CPU_LIMIT -``` +| Component | External | Internal | +|-----------|----------|----------| +| Kibana / OpenSearch Dashboards | `5601` | `5601` | -If no value is diplsayed then you will manually have to set it, run the following: -``` -set -a -source jupyter.env -set +a +#### Notes + +- Must be started after Elasticsearch/OpenSearch +- Connects automatically using `ELASTICSEARCH_HOSTS` +- TLS/user settings are applied from the `/security` env files + +--- + +## πŸ€– OCR Service + +High-performance document text extraction engine replacing legacy Tika for OCR + text processing. +In the near future it will be possible to use LLMs/custom models for ocr-ing (pending v2 release, ETA 2026). + +The service comes in **two variants**: + +- **ocr-service** β€” full OCR pipeline (images β†’ text) +- **ocr-service-text-only** β€” lightweight mode (text extraction only, no OCR) + +Both expose a simple REST API. + +**Purpose:** + +- Extract text from PDFs, images, and scanned documents +- Provide OCR via Tesseract (wrapped in optimised Python service) +- Provide fast plain text extraction for digital PDFs (text-only variant) +- Designed for large-scale throughput within NiFi ingestion pipelines + +### Access + +- ocr-service: `http://localhost:8090/api/process` +- ocr-seervice-text-only: `http://localhost:8091/api/process` + +### Containers + +- `ocr-service` +- `ocr-service-text-only` + +Both built from: + +```bash +cogstacksystems/cogstack-ocr-service: ``` -#### ENV/CONF files: +### Service Location & Files + +- docker compose file: `services/ocr-service/docker/docker-compose.yml` +- service directory: `services/ocr-service/` +- logs: + - Host: `services/ocr-service/log/` + - Container: `/ocr_service/log/` + +- env files: + - `deploy/general.env` β€” shared variables + - `services/ocr-service/env/ocr_service.env` β€” full OCR config + - `services/ocr-service/env/ocr_service_text_only.env` β€” overrides for text-only pipeline + +### Ports + +| Service | External | Internal | +|---------|----------|----------| +| ocr-service | `8090` | `8090` | +| ocr-service-text-only | `8091` | `8090` | + +Both expose the API internally on port `8090`. + +Please check the service's own [README.md](https://github.com/CogStack/ocr-service/blob/main/README.md) + +--- + +## πŸ—‚οΈ Git-ea + +Self-hosted Git instance (Gitea). +Lightweight GitHub/GitLab-style service used for hosting repositories inside secure or offline environments. + +**Purpose:** + +- Internal code hosting for organisations without external Git access +- Repository management, issue tracking, wiki, and basic CI hooks +- Ideal for notebooks, configs, workflows, and internal project code + +### Access + +- URL: **http://localhost:3000** *(default Gitea port)* + +### Containers + +- `gitea` + +### Service Location & Files -- `/deploy/jupyter.env` - all you should ever set is located here -- `/services/jupyter-hub/jupyter_config.py` - only tamper if you know what you are doing, please see [config documentation](https://github.com/jupyterhub/jupyterhub-deploy-docker/blob/main/basic-example/jupyterhub_config.py) for detailed settings +- docker compose file: `deploy/services.yml` +- config file: `services/gitea/app.ini` +- env files: + - `/security/certificates_general.env` -**IMPORTANT**: -- `/services/jupyter-hub/userlist` - userlist that gets loaded once jupyter starts up, you will need to update this manually at the moment whenever a user is created -- `/services/jupyter-hub/teamlist` - teamlist that gets loaded once jupyter starts up +Persistent repository data is stored in the volume defined in `services.yml`. -Re-run the above if you change the values.Make sure to delete old instances of Jupyter-hub containers, and Jupyter single-user containers for each user.DO NOT delete their volumes, you don't want to delete their data! +### Ports -IMPORTANT NOTE: all environment variable(s) are described in detail in the env file comments in `/deploy/jupyter.env` +| Service | External | Internal | +|---------|----------|----------| +| Git-ea | `3000` | `3000` | +### Notes -### Security +- Supports repository migration from external Git servers +- Mirroring available when external access is allowed +- Can use CogStack certificates for HTTPS if configured -This service users NiFi's `../../security/root-ca.p12` and `../../security/root-ca.key` certificates,so if you have generated them for NiFi then there is nothing else to do, please see the [jupytherhub secion](../security.md#jupyterhub) for other security configs. +--- -## Git-ea +## 🧱 NGINX -This is a GitHub/GitLab equivalent.Feel free to use it if you organisation doesn't allow access to Github, etc. +*Note: this component may eventually be replaced by **Traefik** as the preferred reverse‑proxy and ingress layer for CogStack deployments.* +NGINX is used as a lightweight reverse proxy to provide secure, unified access to internal CogStack services. +It handles HTTPS, routing, and access control for NiFi, MedCAT Trainer, and other components. -### Migrating Git repositories: +MedCAT-Trainer has its own nginx instance that runs independently. -Migrating git repos is straightforward. +**Purpose:** -If you have an Git organisation (e.g COGSTACK) on your git-ea server, make sure you do the following steps: -- make sure you have the same organisation name created/existing on both servers, and that the source server has the repos you need migrating assigned to the organisation -- select -- the above option reveals a screen, select `Git` not `Gitea` -- in the next screen we can pick a user -- complete the migration as per the following example: - - get url of the source and dest servers : e.g cogstack1 (source) and cogstack2 (dest) respectively - - use a user and password that is able to manage the repo on cogstack2 - - untick the `mirror` option as we will not be using cogstack2 in future - - select and it should report success and the repo will be migrated into the COGSTACK organisation on the new server +- Secure external access to internal services +- Reverse proxy for NiFi, MedCAT Trainer, and service UIs +- TLS termination (optional) +- Basic auth / access control where required +Two variants are included: +- **nginx-nifi** β€” main proxy for NiFi and related services +- **nginx-medcat-trainer** β€” specialized proxy for MedCAT Trainer -### ENV/settings files: +Two variants: -- `/services/gitea/app.ini`` - this is the file you will need to edit manually for settings for now, ENV file will soon be available. +- **nginx-nifi** β€” main proxy for services +- **nginx-medcat-trainer** β€” dedicated trainer proxy +### Access -### Security +Examples (actual paths depend on config): -This service users NiFi's `../../security/root-ca.p12` and `../../security/root-ca.key` certificates, nothing else is required. +- NiFi: `https://localhost:8443` +- MedCAT Trainer: `https://localhost:8001` -## NGINX -Although by default not used in the deployment example, NGINX is primarily used as a reverse proxy, limiting the access to the used services that normally expose endpoint for the end-user. -For a simple scenario, it can used only for securing access to Apache NiFi webservice endpoint. +Routing rules are defined in the NGINX configuration files. -All the necessary configuration files and scripts are located in [`./services/nginx/config/`](https://github.com/CogStack/CogStack-NiFi/tree/main/services/nginx/config) directory where the user and password generation script `setup_passwd.sh`. +### Containers -### NGINX-NiFi +- `nifi-nginx` β€” main proxy for NiFi & NiFi Registry +- `medcat-trainer-nginx` β€” proxy dedicated to MedCAT Trainer -This is a specific nginx instance that is used directly by all services EXCEPT MedCAT Trainer, the trainer has it's own instance started separately with different rules. +### Service Location & Files -### NGINX-MEDCAT-TRAINER +- docker compose file: `deploy/services.yml`, trainer - `deploy/cogstack-nlp/medcat-trainer` +- config files: + - `services/nginx/config/nifi.conf` + - `services/nginx/config/medcat-trainer.conf` + - additional templates under `services/nginx/config/` +- env / certificates: + - `/security/certificates_general.env` + - `/security/certificates_nifi.env` +- Uses shared CogStack Root CA & NiFi certs (`root-ca.p12`, `root-ca.key`, `nifi.key`, `nifi.pem`) -Please refer to the trainer docs, [MedCAT Trainer](https://github.com/CogStack/MedCATtrainer) for more info on configuration. +### Port +| Proxy Target | External | Internal | +|------------------|----------|----------| +| NiFi | `8443` | `8443` | +| NiFi Registry Flow | `18443` | `18443` | -#### Security +### Notes -This service users NiFi's `../../security/root-ca.p12` and `../../security/root-ca.key` certificates. +- Provides HTTPS entrypoints for internal services +- Works with CogStack certificate bundle +- Trainer uses a separate NGINX instance for routing differences +- Modify NGINX configs only if comfortable with its syntax diff --git a/docs/deploy/troubleshooting.md b/docs/deploy/troubleshooting.md index dbbbcd36..15dc8c94 100644 --- a/docs/deploy/troubleshooting.md +++ b/docs/deploy/troubleshooting.md @@ -1,4 +1,4 @@ -# Troubleshooting +# πŸ“› Troubleshooting Always start with fresh containers and volumes, to make sure that there are no volumes from previous experimentations, make sure to always delete all/any cogstack running containers by executing: @@ -8,13 +8,11 @@ followed by a cleanup or dangling volumes (careful as this will remove all volum `docker volume prune -f` WARNING THIS WILL DELETE ALL UNUSED VOLUMES ON YOUR MACHINE!. Check the volume names used in services.yml file and delete them as necessary `dockr volume rm volume_name` -## Known Issues/errors +## 🐞 Known Issues/errors Common issues that can be encountered across services. -
-
-### **Apple Silicon** +### 🍎 **Apple Silicon** Many services cannot run natively on Apple Silicon (such as M1 and M2 architectures). Common error messages related to Apple silicon follow patterns similar to:

@@ -24,7 +22,7 @@ Many services cannot run natively on Apple Silicon (such as M1 and M2 architectu - `no matching manifest for linux/arm64/v8 in the manifest list entries`



- - `image with reference cogstacksystems/cogstack-ocr-service:0.2.4 was found but does not match the specified platform: wanted linux/arm64, actual: linux/amd64` + - `image with reference cogstacksystems/cogstack-ocr-service:1.0.2 was found but does not match the specified platform: wanted linux/arm64, actual: linux/amd64`

To solve these issues; Rosetta is required and enabled in Docker Desktop. Finally an environment variable is required to be set. @@ -42,7 +40,7 @@ export DOCKER_DEFAULT_PLATFORM=linux/amd64 to set the environment variable. These issues are known to occur on the "cogstack-nifi", "cogstack-ocr-services" and "jupyter-hub" services and may occur on others. -### **NiFi** +### πŸ”§ **NiFi** When dealing with contaminated deployments ( containers using volumes from previous instances ) :

@@ -63,9 +61,9 @@ When dealing with contaminated deployments ( containers using volumes from previ

- `Unable to connect to ElasticSearch` using the `ElasticSearchClientService` NiFi controller, make sure the settings are correct (username,password,certificates, etc.) and then click `Apply`, disregard the errors and click `Enable` on the controller to forcefully reload the controller, stop it and then validate the settings, start it again after and it should work. -### **Elasticsearch Errors** +### πŸ›’οΈ **Elasticsearch Errors** -#### **VM memory errors, failed bootstrap check** +#### ⚑ **VM memory errors, failed bootstrap check** It is quite a common issue for both opensearch and native-ES to error out when it comes to virtual memory allocation, this error typically comes in the form of : @@ -91,7 +89,7 @@ For more on this issue please read: https://www.elastic.co/guide/en/elasticsearc
-#### **OpenSearch: validating opensearch.yml hosts** +#### πŸ“„ **OpenSearch: validating opensearch.yml hosts** ```bash FATAL Error: [config validation of [opensearch].hosts]: types that failed validation: @@ -118,7 +116,7 @@ Alternatively (if the script executes without issues): make start-elastic ``` -### DB-samples issues +### πŸ—ƒοΈ DB-samples issues ```bash No table data for samples_db diff --git a/docs/index.rst b/docs/index.rst index 001b209e..ee75d9bf 100644 --- a/docs/index.rst +++ b/docs/index.rst @@ -12,12 +12,12 @@ Welcome to CogStack-Nifi's documentation! main.md news.md - nifi/main.md - security/main.md deploy/main.md deploy/deployment.md deploy/troubleshooting.md deploy/workflows.md + nifi/main.md + security/main.md Indices and tables ================== diff --git a/docs/news.md b/docs/news.md index 53f8de7d..98627fe7 100644 --- a/docs/news.md +++ b/docs/news.md @@ -1,10 +1,10 @@ -# News +# πŸ“° News This document covers important news with regards to the components of CogStack as a whole, any major security issues or major changes that might break existing deployments are covered here along with how to handle them.

-## 13-12-2021 LOG4J Vulnerabity +## πŸ›‘ 13-12-2021 LOG4J Vulnerabity Since the discovery of the Log4J package vulnerability (https://www.ncsc.gov.uk/news/apache-log4j-vulnerability) it is necessary and recommended to update all existing deployments of CogStack. @@ -22,7 +22,7 @@ For NiFI: - re-pull (docker pull cogstacksystems/cogstack-nifi:latest) - re-pull the tika image (docker pull cogstacksystems/tika-service:latest) -## 01-10-2025 NiFi 2.0 Release +## πŸš€ 01-10-2025 NiFi 2.0 Release New version of NiFi along with the long awaited NiFi registry flow released: