diff --git a/docs/deployment/dashboard.md b/docs/deployment/dashboard.md new file mode 100644 index 0000000..78604ae --- /dev/null +++ b/docs/deployment/dashboard.md @@ -0,0 +1,242 @@ +--- +sidebar_position: 5 +title: Dashboard +description: Real-time web dashboard for monitoring and managing the Chimaera runtime cluster. +--- + +# Runtime Dashboard + +The `context_visualizer` package provides a lightweight Flask web application that lets you inspect and manage a live Chimaera runtime cluster from your browser. It connects to the runtime using the same client API used by application code and surfaces cluster topology, per-node worker statistics, system resource utilization, block device stats, pool configuration, and the active YAML config. + +## Prerequisites + +- IOWarp installed with Python support (`WRP_CORE_ENABLE_PYTHON=ON`) +- A running Chimaera runtime (`chimaera runtime start`) +- Python dependencies: `flask`, `pyyaml`, `msgpack` + +Install the Python dependencies with any of: + +```bash +pip install flask pyyaml msgpack +# or +pip install iowarp-core[visualizer] +# or (conda) +conda install flask pyyaml python-msgpack +``` + +## Starting the Dashboard + +```bash +python -m context_visualizer +``` + +Then open [http://127.0.0.1:5000](http://127.0.0.1:5000) in your browser. + +### CLI Options + +| Flag | Default | Description | +|------|---------|-------------| +| `--host` | `127.0.0.1` | Bind address. Use `0.0.0.0` to expose on all interfaces. | +| `--port` | `5000` | Listen port. | +| `--debug` | *(off)* | Enable Flask debug mode (auto-reload, verbose errors). | + +```bash +# Expose on all interfaces, non-default port +python -m context_visualizer --host 0.0.0.0 --port 8080 + +# Debug mode (development only) +python -m context_visualizer --debug +``` + +## Pages + +### Topology (`/`) {#topology} + +The landing page shows a live grid of all nodes in the cluster. Each node card displays: + +- **Hostname** and **IP address** +- **Status badge** (alive) +- **CPU**, **RAM**, and **GPU** utilization bars (GPU shown only when GPUs are present) +- **Restart** and **Shutdown** action buttons + +The search bar supports filtering by node ID (single `3`, range `1-20`, comma-separated `1,3,5`) or by hostname/IP substring. + +Clicking a node card navigates to the per-node detail page. + +### Node Detail (`/node/`) {#node-detail} + +A per-node drilldown page showing: + +- **Worker statistics** — per-worker queue depth, blocked tasks, processed count, and more +- **System stats** — time-series CPU, RAM, GPU, and HBM utilization +- **Block device stats** — per-bdev pool throughput and capacity + +### Pools (`/pools`) + +Lists all pools defined in the `compose` section of the active configuration file: + +| Column | Description | +|--------|-------------| +| **Module** | ChiMod shared-library name (`mod_name`) | +| **Pool Name** | User-defined pool name | +| **Pool ID** | Unique pool identifier | +| **Query** | Routing policy (`local`, `dynamic`, `broadcast`) | + +### Config (`/config`) + +Displays the full contents of the active YAML configuration file as formatted JSON, for quick inspection without opening a terminal. + +## REST API + +All pages are backed by a JSON API. You can query these endpoints directly for scripting or integration with other monitoring tools. + +### Cluster-wide + +| Endpoint | Method | Description | +|----------|--------|-------------| +| `/api/topology` | GET | List all nodes with hostname, IP, CPU/RAM/GPU utilization | +| `/api/system` | GET | High-level system overview (connected, worker/queue/blocked/processed counts) | +| `/api/workers` | GET | Per-worker stats plus a fleet summary (local node) | +| `/api/pools` | GET | Pool list from the `compose` section of the config | +| `/api/config` | GET | Full active configuration as JSON | + +### Per-node + +| Endpoint | Method | Description | +|----------|--------|-------------| +| `/api/node//workers` | GET | Worker stats for a specific node | +| `/api/node//system_stats` | GET | System resource utilization entries for a specific node | +| `/api/node//bdev_stats` | GET | Block device stats for a specific node | + +### Node Management + +| Endpoint | Method | Description | +|----------|--------|-------------| +| `/api/topology/node//shutdown` | POST | Gracefully shut down a node via SSH | +| `/api/topology/node//restart` | POST | Restart a node via SSH | + +Shutdown and restart are performed by SSHing from the dashboard host to the target node and running `chimaera runtime stop` or `chimaera runtime restart`. This avoids the problem of a node killing itself mid-RPC. The SSH connection uses `StrictHostKeyChecking=no` and `ConnectTimeout=5`. + +**Shutdown response:** +```json +{ + "success": true, + "returncode": 0, + "stdout": "", + "stderr": "" +} +``` + +Exit codes `0` and `134` (SIGABRT from `std::abort()` in `InitiateShutdown`) are both treated as success. + +**Restart** uses `nohup` so the SSH session returns immediately while the node restarts in the background. + +All endpoints return `Content-Type: application/json`. On error they return an appropriate HTTP status code (e.g., `503` if the runtime is unreachable, `404` if a node is not found) with an `"error"` field in the response body. + +### Examples + +```bash +# Get cluster topology +curl http://127.0.0.1:5000/api/topology + +# Get system overview +curl http://127.0.0.1:5000/api/system + +# Get worker stats for node 2 +curl http://127.0.0.1:5000/api/node/2/workers + +# Shut down node 3 +curl -X POST http://127.0.0.1:5000/api/topology/node/3/shutdown + +# Restart node 3 +curl -X POST http://127.0.0.1:5000/api/topology/node/3/restart +``` + +## Configuration File Discovery + +The dashboard reads the same config file as the runtime, using the same search order: + +| Source | Priority | +|--------|----------| +| `CHI_SERVER_CONF` environment variable | **1st** | +| `WRP_RUNTIME_CONF` environment variable | **2nd** | +| `~/.chimaera/chimaera.yaml` | **3rd** | + +See [Configuration](./configuration) for details on the config file format. + +## Connection Lifecycle + +The dashboard connects to the runtime lazily — on the first request that needs live data. If the runtime is not yet running when the dashboard starts, it will show a disconnected state and retry on subsequent requests. Shutdown is handled automatically via `atexit` so the client is finalized cleanly when the server process exits. + +## Docker / Remote Access + +When running the runtime inside Docker or on a remote host, bind the dashboard to all interfaces and forward the port: + +```bash +# On the host running the runtime +python -m context_visualizer --host 0.0.0.0 --port 5000 +``` + +```yaml +# docker-compose.yml — expose the dashboard port alongside the runtime +services: + iowarp: + image: iowarp/deploy-cpu:latest + ports: + - "9413:9413" # Chimaera RPC + - "5000:5000" # Dashboard + command: > + bash -c "chimaera runtime start & + python -m context_visualizer --host 0.0.0.0" +``` + +:::warning +The dashboard has no authentication. Do not expose it on a public network without a reverse proxy that enforces access control. +::: + +## Try It: Interactive Docker Cluster {#interactive-cluster} + +An interactive test environment is provided that spins up a **4-node Chimaera cluster** with the dashboard so you can explore all features from your browser. + +### Location + +``` +context-runtime/test/integration/interactive/ +├── docker-compose.yml # 4-node runtime cluster +├── hostfile # Node IP addresses (172.28.0.10-13) +├── wrp_conf.yaml # Runtime configuration +└── run.sh # Launcher script +``` + +### How It Works + +- **4 Docker containers** (`iowarp-interactive-node1` through `node4`) run the Chimaera runtime on a private `172.28.0.0/16` network, each with `sshd` for SSH-based shutdown/restart +- **Node 1** also runs the dashboard alongside its runtime +- The script connects the devcontainer to the Docker network and starts a local port-forward so that `localhost:5000` reaches the dashboard inside Docker — VS Code then auto-forwards this to your host browser +- SSH keys are distributed via a shared Docker volume so the dashboard can authenticate to all nodes + +### Running + +```bash +cd context-runtime/test/integration/interactive + +# Foreground (Ctrl-C to stop) +bash run.sh + +# Or run in the background +bash run.sh start + +# Follow runtime container logs +bash run.sh logs + +# Stop everything (cluster + dashboard) +bash run.sh stop +``` + +Once the cluster is up (~15 seconds), open [http://localhost:5000](http://localhost:5000) to browse the topology, click into individual nodes, and use the Restart/Shutdown buttons. + +If running from a devcontainer or a host where the workspace is at a different path, set `HOST_WORKSPACE`: + +```bash +HOST_WORKSPACE=/host/path/to/workspace bash run.sh +``` diff --git a/docs/sdk/context-assimilation-engine/overview.md b/docs/sdk/context-assimilation-engine/overview.md new file mode 100644 index 0000000..dd1214b --- /dev/null +++ b/docs/sdk/context-assimilation-engine/overview.md @@ -0,0 +1,273 @@ +# Context Assimilation Engine (CAE) Overview + +## Introduction + +The Context Assimilation Engine (CAE) is a Chimaera module (`wrp_cae::core`) that ingests external data sources into the IOWarp runtime. It reads data from files, HDF5 datasets, or remote Globus endpoints and stores them as blobs in the Context Transfer Engine (CTE). The CAE is registered as a ChiMod container with pool ID `400.0`. + +## Architecture + +``` + +-----------+ + | Client | (wrp_cae::core::Client) + +-----+-----+ + | AsyncParseOmni / AsyncProcessHdf5Dataset + v + +----------+-----------+ + | Runtime | (wrp_cae::core::Runtime : chi::Container) + +----------+-----------+ + | + +-----------+-----------+ + | | + ParseOmni() ProcessHdf5Dataset() + | + v + +-------+--------+ + | AssimilatorFactory | factory.Get(src_url) + +-------+--------+ + | + +-----+-----+-----+ + | | | +BinaryFile Hdf5File GlobusFile +Assimilator Assimilator Assimilator + | | | + +-----+-----+-----+ + | + v + CTE Client (wrp_cte::core::Client) + Put / Get blob operations +``` + +### Key Components + +| Component | Header | Description | +|-----------|--------|-------------| +| `Runtime` | `wrp_cae/core/core_runtime.h` | Container server-side logic | +| `Client` | `wrp_cae/core/core_client.h` | Client-side async API | +| `AssimilatorFactory` | `wrp_cae/core/factory/assimilator_factory.h` | Creates assimilators by source protocol | +| `BaseAssimilator` | `wrp_cae/core/factory/base_assimilator.h` | Abstract interface for all assimilators | +| `AssimilationCtx` | `wrp_cae/core/factory/assimilation_ctx.h` | Serializable transfer descriptor | + +### Namespace and Pool ID + +- **Namespace:** `wrp_cae::core` +- **Pool ID:** `constexpr chi::PoolId kCaePoolId(400, 0)` (defined in `constants.h`) +- **ChiMod library name:** Derived from `chimaera_mod.yaml` (`module_name: core`, `namespace: wrp_cae`) + +## Factory Pattern + +The CAE uses a factory pattern to select the correct assimilator based on the source URL protocol. + +### AssimilatorFactory + +`AssimilatorFactory::Get(const std::string& src)` parses the protocol from the source URI and returns the appropriate `BaseAssimilator` subclass: + +| Protocol | URI Format | Assimilator | Build Flag | +|----------|-----------|-------------|------------| +| `file` | `file::/path/to/file` | `BinaryFileAssimilator` | Always enabled | +| `hdf5` | `hdf5::/path/file.h5:/dataset` | `Hdf5FileAssimilator` | `-DWRP_CORE_ENABLE_HDF5=ON` | +| `globus` | `globus:///` | `GlobusFileAssimilator` | `-DCAE_ENABLE_GLOBUS=ON` | + +The factory also detects Globus web URLs (`https://app.globus.org/...`) and routes them to `GlobusFileAssimilator`. + +Protocol extraction supports two URI styles: +- Standard: `protocol://path` (extracts text before `://`) +- Custom: `protocol::path` (extracts text before `::`) + +### BaseAssimilator Interface + +All assimilators implement the `BaseAssimilator` abstract class: + +```cpp +class BaseAssimilator { + public: + virtual ~BaseAssimilator() = default; + virtual chi::TaskResume Schedule(const AssimilationCtx& ctx, + int& error_code) = 0; +}; +``` + +`Schedule` is a **coroutine**. It uses `co_await` internally to perform async CTE blob operations (create tag, put data). The `error_code` output parameter returns 0 on success. + +### Concrete Assimilators + +**BinaryFileAssimilator** reads local files in chunks. It extracts the file path from `ctx.src`, respects `range_off` and `range_size` for partial reads, and stores the data as CTE blobs. + +**Hdf5FileAssimilator** opens an HDF5 file, discovers all datasets using the HDF5 visitor API, applies include/exclude glob filters from the `AssimilationCtx`, and stores each dataset as a tagged CTE blob with tensor metadata (type and dimensions). It also exposes `ProcessDataset()` publicly for distributed per-dataset processing. + +**GlobusFileAssimilator** handles Globus transfers. It supports Globus-to-Globus transfers (via the Globus transfer API with submission IDs and polling) and Globus-to-local downloads (via HTTPS). Authentication tokens are passed through `ctx.src_token`. + +## AssimilationCtx + +`AssimilationCtx` is the serializable descriptor for a single data transfer: + +```cpp +struct AssimilationCtx { + std::string src; // Source URI (e.g., "file::/path/to/file") + std::string dst; // Destination URI (e.g., "iowarp::tag_name") + std::string format; // Data format ("binary", "hdf5") + std::string depends_on; // Dependency on another transfer (empty = none) + size_t range_off; // Byte offset for partial reads + size_t range_size; // Byte count (0 = entire file) + std::string src_token; // Source authentication token + std::string dst_token; // Destination authentication token + std::vector include_patterns; // Glob patterns to include + std::vector exclude_patterns; // Glob patterns to exclude +}; +``` + +Serialization uses the [cereal](https://uscilab.github.io/cereal/) library with binary archives. The client serializes a `std::vector` into the `ParseOmniTask`, and the runtime deserializes it on the server side. + +## Method IDs + +Defined in `chimaera_mod.yaml`: + +| Method | ID | Description | +|--------|----|-------------| +| `kCreate` | 0 | Container creation | +| `kDestroy` | 1 | Container destruction | +| `kMonitor` | 9 | Container state monitoring | +| `kParseOmni` | 10 | Parse OMNI YAML and schedule transfers | +| `kProcessHdf5Dataset` | 11 | Process a single HDF5 dataset (distributed) | + +## Execution Lifecycle + +### 1. Client Initialization + +```cpp +#include + +// Initialize the global CAE client singleton +// This also initializes the CTE client internally +WRP_CAE_CLIENT_INIT(config_path); + +// Access the client via macro +auto* client = WRP_CAE_CLIENT; +``` + +`WRP_CAE_CLIENT_INIT` creates the CAE container pool via `AsyncCreate`, which triggers `Runtime::Create` on the server side. The runtime initializes its internal CTE client using `wrp_cte::core::kCtePoolId`. + +### 2. Load and Parse OMNI File + +The typical entry point is the `wrp_cae_omni` utility: + +```bash +wrp_cae_omni /path/to/transfers.yaml +``` + +Programmatically: + +```cpp +// Load OMNI YAML into AssimilationCtx vector +auto contexts = LoadOmni("/path/to/transfers.yaml"); + +// Submit to CAE runtime +auto future = client->AsyncParseOmni(contexts); +future.Wait(); +``` + +### 3. Runtime Processes Transfers + +`Runtime::ParseOmni` executes on a Chimaera worker thread as a coroutine: + +1. **Deserialize** the `std::vector` from the task's binary payload +2. **Create** an `AssimilatorFactory` with the CTE client +3. **For each context:** + a. Call `factory.Get(ctx.src)` to obtain the correct assimilator + b. `co_await assimilator->Schedule(ctx, error_code)` to execute the transfer + c. The assimilator reads data from the source and writes CTE blobs asynchronously +4. **Return** `result_code_`, `error_message_`, and `num_tasks_scheduled_` + +### 4. Distributed HDF5 Processing + +For HDF5 files with many datasets, the CAE can distribute dataset processing across nodes: + +```cpp +auto future = client->AsyncProcessHdf5Dataset( + chi::PoolQuery::Physical(node_id), // Route to specific node + "/path/to/file.h5", + "/dataset/path", + "tag_prefix"); +``` + +`Runtime::ProcessHdf5Dataset` opens the HDF5 file, creates an `Hdf5FileAssimilator`, and calls `ProcessDataset()` for the specified dataset. + +### 5. Coroutine Execution Model + +All runtime methods are C++20 coroutines returning `chi::TaskResume`. When an assimilator needs to perform an async CTE operation (e.g., put a blob), it uses `co_await` to suspend execution. The Chimaera scheduler resumes the coroutine when the CTE operation completes, allowing the worker thread to process other tasks while waiting. + +## Client API Reference + +### `AsyncCreate` + +```cpp +chi::Future AsyncCreate( + const chi::PoolQuery& pool_query, + const std::string& pool_name, + const chi::PoolId& custom_pool_id, + const CreateParams& params = CreateParams()); +``` + +Creates the CAE container pool. Submitted to the admin pool for `GetOrCreatePool` processing. + +### `AsyncParseOmni` + +```cpp +chi::Future AsyncParseOmni( + const std::vector& contexts); +``` + +Serializes the contexts vector and submits a `ParseOmniTask` to the CAE runtime. The task is routed locally (`PoolQuery::Local()`). + +### `AsyncProcessHdf5Dataset` + +```cpp +chi::Future AsyncProcessHdf5Dataset( + const chi::PoolQuery& pool_query, + const std::string& file_path, + const std::string& dataset_path, + const std::string& tag_prefix); +``` + +Processes a single HDF5 dataset. Use `PoolQuery::Physical(node_id)` to route to a specific node for distributed processing. + +## Adding a New Assimilator + +To add support for a new data source protocol: + +1. **Create a header** in `core/include/wrp_cae/core/factory/`: + +```cpp +class MyAssimilator : public BaseAssimilator { + public: + explicit MyAssimilator(std::shared_ptr cte_client); + chi::TaskResume Schedule(const AssimilationCtx& ctx, + int& error_code) override; + private: + std::shared_ptr cte_client_; +}; +``` + +2. **Implement `Schedule`** in `core/src/factory/`. Use `co_await` for async CTE operations. Set `error_code = 0` on success. + +3. **Register in the factory** (`assimilator_factory.cc`): + +```cpp +} else if (protocol == "myproto") { + return std::make_unique(cte_client_); +} +``` + +4. **Add build guards** if the assimilator has optional dependencies (e.g., `#ifdef MY_ENABLE_FLAG`). + +## Build Configuration + +| CMake Option | Default | Description | +|-------------|---------|-------------| +| `WRP_CORE_ENABLE_HDF5` | OFF | Enable HDF5 assimilator (requires libhdf5) | +| `CAE_ENABLE_GLOBUS` | OFF | Enable Globus assimilator (requires POCO) | + +## Related Documentation + +- [OMNI File Format](omni.md) - YAML configuration for data transfers +- [Module Development Guide](../context-runtime/2.module_dev_guide.md) - ChiMod development +- [CTE Documentation](../context-transfer-engine/cte.md) - CTE storage documentation