diff --git a/.gitlab-ci.yml b/.gitlab-ci.yml index b132c47..f6d4011 100644 --- a/.gitlab-ci.yml +++ b/.gitlab-ci.yml @@ -1,6 +1,9 @@ include: project: rationai/digital-pathology/templates/ci-templates - file: Python-Lint.gitlab-ci.yml + file: + - Python-Lint.gitlab-ci.yml + - MkDocs.gitlab-ci.yml stages: - lint + - deploy diff --git a/README.md b/README.md index f6ea4ce..66f9fd6 100644 --- a/README.md +++ b/README.md @@ -1,93 +1,110 @@ # Model Service +Model deployment infrastructure for RationAI using Ray Serve on Kubernetes. +This repository contains: -## Getting started +- A KubeRay `RayService` manifest (`ray-service.yaml`) for deploying Ray Serve on Kubernetes. +- Model implementations under `models/` (reference: `models/binary_classifier.py`). +- Documentation under `docs/` (MkDocs). -To make it easy for you to get started with GitLab, here's a list of recommended next steps. +## Documentation -Already a pro? Just edit this README.md and make it your own. Want to make it easy? [Use the template at the bottom](#editing-this-readme)! +- MkDocs content: `docs/` +- Key pages: + - `docs/get-started/quick-start.md` + - `docs/guides/deployment-guide.md` + - `docs/guides/adding-models.md` + - `docs/guides/configuration-reference.md` + - `docs/guides/troubleshooting.md` + - `docs/architecture/overview.md` + - `docs/architecture/request-lifecycle.md` + - `docs/architecture/queues-and-backpressure.md` + - `docs/architecture/batching.md` -## Add your files +## Quick Start (Kubernetes) -- [ ] [Create](https://docs.gitlab.com/ee/user/project/repository/web_editor.html#create-a-file) or [upload](https://docs.gitlab.com/ee/user/project/repository/web_editor.html#upload-a-file) files -- [ ] [Add files using the command line](https://docs.gitlab.com/topics/git/add_files/#add-files-to-a-git-repository) or push an existing Git repository with the following command: +Full walkthrough: `docs/get-started/quick-start.md`. -``` -cd existing_repo -git remote add origin https://gitlab.ics.muni.cz/rationai/infrastructure/model-service2.git -git branch -M master -git push -uf origin master -``` +### Prerequisites -## Integrate with your tools +- Kubernetes cluster with KubeRay operator installed +- `kubectl` configured for the cluster -- [ ] [Set up project integrations](https://gitlab.ics.muni.cz/rationai/infrastructure/model-service2/-/settings/integrations) +### Deploy -## Collaborate with your team +```bash +kubectl apply -f ray-service.yaml -n [namespace] +kubectl get rayservice rayservice-models -n [namespace] +``` -- [ ] [Invite team members and collaborators](https://docs.gitlab.com/ee/user/project/members/) -- [ ] [Create a new merge request](https://docs.gitlab.com/ee/user/project/merge_requests/creating_merge_requests.html) -- [ ] [Automatically close issues from merge requests](https://docs.gitlab.com/ee/user/project/issues/managing_issues.html#closing-issues-automatically) -- [ ] [Enable merge request approvals](https://docs.gitlab.com/ee/user/project/merge_requests/approvals/) -- [ ] [Set auto-merge](https://docs.gitlab.com/user/project/merge_requests/auto_merge/) +### Access locally -## Test and Deploy +```bash +kubectl port-forward -n [namespace] svc/rayservice-models-serve-svc 8000:8000 +``` -Use the built-in continuous integration in GitLab. +### Test the reference model (`BinaryClassifier`) -- [ ] [Get started with GitLab CI/CD](https://docs.gitlab.com/ee/ci/quick_start/) -- [ ] [Analyze your code for known vulnerabilities with Static Application Security Testing (SAST)](https://docs.gitlab.com/ee/user/application_security/sast/) -- [ ] [Deploy to Kubernetes, Amazon EC2, or Amazon ECS using Auto Deploy](https://docs.gitlab.com/ee/topics/autodevops/requirements.html) -- [ ] [Use pull-based deployments for improved Kubernetes management](https://docs.gitlab.com/ee/user/clusters/agent/) -- [ ] [Set up protected environments](https://docs.gitlab.com/ee/ci/environments/protected_environments.html) +The reference deployment in `ray-service.yaml` exposes an app at the route prefix: -*** +- `/prostate-classifier-1` -# Editing this README +`models/binary_classifier.py` expects a **request body that is LZ4-compressed raw bytes** of a single RGB tile: -When you're ready to make this README your own, just edit this file and use the handy template below (or feel free to structure it however you want - this is just a starting point!). Thanks to [makeareadme.com](https://www.makeareadme.com/) for this template. +- dtype: `uint8` +- shape: `(tile_size, tile_size, 3)` +- byte order: row-major (NumPy default) -## Suggestions for a good README +Example (Python): -Every project is different, so consider which of these sections apply to yours. The sections used in the template are suggestions for most open source projects. Also keep in mind that while a README can be too long and detailed, too long is better than too short. If you think your README is too long, consider utilizing another form of documentation rather than cutting out information. +```bash +pip install numpy lz4 requests +``` -## Name -Choose a self-explaining name for your project. +```python +import lz4.frame +import numpy as np +import requests -## Description -Let people know what your project can do specifically. Provide context and add a link to any reference visitors might be unfamiliar with. A list of Features or a Background subsection can also be added here. If there are alternatives to your project, this is a good place to list differentiating factors. +tile_size = 512 # must match RayService user_config.tile_size +tile = np.zeros((tile_size, tile_size, 3), dtype=np.uint8) -## Badges -On some READMEs, you may see small images that convey metadata, such as whether or not all the tests are passing for the project. You can use Shields to add some to your README. Many services also have instructions for adding a badge. +payload = lz4.frame.compress(tile.tobytes()) -## Visuals -Depending on what you are making, it can be a good idea to include screenshots or even a video (you'll frequently see GIFs rather than actual videos). Tools like ttygif can help, but check out Asciinema for a more sophisticated method. +resp = requests.post( + "http://localhost:8000/prostate-classifier-1/", + data=payload, + headers={"Content-Type": "application/octet-stream"}, + timeout=60, +) +resp.raise_for_status() +print(resp.json() if resp.headers.get("content-type", "").startswith("application/json") else resp.text) +``` -## Installation -Within a particular ecosystem, there may be a common way of installing things, such as using Yarn, NuGet, or Homebrew. However, consider the possibility that whoever is reading your README is a novice and would like more guidance. Listing specific steps helps remove ambiguity and gets people to using your project as quickly as possible. If it only runs in a specific context like a particular programming language version or operating system or has dependencies that have to be installed manually, also add a Requirements subsection. +## Repository Structure -## Usage -Use examples liberally, and show the expected output if you can. It's helpful to have inline the smallest example of usage that you can demonstrate, while providing links to more sophisticated examples if they are too long to reasonably include in the README. +``` +model-service/ +├── models/ # Model implementations +│ └── binary_classifier.py +├── providers/ # Model loading providers +│ └── model_provider.py +├── docs/ # Documentation +├── ray-service.yaml # Kubernetes RayService configuration +├── pyproject.toml # Python dependencies +└── README.md +``` ## Support -Tell people where they can go to for help. It can be any combination of an issue tracker, a chat room, an email address, etc. - -## Roadmap -If you have ideas for releases in the future, it is a good idea to list them in the README. -## Contributing -State if you are open to contributions and what your requirements are for accepting them. +- **Issues:** Report bugs or request features via [GitLab Issues](https://gitlab.ics.muni.cz/rationai/infrastructure/model-service/-/issues) +- **Contact:** RationAI team at Masaryk University -For people who want to make changes to your project, it's helpful to have some documentation on how to get started. Perhaps there is a script that they should run or some environment variables that they need to set. Make these steps explicit. These instructions could also be useful to your future self. - -You can also document commands to lint the code or run tests. These steps help to ensure high code quality and reduce the likelihood that the changes inadvertently break something. Having instructions for running tests is especially helpful if it requires external setup, such as starting a Selenium server for testing in a browser. +## License -## Authors and acknowledgment -Show your appreciation to those who have contributed to the project. +This project is part of the RationAI infrastructure and is available for use by authorized members of the RationAI group. -## License -For open source projects, say how it is licensed. +## Authors -## Project status -If you have run out of energy or time for your project, put a note at the top of the README saying that development has slowed down or stopped completely. Someone may choose to fork your project or volunteer to step in as a maintainer or owner, allowing your project to keep going. You can also make an explicit request for maintainers. +Developed and maintained by the RationAI team at Masaryk University, Faculty of Informatics. diff --git a/docs/architecture/batching.md b/docs/architecture/batching.md new file mode 100644 index 0000000..b52ce0d --- /dev/null +++ b/docs/architecture/batching.md @@ -0,0 +1,82 @@ +# Batching (How It Works Under the Hood) + +Batching in Ray Serve is a **replica-local request coalescing** mechanism. + +It improves throughput when your model can process multiple inputs more efficiently together (common for GPU inference). + +## Where batching happens + +Batching happens **inside each replica process**. + +Requests only become eligible for batching after they: + +1. enter through the proxy and handle queueing/backpressure, and +2. get routed to a specific replica + +See also: **[Request lifecycle](request-lifecycle.md)**. + +## The API surface (what you configure) + +In user code, batching is enabled by decorating an **async** method with `@serve.batch`: + +- `max_batch_size`: upper bound for how many requests are grouped into one batch execution +- `batch_wait_timeout_s`: maximum time to wait (since the first queued item) before flushing a smaller batch + +Serve expects the batched handler to return **one result per input** (same batch length, same order). + +## What Serve actually does internally + +Conceptually, each replica maintains an internal structure like: + +- an in-memory buffer of pending calls +- a background “flush” loop that decides when to execute a batch +- per-request futures/promises that get completed when the batch finishes + +### 1. Collection phase (buffering) + +Incoming requests that hit the batched method are appended to a replica-local buffer. + +Each buffered entry stores: + +- the request arguments (or decoded payload) +- a future representing that request’s eventual response + +### 2. Flush conditions (size or time) + +The buffer is flushed when either condition becomes true: + +- **Size trigger**: buffer length reaches `max_batch_size` +- **Time trigger**: `batch_wait_timeout_s` elapses since the **first** item currently in the buffer + +This is why batching can increase latency at low QPS: a request may wait up to `batch_wait_timeout_s` for more arrivals. + +### 3. Execution phase (single call) + +Serve invokes your batched handler **once** with a list of inputs. + +This is where you typically vectorize: + +- stack/concat tensors +- run one forward pass +- split/scatter outputs back + +### 4. Scatter phase (complete futures) + +When the batched handler returns a list of outputs, Serve resolves the stored futures in order. + +Each original HTTP request then completes independently with its corresponding output. + +## Configuration & Tuning + +For a deep dive into how batching interacts with concurrency limits (specifically why `max_ongoing_requests` must be larger than `max_batch_size`), see **[Queues and backpressure](queues-and-backpressure.md)**. + +Quick tips: + +- Increase `max_batch_size` if the model benefits from larger batches and you have headroom. +- Increase `batch_wait_timeout_s` to favor fuller batches; decrease it to favor latency. + +## Next + +- Request flow including queue points: [Request lifecycle](request-lifecycle.md) +- Queueing and rejection controls: [Queues and backpressure](queues-and-backpressure.md) +- “Knobs” reference and meanings: [Configuration reference](../guides/configuration-reference.md) diff --git a/docs/architecture/overview.md b/docs/architecture/overview.md new file mode 100644 index 0000000..254576e --- /dev/null +++ b/docs/architecture/overview.md @@ -0,0 +1,159 @@ +# Architecture Overview + +This section provides a structured overview of Model Service's architecture. + +If you are new to the project, start here and then follow the links to the deeper pages. + +## System Architecture + +Model Service is built on Kubernetes + KubeRay + Ray Serve: + +``` +┌──────────────────────────────────────────────────────────────────┐ +│ Head Node │ +│ │ +│ ┌──────────────┐ ┌───────────────────┐ │ +│ │ Controller │◄───────────────────┤ HTTP Proxy │◄──── Client Request +│ │ (Autoscaler) │ Update Config │ (Ingress) │ │ +│ └──────┬───────┘ └─────────┬─────────┘ │ +│ │ │ │ +└───────────┼──────────────────────────────────────┼───────────────┘ + │ Manage │ Route + ▼ ▼ +┌──────────────────────────────────────────────────────────────────┐ +│ Worker Nodes │ +│ │ +│ ┌────────────────────────────────────────────────────────────┐ │ +│ │ Application 1 │ │ +│ │ ┌──────────────────────┐ ┌──────────────────────┐ │ │ +│ │ │ Deployment A │ │ Deployment B │ │ │ +│ │ │ ┌────────┐ ┌────────┐│ │ ┌────────┐ ┌────────┐│ │ │ +│ │ │ │Replica │ │Replica ││ │ │Replica │ │Replica ││ │ │ +│ │ │ └────────┘ └────────┘│ │ └────────┘ └────────┘│ │ │ +│ │ └──────────────────────┘ └──────────────────────┘ │ │ +│ └────────────────────────────────────────────────────────────┘ │ +│ │ +│ ┌────────────────────────────────────────────────────────────┐ │ +│ │ Application 2 │ │ +│ │ ... │ │ +│ └────────────────────────────────────────────────────────────┘ │ +└──────────────────────────────────────────────────────────────────┘ +``` + +## Core Concepts & Hierarchy + +The client's request flows through several layers of the system: + +HTTP Proxy → Head Node → Worker Node → Application → Deployment → Replica. + +The main components are: + +1. **Ray Service (The Platform)**: The Kubernetes Custom Resource (CR) that defines the entire Ray cluster and the Serve application(s) running on top of it. +2. **Ray Cluster**: The physical set of Kubernetes pods, consisting of a **Head Node** and multiple **Worker Nodes**. +3. **Infrastructure Actors**: + - **Controller**: Manages the control plane, API calls, and autoscaling (does not handle requests). + - **HTTP Proxy**: Ingress point that routes requests to applications. +4. **Serve Application (The Service Boundary)**: A standalone version of your code, including all its deployments and logic. Defined by an import path (e.g., `models.binary_classifier:app`). +5. **Serve Deployment (The Functional Unit)**: A managed group of replicas. It defines scaling rules (`num_replicas`, `num_cpus`) and versioning. +6. **Replica (The Execution Unit)**: A single Ray actor process running the deployment code inside a Worker Node. + +### Serve application vs Serve deployment + +- **Application**: deployable service boundary (routing, code entrypoint, runtime env). +- **Deployment**: scaling unit (replicas), concurrency/queue limits, and resource options. + +### Internal Mechanisms + +For detailed information on how batching works, including the configuration API and internal buffering mechanisms, see [Batching](batching.md). + +For request lifecycle and queueing details, see [Request Lifecycle](request-lifecycle.md) and [Queues and Backpressure](queues-and-backpressure.md). + +## Scaling Architecture + +### Horizontal Scaling (Replicas) + +Models scale horizontally by adding/removing replicas: + +``` +Load: ████████░░ (80%) +Replicas: [R1] [R2] [R3] + +Load: ████████████████ (160%) +Replicas: [R1] [R2] [R3] [R4] [R5] [R6] +``` + +**Autoscaling Triggers:** + +- `target_ongoing_requests`: Target requests per replica +- Scale up when: requests > (replicas × target) +- Scale down when: requests < (replicas × target) + +### Vertical Scaling (Workers) + +Ray cluster scales by adding/removing worker pods: + +```yaml +workerGroupSpecs: + - groupName: cpu-workers + minReplicas: 0 + maxReplicas: 4 +``` + +### Resource Sizing (Pods vs Replicas) + +It is important to distinguish between **Kubernetes Resources** (Pods) and **Ray Resources** (Replicas). + +- **Replica Sizing (`ray_actor_options`)**: Defines how much logical resource one model copy needs (e.g., `num_cpus: 1`). +- **Pod Sizing (`resources.limits`)**: Defines how big the physical container is. + +**Rule of Thumb**: Ensure your Pods are large enough to fit at least one (or N) replicas plus overhead (Python runtime, Object Store). +i.e., `Pod CPU >= Replicas × num_cpus + Overhead`. + +## Autoscaling Architecture + +The Ray Serve Autoscaler runs inside the **Controller** actor and manages the number of replicas dynamically. + +1. **Metrics Collection**: Replicas and DeploymentHandle push metrics (queue size, active queries) to the Controller. +2. **Decision Making**: The Autoscaler periodically checks these metrics against targets (like `target_ongoing_requests`). +3. **Scaling Action**: The Controller adds or removes Replica actors to meet demand. + +## Fault Tolerance + +Ray Serve is designed to be resilient to failures: + +- **Replica Failure**: If a Replica actor crashes, the Controller detects it and starts a new one to replace it. Request routing automatically updates. +- **Proxy Failure**: If the Proxy actor fails, the Controller restarts it. +- **Controller Failure**: If the Controller itself fails, Ray (via GCS) restarts it. Autoscaling pauses during downtime but resumes upon recovery. +- **Node Failure**: KubeRay (managing the cluster) detects node failures and provisions new pods. Ray Serve then eventually schedules actors on the new nodes. + +## Design Principles + +1. **Declarative Configuration**: Infrastructure defined in YAML, managed by GitOps (`RayService` CR). +2. **Separation of Concerns**: Model Code (Python), Infrastructure (K8s), Configuration (User Config). +3. **Elastic Scaling**: Scale to zero when idle, scale up on demand. +4. **Developer Experience**: Simple model implementation, easy local testing. + +## Metrics & Debugging + +Common commands: + +```bash +kubectl get pods -n [namespace] +kubectl top pods -n [namespace] +kubectl logs -n [namespace] +kubectl describe rayservice -n [namespace] +``` + +Ray can export Prometheus metrics (when metrics collection/export is enabled): + +- Request latency +- Request throughput +- Replica count +- Resource usage + +## Next Steps + +- [Request lifecycle](request-lifecycle.md) +- [Deployment guide](../guides/deployment-guide.md) +- [Configuration reference](../guides/configuration-reference.md) +- [Adding new models](../guides/adding-models.md) diff --git a/docs/architecture/queues-and-backpressure.md b/docs/architecture/queues-and-backpressure.md new file mode 100644 index 0000000..965a482 --- /dev/null +++ b/docs/architecture/queues-and-backpressure.md @@ -0,0 +1,57 @@ +# Queues and Backpressure + +To maintain stability and prevent overload, Ray Serve implements queueing mechanisms at multiple levels. Understanding these queues is critical for tuning latency and handling load spikes. + +## Simplified Queue Model + +There are two main places a request can wait: + +1. **Proxy Handle Queue**: Waiting to be assigned to a replica. +2. **Replica Execution Queue**: Assigned to a replica, waiting for execution (or batching). + +## 1. Proxy-Side Queue (`max_queued_requests`) + +When a request arrives at the HTTP Proxy (or via a Deployment Handle), it is routed to a logical deployment. If all specific replicas are busy, the request waits in a queue managed by the proxy/handle. + +- **Config**: `max_queued_requests` (in the deployment spec) +- **Behavior**: + - Controls the maximum number of requests allowed to wait for assignment. + - If the queue is full, new requests are immediately rejected with a **503 Service Unavailable** error (or a `BackpressureError` in Python). + +### Why limit this? + +Without a limit, a system under heavy load might accept requests until it runs out of memory or latency becomes unacceptable. Fail-fast behavior is often preferred over unbounded waiting. + +## 2. Replica-Side Queue (`max_ongoing_requests`) + +Once a request is assigned to a specific replica, it counts as "ongoing" for that replica. + +- **Config**: `max_ongoing_requests` (in the deployment spec) +- **Behavior**: + - Limits how many concurrent requests a single replica can process _or_ have buffered. + - If a replica is at its limit, the proxy considers it "busy" and will not assign new requests to it (they will wait in the Proxy Queue instead). + +### Usage with Batching + +If you use `@serve.batch`, requests sitting in the [batching buffer](batching.md) count towards `max_ongoing_requests`. + +- **Warning**: If `max_ongoing_requests` is set too low (e.g., lower than `max_batch_size`), you might throttle your own batching mechanism because the replica will never accept enough requests to fill a batch. + +## Backpressure flow + +1. **Client** sends a request. +2. **HTTP Proxy** receives it. +3. **Check Replica capacity**: Are there replicas with `ongoing_requests < max_ongoing_requests`? + - **Yes**: Forward request to one of them. + - **No**: Enqueue request in the Proxy Queue. +4. **Check Proxy Queue capacity**: Is `current_queue_size < max_queued_requests`? + - **Yes**: Request waits. + - **No**: Reject request immediately (Fail). + +## Tuning Guidelines + +| Scenario | Recommendation | +| :----------------------------- | :------------------------------------------------------------------------------------------------------ | +| **High Throughput / Batching** | Increase `max_ongoing_requests` to ensure replicas can buffer enough work to form full batches. | +| **Latency Sensitive** | Decrease `max_queued_requests` to fail fast rather than returning stale responses after a long wait. | +| **Memory Constrained** | Lower both values to prevent OOM errors by limiting the number of incomplete requests in system memory. | diff --git a/docs/architecture/request-lifecycle.md b/docs/architecture/request-lifecycle.md new file mode 100644 index 0000000..4c6292c --- /dev/null +++ b/docs/architecture/request-lifecycle.md @@ -0,0 +1,186 @@ +# Request Lifecycle in Detail + +This document traces the path of a single inference request through the Model Service stack, from the external HTTP client down to the Ray Core task execution. + +It also highlights **where requests queue** and which settings control queueing vs rejection. + +## High-Level Flow + +``` +┌──────────────────────────────────────────────────────────────────────┐ +│ External Client │ +│ │ +│ HTTP Request │ +│ │ │ +│ ▼ │ +│ ┌───────────────┐ │ +│ │ K8s Service │ │ +│ │ / Ingress │ │ +│ └───────┬───────┘ │ +│ │ Route │ +└──────────┼───────────────────────────────────────────────────────────┘ + │ + ▼ +┌──────────────────────────────────────────────────────────────────────┐ +│ Head / Worker Nodes │ +│ │ +│ ┌───────────────────────────────┐ │ +│ │ HTTP Proxy Actor │ │ +│ │ (ServeHTTPProxy) │ │ +│ └───────────────┬───────────────┘ │ +│ │ create DeploymentHandle │ +│ ▼ │ +│ ┌───────────────────────────────┐ │ +│ │ Deployment Handle │ │ +│ │ (Client-side queue) │ │ +│ └───────────────┬───────────────┘ │ +│ │ enqueue / backpressure │ +│ ▼ │ +│ ┌───────────────────────────────┐ │ +│ │ Serve Router │ │ +│ │ (Replica selection) │ │ +│ └───────────────┬───────────────┘ │ +│ │ PushTask RPC │ +│ ▼ │ +│ ┌───────────────────────────────┐ │ +│ │ Replica Actor │ │ +│ │ (Deployment instance) │ │ +│ └───────────────┬───────────────┘ │ +│ │ execute │ +│ ▼ │ +│ ┌───────────────────────────────┐ │ +│ │ Ray Worker Process │ │ +│ └───────────────┬───────────────┘ │ +│ │ │ +│ ▼ │ +│ ┌───────────────────────────────┐ │ +│ │ User Model Code │ │ +│ │ (Inference / Logic) │ │ +│ └───────────────┬───────────────┘ │ +│ │ result │ +│ ▼ │ +│ ┌───────────────────────────────┐ │ +│ │ Plasma Object Store │◄── large objects │ +│ └───────────────┬───────────────┘ │ +│ │ ObjectRef / inline │ +│ ▼ │ +│ ┌───────────────────────────────┐ │ +│ │ Replica Actor │ │ +│ └───────────────┬───────────────┘ │ +│ │ │ +│ ▼ │ +│ ┌───────────────────────────────┐ │ +│ │ HTTP Proxy Actor │ │ +│ └───────────────┬───────────────┘ │ +│ │ HTTP Response │ +└──────────────────┼───────────────────────────────────────────────────┘ + │ + ▼ +┌──────────────────────────────────────────────────────────────────────┐ +│ External Client │ +│ 200 OK Response │ +└──────────────────────────────────────────────────────────────────────┘ +``` + +## Step-by-Step Breakdown + +### 1. Ingress (HTTP Proxy) + +**Component**: `ServeHTTPProxy` actor (running on Head or Worker nodes). + +1. **Receive**: The request hits the Uvicorn server running inside the Proxy actor. +2. **Route Matching**: The proxy inspects the URL path to match it against active **Applications** and their **Ingress Deployments**. +3. **Handle Creation**: The proxy uses a `DeploymentHandle` to forward the request. It does **not** send the request directly to a replica yet. + +### 2. Queueing & Backpressure (Deployment Handle) + +**Component**: `DeploymentHandle` (client-side in the Proxy). + +The request enters a **Handle Queue** managed by the caller (the Proxy). + +- **Assignment**: The handle checks for available slots in the target Deployment. +- **Backpressure**: If replicas are saturated (`max_ongoing_requests`), the request stays in this queue instead of being pushed to a replica. +- **Rejection**: If the handle queue grows beyond `max_queued_requests`, the request is rejected with an overload-style error (client-visible backpressure). + +**Where this queue lives**: inside the process that is making the call (here: the HTTP Proxy). It is not a replica-local queue. + +### 3. Replica Assignment (Ray Core) + +**Component**: `ServeRouter` & `Ray Core`. + +When a slot is available: + +1. **Routing**: The router selects a specific Replica actor ID based on the policy (e.g., `PowerOfTwoChoices`). +2. **RPC**: The request is serialized and sent via Ray's internal gRPC protocol to the selected actor. + +> **Under the Hood: Ray Task Lifecycle** +> +> - **Submission**: The router behaves like a Ray Core driver submitting a task. +> - **Worker Lease**: Ray guarantees the actor exists. If the actor had crashed, the Ray Controller would have already requested a new worker lease from the **Raylet** to restart it. +> - **PushTask**: The `PushTask` RPC carries the request data. + +### 4. Execution (Worker & Replica) + +**Component**: `RayWorker` process. + +1. **Receive**: The Worker process hosting the Replica actor receives the message. +2. **Deserialization**: + - **Small Data**: Unpickled directly from the message. + - **Large Data**: If the request payload is large, it may be retrieved from the **Plasma Object Store** (shared memory). +3. **Asyncio Loop**: The request enters the actor's entrypoint (usually `__call__`). +4. **Replica Concurrency Limit**: The replica will not run more than `max_ongoing_requests` concurrently. Requests beyond that should not be dispatched to this replica; instead they remain queued at the caller-side handle. +5. **Batching** (Optional): If `@serve.batch` is used, the request may wait in a replica-local batching buffer until either `max_batch_size` is reached or `batch_wait_timeout_s` expires (see [Batching](batching.md)). +6. **Inference**: The model code runs (e.g. `model.predict(input)`). + +### 5. Response & Return + +**Component**: Shared Memory & Network. + +1. **Completion**: The function returns a result. +2. **Storage**: + - **Small Result**: Sent back directly in the RPC response. + - **Large Result**: Stored in the local Plasma Store; only an `ObjectRef` is returned. +3. **Forwarding**: The HTTP Proxy waits for the result (resolving the `ObjectRef` if necessary) and writes the HTTP response body. +4. **Client**: The client receives the `200 OK`. + +## Where queues are handled (and where requests get rejected) + +Ray Serve has multiple queue-like stages. They serve different purposes and are controlled by different knobs. + +For deep-dive explanation and tuning advice, see **[Queues and Backpressure](queues-and-backpressure.md)**. + +### 1. Proxy-side “handle queue” (caller-side) + +When an HTTP request hits Ray Serve, the proxy forwards it through a `DeploymentHandle`. +That handle maintains a **caller-side queue** of requests waiting to be assigned to a replica. + +This is where `max_queued_requests` applies. + +- If replicas are busy (because of per-replica concurrency limits), the request waits here. +- If the queue grows beyond `max_queued_requests`, the request is rejected (client-visible backpressure). + +### 2. Routing / replica selection + +Once a request can be dispatched, Ray Serve selects a replica. + +This stage is not intended to be a long-term queue - it is primarily where the system decides _which_ replica gets the request next. + +### 3. Replica concurrency slots (“ongoing requests”) + +Each replica enforces a cap on concurrent in-flight work via `max_ongoing_requests`. + +- If a replica already has `max_ongoing_requests` in progress, new work should not be scheduled onto it. +- “Ongoing” includes requests that are actively executing _or_ are awaiting completion (e.g., waiting for I/O or for a batch to flush). + +### 4. Replica-local batching buffer (optional) + +If you use `@serve.batch`, requests assigned to the replica can enter a **batching buffer** inside the replica. + +This buffer is flushed when either: + +- it reaches `max_batch_size`, or +- `batch_wait_timeout_s` elapses since the first buffered request + +This buffer is not controlled by `max_queued_requests` (that limit is caller-side). + +**[Queues and backpressure](queues-and-backpressure.md)** explains specifically how `max_ongoing_requests` and `max_queued_requests` interact. diff --git a/docs/get-started/quick-start.md b/docs/get-started/quick-start.md new file mode 100644 index 0000000..1abc779 --- /dev/null +++ b/docs/get-started/quick-start.md @@ -0,0 +1,100 @@ +# Quick Start + +This guide will help you deploy your first model using Model Service in just a few minutes. + +## Prerequisites + +Before you begin, ensure you have: + +- Access to a Kubernetes cluster with KubeRay operator installed +- `kubectl` configured to access your cluster +- Basic familiarity with Kubernetes concepts + +Don't have KubeRay installed? +See the [Installation Guide](https://docs.ray.io/en/latest/cluster/kubernetes/getting-started/kuberay-operator-installation.html) for instructions on setting up KubeRay. + +## Step 1: Clone the Repository + +```bash +git clone https://gitlab.ics.muni.cz/rationai/infrastructure/model-service.git +cd model-service +``` + +## Step 2: Review the Configuration + +The repository includes a sample RayService configuration in `ray-service.yaml`. This deploys a binary classifier model for prostate tissue analysis. + +```yaml +apiVersion: ray.io/v1 +kind: RayService +metadata: + name: rayservice-models +spec: + serveConfigV2: | + applications: + - name: prostate-classifier-1 + import_path: models.binary_classifier:app + route_prefix: /prostate-classifier-1 + # ... configuration continues +``` + +## Step 3: Deploy the Service + +Apply the RayService configuration to your cluster. + +Replace [namespace] with the desired namespace (e.g., `rationai-notebooks-ns, rationai-jobs-ns` etc.): + +```bash +kubectl apply -f ray-service.yaml -n [namespace] +``` + +## Step 4: Monitor Deployment + +Check the deployment status: + +```bash +# Check RayService status +kubectl get rayservice rayservice-models -n [namespace] + +# Check Ray cluster pods +kubectl get pods -n [namespace] +``` + +If the RayService is not becoming ready, inspect events and status: + +```bash +kubectl describe rayservice rayservice-models -n [namespace] +``` + +## Step 5: Access the Service Locally + +Once deployed, you can port-forward the service to access it locally: + +```bash +# Port-forward to access the service locally +kubectl port-forward -n [namespace] svc/rayservice-models-serve-svc 8000:8000 +``` + +## Step 6: Delete the Deployment + +To delete the deployed RayService, run: + +```bash +kubectl delete -f ray-service.yaml -n [namespace] +``` + +### Connection Issues + +Ensure your cluster has proper network policies and that the namespace has access to required resources (MLflow, proxy, etc.). + +## Next Steps + +Congratulations! You've successfully deployed your first model with Model Service. + +Now you can: + +- [Learn how to add your own models](../guides/adding-models.md) +- [Understand the architecture](../architecture/overview.md) +- [Read the deployment guide](../guides/deployment-guide.md) +- [Check configuration reference](../guides/configuration-reference.md) +- [Troubleshooting](../guides/troubleshooting.md) diff --git a/docs/guides/adding-models.md b/docs/guides/adding-models.md new file mode 100644 index 0000000..008fcf5 --- /dev/null +++ b/docs/guides/adding-models.md @@ -0,0 +1,338 @@ +# Adding New Models + +This guide explains how to integrate your own machine learning models into Model Service. + +## Overview + +To add a new model, you need to: + +1. Create a model class with Ray Serve decorators +2. Implement the inference logic +3. Configure the RayService YAML +4. Deploy and test + +## Model Implementation + +### Basic Structure + +Create a Python file in the `models/` directory: + +```python +from ray import serve +from starlette.requests import Request + +@serve.deployment(ray_actor_options={"num_cpus": 2}) +class MyModel: + def __init__(self): + # Load your model here + pass + + async def __call__(self, request: Request): + # Handle inference requests + data = await request.json() + # Process data and return prediction + result = self.predict(data) + return {"prediction": result} + + def predict(self, data: dict): + # Replace with your own inference logic + return data + +app = MyModel.bind() +``` + +The repository's reference model `BinaryClassifier` uses FastAPI ingress + batched inference and expects a **compressed binary payload** (not JSON). For simple JSON models, the examples above are fine; for high-throughput image inference, consider the batching and ingress patterns shown below. + +### Key Components + +#### 1. Deployment Decorator + +The `@serve.deployment` decorator marks your class as a Ray Serve deployment: + +```python +@serve.deployment( + ray_actor_options={ + "num_cpus": 2, # CPUs per replica + "num_gpus": 0, # GPUs per replica + "memory": 2 * 1024**3, # Memory in bytes + } +) +class MyModel: + ... +``` + +#### 2. Initialization + +Load your model in `__init__`. This method corresponds to the replica **startup phase**. + +```python +def __init__(self): + # This runs ONCE when the replica starts. + # The replica is NOT ready for traffic until this returns. + import torch + + self.model = torch.load("model.pt") + self.model.eval() + print("Model loaded successfully") +``` + +#### 3. Resource Packing (Fractional CPUs/GPUs) + +Ray allows fractional resource requests. This lets you pack multiple small replicas onto a single node. + +```python +# Run 4 replicas on a single 1-CPU node (0.25 * 4 = 1.0) +@serve.deployment(ray_actor_options={"num_cpus": 0.25}) +``` + +#### 4. Inference Method + +Implement `__call__` or other methods for handling requests: + +```python +async def __call__(self, request: Request): + data = await request.json() + input_data = self.preprocess(data["input"]) + + with torch.no_grad(): + output = self.model(input_data) + + return {"prediction": self.postprocess(output)} +``` + +## Advanced Features + +### Dynamic Configuration + +Use `reconfigure()` to update model settings without redeployment: + +```python +from typing import TypedDict + +class Config(TypedDict): + threshold: float + batch_size: int + +@serve.deployment +class ConfigurableModel: + def __init__(self): + self.model = load_model() + + def reconfigure(self, config: Config): + self.threshold = config["threshold"] + self.batch_size = config["batch_size"] + print(f"Reconfigured: threshold={self.threshold}") +``` + +Update config via RayService YAML: + +```yaml +user_config: + threshold: 0.5 + batch_size: 32 +``` + +### Batching Requests + +Use `@serve.batch` for efficient batch processing: + +```python +@serve.deployment +class BatchedModel: + def __init__(self): + self.model = load_model() + + @serve.batch(max_batch_size=32, batch_wait_timeout_s=0.1) + async def predict_batch(self, inputs: list[np.ndarray]): + batch = np.stack(inputs) + outputs = self.model(batch) + return outputs.tolist() + + async def __call__(self, request: Request): + data = await request.json() + input_data = np.array(data["input"]) + result = await self.predict_batch(input_data) + return {"prediction": result} +``` + +For binary/image workloads, you can also batch raw `bytes` like the `BinaryClassifier` does (see `models/binary_classifier.py`). This avoids JSON overhead and lets you control batch sizing via `user_config` by calling `set_max_batch_size()` and `set_batch_wait_timeout_s()`. + +### Using FastAPI + +For advanced HTTP features, use FastAPI: + +```python +from fastapi import FastAPI, HTTPException +from pydantic import BaseModel + +class PredictionRequest(BaseModel): + input: list[float] + +class PredictionResponse(BaseModel): + prediction: float + confidence: float + +fastapi = FastAPI() + +@serve.deployment +@serve.ingress(fastapi) +class FastAPIModel: + def __init__(self): + self.model = load_model() + + @fastapi.post("/predict", response_model=PredictionResponse) + async def predict(self, request: PredictionRequest): + output = self.model(request.input) + return PredictionResponse( + prediction=float(output), + confidence=0.95 + ) + +app = FastAPIModel.bind() +``` + +## Loading Models from MLflow + +Use the model provider to load from MLflow: + +```python +# models/mlflow_model.py +from providers.model_provider import mlflow + +@serve.deployment +class MLflowModel: + def __init__(self): + # This will be set via user_config + self.model_path = None + + async def reconfigure(self, config): + model_uri = config["model"]["artifact_uri"] + self.model_path = mlflow(artifact_uri=model_uri) + + # Load model + import onnxruntime as ort + self.session = ort.InferenceSession(self.model_path) + + async def __call__(self, request: Request): + # Inference logic + ... + +app = MLflowModel.bind() +``` + +Configure in YAML: + +```yaml +runtime_env: + env_vars: + MLFLOW_TRACKING_URI: http://mlflow.rationai-mlflow:5000 +user_config: + model: + artifact_uri: mlflow-artifacts:/65/abc123.../model.onnx +``` + +## RayService Configuration + +Add your model to `ray-service.yaml`: + +```yaml +spec: + serveConfigV2: | + applications: + - name: my-model + import_path: models.my_onnx_model:app + route_prefix: /my-model + runtime_env: + working_dir: https://gitlab.ics.muni.cz/rationai/infrastructure/model-service/-/archive/master/model-service-master.zip + pip: + - onnxruntime>=1.23.2 + - numpy + deployments: + - name: MyONNXModel + autoscaling_config: + min_replicas: 1 + max_replicas: 4 + ray_actor_options: + num_cpus: 2 + memory: 4294967296 # 4 GiB + runtime_env: + pip: + - onnxruntime>=1.23.2 +``` + +In this repository, the production `ray-service.yaml` installs model dependencies under `deployments[*].ray_actor_options.runtime_env.pip` (not only at `applications[*].runtime_env`). This is useful when different deployments need different dependencies. + +## GPU Models + +For GPU-accelerated models: + +```python +@serve.deployment(ray_actor_options={"num_gpus": 1}) +class GPUModel: + def __init__(self): + import torch + + self.device = torch.device("cuda") + self.model = torch.load("model.pt").to(self.device) + self.model.eval() + + async def __call__(self, request: Request): + data = await request.json() + input_tensor = torch.tensor(data["input"]).to(self.device) + + with torch.no_grad(): + output = self.model(input_tensor) + + return {"prediction": output.cpu().numpy().tolist()} +``` + +Configure GPU worker group: + +```yaml +workerGroupSpecs: + - groupName: gpu-workers + replicas: 0 + minReplicas: 0 + maxReplicas: 2 + template: + spec: + nodeSelector: + nvidia.com/gpu.product: NVIDIA-A40 + containers: + - name: ray-worker + image: rayproject/ray:2.52.1-py312-gpu + resources: + limits: + nvidia.com/gpu: 1 +``` + +## Deployment + +Deploy your model: + +```bash +kubectl apply -f ray-service.yaml -n [namespace] +``` + +Monitor deployment: + +```bash +kubectl get rayservice -n [namespace] +kubectl logs -n [namespace] -l ray.io/node-type=worker --tail=100 +``` + +## Best Practices + +1. **Error Handling**: Always wrap inference in try-except blocks +2. **Logging**: Use `print()` or `logging` for debugging (viewable in pod logs) +3. **Resource Limits**: Set appropriate CPU/memory/GPU limits +4. **Model Loading**: Cache models to avoid reloading on each request +5. **Input Validation**: Validate input data format and ranges +6. **Batching**: Use batching for throughput-intensive workloads +7. **Health Checks**: Implement health check endpoints for monitoring + +## Next Steps + +- [Deployment guide](deployment-guide.md) +- [Configuration reference](configuration-reference.md) +- [Architecture overview](../architecture/overview.md) diff --git a/docs/guides/configuration-reference.md b/docs/guides/configuration-reference.md new file mode 100644 index 0000000..602d502 --- /dev/null +++ b/docs/guides/configuration-reference.md @@ -0,0 +1,237 @@ +# Configuration Reference + +This page summarizes the **most important knobs** you will touch when configuring Model Service. For full API details, see the upstream Ray Serve and KubeRay documentation. + +## 1. RayService Skeleton + +```yaml +apiVersion: ray.io/v1 +kind: RayService +metadata: + name: + namespace: [namespace] +spec: + serveConfigV2: | + # Ray Serve applications + rayClusterConfig: + # Ray cluster (head + workers) +``` + +Think of it as two parts: + +- **`serveConfigV2`**: what you serve (apps, deployments, autoscaling). +- **`rayClusterConfig`**: where it runs (Ray version, worker groups, resources). + +## 2. Applications and Deployments + +### Applications (HTTP endpoints) + +```yaml +serveConfigV2: | + applications: + - name: prostate-classifier + import_path: models.binary_classifier:app + route_prefix: /prostate-classifier + runtime_env: + working_dir: https://.../model-service-master.zip + pip: + - onnxruntime>=1.23.2 +``` + +- `name`: logical app name (used in Ray dashboard/logs). +- `import_path`: Python entrypoint (`module.path:variable`). +- `route_prefix`: HTTP path under the Serve gateway. +- `runtime_env`: dynamic environment setup (see [Managing Dependencies](../guides/adding-models.md#6-managing-dependencies)). + +### Deployments (scaling + resources) + +```yaml +deployments: + - name: BinaryClassifier + max_ongoing_requests: 64 + max_queued_requests: 128 + autoscaling_config: + min_replicas: 0 + max_replicas: 4 + target_ongoing_requests: 32 + ray_actor_options: + num_cpus: 6 + memory: 5368709120 # 5 GiB + user_config: + tile_size: 512 + threshold: 0.5 +``` + +- `autoscaling_config`: how many replicas and when to scale. +- `ray_actor_options`: per‑replica CPU/GPU/memory. +- `user_config`: free‑form dict passed to `reconfigure()` in your model. + +## 2.1 Backpressure and queueing settings (very important) + +These two knobs often get confused because they both “limit load”, but they act at different points in the request path. + +### `max_ongoing_requests` (replica-side concurrency) + +**What it is:** the maximum number of in-flight requests a _single replica_ is allowed to have at once. + +**What it controls:** per-replica concurrency and memory pressure. + +**What happens when exceeded:** requests should not be dispatched onto that replica; they must wait upstream (typically in the caller-side queue). + +### `max_queued_requests` (caller-side queue limit) + +**What it is:** the maximum number of requests that are allowed to wait in the caller-side queue _before_ a replica slot is available. + +**Where that queue lives:** in the component that is calling the deployment (commonly the HTTP Proxy when handling HTTP ingress). + +**What happens when exceeded:** requests are rejected (client-visible overload/backpressure). + +### Why the difference matters + +- `max_ongoing_requests` protects the replica from being overloaded. +- `max_queued_requests` decides whether you prefer waiting or rejecting during spikes. + +See: [Queues and Backpressure](../architecture/queues-and-backpressure.md). + +## 2.2 Autoscaling settings (what they actually mean) + +### `target_ongoing_requests` + +**What it is:** The desired average number of **ongoing (in-flight)** requests per replica. This is the **primary scaling driver**. + +**Formula:** +$$ \text{Desired Replicas} = \left\lceil \frac{\text{Total Ongoing Requests}}{\text{target}\_{\text{ongoing}}\_\text{requests}} \right\rceil $$ + +**Note:** "Total Ongoing Requests" refers to the **concurrency** (number of requests currently being processed or waiting in the queue), _not_ the Requests Per Second (RPS). + +**Example:** +If your system receives 100 **concurrent** requests and `target_ongoing_requests` is set to 20, Serve will scale to 5 replicas. + +**How it influences scaling:** + +- **Lower value**: Scales up _earlier_. Use for latency-sensitive models or heavy tasks. +- **Higher value**: Scales up _later_. Use for high-throughput models where a single replica can handle many concurrent requests. + +**Important interaction:** if you set `max_queued_requests` too low, requests may get rejected before ongoing requests rise enough for autoscaling to catch up. + +### `min_replicas` / `max_replicas` + +Hard bounds on how many replicas Serve is allowed to run for that deployment. + +- **Scale to Zero**: Set `min_replicas: 0` to allow the deployment to stop all replicas when idle. The first request will trigger a "cold start" (latency spike). +- **High Availability**: Set `min_replicas: 2` (or more) to ensure at least two copies are always running, even if idle. + +### `upscale_delay_s` / `downscale_delay_s` + +Rules for how quickly the autoscaler reacts to load changes. + +- **`upscale_delay_s`**: The "patience" period before scaling up. The autoscaler sees high load, but waits this many seconds to confirm the spike is real before launching new replicas. + - _Risk_: Setting this too high makes the system sluggish to react to bursts. +- **`downscale_delay_s`**: The "grace period" before scaling down. Even if load drops to zero, the autoscaler keeps replicas alive for this duration. + - _Recommendation_: Keep this high to avoid "thrashing" (rapidly creating/destroying replicas) during short pauses in traffic. + +## 3. Ray Cluster (Workers and Autoscaling) + +```yaml +rayClusterConfig: + rayVersion: "2.52.1" + enableInTreeAutoscaling: true + headGroupSpec: + rayStartParams: + num-cpus: "0" # head only coordinates + template: + spec: + containers: + - name: ray-head + image: rayproject/ray:2.52.1-py312 + workerGroupSpecs: + - groupName: cpu-workers + replicas: 1 + minReplicas: 1 + maxReplicas: 10 + template: + spec: + containers: + - name: ray-worker + image: rayproject/ray:2.52.1-py312 + resources: + requests: + cpu: "4" + memory: "8Gi" + limits: + cpu: "8" + memory: "16Gi" +``` + +**Key Interactions:** + +1. **Head Node Isolation**: `rayStartParams: { num-cpus: "0" }` on the head node prevents workloads from scheduling there. The head is reserved for the Control Plane. +2. **Worker Sizing**: `resources.requests` defines the physical guarantee. Your Pod must be bigger than your Replica (`ray_actor_options`). + - _Physical_: Pod Requests (e.g., 4 CPU) + - _Logical_: Model Replica Requirement (e.g., 2 CPU) + - _Result_: One Pod can fit 2 Replicas (plus overhead). + +## 4. Security and Placement (Optional but Recommended) + +```yaml +template: + spec: + securityContext: + runAsNonRoot: true + fsGroupChangePolicy: OnRootMismatch + seccompProfile: + type: RuntimeDefault + nodeSelector: + nvidia.com/gpu.product: NVIDIA-A40 + containers: + - name: ray-worker + securityContext: + allowPrivilegeEscalation: false + runAsUser: 1000 + capabilities: + drop: ["ALL"] +``` + +Use these to: + +- Enforce non‑root containers and least privilege. +- Pin GPU workloads to specific node types. + +## 5. Putting It Together (Small Example) + +```yaml +apiVersion: ray.io/v1 +kind: RayService +metadata: + name: rayservice-example + namespace: rationai-notebooks-ns +spec: + serveConfigV2: | + applications: + - name: my-classifier + import_path: models.classifier:app + route_prefix: /classify + deployments: + - name: Classifier + autoscaling_config: + min_replicas: 1 + max_replicas: 5 + target_ongoing_requests: 32 + ray_actor_options: + num_cpus: 4 + rayClusterConfig: + rayVersion: "2.52.1" + enableInTreeAutoscaling: true + headGroupSpec: + rayStartParams: + num-cpus: "0" + workerGroupSpecs: + - groupName: cpu-workers + minReplicas: 1 + maxReplicas: 5 +``` + +## Next Steps + +- [Deployment guide](deployment-guide.md) +- [Architecture overview](../architecture/overview.md) diff --git a/docs/guides/deployment-guide.md b/docs/guides/deployment-guide.md new file mode 100644 index 0000000..34d7083 --- /dev/null +++ b/docs/guides/deployment-guide.md @@ -0,0 +1,476 @@ +# Deployment Guide + +Complete guide for deploying models to production with Model Service. + +## Prerequisites + +Before deploying to production, ensure: + +- [x] KubeRay operator installed +- [x] Namespace created (`rationai-notebooks-ns`) +- [x] Model tested locally +- [x] RayService YAML configured +- [x] MLflow accessible (if using MLflow) + +## Deployment Workflow + +### 1. Prepare Model Code + +Ensure your model is in the `models/` directory and properly structured: + +```python +# models/my_model.py +from ray import serve +from starlette.requests import Request + +@serve.deployment(ray_actor_options={"num_cpus": 2}) +class MyModel: + def __init__(self): + # Model initialization + self.model = self.load_model() + + def load_model(self): + # Load model logic + pass + + async def __call__(self, request: Request): + # Inference logic + data = await request.json() + result = self.model.predict(data["input"]) # replace with your own inference call + return {"prediction": result} + +app = MyModel.bind() +``` + +### 2. Create RayService Configuration + +Create or modify `ray-service.yaml`: + +```yaml +apiVersion: ray.io/v1 +kind: RayService +metadata: + name: rayservice-my-model + namespace: rationai-notebooks-ns +spec: + serveConfigV2: | + applications: + - name: my-model + import_path: models.my_model:app + route_prefix: /my-model + runtime_env: + working_dir: https://gitlab.ics.muni.cz/rationai/infrastructure/model-service/-/archive/master/model-service-master.zip + pip: + - numpy + - pandas + env_vars: + MODEL_VERSION: "1.0.0" + deployments: + - name: MyModel + autoscaling_config: + min_replicas: 1 + max_replicas: 5 + target_ongoing_requests: 32 + ray_actor_options: + num_cpus: 4 + memory: 4294967296 # 4 GiB + runtime_env: + pip: + - numpy + - pandas + + rayClusterConfig: + rayVersion: 2.52.1 + enableInTreeAutoscaling: true + autoscalerOptions: + idleTimeoutSeconds: 300 + + headGroupSpec: + rayStartParams: + num-cpus: "0" + dashboard-host: "0.0.0.0" + template: + spec: + containers: + - name: ray-head + image: rayproject/ray:2.52.1-py312 + resources: + limits: + cpu: 2 + memory: 4Gi + + workerGroupSpecs: + - groupName: cpu-workers + replicas: 1 + minReplicas: 1 + maxReplicas: 10 + template: + spec: + containers: + - name: ray-worker + image: rayproject/ray:2.52.1-py312 + resources: + limits: + cpu: 8 + memory: 16Gi +``` + +### 3. Deploy to Kubernetes + +Apply the configuration: + +```bash +kubectl apply -f ray-service.yaml -n [namespace] +``` + +### 4. Monitor Deployment + +Watch the deployment progress: + +```bash +# Watch RayService status +kubectl get rayservice rayservice-my-model -n [namespace] -w + +# Check pods +kubectl get pods -n [namespace] -l ray.io/cluster=rayservice-my-model + +# View head node logs +kubectl logs -n [namespace] -l ray.io/node-type=head -f + +# View worker logs +kubectl logs -n [namespace] -l ray.io/node-type=worker -f +``` + +Wait for status to show `Running` and application status to show `RUNNING`. + +### 5. Verify Deployment + +Check service endpoints: + +```bash +# Get service details +kubectl get svc -n [namespace] + +# Port forward to test +kubectl port-forward -n [namespace] \ + svc/rayservice-my-model-serve-svc 8000:8000 +``` + +The example model in this repository (`models/binary_classifier.py`) uses FastAPI ingress and expects a **compressed binary request body** (LZ4), not JSON. The JSON `curl` example below is valid for JSON-based models but does not apply to `BinaryClassifier`. + +## Production Considerations + +### Resource Planning (Logical vs. Physical) + +Ray scheduling relies on **Logical Resources**, while Kubernetes manages **Physical Resources**. Confusion between them is the #1 cause of "Pending" pods. + +#### 1. Logical Resources (What Ray sees) + +Defined in your code via `ray_actor_options`. These are abstract "slots" used for scheduling. + +- `num_cpus: 4`: The actor needs 4 slots to run. +- `memory: 4294967296` (bytes): Ray logical memory resource used for scheduling/admission control. + +Example (Python): + +```python +from ray import serve + +@serve.deployment( + ray_actor_options={ + "num_cpus": 4, + "memory": 4 * 1024**3, # bytes (4 GiB) + } +) +class MyModel: + ... + +app = MyModel.bind() +``` + +#### 2. Physical Resources (What Kubernetes gives) + +Defined in `ray-service.yaml` under `workerGroupSpecs`. This is the actual container capacity. + +Example (Kubernetes YAML): + +```yaml +workerGroupSpecs: + - groupName: cpu-workers + replicas: 1 + template: + spec: + containers: + - name: ray-worker + resources: + requests: + cpu: 8 + memory: 16Gi + limits: + cpu: 12 + memory: 20Gi +``` + +#### 3. The "Overhead" Gap + +Ray system processes (Raylet, Dashboard Agent, Plasma Store) consume physical CPU and Memory that is **not** accounted for in logical slots. + +**Formula for Worker Pod Sizing:** + +```text +Physical Request >= (Sum of Replicas × Logical Request) + System Overhead +``` + +**Recommended Overhead Buffer:** + +- **CPU**: Add 0.5 - 2 CPU cores per pod for Ray system processes. +- **Memory**: Add 1-2 GiB + 30% of object store size. + +#### Example Calculation + +**Scenario:** Deploy 5 replicas of a model requiring 4 CPUs and 4GB RAM on a single node. + +1. **Logical Needs**: 5 replicas × 4 CPUs = **20 Logical CPUs**. +2. **Physical Overhead**: We estimate 2 CPUs for Raylet/System. +3. **Total Physical Request**: 20 + 2 = **22 CPUs**. + +_If you request only 20 CPUs in Kubernetes, Ray will detect that some CPU is used by the OS/Raylet and might only offer 19 logical slots, causing the 5th replica to hang._ + +### Autoscaling Configuration + +**Choose appropriate scaling parameters:** + +```yaml +autoscaling_config: + min_replicas: 1 + max_replicas: 10 + target_ongoing_requests: 20 + + # Advanced stabilization + upscale_delay_s: 30 + downscale_delay_s: 600 +``` + +**Key Tuning Recommendations:** + +1. **`target_ongoing_requests`**: + + - **Lower this value** (e.g., 5-10) for latency-sensitive models or if your model is CPU-heavy. This forces the system to scale out sooner. + - **Increase this value** (e.g., 50-100) for simple models where a single replica can juggle many async requests. + +2. **`upscale_delay_s`**: + + - Keep this low (e.g., `0s` to `30s`) so the system reacts quickly to traffic spikes. + +3. **`downscale_delay_s`**: + - Keep this high (e.g., `600s`) to avoid "thrashing". It is cheaper to keep an idle replica for 10 minutes than to re-initialize a heavy model (loading weights, etc.) every time traffic dips for a minute. + +For the exact formulas and definitions of these settings, see the [Configuration Reference](configuration-reference.md#22-autoscaling-settings-what-they-actually-mean). + +### High Availability + +**For production workloads:** + +```yaml +# Multiple replicas +autoscaling_config: + min_replicas: 2 # At least 2 for redundancy + +# Multiple workers +workerGroupSpecs: + - groupName: cpu-workers + minReplicas: 2 + maxReplicas: 10 +``` + +### Resource Limits + +**Always set resource limits:** + +```yaml +containers: + - name: ray-worker + resources: + requests: # Guaranteed resources + cpu: 8 + memory: 16Gi + limits: # Maximum resources + cpu: 12 + memory: 20Gi +``` + +### Network Configuration + +**Proxy settings:** + +```yaml +env: + - name: HTTPS_PROXY + value: "http://proxy.example.com:3128" +``` + +**Service configuration:** + +```yaml +# If you need external access +apiVersion: v1 +kind: Service +metadata: + name: rayservice-external +spec: + type: LoadBalancer + selector: + ray.io/cluster: rayservice-my-model + ports: + - port: 80 + targetPort: 8000 +``` + +## Multi-Model Deployment + +Deploy multiple models in one RayService: + +```yaml +serveConfigV2: | + applications: + - name: model-a + import_path: models.model_a:app + route_prefix: /model-a + deployments: + - name: ModelA + ray_actor_options: + num_cpus: 4 + + - name: model-b + import_path: models.model_b:app + route_prefix: /model-b + deployments: + - name: ModelB + ray_actor_options: + num_gpus: 1 +``` + +## Updating Deployments + +### Update Model Code + +1. Update code in repository +2. Commit and push changes +3. RayService will automatically fetch new code from `working_dir` URL + +### Update Configuration + +```bash +# Edit configuration +vim ray-service.yaml # or any IDE + +# Apply changes +kubectl apply -f ray-service.yaml -n [namespace] +``` + +KubeRay will reconcile the RayService and attempt a rolling-style update: + +- New replicas are created with the new config +- Traffic is routed to healthy replicas +- Old replicas are eventually removed + +### Update Model Weights + +If using MLflow: + +```yaml +user_config: + model: + artifact_uri: mlflow-artifacts:/65/NEW_RUN_ID/model.onnx +``` + +Apply update: + +```bash +kubectl apply -f ray-service.yaml -n [namespace] +``` + +## Rollback + +If deployment fails, rollback: + +```bash +# RayService is a Custom Resource (CRD), so Kubernetes "rollout" doesn't apply. +# Instead, view KubeRay status and events, then re-apply a known-good spec. + +# Inspect current state and recent events +kubectl get rayservice rayservice-my-model -n [namespace] -o yaml +kubectl describe rayservice rayservice-my-model -n [namespace] + +# Check Ray Serve controller logs (usually shows the root cause) +kubectl logs -n [namespace] -l ray.io/node-type=head --tail=200 +``` + +## Troubleshooting + +### Deployment Stuck + +**Check RayService status:** + +```bash +kubectl describe rayservice rayservice-my-model -n [namespace] +``` + +**Common issues:** + +- Image pull errors +- Insufficient resources +- Configuration errors +- Network issues + +### Application Not Starting + +**Check serve application logs:** + +```bash +# View dashboard +kubectl port-forward -n [namespace] svc/rayservice-my-model-head-svc 8265:8265 + +# Check logs +kubectl logs -n [namespace] -l ray.io/node-type=worker --tail=100 +``` + +**Common issues:** + +- Python import errors +- Model loading failures +- Dependency issues +- Resource limits + +### High Latency + +**Check metrics:** + +```bash +# Ray dashboard: http://localhost:8265 +kubectl port-forward -n [namespace] svc/rayservice-my-model-head-svc 8265:8265 +``` + +**Possible solutions:** + +- Increase replicas +- Enable batching +- Optimize model code +- Increase resources + +## Best Practices + +1. **Version Control**: Keep all YAML configs in Git +2. **Testing**: Test locally before deploying +3. **Monitoring**: Set up alerts for failures +4. **Resource Limits**: Always set limits to prevent resource hogging +5. **Gradual Rollout**: Update replicas gradually +6. **Documentation**: Document custom configurations +7. **Backup**: Keep backups of working configurations + +## Next Steps + +- [Configuration reference](configuration-reference.md) +- [Architecture overview](../architecture/overview.md) +- [Adding new models](adding-models.md) +- [Troubleshooting](troubleshooting.md) diff --git a/docs/guides/troubleshooting.md b/docs/guides/troubleshooting.md new file mode 100644 index 0000000..34ec5dd --- /dev/null +++ b/docs/guides/troubleshooting.md @@ -0,0 +1,198 @@ +# Troubleshooting + +This page lists the most common issues when deploying and running models in Model Service (Ray Serve on KubeRay). + +## Quick Triage Checklist + +Start here before digging deeper: + +```bash +kubectl get rayservice -n [namespace] +kubectl describe rayservice rayservice-models -n [namespace] +kubectl get pods -n [namespace] +``` + +Then inspect logs: + +```bash +kubectl logs -n [namespace] -l ray.io/node-type=head --tail=200 +kubectl logs -n [namespace] -l ray.io/node-type=worker --tail=200 +``` + +## RayService Shows `DEPLOY_FAILED` + +### What it usually means + +Ray Serve could not start the application or deployment. The root cause is typically visible in the Ray Serve controller logs. + +### What to do + +1. Describe the RayService for events: + +```bash +kubectl describe rayservice rayservice-models -n [namespace] +``` + +2. Open the Ray dashboard (helps with Serve deployment errors): + +```bash +kubectl port-forward -n [namespace] svc/rayservice-models-head-svc 8265:8265 +``` + +Visit `http://localhost:8265`. + +3. Look for Python import errors / missing dependencies: + +```bash +kubectl logs -n [namespace] -l ray.io/node-type=worker --tail=500 +``` + +## ImportError / ModuleNotFoundError + +### Symptoms + +- Serve deployment fails immediately. +- Logs show `ModuleNotFoundError: No module named ...`. + +### Causes + +- Dependency not installed in the runtime environment. +- Wrong `import_path`. +- `working_dir` does not contain the expected code. + +### Fix + +- Ensure `import_path` matches your file: + - Example: `models.binary_classifier:app` means there is `models/binary_classifier.py` defining `app = ...`. +- Add missing dependencies to `runtime_env.pip`. + +In this repository, dependencies are typically installed per deployment: + +```yaml +deployments: + - name: BinaryClassifier + ray_actor_options: + runtime_env: + pip: ["onnxruntime>=1.23.2", "mlflow<3.0", "lz4>=4.4.5"] +``` + +## Worker Crashes (OOMKilled) + +### Symptoms + +- Pods in `kubectl get pods` show status `OOMKilled` or high restart counts. +- `kubectl describe pod ...` shows "Last State: Terminated (Reason: OOMKilled)". +- Ray Dashboard shows unexpected actor deaths. + +### Causes + +- The model loaded into memory + the input batch size exceeds the container's memory limit. +- **Physical vs Logical Mismatch**: Ray was told the actor needs 2GB, so it scheduled it on a node, but the actual Python process used 4GB, causing Kubernetes to kill it. + +### Fix + +You must increase **both** the Ray logical allocation and the Kubernetes physical limit. + +1. Increase `ray_actor_options.memory` (Software limit): + + ```yaml + ray_actor_options: + memory: 4294967296 # 4 GiB + ``` + +2. Increase Kubernetes container limits (Hardware limit): + Ensure the `workerGroupSpecs` in `ray-service.yaml` provides **more** memory than the sum of all actors on that node plus overhead (~30%). + + ```yaml + resources: + limits: + memory: "6Gi" # Must cover the 4GB actor + Ray overhead + ``` + +## Autoscaling Not Working (Replicas Don’t Change) + +### Serve replicas not scaling + +Check your deployment has autoscaling configured: + +```yaml +autoscaling_config: + min_replicas: 0 + max_replicas: 4 + target_ongoing_requests: 32 +``` + +Also note: + +- Scale up/down is not instantaneous (delays and smoothing apply). +- If traffic is low, you may stay at `min_replicas`. + +### Worker pods not scaling + +Worker pod scaling requires cluster autoscaling enabled: + +```yaml +rayClusterConfig: + enableInTreeAutoscaling: true + autoscalerOptions: + idleTimeoutSeconds: 60 +``` + +Also ensure `workerGroupSpecs[*].minReplicas/maxReplicas` allow scaling. + +## Not Enough CPU / Memory (Pods Pending) + +### Symptoms + +- Pods stay in `Pending`. +- Events mention `Insufficient cpu` or `Insufficient memory`. + +### Fix + +1. **Check Physical vs Logical**: + + - _Physical_: Can K8s schedule the pod? `kubectl describe pod` will show if nodes are full. + - _Logical_: Can Ray schedule the actor? Check `ray status` or the dashboard. Ray might say "0/X CPUs available" even if the pod exists, because other actors consumed the slots. + +2. **Adjust Resources**: + - Reduce per-replica requirements (`ray_actor_options.num_cpus`, `memory`). + - Increase cluster capacity (maxReplicas) or per-worker limits. + +Inspect pod scheduling events: + +```bash +kubectl describe pod -n [namespace] +``` + +## MLflow / Artifact Download Problems + +### Symptoms + +- `mlflow.artifacts.download_artifacts` fails. +- Timeouts during replica initialization. + +### Fix + +- Ensure `MLFLOW_TRACKING_URI` is set and reachable from the cluster. +- Ensure the cluster has network access (proxy settings if needed). +- Verify the `artifact_uri` exists and permissions are correct. + +In `ray-service.yaml` this is typically configured via `env_vars`: + +```yaml +ray_actor_options: + runtime_env: + env_vars: + MLFLOW_TRACKING_URI: http://mlflow.rationai-mlflow:5000 +``` + +## Helpful Commands + +```bash +# list Serve and RayService resources +kubectl get rayservice -n [namespace] +kubectl get svc -n [namespace] + +# see all pods for a RayService +kubectl get pods -n [namespace] -l ray.io/cluster=rayservice-models +``` diff --git a/docs/index.md b/docs/index.md new file mode 100644 index 0000000..b861842 --- /dev/null +++ b/docs/index.md @@ -0,0 +1,92 @@ +# Model Service Documentation + +Welcome to the Model Service documentation. This service provides a scalable, production-ready infrastructure for deploying machine learning models for the RationAI project using Ray Serve on Kubernetes. + +## What is Model Service? + +Model Service is a deployment framework that enables: + +- **Scalable Model Serving**: Automatically scale model replicas based on request load +- **Distributed Inference**: Distribute inference workloads across multiple workers and nodes +- **Resource Management**: Efficiently manage CPU and GPU resources in Kubernetes +- **Model Versioning**: Integration with MLflow for model lifecycle management +- **Production-oriented**: Built on Ray Serve and KubeRay, with autoscaling and failure recovery features + +## Key Features + +### Auto-Scaling + +Model Service automatically adjusts the number of model replicas based on incoming request volume, ensuring optimal resource utilization and response times. + +### Multi-Model Deployment + +Deploy multiple models simultaneously with isolated resource allocations and independent scaling policies. + +### GPU/CPU Support + +Flexible resource allocation supporting both CPU-based and GPU-accelerated models with hardware-specific worker groups. + +### Kubernetes Native + +Leverages KubeRay operator for seamless integration with Kubernetes, enabling declarative configuration and GitOps workflows. + +## Why Ray Serve? + +Model Service is built on top of Ray Serve because it combines a simple developer experience with strong production capabilities: + +- **Unified batch and online inference**: The same Ray cluster can handle real-time HTTP requests and large batch jobs, which matches RationAI's mix of interactive and offline pathology workloads. +- **Python‑native API**: Models are implemented as regular Python classes or functions with decorators, making it easy for researchers to contribute without learning a heavy framework. +- **Autoscaling built in**: Ray Serve natively scales replicas based on request pressure and integrates with Ray's cluster autoscaler to add/remove worker pods. +- **Multi‑model support**: Multiple independent applications and deployments can run side‑by‑side on one cluster while isolating resources per model. + +Alternative approaches (plain Kubernetes deployments, custom Flask/FastAPI services, or specialized serving stacks like TorchServe or TF Serving) either lack first‑class autoscaling orchestration across many models, or are tightly coupled to specific ML frameworks. Ray Serve, together with KubeRay, lets us: + +- Express all infrastructure declaratively in a single `RayService` resource. +- Share the same cluster across heterogeneous models and hardware (CPU/GPU). +- Keep the operational surface smaller by relying on one general‑purpose serving layer instead of many ad‑hoc microservices. + +## Use Cases + +Model Service is designed for: + +- **Pathology Image Analysis**: Deploy models for tissue classification, nuclei detection, and other pathology tasks +- **Batch Processing**: Handle large-scale inference workloads efficiently +- **Real-time Inference**: Serve predictions with low latency for interactive applications +- **Research Experiments**: Quickly deploy and test new model versions + +## Documentation Contents + +### Get Started + +- [**Quick Start**](get-started/quick-start.md): Deploy the reference model in minutes. + +### Guides + +- [**Adding Models**](guides/adding-models.md): How to write, package, and integrate your own Python models. +- [**Deployment Guide**](guides/deployment-guide.md): Production checklist, resource planning (CPU/GPU), and networking. +- [**Configuration Reference**](guides/configuration-reference.md): Detailed explanation of `ray-service.yaml` settings. +- [**Troubleshooting**](guides/troubleshooting.md): Common errors (OOM, hang scenarios) and solutions. + +### Architecture Deep Dive + +- [**Overview**](architecture/overview.md): High-level system design and component hierarchy. +- [**Request Lifecycle**](architecture/request-lifecycle.md): Trace a request from Ingress to Worker. +- [**Queues & Backpressure**](architecture/queues-and-backpressure.md): Understanding flow control and overload protection. +- [**Batching**](architecture/batching.md): How request coalescing works under the hood. + +## Getting Help + +- **Documentation**: Browse the guides and reference materials in this documentation +- **Issues**: Report bugs or request features via [GitLab Issues](https://gitlab.ics.muni.cz/rationai/infrastructure/model-service/-/issues) +- **Contact**: Reach out to the RationAI team at Masaryk University + +## Next Steps + +Ready to get started? Follow our [Quick Start Guide](get-started/quick-start.md) to deploy your first model. + +## Glossary + +- **RayService**: KubeRay custom resource that manages a Ray cluster plus a Ray Serve application, including updates. +- **Deployment (Ray Serve)**: A scalable unit (replicas) that runs your model code. +- **Replica**: One running instance of a deployment. +- **Worker group (KubeRay)**: A set of Ray worker pods (e.g., CPU or GPU workers) with independent scaling bounds. diff --git a/mkdocs.yml b/mkdocs.yml new file mode 100644 index 0000000..e4a1e22 --- /dev/null +++ b/mkdocs.yml @@ -0,0 +1,48 @@ +site_name: Model Service Documentation +site_description: Model deployment infrastructure for RationAI using Ray Serve on Kubernetes +repo_url: https://gitlab.ics.muni.cz/rationai/infrastructure/model-service +edit_uri: -/edit/master/docs/ + +theme: + name: material + palette: + primary: indigo + accent: indigo + features: + - navigation.tabs + - navigation.sections + - toc.integrate + - search.suggest + - search.highlight + - content.code.copy + +nav: + - Home: index.md + - Get Started: + - Quick Start: get-started/quick-start.md + - Installation: https://docs.ray.io/en/latest/cluster/kubernetes/getting-started/kuberay-operator-installation.html + - First Deployment: https://docs.ray.io/en/latest/serve/production-guide/kubernetes.html + - Guides: + - Deployment Guide: guides/deployment-guide.md + - Adding New Models: guides/adding-models.md + - Configuration Reference: guides/configuration-reference.md + - Troubleshooting: guides/troubleshooting.md + - Architecture: + - Overview: architecture/overview.md + - Request Lifecycle: architecture/request-lifecycle.md + - Batching: architecture/batching.md + - Queues and Backpressure: architecture/queues-and-backpressure.md + +markdown_extensions: + - admonition + - pymdownx.details + - pymdownx.superfences + - pymdownx.highlight: + anchor_linenums: true + - pymdownx.inlinehilite + - pymdownx.snippets + - pymdownx.tabbed: + alternate_style: true + - tables + - toc: + permalink: true \ No newline at end of file diff --git a/pyproject.toml b/pyproject.toml index c42c605..be98026 100644 --- a/pyproject.toml +++ b/pyproject.toml @@ -19,3 +19,4 @@ dependencies = [ [dependency-groups] dev = ["mypy>=1.18.2", "ruff>=0.14.6"] +docs = ["mkdocs>=1.6.0", "mkdocs-material>=9.6.0", "pymdown-extensions>=10.0"]