From 12a3b91b81cd0eaf6fbb381cb28fd05afd7c88d2 Mon Sep 17 00:00:00 2001 From: JiriStipek <567776@muni.cz> Date: Thu, 18 Dec 2025 19:20:06 +0100 Subject: [PATCH 01/17] feat: add 1st docs iteration --- docs/architecture/overview.md | 334 ++++++++++++++++ docs/get-started/quick-start.md | 100 +++++ docs/guides/adding-models.md | 325 ++++++++++++++++ docs/guides/deployment-guide.md | 447 ++++++++++++++++++++++ docs/guides/troubleshooting.md | 159 ++++++++ docs/index.md | 72 ++++ docs/reference/configuration-reference.md | 171 +++++++++ mkdocs.yml | 46 +++ 8 files changed, 1654 insertions(+) create mode 100644 docs/architecture/overview.md create mode 100644 docs/get-started/quick-start.md create mode 100644 docs/guides/adding-models.md create mode 100644 docs/guides/deployment-guide.md create mode 100644 docs/guides/troubleshooting.md create mode 100644 docs/index.md create mode 100644 docs/reference/configuration-reference.md create mode 100644 mkdocs.yml diff --git a/docs/architecture/overview.md b/docs/architecture/overview.md new file mode 100644 index 0000000..48447d2 --- /dev/null +++ b/docs/architecture/overview.md @@ -0,0 +1,334 @@ +# Architecture Overview + +This document provides an overview of Model Service's architecture, components, and design principles. + +## System Architecture + +Model Service is built on a multi-layered architecture: + +``` +┌─────────────────────────────────────────────────────────┐ +│ Client Applications │ +│ (API Consumers, Web Apps, Notebooks) │ +└────────────────────┬────────────────────────────────────┘ + │ HTTP/HTTPS + ▼ +┌─────────────────────────────────────────────────────────┐ +│ Kubernetes Service │ +│ (Load Balancer / Ingress) │ +└────────────────────┬────────────────────────────────────┘ + │ + ▼ +┌─────────────────────────────────────────────────────────┐ +│ Ray Serve │ +│ ┌──────────────┐ ┌──────────────┐ ┌──────────────┐ │ +│ │ Model A │ │ Model B │ │ Model C │ │ +│ │ (Replicas) │ │ (Replicas) │ │ (Replicas) │ │ +│ └──────────────┘ └──────────────┘ └──────────────┘ │ +└────────────────────┬────────────────────────────────────┘ + │ + ▼ +┌─────────────────────────────────────────────────────────┐ +│ Ray Cluster │ +│ ┌────────────┐ ┌────────────┐ ┌────────────┐ │ +│ │ Head Node │ │ Worker 1 │ │ Worker 2 │ ... │ +│ └────────────┘ └────────────┘ └────────────┘ │ +└─────────────────────────────────────────────────────────┘ +``` + +## Core Components + +### 1. Ray Cluster + +The foundation of Model Service, providing distributed computing infrastructure. + +**Head Node:** + +- Cluster coordination and management +- Dashboard for monitoring +- Scheduling decisions +- Does not run model workloads (no CPU/GPU assigned) + +**Worker Nodes:** + +- Execute model inference workloads +- Can be CPU-only or GPU-enabled + +- Auto-scale based on demand +- Different worker groups for different hardware types + +### 2. Ray Serve + +Application layer for serving ML models as HTTP endpoints. + +**Features:** + +- HTTP request routing +- Load balancing across replicas +- Request batching +- Automatic retry and fault tolerance +- Dynamic model configuration + +### 3. KubeRay Operator + +Kubernetes operator that manages Ray clusters. + +**Responsibilities:** + +- Cluster lifecycle management (create, update, delete) +- Autoscaling worker nodes +- Health monitoring +- Configuration reconciliation + +### 4. Model Implementations + +Your ML models wrapped with Ray Serve decorators. + +**Structure:** + +```python +@serve.deployment +class YourModel: + def __init__(self): + # Model loading + + async def __call__(self, request): + # Inference logic +``` + +## Data Flow + +### Inference Request Flow + +1. **Client Request**: HTTP POST to model endpoint +2. **Service Routing**: Kubernetes service routes to Ray Serve +3. **Load Balancing**: Ray Serve distributes to available replica +4. **Model Processing**: Replica executes inference +5. **Response**: Result returned to client + +``` +Client → K8s Service → Ray Serve Router → Model Replica → Response + ↓ + (Autoscaler) + ↓ + Add/Remove + Replicas +``` + +### Model Loading Flow + +1. **Initialization**: Ray Serve creates model replica +2. **Environment Setup**: Install dependencies from runtime_env +3. **Model Download**: Fetch from MLflow/storage +4. **Loading**: Initialize model in memory +5. **Ready**: Replica accepts requests + +## Scaling Architecture + +### Horizontal Scaling (Replicas) + +Models scale horizontally by adding/removing replicas: + +``` +Load: ████████░░ (80%) +Replicas: [R1] [R2] [R3] + +Load: ████████████████ (160%) +Replicas: [R1] [R2] [R3] [R4] [R5] [R6] +``` + +**Autoscaling Triggers:** + +- `target_ongoing_requests`: Target requests per replica +- Scale up when: requests > (replicas × target) +- Scale down when: requests < (replicas × target) + +### Vertical Scaling (Workers) + +Ray cluster scales by adding/removing worker pods: + +```yaml +workerGroupSpecs: + - groupName: cpu-workers + minReplicas: 0 + maxReplicas: 4 +``` + +**Triggers:** + +- Resource pressure (CPU, memory, GPU) +- Idle timeout (scale to zero) +- Manual scaling + +## Resource Management + +### CPU Resources + +```yaml +ray_actor_options: + num_cpus: 6 # CPUs per replica + +containers: + resources: + requests: + cpu: 12 # CPUs per worker pod +``` + +**Calculation:** + +- Worker pod CPUs ≥ (replicas × num_cpus) +- Leave headroom for system processes + +### Memory Resources + +```yaml +ray_actor_options: + memory: 5368709120 # 5 GiB per replica + +containers: + resources: + limits: + memory: 10Gi # Memory per worker pod +``` + +### GPU Resources + +```yaml +ray_actor_options: + num_gpus: 1 # GPUs per replica + +nodeSelector: + nvidia.com/gpu.product: NVIDIA-A40 + +resources: + limits: + nvidia.com/gpu: 1 # GPUs per worker pod +``` + +## High Availability + +### Fault Tolerance + +**Ray Cluster:** + +- Head node failure → Cluster recreated by KubeRay +- Worker failure → Workload rescheduled to other workers +- Network partition → Automatic reconnection + +**Ray Serve:** + +- Replica failure → Requests routed to healthy replicas +- Failed replicas automatically restarted +- Graceful shutdown on updates + +### Zero-Downtime Updates + +Model updates use blue-green deployment: + +1. New version deployed alongside old +2. Traffic gradually shifted to new version +3. Old version removed when no active requests + +```yaml +spec: + serveConfigV2: | + applications: + - name: my-model-v2 # New version + # ... new configuration +``` + +## Security + +### Pod Security + +All pods run with security constraints: + +```yaml +securityContext: + runAsNonRoot: true + runAsUser: 1000 + allowPrivilegeEscalation: false + capabilities: + drop: ["ALL"] + seccompProfile: + type: RuntimeDefault +``` + +### Network Security + +- Internal service communication only +- Ingress controls external access +- Proxy support for external dependencies + +## Monitoring & Observability + +### Ray Dashboard + +Web UI for cluster monitoring: + +- Resource utilization +- Active tasks +- Node status +- Serve deployments + +### Kubernetes Monitoring + +Standard Kubernetes tools: + +```bash +kubectl get pods -n [namespace] +kubectl top pods -n [namespace] +kubectl logs -n [namespace] +kubectl describe rayservice -n [namespace] +``` + +### Metrics + +Ray exports Prometheus metrics: + +- Request latency +- Request throughput +- Replica count +- Resource usage + +## Design Principles + +### 1. Declarative Configuration + +Infrastructure defined in YAML, managed by GitOps: + +```yaml +apiVersion: ray.io/v1 +kind: RayService +# ... configuration +``` + +### 2. Separation of Concerns + +- **Model Code**: Python implementation +- **Infrastructure**: Kubernetes manifests +- **Configuration**: user_config section + +### 3. Elastic Scaling + +- Scale to zero when idle +- Scale up on demand +- Efficient resource utilization + +### 4. Fault Tolerance + +- Automatic recovery from failures +- No single point of failure (except data plane) +- Graceful degradation + +### 5. Developer Experience + +- Simple model implementation +- Easy local testing +- Fast iteration cycle + +## Next Steps + +- [Deployment guide](../guides/deployment-guide.md) +- [Configuration reference](../reference/configuration-reference.md) +- [Adding new models](../guides/adding-models.md) diff --git a/docs/get-started/quick-start.md b/docs/get-started/quick-start.md new file mode 100644 index 0000000..f96fceb --- /dev/null +++ b/docs/get-started/quick-start.md @@ -0,0 +1,100 @@ +# Quick Start + +This guide will help you deploy your first model using Model Service in just a few minutes. + +## Prerequisites + +Before you begin, ensure you have: + +- Access to a Kubernetes cluster with KubeRay operator installed +- `kubectl` configured to access your cluster +- Basic familiarity with Kubernetes concepts + +Don't have KubeRay installed? +See the [Installation Guide](https://docs.ray.io/en/latest/cluster/kubernetes/getting-started/kuberay-operator-installation.html) for instructions on setting up KubeRay. + +## Step 1: Clone the Repository + +```bash +git clone https://gitlab.ics.muni.cz/rationai/infrastructure/model-service.git +cd model-service +``` + +## Step 2: Review the Configuration + +The repository includes a sample RayService configuration in `ray-service.yaml`. This deploys a binary classifier model for prostate tissue analysis. + +```yaml +apiVersion: ray.io/v1 +kind: RayService +metadata: + name: rayservice-models +spec: + serveConfigV2: | + applications: + - name: prostate-classifier-1 + import_path: models.binary_classifier:app + route_prefix: /prostate-classifier-1 + # ... configuration continues +``` + +## Step 3: Deploy the Service + +Apply the RayService configuration to your cluster. + +Replace [namespace] with the desired namespace (e.g., `rationai-notebooks-ns, rationai-jobs-ns` etc.): + +```bash +kubectl apply -f ray-service.yaml -n [namespace] +``` + +## Step 4: Monitor Deployment + +Check the deployment status: + +```bash +# Check RayService status +kubectl get rayservice rayservice-models -n [namespace] + +# Check Ray cluster pods +kubectl get pods -n [namespace] +``` + +If the RayService is not becoming ready, inspect events and status: + +```bash +kubectl describe rayservice rayservice-models -n [namespace] +``` + +## Step 5: Local Access the Service + +Once deployed, you can port-forward the service to access it locally: + +```bash +# Port-forward to access the service locally +kubectl port-forward svc/rayservice-models-serve-svc -n [namespace] 8000:8000 +``` + +## Step 6: Delete the Deployed Model + +To delete the deployed RayService, run: + +```bash +kubectl delete -f ray-service.yaml -n [namespace] +``` + +## Next Steps + +Congratulations! You've successfully deployed your first model with Model Service. + +Now you can: + +- [Learn how to add your own models](../guides/adding-models.md) +- [Understand the architecture](../architecture/overview.md) +- [Read the deployment guide](../guides/deployment-guide.md) +- [Check configuration reference](../reference/configuration-reference.md) +- [Troubleshooting](../guides/troubleshooting.md) + +### Connection Issues + +Ensure your cluster has proper network policies and that the namespace has access to required resources (MLflow, proxy, etc.). diff --git a/docs/guides/adding-models.md b/docs/guides/adding-models.md new file mode 100644 index 0000000..fd79a7e --- /dev/null +++ b/docs/guides/adding-models.md @@ -0,0 +1,325 @@ +# Adding New Models + +This guide explains how to integrate your own machine learning models into Model Service. + +## Overview + +To add a new model, you need to: + +1. Create a model class with Ray Serve decorators +2. Implement the inference logic +3. Configure the RayService YAML +4. Deploy and test + +## Model Implementation + +### Basic Structure + +Create a Python file in the `models/` directory: + +```python +from ray import serve +from starlette.requests import Request + +@serve.deployment(ray_actor_options={"num_cpus": 2}) +class MyModel: + def __init__(self): + # Load your model here + pass + + async def __call__(self, request: Request): + # Handle inference requests + data = await request.json() + # Process data and return prediction + return {"prediction": result} + +app = MyModel.bind() +``` + +!!! note +The repository's reference model `BinaryClassifier` uses FastAPI ingress + batched inference and expects a **compressed binary payload** (not JSON). For simple JSON models, the examples above are fine; for high-throughput image inference, consider the batching and ingress patterns shown below. + +### Key Components + +#### 1. Deployment Decorator + +The `@serve.deployment` decorator marks your class as a Ray Serve deployment: + +```python +@serve.deployment( + ray_actor_options={ + "num_cpus": 2, # CPUs per replica + "num_gpus": 0, # GPUs per replica + "memory": 2 * 1024**3, # Memory in bytes + } +) +class MyModel: + ... +``` + +#### 2. Initialization + +Load your model in `__init__`: + +```python +def __init__(self): + import torch + + self.model = torch.load("model.pt") + self.model.eval() + print("Model loaded successfully") +``` + +#### 3. Inference Method + +Implement `__call__` or other methods for handling requests: + +```python +async def __call__(self, request: Request): + data = await request.json() + input_data = self.preprocess(data["input"]) + + with torch.no_grad(): + output = self.model(input_data) + + return {"prediction": self.postprocess(output)} +``` + +## Advanced Features + +### Dynamic Configuration + +Use `reconfigure()` to update model settings without redeployment: + +```python +from typing import TypedDict + +class Config(TypedDict): + threshold: float + batch_size: int + +@serve.deployment +class ConfigurableModel: + def __init__(self): + self.model = load_model() + + async def reconfigure(self, config: Config): + self.threshold = config["threshold"] + self.batch_size = config["batch_size"] + print(f"Reconfigured: threshold={self.threshold}") +``` + +Update config via RayService YAML: + +```yaml +user_config: + threshold: 0.5 + batch_size: 32 +``` + +### Batching Requests + +Use `@serve.batch` for efficient batch processing: + +```python +@serve.deployment +class BatchedModel: + def __init__(self): + self.model = load_model() + + @serve.batch(max_batch_size=32, batch_wait_timeout_s=0.1) + async def predict_batch(self, inputs: list[np.ndarray]): + batch = np.stack(inputs) + outputs = self.model(batch) + return outputs.tolist() + + async def __call__(self, request: Request): + data = await request.json() + input_data = np.array(data["input"]) + result = await self.predict_batch(input_data) + return {"prediction": result} +``` + +!!! tip +For binary/image workloads, you can also batch raw `bytes` like the `BinaryClassifier` does (see `models/binary_classifier.py`). This avoids JSON overhead and lets you control batch sizing via `user_config` by calling `set_max_batch_size()` and `set_batch_wait_timeout_s()`. + +### Using FastAPI + +For advanced HTTP features, use FastAPI: + +```python +from fastapi import FastAPI, HTTPException +from pydantic import BaseModel + +class PredictionRequest(BaseModel): + input: list[float] + +class PredictionResponse(BaseModel): + prediction: float + confidence: float + +fastapi = FastAPI() + +@serve.deployment +@serve.ingress(fastapi) +class FastAPIModel: + def __init__(self): + self.model = load_model() + + @fastapi.post("/predict", response_model=PredictionResponse) + async def predict(self, request: PredictionRequest): + output = self.model(request.input) + return PredictionResponse( + prediction=float(output), + confidence=0.95 + ) + +app = FastAPIModel.bind() +``` + +## Loading Models from MLflow + +Use the model provider to load from MLflow: + +```python +# models/mlflow_model.py +from providers.model_provider import mlflow + +@serve.deployment +class MLflowModel: + def __init__(self): + # This will be set via user_config + self.model_path = None + + async def reconfigure(self, config): + model_uri = config["model"]["artifact_uri"] + self.model_path = mlflow(artifact_uri=model_uri) + + # Load model + import onnxruntime as ort + self.session = ort.InferenceSession(self.model_path) + + async def __call__(self, request: Request): + # Inference logic + ... + +app = MLflowModel.bind() +``` + +Configure in YAML: + +```yaml +runtime_env: + env_vars: + MLFLOW_TRACKING_URI: http://mlflow.rationai-mlflow:5000 +user_config: + model: + artifact_uri: mlflow-artifacts:/65/abc123.../model.onnx +``` + +## RayService Configuration + +Add your model to `ray-service.yaml`: + +```yaml +spec: + serveConfigV2: | + applications: + - name: my-model + import_path: models.my_onnx_model:app + route_prefix: /my-model + runtime_env: + working_dir: https://gitlab.ics.muni.cz/rationai/infrastructure/model-service/-/archive/master/model-service-master.zip + pip: + - onnxruntime>=1.23.2 + - numpy + deployments: + - name: MyONNXModel + autoscaling_config: + min_replicas: 1 + max_replicas: 4 + ray_actor_options: + num_cpus: 2 + memory: 4294967296 # 4 GiB + runtime_env: + pip: + - onnxruntime>=1.23.2 +``` + +!!! note +In this repo, the production `ray-service.yaml` installs model dependencies under `deployments[*].ray_actor_options.runtime_env.pip` (not only at `applications[*].runtime_env`). This is useful when different deployments need different dependencies. + +## GPU Models + +For GPU-accelerated models: + +```python +@serve.deployment(ray_actor_options={"num_gpus": 1}) +class GPUModel: + def __init__(self): + import torch + + self.device = torch.device("cuda") + self.model = torch.load("model.pt").to(self.device) + self.model.eval() + + async def __call__(self, request: Request): + data = await request.json() + input_tensor = torch.tensor(data["input"]).to(self.device) + + with torch.no_grad(): + output = self.model(input_tensor) + + return {"prediction": output.cpu().numpy().tolist()} +``` + +Configure GPU worker group: + +```yaml +workerGroupSpecs: + - groupName: gpu-workers + replicas: 0 + minReplicas: 0 + maxReplicas: 2 + template: + spec: + nodeSelector: + nvidia.com/gpu.product: NVIDIA-A40 + containers: + - name: ray-worker + image: rayproject/ray:2.52.1-py312-gpu + resources: + limits: + nvidia.com/gpu: 1 +``` + +## Deployment + +Deploy your model: + +```bash +kubectl apply -f ray-service.yaml -n [namespace] +``` + +Monitor deployment: + +```bash +kubectl get rayservice -n [namespace] +kubectl logs -n [namespace] -l ray.io/node-type=worker --tail=100 +``` + +## Best Practices + +1. **Error Handling**: Always wrap inference in try-except blocks +2. **Logging**: Use `print()` or `logging` for debugging (viewable in pod logs) +3. **Resource Limits**: Set appropriate CPU/memory/GPU limits +4. **Model Loading**: Cache models to avoid reloading on each request +5. **Input Validation**: Validate input data format and ranges +6. **Batching**: Use batching for throughput-intensive workloads +7. **Health Checks**: Implement health check endpoints for monitoring + +## Next Steps + +- [Deployment guide](deployment-guide.md) +- [Configuration reference](../reference/configuration-reference.md) +- [Architecture overview](../architecture/overview.md) diff --git a/docs/guides/deployment-guide.md b/docs/guides/deployment-guide.md new file mode 100644 index 0000000..cdfc110 --- /dev/null +++ b/docs/guides/deployment-guide.md @@ -0,0 +1,447 @@ +# Deployment Guide + +Complete guide for deploying models to production with Model Service. + +## Prerequisites + +Before deploying to production, ensure: + +- [x] KubeRay operator installed +- [x] Namespace created (`rationai-notebooks-ns`) +- [x] Model tested locally +- [x] RayService YAML configured +- [x] MLflow accessible (if using MLflow) + +## Deployment Workflow + +### 1. Prepare Model Code + +Ensure your model is in the `models/` directory and properly structured: + +```python +# models/my_model.py +from ray import serve +from starlette.requests import Request + +@serve.deployment(ray_actor_options={"num_cpus": 2}) +class MyModel: + def __init__(self): + # Model initialization + self.model = self.load_model() + + def load_model(self): + # Load model logic + pass + + async def __call__(self, request: Request): + # Inference logic + data = await request.json() + result = self.model.predict(data["input"]) + return {"prediction": result} + +app = MyModel.bind() +``` + +### 2. Create RayService Configuration + +Create or modify `ray-service.yaml`: + +```yaml +apiVersion: ray.io/v1 +kind: RayService +metadata: + name: rayservice-my-model + namespace: rationai-notebooks-ns +spec: + serveConfigV2: | + applications: + - name: my-model + import_path: models.my_model:app + route_prefix: /my-model + runtime_env: + working_dir: https://gitlab.ics.muni.cz/rationai/infrastructure/model-service/-/archive/master/model-service-master.zip + pip: + - numpy + - pandas + env_vars: + MODEL_VERSION: "1.0.0" + deployments: + - name: MyModel + autoscaling_config: + min_replicas: 1 + max_replicas: 5 + target_ongoing_requests: 32 + ray_actor_options: + num_cpus: 4 + memory: 4294967296 # 4 GiB + runtime_env: + pip: + - numpy + - pandas + + rayClusterConfig: + rayVersion: 2.52.1 + enableInTreeAutoscaling: true + autoscalerOptions: + idleTimeoutSeconds: 300 + + headGroupSpec: + rayStartParams: + num-cpus: "0" + dashboard-host: "0.0.0.0" + template: + spec: + containers: + - name: ray-head + image: rayproject/ray:2.52.1-py312 + resources: + limits: + cpu: 2 + memory: 4Gi + + workerGroupSpecs: + - groupName: cpu-workers + replicas: 1 + minReplicas: 1 + maxReplicas: 10 + template: + spec: + containers: + - name: ray-worker + image: rayproject/ray:2.52.1-py312 + resources: + limits: + cpu: 8 + memory: 16Gi +``` + +### 3. Deploy to Kubernetes + +Apply the configuration: + +```bash +kubectl apply -f ray-service.yaml -n [namespace] +``` + +### 4. Monitor Deployment + +Watch the deployment progress: + +```bash +# Watch RayService status +kubectl get rayservice rayservice-my-model -n [namespace] -w + +# Check pods +kubectl get pods -n [namespace] -l ray.io/cluster + +# View head node logs +kubectl logs -n [namespace] -l ray.io/node-type=head -f + +# View worker logs +kubectl logs -n [namespace] -l ray.io/node-type=worker -f +``` + +Wait for status to show `Running` and application status to show `RUNNING`. + +### 5. Verify Deployment + +Check service endpoints: + +```bash +# Get service details +kubectl get svc -n [namespace] + +# Port forward to test +kubectl port-forward -n rationai-notebooks-ns \ + svc/rayservice-my-model-serve-svc 8000:8000 +``` + +!!! note +The example model in this repository (`models/binary_classifier.py`) uses FastAPI ingress and expects a **compressed binary request body** (LZ4), not JSON. The JSON `curl` example below is valid for JSON-based models but does not apply to `BinaryClassifier`. + +Test the endpoint: + +```bash +curl -X POST http://localhost:8000/my-model \ + -H "Content-Type: application/json" \ + -d '{"input": [1.0, 2.0, 3.0]}' +``` + +## Production Considerations + +### Resource Planning + +**Calculate resource requirements:** + +1. **Per-replica resources:** + + - CPU: Based on model complexity + - Memory: Model size + working memory + overhead + - GPU: Number of GPUs needed + +2. **Total cluster resources:** + + ``` + Total CPUs = max_replicas × num_cpus + overhead + Total Memory = max_replicas × memory + overhead + ``` + +3. **Example calculation:** + + ``` + Model: 4 CPU, 4GB per replica + Max replicas: 5 + + Required per worker: 5 × 4 = 20 CPUs, 5 × 4GB = 20GB + Overhead: +2 CPUs, +4GB for system + + Worker resources: 22 CPUs, 24GB memory + ``` + +### Autoscaling Configuration + +**Choose appropriate scaling parameters:** + +```yaml +autoscaling_config: + min_replicas: 1 # Always keep 1 running + max_replicas: 10 # Scale up to 10 + target_ongoing_requests: 20 # Target load per replica + + # Advanced options + upscale_delay_s: 30 # Wait 30s before scaling up + downscale_delay_s: 600 # Wait 10m before scaling down +``` + +**Scaling behavior:** + +- **Cold start**: Set `min_replicas: 0` for scale-to-zero +- **Always available**: Set `min_replicas: 1` or higher +- **High traffic**: Increase `max_replicas` and `target_ongoing_requests` +- **Batch processing**: Use higher `target_ongoing_requests` + +### High Availability + +**For production workloads:** + +```yaml +# Multiple replicas +autoscaling_config: + min_replicas: 2 # At least 2 for redundancy + +# Multiple workers +workerGroupSpecs: + - groupName: cpu-workers + minReplicas: 2 + maxReplicas: 10 +``` + +### Resource Limits + +**Always set resource limits:** + +```yaml +containers: + - name: ray-worker + resources: + requests: # Guaranteed resources + cpu: 8 + memory: 16Gi + limits: # Maximum resources + cpu: 12 + memory: 20Gi +``` + +### Network Configuration + +**Proxy settings:** + +```yaml +env: + - name: HTTP_PROXY + value: "http://proxy.example.com:3128" + - name: HTTPS_PROXY + value: "http://proxy.example.com:3128" + - name: NO_PROXY + value: ".svc.cluster.local,.cluster.local" +``` + +**Service configuration:** + +```yaml +# If you need external access +apiVersion: v1 +kind: Service +metadata: + name: rayservice-external +spec: + type: LoadBalancer + selector: + ray.io/cluster: rayservice-my-model + ports: + - port: 80 + targetPort: 8000 +``` + +## Multi-Model Deployment + +Deploy multiple models in one RayService: + +```yaml +serveConfigV2: | + applications: + - name: model-a + import_path: models.model_a:app + route_prefix: /model-a + deployments: + - name: ModelA + ray_actor_options: + num_cpus: 4 + + - name: model-b + import_path: models.model_b:app + route_prefix: /model-b + deployments: + - name: ModelB + ray_actor_options: + num_gpus: 1 +``` + +Access models: + +```bash +curl http://service:8000/model-a -d '{"input": ...}' +curl http://service:8000/model-b -d '{"input": ...}' +``` + +## Updating Deployments + +### Update Model Code + +1. Update code in repository +2. Commit and push changes +3. RayService will automatically fetch new code from `working_dir` URL + +### Update Configuration + +```bash +# Edit configuration +vim ray-service.yaml + +# Apply changes +kubectl apply -f ray-service.yaml +``` + +Ray will perform rolling update: + +- New replicas created with new config +- Traffic gradually shifted +- Old replicas removed + +### Update Model Weights + +If using MLflow: + +```yaml +user_config: + model: + artifact_uri: mlflow-artifacts:/65/NEW_RUN_ID/model.onnx +``` + +Apply update: + +```bash +kubectl apply -f ray-service.yaml +``` + +## Rollback + +If deployment fails, rollback: + +```bash +# RayService is a Custom Resource (CRD), so Kubernetes "rollout" doesn't apply. +# Instead, view KubeRay status and events, then re-apply a known-good spec. + +# Inspect current state and recent events +kubectl get rayservice rayservice-my-model -n rationai-notebooks-ns -o yaml +kubectl describe rayservice rayservice-my-model -n rationai-notebooks-ns + +# Check Ray Serve controller logs (usually shows the root cause) +kubectl logs -n rationai-notebooks-ns -l ray.io/node-type=head --tail=200 +``` + +Or manually apply previous configuration: + +```bash +git checkout HEAD~1 ray-service.yaml +kubectl apply -f ray-service.yaml +``` + +## Troubleshooting + +### Deployment Stuck + +**Check RayService status:** + +```bash +kubectl describe rayservice rayservice-my-model -n rationai-notebooks-ns +``` + +**Common issues:** + +- Image pull errors +- Insufficient resources +- Configuration errors +- Network issues + +### Application Not Starting + +**Check serve application logs:** + +```bash +# View dashboard +kubectl port-forward svc/rayservice-my-model-head-svc 8265:8265 + +# Check logs +kubectl logs -n rationai-notebooks-ns -l ray.io/node-type=worker --tail=100 +``` + +**Common issues:** + +- Python import errors +- Model loading failures +- Dependency issues +- Resource limits + +### High Latency + +**Check metrics:** + +```bash +# Ray dashboard: http://localhost:8265 +kubectl port-forward svc/rayservice-my-model-head-svc 8265:8265 +``` + +**Possible solutions:** + +- Increase replicas +- Enable batching +- Optimize model code +- Increase resources + +## Best Practices + +1. **Version Control**: Keep all YAML configs in Git +2. **Testing**: Test locally before deploying +3. **Monitoring**: Set up alerts for failures +4. **Resource Limits**: Always set limits to prevent resource hogging +5. **Gradual Rollout**: Update replicas gradually +6. **Documentation**: Document custom configurations +7. **Backup**: Keep backup of working configurations + +## Next Steps + +- [Configuration reference](../reference/configuration-reference.md) +- [Architecture overview](../architecture/overview.md) +- [Adding new models](adding-models.md) +- [Troubleshooting](troubleshooting.md) diff --git a/docs/guides/troubleshooting.md b/docs/guides/troubleshooting.md new file mode 100644 index 0000000..39046b0 --- /dev/null +++ b/docs/guides/troubleshooting.md @@ -0,0 +1,159 @@ +# Troubleshooting + +This page lists the most common issues when deploying and running models in Model Service (Ray Serve on KubeRay). + +## Quick Triage Checklist + +Start here before digging deeper: + +```bash +kubectl get rayservice -n [namespace] +kubectl describe rayservice rayservice-models -n [namespace] +kubectl get pods -n [namespace] +``` + +Then inspect logs: + +```bash +kubectl logs -n [namespace] -l ray.io/node-type=head --tail=200 +kubectl logs -n [namespace] -l ray.io/node-type=worker --tail=200 +``` + +## RayService Shows `DEPLOY_FAILED` + +### What it usually means + +Ray Serve could not start the application or deployment. The root cause is typically visible in the Ray Serve controller logs. + +### What to do + +1. Describe the RayService for events: + +```bash +kubectl describe rayservice rayservice-models -n [namespace] +``` + +2. Open the Ray dashboard (helps with Serve deployment errors): + +```bash +kubectl port-forward -n [namespace] svc/rayservice-models-head-svc 8265:8265 +``` + +Visit `http://localhost:8265`. + +3. Look for Python import errors / missing dependencies: + +```bash +kubectl logs -n [namespace] -l ray.io/node-type=worker --tail=500 +``` + +## ImportError / ModuleNotFoundError + +### Symptoms + +- Serve deployment fails immediately. +- Logs show `ModuleNotFoundError: No module named ...`. + +### Causes + +- Dependency not installed in the runtime environment. +- Wrong `import_path`. +- `working_dir` does not contain the expected code. + +### Fix + +- Ensure `import_path` matches your file: + - Example: `models.binary_classifier:app` means there is `models/binary_classifier.py` defining `app = ...`. +- Add missing dependencies to `runtime_env.pip`. + +In this repository, dependencies are typically installed per deployment: + +```yaml +deployments: + - name: BinaryClassifier + ray_actor_options: + runtime_env: + pip: ["onnxruntime>=1.23.2", "mlflow<3.0", "lz4>=4.4.5"] +``` + +## Autoscaling Not Working (Replicas Don’t Change) + +### Serve replicas not scaling + +Check your deployment has autoscaling configured: + +```yaml +autoscaling_config: + min_replicas: 0 + max_replicas: 4 + target_ongoing_requests: 32 +``` + +Also note: + +- Scale up/down is not instantaneous (delays and smoothing apply). +- If traffic is low, you may stay at `min_replicas`. + +### Worker pods not scaling + +Worker pod scaling requires cluster autoscaling enabled: + +```yaml +rayClusterConfig: + enableInTreeAutoscaling: true + autoscalerOptions: + idleTimeoutSeconds: 60 +``` + +Also ensure `workerGroupSpecs[*].minReplicas/maxReplicas` allow scaling. + +## Not Enough CPU / Memory (Pods Pending) + +### Symptoms + +- Pods stay in `Pending`. +- Events mention `Insufficient cpu` or `Insufficient memory`. + +### Fix + +- Reduce per-replica requirements (`ray_actor_options.num_cpus`, `memory`). +- Increase cluster capacity or adjust worker pod resources. + +Inspect pod scheduling events: + +```bash +kubectl describe pod -n [namespace] +``` + +## MLflow / Artifact Download Problems + +### Symptoms + +- `mlflow.artifacts.download_artifacts` fails. +- Timeouts during replica initialization. + +### Fix + +- Ensure `MLFLOW_TRACKING_URI` is set and reachable from the cluster. +- Ensure the cluster has network access (proxy settings if needed). +- Verify the `artifact_uri` exists and permissions are correct. + +In `ray-service.yaml` this is typically configured via `env_vars`: + +```yaml +ray_actor_options: + runtime_env: + env_vars: + MLFLOW_TRACKING_URI: http://mlflow.rationai-mlflow:5000 +``` + +## Helpful Commands + +```bash +# list Serve and RayService resources +kubectl get rayservice -n [namespace] +kubectl get svc -n [namespace] + +# see all pods for a RayService +kubectl get pods -n [namespace] -l ray.io/cluster=rayservice-models +``` diff --git a/docs/index.md b/docs/index.md new file mode 100644 index 0000000..865ea5a --- /dev/null +++ b/docs/index.md @@ -0,0 +1,72 @@ +# Model Service Documentation + +Welcome to the Model Service documentation. This service provides a scalable, production-ready infrastructure for deploying machine learning models for the RationAI project using Ray Serve on Kubernetes. + +## What is Model Service? + +Model Service is a deployment framework that enables: + +- **Scalable Model Serving**: Automatically scale model replicas based on request load +- **Distributed Inference**: Distribute inference workloads across multiple workers and nodes +- **Resource Management**: Efficiently manage CPU and GPU resources in Kubernetes +- **Model Versioning**: Integration with MLflow for model lifecycle management +- **Production Ready**: Built on Ray Serve with fault tolerance and high availability + +## Key Features + +### Auto-Scaling + +Model Service automatically adjusts the number of model replicas based on incoming request volume, ensuring optimal resource utilization and response times. + +### Multi-Model Deployment + +Deploy multiple models simultaneously with isolated resource allocations and independent scaling policies. + +### GPU/CPU Support + +Flexible resource allocation supporting both CPU-based and GPU-accelerated models with hardware-specific worker groups. + +### Kubernetes Native + +Leverages KubeRay operator for seamless integration with Kubernetes, enabling declarative configuration and GitOps workflows. + +## Why Ray Serve? + +Model Service is built on top of Ray Serve because it combines a simple developer experience with strong production capabilities: + +- **Unified batch and online inference**: The same Ray cluster can handle real-time HTTP requests and large batch jobs, which matches RationAI's mix of interactive and offline pathology workloads. +- **Python‑native API**: Models are implemented as regular Python classes or functions with decorators, making it easy for researchers to contribute without learning a heavy framework. +- **Autoscaling built in**: Ray Serve natively scales replicas based on request pressure and integrates with Ray's cluster autoscaler to add/remove worker pods. +- **Multi‑model support**: Multiple independent applications and deployments can run side‑by‑side on one cluster while isolating resources per model. + +Alternative approaches (plain Kubernetes deployments, custom Flask/FastAPI services, or specialized serving stacks like TorchServe or TF Serving) either lack first‑class autoscaling orchestration across many models, or are tightly coupled to specific ML frameworks. Ray Serve, together with KubeRay, lets us: + +- Express all infrastructure declaratively in a single `RayService` resource. +- Share the same cluster across heterogeneous models and hardware (CPU/GPU). +- Keep the operational surface smaller by relying on one general‑purpose serving layer instead of many ad‑hoc microservices. + +## Use Cases + +Model Service is designed for: + +- **Pathology Image Analysis**: Deploy models for tissue classification, nuclei detection, and other pathology tasks +- **Batch Processing**: Handle large-scale inference workloads efficiently +- **Real-time Inference**: Serve predictions with low latency for interactive applications +- **Research Experiments**: Quickly deploy and test new model versions + +## Getting Help + +- **Documentation**: Browse the guides and reference materials in this documentation +- **Issues**: Report bugs or request features via [GitLab Issues](https://gitlab.ics.muni.cz/rationai/infrastructure/model-service/-/issues) +- **Contact**: Reach out to the RationAI team at Masaryk University + +## Next Steps + +Ready to get started? Follow our [Quick Start Guide](get-started/quick-start.md) to deploy your first model. + +## Glossary + +- **RayService**: KubeRay custom resource that manages a Ray cluster plus a Ray Serve application, including updates. +- **Deployment (Ray Serve)**: A scalable unit (replicas) that runs your model code. +- **Replica**: One running instance of a deployment. +- **Worker group (KubeRay)**: A set of Ray worker pods (e.g., CPU or GPU workers) with independent scaling bounds. diff --git a/docs/reference/configuration-reference.md b/docs/reference/configuration-reference.md new file mode 100644 index 0000000..1e4424c --- /dev/null +++ b/docs/reference/configuration-reference.md @@ -0,0 +1,171 @@ +# Configuration Reference + +This page summarizes the **most important knobs** you will touch when configuring Model Service. For full API details, see the upstream Ray Serve and KubeRay documentation. + +## 1. RayService Skeleton + +```yaml +apiVersion: ray.io/v1 +kind: RayService +metadata: + name: + namespace: [namespace] +spec: + serveConfigV2: | + # Ray Serve applications + rayClusterConfig: + # Ray cluster (head + workers) +``` + +Think of it as two parts: + +- **`serveConfigV2`**: what you serve (apps, deployments, autoscaling). +- **`rayClusterConfig`**: where it runs (Ray version, worker groups, resources). + +## 2. Applications and Deployments + +### Applications (HTTP endpoints) + +```yaml +serveConfigV2: | + applications: + - name: prostate-classifier + import_path: models.binary_classifier:app + route_prefix: /prostate-classifier + runtime_env: + working_dir: https://.../model-service-master.zip + pip: + - onnxruntime>=1.23.2 +``` + +- `name`: logical app name (used in Ray dashboard/logs). +- `import_path`: Python entrypoint (`module.path:variable`). +- `route_prefix`: HTTP path under the Serve gateway. +- `runtime_env`: code location + extra Python deps. + +### Deployments (scaling + resources) + +```yaml +deployments: + - name: BinaryClassifier + max_ongoing_requests: 64 + max_queued_requests: 128 + autoscaling_config: + min_replicas: 0 + max_replicas: 4 + target_ongoing_requests: 32 + ray_actor_options: + num_cpus: 6 + memory: 5368709120 # 5 GiB + user_config: + tile_size: 512 + threshold: 0.5 +``` + +- `autoscaling_config`: how many replicas and when to scale. +- `ray_actor_options`: per‑replica CPU/GPU/memory. +- `user_config`: free‑form dict passed to `reconfigure()` in your model. + +## 3. Ray Cluster (Workers and Autoscaling) + +```yaml +rayClusterConfig: + rayVersion: "2.52.1" + enableInTreeAutoscaling: true + headGroupSpec: + rayStartParams: + num-cpus: "0" # head only coordinates + template: + spec: + containers: + - name: ray-head + image: rayproject/ray:2.52.1-py312 + workerGroupSpecs: + - groupName: cpu-workers + replicas: 1 + minReplicas: 1 + maxReplicas: 10 + template: + spec: + containers: + - name: ray-worker + image: rayproject/ray:2.52.1-py312 + resources: + requests: + cpu: "4" + memory: "8Gi" + limits: + cpu: "8" + memory: "16Gi" +``` + +Focus on: + +- `rayVersion`: must match images you use. +- `workerGroupSpecs[*].{replicas,minReplicas,maxReplicas}`: cluster‑level scaling bounds. +- `resources.requests/limits`: how big each worker pod is. + +## 4. Security and Placement (Optional but Recommended) + +```yaml +template: + spec: + securityContext: + runAsNonRoot: true + fsGroupChangePolicy: OnRootMismatch + seccompProfile: + type: RuntimeDefault + nodeSelector: + nvidia.com/gpu.product: NVIDIA-A40 + containers: + - name: ray-worker + securityContext: + allowPrivilegeEscalation: false + runAsUser: 1000 + capabilities: + drop: ["ALL"] +``` + +Use these to: + +- Enforce non‑root containers and least privilege. +- Pin GPU workloads to specific node types. + +## 5. Putting It Together (Small Example) + +```yaml +apiVersion: ray.io/v1 +kind: RayService +metadata: + name: rayservice-example + namespace: rationai-notebooks-ns +spec: + serveConfigV2: | + applications: + - name: my-classifier + import_path: models.classifier:app + route_prefix: /classify + deployments: + - name: Classifier + autoscaling_config: + min_replicas: 1 + max_replicas: 5 + target_ongoing_requests: 32 + ray_actor_options: + num_cpus: 4 + rayClusterConfig: + rayVersion: "2.52.1" + enableInTreeAutoscaling: true + headGroupSpec: + rayStartParams: + num-cpus: "0" + workerGroupSpecs: + - groupName: cpu-workers + minReplicas: 1 + maxReplicas: 5 +``` + +## Next Steps + +- [Deployment guide](../guides/deployment-guide.md) +- [Architecture overview](../architecture/overview.md) diff --git a/mkdocs.yml b/mkdocs.yml new file mode 100644 index 0000000..3a255cd --- /dev/null +++ b/mkdocs.yml @@ -0,0 +1,46 @@ +site_name: Model Service Documentation +site_description: Model deployment infrastructure for RationAI using Ray Serve on Kubernetes +repo_url: https://gitlab.ics.muni.cz/rationai/infrastructure/model-service +edit_uri: edit/master/docs/ + +theme: + name: material + palette: + primary: indigo + accent: indigo + features: + - navigation.tabs + - navigation.sections + - toc.integrate + - search.suggest + - search.highlight + - content.code.copy + +nav: + - Home: index.md + - Get Started: + - Quick Start: get-started/quick-start.md + - Installation: https://docs.ray.io/en/latest/cluster/kubernetes/getting-started/kuberay-operator-installation.html + - First Deployment: https://docs.ray.io/en/latest/serve/production-guide/kubernetes.html + - Guides: + - Deployment Guide: guides/deployment-guide.md + - Adding New Models: guides/adding-models.md + - Troubleshooting: guides/troubleshooting.md + - Architecture: + - Overview: architecture/overview.md + - Reference: + - Configuration Reference: reference/configuration-reference.md + +markdown_extensions: + - admonition + - pymdownx.details + - pymdownx.superfences + - pymdownx.highlight: + anchor_linenums: true + - pymdownx.inlinehilite + - pymdownx.snippets + - pymdownx.tabbed: + alternate_style: true + - tables + - toc: + permalink: true \ No newline at end of file From 43bc37dd53a39e980a31ea31971a707ed5e8f5e0 Mon Sep 17 00:00:00 2001 From: JiriStipek <567776@muni.cz> Date: Tue, 6 Jan 2026 13:58:38 +0100 Subject: [PATCH 02/17] feat: 2nd docs iteration --- README.md | 132 ++++++++++-------- docs/architecture/overview.md | 25 ++-- docs/get-started/quick-start.md | 12 +- docs/guides/adding-models.md | 7 +- .../configuration-reference.md | 2 +- docs/guides/deployment-guide.md | 57 +++----- docs/index.md | 2 +- mkdocs.yml | 5 +- 8 files changed, 113 insertions(+), 129 deletions(-) rename docs/{reference => guides}/configuration-reference.md (98%) diff --git a/README.md b/README.md index f6ea4ce..6636fae 100644 --- a/README.md +++ b/README.md @@ -1,93 +1,107 @@ # Model Service +Model deployment infrastructure for RationAI using Ray Serve on Kubernetes. +This repository contains: -## Getting started +- A KubeRay `RayService` manifest (`ray-service.yaml`) for deploying Ray Serve on Kubernetes. +- Model implementations under `models/` (reference: `models/binary_classifier.py`). +- Documentation under `docs/` (MkDocs). -To make it easy for you to get started with GitLab, here's a list of recommended next steps. +## Documentation -Already a pro? Just edit this README.md and make it your own. Want to make it easy? [Use the template at the bottom](#editing-this-readme)! +- MkDocs content: `docs/` +- Key pages: + - `docs/get-started/quick-start.md` + - `docs/guides/deployment-guide.md` + - `docs/guides/adding-models.md` + - `docs/guides/configuration-reference.md` + - `docs/guides/troubleshooting.md` + - `docs/architecture/overview.md` -## Add your files +## Quick Start (Kubernetes) -- [ ] [Create](https://docs.gitlab.com/ee/user/project/repository/web_editor.html#create-a-file) or [upload](https://docs.gitlab.com/ee/user/project/repository/web_editor.html#upload-a-file) files -- [ ] [Add files using the command line](https://docs.gitlab.com/topics/git/add_files/#add-files-to-a-git-repository) or push an existing Git repository with the following command: +Full walkthrough: `docs/get-started/quick-start.md`. -``` -cd existing_repo -git remote add origin https://gitlab.ics.muni.cz/rationai/infrastructure/model-service2.git -git branch -M master -git push -uf origin master -``` +### Prerequisites -## Integrate with your tools +- Kubernetes cluster with KubeRay operator installed +- `kubectl` configured for the cluster -- [ ] [Set up project integrations](https://gitlab.ics.muni.cz/rationai/infrastructure/model-service2/-/settings/integrations) +### Deploy -## Collaborate with your team +```bash +kubectl apply -f ray-service.yaml -n rationai-notebooks-ns +kubectl get rayservice rayservice-models -n rationai-notebooks-ns +``` -- [ ] [Invite team members and collaborators](https://docs.gitlab.com/ee/user/project/members/) -- [ ] [Create a new merge request](https://docs.gitlab.com/ee/user/project/merge_requests/creating_merge_requests.html) -- [ ] [Automatically close issues from merge requests](https://docs.gitlab.com/ee/user/project/issues/managing_issues.html#closing-issues-automatically) -- [ ] [Enable merge request approvals](https://docs.gitlab.com/ee/user/project/merge_requests/approvals/) -- [ ] [Set auto-merge](https://docs.gitlab.com/user/project/merge_requests/auto_merge/) +### Access locally -## Test and Deploy +```bash +kubectl port-forward -n rationai-notebooks-ns svc/rayservice-models-serve-svc 8000:8000 +``` -Use the built-in continuous integration in GitLab. +### Test the reference model (`BinaryClassifier`) -- [ ] [Get started with GitLab CI/CD](https://docs.gitlab.com/ee/ci/quick_start/) -- [ ] [Analyze your code for known vulnerabilities with Static Application Security Testing (SAST)](https://docs.gitlab.com/ee/user/application_security/sast/) -- [ ] [Deploy to Kubernetes, Amazon EC2, or Amazon ECS using Auto Deploy](https://docs.gitlab.com/ee/topics/autodevops/requirements.html) -- [ ] [Use pull-based deployments for improved Kubernetes management](https://docs.gitlab.com/ee/user/clusters/agent/) -- [ ] [Set up protected environments](https://docs.gitlab.com/ee/ci/environments/protected_environments.html) +The reference deployment in `ray-service.yaml` exposes an app at the route prefix: -*** +- `/prostate-classifier-1` -# Editing this README +`models/binary_classifier.py` expects a **request body that is LZ4-compressed raw bytes** of a single RGB tile: -When you're ready to make this README your own, just edit this file and use the handy template below (or feel free to structure it however you want - this is just a starting point!). Thanks to [makeareadme.com](https://www.makeareadme.com/) for this template. +- dtype: `uint8` +- shape: `(tile_size, tile_size, 3)` +- byte order: row-major (NumPy default) -## Suggestions for a good README +Example (Python): -Every project is different, so consider which of these sections apply to yours. The sections used in the template are suggestions for most open source projects. Also keep in mind that while a README can be too long and detailed, too long is better than too short. If you think your README is too long, consider utilizing another form of documentation rather than cutting out information. +```bash +pip install numpy lz4 requests +``` -## Name -Choose a self-explaining name for your project. +```python +import lz4.frame +import numpy as np +import requests -## Description -Let people know what your project can do specifically. Provide context and add a link to any reference visitors might be unfamiliar with. A list of Features or a Background subsection can also be added here. If there are alternatives to your project, this is a good place to list differentiating factors. +tile_size = 512 # must match RayService user_config.tile_size +tile = np.zeros((tile_size, tile_size, 3), dtype=np.uint8) -## Badges -On some READMEs, you may see small images that convey metadata, such as whether or not all the tests are passing for the project. You can use Shields to add some to your README. Many services also have instructions for adding a badge. +payload = lz4.frame.compress(tile.tobytes()) -## Visuals -Depending on what you are making, it can be a good idea to include screenshots or even a video (you'll frequently see GIFs rather than actual videos). Tools like ttygif can help, but check out Asciinema for a more sophisticated method. +resp = requests.post( + "http://localhost:8000/prostate-classifier-1/", + data=payload, + headers={"Content-Type": "application/octet-stream"}, + timeout=60, +) +resp.raise_for_status() +print(resp.json() if resp.headers.get("content-type", "").startswith("application/json") else resp.text) +``` -## Installation -Within a particular ecosystem, there may be a common way of installing things, such as using Yarn, NuGet, or Homebrew. However, consider the possibility that whoever is reading your README is a novice and would like more guidance. Listing specific steps helps remove ambiguity and gets people to using your project as quickly as possible. If it only runs in a specific context like a particular programming language version or operating system or has dependencies that have to be installed manually, also add a Requirements subsection. +## Repository Structure -## Usage -Use examples liberally, and show the expected output if you can. It's helpful to have inline the smallest example of usage that you can demonstrate, while providing links to more sophisticated examples if they are too long to reasonably include in the README. +``` +model-service/ +├── models/ # Model implementations +│ └── binary_classifier.py +├── providers/ # Model loading providers +│ └── model_provider.py +├── docs/ # Documentation +├── ray-service.yaml # Kubernetes RayService configuration +├── pyproject.toml # Python dependencies +└── README.md +``` ## Support -Tell people where they can go to for help. It can be any combination of an issue tracker, a chat room, an email address, etc. - -## Roadmap -If you have ideas for releases in the future, it is a good idea to list them in the README. -## Contributing -State if you are open to contributions and what your requirements are for accepting them. +- **Issues:** Report bugs or request features via [GitLab Issues](https://gitlab.ics.muni.cz/rationai/infrastructure/model-service/-/issues) +- **Contact:** RationAI team at Masaryk University -For people who want to make changes to your project, it's helpful to have some documentation on how to get started. Perhaps there is a script that they should run or some environment variables that they need to set. Make these steps explicit. These instructions could also be useful to your future self. - -You can also document commands to lint the code or run tests. These steps help to ensure high code quality and reduce the likelihood that the changes inadvertently break something. Having instructions for running tests is especially helpful if it requires external setup, such as starting a Selenium server for testing in a browser. +## License -## Authors and acknowledgment -Show your appreciation to those who have contributed to the project. +This project is part of the RationAI infrastructure and is available for use by authorized members of the RationAI group. -## License -For open source projects, say how it is licensed. +## Authors -## Project status -If you have run out of energy or time for your project, put a note at the top of the README saying that development has slowed down or stopped completely. Someone may choose to fork your project or volunteer to step in as a maintainer or owner, allowing your project to keep going. You can also make an explicit request for maintainers. +Developed and maintained by the RationAI team at Masaryk University, Faculty of Informatics. diff --git a/docs/architecture/overview.md b/docs/architecture/overview.md index 48447d2..6d3f3bc 100644 --- a/docs/architecture/overview.md +++ b/docs/architecture/overview.md @@ -47,13 +47,12 @@ The foundation of Model Service, providing distributed computing infrastructure. - Cluster coordination and management - Dashboard for monitoring - Scheduling decisions -- Does not run model workloads (no CPU/GPU assigned) +- Typically configured with 0 CPU/GPU for user workloads (cluster coordination only) **Worker Nodes:** - Execute model inference workloads - Can be CPU-only or GPU-enabled - - Auto-scale based on demand - Different worker groups for different hardware types @@ -211,23 +210,21 @@ resources: **Ray Cluster:** -- Head node failure → Cluster recreated by KubeRay -- Worker failure → Workload rescheduled to other workers -- Network partition → Automatic reconnection +- Worker pod failure → Ray reschedules work when possible, KubeRay recreates pods +- Head pod failure → KubeRay typically restarts the head, but the cluster (and Serve) may be briefly unavailable during recovery +- In general, expect automatic recovery from many pod-level failures, but not strict “no-downtime” guarantees **Ray Serve:** - Replica failure → Requests routed to healthy replicas - Failed replicas automatically restarted -- Graceful shutdown on updates +- Graceful shutdown is supported when configured properly, but does not guarantee zero dropped requests. -### Zero-Downtime Updates +### Updates and Downtime -Model updates use blue-green deployment: +RayService updates are reconciled by KubeRay and are designed to minimize downtime, but the exact behavior depends on your Ray/KubeRay versions and the change being applied. -1. New version deployed alongside old -2. Traffic gradually shifted to new version -3. Old version removed when no active requests +In practice, updates may temporarily run old and new replicas at the same time while shifting traffic to healthy replicas. ```yaml spec: @@ -284,7 +281,7 @@ kubectl describe rayservice -n [namespace] ### Metrics -Ray exports Prometheus metrics: +Ray can export Prometheus metrics (when metrics collection/export is enabled): - Request latency - Request throughput @@ -318,7 +315,7 @@ kind: RayService ### 4. Fault Tolerance - Automatic recovery from failures -- No single point of failure (except data plane) +- Ray Head Node is a logical single point of control - failures are recoverable but may cause brief service disruption - Graceful degradation ### 5. Developer Experience @@ -330,5 +327,5 @@ kind: RayService ## Next Steps - [Deployment guide](../guides/deployment-guide.md) -- [Configuration reference](../reference/configuration-reference.md) +- [Configuration reference](../guides/configuration-reference.md) - [Adding new models](../guides/adding-models.md) diff --git a/docs/get-started/quick-start.md b/docs/get-started/quick-start.md index f96fceb..e2084d9 100644 --- a/docs/get-started/quick-start.md +++ b/docs/get-started/quick-start.md @@ -72,7 +72,7 @@ Once deployed, you can port-forward the service to access it locally: ```bash # Port-forward to access the service locally -kubectl port-forward svc/rayservice-models-serve-svc -n [namespace] 8000:8000 +kubectl port-forward -n [namespace] svc/rayservice-models-serve-svc 8000:8000 ``` ## Step 6: Delete the Deployed Model @@ -83,6 +83,10 @@ To delete the deployed RayService, run: kubectl delete -f ray-service.yaml -n [namespace] ``` +### Connection Issues + +Ensure your cluster has proper network policies and that the namespace has access to required resources (MLflow, proxy, etc.). + ## Next Steps Congratulations! You've successfully deployed your first model with Model Service. @@ -92,9 +96,5 @@ Now you can: - [Learn how to add your own models](../guides/adding-models.md) - [Understand the architecture](../architecture/overview.md) - [Read the deployment guide](../guides/deployment-guide.md) -- [Check configuration reference](../reference/configuration-reference.md) +- [Check configuration reference](../guides/configuration-reference.md) - [Troubleshooting](../guides/troubleshooting.md) - -### Connection Issues - -Ensure your cluster has proper network policies and that the namespace has access to required resources (MLflow, proxy, etc.). diff --git a/docs/guides/adding-models.md b/docs/guides/adding-models.md index fd79a7e..75b73a7 100644 --- a/docs/guides/adding-models.md +++ b/docs/guides/adding-models.md @@ -36,7 +36,6 @@ class MyModel: app = MyModel.bind() ``` -!!! note The repository's reference model `BinaryClassifier` uses FastAPI ingress + batched inference and expects a **compressed binary payload** (not JSON). For simple JSON models, the examples above are fine; for high-throughput image inference, consider the batching and ingress patterns shown below. ### Key Components @@ -140,7 +139,6 @@ class BatchedModel: return {"prediction": result} ``` -!!! tip For binary/image workloads, you can also batch raw `bytes` like the `BinaryClassifier` does (see `models/binary_classifier.py`). This avoids JSON overhead and lets you control batch sizing via `user_config` by calling `set_max_batch_size()` and `set_batch_wait_timeout_s()`. ### Using FastAPI @@ -246,8 +244,7 @@ spec: - onnxruntime>=1.23.2 ``` -!!! note -In this repo, the production `ray-service.yaml` installs model dependencies under `deployments[*].ray_actor_options.runtime_env.pip` (not only at `applications[*].runtime_env`). This is useful when different deployments need different dependencies. +In this repository, the production `ray-service.yaml` installs model dependencies under `deployments[*].ray_actor_options.runtime_env.pip` (not only at `applications[*].runtime_env`). This is useful when different deployments need different dependencies. ## GPU Models @@ -321,5 +318,5 @@ kubectl logs -n [namespace] -l ray.io/node-type=worker --tail=100 ## Next Steps - [Deployment guide](deployment-guide.md) -- [Configuration reference](../reference/configuration-reference.md) +- [Configuration reference](configuration-reference.md) - [Architecture overview](../architecture/overview.md) diff --git a/docs/reference/configuration-reference.md b/docs/guides/configuration-reference.md similarity index 98% rename from docs/reference/configuration-reference.md rename to docs/guides/configuration-reference.md index 1e4424c..cd9771d 100644 --- a/docs/reference/configuration-reference.md +++ b/docs/guides/configuration-reference.md @@ -167,5 +167,5 @@ spec: ## Next Steps -- [Deployment guide](../guides/deployment-guide.md) +- [Deployment guide](deployment-guide.md) - [Architecture overview](../architecture/overview.md) diff --git a/docs/guides/deployment-guide.md b/docs/guides/deployment-guide.md index cdfc110..88ee041 100644 --- a/docs/guides/deployment-guide.md +++ b/docs/guides/deployment-guide.md @@ -132,7 +132,7 @@ Watch the deployment progress: kubectl get rayservice rayservice-my-model -n [namespace] -w # Check pods -kubectl get pods -n [namespace] -l ray.io/cluster +kubectl get pods -n [namespace] -l ray.io/cluster=rayservice-my-model # View head node logs kubectl logs -n [namespace] -l ray.io/node-type=head -f @@ -152,21 +152,12 @@ Check service endpoints: kubectl get svc -n [namespace] # Port forward to test -kubectl port-forward -n rationai-notebooks-ns \ +kubectl port-forward -n [namespace] \ svc/rayservice-my-model-serve-svc 8000:8000 ``` -!!! note The example model in this repository (`models/binary_classifier.py`) uses FastAPI ingress and expects a **compressed binary request body** (LZ4), not JSON. The JSON `curl` example below is valid for JSON-based models but does not apply to `BinaryClassifier`. -Test the endpoint: - -```bash -curl -X POST http://localhost:8000/my-model \ - -H "Content-Type: application/json" \ - -d '{"input": [1.0, 2.0, 3.0]}' -``` - ## Production Considerations ### Resource Planning @@ -307,13 +298,6 @@ serveConfigV2: | num_gpus: 1 ``` -Access models: - -```bash -curl http://service:8000/model-a -d '{"input": ...}' -curl http://service:8000/model-b -d '{"input": ...}' -``` - ## Updating Deployments ### Update Model Code @@ -326,17 +310,17 @@ curl http://service:8000/model-b -d '{"input": ...}' ```bash # Edit configuration -vim ray-service.yaml +vim ray-service.yaml # or any IDE # Apply changes -kubectl apply -f ray-service.yaml +kubectl apply -f ray-service.yaml -n [namespace] ``` -Ray will perform rolling update: +KubeRay will reconcile the RayService and attempt a rolling-style update: -- New replicas created with new config -- Traffic gradually shifted -- Old replicas removed +- New replicas are created with the new config +- Traffic is routed to healthy replicas +- Old replicas are eventually removed ### Update Model Weights @@ -351,7 +335,7 @@ user_config: Apply update: ```bash -kubectl apply -f ray-service.yaml +kubectl apply -f ray-service.yaml -n [namespace] ``` ## Rollback @@ -363,18 +347,11 @@ If deployment fails, rollback: # Instead, view KubeRay status and events, then re-apply a known-good spec. # Inspect current state and recent events -kubectl get rayservice rayservice-my-model -n rationai-notebooks-ns -o yaml -kubectl describe rayservice rayservice-my-model -n rationai-notebooks-ns +kubectl get rayservice rayservice-my-model -n [namespace] -o yaml +kubectl describe rayservice rayservice-my-model -n [namespace] # Check Ray Serve controller logs (usually shows the root cause) -kubectl logs -n rationai-notebooks-ns -l ray.io/node-type=head --tail=200 -``` - -Or manually apply previous configuration: - -```bash -git checkout HEAD~1 ray-service.yaml -kubectl apply -f ray-service.yaml +kubectl logs -n [namespace] -l ray.io/node-type=head --tail=200 ``` ## Troubleshooting @@ -384,7 +361,7 @@ kubectl apply -f ray-service.yaml **Check RayService status:** ```bash -kubectl describe rayservice rayservice-my-model -n rationai-notebooks-ns +kubectl describe rayservice rayservice-my-model -n [namespace] ``` **Common issues:** @@ -400,10 +377,10 @@ kubectl describe rayservice rayservice-my-model -n rationai-notebooks-ns ```bash # View dashboard -kubectl port-forward svc/rayservice-my-model-head-svc 8265:8265 +kubectl port-forward -n [namespace] svc/rayservice-my-model-head-svc 8265:8265 # Check logs -kubectl logs -n rationai-notebooks-ns -l ray.io/node-type=worker --tail=100 +kubectl logs -n [namespace] -l ray.io/node-type=worker --tail=100 ``` **Common issues:** @@ -419,7 +396,7 @@ kubectl logs -n rationai-notebooks-ns -l ray.io/node-type=worker --tail=100 ```bash # Ray dashboard: http://localhost:8265 -kubectl port-forward svc/rayservice-my-model-head-svc 8265:8265 +kubectl port-forward -n [namespace] svc/rayservice-my-model-head-svc 8265:8265 ``` **Possible solutions:** @@ -441,7 +418,7 @@ kubectl port-forward svc/rayservice-my-model-head-svc 8265:8265 ## Next Steps -- [Configuration reference](../reference/configuration-reference.md) +- [Configuration reference](configuration-reference.md) - [Architecture overview](../architecture/overview.md) - [Adding new models](adding-models.md) - [Troubleshooting](troubleshooting.md) diff --git a/docs/index.md b/docs/index.md index 865ea5a..0ef4ee7 100644 --- a/docs/index.md +++ b/docs/index.md @@ -10,7 +10,7 @@ Model Service is a deployment framework that enables: - **Distributed Inference**: Distribute inference workloads across multiple workers and nodes - **Resource Management**: Efficiently manage CPU and GPU resources in Kubernetes - **Model Versioning**: Integration with MLflow for model lifecycle management -- **Production Ready**: Built on Ray Serve with fault tolerance and high availability +- **Production-oriented**: Built on Ray Serve and KubeRay, with autoscaling and failure recovery features ## Key Features diff --git a/mkdocs.yml b/mkdocs.yml index 3a255cd..ca710b3 100644 --- a/mkdocs.yml +++ b/mkdocs.yml @@ -1,7 +1,7 @@ site_name: Model Service Documentation site_description: Model deployment infrastructure for RationAI using Ray Serve on Kubernetes repo_url: https://gitlab.ics.muni.cz/rationai/infrastructure/model-service -edit_uri: edit/master/docs/ +edit_uri: -/edit/master/docs/ theme: name: material @@ -25,11 +25,10 @@ nav: - Guides: - Deployment Guide: guides/deployment-guide.md - Adding New Models: guides/adding-models.md + - Configuration Reference: guides/configuration-reference.md - Troubleshooting: guides/troubleshooting.md - Architecture: - Overview: architecture/overview.md - - Reference: - - Configuration Reference: reference/configuration-reference.md markdown_extensions: - admonition From 00603480a45fd0f701abc04f5b27dacb4a660144 Mon Sep 17 00:00:00 2001 From: JiriStipek <567776@muni.cz> Date: Fri, 9 Jan 2026 17:27:56 +0100 Subject: [PATCH 03/17] feat: add dependency group + CI pipelines --- .gitlab-ci.yml | 9 +++++++++ README.md | 6 +++--- pyproject.toml | 1 + 3 files changed, 13 insertions(+), 3 deletions(-) diff --git a/.gitlab-ci.yml b/.gitlab-ci.yml index b132c47..67bffdb 100644 --- a/.gitlab-ci.yml +++ b/.gitlab-ci.yml @@ -4,3 +4,12 @@ include: stages: - lint + - docs + +docs:build: + stage: docs + image: python:3.12 + script: + - python -m pip install --upgrade pip + - python -m pip install mkdocs mkdocs-material pymdown-extensions + - mkdocs build --strict diff --git a/README.md b/README.md index 6636fae..fc1970a 100644 --- a/README.md +++ b/README.md @@ -31,14 +31,14 @@ Full walkthrough: `docs/get-started/quick-start.md`. ### Deploy ```bash -kubectl apply -f ray-service.yaml -n rationai-notebooks-ns -kubectl get rayservice rayservice-models -n rationai-notebooks-ns +kubectl apply -f ray-service.yaml -n [namespace] +kubectl get rayservice rayservice-models -n [namespace] ``` ### Access locally ```bash -kubectl port-forward -n rationai-notebooks-ns svc/rayservice-models-serve-svc 8000:8000 +kubectl port-forward -n [namespace] svc/rayservice-models-serve-svc 8000:8000 ``` ### Test the reference model (`BinaryClassifier`) diff --git a/pyproject.toml b/pyproject.toml index 21a2bfc..7eb9861 100644 --- a/pyproject.toml +++ b/pyproject.toml @@ -17,3 +17,4 @@ dependencies = [ [dependency-groups] dev = ["mypy>=1.18.2", "ruff>=0.14.6"] +docs = ["mkdocs>=1.6.0", "mkdocs-material>=9.6.0", "pymdown-extensions>=10.0"] From 3d61e036527d55097e427ffa5953af05c90ca7bc Mon Sep 17 00:00:00 2001 From: JiriStipek <567776@muni.cz> Date: Thu, 18 Dec 2025 19:20:06 +0100 Subject: [PATCH 04/17] feat: add 1st docs iteration --- docs/architecture/overview.md | 334 ++++++++++++++++ docs/get-started/quick-start.md | 100 +++++ docs/guides/adding-models.md | 325 ++++++++++++++++ docs/guides/deployment-guide.md | 447 ++++++++++++++++++++++ docs/guides/troubleshooting.md | 159 ++++++++ docs/index.md | 72 ++++ docs/reference/configuration-reference.md | 171 +++++++++ mkdocs.yml | 46 +++ 8 files changed, 1654 insertions(+) create mode 100644 docs/architecture/overview.md create mode 100644 docs/get-started/quick-start.md create mode 100644 docs/guides/adding-models.md create mode 100644 docs/guides/deployment-guide.md create mode 100644 docs/guides/troubleshooting.md create mode 100644 docs/index.md create mode 100644 docs/reference/configuration-reference.md create mode 100644 mkdocs.yml diff --git a/docs/architecture/overview.md b/docs/architecture/overview.md new file mode 100644 index 0000000..48447d2 --- /dev/null +++ b/docs/architecture/overview.md @@ -0,0 +1,334 @@ +# Architecture Overview + +This document provides an overview of Model Service's architecture, components, and design principles. + +## System Architecture + +Model Service is built on a multi-layered architecture: + +``` +┌─────────────────────────────────────────────────────────┐ +│ Client Applications │ +│ (API Consumers, Web Apps, Notebooks) │ +└────────────────────┬────────────────────────────────────┘ + │ HTTP/HTTPS + ▼ +┌─────────────────────────────────────────────────────────┐ +│ Kubernetes Service │ +│ (Load Balancer / Ingress) │ +└────────────────────┬────────────────────────────────────┘ + │ + ▼ +┌─────────────────────────────────────────────────────────┐ +│ Ray Serve │ +│ ┌──────────────┐ ┌──────────────┐ ┌──────────────┐ │ +│ │ Model A │ │ Model B │ │ Model C │ │ +│ │ (Replicas) │ │ (Replicas) │ │ (Replicas) │ │ +│ └──────────────┘ └──────────────┘ └──────────────┘ │ +└────────────────────┬────────────────────────────────────┘ + │ + ▼ +┌─────────────────────────────────────────────────────────┐ +│ Ray Cluster │ +│ ┌────────────┐ ┌────────────┐ ┌────────────┐ │ +│ │ Head Node │ │ Worker 1 │ │ Worker 2 │ ... │ +│ └────────────┘ └────────────┘ └────────────┘ │ +└─────────────────────────────────────────────────────────┘ +``` + +## Core Components + +### 1. Ray Cluster + +The foundation of Model Service, providing distributed computing infrastructure. + +**Head Node:** + +- Cluster coordination and management +- Dashboard for monitoring +- Scheduling decisions +- Does not run model workloads (no CPU/GPU assigned) + +**Worker Nodes:** + +- Execute model inference workloads +- Can be CPU-only or GPU-enabled + +- Auto-scale based on demand +- Different worker groups for different hardware types + +### 2. Ray Serve + +Application layer for serving ML models as HTTP endpoints. + +**Features:** + +- HTTP request routing +- Load balancing across replicas +- Request batching +- Automatic retry and fault tolerance +- Dynamic model configuration + +### 3. KubeRay Operator + +Kubernetes operator that manages Ray clusters. + +**Responsibilities:** + +- Cluster lifecycle management (create, update, delete) +- Autoscaling worker nodes +- Health monitoring +- Configuration reconciliation + +### 4. Model Implementations + +Your ML models wrapped with Ray Serve decorators. + +**Structure:** + +```python +@serve.deployment +class YourModel: + def __init__(self): + # Model loading + + async def __call__(self, request): + # Inference logic +``` + +## Data Flow + +### Inference Request Flow + +1. **Client Request**: HTTP POST to model endpoint +2. **Service Routing**: Kubernetes service routes to Ray Serve +3. **Load Balancing**: Ray Serve distributes to available replica +4. **Model Processing**: Replica executes inference +5. **Response**: Result returned to client + +``` +Client → K8s Service → Ray Serve Router → Model Replica → Response + ↓ + (Autoscaler) + ↓ + Add/Remove + Replicas +``` + +### Model Loading Flow + +1. **Initialization**: Ray Serve creates model replica +2. **Environment Setup**: Install dependencies from runtime_env +3. **Model Download**: Fetch from MLflow/storage +4. **Loading**: Initialize model in memory +5. **Ready**: Replica accepts requests + +## Scaling Architecture + +### Horizontal Scaling (Replicas) + +Models scale horizontally by adding/removing replicas: + +``` +Load: ████████░░ (80%) +Replicas: [R1] [R2] [R3] + +Load: ████████████████ (160%) +Replicas: [R1] [R2] [R3] [R4] [R5] [R6] +``` + +**Autoscaling Triggers:** + +- `target_ongoing_requests`: Target requests per replica +- Scale up when: requests > (replicas × target) +- Scale down when: requests < (replicas × target) + +### Vertical Scaling (Workers) + +Ray cluster scales by adding/removing worker pods: + +```yaml +workerGroupSpecs: + - groupName: cpu-workers + minReplicas: 0 + maxReplicas: 4 +``` + +**Triggers:** + +- Resource pressure (CPU, memory, GPU) +- Idle timeout (scale to zero) +- Manual scaling + +## Resource Management + +### CPU Resources + +```yaml +ray_actor_options: + num_cpus: 6 # CPUs per replica + +containers: + resources: + requests: + cpu: 12 # CPUs per worker pod +``` + +**Calculation:** + +- Worker pod CPUs ≥ (replicas × num_cpus) +- Leave headroom for system processes + +### Memory Resources + +```yaml +ray_actor_options: + memory: 5368709120 # 5 GiB per replica + +containers: + resources: + limits: + memory: 10Gi # Memory per worker pod +``` + +### GPU Resources + +```yaml +ray_actor_options: + num_gpus: 1 # GPUs per replica + +nodeSelector: + nvidia.com/gpu.product: NVIDIA-A40 + +resources: + limits: + nvidia.com/gpu: 1 # GPUs per worker pod +``` + +## High Availability + +### Fault Tolerance + +**Ray Cluster:** + +- Head node failure → Cluster recreated by KubeRay +- Worker failure → Workload rescheduled to other workers +- Network partition → Automatic reconnection + +**Ray Serve:** + +- Replica failure → Requests routed to healthy replicas +- Failed replicas automatically restarted +- Graceful shutdown on updates + +### Zero-Downtime Updates + +Model updates use blue-green deployment: + +1. New version deployed alongside old +2. Traffic gradually shifted to new version +3. Old version removed when no active requests + +```yaml +spec: + serveConfigV2: | + applications: + - name: my-model-v2 # New version + # ... new configuration +``` + +## Security + +### Pod Security + +All pods run with security constraints: + +```yaml +securityContext: + runAsNonRoot: true + runAsUser: 1000 + allowPrivilegeEscalation: false + capabilities: + drop: ["ALL"] + seccompProfile: + type: RuntimeDefault +``` + +### Network Security + +- Internal service communication only +- Ingress controls external access +- Proxy support for external dependencies + +## Monitoring & Observability + +### Ray Dashboard + +Web UI for cluster monitoring: + +- Resource utilization +- Active tasks +- Node status +- Serve deployments + +### Kubernetes Monitoring + +Standard Kubernetes tools: + +```bash +kubectl get pods -n [namespace] +kubectl top pods -n [namespace] +kubectl logs -n [namespace] +kubectl describe rayservice -n [namespace] +``` + +### Metrics + +Ray exports Prometheus metrics: + +- Request latency +- Request throughput +- Replica count +- Resource usage + +## Design Principles + +### 1. Declarative Configuration + +Infrastructure defined in YAML, managed by GitOps: + +```yaml +apiVersion: ray.io/v1 +kind: RayService +# ... configuration +``` + +### 2. Separation of Concerns + +- **Model Code**: Python implementation +- **Infrastructure**: Kubernetes manifests +- **Configuration**: user_config section + +### 3. Elastic Scaling + +- Scale to zero when idle +- Scale up on demand +- Efficient resource utilization + +### 4. Fault Tolerance + +- Automatic recovery from failures +- No single point of failure (except data plane) +- Graceful degradation + +### 5. Developer Experience + +- Simple model implementation +- Easy local testing +- Fast iteration cycle + +## Next Steps + +- [Deployment guide](../guides/deployment-guide.md) +- [Configuration reference](../reference/configuration-reference.md) +- [Adding new models](../guides/adding-models.md) diff --git a/docs/get-started/quick-start.md b/docs/get-started/quick-start.md new file mode 100644 index 0000000..f96fceb --- /dev/null +++ b/docs/get-started/quick-start.md @@ -0,0 +1,100 @@ +# Quick Start + +This guide will help you deploy your first model using Model Service in just a few minutes. + +## Prerequisites + +Before you begin, ensure you have: + +- Access to a Kubernetes cluster with KubeRay operator installed +- `kubectl` configured to access your cluster +- Basic familiarity with Kubernetes concepts + +Don't have KubeRay installed? +See the [Installation Guide](https://docs.ray.io/en/latest/cluster/kubernetes/getting-started/kuberay-operator-installation.html) for instructions on setting up KubeRay. + +## Step 1: Clone the Repository + +```bash +git clone https://gitlab.ics.muni.cz/rationai/infrastructure/model-service.git +cd model-service +``` + +## Step 2: Review the Configuration + +The repository includes a sample RayService configuration in `ray-service.yaml`. This deploys a binary classifier model for prostate tissue analysis. + +```yaml +apiVersion: ray.io/v1 +kind: RayService +metadata: + name: rayservice-models +spec: + serveConfigV2: | + applications: + - name: prostate-classifier-1 + import_path: models.binary_classifier:app + route_prefix: /prostate-classifier-1 + # ... configuration continues +``` + +## Step 3: Deploy the Service + +Apply the RayService configuration to your cluster. + +Replace [namespace] with the desired namespace (e.g., `rationai-notebooks-ns, rationai-jobs-ns` etc.): + +```bash +kubectl apply -f ray-service.yaml -n [namespace] +``` + +## Step 4: Monitor Deployment + +Check the deployment status: + +```bash +# Check RayService status +kubectl get rayservice rayservice-models -n [namespace] + +# Check Ray cluster pods +kubectl get pods -n [namespace] +``` + +If the RayService is not becoming ready, inspect events and status: + +```bash +kubectl describe rayservice rayservice-models -n [namespace] +``` + +## Step 5: Local Access the Service + +Once deployed, you can port-forward the service to access it locally: + +```bash +# Port-forward to access the service locally +kubectl port-forward svc/rayservice-models-serve-svc -n [namespace] 8000:8000 +``` + +## Step 6: Delete the Deployed Model + +To delete the deployed RayService, run: + +```bash +kubectl delete -f ray-service.yaml -n [namespace] +``` + +## Next Steps + +Congratulations! You've successfully deployed your first model with Model Service. + +Now you can: + +- [Learn how to add your own models](../guides/adding-models.md) +- [Understand the architecture](../architecture/overview.md) +- [Read the deployment guide](../guides/deployment-guide.md) +- [Check configuration reference](../reference/configuration-reference.md) +- [Troubleshooting](../guides/troubleshooting.md) + +### Connection Issues + +Ensure your cluster has proper network policies and that the namespace has access to required resources (MLflow, proxy, etc.). diff --git a/docs/guides/adding-models.md b/docs/guides/adding-models.md new file mode 100644 index 0000000..fd79a7e --- /dev/null +++ b/docs/guides/adding-models.md @@ -0,0 +1,325 @@ +# Adding New Models + +This guide explains how to integrate your own machine learning models into Model Service. + +## Overview + +To add a new model, you need to: + +1. Create a model class with Ray Serve decorators +2. Implement the inference logic +3. Configure the RayService YAML +4. Deploy and test + +## Model Implementation + +### Basic Structure + +Create a Python file in the `models/` directory: + +```python +from ray import serve +from starlette.requests import Request + +@serve.deployment(ray_actor_options={"num_cpus": 2}) +class MyModel: + def __init__(self): + # Load your model here + pass + + async def __call__(self, request: Request): + # Handle inference requests + data = await request.json() + # Process data and return prediction + return {"prediction": result} + +app = MyModel.bind() +``` + +!!! note +The repository's reference model `BinaryClassifier` uses FastAPI ingress + batched inference and expects a **compressed binary payload** (not JSON). For simple JSON models, the examples above are fine; for high-throughput image inference, consider the batching and ingress patterns shown below. + +### Key Components + +#### 1. Deployment Decorator + +The `@serve.deployment` decorator marks your class as a Ray Serve deployment: + +```python +@serve.deployment( + ray_actor_options={ + "num_cpus": 2, # CPUs per replica + "num_gpus": 0, # GPUs per replica + "memory": 2 * 1024**3, # Memory in bytes + } +) +class MyModel: + ... +``` + +#### 2. Initialization + +Load your model in `__init__`: + +```python +def __init__(self): + import torch + + self.model = torch.load("model.pt") + self.model.eval() + print("Model loaded successfully") +``` + +#### 3. Inference Method + +Implement `__call__` or other methods for handling requests: + +```python +async def __call__(self, request: Request): + data = await request.json() + input_data = self.preprocess(data["input"]) + + with torch.no_grad(): + output = self.model(input_data) + + return {"prediction": self.postprocess(output)} +``` + +## Advanced Features + +### Dynamic Configuration + +Use `reconfigure()` to update model settings without redeployment: + +```python +from typing import TypedDict + +class Config(TypedDict): + threshold: float + batch_size: int + +@serve.deployment +class ConfigurableModel: + def __init__(self): + self.model = load_model() + + async def reconfigure(self, config: Config): + self.threshold = config["threshold"] + self.batch_size = config["batch_size"] + print(f"Reconfigured: threshold={self.threshold}") +``` + +Update config via RayService YAML: + +```yaml +user_config: + threshold: 0.5 + batch_size: 32 +``` + +### Batching Requests + +Use `@serve.batch` for efficient batch processing: + +```python +@serve.deployment +class BatchedModel: + def __init__(self): + self.model = load_model() + + @serve.batch(max_batch_size=32, batch_wait_timeout_s=0.1) + async def predict_batch(self, inputs: list[np.ndarray]): + batch = np.stack(inputs) + outputs = self.model(batch) + return outputs.tolist() + + async def __call__(self, request: Request): + data = await request.json() + input_data = np.array(data["input"]) + result = await self.predict_batch(input_data) + return {"prediction": result} +``` + +!!! tip +For binary/image workloads, you can also batch raw `bytes` like the `BinaryClassifier` does (see `models/binary_classifier.py`). This avoids JSON overhead and lets you control batch sizing via `user_config` by calling `set_max_batch_size()` and `set_batch_wait_timeout_s()`. + +### Using FastAPI + +For advanced HTTP features, use FastAPI: + +```python +from fastapi import FastAPI, HTTPException +from pydantic import BaseModel + +class PredictionRequest(BaseModel): + input: list[float] + +class PredictionResponse(BaseModel): + prediction: float + confidence: float + +fastapi = FastAPI() + +@serve.deployment +@serve.ingress(fastapi) +class FastAPIModel: + def __init__(self): + self.model = load_model() + + @fastapi.post("/predict", response_model=PredictionResponse) + async def predict(self, request: PredictionRequest): + output = self.model(request.input) + return PredictionResponse( + prediction=float(output), + confidence=0.95 + ) + +app = FastAPIModel.bind() +``` + +## Loading Models from MLflow + +Use the model provider to load from MLflow: + +```python +# models/mlflow_model.py +from providers.model_provider import mlflow + +@serve.deployment +class MLflowModel: + def __init__(self): + # This will be set via user_config + self.model_path = None + + async def reconfigure(self, config): + model_uri = config["model"]["artifact_uri"] + self.model_path = mlflow(artifact_uri=model_uri) + + # Load model + import onnxruntime as ort + self.session = ort.InferenceSession(self.model_path) + + async def __call__(self, request: Request): + # Inference logic + ... + +app = MLflowModel.bind() +``` + +Configure in YAML: + +```yaml +runtime_env: + env_vars: + MLFLOW_TRACKING_URI: http://mlflow.rationai-mlflow:5000 +user_config: + model: + artifact_uri: mlflow-artifacts:/65/abc123.../model.onnx +``` + +## RayService Configuration + +Add your model to `ray-service.yaml`: + +```yaml +spec: + serveConfigV2: | + applications: + - name: my-model + import_path: models.my_onnx_model:app + route_prefix: /my-model + runtime_env: + working_dir: https://gitlab.ics.muni.cz/rationai/infrastructure/model-service/-/archive/master/model-service-master.zip + pip: + - onnxruntime>=1.23.2 + - numpy + deployments: + - name: MyONNXModel + autoscaling_config: + min_replicas: 1 + max_replicas: 4 + ray_actor_options: + num_cpus: 2 + memory: 4294967296 # 4 GiB + runtime_env: + pip: + - onnxruntime>=1.23.2 +``` + +!!! note +In this repo, the production `ray-service.yaml` installs model dependencies under `deployments[*].ray_actor_options.runtime_env.pip` (not only at `applications[*].runtime_env`). This is useful when different deployments need different dependencies. + +## GPU Models + +For GPU-accelerated models: + +```python +@serve.deployment(ray_actor_options={"num_gpus": 1}) +class GPUModel: + def __init__(self): + import torch + + self.device = torch.device("cuda") + self.model = torch.load("model.pt").to(self.device) + self.model.eval() + + async def __call__(self, request: Request): + data = await request.json() + input_tensor = torch.tensor(data["input"]).to(self.device) + + with torch.no_grad(): + output = self.model(input_tensor) + + return {"prediction": output.cpu().numpy().tolist()} +``` + +Configure GPU worker group: + +```yaml +workerGroupSpecs: + - groupName: gpu-workers + replicas: 0 + minReplicas: 0 + maxReplicas: 2 + template: + spec: + nodeSelector: + nvidia.com/gpu.product: NVIDIA-A40 + containers: + - name: ray-worker + image: rayproject/ray:2.52.1-py312-gpu + resources: + limits: + nvidia.com/gpu: 1 +``` + +## Deployment + +Deploy your model: + +```bash +kubectl apply -f ray-service.yaml -n [namespace] +``` + +Monitor deployment: + +```bash +kubectl get rayservice -n [namespace] +kubectl logs -n [namespace] -l ray.io/node-type=worker --tail=100 +``` + +## Best Practices + +1. **Error Handling**: Always wrap inference in try-except blocks +2. **Logging**: Use `print()` or `logging` for debugging (viewable in pod logs) +3. **Resource Limits**: Set appropriate CPU/memory/GPU limits +4. **Model Loading**: Cache models to avoid reloading on each request +5. **Input Validation**: Validate input data format and ranges +6. **Batching**: Use batching for throughput-intensive workloads +7. **Health Checks**: Implement health check endpoints for monitoring + +## Next Steps + +- [Deployment guide](deployment-guide.md) +- [Configuration reference](../reference/configuration-reference.md) +- [Architecture overview](../architecture/overview.md) diff --git a/docs/guides/deployment-guide.md b/docs/guides/deployment-guide.md new file mode 100644 index 0000000..cdfc110 --- /dev/null +++ b/docs/guides/deployment-guide.md @@ -0,0 +1,447 @@ +# Deployment Guide + +Complete guide for deploying models to production with Model Service. + +## Prerequisites + +Before deploying to production, ensure: + +- [x] KubeRay operator installed +- [x] Namespace created (`rationai-notebooks-ns`) +- [x] Model tested locally +- [x] RayService YAML configured +- [x] MLflow accessible (if using MLflow) + +## Deployment Workflow + +### 1. Prepare Model Code + +Ensure your model is in the `models/` directory and properly structured: + +```python +# models/my_model.py +from ray import serve +from starlette.requests import Request + +@serve.deployment(ray_actor_options={"num_cpus": 2}) +class MyModel: + def __init__(self): + # Model initialization + self.model = self.load_model() + + def load_model(self): + # Load model logic + pass + + async def __call__(self, request: Request): + # Inference logic + data = await request.json() + result = self.model.predict(data["input"]) + return {"prediction": result} + +app = MyModel.bind() +``` + +### 2. Create RayService Configuration + +Create or modify `ray-service.yaml`: + +```yaml +apiVersion: ray.io/v1 +kind: RayService +metadata: + name: rayservice-my-model + namespace: rationai-notebooks-ns +spec: + serveConfigV2: | + applications: + - name: my-model + import_path: models.my_model:app + route_prefix: /my-model + runtime_env: + working_dir: https://gitlab.ics.muni.cz/rationai/infrastructure/model-service/-/archive/master/model-service-master.zip + pip: + - numpy + - pandas + env_vars: + MODEL_VERSION: "1.0.0" + deployments: + - name: MyModel + autoscaling_config: + min_replicas: 1 + max_replicas: 5 + target_ongoing_requests: 32 + ray_actor_options: + num_cpus: 4 + memory: 4294967296 # 4 GiB + runtime_env: + pip: + - numpy + - pandas + + rayClusterConfig: + rayVersion: 2.52.1 + enableInTreeAutoscaling: true + autoscalerOptions: + idleTimeoutSeconds: 300 + + headGroupSpec: + rayStartParams: + num-cpus: "0" + dashboard-host: "0.0.0.0" + template: + spec: + containers: + - name: ray-head + image: rayproject/ray:2.52.1-py312 + resources: + limits: + cpu: 2 + memory: 4Gi + + workerGroupSpecs: + - groupName: cpu-workers + replicas: 1 + minReplicas: 1 + maxReplicas: 10 + template: + spec: + containers: + - name: ray-worker + image: rayproject/ray:2.52.1-py312 + resources: + limits: + cpu: 8 + memory: 16Gi +``` + +### 3. Deploy to Kubernetes + +Apply the configuration: + +```bash +kubectl apply -f ray-service.yaml -n [namespace] +``` + +### 4. Monitor Deployment + +Watch the deployment progress: + +```bash +# Watch RayService status +kubectl get rayservice rayservice-my-model -n [namespace] -w + +# Check pods +kubectl get pods -n [namespace] -l ray.io/cluster + +# View head node logs +kubectl logs -n [namespace] -l ray.io/node-type=head -f + +# View worker logs +kubectl logs -n [namespace] -l ray.io/node-type=worker -f +``` + +Wait for status to show `Running` and application status to show `RUNNING`. + +### 5. Verify Deployment + +Check service endpoints: + +```bash +# Get service details +kubectl get svc -n [namespace] + +# Port forward to test +kubectl port-forward -n rationai-notebooks-ns \ + svc/rayservice-my-model-serve-svc 8000:8000 +``` + +!!! note +The example model in this repository (`models/binary_classifier.py`) uses FastAPI ingress and expects a **compressed binary request body** (LZ4), not JSON. The JSON `curl` example below is valid for JSON-based models but does not apply to `BinaryClassifier`. + +Test the endpoint: + +```bash +curl -X POST http://localhost:8000/my-model \ + -H "Content-Type: application/json" \ + -d '{"input": [1.0, 2.0, 3.0]}' +``` + +## Production Considerations + +### Resource Planning + +**Calculate resource requirements:** + +1. **Per-replica resources:** + + - CPU: Based on model complexity + - Memory: Model size + working memory + overhead + - GPU: Number of GPUs needed + +2. **Total cluster resources:** + + ``` + Total CPUs = max_replicas × num_cpus + overhead + Total Memory = max_replicas × memory + overhead + ``` + +3. **Example calculation:** + + ``` + Model: 4 CPU, 4GB per replica + Max replicas: 5 + + Required per worker: 5 × 4 = 20 CPUs, 5 × 4GB = 20GB + Overhead: +2 CPUs, +4GB for system + + Worker resources: 22 CPUs, 24GB memory + ``` + +### Autoscaling Configuration + +**Choose appropriate scaling parameters:** + +```yaml +autoscaling_config: + min_replicas: 1 # Always keep 1 running + max_replicas: 10 # Scale up to 10 + target_ongoing_requests: 20 # Target load per replica + + # Advanced options + upscale_delay_s: 30 # Wait 30s before scaling up + downscale_delay_s: 600 # Wait 10m before scaling down +``` + +**Scaling behavior:** + +- **Cold start**: Set `min_replicas: 0` for scale-to-zero +- **Always available**: Set `min_replicas: 1` or higher +- **High traffic**: Increase `max_replicas` and `target_ongoing_requests` +- **Batch processing**: Use higher `target_ongoing_requests` + +### High Availability + +**For production workloads:** + +```yaml +# Multiple replicas +autoscaling_config: + min_replicas: 2 # At least 2 for redundancy + +# Multiple workers +workerGroupSpecs: + - groupName: cpu-workers + minReplicas: 2 + maxReplicas: 10 +``` + +### Resource Limits + +**Always set resource limits:** + +```yaml +containers: + - name: ray-worker + resources: + requests: # Guaranteed resources + cpu: 8 + memory: 16Gi + limits: # Maximum resources + cpu: 12 + memory: 20Gi +``` + +### Network Configuration + +**Proxy settings:** + +```yaml +env: + - name: HTTP_PROXY + value: "http://proxy.example.com:3128" + - name: HTTPS_PROXY + value: "http://proxy.example.com:3128" + - name: NO_PROXY + value: ".svc.cluster.local,.cluster.local" +``` + +**Service configuration:** + +```yaml +# If you need external access +apiVersion: v1 +kind: Service +metadata: + name: rayservice-external +spec: + type: LoadBalancer + selector: + ray.io/cluster: rayservice-my-model + ports: + - port: 80 + targetPort: 8000 +``` + +## Multi-Model Deployment + +Deploy multiple models in one RayService: + +```yaml +serveConfigV2: | + applications: + - name: model-a + import_path: models.model_a:app + route_prefix: /model-a + deployments: + - name: ModelA + ray_actor_options: + num_cpus: 4 + + - name: model-b + import_path: models.model_b:app + route_prefix: /model-b + deployments: + - name: ModelB + ray_actor_options: + num_gpus: 1 +``` + +Access models: + +```bash +curl http://service:8000/model-a -d '{"input": ...}' +curl http://service:8000/model-b -d '{"input": ...}' +``` + +## Updating Deployments + +### Update Model Code + +1. Update code in repository +2. Commit and push changes +3. RayService will automatically fetch new code from `working_dir` URL + +### Update Configuration + +```bash +# Edit configuration +vim ray-service.yaml + +# Apply changes +kubectl apply -f ray-service.yaml +``` + +Ray will perform rolling update: + +- New replicas created with new config +- Traffic gradually shifted +- Old replicas removed + +### Update Model Weights + +If using MLflow: + +```yaml +user_config: + model: + artifact_uri: mlflow-artifacts:/65/NEW_RUN_ID/model.onnx +``` + +Apply update: + +```bash +kubectl apply -f ray-service.yaml +``` + +## Rollback + +If deployment fails, rollback: + +```bash +# RayService is a Custom Resource (CRD), so Kubernetes "rollout" doesn't apply. +# Instead, view KubeRay status and events, then re-apply a known-good spec. + +# Inspect current state and recent events +kubectl get rayservice rayservice-my-model -n rationai-notebooks-ns -o yaml +kubectl describe rayservice rayservice-my-model -n rationai-notebooks-ns + +# Check Ray Serve controller logs (usually shows the root cause) +kubectl logs -n rationai-notebooks-ns -l ray.io/node-type=head --tail=200 +``` + +Or manually apply previous configuration: + +```bash +git checkout HEAD~1 ray-service.yaml +kubectl apply -f ray-service.yaml +``` + +## Troubleshooting + +### Deployment Stuck + +**Check RayService status:** + +```bash +kubectl describe rayservice rayservice-my-model -n rationai-notebooks-ns +``` + +**Common issues:** + +- Image pull errors +- Insufficient resources +- Configuration errors +- Network issues + +### Application Not Starting + +**Check serve application logs:** + +```bash +# View dashboard +kubectl port-forward svc/rayservice-my-model-head-svc 8265:8265 + +# Check logs +kubectl logs -n rationai-notebooks-ns -l ray.io/node-type=worker --tail=100 +``` + +**Common issues:** + +- Python import errors +- Model loading failures +- Dependency issues +- Resource limits + +### High Latency + +**Check metrics:** + +```bash +# Ray dashboard: http://localhost:8265 +kubectl port-forward svc/rayservice-my-model-head-svc 8265:8265 +``` + +**Possible solutions:** + +- Increase replicas +- Enable batching +- Optimize model code +- Increase resources + +## Best Practices + +1. **Version Control**: Keep all YAML configs in Git +2. **Testing**: Test locally before deploying +3. **Monitoring**: Set up alerts for failures +4. **Resource Limits**: Always set limits to prevent resource hogging +5. **Gradual Rollout**: Update replicas gradually +6. **Documentation**: Document custom configurations +7. **Backup**: Keep backup of working configurations + +## Next Steps + +- [Configuration reference](../reference/configuration-reference.md) +- [Architecture overview](../architecture/overview.md) +- [Adding new models](adding-models.md) +- [Troubleshooting](troubleshooting.md) diff --git a/docs/guides/troubleshooting.md b/docs/guides/troubleshooting.md new file mode 100644 index 0000000..39046b0 --- /dev/null +++ b/docs/guides/troubleshooting.md @@ -0,0 +1,159 @@ +# Troubleshooting + +This page lists the most common issues when deploying and running models in Model Service (Ray Serve on KubeRay). + +## Quick Triage Checklist + +Start here before digging deeper: + +```bash +kubectl get rayservice -n [namespace] +kubectl describe rayservice rayservice-models -n [namespace] +kubectl get pods -n [namespace] +``` + +Then inspect logs: + +```bash +kubectl logs -n [namespace] -l ray.io/node-type=head --tail=200 +kubectl logs -n [namespace] -l ray.io/node-type=worker --tail=200 +``` + +## RayService Shows `DEPLOY_FAILED` + +### What it usually means + +Ray Serve could not start the application or deployment. The root cause is typically visible in the Ray Serve controller logs. + +### What to do + +1. Describe the RayService for events: + +```bash +kubectl describe rayservice rayservice-models -n [namespace] +``` + +2. Open the Ray dashboard (helps with Serve deployment errors): + +```bash +kubectl port-forward -n [namespace] svc/rayservice-models-head-svc 8265:8265 +``` + +Visit `http://localhost:8265`. + +3. Look for Python import errors / missing dependencies: + +```bash +kubectl logs -n [namespace] -l ray.io/node-type=worker --tail=500 +``` + +## ImportError / ModuleNotFoundError + +### Symptoms + +- Serve deployment fails immediately. +- Logs show `ModuleNotFoundError: No module named ...`. + +### Causes + +- Dependency not installed in the runtime environment. +- Wrong `import_path`. +- `working_dir` does not contain the expected code. + +### Fix + +- Ensure `import_path` matches your file: + - Example: `models.binary_classifier:app` means there is `models/binary_classifier.py` defining `app = ...`. +- Add missing dependencies to `runtime_env.pip`. + +In this repository, dependencies are typically installed per deployment: + +```yaml +deployments: + - name: BinaryClassifier + ray_actor_options: + runtime_env: + pip: ["onnxruntime>=1.23.2", "mlflow<3.0", "lz4>=4.4.5"] +``` + +## Autoscaling Not Working (Replicas Don’t Change) + +### Serve replicas not scaling + +Check your deployment has autoscaling configured: + +```yaml +autoscaling_config: + min_replicas: 0 + max_replicas: 4 + target_ongoing_requests: 32 +``` + +Also note: + +- Scale up/down is not instantaneous (delays and smoothing apply). +- If traffic is low, you may stay at `min_replicas`. + +### Worker pods not scaling + +Worker pod scaling requires cluster autoscaling enabled: + +```yaml +rayClusterConfig: + enableInTreeAutoscaling: true + autoscalerOptions: + idleTimeoutSeconds: 60 +``` + +Also ensure `workerGroupSpecs[*].minReplicas/maxReplicas` allow scaling. + +## Not Enough CPU / Memory (Pods Pending) + +### Symptoms + +- Pods stay in `Pending`. +- Events mention `Insufficient cpu` or `Insufficient memory`. + +### Fix + +- Reduce per-replica requirements (`ray_actor_options.num_cpus`, `memory`). +- Increase cluster capacity or adjust worker pod resources. + +Inspect pod scheduling events: + +```bash +kubectl describe pod -n [namespace] +``` + +## MLflow / Artifact Download Problems + +### Symptoms + +- `mlflow.artifacts.download_artifacts` fails. +- Timeouts during replica initialization. + +### Fix + +- Ensure `MLFLOW_TRACKING_URI` is set and reachable from the cluster. +- Ensure the cluster has network access (proxy settings if needed). +- Verify the `artifact_uri` exists and permissions are correct. + +In `ray-service.yaml` this is typically configured via `env_vars`: + +```yaml +ray_actor_options: + runtime_env: + env_vars: + MLFLOW_TRACKING_URI: http://mlflow.rationai-mlflow:5000 +``` + +## Helpful Commands + +```bash +# list Serve and RayService resources +kubectl get rayservice -n [namespace] +kubectl get svc -n [namespace] + +# see all pods for a RayService +kubectl get pods -n [namespace] -l ray.io/cluster=rayservice-models +``` diff --git a/docs/index.md b/docs/index.md new file mode 100644 index 0000000..865ea5a --- /dev/null +++ b/docs/index.md @@ -0,0 +1,72 @@ +# Model Service Documentation + +Welcome to the Model Service documentation. This service provides a scalable, production-ready infrastructure for deploying machine learning models for the RationAI project using Ray Serve on Kubernetes. + +## What is Model Service? + +Model Service is a deployment framework that enables: + +- **Scalable Model Serving**: Automatically scale model replicas based on request load +- **Distributed Inference**: Distribute inference workloads across multiple workers and nodes +- **Resource Management**: Efficiently manage CPU and GPU resources in Kubernetes +- **Model Versioning**: Integration with MLflow for model lifecycle management +- **Production Ready**: Built on Ray Serve with fault tolerance and high availability + +## Key Features + +### Auto-Scaling + +Model Service automatically adjusts the number of model replicas based on incoming request volume, ensuring optimal resource utilization and response times. + +### Multi-Model Deployment + +Deploy multiple models simultaneously with isolated resource allocations and independent scaling policies. + +### GPU/CPU Support + +Flexible resource allocation supporting both CPU-based and GPU-accelerated models with hardware-specific worker groups. + +### Kubernetes Native + +Leverages KubeRay operator for seamless integration with Kubernetes, enabling declarative configuration and GitOps workflows. + +## Why Ray Serve? + +Model Service is built on top of Ray Serve because it combines a simple developer experience with strong production capabilities: + +- **Unified batch and online inference**: The same Ray cluster can handle real-time HTTP requests and large batch jobs, which matches RationAI's mix of interactive and offline pathology workloads. +- **Python‑native API**: Models are implemented as regular Python classes or functions with decorators, making it easy for researchers to contribute without learning a heavy framework. +- **Autoscaling built in**: Ray Serve natively scales replicas based on request pressure and integrates with Ray's cluster autoscaler to add/remove worker pods. +- **Multi‑model support**: Multiple independent applications and deployments can run side‑by‑side on one cluster while isolating resources per model. + +Alternative approaches (plain Kubernetes deployments, custom Flask/FastAPI services, or specialized serving stacks like TorchServe or TF Serving) either lack first‑class autoscaling orchestration across many models, or are tightly coupled to specific ML frameworks. Ray Serve, together with KubeRay, lets us: + +- Express all infrastructure declaratively in a single `RayService` resource. +- Share the same cluster across heterogeneous models and hardware (CPU/GPU). +- Keep the operational surface smaller by relying on one general‑purpose serving layer instead of many ad‑hoc microservices. + +## Use Cases + +Model Service is designed for: + +- **Pathology Image Analysis**: Deploy models for tissue classification, nuclei detection, and other pathology tasks +- **Batch Processing**: Handle large-scale inference workloads efficiently +- **Real-time Inference**: Serve predictions with low latency for interactive applications +- **Research Experiments**: Quickly deploy and test new model versions + +## Getting Help + +- **Documentation**: Browse the guides and reference materials in this documentation +- **Issues**: Report bugs or request features via [GitLab Issues](https://gitlab.ics.muni.cz/rationai/infrastructure/model-service/-/issues) +- **Contact**: Reach out to the RationAI team at Masaryk University + +## Next Steps + +Ready to get started? Follow our [Quick Start Guide](get-started/quick-start.md) to deploy your first model. + +## Glossary + +- **RayService**: KubeRay custom resource that manages a Ray cluster plus a Ray Serve application, including updates. +- **Deployment (Ray Serve)**: A scalable unit (replicas) that runs your model code. +- **Replica**: One running instance of a deployment. +- **Worker group (KubeRay)**: A set of Ray worker pods (e.g., CPU or GPU workers) with independent scaling bounds. diff --git a/docs/reference/configuration-reference.md b/docs/reference/configuration-reference.md new file mode 100644 index 0000000..1e4424c --- /dev/null +++ b/docs/reference/configuration-reference.md @@ -0,0 +1,171 @@ +# Configuration Reference + +This page summarizes the **most important knobs** you will touch when configuring Model Service. For full API details, see the upstream Ray Serve and KubeRay documentation. + +## 1. RayService Skeleton + +```yaml +apiVersion: ray.io/v1 +kind: RayService +metadata: + name: + namespace: [namespace] +spec: + serveConfigV2: | + # Ray Serve applications + rayClusterConfig: + # Ray cluster (head + workers) +``` + +Think of it as two parts: + +- **`serveConfigV2`**: what you serve (apps, deployments, autoscaling). +- **`rayClusterConfig`**: where it runs (Ray version, worker groups, resources). + +## 2. Applications and Deployments + +### Applications (HTTP endpoints) + +```yaml +serveConfigV2: | + applications: + - name: prostate-classifier + import_path: models.binary_classifier:app + route_prefix: /prostate-classifier + runtime_env: + working_dir: https://.../model-service-master.zip + pip: + - onnxruntime>=1.23.2 +``` + +- `name`: logical app name (used in Ray dashboard/logs). +- `import_path`: Python entrypoint (`module.path:variable`). +- `route_prefix`: HTTP path under the Serve gateway. +- `runtime_env`: code location + extra Python deps. + +### Deployments (scaling + resources) + +```yaml +deployments: + - name: BinaryClassifier + max_ongoing_requests: 64 + max_queued_requests: 128 + autoscaling_config: + min_replicas: 0 + max_replicas: 4 + target_ongoing_requests: 32 + ray_actor_options: + num_cpus: 6 + memory: 5368709120 # 5 GiB + user_config: + tile_size: 512 + threshold: 0.5 +``` + +- `autoscaling_config`: how many replicas and when to scale. +- `ray_actor_options`: per‑replica CPU/GPU/memory. +- `user_config`: free‑form dict passed to `reconfigure()` in your model. + +## 3. Ray Cluster (Workers and Autoscaling) + +```yaml +rayClusterConfig: + rayVersion: "2.52.1" + enableInTreeAutoscaling: true + headGroupSpec: + rayStartParams: + num-cpus: "0" # head only coordinates + template: + spec: + containers: + - name: ray-head + image: rayproject/ray:2.52.1-py312 + workerGroupSpecs: + - groupName: cpu-workers + replicas: 1 + minReplicas: 1 + maxReplicas: 10 + template: + spec: + containers: + - name: ray-worker + image: rayproject/ray:2.52.1-py312 + resources: + requests: + cpu: "4" + memory: "8Gi" + limits: + cpu: "8" + memory: "16Gi" +``` + +Focus on: + +- `rayVersion`: must match images you use. +- `workerGroupSpecs[*].{replicas,minReplicas,maxReplicas}`: cluster‑level scaling bounds. +- `resources.requests/limits`: how big each worker pod is. + +## 4. Security and Placement (Optional but Recommended) + +```yaml +template: + spec: + securityContext: + runAsNonRoot: true + fsGroupChangePolicy: OnRootMismatch + seccompProfile: + type: RuntimeDefault + nodeSelector: + nvidia.com/gpu.product: NVIDIA-A40 + containers: + - name: ray-worker + securityContext: + allowPrivilegeEscalation: false + runAsUser: 1000 + capabilities: + drop: ["ALL"] +``` + +Use these to: + +- Enforce non‑root containers and least privilege. +- Pin GPU workloads to specific node types. + +## 5. Putting It Together (Small Example) + +```yaml +apiVersion: ray.io/v1 +kind: RayService +metadata: + name: rayservice-example + namespace: rationai-notebooks-ns +spec: + serveConfigV2: | + applications: + - name: my-classifier + import_path: models.classifier:app + route_prefix: /classify + deployments: + - name: Classifier + autoscaling_config: + min_replicas: 1 + max_replicas: 5 + target_ongoing_requests: 32 + ray_actor_options: + num_cpus: 4 + rayClusterConfig: + rayVersion: "2.52.1" + enableInTreeAutoscaling: true + headGroupSpec: + rayStartParams: + num-cpus: "0" + workerGroupSpecs: + - groupName: cpu-workers + minReplicas: 1 + maxReplicas: 5 +``` + +## Next Steps + +- [Deployment guide](../guides/deployment-guide.md) +- [Architecture overview](../architecture/overview.md) diff --git a/mkdocs.yml b/mkdocs.yml new file mode 100644 index 0000000..3a255cd --- /dev/null +++ b/mkdocs.yml @@ -0,0 +1,46 @@ +site_name: Model Service Documentation +site_description: Model deployment infrastructure for RationAI using Ray Serve on Kubernetes +repo_url: https://gitlab.ics.muni.cz/rationai/infrastructure/model-service +edit_uri: edit/master/docs/ + +theme: + name: material + palette: + primary: indigo + accent: indigo + features: + - navigation.tabs + - navigation.sections + - toc.integrate + - search.suggest + - search.highlight + - content.code.copy + +nav: + - Home: index.md + - Get Started: + - Quick Start: get-started/quick-start.md + - Installation: https://docs.ray.io/en/latest/cluster/kubernetes/getting-started/kuberay-operator-installation.html + - First Deployment: https://docs.ray.io/en/latest/serve/production-guide/kubernetes.html + - Guides: + - Deployment Guide: guides/deployment-guide.md + - Adding New Models: guides/adding-models.md + - Troubleshooting: guides/troubleshooting.md + - Architecture: + - Overview: architecture/overview.md + - Reference: + - Configuration Reference: reference/configuration-reference.md + +markdown_extensions: + - admonition + - pymdownx.details + - pymdownx.superfences + - pymdownx.highlight: + anchor_linenums: true + - pymdownx.inlinehilite + - pymdownx.snippets + - pymdownx.tabbed: + alternate_style: true + - tables + - toc: + permalink: true \ No newline at end of file From 53e187305dff9614093c50492777a0a2bf2635f2 Mon Sep 17 00:00:00 2001 From: JiriStipek <567776@muni.cz> Date: Tue, 6 Jan 2026 13:58:38 +0100 Subject: [PATCH 05/17] feat: 2nd docs iteration --- README.md | 132 ++++++++++-------- docs/architecture/overview.md | 25 ++-- docs/get-started/quick-start.md | 12 +- docs/guides/adding-models.md | 7 +- .../configuration-reference.md | 2 +- docs/guides/deployment-guide.md | 57 +++----- docs/index.md | 2 +- mkdocs.yml | 5 +- 8 files changed, 113 insertions(+), 129 deletions(-) rename docs/{reference => guides}/configuration-reference.md (98%) diff --git a/README.md b/README.md index f6ea4ce..6636fae 100644 --- a/README.md +++ b/README.md @@ -1,93 +1,107 @@ # Model Service +Model deployment infrastructure for RationAI using Ray Serve on Kubernetes. +This repository contains: -## Getting started +- A KubeRay `RayService` manifest (`ray-service.yaml`) for deploying Ray Serve on Kubernetes. +- Model implementations under `models/` (reference: `models/binary_classifier.py`). +- Documentation under `docs/` (MkDocs). -To make it easy for you to get started with GitLab, here's a list of recommended next steps. +## Documentation -Already a pro? Just edit this README.md and make it your own. Want to make it easy? [Use the template at the bottom](#editing-this-readme)! +- MkDocs content: `docs/` +- Key pages: + - `docs/get-started/quick-start.md` + - `docs/guides/deployment-guide.md` + - `docs/guides/adding-models.md` + - `docs/guides/configuration-reference.md` + - `docs/guides/troubleshooting.md` + - `docs/architecture/overview.md` -## Add your files +## Quick Start (Kubernetes) -- [ ] [Create](https://docs.gitlab.com/ee/user/project/repository/web_editor.html#create-a-file) or [upload](https://docs.gitlab.com/ee/user/project/repository/web_editor.html#upload-a-file) files -- [ ] [Add files using the command line](https://docs.gitlab.com/topics/git/add_files/#add-files-to-a-git-repository) or push an existing Git repository with the following command: +Full walkthrough: `docs/get-started/quick-start.md`. -``` -cd existing_repo -git remote add origin https://gitlab.ics.muni.cz/rationai/infrastructure/model-service2.git -git branch -M master -git push -uf origin master -``` +### Prerequisites -## Integrate with your tools +- Kubernetes cluster with KubeRay operator installed +- `kubectl` configured for the cluster -- [ ] [Set up project integrations](https://gitlab.ics.muni.cz/rationai/infrastructure/model-service2/-/settings/integrations) +### Deploy -## Collaborate with your team +```bash +kubectl apply -f ray-service.yaml -n rationai-notebooks-ns +kubectl get rayservice rayservice-models -n rationai-notebooks-ns +``` -- [ ] [Invite team members and collaborators](https://docs.gitlab.com/ee/user/project/members/) -- [ ] [Create a new merge request](https://docs.gitlab.com/ee/user/project/merge_requests/creating_merge_requests.html) -- [ ] [Automatically close issues from merge requests](https://docs.gitlab.com/ee/user/project/issues/managing_issues.html#closing-issues-automatically) -- [ ] [Enable merge request approvals](https://docs.gitlab.com/ee/user/project/merge_requests/approvals/) -- [ ] [Set auto-merge](https://docs.gitlab.com/user/project/merge_requests/auto_merge/) +### Access locally -## Test and Deploy +```bash +kubectl port-forward -n rationai-notebooks-ns svc/rayservice-models-serve-svc 8000:8000 +``` -Use the built-in continuous integration in GitLab. +### Test the reference model (`BinaryClassifier`) -- [ ] [Get started with GitLab CI/CD](https://docs.gitlab.com/ee/ci/quick_start/) -- [ ] [Analyze your code for known vulnerabilities with Static Application Security Testing (SAST)](https://docs.gitlab.com/ee/user/application_security/sast/) -- [ ] [Deploy to Kubernetes, Amazon EC2, or Amazon ECS using Auto Deploy](https://docs.gitlab.com/ee/topics/autodevops/requirements.html) -- [ ] [Use pull-based deployments for improved Kubernetes management](https://docs.gitlab.com/ee/user/clusters/agent/) -- [ ] [Set up protected environments](https://docs.gitlab.com/ee/ci/environments/protected_environments.html) +The reference deployment in `ray-service.yaml` exposes an app at the route prefix: -*** +- `/prostate-classifier-1` -# Editing this README +`models/binary_classifier.py` expects a **request body that is LZ4-compressed raw bytes** of a single RGB tile: -When you're ready to make this README your own, just edit this file and use the handy template below (or feel free to structure it however you want - this is just a starting point!). Thanks to [makeareadme.com](https://www.makeareadme.com/) for this template. +- dtype: `uint8` +- shape: `(tile_size, tile_size, 3)` +- byte order: row-major (NumPy default) -## Suggestions for a good README +Example (Python): -Every project is different, so consider which of these sections apply to yours. The sections used in the template are suggestions for most open source projects. Also keep in mind that while a README can be too long and detailed, too long is better than too short. If you think your README is too long, consider utilizing another form of documentation rather than cutting out information. +```bash +pip install numpy lz4 requests +``` -## Name -Choose a self-explaining name for your project. +```python +import lz4.frame +import numpy as np +import requests -## Description -Let people know what your project can do specifically. Provide context and add a link to any reference visitors might be unfamiliar with. A list of Features or a Background subsection can also be added here. If there are alternatives to your project, this is a good place to list differentiating factors. +tile_size = 512 # must match RayService user_config.tile_size +tile = np.zeros((tile_size, tile_size, 3), dtype=np.uint8) -## Badges -On some READMEs, you may see small images that convey metadata, such as whether or not all the tests are passing for the project. You can use Shields to add some to your README. Many services also have instructions for adding a badge. +payload = lz4.frame.compress(tile.tobytes()) -## Visuals -Depending on what you are making, it can be a good idea to include screenshots or even a video (you'll frequently see GIFs rather than actual videos). Tools like ttygif can help, but check out Asciinema for a more sophisticated method. +resp = requests.post( + "http://localhost:8000/prostate-classifier-1/", + data=payload, + headers={"Content-Type": "application/octet-stream"}, + timeout=60, +) +resp.raise_for_status() +print(resp.json() if resp.headers.get("content-type", "").startswith("application/json") else resp.text) +``` -## Installation -Within a particular ecosystem, there may be a common way of installing things, such as using Yarn, NuGet, or Homebrew. However, consider the possibility that whoever is reading your README is a novice and would like more guidance. Listing specific steps helps remove ambiguity and gets people to using your project as quickly as possible. If it only runs in a specific context like a particular programming language version or operating system or has dependencies that have to be installed manually, also add a Requirements subsection. +## Repository Structure -## Usage -Use examples liberally, and show the expected output if you can. It's helpful to have inline the smallest example of usage that you can demonstrate, while providing links to more sophisticated examples if they are too long to reasonably include in the README. +``` +model-service/ +├── models/ # Model implementations +│ └── binary_classifier.py +├── providers/ # Model loading providers +│ └── model_provider.py +├── docs/ # Documentation +├── ray-service.yaml # Kubernetes RayService configuration +├── pyproject.toml # Python dependencies +└── README.md +``` ## Support -Tell people where they can go to for help. It can be any combination of an issue tracker, a chat room, an email address, etc. - -## Roadmap -If you have ideas for releases in the future, it is a good idea to list them in the README. -## Contributing -State if you are open to contributions and what your requirements are for accepting them. +- **Issues:** Report bugs or request features via [GitLab Issues](https://gitlab.ics.muni.cz/rationai/infrastructure/model-service/-/issues) +- **Contact:** RationAI team at Masaryk University -For people who want to make changes to your project, it's helpful to have some documentation on how to get started. Perhaps there is a script that they should run or some environment variables that they need to set. Make these steps explicit. These instructions could also be useful to your future self. - -You can also document commands to lint the code or run tests. These steps help to ensure high code quality and reduce the likelihood that the changes inadvertently break something. Having instructions for running tests is especially helpful if it requires external setup, such as starting a Selenium server for testing in a browser. +## License -## Authors and acknowledgment -Show your appreciation to those who have contributed to the project. +This project is part of the RationAI infrastructure and is available for use by authorized members of the RationAI group. -## License -For open source projects, say how it is licensed. +## Authors -## Project status -If you have run out of energy or time for your project, put a note at the top of the README saying that development has slowed down or stopped completely. Someone may choose to fork your project or volunteer to step in as a maintainer or owner, allowing your project to keep going. You can also make an explicit request for maintainers. +Developed and maintained by the RationAI team at Masaryk University, Faculty of Informatics. diff --git a/docs/architecture/overview.md b/docs/architecture/overview.md index 48447d2..6d3f3bc 100644 --- a/docs/architecture/overview.md +++ b/docs/architecture/overview.md @@ -47,13 +47,12 @@ The foundation of Model Service, providing distributed computing infrastructure. - Cluster coordination and management - Dashboard for monitoring - Scheduling decisions -- Does not run model workloads (no CPU/GPU assigned) +- Typically configured with 0 CPU/GPU for user workloads (cluster coordination only) **Worker Nodes:** - Execute model inference workloads - Can be CPU-only or GPU-enabled - - Auto-scale based on demand - Different worker groups for different hardware types @@ -211,23 +210,21 @@ resources: **Ray Cluster:** -- Head node failure → Cluster recreated by KubeRay -- Worker failure → Workload rescheduled to other workers -- Network partition → Automatic reconnection +- Worker pod failure → Ray reschedules work when possible, KubeRay recreates pods +- Head pod failure → KubeRay typically restarts the head, but the cluster (and Serve) may be briefly unavailable during recovery +- In general, expect automatic recovery from many pod-level failures, but not strict “no-downtime” guarantees **Ray Serve:** - Replica failure → Requests routed to healthy replicas - Failed replicas automatically restarted -- Graceful shutdown on updates +- Graceful shutdown is supported when configured properly, but does not guarantee zero dropped requests. -### Zero-Downtime Updates +### Updates and Downtime -Model updates use blue-green deployment: +RayService updates are reconciled by KubeRay and are designed to minimize downtime, but the exact behavior depends on your Ray/KubeRay versions and the change being applied. -1. New version deployed alongside old -2. Traffic gradually shifted to new version -3. Old version removed when no active requests +In practice, updates may temporarily run old and new replicas at the same time while shifting traffic to healthy replicas. ```yaml spec: @@ -284,7 +281,7 @@ kubectl describe rayservice -n [namespace] ### Metrics -Ray exports Prometheus metrics: +Ray can export Prometheus metrics (when metrics collection/export is enabled): - Request latency - Request throughput @@ -318,7 +315,7 @@ kind: RayService ### 4. Fault Tolerance - Automatic recovery from failures -- No single point of failure (except data plane) +- Ray Head Node is a logical single point of control - failures are recoverable but may cause brief service disruption - Graceful degradation ### 5. Developer Experience @@ -330,5 +327,5 @@ kind: RayService ## Next Steps - [Deployment guide](../guides/deployment-guide.md) -- [Configuration reference](../reference/configuration-reference.md) +- [Configuration reference](../guides/configuration-reference.md) - [Adding new models](../guides/adding-models.md) diff --git a/docs/get-started/quick-start.md b/docs/get-started/quick-start.md index f96fceb..e2084d9 100644 --- a/docs/get-started/quick-start.md +++ b/docs/get-started/quick-start.md @@ -72,7 +72,7 @@ Once deployed, you can port-forward the service to access it locally: ```bash # Port-forward to access the service locally -kubectl port-forward svc/rayservice-models-serve-svc -n [namespace] 8000:8000 +kubectl port-forward -n [namespace] svc/rayservice-models-serve-svc 8000:8000 ``` ## Step 6: Delete the Deployed Model @@ -83,6 +83,10 @@ To delete the deployed RayService, run: kubectl delete -f ray-service.yaml -n [namespace] ``` +### Connection Issues + +Ensure your cluster has proper network policies and that the namespace has access to required resources (MLflow, proxy, etc.). + ## Next Steps Congratulations! You've successfully deployed your first model with Model Service. @@ -92,9 +96,5 @@ Now you can: - [Learn how to add your own models](../guides/adding-models.md) - [Understand the architecture](../architecture/overview.md) - [Read the deployment guide](../guides/deployment-guide.md) -- [Check configuration reference](../reference/configuration-reference.md) +- [Check configuration reference](../guides/configuration-reference.md) - [Troubleshooting](../guides/troubleshooting.md) - -### Connection Issues - -Ensure your cluster has proper network policies and that the namespace has access to required resources (MLflow, proxy, etc.). diff --git a/docs/guides/adding-models.md b/docs/guides/adding-models.md index fd79a7e..75b73a7 100644 --- a/docs/guides/adding-models.md +++ b/docs/guides/adding-models.md @@ -36,7 +36,6 @@ class MyModel: app = MyModel.bind() ``` -!!! note The repository's reference model `BinaryClassifier` uses FastAPI ingress + batched inference and expects a **compressed binary payload** (not JSON). For simple JSON models, the examples above are fine; for high-throughput image inference, consider the batching and ingress patterns shown below. ### Key Components @@ -140,7 +139,6 @@ class BatchedModel: return {"prediction": result} ``` -!!! tip For binary/image workloads, you can also batch raw `bytes` like the `BinaryClassifier` does (see `models/binary_classifier.py`). This avoids JSON overhead and lets you control batch sizing via `user_config` by calling `set_max_batch_size()` and `set_batch_wait_timeout_s()`. ### Using FastAPI @@ -246,8 +244,7 @@ spec: - onnxruntime>=1.23.2 ``` -!!! note -In this repo, the production `ray-service.yaml` installs model dependencies under `deployments[*].ray_actor_options.runtime_env.pip` (not only at `applications[*].runtime_env`). This is useful when different deployments need different dependencies. +In this repository, the production `ray-service.yaml` installs model dependencies under `deployments[*].ray_actor_options.runtime_env.pip` (not only at `applications[*].runtime_env`). This is useful when different deployments need different dependencies. ## GPU Models @@ -321,5 +318,5 @@ kubectl logs -n [namespace] -l ray.io/node-type=worker --tail=100 ## Next Steps - [Deployment guide](deployment-guide.md) -- [Configuration reference](../reference/configuration-reference.md) +- [Configuration reference](configuration-reference.md) - [Architecture overview](../architecture/overview.md) diff --git a/docs/reference/configuration-reference.md b/docs/guides/configuration-reference.md similarity index 98% rename from docs/reference/configuration-reference.md rename to docs/guides/configuration-reference.md index 1e4424c..cd9771d 100644 --- a/docs/reference/configuration-reference.md +++ b/docs/guides/configuration-reference.md @@ -167,5 +167,5 @@ spec: ## Next Steps -- [Deployment guide](../guides/deployment-guide.md) +- [Deployment guide](deployment-guide.md) - [Architecture overview](../architecture/overview.md) diff --git a/docs/guides/deployment-guide.md b/docs/guides/deployment-guide.md index cdfc110..88ee041 100644 --- a/docs/guides/deployment-guide.md +++ b/docs/guides/deployment-guide.md @@ -132,7 +132,7 @@ Watch the deployment progress: kubectl get rayservice rayservice-my-model -n [namespace] -w # Check pods -kubectl get pods -n [namespace] -l ray.io/cluster +kubectl get pods -n [namespace] -l ray.io/cluster=rayservice-my-model # View head node logs kubectl logs -n [namespace] -l ray.io/node-type=head -f @@ -152,21 +152,12 @@ Check service endpoints: kubectl get svc -n [namespace] # Port forward to test -kubectl port-forward -n rationai-notebooks-ns \ +kubectl port-forward -n [namespace] \ svc/rayservice-my-model-serve-svc 8000:8000 ``` -!!! note The example model in this repository (`models/binary_classifier.py`) uses FastAPI ingress and expects a **compressed binary request body** (LZ4), not JSON. The JSON `curl` example below is valid for JSON-based models but does not apply to `BinaryClassifier`. -Test the endpoint: - -```bash -curl -X POST http://localhost:8000/my-model \ - -H "Content-Type: application/json" \ - -d '{"input": [1.0, 2.0, 3.0]}' -``` - ## Production Considerations ### Resource Planning @@ -307,13 +298,6 @@ serveConfigV2: | num_gpus: 1 ``` -Access models: - -```bash -curl http://service:8000/model-a -d '{"input": ...}' -curl http://service:8000/model-b -d '{"input": ...}' -``` - ## Updating Deployments ### Update Model Code @@ -326,17 +310,17 @@ curl http://service:8000/model-b -d '{"input": ...}' ```bash # Edit configuration -vim ray-service.yaml +vim ray-service.yaml # or any IDE # Apply changes -kubectl apply -f ray-service.yaml +kubectl apply -f ray-service.yaml -n [namespace] ``` -Ray will perform rolling update: +KubeRay will reconcile the RayService and attempt a rolling-style update: -- New replicas created with new config -- Traffic gradually shifted -- Old replicas removed +- New replicas are created with the new config +- Traffic is routed to healthy replicas +- Old replicas are eventually removed ### Update Model Weights @@ -351,7 +335,7 @@ user_config: Apply update: ```bash -kubectl apply -f ray-service.yaml +kubectl apply -f ray-service.yaml -n [namespace] ``` ## Rollback @@ -363,18 +347,11 @@ If deployment fails, rollback: # Instead, view KubeRay status and events, then re-apply a known-good spec. # Inspect current state and recent events -kubectl get rayservice rayservice-my-model -n rationai-notebooks-ns -o yaml -kubectl describe rayservice rayservice-my-model -n rationai-notebooks-ns +kubectl get rayservice rayservice-my-model -n [namespace] -o yaml +kubectl describe rayservice rayservice-my-model -n [namespace] # Check Ray Serve controller logs (usually shows the root cause) -kubectl logs -n rationai-notebooks-ns -l ray.io/node-type=head --tail=200 -``` - -Or manually apply previous configuration: - -```bash -git checkout HEAD~1 ray-service.yaml -kubectl apply -f ray-service.yaml +kubectl logs -n [namespace] -l ray.io/node-type=head --tail=200 ``` ## Troubleshooting @@ -384,7 +361,7 @@ kubectl apply -f ray-service.yaml **Check RayService status:** ```bash -kubectl describe rayservice rayservice-my-model -n rationai-notebooks-ns +kubectl describe rayservice rayservice-my-model -n [namespace] ``` **Common issues:** @@ -400,10 +377,10 @@ kubectl describe rayservice rayservice-my-model -n rationai-notebooks-ns ```bash # View dashboard -kubectl port-forward svc/rayservice-my-model-head-svc 8265:8265 +kubectl port-forward -n [namespace] svc/rayservice-my-model-head-svc 8265:8265 # Check logs -kubectl logs -n rationai-notebooks-ns -l ray.io/node-type=worker --tail=100 +kubectl logs -n [namespace] -l ray.io/node-type=worker --tail=100 ``` **Common issues:** @@ -419,7 +396,7 @@ kubectl logs -n rationai-notebooks-ns -l ray.io/node-type=worker --tail=100 ```bash # Ray dashboard: http://localhost:8265 -kubectl port-forward svc/rayservice-my-model-head-svc 8265:8265 +kubectl port-forward -n [namespace] svc/rayservice-my-model-head-svc 8265:8265 ``` **Possible solutions:** @@ -441,7 +418,7 @@ kubectl port-forward svc/rayservice-my-model-head-svc 8265:8265 ## Next Steps -- [Configuration reference](../reference/configuration-reference.md) +- [Configuration reference](configuration-reference.md) - [Architecture overview](../architecture/overview.md) - [Adding new models](adding-models.md) - [Troubleshooting](troubleshooting.md) diff --git a/docs/index.md b/docs/index.md index 865ea5a..0ef4ee7 100644 --- a/docs/index.md +++ b/docs/index.md @@ -10,7 +10,7 @@ Model Service is a deployment framework that enables: - **Distributed Inference**: Distribute inference workloads across multiple workers and nodes - **Resource Management**: Efficiently manage CPU and GPU resources in Kubernetes - **Model Versioning**: Integration with MLflow for model lifecycle management -- **Production Ready**: Built on Ray Serve with fault tolerance and high availability +- **Production-oriented**: Built on Ray Serve and KubeRay, with autoscaling and failure recovery features ## Key Features diff --git a/mkdocs.yml b/mkdocs.yml index 3a255cd..ca710b3 100644 --- a/mkdocs.yml +++ b/mkdocs.yml @@ -1,7 +1,7 @@ site_name: Model Service Documentation site_description: Model deployment infrastructure for RationAI using Ray Serve on Kubernetes repo_url: https://gitlab.ics.muni.cz/rationai/infrastructure/model-service -edit_uri: edit/master/docs/ +edit_uri: -/edit/master/docs/ theme: name: material @@ -25,11 +25,10 @@ nav: - Guides: - Deployment Guide: guides/deployment-guide.md - Adding New Models: guides/adding-models.md + - Configuration Reference: guides/configuration-reference.md - Troubleshooting: guides/troubleshooting.md - Architecture: - Overview: architecture/overview.md - - Reference: - - Configuration Reference: reference/configuration-reference.md markdown_extensions: - admonition From d17771309ebabad7657b3a476581a194dac5d827 Mon Sep 17 00:00:00 2001 From: JiriStipek <567776@muni.cz> Date: Fri, 9 Jan 2026 17:27:56 +0100 Subject: [PATCH 06/17] feat: add dependency group + CI pipelines --- .gitlab-ci.yml | 9 +++++++++ README.md | 6 +++--- pyproject.toml | 1 + 3 files changed, 13 insertions(+), 3 deletions(-) diff --git a/.gitlab-ci.yml b/.gitlab-ci.yml index b132c47..67bffdb 100644 --- a/.gitlab-ci.yml +++ b/.gitlab-ci.yml @@ -4,3 +4,12 @@ include: stages: - lint + - docs + +docs:build: + stage: docs + image: python:3.12 + script: + - python -m pip install --upgrade pip + - python -m pip install mkdocs mkdocs-material pymdown-extensions + - mkdocs build --strict diff --git a/README.md b/README.md index 6636fae..fc1970a 100644 --- a/README.md +++ b/README.md @@ -31,14 +31,14 @@ Full walkthrough: `docs/get-started/quick-start.md`. ### Deploy ```bash -kubectl apply -f ray-service.yaml -n rationai-notebooks-ns -kubectl get rayservice rayservice-models -n rationai-notebooks-ns +kubectl apply -f ray-service.yaml -n [namespace] +kubectl get rayservice rayservice-models -n [namespace] ``` ### Access locally ```bash -kubectl port-forward -n rationai-notebooks-ns svc/rayservice-models-serve-svc 8000:8000 +kubectl port-forward -n [namespace] svc/rayservice-models-serve-svc 8000:8000 ``` ### Test the reference model (`BinaryClassifier`) diff --git a/pyproject.toml b/pyproject.toml index c42c605..be98026 100644 --- a/pyproject.toml +++ b/pyproject.toml @@ -19,3 +19,4 @@ dependencies = [ [dependency-groups] dev = ["mypy>=1.18.2", "ruff>=0.14.6"] +docs = ["mkdocs>=1.6.0", "mkdocs-material>=9.6.0", "pymdown-extensions>=10.0"] From cd822ac4190d852c6c33b4673d45009e8bc95386 Mon Sep 17 00:00:00 2001 From: JiriStipek <567776@muni.cz> Date: Sun, 18 Jan 2026 17:52:01 +0100 Subject: [PATCH 07/17] fix: gitlab template --- .gitlab-ci.yml | 14 ++++---------- 1 file changed, 4 insertions(+), 10 deletions(-) diff --git a/.gitlab-ci.yml b/.gitlab-ci.yml index 67bffdb..f6d4011 100644 --- a/.gitlab-ci.yml +++ b/.gitlab-ci.yml @@ -1,15 +1,9 @@ include: project: rationai/digital-pathology/templates/ci-templates - file: Python-Lint.gitlab-ci.yml + file: + - Python-Lint.gitlab-ci.yml + - MkDocs.gitlab-ci.yml stages: - lint - - docs - -docs:build: - stage: docs - image: python:3.12 - script: - - python -m pip install --upgrade pip - - python -m pip install mkdocs mkdocs-material pymdown-extensions - - mkdocs build --strict + - deploy From 8cdce494ae620d200ab722cbc44e2474b649e8eb Mon Sep 17 00:00:00 2001 From: JiriStipek <567776@muni.cz> Date: Mon, 19 Jan 2026 11:04:42 +0100 Subject: [PATCH 08/17] feat: add more detailed pages --- docs/architecture/batching.md | 82 +++++ docs/architecture/overview.md | 324 +++++-------------- docs/architecture/queues-and-backpressure.md | 57 ++++ docs/architecture/request-lifecycle.md | 186 +++++++++++ mkdocs.yml | 3 + 5 files changed, 404 insertions(+), 248 deletions(-) create mode 100644 docs/architecture/batching.md create mode 100644 docs/architecture/queues-and-backpressure.md create mode 100644 docs/architecture/request-lifecycle.md diff --git a/docs/architecture/batching.md b/docs/architecture/batching.md new file mode 100644 index 0000000..2d4faff --- /dev/null +++ b/docs/architecture/batching.md @@ -0,0 +1,82 @@ +# Batching (How It Works Under the Hood) + +Batching in Ray Serve is a **replica-local request coalescing** mechanism. + +It improves throughput when your model can process multiple inputs more efficiently together (common for GPU inference). + +## Where batching happens + +Batching happens **inside each replica process**. + +Requests only become eligible for batching after they: + +1. enter through the proxy and handle queueing/backpressure, and +2. get routed to a specific replica + +See also: [Request lifecycle](request-lifecycle.md). + +## The API surface (what you configure) + +In user code, batching is enabled by decorating an async method with `@serve.batch`: + +- `max_batch_size`: upper bound for how many requests are grouped into one batch execution +- `batch_wait_timeout_s`: maximum time to wait (since the first queued item) before flushing a smaller batch + +Serve expects the batched handler to return **one result per input** (same batch length, same order). + +## What Serve actually does internally + +Conceptually, each replica maintains an internal structure like: + +- an in-memory buffer of pending calls +- a background “flush” loop that decides when to execute a batch +- per-request futures/promises that get completed when the batch finishes + +### 1. Collection phase (buffering) + +Incoming requests that hit the batched method are appended to a replica-local buffer. + +Each buffered entry stores: + +- the request arguments (or decoded payload) +- a future representing that request’s eventual response + +### 2. Flush conditions (size or time) + +The buffer is flushed when either condition becomes true: + +- **Size trigger**: buffer length reaches `max_batch_size` +- **Time trigger**: `batch_wait_timeout_s` elapses since the **first** item currently in the buffer + +This is why batching can increase latency at low QPS: a request may wait up to `batch_wait_timeout_s` for more arrivals. + +### 3. Execution phase (single call) + +Serve invokes your batched handler **once** with a list of inputs. + +This is where you typically vectorize: + +- stack/concat tensors +- run one forward pass +- split/scatter outputs back + +### 4. Scatter phase (complete futures) + +When the batched handler returns a list of outputs, Serve resolves the stored futures in order. + +Each original HTTP request then completes independently with its corresponding output. + +## Configuration & Tuning + +For a deep dive into how batching interacts with concurrency limits (specifically why `max_ongoing_requests` must be larger than `max_batch_size`), see **[Queues and backpressure](queues-and-backpressure.md)**. + +Quick tips: + +- Increase `max_batch_size` if the model benefits from larger batches and you have headroom. +- Increase `batch_wait_timeout_s` to favor fuller batches; decrease it to favor latency. + +## Next + +- Request flow including queue points: [Request lifecycle](request-lifecycle.md) +- Queueing and rejection controls: [Queues and backpressure](queues-and-backpressure.md) +- “Knobs” reference and meanings: [Configuration reference](../guides/configuration-reference.md) diff --git a/docs/architecture/overview.md b/docs/architecture/overview.md index 6d3f3bc..254576e 100644 --- a/docs/architecture/overview.md +++ b/docs/architecture/overview.md @@ -1,126 +1,72 @@ # Architecture Overview -This document provides an overview of Model Service's architecture, components, and design principles. +This section provides a structured overview of Model Service's architecture. + +If you are new to the project, start here and then follow the links to the deeper pages. ## System Architecture -Model Service is built on a multi-layered architecture: +Model Service is built on Kubernetes + KubeRay + Ray Serve: ``` -┌─────────────────────────────────────────────────────────┐ -│ Client Applications │ -│ (API Consumers, Web Apps, Notebooks) │ -└────────────────────┬────────────────────────────────────┘ - │ HTTP/HTTPS - ▼ -┌─────────────────────────────────────────────────────────┐ -│ Kubernetes Service │ -│ (Load Balancer / Ingress) │ -└────────────────────┬────────────────────────────────────┘ - │ - ▼ -┌─────────────────────────────────────────────────────────┐ -│ Ray Serve │ -│ ┌──────────────┐ ┌──────────────┐ ┌──────────────┐ │ -│ │ Model A │ │ Model B │ │ Model C │ │ -│ │ (Replicas) │ │ (Replicas) │ │ (Replicas) │ │ -│ └──────────────┘ └──────────────┘ └──────────────┘ │ -└────────────────────┬────────────────────────────────────┘ - │ - ▼ -┌─────────────────────────────────────────────────────────┐ -│ Ray Cluster │ -│ ┌────────────┐ ┌────────────┐ ┌────────────┐ │ -│ │ Head Node │ │ Worker 1 │ │ Worker 2 │ ... │ -│ └────────────┘ └────────────┘ └────────────┘ │ -└─────────────────────────────────────────────────────────┘ +┌──────────────────────────────────────────────────────────────────┐ +│ Head Node │ +│ │ +│ ┌──────────────┐ ┌───────────────────┐ │ +│ │ Controller │◄───────────────────┤ HTTP Proxy │◄──── Client Request +│ │ (Autoscaler) │ Update Config │ (Ingress) │ │ +│ └──────┬───────┘ └─────────┬─────────┘ │ +│ │ │ │ +└───────────┼──────────────────────────────────────┼───────────────┘ + │ Manage │ Route + ▼ ▼ +┌──────────────────────────────────────────────────────────────────┐ +│ Worker Nodes │ +│ │ +│ ┌────────────────────────────────────────────────────────────┐ │ +│ │ Application 1 │ │ +│ │ ┌──────────────────────┐ ┌──────────────────────┐ │ │ +│ │ │ Deployment A │ │ Deployment B │ │ │ +│ │ │ ┌────────┐ ┌────────┐│ │ ┌────────┐ ┌────────┐│ │ │ +│ │ │ │Replica │ │Replica ││ │ │Replica │ │Replica ││ │ │ +│ │ │ └────────┘ └────────┘│ │ └────────┘ └────────┘│ │ │ +│ │ └──────────────────────┘ └──────────────────────┘ │ │ +│ └────────────────────────────────────────────────────────────┘ │ +│ │ +│ ┌────────────────────────────────────────────────────────────┐ │ +│ │ Application 2 │ │ +│ │ ... │ │ +│ └────────────────────────────────────────────────────────────┘ │ +└──────────────────────────────────────────────────────────────────┘ ``` -## Core Components - -### 1. Ray Cluster - -The foundation of Model Service, providing distributed computing infrastructure. - -**Head Node:** - -- Cluster coordination and management -- Dashboard for monitoring -- Scheduling decisions -- Typically configured with 0 CPU/GPU for user workloads (cluster coordination only) - -**Worker Nodes:** - -- Execute model inference workloads -- Can be CPU-only or GPU-enabled -- Auto-scale based on demand -- Different worker groups for different hardware types - -### 2. Ray Serve - -Application layer for serving ML models as HTTP endpoints. +## Core Concepts & Hierarchy -**Features:** +The client's request flows through several layers of the system: -- HTTP request routing -- Load balancing across replicas -- Request batching -- Automatic retry and fault tolerance -- Dynamic model configuration +HTTP Proxy → Head Node → Worker Node → Application → Deployment → Replica. -### 3. KubeRay Operator +The main components are: -Kubernetes operator that manages Ray clusters. +1. **Ray Service (The Platform)**: The Kubernetes Custom Resource (CR) that defines the entire Ray cluster and the Serve application(s) running on top of it. +2. **Ray Cluster**: The physical set of Kubernetes pods, consisting of a **Head Node** and multiple **Worker Nodes**. +3. **Infrastructure Actors**: + - **Controller**: Manages the control plane, API calls, and autoscaling (does not handle requests). + - **HTTP Proxy**: Ingress point that routes requests to applications. +4. **Serve Application (The Service Boundary)**: A standalone version of your code, including all its deployments and logic. Defined by an import path (e.g., `models.binary_classifier:app`). +5. **Serve Deployment (The Functional Unit)**: A managed group of replicas. It defines scaling rules (`num_replicas`, `num_cpus`) and versioning. +6. **Replica (The Execution Unit)**: A single Ray actor process running the deployment code inside a Worker Node. -**Responsibilities:** +### Serve application vs Serve deployment -- Cluster lifecycle management (create, update, delete) -- Autoscaling worker nodes -- Health monitoring -- Configuration reconciliation +- **Application**: deployable service boundary (routing, code entrypoint, runtime env). +- **Deployment**: scaling unit (replicas), concurrency/queue limits, and resource options. -### 4. Model Implementations +### Internal Mechanisms -Your ML models wrapped with Ray Serve decorators. +For detailed information on how batching works, including the configuration API and internal buffering mechanisms, see [Batching](batching.md). -**Structure:** - -```python -@serve.deployment -class YourModel: - def __init__(self): - # Model loading - - async def __call__(self, request): - # Inference logic -``` - -## Data Flow - -### Inference Request Flow - -1. **Client Request**: HTTP POST to model endpoint -2. **Service Routing**: Kubernetes service routes to Ray Serve -3. **Load Balancing**: Ray Serve distributes to available replica -4. **Model Processing**: Replica executes inference -5. **Response**: Result returned to client - -``` -Client → K8s Service → Ray Serve Router → Model Replica → Response - ↓ - (Autoscaler) - ↓ - Add/Remove - Replicas -``` - -### Model Loading Flow - -1. **Initialization**: Ray Serve creates model replica -2. **Environment Setup**: Install dependencies from runtime_env -3. **Model Download**: Fetch from MLflow/storage -4. **Loading**: Initialize model in memory -5. **Ready**: Replica accepts requests +For request lifecycle and queueing details, see [Request Lifecycle](request-lifecycle.md) and [Queues and Backpressure](queues-and-backpressure.md). ## Scaling Architecture @@ -153,124 +99,43 @@ workerGroupSpecs: maxReplicas: 4 ``` -**Triggers:** - -- Resource pressure (CPU, memory, GPU) -- Idle timeout (scale to zero) -- Manual scaling - -## Resource Management - -### CPU Resources - -```yaml -ray_actor_options: - num_cpus: 6 # CPUs per replica - -containers: - resources: - requests: - cpu: 12 # CPUs per worker pod -``` - -**Calculation:** - -- Worker pod CPUs ≥ (replicas × num_cpus) -- Leave headroom for system processes - -### Memory Resources - -```yaml -ray_actor_options: - memory: 5368709120 # 5 GiB per replica - -containers: - resources: - limits: - memory: 10Gi # Memory per worker pod -``` - -### GPU Resources +### Resource Sizing (Pods vs Replicas) -```yaml -ray_actor_options: - num_gpus: 1 # GPUs per replica - -nodeSelector: - nvidia.com/gpu.product: NVIDIA-A40 - -resources: - limits: - nvidia.com/gpu: 1 # GPUs per worker pod -``` - -## High Availability - -### Fault Tolerance - -**Ray Cluster:** - -- Worker pod failure → Ray reschedules work when possible, KubeRay recreates pods -- Head pod failure → KubeRay typically restarts the head, but the cluster (and Serve) may be briefly unavailable during recovery -- In general, expect automatic recovery from many pod-level failures, but not strict “no-downtime” guarantees - -**Ray Serve:** +It is important to distinguish between **Kubernetes Resources** (Pods) and **Ray Resources** (Replicas). -- Replica failure → Requests routed to healthy replicas -- Failed replicas automatically restarted -- Graceful shutdown is supported when configured properly, but does not guarantee zero dropped requests. +- **Replica Sizing (`ray_actor_options`)**: Defines how much logical resource one model copy needs (e.g., `num_cpus: 1`). +- **Pod Sizing (`resources.limits`)**: Defines how big the physical container is. -### Updates and Downtime +**Rule of Thumb**: Ensure your Pods are large enough to fit at least one (or N) replicas plus overhead (Python runtime, Object Store). +i.e., `Pod CPU >= Replicas × num_cpus + Overhead`. -RayService updates are reconciled by KubeRay and are designed to minimize downtime, but the exact behavior depends on your Ray/KubeRay versions and the change being applied. +## Autoscaling Architecture -In practice, updates may temporarily run old and new replicas at the same time while shifting traffic to healthy replicas. - -```yaml -spec: - serveConfigV2: | - applications: - - name: my-model-v2 # New version - # ... new configuration -``` - -## Security - -### Pod Security - -All pods run with security constraints: - -```yaml -securityContext: - runAsNonRoot: true - runAsUser: 1000 - allowPrivilegeEscalation: false - capabilities: - drop: ["ALL"] - seccompProfile: - type: RuntimeDefault -``` +The Ray Serve Autoscaler runs inside the **Controller** actor and manages the number of replicas dynamically. -### Network Security +1. **Metrics Collection**: Replicas and DeploymentHandle push metrics (queue size, active queries) to the Controller. +2. **Decision Making**: The Autoscaler periodically checks these metrics against targets (like `target_ongoing_requests`). +3. **Scaling Action**: The Controller adds or removes Replica actors to meet demand. -- Internal service communication only -- Ingress controls external access -- Proxy support for external dependencies +## Fault Tolerance -## Monitoring & Observability +Ray Serve is designed to be resilient to failures: -### Ray Dashboard +- **Replica Failure**: If a Replica actor crashes, the Controller detects it and starts a new one to replace it. Request routing automatically updates. +- **Proxy Failure**: If the Proxy actor fails, the Controller restarts it. +- **Controller Failure**: If the Controller itself fails, Ray (via GCS) restarts it. Autoscaling pauses during downtime but resumes upon recovery. +- **Node Failure**: KubeRay (managing the cluster) detects node failures and provisions new pods. Ray Serve then eventually schedules actors on the new nodes. -Web UI for cluster monitoring: +## Design Principles -- Resource utilization -- Active tasks -- Node status -- Serve deployments +1. **Declarative Configuration**: Infrastructure defined in YAML, managed by GitOps (`RayService` CR). +2. **Separation of Concerns**: Model Code (Python), Infrastructure (K8s), Configuration (User Config). +3. **Elastic Scaling**: Scale to zero when idle, scale up on demand. +4. **Developer Experience**: Simple model implementation, easy local testing. -### Kubernetes Monitoring +## Metrics & Debugging -Standard Kubernetes tools: +Common commands: ```bash kubectl get pods -n [namespace] @@ -279,8 +144,6 @@ kubectl logs -n [namespace] kubectl describe rayservice -n [namespace] ``` -### Metrics - Ray can export Prometheus metrics (when metrics collection/export is enabled): - Request latency @@ -288,44 +151,9 @@ Ray can export Prometheus metrics (when metrics collection/export is enabled): - Replica count - Resource usage -## Design Principles - -### 1. Declarative Configuration - -Infrastructure defined in YAML, managed by GitOps: - -```yaml -apiVersion: ray.io/v1 -kind: RayService -# ... configuration -``` - -### 2. Separation of Concerns - -- **Model Code**: Python implementation -- **Infrastructure**: Kubernetes manifests -- **Configuration**: user_config section - -### 3. Elastic Scaling - -- Scale to zero when idle -- Scale up on demand -- Efficient resource utilization - -### 4. Fault Tolerance - -- Automatic recovery from failures -- Ray Head Node is a logical single point of control - failures are recoverable but may cause brief service disruption -- Graceful degradation - -### 5. Developer Experience - -- Simple model implementation -- Easy local testing -- Fast iteration cycle - ## Next Steps +- [Request lifecycle](request-lifecycle.md) - [Deployment guide](../guides/deployment-guide.md) - [Configuration reference](../guides/configuration-reference.md) - [Adding new models](../guides/adding-models.md) diff --git a/docs/architecture/queues-and-backpressure.md b/docs/architecture/queues-and-backpressure.md new file mode 100644 index 0000000..fbfd4d8 --- /dev/null +++ b/docs/architecture/queues-and-backpressure.md @@ -0,0 +1,57 @@ +# Queues and Backpressure + +To maintain stability and prevent overload, Ray Serve implements queueing mechanisms at multiple levels. Understanding these queues is critical for tuning latency and handling load spikes. + +## Simplified Queue Model + +There are two main places a request can wait: + +1. **Proxy Handle Queue**: Waiting to be assigned to a replica. +2. **Replica Execution Queue**: Assigned to a replica, waiting for execution (or batching). + +## 1. Proxy-Side Queue (`max_queued_requests`) + +When a request arrives at the HTTP Proxy (or via a Deployment Handle), it is routed to a logical deployment. If all specific replicas are busy, the request waits in a queue managed by the proxy/handle. + +- **Config**: `max_queued_requests` (in `deployment_config`) +- **Behavior**: + - Controls the maximum number of requests allowed to wait for assignment. + - If the queue is full, new requests are immediately rejected with a **503 Service Unavailable** error (or a `BackpressureError` in Python). + +### Why limit this? + +Without a limit, a system under heavy load might accept requests until it runs out of memory or latency becomes unacceptable. Fail-fast behavior is often preferred over unbounded waiting. + +## 2. Replica-Side Queue (`max_ongoing_requests`) + +Once a request is assigned to a specific replica, it counts as "ongoing" for that replica. + +- **Config**: `max_ongoing_requests` (in `deployment_config`) +- **Behavior**: + - Limits how many concurrent requests a single replica can process _or_ have buffered. + - If a replica is at its limit, the proxy considers it "busy" and will not assign new requests to it (they will wait in the Proxy Queue instead). + +### Usage with Batching + +If you use `@serve.batch`, requests sitting in the [batching buffer](batching.md) count towards `max_ongoing_requests`. + +- **Warning**: If `max_ongoing_requests` is set too low (e.g., lower than `max_batch_size`), you might throttle your own batching mechanism because the replica will never accept enough requests to fill a batch. + +## Backpressure flow + +1. **Client** sends a request. +2. **HTTP Proxy** receives it. +3. **Check Replica capacity**: Are there replicas with `ongoing_requests < max_ongoing_requests`? + - **Yes**: Forward request to one of them. + - **No**: Enqueue request in the Proxy Queue. +4. **Check Proxy Queue capacity**: Is `current_queue_size < max_queued_requests`? + - **Yes**: Request waits. + - **No**: Reject request immediately (Fail). + +## Tuning Guidelines + +| Scenario | Recommendation | +| :----------------------------- | :------------------------------------------------------------------------------------------------------ | +| **High Throughput / Batching** | Increase `max_ongoing_requests` to ensure replicas can buffer enough work to form full batches. | +| **Latency Sensitive** | Decrease `max_queued_requests` to fail fast rather than returning stale responses after a long wait. | +| **Memory Constrained** | Lower both values to prevent OOM errors by limiting the number of incomplete requests in system memory. | diff --git a/docs/architecture/request-lifecycle.md b/docs/architecture/request-lifecycle.md new file mode 100644 index 0000000..4c6292c --- /dev/null +++ b/docs/architecture/request-lifecycle.md @@ -0,0 +1,186 @@ +# Request Lifecycle in Detail + +This document traces the path of a single inference request through the Model Service stack, from the external HTTP client down to the Ray Core task execution. + +It also highlights **where requests queue** and which settings control queueing vs rejection. + +## High-Level Flow + +``` +┌──────────────────────────────────────────────────────────────────────┐ +│ External Client │ +│ │ +│ HTTP Request │ +│ │ │ +│ ▼ │ +│ ┌───────────────┐ │ +│ │ K8s Service │ │ +│ │ / Ingress │ │ +│ └───────┬───────┘ │ +│ │ Route │ +└──────────┼───────────────────────────────────────────────────────────┘ + │ + ▼ +┌──────────────────────────────────────────────────────────────────────┐ +│ Head / Worker Nodes │ +│ │ +│ ┌───────────────────────────────┐ │ +│ │ HTTP Proxy Actor │ │ +│ │ (ServeHTTPProxy) │ │ +│ └───────────────┬───────────────┘ │ +│ │ create DeploymentHandle │ +│ ▼ │ +│ ┌───────────────────────────────┐ │ +│ │ Deployment Handle │ │ +│ │ (Client-side queue) │ │ +│ └───────────────┬───────────────┘ │ +│ │ enqueue / backpressure │ +│ ▼ │ +│ ┌───────────────────────────────┐ │ +│ │ Serve Router │ │ +│ │ (Replica selection) │ │ +│ └───────────────┬───────────────┘ │ +│ │ PushTask RPC │ +│ ▼ │ +│ ┌───────────────────────────────┐ │ +│ │ Replica Actor │ │ +│ │ (Deployment instance) │ │ +│ └───────────────┬───────────────┘ │ +│ │ execute │ +│ ▼ │ +│ ┌───────────────────────────────┐ │ +│ │ Ray Worker Process │ │ +│ └───────────────┬───────────────┘ │ +│ │ │ +│ ▼ │ +│ ┌───────────────────────────────┐ │ +│ │ User Model Code │ │ +│ │ (Inference / Logic) │ │ +│ └───────────────┬───────────────┘ │ +│ │ result │ +│ ▼ │ +│ ┌───────────────────────────────┐ │ +│ │ Plasma Object Store │◄── large objects │ +│ └───────────────┬───────────────┘ │ +│ │ ObjectRef / inline │ +│ ▼ │ +│ ┌───────────────────────────────┐ │ +│ │ Replica Actor │ │ +│ └───────────────┬───────────────┘ │ +│ │ │ +│ ▼ │ +│ ┌───────────────────────────────┐ │ +│ │ HTTP Proxy Actor │ │ +│ └───────────────┬───────────────┘ │ +│ │ HTTP Response │ +└──────────────────┼───────────────────────────────────────────────────┘ + │ + ▼ +┌──────────────────────────────────────────────────────────────────────┐ +│ External Client │ +│ 200 OK Response │ +└──────────────────────────────────────────────────────────────────────┘ +``` + +## Step-by-Step Breakdown + +### 1. Ingress (HTTP Proxy) + +**Component**: `ServeHTTPProxy` actor (running on Head or Worker nodes). + +1. **Receive**: The request hits the Uvicorn server running inside the Proxy actor. +2. **Route Matching**: The proxy inspects the URL path to match it against active **Applications** and their **Ingress Deployments**. +3. **Handle Creation**: The proxy uses a `DeploymentHandle` to forward the request. It does **not** send the request directly to a replica yet. + +### 2. Queueing & Backpressure (Deployment Handle) + +**Component**: `DeploymentHandle` (client-side in the Proxy). + +The request enters a **Handle Queue** managed by the caller (the Proxy). + +- **Assignment**: The handle checks for available slots in the target Deployment. +- **Backpressure**: If replicas are saturated (`max_ongoing_requests`), the request stays in this queue instead of being pushed to a replica. +- **Rejection**: If the handle queue grows beyond `max_queued_requests`, the request is rejected with an overload-style error (client-visible backpressure). + +**Where this queue lives**: inside the process that is making the call (here: the HTTP Proxy). It is not a replica-local queue. + +### 3. Replica Assignment (Ray Core) + +**Component**: `ServeRouter` & `Ray Core`. + +When a slot is available: + +1. **Routing**: The router selects a specific Replica actor ID based on the policy (e.g., `PowerOfTwoChoices`). +2. **RPC**: The request is serialized and sent via Ray's internal gRPC protocol to the selected actor. + +> **Under the Hood: Ray Task Lifecycle** +> +> - **Submission**: The router behaves like a Ray Core driver submitting a task. +> - **Worker Lease**: Ray guarantees the actor exists. If the actor had crashed, the Ray Controller would have already requested a new worker lease from the **Raylet** to restart it. +> - **PushTask**: The `PushTask` RPC carries the request data. + +### 4. Execution (Worker & Replica) + +**Component**: `RayWorker` process. + +1. **Receive**: The Worker process hosting the Replica actor receives the message. +2. **Deserialization**: + - **Small Data**: Unpickled directly from the message. + - **Large Data**: If the request payload is large, it may be retrieved from the **Plasma Object Store** (shared memory). +3. **Asyncio Loop**: The request enters the actor's entrypoint (usually `__call__`). +4. **Replica Concurrency Limit**: The replica will not run more than `max_ongoing_requests` concurrently. Requests beyond that should not be dispatched to this replica; instead they remain queued at the caller-side handle. +5. **Batching** (Optional): If `@serve.batch` is used, the request may wait in a replica-local batching buffer until either `max_batch_size` is reached or `batch_wait_timeout_s` expires (see [Batching](batching.md)). +6. **Inference**: The model code runs (e.g. `model.predict(input)`). + +### 5. Response & Return + +**Component**: Shared Memory & Network. + +1. **Completion**: The function returns a result. +2. **Storage**: + - **Small Result**: Sent back directly in the RPC response. + - **Large Result**: Stored in the local Plasma Store; only an `ObjectRef` is returned. +3. **Forwarding**: The HTTP Proxy waits for the result (resolving the `ObjectRef` if necessary) and writes the HTTP response body. +4. **Client**: The client receives the `200 OK`. + +## Where queues are handled (and where requests get rejected) + +Ray Serve has multiple queue-like stages. They serve different purposes and are controlled by different knobs. + +For deep-dive explanation and tuning advice, see **[Queues and Backpressure](queues-and-backpressure.md)**. + +### 1. Proxy-side “handle queue” (caller-side) + +When an HTTP request hits Ray Serve, the proxy forwards it through a `DeploymentHandle`. +That handle maintains a **caller-side queue** of requests waiting to be assigned to a replica. + +This is where `max_queued_requests` applies. + +- If replicas are busy (because of per-replica concurrency limits), the request waits here. +- If the queue grows beyond `max_queued_requests`, the request is rejected (client-visible backpressure). + +### 2. Routing / replica selection + +Once a request can be dispatched, Ray Serve selects a replica. + +This stage is not intended to be a long-term queue - it is primarily where the system decides _which_ replica gets the request next. + +### 3. Replica concurrency slots (“ongoing requests”) + +Each replica enforces a cap on concurrent in-flight work via `max_ongoing_requests`. + +- If a replica already has `max_ongoing_requests` in progress, new work should not be scheduled onto it. +- “Ongoing” includes requests that are actively executing _or_ are awaiting completion (e.g., waiting for I/O or for a batch to flush). + +### 4. Replica-local batching buffer (optional) + +If you use `@serve.batch`, requests assigned to the replica can enter a **batching buffer** inside the replica. + +This buffer is flushed when either: + +- it reaches `max_batch_size`, or +- `batch_wait_timeout_s` elapses since the first buffered request + +This buffer is not controlled by `max_queued_requests` (that limit is caller-side). + +**[Queues and backpressure](queues-and-backpressure.md)** explains specifically how `max_ongoing_requests` and `max_queued_requests` interact. diff --git a/mkdocs.yml b/mkdocs.yml index ca710b3..e4a1e22 100644 --- a/mkdocs.yml +++ b/mkdocs.yml @@ -29,6 +29,9 @@ nav: - Troubleshooting: guides/troubleshooting.md - Architecture: - Overview: architecture/overview.md + - Request Lifecycle: architecture/request-lifecycle.md + - Batching: architecture/batching.md + - Queues and Backpressure: architecture/queues-and-backpressure.md markdown_extensions: - admonition From a1cc2753784063ed4819c43458294077cf554b64 Mon Sep 17 00:00:00 2001 From: JiriStipek <567776@muni.cz> Date: Mon, 19 Jan 2026 11:05:18 +0100 Subject: [PATCH 09/17] feat: add pages to readme --- README.md | 3 +++ docs/architecture/batching.md | 4 ++-- docs/index.md | 20 ++++++++++++++++++++ 3 files changed, 25 insertions(+), 2 deletions(-) diff --git a/README.md b/README.md index fc1970a..66f9fd6 100644 --- a/README.md +++ b/README.md @@ -18,6 +18,9 @@ This repository contains: - `docs/guides/configuration-reference.md` - `docs/guides/troubleshooting.md` - `docs/architecture/overview.md` + - `docs/architecture/request-lifecycle.md` + - `docs/architecture/queues-and-backpressure.md` + - `docs/architecture/batching.md` ## Quick Start (Kubernetes) diff --git a/docs/architecture/batching.md b/docs/architecture/batching.md index 2d4faff..b52ce0d 100644 --- a/docs/architecture/batching.md +++ b/docs/architecture/batching.md @@ -13,11 +13,11 @@ Requests only become eligible for batching after they: 1. enter through the proxy and handle queueing/backpressure, and 2. get routed to a specific replica -See also: [Request lifecycle](request-lifecycle.md). +See also: **[Request lifecycle](request-lifecycle.md)**. ## The API surface (what you configure) -In user code, batching is enabled by decorating an async method with `@serve.batch`: +In user code, batching is enabled by decorating an **async** method with `@serve.batch`: - `max_batch_size`: upper bound for how many requests are grouped into one batch execution - `batch_wait_timeout_s`: maximum time to wait (since the first queued item) before flushing a smaller batch diff --git a/docs/index.md b/docs/index.md index 0ef4ee7..b5ad660 100644 --- a/docs/index.md +++ b/docs/index.md @@ -54,6 +54,26 @@ Model Service is designed for: - **Real-time Inference**: Serve predictions with low latency for interactive applications - **Research Experiments**: Quickly deploy and test new model versions +## Documentation Contents + +### Get Started + +- [**Quick Start**](get-started/quick-start.md): Deploy the reference empty model in minutes. + +### Guides + +- [**Adding Models**](guides/adding-models.md): How to write, package, and integrate your own Python models. +- [**Deployment Guide**](guides/deployment-guide.md): Production checklist, resource planning (CPU/GPU), and networking. +- [**Configuration Reference**](guides/configuration-reference.md): Detailed explanation of `ray-service.yaml` settings. +- [**Troubleshooting**](guides/troubleshooting.md): Common errors (OOM, hang scenarios) and solutions. + +### Architecture Deep Dive + +- [**Overview**](architecture/overview.md): High-level system design and component hierarchy. +- [**Request Lifecycle**](architecture/request-lifecycle.md): Trace a request from Ingress to Worker. +- [**Queues & Backpressure**](architecture/queues-and-backpressure.md): Understanding flow control and overload protection. +- [**Batching**](architecture/batching.md): How request coalescing works under the hood. + ## Getting Help - **Documentation**: Browse the guides and reference materials in this documentation From 83e72e4cd03daaae75c036648452f586061aeca7 Mon Sep 17 00:00:00 2001 From: JiriStipek <567776@muni.cz> Date: Mon, 19 Jan 2026 11:06:09 +0100 Subject: [PATCH 10/17] feat: add details --- docs/guides/adding-models.md | 15 +++++++++++++-- 1 file changed, 13 insertions(+), 2 deletions(-) diff --git a/docs/guides/adding-models.md b/docs/guides/adding-models.md index 75b73a7..4bcfb42 100644 --- a/docs/guides/adding-models.md +++ b/docs/guides/adding-models.md @@ -58,10 +58,12 @@ class MyModel: #### 2. Initialization -Load your model in `__init__`: +Load your model in `__init__`. This method corresponds to the replica **startup phase**. ```python def __init__(self): + # This runs ONCE when the replica starts. + # The replica is NOT ready for traffic until this returns. import torch self.model = torch.load("model.pt") @@ -69,7 +71,16 @@ def __init__(self): print("Model loaded successfully") ``` -#### 3. Inference Method +#### 3. Resource Packing (Fractional CPUs/GPUs) + +Ray allows fractional resource requests. This lets you pack multiple small replicas onto a single node. + +```python +# Run 4 replicas on a single 1-CPU node (0.25 * 4 = 1.0) +@serve.deployment(ray_actor_options={"num_cpus": 0.25}) +``` + +#### 4. Inference Method Implement `__call__` or other methods for handling requests: From cff36fb09deec4e0599f985f16d1ee79ffb70e56 Mon Sep 17 00:00:00 2001 From: JiriStipek <567776@muni.cz> Date: Mon, 19 Jan 2026 11:22:09 +0100 Subject: [PATCH 11/17] feat: add more details of requesting --- docs/guides/configuration-reference.md | 76 ++++++++++++++++++++++++-- 1 file changed, 71 insertions(+), 5 deletions(-) diff --git a/docs/guides/configuration-reference.md b/docs/guides/configuration-reference.md index cd9771d..4f9c165 100644 --- a/docs/guides/configuration-reference.md +++ b/docs/guides/configuration-reference.md @@ -41,7 +41,7 @@ serveConfigV2: | - `name`: logical app name (used in Ray dashboard/logs). - `import_path`: Python entrypoint (`module.path:variable`). - `route_prefix`: HTTP path under the Serve gateway. -- `runtime_env`: code location + extra Python deps. +- `runtime_env`: dynamic environment setup (see [Managing Dependencies](../guides/adding-models.md#6-managing-dependencies)). ### Deployments (scaling + resources) @@ -66,6 +66,70 @@ deployments: - `ray_actor_options`: per‑replica CPU/GPU/memory. - `user_config`: free‑form dict passed to `reconfigure()` in your model. +## 2.1 Backpressure and queueing settings (very important) + +These two knobs often get confused because they both “limit load”, but they act at different points in the request path. + +### `max_ongoing_requests` (replica-side concurrency) + +**What it is:** the maximum number of in-flight requests a _single replica_ is allowed to have at once. + +**What it controls:** per-replica concurrency and memory pressure. + +**What happens when exceeded:** requests should not be dispatched onto that replica; they must wait upstream (typically in the caller-side queue). + +### `max_queued_requests` (caller-side queue limit) + +**What it is:** the maximum number of requests that are allowed to wait in the caller-side queue _before_ a replica slot is available. + +**Where that queue lives:** in the component that is calling the deployment (commonly the HTTP Proxy when handling HTTP ingress). + +**What happens when exceeded:** requests are rejected (client-visible overload/backpressure). + +### Why the difference matters + +- `max_ongoing_requests` protects the replica from being overloaded. +- `max_queued_requests` decides whether you prefer waiting or rejecting during spikes. + +See: [Queues and Backpressure](../architecture/queues-and-backpressure.md). + +## 2.2 Autoscaling settings (what they actually mean) + +### `target_ongoing_requests` + +**What it is:** The desired average number of **ongoing (in-flight)** requests per replica. This is the **primary scaling driver**. + +**Formula:** +$$ \text{Desired Replicas} = \left\lceil \frac{\text{Total Ongoing Requests}}{\text{target\_ongoing\_requests}} \right\rceil $$ + +**Note:** "Total Ongoing Requests" refers to the **concurrency** (number of requests currently being processed or waiting in the queue), _not_ the Requests Per Second (RPS). + +**Example:** +If your system receives 100 **concurrent** requests and `target_ongoing_requests` is set to 20, Serve will scale to 5 replicas. + +**How it influences scaling:** + +- **Lower value**: Scales up _earlier_. Use for latency-sensitive models or heavy tasks. +- **Higher value**: Scales up _later_. Use for high-throughput models where a single replica can handle many concurrent requests. + +**Important interaction:** if you set `max_queued_requests` too low, requests may get rejected before ongoing requests rise enough for autoscaling to catch up. + +### `min_replicas` / `max_replicas` + +Hard bounds on how many replicas Serve is allowed to run for that deployment. + +- **Scale to Zero**: Set `min_replicas: 0` to allow the deployment to stop all replicas when idle. The first request will trigger a "cold start" (latency spike). +- **High Availability**: Set `min_replicas: 2` (or more) to ensure at least two copies are always running, even if idle. + +### `upscale_delay_s` / `downscale_delay_s` + +Rules for how quickly the autoscaler reacts to load changes. + +- **`upscale_delay_s`**: The "patience" period before scaling up. The autoscaler sees high load, but waits this many seconds to confirm the spike is real before launching new replicas. + - _Risk_: Setting this too high makes the system sluggish to react to bursts. +- **`downscale_delay_s`**: The "grace period" before scaling down. Even if load drops to zero, the autoscaler keeps replicas alive for this duration. + - _Recommendation_: Keep this high to avoid "thrashing" (rapidly creating/destroying replicas) during short pauses in traffic. + ## 3. Ray Cluster (Workers and Autoscaling) ```yaml @@ -99,11 +163,13 @@ rayClusterConfig: memory: "16Gi" ``` -Focus on: +**Key Interactions:** -- `rayVersion`: must match images you use. -- `workerGroupSpecs[*].{replicas,minReplicas,maxReplicas}`: cluster‑level scaling bounds. -- `resources.requests/limits`: how big each worker pod is. +1. **Head Node Isolation**: `rayStartParams: { num-cpus: "0" }` on the head node prevents workloads from scheduling there. The head is reserved for the Control Plane. +2. **Worker Sizing**: `resources.requests` defines the physical guarantee. Your Pod must be bigger than your Replica (`ray_actor_options`). + - _Physical_: Pod Requests (e.g., 4 CPU) + - _Logical_: Model Replica Requirement (e.g., 2 CPU) + - _Result_: One Pod can fit 2 Replicas (plus overhead). ## 4. Security and Placement (Optional but Recommended) From 39d4bbf1febdbfdb667a7856fa66e6b8afc0c55a Mon Sep 17 00:00:00 2001 From: JiriStipek <567776@muni.cz> Date: Mon, 19 Jan 2026 11:22:53 +0100 Subject: [PATCH 12/17] feat: more detailed logical/physical --- docs/guides/deployment-guide.md | 82 +++++++++++++++++++-------------- 1 file changed, 48 insertions(+), 34 deletions(-) diff --git a/docs/guides/deployment-guide.md b/docs/guides/deployment-guide.md index 88ee041..ba210a1 100644 --- a/docs/guides/deployment-guide.md +++ b/docs/guides/deployment-guide.md @@ -160,34 +160,45 @@ The example model in this repository (`models/binary_classifier.py`) uses FastAP ## Production Considerations -### Resource Planning +### Resource Planning (Logical vs. Physical) -**Calculate resource requirements:** +Ray scheduling relies on **Logical Resources**, while Kubernetes manages **Physical Resources**. Confusion between them is the #1 cause of "Pending" pods. -1. **Per-replica resources:** +#### 1. Logical Resources (What Ray sees) - - CPU: Based on model complexity - - Memory: Model size + working memory + overhead - - GPU: Number of GPUs needed +Defined in your code via `ray_actor_options`. These are abstract "slots" used for scheduling. -2. **Total cluster resources:** +- `num_cpus: 4`: The actor needs 4 slots to run. +- `memory: 4Gi`: The actor needs 4Gi of _tracked_ heap/object store memory. - ``` - Total CPUs = max_replicas × num_cpus + overhead - Total Memory = max_replicas × memory + overhead - ``` +#### 2. Physical Resources (What Kubernetes gives) -3. **Example calculation:** +Defined in `ray-service.yaml` under `workerGroupSpecs`. This is the actual container capacity. - ``` - Model: 4 CPU, 4GB per replica - Max replicas: 5 +#### 3. The "Overhead" Gap - Required per worker: 5 × 4 = 20 CPUs, 5 × 4GB = 20GB - Overhead: +2 CPUs, +4GB for system +Ray system processes (Raylet, Dashboard Agent, Plasma Store) consume physical CPU and Memory that is **not** accounted for in logical slots. - Worker resources: 22 CPUs, 24GB memory - ``` +**Formula for Worker Pod Sizing:** + +```text +Physical Request >= (Sum of Replicas × Logical Request) + System Overhead +``` + +**Recommended Overhead Buffer:** + +- **CPU**: Add 0.5 - 2 CPU cores per pod for Ray system processes. +- **Memory**: Add 1-2 GiB + 30% of object store size. + +#### Example Calculation + +**Scenario:** Deploy 5 replicas of a model requiring 4 CPUs and 4GB RAM on a single node. + +1. **Logical Needs**: 5 replicas × 4 CPUs = **20 Logical CPUs**. +2. **Physical Overhead**: We estimate 2 CPUs for Raylet/System. +3. **Total Physical Request**: 20 + 2 = **22 CPUs**. + +_If you request only 20 CPUs in Kubernetes, Ray will detect that some CPU is used by the OS/Raylet and might only offer 19 logical slots, causing the 5th replica to hang._ ### Autoscaling Configuration @@ -195,21 +206,28 @@ The example model in this repository (`models/binary_classifier.py`) uses FastAP ```yaml autoscaling_config: - min_replicas: 1 # Always keep 1 running - max_replicas: 10 # Scale up to 10 - target_ongoing_requests: 20 # Target load per replica + min_replicas: 1 + max_replicas: 10 + target_ongoing_requests: 20 - # Advanced options - upscale_delay_s: 30 # Wait 30s before scaling up - downscale_delay_s: 600 # Wait 10m before scaling down + # Advanced stabilization + upscale_delay_s: 30 + downscale_delay_s: 600 ``` -**Scaling behavior:** +**Key Tuning Recommendations:** + +1. **`target_ongoing_requests`**: + - **Lower this value** (e.g., 5-10) for latency-sensitive models or if your model is CPU-heavy. This forces the system to scale out sooner. + - **Increase this value** (e.g., 50-100) for simple models where a single replica can juggle many async requests. -- **Cold start**: Set `min_replicas: 0` for scale-to-zero -- **Always available**: Set `min_replicas: 1` or higher -- **High traffic**: Increase `max_replicas` and `target_ongoing_requests` -- **Batch processing**: Use higher `target_ongoing_requests` +2. **`upscale_delay_s`**: + - Keep this low (e.g., `0s` to `30s`) so the system reacts quickly to traffic spikes. + +3. **`downscale_delay_s`**: + - Keep this high (e.g., `600s`) to avoid "thrashing". It is cheaper to keep an idle replica for 10 minutes than to re-initialize a heavy model (loading weights, etc.) every time traffic dips for a minute. + +For the exact formulas and definitions of these settings, see the [Configuration Reference](configuration-reference.md#2-autoscaling-configuration). ### High Availability @@ -249,12 +267,8 @@ containers: ```yaml env: - - name: HTTP_PROXY - value: "http://proxy.example.com:3128" - name: HTTPS_PROXY value: "http://proxy.example.com:3128" - - name: NO_PROXY - value: ".svc.cluster.local,.cluster.local" ``` **Service configuration:** From c7a4f64943389797fe10ca6b400f4bacfef28e3b Mon Sep 17 00:00:00 2001 From: JiriStipek <567776@muni.cz> Date: Mon, 19 Jan 2026 11:23:24 +0100 Subject: [PATCH 13/17] feat: add OOMK crash --- docs/guides/troubleshooting.md | 43 ++++++++++++++++++++++++++++++++-- 1 file changed, 41 insertions(+), 2 deletions(-) diff --git a/docs/guides/troubleshooting.md b/docs/guides/troubleshooting.md index 39046b0..34ec5dd 100644 --- a/docs/guides/troubleshooting.md +++ b/docs/guides/troubleshooting.md @@ -76,6 +76,39 @@ deployments: pip: ["onnxruntime>=1.23.2", "mlflow<3.0", "lz4>=4.4.5"] ``` +## Worker Crashes (OOMKilled) + +### Symptoms + +- Pods in `kubectl get pods` show status `OOMKilled` or high restart counts. +- `kubectl describe pod ...` shows "Last State: Terminated (Reason: OOMKilled)". +- Ray Dashboard shows unexpected actor deaths. + +### Causes + +- The model loaded into memory + the input batch size exceeds the container's memory limit. +- **Physical vs Logical Mismatch**: Ray was told the actor needs 2GB, so it scheduled it on a node, but the actual Python process used 4GB, causing Kubernetes to kill it. + +### Fix + +You must increase **both** the Ray logical allocation and the Kubernetes physical limit. + +1. Increase `ray_actor_options.memory` (Software limit): + + ```yaml + ray_actor_options: + memory: 4294967296 # 4 GiB + ``` + +2. Increase Kubernetes container limits (Hardware limit): + Ensure the `workerGroupSpecs` in `ray-service.yaml` provides **more** memory than the sum of all actors on that node plus overhead (~30%). + + ```yaml + resources: + limits: + memory: "6Gi" # Must cover the 4GB actor + Ray overhead + ``` + ## Autoscaling Not Working (Replicas Don’t Change) ### Serve replicas not scaling @@ -116,8 +149,14 @@ Also ensure `workerGroupSpecs[*].minReplicas/maxReplicas` allow scaling. ### Fix -- Reduce per-replica requirements (`ray_actor_options.num_cpus`, `memory`). -- Increase cluster capacity or adjust worker pod resources. +1. **Check Physical vs Logical**: + + - _Physical_: Can K8s schedule the pod? `kubectl describe pod` will show if nodes are full. + - _Logical_: Can Ray schedule the actor? Check `ray status` or the dashboard. Ray might say "0/X CPUs available" even if the pod exists, because other actors consumed the slots. + +2. **Adjust Resources**: + - Reduce per-replica requirements (`ray_actor_options.num_cpus`, `memory`). + - Increase cluster capacity (maxReplicas) or per-worker limits. Inspect pod scheduling events: From faa8dfe2d8ef554623216f854e0e2a8bb8385c42 Mon Sep 17 00:00:00 2001 From: JiriStipek <567776@muni.cz> Date: Mon, 19 Jan 2026 11:27:55 +0100 Subject: [PATCH 14/17] fix: broken link --- docs/guides/deployment-guide.md | 2 ++ 1 file changed, 2 insertions(+) diff --git a/docs/guides/deployment-guide.md b/docs/guides/deployment-guide.md index ba210a1..7db1c35 100644 --- a/docs/guides/deployment-guide.md +++ b/docs/guides/deployment-guide.md @@ -218,10 +218,12 @@ autoscaling_config: **Key Tuning Recommendations:** 1. **`target_ongoing_requests`**: + - **Lower this value** (e.g., 5-10) for latency-sensitive models or if your model is CPU-heavy. This forces the system to scale out sooner. - **Increase this value** (e.g., 50-100) for simple models where a single replica can juggle many async requests. 2. **`upscale_delay_s`**: + - Keep this low (e.g., `0s` to `30s`) so the system reacts quickly to traffic spikes. 3. **`downscale_delay_s`**: From 621fb398a6d3a2a06e930a4a87dcd7264bdf34ab Mon Sep 17 00:00:00 2001 From: JiriStipek <567776@muni.cz> Date: Mon, 19 Jan 2026 12:14:33 +0100 Subject: [PATCH 15/17] fix: spelling --- docs/architecture/queues-and-backpressure.md | 4 ++-- docs/get-started/quick-start.md | 4 ++-- docs/guides/adding-models.md | 11 ++++++++--- docs/guides/deployment-guide.md | 6 +++--- docs/index.md | 2 +- 5 files changed, 16 insertions(+), 11 deletions(-) diff --git a/docs/architecture/queues-and-backpressure.md b/docs/architecture/queues-and-backpressure.md index fbfd4d8..965a482 100644 --- a/docs/architecture/queues-and-backpressure.md +++ b/docs/architecture/queues-and-backpressure.md @@ -13,7 +13,7 @@ There are two main places a request can wait: When a request arrives at the HTTP Proxy (or via a Deployment Handle), it is routed to a logical deployment. If all specific replicas are busy, the request waits in a queue managed by the proxy/handle. -- **Config**: `max_queued_requests` (in `deployment_config`) +- **Config**: `max_queued_requests` (in the deployment spec) - **Behavior**: - Controls the maximum number of requests allowed to wait for assignment. - If the queue is full, new requests are immediately rejected with a **503 Service Unavailable** error (or a `BackpressureError` in Python). @@ -26,7 +26,7 @@ Without a limit, a system under heavy load might accept requests until it runs o Once a request is assigned to a specific replica, it counts as "ongoing" for that replica. -- **Config**: `max_ongoing_requests` (in `deployment_config`) +- **Config**: `max_ongoing_requests` (in the deployment spec) - **Behavior**: - Limits how many concurrent requests a single replica can process _or_ have buffered. - If a replica is at its limit, the proxy considers it "busy" and will not assign new requests to it (they will wait in the Proxy Queue instead). diff --git a/docs/get-started/quick-start.md b/docs/get-started/quick-start.md index e2084d9..1abc779 100644 --- a/docs/get-started/quick-start.md +++ b/docs/get-started/quick-start.md @@ -66,7 +66,7 @@ If the RayService is not becoming ready, inspect events and status: kubectl describe rayservice rayservice-models -n [namespace] ``` -## Step 5: Local Access the Service +## Step 5: Access the Service Locally Once deployed, you can port-forward the service to access it locally: @@ -75,7 +75,7 @@ Once deployed, you can port-forward the service to access it locally: kubectl port-forward -n [namespace] svc/rayservice-models-serve-svc 8000:8000 ``` -## Step 6: Delete the Deployed Model +## Step 6: Delete the Deployment To delete the deployed RayService, run: diff --git a/docs/guides/adding-models.md b/docs/guides/adding-models.md index 4bcfb42..008fcf5 100644 --- a/docs/guides/adding-models.md +++ b/docs/guides/adding-models.md @@ -30,8 +30,13 @@ class MyModel: async def __call__(self, request: Request): # Handle inference requests data = await request.json() - # Process data and return prediction - return {"prediction": result} + # Process data and return prediction + result = self.predict(data) + return {"prediction": result} + + def predict(self, data: dict): + # Replace with your own inference logic + return data app = MyModel.bind() ``` @@ -113,7 +118,7 @@ class ConfigurableModel: def __init__(self): self.model = load_model() - async def reconfigure(self, config: Config): + def reconfigure(self, config: Config): self.threshold = config["threshold"] self.batch_size = config["batch_size"] print(f"Reconfigured: threshold={self.threshold}") diff --git a/docs/guides/deployment-guide.md b/docs/guides/deployment-guide.md index 7db1c35..703516a 100644 --- a/docs/guides/deployment-guide.md +++ b/docs/guides/deployment-guide.md @@ -36,7 +36,7 @@ class MyModel: async def __call__(self, request: Request): # Inference logic data = await request.json() - result = self.model.predict(data["input"]) + result = self.model.predict(data["input"]) # replace with your own inference call return {"prediction": result} app = MyModel.bind() @@ -229,7 +229,7 @@ autoscaling_config: 3. **`downscale_delay_s`**: - Keep this high (e.g., `600s`) to avoid "thrashing". It is cheaper to keep an idle replica for 10 minutes than to re-initialize a heavy model (loading weights, etc.) every time traffic dips for a minute. -For the exact formulas and definitions of these settings, see the [Configuration Reference](configuration-reference.md#2-autoscaling-configuration). +For the exact formulas and definitions of these settings, see the [Configuration Reference](configuration-reference.md#22-autoscaling-settings-what-they-actually-mean). ### High Availability @@ -430,7 +430,7 @@ kubectl port-forward -n [namespace] svc/rayservice-my-model-head-svc 8265:8265 4. **Resource Limits**: Always set limits to prevent resource hogging 5. **Gradual Rollout**: Update replicas gradually 6. **Documentation**: Document custom configurations -7. **Backup**: Keep backup of working configurations +7. **Backup**: Keep backups of working configurations ## Next Steps diff --git a/docs/index.md b/docs/index.md index b5ad660..b861842 100644 --- a/docs/index.md +++ b/docs/index.md @@ -58,7 +58,7 @@ Model Service is designed for: ### Get Started -- [**Quick Start**](get-started/quick-start.md): Deploy the reference empty model in minutes. +- [**Quick Start**](get-started/quick-start.md): Deploy the reference model in minutes. ### Guides From 09d88f443d6755b44bd833321141fb541358dbc7 Mon Sep 17 00:00:00 2001 From: JiriStipek <567776@muni.cz> Date: Mon, 19 Jan 2026 13:28:10 +0100 Subject: [PATCH 16/17] feat: add code examples --- docs/guides/deployment-guide.md | 40 +++++++++++++++++++++++++++++++-- 1 file changed, 38 insertions(+), 2 deletions(-) diff --git a/docs/guides/deployment-guide.md b/docs/guides/deployment-guide.md index 703516a..34d7083 100644 --- a/docs/guides/deployment-guide.md +++ b/docs/guides/deployment-guide.md @@ -36,7 +36,7 @@ class MyModel: async def __call__(self, request: Request): # Inference logic data = await request.json() - result = self.model.predict(data["input"]) # replace with your own inference call + result = self.model.predict(data["input"]) # replace with your own inference call return {"prediction": result} app = MyModel.bind() @@ -169,12 +169,48 @@ Ray scheduling relies on **Logical Resources**, while Kubernetes manages **Physi Defined in your code via `ray_actor_options`. These are abstract "slots" used for scheduling. - `num_cpus: 4`: The actor needs 4 slots to run. -- `memory: 4Gi`: The actor needs 4Gi of _tracked_ heap/object store memory. +- `memory: 4294967296` (bytes): Ray logical memory resource used for scheduling/admission control. + +Example (Python): + +```python +from ray import serve + +@serve.deployment( + ray_actor_options={ + "num_cpus": 4, + "memory": 4 * 1024**3, # bytes (4 GiB) + } +) +class MyModel: + ... + +app = MyModel.bind() +``` #### 2. Physical Resources (What Kubernetes gives) Defined in `ray-service.yaml` under `workerGroupSpecs`. This is the actual container capacity. +Example (Kubernetes YAML): + +```yaml +workerGroupSpecs: + - groupName: cpu-workers + replicas: 1 + template: + spec: + containers: + - name: ray-worker + resources: + requests: + cpu: 8 + memory: 16Gi + limits: + cpu: 12 + memory: 20Gi +``` + #### 3. The "Overhead" Gap Ray system processes (Raylet, Dashboard Agent, Plasma Store) consume physical CPU and Memory that is **not** accounted for in logical slots. From 111fac16838625ef00defc4b65d8443f327c5dde Mon Sep 17 00:00:00 2001 From: JiriStipek <567776@muni.cz> Date: Mon, 19 Jan 2026 13:28:27 +0100 Subject: [PATCH 17/17] fix: fox the formula --- docs/guides/configuration-reference.md | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/docs/guides/configuration-reference.md b/docs/guides/configuration-reference.md index e5a5547..602d502 100644 --- a/docs/guides/configuration-reference.md +++ b/docs/guides/configuration-reference.md @@ -100,7 +100,7 @@ See: [Queues and Backpressure](../architecture/queues-and-backpressure.md). **What it is:** The desired average number of **ongoing (in-flight)** requests per replica. This is the **primary scaling driver**. **Formula:** -$$ \text{Desired Replicas} = \left\lceil \frac{\text{Total Ongoing Requests}}{\text{target_ongoing_requests}} \right\rceil $$ +$$ \text{Desired Replicas} = \left\lceil \frac{\text{Total Ongoing Requests}}{\text{target}\_{\text{ongoing}}\_\text{requests}} \right\rceil $$ **Note:** "Total Ongoing Requests" refers to the **concurrency** (number of requests currently being processed or waiting in the queue), _not_ the Requests Per Second (RPS).