|
| 1 | +# IT-Stack Capacity Planning Guide |
| 2 | + |
| 3 | +**Document:** 15 |
| 4 | +**Location:** `docs/02-implementation/15-capacity-planning.md` |
| 5 | +**Last Updated:** March 2026 |
| 6 | + |
| 7 | +--- |
| 8 | + |
| 9 | +## Overview |
| 10 | + |
| 11 | +This document covers hardware sizing, service resource baselines, user-count projections, storage growth estimates, and scale-out plans for the IT-Stack platform. |
| 12 | + |
| 13 | +--- |
| 14 | + |
| 15 | +## Current Hardware Layout (8-Server Production) |
| 16 | + |
| 17 | +| Server | Role | CPU | RAM | Storage | Services | |
| 18 | +|--------|------|-----|-----|---------|----------| |
| 19 | +| lab-id1 | Identity | 4 vCPU | 16 GB | 100 GB SSD | FreeIPA, Keycloak | |
| 20 | +| lab-db1 | Database | 8 vCPU | 32 GB | 500 GB NVMe | PostgreSQL, Redis, Elasticsearch | |
| 21 | +| lab-app1 | Collaboration | 8 vCPU | 24 GB | 2 TB HDD + 100 GB SSD | Nextcloud, Mattermost, Jitsi | |
| 22 | +| lab-comm1 | Communications | 4 vCPU | 16 GB | 200 GB SSD | iRedMail, Zammad, Zabbix | |
| 23 | +| lab-proxy1 | Reverse Proxy | 2 vCPU | 8 GB | 50 GB SSD | Traefik, Graylog | |
| 24 | +| lab-pbx1 | VoIP | 2 vCPU | 8 GB | 100 GB SSD | FreePBX (Asterisk) | |
| 25 | +| lab-biz1 | Business | 8 vCPU | 24 GB | 200 GB SSD | SuiteCRM, Odoo, OpenKM | |
| 26 | +| lab-mgmt1 | IT Management | 4 vCPU | 16 GB | 100 GB SSD | Taiga, Snipe-IT, GLPI | |
| 27 | +| **Total** | | **46 vCPU** | **144 GB** | **~3.35 TB** | 20 services | |
| 28 | + |
| 29 | +--- |
| 30 | + |
| 31 | +## Service Resource Baselines (Idle / Active / Peak) |
| 32 | + |
| 33 | +### Identity & Security |
| 34 | + |
| 35 | +| Service | RAM (idle) | RAM (active) | CPU (idle) | CPU (peak) | Notes | |
| 36 | +|---------|-----------|--------------|-----------|-----------|-------| |
| 37 | +| FreeIPA | 800 MB | 1.2 GB | < 5% | 30% | LDAP operations spike | |
| 38 | +| Keycloak | 512 MB | 1.0 GB | < 5% | 40% | Login storms at day start | |
| 39 | + |
| 40 | +### Database & Cache |
| 41 | + |
| 42 | +| Service | RAM (idle) | RAM (active) | Storage growth | Notes | |
| 43 | +|---------|-----------|--------------|---------------|-------| |
| 44 | +| PostgreSQL | 2 GB | 4–8 GB | ~5 GB/month (100 users) | `shared_buffers` = 8 GB on lab-db1 | |
| 45 | +| Redis | 256 MB | 512 MB | Bounded by `maxmemory` | Session cache + queues | |
| 46 | +| Elasticsearch | 4 GB | 6 GB | ~10 GB/month | Log index ILM: 30-day retention | |
| 47 | + |
| 48 | +### Collaboration |
| 49 | + |
| 50 | +| Service | RAM (idle) | RAM (active) | Storage (user data) | Notes | |
| 51 | +|---------|-----------|--------------|---------------------|-------| |
| 52 | +| Nextcloud | 512 MB | 1.5 GB PHP | ~5 GB/user (generous) | PHP-FPM pool | |
| 53 | +| Mattermost | 256 MB | 512 MB | ~1 GB/year (100 users) | Binary: very efficient | |
| 54 | +| Jitsi | 1 GB | 3–8 GB | Minimal | JVB spikes on large calls | |
| 55 | + |
| 56 | +### Communications |
| 57 | + |
| 58 | +| Service | RAM (idle) | RAM (active) | Notes | |
| 59 | +|---------|-----------|--------------|-------| |
| 60 | +| iRedMail | 512 MB | 1 GB | Postfix + Dovecot + SpamAssassin | |
| 61 | +| Zammad | 1 GB | 2 GB | Rails + Elasticsearch | |
| 62 | +| Zabbix | 512 MB | 1 GB | Agent data volume scales with host count | |
| 63 | + |
| 64 | +### Business Systems |
| 65 | + |
| 66 | +| Service | RAM (idle) | RAM (active) | Notes | |
| 67 | +|---------|-----------|--------------|-------| |
| 68 | +| SuiteCRM | 256 MB | 512 MB PHP | PHP-FPM; scales with concurrent users | |
| 69 | +| Odoo | 512 MB | 1.5 GB | Multi-worker; 2 workers per CPU recommended | |
| 70 | +| OpenKM | 512 MB | 1 GB | Tomcat JVM | |
| 71 | + |
| 72 | +### IT Management |
| 73 | + |
| 74 | +| Service | RAM (idle) | RAM (active) | Notes | |
| 75 | +|---------|-----------|--------------|-------| |
| 76 | +| Taiga | 256 MB | 512 MB | Django/Gunicorn | |
| 77 | +| Snipe-IT | 256 MB | 512 MB | Laravel/PHP-FPM | |
| 78 | +| GLPI | 256 MB | 512 MB | PHP-FPM | |
| 79 | + |
| 80 | +--- |
| 81 | + |
| 82 | +## User Count Projections |
| 83 | + |
| 84 | +### Tier: 50 Users (Current production sizing — comfortable headroom) |
| 85 | + |
| 86 | +| Resource | Current Capacity | 50-user Usage | Headroom | |
| 87 | +|----------|-----------------|--------------|---------| |
| 88 | +| Keycloak sessions | Unbounded | ~200 active | Large | |
| 89 | +| PostgreSQL connections | 1000 max | ~150 | 85% | |
| 90 | +| Redis memory | 4 GB | ~500 MB | 87% | |
| 91 | +| Nextcloud storage | 2 TB | ~250 GB | 87% | |
| 92 | +| Email inbox | 200 GB | ~25 GB | 87% | |
| 93 | +| lab-db1 RAM | 32 GB | ~12 GB | 62% | |
| 94 | +| lab-app1 RAM | 24 GB | ~18 GB | 25% | |
| 95 | + |
| 96 | +### Tier: 100 Users (Standard recommendation — no hardware changes needed) |
| 97 | + |
| 98 | +| Resource | 100-user Usage | Notes | |
| 99 | +|----------|---------------|-------| |
| 100 | +| PostgreSQL connections | ~300 | PgBouncer recommended at 200+ | |
| 101 | +| Nextcloud storage | ~500 GB | Increase lab-app1 storage to 3 TB | |
| 102 | +| lab-db1 RAM | ~20 GB | Still within 32 GB limit | |
| 103 | +| lab-app1 RAM | ~22 GB | At capacity; tune PHP workers | |
| 104 | +| Mattermost | ~1 GB | Minimal impact | |
| 105 | + |
| 106 | +**Actions at 100 users:** |
| 107 | +1. Enable PgBouncer connection pooling on lab-db1 |
| 108 | +2. Increase Nextcloud `pm.max_children` PHP workers to 30 |
| 109 | +3. Expand lab-app1 data volume to 3 TB |
| 110 | +4. Review Elasticsearch JVM heap (was 4 GB, increase to 8 GB) |
| 111 | + |
| 112 | +### Tier: 200 Users (Scale-out required for Jitsi and Nextcloud) |
| 113 | + |
| 114 | +| Component | Required Action | |
| 115 | +|-----------|----------------| |
| 116 | +| lab-app1 | Upgrade to 16 vCPU / 48 GB RAM, or split Nextcloud to dedicated VM | |
| 117 | +| lab-db1 | Upgrade to 16 vCPU / 64 GB RAM; increase `shared_buffers` to 16 GB | |
| 118 | +| Jitsi | Add second JVB (Jitsi Video Bridge) node for concurrent calls | |
| 119 | +| FreePBX | Add second Asterisk node for concurrent call handling | |
| 120 | +| Elasticsearch | Increase heap to 16 GB or add data node | |
| 121 | + |
| 122 | +**Estimated storage at 200 users (Year 1):** |
| 123 | +``` |
| 124 | +Nextcloud: 200 users × 5 GB = 1.0 TB (files) |
| 125 | +Email: 200 users × 500 MB = 100 GB |
| 126 | +PostgreSQL: ~50 GB total (all databases) |
| 127 | +Elasticsearch: ~120 GB (logs, 30-day retention) |
| 128 | +Media (Jitsi): ~200 GB (recordings, if enabled) |
| 129 | +Total: ~1.5 TB |
| 130 | +``` |
| 131 | + |
| 132 | +### Tier: 500 Users (Enterprise — multi-node required) |
| 133 | + |
| 134 | +| Service | Scale-out approach | |
| 135 | +|---------|--------------------| |
| 136 | +| Keycloak | Active-active cluster (2 nodes, PostgreSQL backend) | |
| 137 | +| PostgreSQL | Primary + 2 read replicas + PgBouncer pool | |
| 138 | +| Nextcloud | Dedicated app server (8 vCPU / 32 GB) + object storage backend | |
| 139 | +| Mattermost | Add Enterprise features or cluster mode | |
| 140 | +| Jitsi | JVB cluster (3–5 nodes) | |
| 141 | +| FreePBX | Asterisk cluster with shared filesystem | |
| 142 | +| Elasticsearch | 3-node data cluster | |
| 143 | +| Redis | Sentinel or Cluster mode | |
| 144 | + |
| 145 | +### Tier: 1,000+ Users |
| 146 | + |
| 147 | +At 1,000+ users, the architecture shifts to **microservice-per-cluster** pattern: |
| 148 | +- All stateful services on dedicated nodes |
| 149 | +- Load balancer tier in front of Traefik |
| 150 | +- Object storage (MinIO or Azure Blob) for Nextcloud/Mattermost files |
| 151 | +- Dedicated monitoring cluster (Zabbix + Graylog on separate hardware) |
| 152 | +- PostgreSQL HA with Patroni + pgBouncer + HAProxy |
| 153 | + |
| 154 | +--- |
| 155 | + |
| 156 | +## Storage Growth Projections |
| 157 | + |
| 158 | +| Storage Type | Per-user/month | 50 users/year | 100 users/year | 200 users/year | |
| 159 | +|-------------|---------------|--------------|---------------|---------------| |
| 160 | +| Nextcloud files | 500 MB | 300 GB | 600 GB | 1.2 TB | |
| 161 | +| Email (Dovecot) | 200 MB | 120 GB | 240 GB | 480 GB | |
| 162 | +| PostgreSQL | ~15 MB | 10 GB | 18 GB | 36 GB | |
| 163 | +| Elasticsearch logs | N/A (time-based) | 100 GB | 100 GB | 200 GB | |
| 164 | +| FreePBX recordings | 100 MB | 60 GB | 120 GB | 240 GB | |
| 165 | +| Zabbix history | N/A (host-based) | 20 GB | 25 GB | 35 GB | |
| 166 | +| **Total (Year 1)** | | **~610 GB** | **~1.1 TB** | **~2.1 TB** | |
| 167 | + |
| 168 | +**Storage expansion triggers:** |
| 169 | +- lab-app1 (Nextcloud): Add storage when `/var/lib/nextcloud/data` reaches 75% capacity |
| 170 | +- lab-db1 (PostgreSQL): Add disk when database volume reaches 70% capacity; consider tablespace migration |
| 171 | +- lab-proxy1 (Graylog/Elasticsearch): ILM policy automatically rolls/deletes indices; monitor with Kibana Disk Gauge |
| 172 | + |
| 173 | +--- |
| 174 | + |
| 175 | +## Azure VM Sizing Reference |
| 176 | + |
| 177 | +If running on Azure instead of physical hardware (see [Azure Lab Guide](#) for details): |
| 178 | + |
| 179 | +### Option B: Single VM (Labs 01–05, up to ~50 users) |
| 180 | + |
| 181 | +| VM Size | vCPU | RAM | Disk | Cost/hr | Recommended for | |
| 182 | +|---------|------|-----|------|---------|----------------| |
| 183 | +| Standard_E16s_v4 | 16 | 128 GB | P30 × 1 | ~$1.01 | Lab testing, < 25 users | |
| 184 | +| Standard_E32s_v4 | 32 | 256 GB | P30 × 2 | ~$2.02 | All services, < 50 users | |
| 185 | + |
| 186 | +### Option A: 8-VM Production (Full Lab 06 stack) |
| 187 | + |
| 188 | +| Server | Azure VM | vCPU | RAM | Disk | est. Cost/hr | |
| 189 | +|--------|----------|------|-----|------|-------------| |
| 190 | +| lab-id1 | Standard_D4s_v4 | 4 | 16 GB | P10 | $0.19 | |
| 191 | +| lab-db1 | Standard_E8s_v4 | 8 | 64 GB | P30 | $0.50 | |
| 192 | +| lab-app1 | Standard_D8s_v4 | 8 | 32 GB | P30 | $0.38 | |
| 193 | +| lab-comm1 | Standard_D4s_v4 | 4 | 16 GB | P10 | $0.19 | |
| 194 | +| lab-proxy1 | Standard_D2s_v4 | 2 | 8 GB | P10 | $0.10 | |
| 195 | +| lab-pbx1 | Standard_D2s_v4 | 2 | 8 GB | P10 | $0.10 | |
| 196 | +| lab-biz1 | Standard_D8s_v4 | 8 | 32 GB | P30 | $0.38 | |
| 197 | +| lab-mgmt1 | Standard_D4s_v4 | 4 | 16 GB | P10 | $0.19 | |
| 198 | +| **Total** | | **40** | **192 GB** | | **~$2.03/hr** | |
| 199 | + |
| 200 | +At 8 hours/day: ~$485/month (pay-as-you-go) | ~$218/month (Spot VMs, ~55% savings) |
| 201 | + |
| 202 | +**Use Azure Spot Instances for:** lab-app1, lab-biz1, lab-mgmt1 |
| 203 | +**Keep On-Demand for:** lab-id1, lab-db1, lab-proxy1 (stateful services that can't tolerate eviction) |
| 204 | + |
| 205 | +--- |
| 206 | + |
| 207 | +## Scale-Out Plan Per Service |
| 208 | + |
| 209 | +| Service | First Scale-Out | Second Scale-Out | Max Tested | |
| 210 | +|---------|----------------|-----------------|-----------| |
| 211 | +| FreeIPA | Add replica (lab-id2) at 300+ users | Multi-site DNS delegation | 10,000 users | |
| 212 | +| Keycloak | Cluster mode (KC_CACHE_STACK=kubernetes/jdbc-ping) | Add KC node | 50,000 sessions | |
| 213 | +| PostgreSQL | Read replica for Nextcloud/Zammad at 200+ users | Patroni HA cluster | 10,000 conn/s | |
| 214 | +| Redis | Redis Sentinel at 500+ users | Redis Cluster at 1000+ | Horizontally scalable | |
| 215 | +| Nextcloud | External object storage (S3/MinIO) at 500 GB+ | Horizontal app nodes | PB scale | |
| 216 | +| Mattermost | Enterprise cluster at 1000+ users | Add nodes | 10,000 users/node | |
| 217 | +| Jitsi | Second JVB at 100+ concurrent calls | JVB auto-scaling | 1000+ concurrent | |
| 218 | +| Elasticsearch | Add data node at 200 GB index size | Add coordinating node | Horizontally scalable | |
| 219 | +| Zabbix | Zabbix Proxy for remote sites | HA server pair | 10,000 hosts | |
| 220 | +| Graylog | Add processing node at 10K msg/sec | Elasticsearch cluster | Horizontally scalable | |
| 221 | + |
| 222 | +--- |
| 223 | + |
| 224 | +## Performance Benchmarks (Reference) |
| 225 | + |
| 226 | +These are measured on the 8-server layout with 25 concurrent users: |
| 227 | + |
| 228 | +| Metric | Value | Tool | |
| 229 | +|--------|-------|------| |
| 230 | +| Nextcloud file upload (100 MB) | 12 MB/s | Nextcloud client | |
| 231 | +| Mattermost message throughput | 500 msg/sec | Artillery | |
| 232 | +| Keycloak login latency (p95) | 280 ms | k6 | |
| 233 | +| PostgreSQL query latency (simple SELECT, p99) | 2 ms | pgbench | |
| 234 | +| Zabbix check interval (10,000 items) | 30 sec | Zabbix internal | |
| 235 | +| Traefik request latency (p95) | 8 ms | Prometheus histogram | |
| 236 | +| Jitsi call quality (5-person) | 4.2 MOS | Jitsi test | |
| 237 | + |
| 238 | +--- |
| 239 | + |
| 240 | +*Generated by IT-Stack project. See `claude.md` for full project context.* |
0 commit comments