NetAI Stack SE

On-Premise AI Infrastructure for Intel Arc GPUs

Data Sovereignty — Your Data Stays Yours

NetAI Stack SE is built for organizations that cannot afford cloud data leakage. Law firms, SMEs, and compliance-driven businesses run this stack entirely on-premise:

No cloud inference: All LLM queries execute locally on your Intel Arc Pro B50.
No telemetry: No data leaves your network unless you explicitly configure external integrations.
GDPR-ready: Patient/client data, legal documents, and internal knowledge bases remain under your physical control. Enhanced with Microsoft Presidio for automatic PII redaction (German + English).
EU AI Act compliant: Full transparency documentation and prompt injection protection via Mezzo-Prompt-Guard-v2-Base on isolated iGPU.
Dual-GPU Partitioning: Compliance services (PII-Guard, Security-Guard) offloaded to Alder Lake iGPU to preserve Battlemage dGPU VRAM for the main LLM.

Compliance Features

GDPR Compliance (DSGVO)

The stack includes automated PII detection and redaction:

PII-Guard Service: Intercepts all web search queries before they reach SearXNG.
Microsoft Presidio: Detects names, locations, IBANs, phone numbers, emails, and more. Optimized for German and English.
Endpoints: /redact (standalone redaction), /search (redact + forward to SearXNG), /search/redacted (debug/compliance view).
Automatic Redaction: Replaces PII with generic placeholders (e.g., [REDACTED_NAME], [REDACTED_LOCATION]).
Data Minimization: GDPR Article 5 compliance headers on proxied responses (X-PIGuard-Redacted, X-PIGuard-Compliance).

EU AI Act Compliance (Article 52)

The stack includes AI safety measures for professional use:

Security-Guard Service: Filters all inference requests for prompt injection before they reach the Cascade LLM.
Mezzo-Prompt-Guard-v2-Base (IQ4_XS): Highly specialized safety model (~450MB VRAM) on the integrated iGPU.
Fallback: Heuristic regex-based classifier when llama-server is unavailable.
Prompt Injection Detection: Blocks jailbreak attempts, system prompt extraction, malicious code.
Human-in-the-Loop: All blocked requests are logged for review with transparency metadata.

Architecture

User → Caddy (:443) → LibreChat (:3080/chat) → Security-Guard (:8778) → Cascade LLM (:3000) ─┬─ small (low complexity) → Auxiliary LFM2.5-VL-1.6B (:8082)
                                                                                         ├─ large (high complexity) → Inference Qwen3.6-35B (:8080)
                                                                                         └─ confidence < 0.7        → reroute to Inference (fallback)
                   → SearXNG (:8080) via PII-Guard (:8777)
                   → Hermes Agent (:9119 dashboard, :8642 API) ───→ Cascade LLM (:3000) [direct, bypasses Security-Guard]
                   → Hermes WebUI (:8787)
                   → Beszel (:8090 monitoring)
                   → SuperTonic TTS (:8800)
                   → Parakeet STT (:5092)
                   → Auth-Validator (:8081) → LibreChat (:3080) [forward_auth backend]

Core Services

Service	Container	Port(s)	Purpose
Inference	`netai-inference`	8080	llama.cpp with Intel SYCL, hosts Qwen3.6-35B (large model)
Auxiliary	`netai-auxiliary`	8082	llama.cpp with Intel SYCL, hosts LFM2.5-VL-1.6B (small multimodal model)
Cascade LLM	`netai-cascade-llm`	3000	Routes requests by complexity + confidence between small/large models
Frontend	`netai-librechat`	3080	LibreChat (chat interface, RAG, multi-model, MCP)
Agent	`netai-hermes`	8642, 9119	Hermes Agent (Telegram bot + dashboard)
Hermes WebUI	`netai-hermes-webui`	8787	Full web interface for Hermes
Reverse Proxy	`netai-caddy`	80, 443	TLS termination, path-based routing, URL rewriting (`replace-response` module)
Web Search	`netai-searxng`	8080	Privacy-respecting meta search engine
PII-Guard	`netai-pii-guard`	8777	GDPR PII redaction proxy between LibreChat and SearXNG
Security-Guard	`netai-security-guard`	8778	Mezzo Prompt Guard (+ internal llama-server :8779) on iGPU
TTS	`netai-tts`	8800	SuperTonic TTS (OpenAI-compatible)
STT	`netai-stt`	5092	Parakeet TDT (OpenAI-compatible)
Knowledge Graph	`netai-lightrag`	8020	LightRAG — graph-enhanced RAG with entity extraction, hybrid search (local/global/naive)
Monitoring	`netai-beszel`	8090	Lightweight system monitoring (CPU, RAM, disk, Docker, GPU)
Auth-Validator	`netai-auth-validator`	8081	Validates LibreChat sessions for Caddy forward_auth

Cascade LLM — Model Router

The Cascade LLM (netai-cascade-llm, port 3000) is the central routing layer between all clients (LibreChat via Security-Guard, Hermes Agent directly) and the two inference backends.

Routing Architecture

Request → Cascade LLM
            │
            ├─ has_image?
            │   ├─ Yes + complexity > 0.5  → Large Multimodal (Qwen3.6-35B)
            │   └─ Yes + complexity ≤ 0.5  → Small Multimodal (LFM2.5-VL-1.6B)
            │
            └─ text-only?
                ├─ complexity > 0.5  → Large Text (Qwen3.6-35B)
                └─ complexity ≤ 0.5  → Small Model + confidence check
                                        │
                                        ├─ confidence ≥ 0.7 → keep small response
                                        └─ confidence < 0.7 → reroute to Large Text

Step 1: Complexity Scoring

evaluate_complexity() computes a score from 0.0–1.0:

Factor	Weight	Details
Message length	50%	`min(chars / 1000, 1.0)`
Keywords	50%	`analyze deeply`, `write code`, `expert`, `reasoning`, `logic`, `complex` (+0.2 each, capped at 1.0)

Step 2: Confidence-Based Fallback (non-streaming only)

When the small model is selected for a non-streaming request:

Request is forwarded with logprobs: true enabled
Gateway receives the full response and extracts token-level log probabilities
confidence = exp(mean(token_logprobs))
If confidence ≥ CONFIDENCE_THRESHOLD (default 0.7): small model response is returned immediately with an x-confidence header
If confidence < 0.7: small model response is discarded and the original request is rerouted to the large model

This catches cases where the small model is uncertain — ambiguous queries, domain-specific questions, or edge cases the complexity heuristic misjudged.

Configurable Environment Variables

Env Var	Default	Purpose
`SMALL_MLLM_URL`	`http://netai-auxiliary:8080/v1/chat/completions`	Small multimodal model endpoint
`LARGE_MLLM_URL`	`http://netai-inference:8080/v1/chat/completions`	Large multimodal model endpoint
`LARGE_TEXT_URL`	`http://netai-inference:8080/v1/chat/completions`	Large text-only model endpoint
`ROUTER_THRESHOLD`	`0.5`	Complexity cutoff (0.0–1.0)
`CONFIDENCE_THRESHOLD`	`0.7`	Minimum confidence to keep small model response (0.0–1.0)
`LARGE_MODEL_MULTIMODAL`	`true`	Whether the large model supports images

Data Flows

LLM Inference (AI Act Protected)

LibreChat sends user prompt to Security-Guard.
Security-Guard classifies prompt via Mezzo-Prompt-Guard (running on iGPU).
SAFE: Prompt forwarded to Cascade LLM.
Cascade LLM evaluates complexity → routes to small or large model.
If small model: confidence check via logprobs → may reroute to large.
Response returned through Security-Guard → LibreChat → User.
UNSAFE: Blocked with 403 + audit log.

Hermes Agent (Bypasses Security-Guard)

Hermes Agent sends directly to Cascade LLM (avoids duplicating safety checks).
Gateway routes by complexity/confidence as above.
Hermes uses LFM2.5-VL-1.6B on auxiliary for context compression.
Web search goes through MCP server → PII-Guard → SearXNG.

Web Search with PII Redaction

LibreChat → PII-Guard (:8777) → SearXNG (:8080)
                ↓
         Presidio Analyzer (spaCy EN+DE, custom regex)
                ↓
         Anonymized query + compliance headers

Hermes Agent Web Search via MCP Server

Hermes Agent uses an MCP (Model Context Protocol) server for web search functionality:

Hermes Agent → MCP Server (stdio) → PII-Guard (:8777) → SearXNG (:8080)

MCP Server: config/hermes-agent/mcp-server-search.py (Python stdlib, JSON-RPC 2.0)
Tool: web_search(query, limit) — routes through PII-Guard for GDPR compliance
Registered in config/hermes-agent/config.yaml under mcp_servers.netai-search
Chosen because Hermes' built-in web search only works with supported LLM API providers (OpenAI, Anthropic, etc.)

Quick Start

1. Prerequisites

Ubuntu 24.04 LTS (kernel 6.8+)
Intel Arc Pro B50 (Battlemage) GPU
Model files in models/Qwen3.6/:
- Qwen3.6-35B-A3B-UD-IQ2_M.gguf
- mmproj-F16.gguf
Model file in models/LFM2.5/:
- LFM2.5-VL-1.6B-UD-IQ4_XS.gguf
Model file in models/Mezzo-Prompt_guard-v2-Base/:
- Mezzo-Prompt-Guard-v2-Base.IQ4_XS.gguf
Model file in models/Parakeet/ (auto-downloaded on startup)

2. Configure Environment

cp .env.example .env
# Edit .env and set:
#   DOMAIN, ADMIN_EMAIL, ADMIN_PASSWORD, TELEGRAM_TOKEN (optional)
#   SSL_CERT_PATH and SSL_KEY_PATH (Let's Encrypt paths)
#   SEARXNG_SECRET (generate with: openssl rand -hex 32)
#   LIBRECHAT_JWT_SECRET and JWT_REFRESH_SECRET (generate with: openssl rand -hex 32)
#   BESZEL_KEY and BESZEL_TOKEN (generate with: openssl rand -hex 32)

3. Run Setup

./setup.sh

This will:

Update system packages
Install Intel GPU drivers (intel-opencl-icd, intel-level-zero-gpu, level-zero)
Install Docker & Docker Compose if missing
Add your user to render and video groups
Auto-detect Intel GPU DRI devices and write them to .env
Validate that model files are present
Validate SSL certificates

Log out and back in after setup completes so group membership takes effect.

4. Build Custom Images

docker compose build auth-validator caddy security-guard pii-guard speech-stt

5. Start Services

docker compose up -d

6. Set Up Beszel Monitoring

bash scripts/setup-beszel.sh

This auto-creates the admin account using ADMIN_EMAIL / ADMIN_PASSWORD from .env.

7. Register Admin User

Open your browser to https://<your-domain> and register the first user. The first registered user becomes the admin (requires ALLOW_REGISTRATION=true and ALLOW_UNVERIFIED_EMAIL_LOGIN=true).

8. Access Services

Endpoint	Description
`https://<your-domain>/`	LibreChat
`https://<your-domain>/agent/`	Hermes Agent Dashboard
`https://<your-domain>/agent-api/`	Hermes Agent API (OpenAI-compatible)
`https://<your-domain>/hermes-webui/`	Hermes Web UI (full web interface)
`https://<your-domain>/search/`	SearXNG (Web Search)
`https://<your-domain>/tts/`	SuperTonic TTS API (OpenAI-compatible)
`https://<your-domain>/speech-stt/`	Parakeet STT API (OpenAI-compatible)
`https://<your-domain>/beszel/`	Beszel Monitoring Dashboard
`https://<your-domain>/inference/`	llama.cpp Inference API
`https://<your-domain>/pii-guard/`	PII-Guard API
`https://<your-domain>/security-guard/`	Security-Guard API
`https://<your-domain>/knowledge/`	LightRAG Knowledge Graph API (query, upload documents)
`https://<your-domain>/lanshare/`	LAN Share files served via LibreChat
`\\<HOST>\netai-lanshare`	SMB/CIFS LAN share (Windows file explorer)

Note: HTTP (port 80) automatically redirects to HTTPS (port 443).

LAN Share — SMB/CIFS File Sharing

The stack includes an SMB file share for exchanging documents across your local network. It is integrated with LibreChat for AI-assisted document processing.

Access via Windows File Explorer:

\\<HOST-IP-ADDRESS>\netai-lanshare

Access via Linux/macOS:

smbclient //<HOST-IP-ADDRESS>/netai-lanshare -U netai

Access via Browser:

https://<DOMAIN>/lanshare/

Credentials: User netai, password from LANSHARE_PASSWORD in .env.

LibreChat Integration: Files placed in the LAN share are automatically available in LibreChat at /lanshare/. You can reference them in chat prompts and LibreChat will include their content as context. PDFs, images, and text documents placed here can be used for RAG-style queries.

Backup Hint: This directory is a host bind mount at ./data/netai-lanshare/. Include it in your regular backup routine.

9. Run the Test Suite (Optional)

# Fast smoke test (~30-40s) — containers, network, auth, data stores, backup
pytest tests/ -m "not slow" -v

# Full suite including chat completion (~2-3 min)
pytest tests/ -v

# Compliance-specific tests
pytest tests/ -m "pii_guard" -v
pytest tests/ -m "security_guard" -v

# Cascade LLM routing tests
pytest tests/test_llm_gateway.py -v

Caddy Reverse Proxy Routes

Path	Upstream	Notes
`/`	`librechat:3080`	LibreChat
`/search/`	`searxng:8080`	With HTML URL rewriting via `replace`
`/agent/`	`agent-hermes:9119`	Dashboard, HTML/JS/CSS rewriting, auth via LibreChat
`/agent-api/`	`agent-hermes:8642`	Gateway API, no auth
`/hermes-webui/`	`hermes-webui:8787`	HTML rewriting, auth via LibreChat
`/beszel/`	`netai-beszel:8090`	Beszel dashboard (WebSocket for real-time), auth via LibreChat
`/tts/`	`supertonic-tts:8800`	SuperTonic TTS API
`/inference/`	`netai-inference:8080`	llama.cpp API
`/pii-guard/`	`pii-guard:8777`	PII-Guard API
`/security-guard/`	`security-guard:8778`	Security-Guard API
`/speech-stt/`	`speech-stt:5092`	Parakeet STT API
`/knowledge/`	`lightrag:8020`	LightRAG Knowledge Graph API (query, document upload)

API Endpoints Reference

All internal services are reachable via Docker network (netai-stack-se). The Caddy reverse proxy makes selected endpoints available externally on port 443 (HTTPS). Unless noted, endpoints require no authentication when accessed from the internal Docker network; external routes via Caddy may require a valid LibreChat JWT session.

Service	Internal URL	External Path	Protocol	Auth	Purpose
LibreChat	`http://librechat:3080`	`/`	OpenAI-compatible REST	JWT Bearer	Chat frontend, conversation management, preset config
Cascade LLM	`http://netai-cascade-llm:3000`	— (internal only)	OpenAI-compatible REST	None	Complexity + confidence routing between small and large LLM
Security-Guard	`http://security-guard:8778`	`/security-guard/`	OpenAI-compatible REST	None (external via Caddy)	Prompt injection detection, transparent proxy to Cascade LLM
PII-Guard	`http://pii-guard:8777`	`/pii-guard/`	REST	None	PII redaction, SearXNG proxy with GDPR compliance
Inference (raw)	`http://inference-server:8080`	`/inference/`	OpenAI-compatible REST	None	Raw llama.cpp (Qwen3.6-35B) — no guard, no routing
Auxiliary (raw)	`http://auxiliary-server:8080`	— (internal only)	OpenAI-compatible REST	None	Raw llama.cpp (LFM2.5-VL-1.6B) — small model backend
Hermes Agent	`http://agent-hermes:8642`	`/agent-api/`	OpenAI-compatible REST	LibreChat JWT	Hermes Agent gateway API (tool calling, code execution)
SearXNG	`http://searxng:8080`	`/search/`	REST + Web UI	None	Privacy-respecting meta search engine
TTS	`http://supertonic-tts:8800`	`/tts/`	OpenAI-compatible REST	None	Text-to-Speech (SuperTonic)
STT	`http://speech-stt:5092`	`/speech-stt/`	OpenAI-compatible REST	None	Speech-to-Text (Parakeet TDT)
Beszel	`http://beszel:8090`	`/beszel/`	REST + WebSocket	LibreChat JWT	System monitoring dashboard
Auth-Validator	`http://auth-validator:8081`	— (internal only)	REST	JWT Bearer	Caddy forward_auth backend — validates LibreChat sessions
LightRAG	`http://lightrag:8020`	`/knowledge/`	REST (OpenAI-compatible)	API Key (optional)	Graph-enhanced RAG — entity extraction, hybrid search, document ingestion
MongoDB	`mongodb://mongo:27017`	— (internal only)	MongoDB Wire	None	LibreChat database (internal Docker network only)
LAN Share	`smb://<HOST>:445/netai-lanshare`	`\\<HOST>\netai-lanshare`	SMB/CIFS	SMB user `netai`	File share, mounted in LibreChat at `/lanshare/`
Caddy	`caddy:443`	`https://<DOMAIN>/`	HTTP/HTTPS	LibreChat JWT (selected routes)	TLS termination, path-based reverse proxy

Endpoint Details

LibreChat API (OpenAI-Compatible Chat + Proprietary)

Method	Path	Auth	Description
`POST`	`/api/auth/login`	None	Login with email + password, returns JWT
`POST`	`/api/auth/refresh`	JWT	Refresh expired JWT token
`POST`	`/api/auth/logout`	JWT	Invalidate current session
`GET`	`/api/auth/user`	JWT	Get current user profile
`GET`	`/api/config`	JWT	Get client configuration (includes modelSpecs)
`GET`	`/api/models`	JWT	List available models
`POST`	`/api/chat/completions`	JWT	Chat completions (OpenAI-compatible format)
`POST`	`/api/chat/stream`	JWT	Streaming chat completions (SSE)
`GET`	`/api/convos`	JWT	List conversations
`GET`	`/api/convos/:id`	JWT	Get single conversation
`DELETE`	`/api/convos/:id`	JWT	Delete conversation
`POST`	`/api/convos/clear`	JWT	Clear all conversations
`GET`	`/api/presets`	JWT	List model presets
`POST`	`/api/presets`	JWT	Create preset
`PUT`	`/api/presets/:id`	JWT	Update preset
`DELETE`	`/api/presets/:id`	JWT	Delete preset
`POST`	`/api/endpoints`	JWT	Register custom endpoints
`POST`	`/api/agents`	JWT	Create agent (if agents enabled)
`GET`	`/api/agents`	JWT	List agents
`GET`	`/api/agents/:id`	JWT	Get agent details
`POST`	`/api/agents/:id/completions`	JWT	Agent-based chat completion
`POST`	`/api/files/upload`	JWT	Upload file for RAG context
`GET`	`/api/files/:id`	JWT	Get uploaded file
`POST`	`/api/tags`	JWT	Create tag
`GET`	`/api/tags`	JWT	List tags

Auth: Include Authorization: Bearer <JWT_TOKEN> header. Token obtained from POST /api/auth/login.

Security-Guard (Prompt Injection Protection)

Method	Path	Auth	Description
`POST`	`/v1/chat/completions`	None	Transparent proxy — classifies prompt via Mezzo, forwards to Cascade LLM if safe (supports streaming SSE)
`GET`	`/v1/models`	None	Passthrough model discovery
`POST`	`/classify`	None	Standalone prompt classification — returns `{"safe": true/false, "category": "...", "score": N}`
`POST`	`/filter`	None	Block unsafe prompts (403), forward safe ones to inference
`POST`	`/inference`	None	Legacy guarded completion endpoint
`GET`	`/health`	None	Health check

Response headers: X-MeZzo-Safe, X-MeZzo-Category, X-MeZzo-Score

PII-Guard (GDPR Redaction)

Method	Path	Auth	Description
`POST`	`/redact`	None	Standalone PII redaction — send JSON `{"text": "..."}`, receive redacted text
`POST`	`/search`	None	Proxy — redacts query, forwards to SearXNG, returns JSON results
`GET`	`/search/redacted`	None	Debug/compliance — shows what would be redacted without forwarding
`GET`	`/health`	None	Health check

Response headers: X-PIGuard-Redacted (count), X-PIGuard-Compliance (GDPR article reference)

Cascade LLM (Internal Router — No External Route)

Method	Path	Auth	Description
`POST`	`/v1/chat/completions`	None	OpenAI-compatible — evaluates complexity, routes to small/large model. Supports `logprobs: true` for confidence-based fallback. Non-streaming only: reroutes to large model when confidence < threshold.
`GET`	`/health`	None	Health check

Request body: Standard OpenAI chat completions format. Add "image_url" content parts for multimodal routing. Response headers (when small model kept): x-confidence (float 0.0–1.0)

Inference & Auxiliary (Raw llama.cpp — No Guard)

Method	Path	Auth	Description
`POST`	`/v1/chat/completions`	None	Raw llama.cpp completions (OpenAI-compatible). No prompt injection check, no confidence routing.
`GET`	`/v1/models`	None	List loaded models
`GET`	`/health`	None	Health check

TTS (SuperTonic — OpenAI-Compatible)

Method	Path	Auth	Description
`POST`	`/v1/audio/speech`	None	Text-to-Speech. Accepts `{"model": "tts-1", "input": "...", "voice": "nova", "response_format": "wav"}`
`GET`	`/health`	None	Health check

Voices: alloy, echo, fable, onyx, nova, shimmer, ash, ballad, coral, marin, cedar, sage, verse

STT (Parakeet TDT — OpenAI-Compatible)

Method	Path	Auth	Description
`POST`	`/v1/audio/transcriptions`	None	Speech-to-Text. Accepts multipart form with audio file. Returns `{"text": "..."}`
`GET`	`/health`	None	Health check

Hermes Agent API (OpenAI-Compatible)

Method	Path	Auth	Description
`POST`	`/v1/chat/completions`	LibreChat JWT	Chat with Hermes Agent — supports tool calling, MCP tools, code execution
`GET`	`/health`	None	Health check

External path: https://<DOMAIN>/agent-api/v1/chat/completions Note: External access requires a valid LibreChat session (auth-validator validates the JWT before proxying).

SearXNG (Web Search)

Method	Path	Auth	Description
`GET`	`/`	None	Web UI (HTML)
`POST`	`/search`	None	JSON search API — accepts `{"q": "...", "format": "json", "language": "de-DE"}`
`GET`	`/health`	None	Health check

External path: https://<DOMAIN>/search/ (via PII-Guard for GDPR compliance from LibreChat)

Auth-Validator (Internal — No External Route)

Method	Path	Auth	Description
`GET`	`/`	JWT Bearer	Validates LibreChat JWT. Returns 200 if valid, 401 if invalid/expired. Used by Caddy's `forward_auth` directive.

LAN Share (SMB/CIFS)

Protocol	Path	Auth	Description
SMB/CIFS	`\\<HOST>\netai-lanshare`	SMB user `netai`	Read/write file share. Also mounted in LibreChat at `/lanshare/` for browser access.

LightRAG (Graph-Enhanced RAG)

Method	Path	Auth	Description
`POST`	`/documents/upload`	API Key	Upload document — LightRAG extracts entities + relationships, builds knowledge graph
`POST`	`/query`	API Key	Query the knowledge graph. Modes: `naive` (vector-only), `local` (entity-level), `global` (community-level), `hybrid` (all)
`GET`	`/graph`	API Key	Retrieve the knowledge graph structure (entities + relationships)
`GET`	`/health`	None	Health check
`POST`	`/v1/chat/completions`	API Key	OpenAI-compatible endpoint — accepts standard chat format, returns RAG-augmented responses

Auth: Optional API key via Authorization: Bearer <key> header. Default: sk-no-key-required.

Document ingestion: Use the provided script to bulk-import documents:

# Ingest all files from the LAN share
./scripts/ingest-docs.sh

# Ingest a single document
./scripts/ingest-docs.sh /path/to/document.pdf

# Watch the LAN share for new files (requires inotify-tools)
sudo apt install inotify-tools
./scripts/ingest-docs.sh --watch

JWT Authentication Flow

POST /api/auth/login with {"email": "...", "password": "..."} → returns token (JWT) and refreshToken
Include Authorization: Bearer <token> in subsequent requests
When the token expires, POST /api/auth/refresh with {"token": "<refreshToken>"} → returns new token pair
For external routes requiring auth (marked "LibreChat JWT") Caddy's forward_auth middleware automatically validates the JWT against auth-validator:8081

Intel SYCL Notes

This stack uses the SYCL backend via the official ghcr.io/ggml-org/llama.cpp:server-intel-b9641 image.
Ensure your kernel is 6.8 or newer for native Xe/i915 support on Battlemage.
The environment variable ONEAPI_DEVICE_SELECTOR=*:gpu is passed to the inference container.
Important: llama.cpp SYCL backend only supports discrete Intel Arc GPUs (Xe-HPG+, like Arc Pro B50). Integrated Xe-LP GPUs (UHD 770) are not enumerated by the SYCL backend and cannot be used for inference. The inference-server container is strictly bound to the discrete Arc GPU.
The auxiliary-server container also uses SYCL and shares the same dGPU for the small multimodal model (LFM2.5-VL-1.6B).
Security-Guard uses the iGPU (/dev/dri/card0) separately for Mezzo-Prompt-Guard inference.
There are no NVIDIA/CUDA dependencies in this stack.

SSL / TLS

Caddy handles TLS termination. For production, configure Let's Encrypt certificates. For development, self-signed certificates can be used:

# Place certificates at paths specified in .env
# Caddy will use them for TLS

Testing

A pytest-based integration test suite validates containers, network paths, APIs, authentication, chat completions, data stores, and backups.

Prerequisites: pytest and requests must be installed (pip install pytest requests).

# Fast smoke test (containers, network, auth, data stores, backup — ~30-40s)
pytest tests/ -m "not slow" -v

# Full suite including chat completion (~2-3 min)
pytest tests/ -v

# Parallel execution (requires pytest-xdist)
pytest tests/ -m "not slow" -v -n auto

# Compliance-specific
pytest tests/ -m "pii_guard" -v
pytest tests/ -m "security_guard" -v

# Cascade LLM
pytest tests/test_llm_gateway.py -v

Test File	Coverage
`test_containers.py`	Docker container status and health checks
`test_network_paths.py`	Caddy routing and inter-service DNS
`test_api_endpoints.py`	LLM inference, LibreChat, Hermes API, SearXNG
`test_authentication.py`	LibreChat login and token validation
`test_chat_completion.py`	End-to-end chat via LibreChat API
`test_data_stores.py`	SQLite integrity, ChromaDB, uploads
`test_backup.py`	Backup script execution and archive validation
`test_pii_guard.py`	GDPR compliance, PII redaction, search proxy
`test_security_guard.py`	Prompt injection detection, filtering, Article 52
`test_beszel.py`	Beszel monitoring metrics
`test_llm_gateway.py`	Cascade LLM routing, complexity scoring, confidence fallback, streaming
`test_hermes_webui.py`	Hermes compose config, Caddy proxy, .env.example
`test_hermes_playwright.py`	Hermes API server and LibreChat browser tests

Backup & Restore

LibreChat stores all data in MongoDB (persistent volume mongo-data).

Create a Backup

docker compose exec mongo mongodump --archive=/backups/librechat-$(date +%Y%m%d).archive

Restore from Backup

docker compose exec -T mongo mongorestore --archive=< backup-file.archive

Troubleshooting

GPU Not Detected

lspci | grep -i vga | grep -i intel

If empty, verify the GPU is seated and the kernel module is loaded:

sudo dmesg | grep i915
sudo intel_gpu_top

Permission Denied on /dev/dri

Ensure your user is in the video and render groups, then log out and back in:

sudo usermod -aG video,render $USER

Model File Missing

setup.sh will warn you if the expected GGUF files are absent. Download them via Hugging Face CLI or wget and place them in their respective directories under models/.

Security-Guard Not Starting

Verify the Mezzo-Prompt-Guard model file is present at the path specified by SECURITY_GUARD_MODEL_PATH in .env or the default path in docker-compose.yml.

License

This deployment configuration is provided as-is for B2B on-premise deployments. Model weights and upstream container images are subject to their respective licenses.

Name		Name	Last commit message	Last commit date
Latest commit History 82 Commits
.kilo		.kilo
.opencode/plans		.opencode/plans
XPU_tests		XPU_tests
config		config
data		data
docs		docs
scripts		scripts
test_files		test_files
tests		tests
.env.example		.env.example
.gitignore		.gitignore
AGENTS.md		AGENTS.md
Dockerfile.test-runner		Dockerfile.test-runner
NetAI_Safety_Sheet.md		NetAI_Safety_Sheet.md
PLAN.md		PLAN.md
README.md		README.md
dev-rebuild.sh		dev-rebuild.sh
docker-compose.yml		docker-compose.yml
docker-compose.yml.bak		docker-compose.yml.bak
fix_yaml.py		fix_yaml.py
models		models
report.md		report.md
setup.sh		setup.sh
start_server.sh		start_server.sh
stop_server.sh		stop_server.sh
test-suite.py		test-suite.py

Folders and files

Latest commit

History

Repository files navigation

NetAI Stack SE

Data Sovereignty — Your Data Stays Yours

Compliance Features

GDPR Compliance (DSGVO)

EU AI Act Compliance (Article 52)

Architecture

Core Services

Cascade LLM — Model Router

Routing Architecture

Step 1: Complexity Scoring

Step 2: Confidence-Based Fallback (non-streaming only)

Configurable Environment Variables

Data Flows

LLM Inference (AI Act Protected)

Hermes Agent (Bypasses Security-Guard)

Web Search with PII Redaction

Hermes Agent Web Search via MCP Server

Quick Start

1. Prerequisites

2. Configure Environment

3. Run Setup

4. Build Custom Images

5. Start Services

6. Set Up Beszel Monitoring

7. Register Admin User

8. Access Services

LAN Share — SMB/CIFS File Sharing

9. Run the Test Suite (Optional)

Caddy Reverse Proxy Routes

API Endpoints Reference

Endpoint Details

LibreChat API (OpenAI-Compatible Chat + Proprietary)

Security-Guard (Prompt Injection Protection)

PII-Guard (GDPR Redaction)

Cascade LLM (Internal Router — No External Route)

Inference & Auxiliary (Raw llama.cpp — No Guard)

TTS (SuperTonic — OpenAI-Compatible)

STT (Parakeet TDT — OpenAI-Compatible)

Hermes Agent API (OpenAI-Compatible)

SearXNG (Web Search)

Auth-Validator (Internal — No External Route)

LAN Share (SMB/CIFS)

LightRAG (Graph-Enhanced RAG)

JWT Authentication Flow

Intel SYCL Notes

SSL / TLS

Testing

Backup & Restore

Create a Backup

Restore from Backup

Troubleshooting

GPU Not Detected

Permission Denied on /dev/dri

Model File Missing

Security-Guard Not Starting

License

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages