Skip to content
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
109 changes: 33 additions & 76 deletions README.md
Original file line number Diff line number Diff line change
@@ -1,55 +1,48 @@
# Small-Language-Model Server
# Small Language Model Server

[![CI Pipeline](https://github.com/XyLearningProgramming/slm_server/actions/workflows/ci.yml/badge.svg)](https://github.com/XyLearningProgramming/slm_server/actions/workflows/ci.yml)
[![codecov](https://codecov.io/gh/XyLearningProgramming/slm_server/branch/main/graph/badge.svg)](https://codecov.io/gh/XyLearningProgramming/slm_server)
[![Docker](https://img.shields.io/badge/docker-ready-blue.svg)](https://hub.docker.com/r/x3huang/slm_server)
[![License](https://img.shields.io/badge/license-MIT-green.svg)](LICENSE)

🚀 A light model server that serves small language models (default: `Qwen3-0.6B-GGUF`) as a **thin wrapper** around `llama-cpp` exposing the OpenAI-compatible `/chat/completions` API. Core logic is just <100 lines under `./slm_server/app.py`!
A lightweight model server that serves small language models (default: Qwen3-0.6B-GGUF) as a thin wrapper around llama-cpp with OpenAI-compatible `/chat/completions` API. Core logic is <100 lines in `./slm_server/app.py`.

> This is still a WIP project. Issues, pull-requests are welcome. I mainly use this repo to deploy a SLM model as part of the backend on my own site [x3huang.dev](https://x3huang.dev/) while trying my best to keep this repo model-agonistic.
## Features

## ✨ Features
- **OpenAI-compatible API** - Drop-in replacement with `/chat/completions` endpoint and streaming support
- **Llama.cpp integration** - High-performance inference optimized for limited CPU and memory resources
- **Production observability** - Built-in logging, Prometheus metrics, and OpenTelemetry tracing
- **Enterprise deployment** - Complete CI/CD pipeline with unit tests, e2e tests, Helm charts, and Docker support
- **Simple configuration** - Environment-based config with sensible defaults

![Thin wrapper around llama cpp](./docs/20250712_slm_img1.jpg)
## Use Cases

- 🔌 **OpenAI-compatible API** - Drop-in replacement with `/chat/completions` endpoint and streaming support
- ⚡ **Llama.cpp integration** - High-performance inference optimized for limited CPU and memory resources
- 📊 **Production observability** - Built-in logging, Prometheus metrics, and OpenTelemetry tracing (all configurable)
- 🚀 **Enterprise deployment** - Complete CI/CD pipeline with unit tests, e2e tests, Helm charts, and Docker support
- 🔧 **Simple configuration** - Environment-based config with sensible defaults
- **Self-hosting** - Deploy small models under resource constraints
- **Privacy-first inference** - No user content logging, complete data control
- **Development environments** - Local LLM testing and prototyping
- **Edge deployments** - Lightweight inference in constrained environments
- **API standardization** - Unified OpenAI-compatible interface for small models

## 🚀 Quick Start
## Quick Start

### Local Development

```bash
# 1. Get your model
# Download model
./scripts/download.sh # Downloads default Qwen3-0.6B-GGUF
# OR place your own GGUF model in models/ directory

# 2. Install dependencies
# Install and start
uv sync

# 3. Configure (optional)
cp .env.example .env # Edit as needed

# 4. Start the server
./scripts/start.sh
```

### Docker

```bash
# Pull and run
docker run -p 8000:8000 -v $(pwd)/models:/app/models x3huang/slm_server/general

# Or build locally
docker build -t slm-server .
docker run -p 8000:8000 -v $(pwd)/models:/app/models slm_server
```

### Test the API
### Test API

```bash
curl -X POST http://localhost:8000/api/v1/chat/completions \
Expand All @@ -61,57 +54,26 @@ curl -X POST http://localhost:8000/api/v1/chat/completions \
}'
```

## 🎯 Why SLM Server?

- **🎯 Unified access** - Single point of entry for SLM inference with concurrency control
- **💰 Cost-effective** - Perfect for self-hosting small models under resource constraints
- **🔒 Privacy-matters** - No user content logging, complete data control
- **⚡ Performance** - As thin wrapper around `llama-cpp`

## 📊 Observability Stack

All observability components are **configurable** and **enabled by default** for production readiness.

### 📝 Structured Logging
Request lifecycle logging with trace correlation:

```log
2025-07-21 09:52:32,475 INFO [slm_server.utils] 2025-07-21 09:52:32,475 INFO [slm_server.utils] [utils.py:341] [trace_id=e4a2ed019bd6fe95d611d7b29b90db4f span_id=c8fcaa72b8732e29 resource.service.name= trace_sampled=True] - [SLM] starting streaming: {'max_tokens': 2048, 'temperature': 0.7, 'input_messages': 1, 'input_content_length': 15}

2025-07-21 09:52:36,496 INFO [slm_server.utils] [utils.py:404] [trace_id=e4a2ed019bd6fe95d611d7b29b90db4f span_id=c8fcaa72b8732e29 resource.service.name= trace_sampled=True] - [SLM] completed streaming: {'duration_ms': 4021.32, 'output_content_length': 468, 'total_tokens': 111, 'completion_tokens': 108, 'completion_tokens_per_second': 26.86, 'total_tokens_per_second': 27.6, 'chunk_count': 108, 'avg_chunk_delay_ms': 37.23, 'first_token_delay_ms': 38.19, 'avg_chunk_size': 259.45, 'avg_chunk_content_size': 4.25, 'chunks_with_content': 108, 'empty_chunks': 2}
```

### 📈 Prometheus Metrics
Available at `/metrics` endpoint:
- Request latency and throughput
- Token generation rates
- Model memory usage
- Error rates and types
## Observability

### 🔍 OpenTelemetry Tracing
Distributed tracing with:
- Request flow visualization, each stream response as extra event if any
- Performance bottleneck identification
All observability components are configurable and enabled by default:

## ⚙️ Configuration
- **Structured Logging** - Request lifecycle logging with trace correlation
- **Prometheus Metrics** - Available at `/metrics` (latency, throughput, token rates, memory usage)
- **OpenTelemetry Tracing** - Distributed tracing with request flow visualization

Configure via environment variables (prefix: `SLM_`) or `.env` file.
## Configuration

See [`./slm_server/config.py`](./slm_server/config.py) for complete configuration options.
Configure via environment variables (prefix: `SLM_`) or `.env` file. See [`./slm_server/config.py`](./slm_server/config.py) for all options.

## 🚢 Deployment
## Deployment

### Kubernetes with Helm

```bash
# Deploy to production
helm upgrade --install slm-server ./deploy/helm \
--namespace backend \
--values ./deploy/helm/values.yaml

# Monitor deployment
kubectl get pods -n backend
kubectl logs -f deployment/slm-server -n backend
```

### Docker Compose
Expand All @@ -125,43 +87,38 @@ services:
- "8000:8000"
volumes:
- ./models:/app/models
# Optional
environment:
- slm_server_PATH=/app/models/your-model.gguf
```

## 🧪 Development
## Development

### Running Tests
### Testing

```bash
# Unit tests
uv run pytest tests/ --ignore=tests/e2e/

# End-to-end tests (with server pulled up)
# End-to-end tests
uv run python ./tests/e2e/main.py

# With coverage
uv run pytest tests/ --ignore=tests/e2e/ --cov=slm_server --cov-report=html --cov-report=term-missing
uv run pytest tests/ --ignore=tests/e2e/ --cov=slm_server --cov-report=html
```

### Code Quality

```bash
# Linting and formatting
uv run ruff check .
uv run ruff format .
```

## 📚 API Documentation
## API Documentation

Once running, visit:
- **Interactive docs**: http://localhost:8000/docs
- **OpenAPI spec**: http://localhost:8000/openapi.json
- **Health check**: http://localhost:8000/health

## 📄 License

This project is licensed under the MIT License - see the [LICENSE](LICENSE) file for details.

## License

MIT License - see [LICENSE](LICENSE) file for details.
3 changes: 3 additions & 0 deletions pyproject.toml
Original file line number Diff line number Diff line change
Expand Up @@ -26,6 +26,9 @@ select = ["C", "E", "F", "W"]
[dependency-groups]
dev = [
"httpx>=0.28.1",
"langchain>=0.3.26",
"langchain-core>=0.3.71",
"langchain-openai>=0.3.28",
"pytest>=8.4.1",
"pytest-cov>=4.0.0",
"ruff>=0.12.3",
Expand Down
5 changes: 5 additions & 0 deletions pytest.ini
Original file line number Diff line number Diff line change
@@ -0,0 +1,5 @@
[pytest]
markers =
api: marks tests as api tests
api_non_streaming: marks tests as api and non_streaming tests
langchain: marks tests as langchain compatibility tests
99 changes: 62 additions & 37 deletions slm_server/app.py
Original file line number Diff line number Diff line change
@@ -1,23 +1,27 @@
import asyncio
import json
import traceback
from http import HTTPStatus
from typing import Annotated, AsyncGenerator

from fastapi import Depends, FastAPI, HTTPException
from fastapi.responses import StreamingResponse
from llama_cpp import Llama
from llama_cpp import CreateChatCompletionStreamResponse, Llama

from slm_server.config import Settings, get_settings
from slm_server.logging import setup_logging
from slm_server.metrics import setup_metrics
from slm_server.model import (
ChatCompletionRequest,
ChatCompletionResponse,
ChatCompletionStreamResponse,
EmbeddingRequest,
)
from slm_server.trace import setup_tracing
from slm_server.utils import (
set_atrribute_response,
set_atrribute_response_stream,
set_attribute_cancelled,
set_attribute_response_embedding,
slm_embedding_span,
slm_span,
)

Expand All @@ -28,6 +32,11 @@
MAX_CONCURRENCY = 1
# Default timeout message in detail field.
DETAIL_SEM_TIMEOUT = "Server is busy, please try again later."
# Status code for semaphore timeout.
STATUS_CODE_SEM_TIMEOUT = HTTPStatus.REQUEST_TIMEOUT
# Status code for unexpected errors.
# This is used when the server encounters an error that is not handled
STATUS_CODE_EXCEPTION = HTTPStatus.INTERNAL_SERVER_ERROR


def get_llm_semaphor() -> asyncio.Semaphore:
Expand All @@ -46,9 +55,10 @@ def get_llm(settings: Annotated[Settings, Depends(get_settings)]) -> Llama:
verbose=settings.logging.verbose,
seed=settings.seed,
logits_all=False,
embedding=False,
embedding=True,
use_mlock=True, # Use mlock to prevent memory swapping
use_mmap=True, # Use memory-mapped files for faster access
chat_format="chatml-function-calling",
)
return get_llm._instance

Expand Down Expand Up @@ -77,18 +87,17 @@ def get_app() -> FastAPI:


async def lock_llm_semaphor(
req: ChatCompletionRequest,
sem: Annotated[asyncio.Semaphore, Depends(get_llm_semaphor)],
settings: Annotated[Settings, Depends(get_settings)],
) -> AsyncGenerator[None, None]:
"""Context manager to acquire and release the LLM semaphore with a timeout."""
try:
await asyncio.wait_for(
sem.acquire(), timeout=req.wait_timeout or settings.s_timeout
)
await asyncio.wait_for(sem.acquire(), settings.s_timeout)
yield None
except asyncio.TimeoutError:
raise HTTPException(status_code=503, detail=DETAIL_SEM_TIMEOUT)
raise HTTPException(
status_code=STATUS_CODE_SEM_TIMEOUT, detail=DETAIL_SEM_TIMEOUT
)
finally:
if sem.locked():
sem.release()
Expand All @@ -98,42 +107,36 @@ async def run_llm_streaming(
llm: Llama, req: ChatCompletionRequest
) -> AsyncGenerator[str, None]:
"""Generator that runs the LLM and yields SSE chunks under lock."""
with slm_span(req, is_streaming=True) as (span, messages_for_llm):
completion_stream = await asyncio.to_thread(
llm.create_chat_completion,
messages=messages_for_llm,
max_tokens=req.max_tokens,
temperature=req.temperature,
stream=True,
)
with slm_span(req, is_streaming=True) as span:
try:
completion_stream = await asyncio.to_thread(
llm.create_chat_completion,
**req.model_dump(),
)

# Use traced iterator that automatically handles chunk spans
# and parent span updates
for chunk in completion_stream:
response_model = ChatCompletionStreamResponse.model_validate(chunk)
set_atrribute_response_stream(span, response_model)
yield f"data: {response_model.model_dump_json()}\n\n"
# Use traced iterator that automatically handles chunk spans
# and parent span updates
chunk: CreateChatCompletionStreamResponse
for chunk in completion_stream:
set_atrribute_response_stream(span, chunk)
yield f"data: {json.dumps(chunk)}\n\n"

yield "data: [DONE]\n\n"
yield "data: [DONE]\n\n"
except asyncio.CancelledError:
# Handle cancellation gracefully during sse.
set_attribute_cancelled(span)


async def run_llm_non_streaming(
llm: Llama, req: ChatCompletionRequest
) -> ChatCompletionResponse:
async def run_llm_non_streaming(llm: Llama, req: ChatCompletionRequest):
"""Runs the LLM for a non-streaming request under lock."""
with slm_span(req, is_streaming=False) as (span, messages_for_llm):
with slm_span(req, is_streaming=False) as span:
completion_result = await asyncio.to_thread(
llm.create_chat_completion,
messages=messages_for_llm,
max_tokens=req.max_tokens,
temperature=req.temperature,
stream=False,
**req.model_dump(),
)
set_atrribute_response(span, completion_result)

response_model = ChatCompletionResponse.model_validate(completion_result)
set_atrribute_response(span, response_model)

return response_model
return completion_result


@app.post("/api/v1/chat/completions")
Expand All @@ -156,7 +159,29 @@ async def create_chat_completion(
except Exception:
# Catch any other unexpected errors
error_str = traceback.format_exc()
raise HTTPException(status_code=500, detail=error_str)
raise HTTPException(status_code=STATUS_CODE_EXCEPTION, detail=error_str)


@app.post("/api/v1/embeddings")
async def create_embeddings(
req: EmbeddingRequest,
llm: Annotated[Llama, Depends(get_llm)],
_: Annotated[None, Depends(lock_llm_semaphor)],
):
"""Create embeddings for the given input text(s)."""
try:
with slm_embedding_span(req) as span:
# Use llama-cpp-python's create_embedding method directly
embedding_result = await asyncio.to_thread(
llm.create_embedding,
**req.model_dump(),
)
# Convert llama-cpp response using model_validate like chat completion
set_attribute_response_embedding(span, embedding_result)
return embedding_result
except Exception:
error_str = traceback.format_exc()
raise HTTPException(status_code=STATUS_CODE_EXCEPTION, detail=error_str)


@app.get("/health")
Expand Down
Loading