XyLearningProgramming · XyLearningProgramming · Jul 25, 2025 · Jul 23, 2025 · Jul 24, 2025 · Jul 25, 2025
diff --git a/README.md b/README.md
@@ -1,55 +1,48 @@
-# Small-Language-Model Server
+# Small Language Model Server
 
 [![CI Pipeline](https://github.com/XyLearningProgramming/slm_server/actions/workflows/ci.yml/badge.svg)](https://github.com/XyLearningProgramming/slm_server/actions/workflows/ci.yml)
 [![codecov](https://codecov.io/gh/XyLearningProgramming/slm_server/branch/main/graph/badge.svg)](https://codecov.io/gh/XyLearningProgramming/slm_server)
 [![Docker](https://img.shields.io/badge/docker-ready-blue.svg)](https://hub.docker.com/r/x3huang/slm_server)
 [![License](https://img.shields.io/badge/license-MIT-green.svg)](LICENSE)
 
-🚀 A light model server that serves small language models (default: `Qwen3-0.6B-GGUF`) as a **thin wrapper** around `llama-cpp` exposing the OpenAI-compatible `/chat/completions` API. Core logic is just <100 lines under `./slm_server/app.py`!
+A lightweight model server that serves small language models (default: Qwen3-0.6B-GGUF) as a thin wrapper around llama-cpp with OpenAI-compatible `/chat/completions` API. Core logic is <100 lines in `./slm_server/app.py`.
 
-> This is still a WIP project. Issues, pull-requests are welcome. I mainly use this repo to deploy a SLM model as part of the backend on my own site [x3huang.dev](https://x3huang.dev/) while trying my best to keep this repo model-agonistic. 
+## Features
 
-## ✨ Features
+- **OpenAI-compatible API** - Drop-in replacement with `/chat/completions` endpoint and streaming support
+- **Llama.cpp integration** - High-performance inference optimized for limited CPU and memory resources
+- **Production observability** - Built-in logging, Prometheus metrics, and OpenTelemetry tracing
+- **Enterprise deployment** - Complete CI/CD pipeline with unit tests, e2e tests, Helm charts, and Docker support
+- **Simple configuration** - Environment-based config with sensible defaults
 
-![Thin wrapper around llama cpp](./docs/20250712_slm_img1.jpg)
+## Use Cases
 
-- 🔌 **OpenAI-compatible API** - Drop-in replacement with `/chat/completions` endpoint and streaming support
-- ⚡ **Llama.cpp integration** - High-performance inference optimized for limited CPU and memory resources
-- 📊 **Production observability** - Built-in logging, Prometheus metrics, and OpenTelemetry tracing (all configurable)
-- 🚀 **Enterprise deployment** - Complete CI/CD pipeline with unit tests, e2e tests, Helm charts, and Docker support
-- 🔧 **Simple configuration** - Environment-based config with sensible defaults
+- **Self-hosting** - Deploy small models under resource constraints
+- **Privacy-first inference** - No user content logging, complete data control
+- **Development environments** - Local LLM testing and prototyping
+- **Edge deployments** - Lightweight inference in constrained environments
+- **API standardization** - Unified OpenAI-compatible interface for small models
 
-## 🚀 Quick Start
+## Quick Start
 
 ### Local Development
 
 ```bash
-# 1. Get your model
+# Download model
 ./scripts/download.sh  # Downloads default Qwen3-0.6B-GGUF
-# OR place your own GGUF model in models/ directory
 
-# 2. Install dependencies
+# Install and start
 uv sync
-
-# 3. Configure (optional)
-cp .env.example .env  # Edit as needed
-
-# 4. Start the server
 ./scripts/start.sh
 ```
 
 ### Docker
 
 ```bash
-# Pull and run
 docker run -p 8000:8000 -v $(pwd)/models:/app/models x3huang/slm_server/general
-
-# Or build locally
-docker build -t slm-server .
-docker run -p 8000:8000 -v $(pwd)/models:/app/models slm_server
 ```
 
-### Test the API
+### Test API
 
 ```bash
 curl -X POST http://localhost:8000/api/v1/chat/completions \
@@ -61,57 +54,26 @@ curl -X POST http://localhost:8000/api/v1/chat/completions \
   }'
 ```
 
-## 🎯 Why SLM Server?
-
-- **🎯 Unified access** - Single point of entry for SLM inference with concurrency control
-- **💰 Cost-effective** - Perfect for self-hosting small models under resource constraints
-- **🔒 Privacy-matters** - No user content logging, complete data control
-- **⚡ Performance** - As thin wrapper around `llama-cpp`
-
-## 📊 Observability Stack
-
-All observability components are **configurable** and **enabled by default** for production readiness.
-
-### 📝 Structured Logging
-Request lifecycle logging with trace correlation:
-
-```log
-2025-07-21 09:52:32,475 INFO [slm_server.utils] 2025-07-21 09:52:32,475 INFO [slm_server.utils] [utils.py:341] [trace_id=e4a2ed019bd6fe95d611d7b29b90db4f span_id=c8fcaa72b8732e29 resource.service.name= trace_sampled=True] - [SLM] starting streaming: {'max_tokens': 2048, 'temperature': 0.7, 'input_messages': 1, 'input_content_length': 15}
-
-2025-07-21 09:52:36,496 INFO [slm_server.utils] [utils.py:404] [trace_id=e4a2ed019bd6fe95d611d7b29b90db4f span_id=c8fcaa72b8732e29 resource.service.name= trace_sampled=True] - [SLM] completed streaming: {'duration_ms': 4021.32, 'output_content_length': 468, 'total_tokens': 111, 'completion_tokens': 108, 'completion_tokens_per_second': 26.86, 'total_tokens_per_second': 27.6, 'chunk_count': 108, 'avg_chunk_delay_ms': 37.23, 'first_token_delay_ms': 38.19, 'avg_chunk_size': 259.45, 'avg_chunk_content_size': 4.25, 'chunks_with_content': 108, 'empty_chunks': 2}
-```
-
-### 📈 Prometheus Metrics
-Available at `/metrics` endpoint:
-- Request latency and throughput
-- Token generation rates
-- Model memory usage
-- Error rates and types
+## Observability
 
-### 🔍 OpenTelemetry Tracing
-Distributed tracing with:
-- Request flow visualization, each stream response as extra event if any
-- Performance bottleneck identification
+All observability components are configurable and enabled by default:
 
-## ⚙️ Configuration
+- **Structured Logging** - Request lifecycle logging with trace correlation
+- **Prometheus Metrics** - Available at `/metrics` (latency, throughput, token rates, memory usage)
+- **OpenTelemetry Tracing** - Distributed tracing with request flow visualization
 
-Configure via environment variables (prefix: `SLM_`) or `.env` file.
+## Configuration
 
-See [`./slm_server/config.py`](./slm_server/config.py) for complete configuration options.
+Configure via environment variables (prefix: `SLM_`) or `.env` file. See [`./slm_server/config.py`](./slm_server/config.py) for all options.
 
-## 🚢 Deployment
+## Deployment
 
 ### Kubernetes with Helm
 
 ```bash
-# Deploy to production
 helm upgrade --install slm-server ./deploy/helm \
   --namespace backend \
   --values ./deploy/helm/values.yaml
-
-# Monitor deployment
-kubectl get pods -n backend
-kubectl logs -f deployment/slm-server -n backend
 ```
 
 ### Docker Compose
@@ -125,43 +87,38 @@ services:
       - "8000:8000"
     volumes:
       - ./models:/app/models
-    # Optional
     environment:
       - slm_server_PATH=/app/models/your-model.gguf
 ```
 
-## 🧪 Development
+## Development
 
-### Running Tests
+### Testing
 
 ```bash
 # Unit tests
 uv run pytest tests/ --ignore=tests/e2e/
 
-# End-to-end tests (with server pulled up)
+# End-to-end tests
 uv run python ./tests/e2e/main.py
 
 # With coverage
-uv run pytest tests/ --ignore=tests/e2e/ --cov=slm_server --cov-report=html --cov-report=term-missing
+uv run pytest tests/ --ignore=tests/e2e/ --cov=slm_server --cov-report=html
 ```
 
 ### Code Quality
 
 ```bash
-# Linting and formatting
 uv run ruff check .
 uv run ruff format .
 ```
 
-## 📚 API Documentation
+## API Documentation
 
-Once running, visit:
 - **Interactive docs**: http://localhost:8000/docs
 - **OpenAPI spec**: http://localhost:8000/openapi.json
 - **Health check**: http://localhost:8000/health
 
-## 📄 License
-
-This project is licensed under the MIT License - see the [LICENSE](LICENSE) file for details.
-
+## License
 
+MIT License - see [LICENSE](LICENSE) file for details.
diff --git a/pyproject.toml b/pyproject.toml
@@ -26,6 +26,9 @@ select = ["C", "E", "F", "W"]
 [dependency-groups]
 dev = [
     "httpx>=0.28.1",
+    "langchain>=0.3.26",
+    "langchain-core>=0.3.71",
+    "langchain-openai>=0.3.28",
     "pytest>=8.4.1",
     "pytest-cov>=4.0.0",
     "ruff>=0.12.3",

diff --git a/pytest.ini b/pytest.ini
@@ -0,0 +1,5 @@
+[pytest]
+markers =
+    api: marks tests as api tests
+    api_non_streaming: marks tests as api and non_streaming tests
+    langchain: marks tests as langchain compatibility tests
diff --git a/slm_server/app.py b/slm_server/app.py
@@ -1,23 +1,27 @@
 import asyncio
+import json
 import traceback
+from http import HTTPStatus
 from typing import Annotated, AsyncGenerator
 
 from fastapi import Depends, FastAPI, HTTPException
 from fastapi.responses import StreamingResponse
-from llama_cpp import Llama
+from llama_cpp import CreateChatCompletionStreamResponse, Llama
 
 from slm_server.config import Settings, get_settings
 from slm_server.logging import setup_logging
 from slm_server.metrics import setup_metrics
 from slm_server.model import (
     ChatCompletionRequest,
-    ChatCompletionResponse,
-    ChatCompletionStreamResponse,
+    EmbeddingRequest,
 )
 from slm_server.trace import setup_tracing
 from slm_server.utils import (
     set_atrribute_response,
     set_atrribute_response_stream,
+    set_attribute_cancelled,
+    set_attribute_response_embedding,
+    slm_embedding_span,
     slm_span,
 )
 
@@ -28,6 +32,11 @@
 MAX_CONCURRENCY = 1
 # Default timeout message in detail field.
 DETAIL_SEM_TIMEOUT = "Server is busy, please try again later."
+# Status code for semaphore timeout.
+STATUS_CODE_SEM_TIMEOUT = HTTPStatus.REQUEST_TIMEOUT
+# Status code for unexpected errors.
+# This is used when the server encounters an error that is not handled
+STATUS_CODE_EXCEPTION = HTTPStatus.INTERNAL_SERVER_ERROR
 
 
 def get_llm_semaphor() -> asyncio.Semaphore:
@@ -46,9 +55,10 @@ def get_llm(settings: Annotated[Settings, Depends(get_settings)]) -> Llama:
             verbose=settings.logging.verbose,
             seed=settings.seed,
             logits_all=False,
-            embedding=False,
+            embedding=True,
             use_mlock=True,  # Use mlock to prevent memory swapping
             use_mmap=True,  # Use memory-mapped files for faster access
+            chat_format="chatml-function-calling",
         )
     return get_llm._instance
 
@@ -77,18 +87,17 @@ def get_app() -> FastAPI:
 
 
 async def lock_llm_semaphor(
-    req: ChatCompletionRequest,
     sem: Annotated[asyncio.Semaphore, Depends(get_llm_semaphor)],
     settings: Annotated[Settings, Depends(get_settings)],
 ) -> AsyncGenerator[None, None]:
     """Context manager to acquire and release the LLM semaphore with a timeout."""
     try:
-        await asyncio.wait_for(
-            sem.acquire(), timeout=req.wait_timeout or settings.s_timeout
-        )
+        await asyncio.wait_for(sem.acquire(), settings.s_timeout)
         yield None
     except asyncio.TimeoutError:
-        raise HTTPException(status_code=503, detail=DETAIL_SEM_TIMEOUT)
+        raise HTTPException(
+            status_code=STATUS_CODE_SEM_TIMEOUT, detail=DETAIL_SEM_TIMEOUT
+        )
     finally:
         if sem.locked():
             sem.release()
@@ -98,42 +107,36 @@ async def run_llm_streaming(
     llm: Llama, req: ChatCompletionRequest
 ) -> AsyncGenerator[str, None]:
     """Generator that runs the LLM and yields SSE chunks under lock."""
-    with slm_span(req, is_streaming=True) as (span, messages_for_llm):
-        completion_stream = await asyncio.to_thread(
-            llm.create_chat_completion,
-            messages=messages_for_llm,
-            max_tokens=req.max_tokens,
-            temperature=req.temperature,
-            stream=True,
-        )
+    with slm_span(req, is_streaming=True) as span:
+        try:
+            completion_stream = await asyncio.to_thread(
+                llm.create_chat_completion,
+                **req.model_dump(),
+            )
 
-        # Use traced iterator that automatically handles chunk spans
-        # and parent span updates
-        for chunk in completion_stream:
-            response_model = ChatCompletionStreamResponse.model_validate(chunk)
-            set_atrribute_response_stream(span, response_model)
-            yield f"data: {response_model.model_dump_json()}\n\n"
+            # Use traced iterator that automatically handles chunk spans
+            # and parent span updates
+            chunk: CreateChatCompletionStreamResponse
+            for chunk in completion_stream:
+                set_atrribute_response_stream(span, chunk)
+                yield f"data: {json.dumps(chunk)}\n\n"
 
-        yield "data: [DONE]\n\n"
+            yield "data: [DONE]\n\n"
+        except asyncio.CancelledError:
+            # Handle cancellation gracefully during sse.
+            set_attribute_cancelled(span)
 
 
-async def run_llm_non_streaming(
-    llm: Llama, req: ChatCompletionRequest
-) -> ChatCompletionResponse:
+async def run_llm_non_streaming(llm: Llama, req: ChatCompletionRequest):
     """Runs the LLM for a non-streaming request under lock."""
-    with slm_span(req, is_streaming=False) as (span, messages_for_llm):
+    with slm_span(req, is_streaming=False) as span:
         completion_result = await asyncio.to_thread(
             llm.create_chat_completion,
-            messages=messages_for_llm,
-            max_tokens=req.max_tokens,
-            temperature=req.temperature,
-            stream=False,
+            **req.model_dump(),
         )
+        set_atrribute_response(span, completion_result)
 
-        response_model = ChatCompletionResponse.model_validate(completion_result)
-        set_atrribute_response(span, response_model)
-
-        return response_model
+        return completion_result
 
 
 @app.post("/api/v1/chat/completions")
@@ -156,7 +159,29 @@ async def create_chat_completion(
     except Exception:
         # Catch any other unexpected errors
         error_str = traceback.format_exc()
-        raise HTTPException(status_code=500, detail=error_str)
+        raise HTTPException(status_code=STATUS_CODE_EXCEPTION, detail=error_str)
+
+
+@app.post("/api/v1/embeddings")
+async def create_embeddings(
+    req: EmbeddingRequest,
+    llm: Annotated[Llama, Depends(get_llm)],
+    _: Annotated[None, Depends(lock_llm_semaphor)],
+):
+    """Create embeddings for the given input text(s)."""
+    try:
+        with slm_embedding_span(req) as span:
+            # Use llama-cpp-python's create_embedding method directly
+            embedding_result = await asyncio.to_thread(
+                llm.create_embedding,
+                **req.model_dump(),
+            )
+            # Convert llama-cpp response using model_validate like chat completion
+            set_attribute_response_embedding(span, embedding_result)
+            return embedding_result
+    except Exception:
+        error_str = traceback.format_exc()
+        raise HTTPException(status_code=STATUS_CODE_EXCEPTION, detail=error_str)
 
 
 @app.get("/health")