diff --git a/README.md b/README.md
index 840a337..cc34ca7 100644
--- a/README.md
+++ b/README.md
@@ -1,55 +1,48 @@
-# Small-Language-Model Server
+# Small Language Model Server
 
 [![CI Pipeline](https://github.com/XyLearningProgramming/slm_server/actions/workflows/ci.yml/badge.svg)](https://github.com/XyLearningProgramming/slm_server/actions/workflows/ci.yml)
 [![codecov](https://codecov.io/gh/XyLearningProgramming/slm_server/branch/main/graph/badge.svg)](https://codecov.io/gh/XyLearningProgramming/slm_server)
 [![Docker](https://img.shields.io/badge/docker-ready-blue.svg)](https://hub.docker.com/r/x3huang/slm_server)
 [![License](https://img.shields.io/badge/license-MIT-green.svg)](LICENSE)
 
-🚀 A light model server that serves small language models (default: `Qwen3-0.6B-GGUF`) as a **thin wrapper** around `llama-cpp` exposing the OpenAI-compatible `/chat/completions` API. Core logic is just <100 lines under `./slm_server/app.py`!
+A lightweight model server that serves small language models (default: Qwen3-0.6B-GGUF) as a thin wrapper around llama-cpp with OpenAI-compatible `/chat/completions` API. Core logic is <100 lines in `./slm_server/app.py`.
 
-> This is still a WIP project. Issues, pull-requests are welcome. I mainly use this repo to deploy a SLM model as part of the backend on my own site [x3huang.dev](https://x3huang.dev/) while trying my best to keep this repo model-agonistic. 
+## Features
 
-## ✨ Features
+- **OpenAI-compatible API** - Drop-in replacement with `/chat/completions` endpoint and streaming support
+- **Llama.cpp integration** - High-performance inference optimized for limited CPU and memory resources
+- **Production observability** - Built-in logging, Prometheus metrics, and OpenTelemetry tracing
+- **Enterprise deployment** - Complete CI/CD pipeline with unit tests, e2e tests, Helm charts, and Docker support
+- **Simple configuration** - Environment-based config with sensible defaults
 
-![Thin wrapper around llama cpp](./docs/20250712_slm_img1.jpg)
+## Use Cases
 
-- 🔌 **OpenAI-compatible API** - Drop-in replacement with `/chat/completions` endpoint and streaming support
-- ⚡ **Llama.cpp integration** - High-performance inference optimized for limited CPU and memory resources
-- 📊 **Production observability** - Built-in logging, Prometheus metrics, and OpenTelemetry tracing (all configurable)
-- 🚀 **Enterprise deployment** - Complete CI/CD pipeline with unit tests, e2e tests, Helm charts, and Docker support
-- 🔧 **Simple configuration** - Environment-based config with sensible defaults
+- **Self-hosting** - Deploy small models under resource constraints
+- **Privacy-first inference** - No user content logging, complete data control
+- **Development environments** - Local LLM testing and prototyping
+- **Edge deployments** - Lightweight inference in constrained environments
+- **API standardization** - Unified OpenAI-compatible interface for small models
 
-## 🚀 Quick Start
+## Quick Start
 
 ### Local Development
 
 ```bash
-# 1. Get your model
+# Download model
 ./scripts/download.sh  # Downloads default Qwen3-0.6B-GGUF
-# OR place your own GGUF model in models/ directory
 
-# 2. Install dependencies
+# Install and start
 uv sync
-
-# 3. Configure (optional)
-cp .env.example .env  # Edit as needed
-
-# 4. Start the server
 ./scripts/start.sh
 ```
 
 ### Docker
 
 ```bash
-# Pull and run
 docker run -p 8000:8000 -v $(pwd)/models:/app/models x3huang/slm_server/general
-
-# Or build locally
-docker build -t slm-server .
-docker run -p 8000:8000 -v $(pwd)/models:/app/models slm_server
 ```
 
-### Test the API
+### Test API
 
 ```bash
 curl -X POST http://localhost:8000/api/v1/chat/completions \
@@ -61,57 +54,26 @@ curl -X POST http://localhost:8000/api/v1/chat/completions \
   }'
 ```
 
-## 🎯 Why SLM Server?
-
-- **🎯 Unified access** - Single point of entry for SLM inference with concurrency control
-- **💰 Cost-effective** - Perfect for self-hosting small models under resource constraints
-- **🔒 Privacy-matters** - No user content logging, complete data control
-- **⚡ Performance** - As thin wrapper around `llama-cpp`
-
-## 📊 Observability Stack
-
-All observability components are **configurable** and **enabled by default** for production readiness.
-
-### 📝 Structured Logging
-Request lifecycle logging with trace correlation:
-
-```log
-2025-07-21 09:52:32,475 INFO [slm_server.utils] 2025-07-21 09:52:32,475 INFO [slm_server.utils] [utils.py:341] [trace_id=e4a2ed019bd6fe95d611d7b29b90db4f span_id=c8fcaa72b8732e29 resource.service.name= trace_sampled=True] - [SLM] starting streaming: {'max_tokens': 2048, 'temperature': 0.7, 'input_messages': 1, 'input_content_length': 15}
-
-2025-07-21 09:52:36,496 INFO [slm_server.utils] [utils.py:404] [trace_id=e4a2ed019bd6fe95d611d7b29b90db4f span_id=c8fcaa72b8732e29 resource.service.name= trace_sampled=True] - [SLM] completed streaming: {'duration_ms': 4021.32, 'output_content_length': 468, 'total_tokens': 111, 'completion_tokens': 108, 'completion_tokens_per_second': 26.86, 'total_tokens_per_second': 27.6, 'chunk_count': 108, 'avg_chunk_delay_ms': 37.23, 'first_token_delay_ms': 38.19, 'avg_chunk_size': 259.45, 'avg_chunk_content_size': 4.25, 'chunks_with_content': 108, 'empty_chunks': 2}
-```
-
-### 📈 Prometheus Metrics
-Available at `/metrics` endpoint:
-- Request latency and throughput
-- Token generation rates
-- Model memory usage
-- Error rates and types
+## Observability
 
-### 🔍 OpenTelemetry Tracing
-Distributed tracing with:
-- Request flow visualization, each stream response as extra event if any
-- Performance bottleneck identification
+All observability components are configurable and enabled by default:
 
-## ⚙️ Configuration
+- **Structured Logging** - Request lifecycle logging with trace correlation
+- **Prometheus Metrics** - Available at `/metrics` (latency, throughput, token rates, memory usage)
+- **OpenTelemetry Tracing** - Distributed tracing with request flow visualization
 
-Configure via environment variables (prefix: `SLM_`) or `.env` file.
+## Configuration
 
-See [`./slm_server/config.py`](./slm_server/config.py) for complete configuration options.
+Configure via environment variables (prefix: `SLM_`) or `.env` file. See [`./slm_server/config.py`](./slm_server/config.py) for all options.
 
-## 🚢 Deployment
+## Deployment
 
 ### Kubernetes with Helm
 
 ```bash
-# Deploy to production
 helm upgrade --install slm-server ./deploy/helm \
   --namespace backend \
   --values ./deploy/helm/values.yaml
-
-# Monitor deployment
-kubectl get pods -n backend
-kubectl logs -f deployment/slm-server -n backend
 ```
 
 ### Docker Compose
@@ -125,43 +87,38 @@ services:
       - "8000:8000"
     volumes:
       - ./models:/app/models
-    # Optional
     environment:
       - slm_server_PATH=/app/models/your-model.gguf
 ```
 
-## 🧪 Development
+## Development
 
-### Running Tests
+### Testing
 
 ```bash
 # Unit tests
 uv run pytest tests/ --ignore=tests/e2e/
 
-# End-to-end tests (with server pulled up)
+# End-to-end tests
 uv run python ./tests/e2e/main.py
 
 # With coverage
-uv run pytest tests/ --ignore=tests/e2e/ --cov=slm_server --cov-report=html --cov-report=term-missing
+uv run pytest tests/ --ignore=tests/e2e/ --cov=slm_server --cov-report=html
 ```
 
 ### Code Quality
 
 ```bash
-# Linting and formatting
 uv run ruff check .
 uv run ruff format .
 ```
 
-## 📚 API Documentation
+## API Documentation
 
-Once running, visit:
 - **Interactive docs**: http://localhost:8000/docs
 - **OpenAPI spec**: http://localhost:8000/openapi.json
 - **Health check**: http://localhost:8000/health
 
-## 📄 License
-
-This project is licensed under the MIT License - see the [LICENSE](LICENSE) file for details.
-
+## License
 
+MIT License - see [LICENSE](LICENSE) file for details.
\ No newline at end of file
diff --git a/pyproject.toml b/pyproject.toml
index cf4ec02..7c93f92 100644
--- a/pyproject.toml
+++ b/pyproject.toml
@@ -26,6 +26,9 @@ select = ["C", "E", "F", "W"]
 [dependency-groups]
 dev = [
     "httpx>=0.28.1",
+    "langchain>=0.3.26",
+    "langchain-core>=0.3.71",
+    "langchain-openai>=0.3.28",
     "pytest>=8.4.1",
     "pytest-cov>=4.0.0",
     "ruff>=0.12.3",
diff --git a/pytest.ini b/pytest.ini
new file mode 100644
index 0000000..d29f63d
--- /dev/null
+++ b/pytest.ini
@@ -0,0 +1,5 @@
+[pytest]
+markers =
+    api: marks tests as api tests
+    api_non_streaming: marks tests as api and non_streaming tests
+    langchain: marks tests as langchain compatibility tests
diff --git a/slm_server/app.py b/slm_server/app.py
index f604c49..e825f2b 100644
--- a/slm_server/app.py
+++ b/slm_server/app.py
@@ -1,23 +1,27 @@
 import asyncio
+import json
 import traceback
+from http import HTTPStatus
 from typing import Annotated, AsyncGenerator
 
 from fastapi import Depends, FastAPI, HTTPException
 from fastapi.responses import StreamingResponse
-from llama_cpp import Llama
+from llama_cpp import CreateChatCompletionStreamResponse, Llama
 
 from slm_server.config import Settings, get_settings
 from slm_server.logging import setup_logging
 from slm_server.metrics import setup_metrics
 from slm_server.model import (
     ChatCompletionRequest,
-    ChatCompletionResponse,
-    ChatCompletionStreamResponse,
+    EmbeddingRequest,
 )
 from slm_server.trace import setup_tracing
 from slm_server.utils import (
     set_atrribute_response,
     set_atrribute_response_stream,
+    set_attribute_cancelled,
+    set_attribute_response_embedding,
+    slm_embedding_span,
     slm_span,
 )
 
@@ -28,6 +32,11 @@
 MAX_CONCURRENCY = 1
 # Default timeout message in detail field.
 DETAIL_SEM_TIMEOUT = "Server is busy, please try again later."
+# Status code for semaphore timeout.
+STATUS_CODE_SEM_TIMEOUT = HTTPStatus.REQUEST_TIMEOUT
+# Status code for unexpected errors.
+# This is used when the server encounters an error that is not handled
+STATUS_CODE_EXCEPTION = HTTPStatus.INTERNAL_SERVER_ERROR
 
 
 def get_llm_semaphor() -> asyncio.Semaphore:
@@ -46,9 +55,10 @@ def get_llm(settings: Annotated[Settings, Depends(get_settings)]) -> Llama:
             verbose=settings.logging.verbose,
             seed=settings.seed,
             logits_all=False,
-            embedding=False,
+            embedding=True,
             use_mlock=True,  # Use mlock to prevent memory swapping
             use_mmap=True,  # Use memory-mapped files for faster access
+            chat_format="chatml-function-calling",
         )
     return get_llm._instance
 
@@ -77,18 +87,17 @@ def get_app() -> FastAPI:
 
 
 async def lock_llm_semaphor(
-    req: ChatCompletionRequest,
     sem: Annotated[asyncio.Semaphore, Depends(get_llm_semaphor)],
     settings: Annotated[Settings, Depends(get_settings)],
 ) -> AsyncGenerator[None, None]:
     """Context manager to acquire and release the LLM semaphore with a timeout."""
     try:
-        await asyncio.wait_for(
-            sem.acquire(), timeout=req.wait_timeout or settings.s_timeout
-        )
+        await asyncio.wait_for(sem.acquire(), settings.s_timeout)
         yield None
     except asyncio.TimeoutError:
-        raise HTTPException(status_code=503, detail=DETAIL_SEM_TIMEOUT)
+        raise HTTPException(
+            status_code=STATUS_CODE_SEM_TIMEOUT, detail=DETAIL_SEM_TIMEOUT
+        )
     finally:
         if sem.locked():
             sem.release()
@@ -98,42 +107,36 @@ async def run_llm_streaming(
     llm: Llama, req: ChatCompletionRequest
 ) -> AsyncGenerator[str, None]:
     """Generator that runs the LLM and yields SSE chunks under lock."""
-    with slm_span(req, is_streaming=True) as (span, messages_for_llm):
-        completion_stream = await asyncio.to_thread(
-            llm.create_chat_completion,
-            messages=messages_for_llm,
-            max_tokens=req.max_tokens,
-            temperature=req.temperature,
-            stream=True,
-        )
+    with slm_span(req, is_streaming=True) as span:
+        try:
+            completion_stream = await asyncio.to_thread(
+                llm.create_chat_completion,
+                **req.model_dump(),
+            )
 
-        # Use traced iterator that automatically handles chunk spans
-        # and parent span updates
-        for chunk in completion_stream:
-            response_model = ChatCompletionStreamResponse.model_validate(chunk)
-            set_atrribute_response_stream(span, response_model)
-            yield f"data: {response_model.model_dump_json()}\n\n"
+            # Use traced iterator that automatically handles chunk spans
+            # and parent span updates
+            chunk: CreateChatCompletionStreamResponse
+            for chunk in completion_stream:
+                set_atrribute_response_stream(span, chunk)
+                yield f"data: {json.dumps(chunk)}\n\n"
 
-        yield "data: [DONE]\n\n"
+            yield "data: [DONE]\n\n"
+        except asyncio.CancelledError:
+            # Handle cancellation gracefully during sse.
+            set_attribute_cancelled(span)
 
 
-async def run_llm_non_streaming(
-    llm: Llama, req: ChatCompletionRequest
-) -> ChatCompletionResponse:
+async def run_llm_non_streaming(llm: Llama, req: ChatCompletionRequest):
     """Runs the LLM for a non-streaming request under lock."""
-    with slm_span(req, is_streaming=False) as (span, messages_for_llm):
+    with slm_span(req, is_streaming=False) as span:
         completion_result = await asyncio.to_thread(
             llm.create_chat_completion,
-            messages=messages_for_llm,
-            max_tokens=req.max_tokens,
-            temperature=req.temperature,
-            stream=False,
+            **req.model_dump(),
         )
+        set_atrribute_response(span, completion_result)
 
-        response_model = ChatCompletionResponse.model_validate(completion_result)
-        set_atrribute_response(span, response_model)
-
-        return response_model
+        return completion_result
 
 
 @app.post("/api/v1/chat/completions")
@@ -156,7 +159,29 @@ async def create_chat_completion(
     except Exception:
         # Catch any other unexpected errors
         error_str = traceback.format_exc()
-        raise HTTPException(status_code=500, detail=error_str)
+        raise HTTPException(status_code=STATUS_CODE_EXCEPTION, detail=error_str)
+
+
+@app.post("/api/v1/embeddings")
+async def create_embeddings(
+    req: EmbeddingRequest,
+    llm: Annotated[Llama, Depends(get_llm)],
+    _: Annotated[None, Depends(lock_llm_semaphor)],
+):
+    """Create embeddings for the given input text(s)."""
+    try:
+        with slm_embedding_span(req) as span:
+            # Use llama-cpp-python's create_embedding method directly
+            embedding_result = await asyncio.to_thread(
+                llm.create_embedding,
+                **req.model_dump(),
+            )
+            # Convert llama-cpp response using model_validate like chat completion
+            set_attribute_response_embedding(span, embedding_result)
+            return embedding_result
+    except Exception:
+        error_str = traceback.format_exc()
+        raise HTTPException(status_code=STATUS_CODE_EXCEPTION, detail=error_str)
 
 
 @app.get("/health")
diff --git a/slm_server/model.py b/slm_server/model.py
index 61a2b37..a04a46e 100644
--- a/slm_server/model.py
+++ b/slm_server/model.py
@@ -1,71 +1,90 @@
-import time
-import uuid
-from typing import List, Optional
-
+from llama_cpp.llama_types import (
+    ChatCompletionFunction,
+    ChatCompletionRequestFunctionCall,
+    ChatCompletionRequestMessage,
+    ChatCompletionRequestResponseFormat,
+    ChatCompletionTool,
+    ChatCompletionToolChoiceOption,
+)
 from pydantic import BaseModel, Field
 
 
-def generate_chat_id():
-    return f"chatcmpl-{uuid.uuid4().hex}"
-
-
-def generate_timestamp():
-    return int(time.time())
-
-
-class ChatMessage(BaseModel):
-    role: str
-    content: str
-
-
 class ChatCompletionRequest(BaseModel):
-    messages: List[ChatMessage]
-    model: Optional[str] = Field(
-        "Qwen3-0.6B-GGUF", description="Model name used, not important."
+    messages: list[ChatCompletionRequestMessage] = Field(
+        description="List of chat completion messages in the conversation"
     )
-    temperature: float = Field(0.7, ge=0.0, le=2.0)
-    max_tokens: int = Field(2048, gt=0)
-    stream: bool = Field(False)
-    wait_timeout: Optional[float] = Field(
-        0, description="Max wait timeout to request sem. Default to server settings."
+    functions: list[ChatCompletionFunction] | None = Field(
+        default=None, description="List of functions available for the model to call"
+    )
+    function_call: ChatCompletionRequestFunctionCall | None = Field(
+        default=None, description="Controls which function the model should call"
+    )
+    tools: list[ChatCompletionTool] | None = Field(
+        default=None, description="List of tools available for the model to use"
+    )
+    tool_choice: ChatCompletionToolChoiceOption | None = Field(
+        default=None, description="Controls which tool the model should use"
+    )
+    temperature: float = Field(
+        default=0.2, description="Sampling temperature (0.0 to 2.0)"
+    )
+    top_p: float = Field(default=0.95, description="Nucleus sampling parameter")
+    top_k: int = Field(default=40, description="Top-k sampling parameter")
+    min_p: float = Field(
+        default=0.05, description="Minimum probability threshold for sampling"
+    )
+    typical_p: float = Field(default=1.0, description="Typical sampling parameter")
+    stream: bool = Field(default=False, description="Whether to stream the response")
+    stop: str | list[str] | None = Field(
+        default=None, description="Stop sequences to end generation"
+    )
+    seed: int | None = Field(
+        default=None, description="Random seed for reproducible generation"
+    )
+    response_format: ChatCompletionRequestResponseFormat | None = Field(
+        default=None, description="Response format specification"
+    )
+    max_tokens: int | None = Field(
+        default=None, description="Maximum number of tokens to generate"
+    )
+    presence_penalty: float = Field(
+        default=0.0, description="Presence penalty (-2.0 to 2.0)"
+    )
+    frequency_penalty: float = Field(
+        default=0.0, description="Frequency penalty (-2.0 to 2.0)"
+    )
+    repeat_penalty: float = Field(
+        default=1.0, description="Repetition penalty (1.0 = no penalty)"
+    )
+    tfs_z: float = Field(default=1.0, description="Tail free sampling parameter")
+    mirostat_mode: int = Field(
+        default=0, description="Mirostat sampling mode (0=disabled, 1=v1, 2=v2)"
+    )
+    mirostat_tau: float = Field(default=5.0, description="Mirostat target entropy")
+    mirostat_eta: float = Field(default=0.1, description="Mirostat learning rate")
+    model: str | None = Field(default=None, description="Model identifier")
+    # Cannot be properly serialized with pydantic, so we ignore it for now.
+    #
+    # logits_processor: LogitsProcessorList | None = Field(
+    #     default=None, description="List of logits processors to apply"
+    # )
+    # grammar: LlamaGrammar | None = Field(
+    #     default=None, description="Grammar constraints for generation"
+    # )
+    logit_bias: dict[int, float] | None = Field(
+        default=None, description="Logit bias adjustments for specific tokens"
+    )
+    logprobs: bool | None = Field(
+        default=None, description="Whether to return log probabilities"
+    )
+    top_logprobs: int | None = Field(
+        default=None, description="Number of top log probabilities to return"
     )
 
 
-class ChatCompletionChoice(BaseModel):
-    index: int
-    message: ChatMessage
-    finish_reason: Optional[str]
-
-
-class Usage(BaseModel):
-    prompt_tokens: int
-    completion_tokens: int
-    total_tokens: int
-
-
-class ChatCompletionResponse(BaseModel):
-    id: str = Field(default_factory=generate_chat_id)
-    object: str = "chat.completion"
-    created: int = Field(default_factory=generate_timestamp)
-    model: str
-    choices: List[ChatCompletionChoice]
-    usage: Usage
-
-
-class DeltaMessage(BaseModel):
-    role: Optional[str] = None
-    content: Optional[str] = None
-
-
-class ChatCompletionStreamChoice(BaseModel):
-    index: int
-    delta: DeltaMessage
-    finish_reason: Optional[str] = None
-
-
-class ChatCompletionStreamResponse(BaseModel):
-    id: str = Field(default_factory=generate_chat_id)
-    object: str = "chat.completion.chunk"
-    created: int = Field(default_factory=generate_timestamp)
-    model: str
-    choices: List[ChatCompletionStreamChoice]
+# Embeddings API Models
+class EmbeddingRequest(BaseModel):
+    input: str | list[str]
+    model: str | None = Field(
+        default=None, description="Model name, not important for our server"
+    )
diff --git a/slm_server/utils.py b/slm_server/utils.py
deleted file mode 100644
index 99e95fa..0000000
--- a/slm_server/utils.py
+++ /dev/null
@@ -1,616 +0,0 @@
-import logging
-import traceback
-from contextlib import contextmanager
-
-from llama_cpp import ChatCompletionStreamResponse
-from opentelemetry import trace
-from opentelemetry.sdk.trace import Span
-from opentelemetry.sdk.trace.export import SpanProcessor
-from opentelemetry.sdk.trace.sampling import Decision, Sampler, SamplingResult
-from opentelemetry.trace import Status, StatusCode
-from prometheus_client import Counter, Histogram
-
-from slm_server.model import ChatCompletionRequest, ChatCompletionResponse
-
-# Constants for span naming and attributes
-MODEL_NAME = "llama-cpp"
-SPAN_PREFIX = "slm"
-
-# Span names
-SPAN_CHAT_COMPLETION = f"{SPAN_PREFIX}.chat_completion"
-
-# Event names
-EVENT_CHUNK_GENERATED = f"{SPAN_PREFIX}.chunk_generated"
-
-# Event attribute names
-EVENT_ATTR_CHUNK_SIZE = f"{SPAN_PREFIX}.chunk_size"
-EVENT_ATTR_CHUNK_CONTENT_SIZE = f"{SPAN_PREFIX}.chunk_content_size"
-# EVENT_ATTR_CHUNK_CONTENT = f"{SPAN_PREFIX}.chunk_content"
-# EVENT_ATTR_FINISH_REASON = f"{SPAN_PREFIX}.finish_reason"
-
-# Attribute names
-ATTR_MODEL = f"{SPAN_PREFIX}.model"
-ATTR_STREAMING = f"{SPAN_PREFIX}.streaming"
-ATTR_MAX_TOKENS = f"{SPAN_PREFIX}.max_tokens"
-ATTR_TEMPERATURE = f"{SPAN_PREFIX}.temperature"
-ATTR_INPUT_MESSAGES = f"{SPAN_PREFIX}.input.messages"
-ATTR_INPUT_CONTENT_LENGTH = f"{SPAN_PREFIX}.input.content_length"
-ATTR_OUTPUT_CONTENT_LENGTH = f"{SPAN_PREFIX}.output.content_length"
-ATTR_CHUNK_COUNT = f"{SPAN_PREFIX}.output.chunk_count"
-ATTR_CHUNK_SIZE = f"{SPAN_PREFIX}.chunk.size"
-ATTR_PROMPT_TOKENS = f"{SPAN_PREFIX}.usage.prompt_tokens"
-ATTR_COMPLETION_TOKENS = f"{SPAN_PREFIX}.usage.completion_tokens"
-ATTR_TOTAL_TOKENS = f"{SPAN_PREFIX}.usage.total_tokens"
-ATTR_FORCE_SAMPLE = f"{SPAN_PREFIX}.force_sample"
-
-# Performance timing attributes
-ATTR_FIRST_TOKEN_DELAY = f"{SPAN_PREFIX}.timing.first_token_delay_ms"
-ATTR_TOKENS_PER_SECOND = f"{SPAN_PREFIX}.timing.completion_tokens_per_second"
-ATTR_TOTAL_TOKENS_PER_SECOND = f"{SPAN_PREFIX}.timing.total_tokens_per_second"
-ATTR_CHUNK_DELAY = f"{SPAN_PREFIX}.timing.chunk_delay_ms"
-ATTR_CHUNK_DURATION = f"{SPAN_PREFIX}.timing.chunk_duration_ms"
-ATTR_TOTAL_DURATION = f"{SPAN_PREFIX}.timing.total_duration_ms"
-ATTR_CHUNK_CONTENT_SIZE = f"{SPAN_PREFIX}.chunk.content_size"
-
-# Calculated metric names (used as keys in calculate_performance_metrics)
-METRIC_TOTAL_DURATION = ATTR_TOTAL_DURATION
-METRIC_TOKENS_PER_SECOND = ATTR_TOKENS_PER_SECOND
-METRIC_TOTAL_TOKENS_PER_SECOND = ATTR_TOTAL_TOKENS_PER_SECOND
-METRIC_CHUNK_DELAY = ATTR_CHUNK_DELAY
-METRIC_FIRST_TOKEN_DELAY = ATTR_FIRST_TOKEN_DELAY
-METRIC_AVG_CHUNK_SIZE = f"{SPAN_PREFIX}.metrics.avg_chunk_size"
-METRIC_AVG_CHUNK_CONTENT_SIZE = f"{SPAN_PREFIX}.metrics.avg_chunk_content_size"
-METRIC_MAX_CHUNK_SIZE = f"{SPAN_PREFIX}.metrics.max_chunk_size"
-METRIC_MIN_CHUNK_SIZE = f"{SPAN_PREFIX}.metrics.min_chunk_size"
-METRIC_CHUNKS_WITH_CONTENT = f"{SPAN_PREFIX}.metrics.chunks_with_content"
-METRIC_EMPTY_CHUNKS = f"{SPAN_PREFIX}.metrics.empty_chunks"
-
-# Log data keys (for consistent logging format)
-LOG_KEY_MAX_TOKENS = "max_tokens"
-LOG_KEY_TEMPERATURE = "temperature"
-LOG_KEY_INPUT_MESSAGES = "input_messages"
-LOG_KEY_INPUT_CONTENT_LENGTH = "input_content_length"
-LOG_KEY_DURATION_MS = "duration_ms"
-LOG_KEY_OUTPUT_CONTENT_LENGTH = "output_content_length"
-LOG_KEY_TOTAL_TOKENS = "total_tokens"
-LOG_KEY_COMPLETION_TOKENS = "completion_tokens"
-LOG_KEY_COMPLETION_TOKENS_PER_SECOND = "completion_tokens_per_second"
-LOG_KEY_TOTAL_TOKENS_PER_SECOND = "total_tokens_per_second"
-LOG_KEY_CHUNK_COUNT = "chunk_count"
-LOG_KEY_AVG_CHUNK_DELAY_MS = "avg_chunk_delay_ms"
-LOG_KEY_FIRST_TOKEN_DELAY_MS = "first_token_delay_ms"
-LOG_KEY_AVG_CHUNK_SIZE = "avg_chunk_size"
-LOG_KEY_AVG_CHUNK_CONTENT_SIZE = "avg_chunk_content_size"
-LOG_KEY_CHUNKS_WITH_CONTENT = "chunks_with_content"
-LOG_KEY_EMPTY_CHUNKS = "empty_chunks"
-
-# Prometheus metric names and descriptions
-PROMETHEUS_COMPLETION_DURATION = "slm_completion_duration_seconds"
-PROMETHEUS_COMPLETION_DURATION_DESC = "SLM completion duration in seconds"
-PROMETHEUS_TOKEN_COUNT = "slm_tokens_total"
-PROMETHEUS_TOKEN_COUNT_DESC = "Total tokens processed"
-PROMETHEUS_COMPLETION_TOKENS_PER_SECOND = "slm_completion_tokens_per_second"
-PROMETHEUS_COMPLETION_TOKENS_PER_SECOND_DESC = (
-    "Completion token generation rate (tokens/sec)"
-)
-PROMETHEUS_TOTAL_TOKENS_PER_SECOND = "slm_total_tokens_per_second"
-PROMETHEUS_TOTAL_TOKENS_PER_SECOND_DESC = (
-    "Total token throughput including prompt processing (tokens/sec)"
-)
-PROMETHEUS_FIRST_TOKEN_DELAY = "slm_first_token_delay_ms"
-PROMETHEUS_FIRST_TOKEN_DELAY_DESC = "Time to first token in milliseconds (streaming)"
-PROMETHEUS_CHUNK_DELAY = "slm_chunk_delay_ms"
-PROMETHEUS_CHUNK_DELAY_DESC = "Average chunk delay in milliseconds (streaming)"
-PROMETHEUS_CHUNK_DURATION = "slm_chunk_duration_ms"
-PROMETHEUS_CHUNK_DURATION_DESC = "Individual chunk processing duration in milliseconds"
-PROMETHEUS_ERROR_TOTAL = "slm_errors_total"
-PROMETHEUS_ERROR_TOTAL_DESC = "Total SLM errors"
-PROMETHEUS_CHUNK_COUNT = "slm_chunks_total"
-PROMETHEUS_CHUNK_COUNT_DESC = "Number of chunks in streaming response"
-
-# Log message templates
-LOG_MSG_STARTING_CALL = "[SLM] starting {}: {}"
-LOG_MSG_COMPLETED_CALL = "[SLM] completed {}: {}"
-LOG_MSG_FAILED_CALL = "[SLM] failed: {}"
-
-
-# Get tracer
-tracer = trace.get_tracer(__name__)
-logger = logging.getLogger(__name__)
-
-
-def set_atrribute_response(span: Span, response: ChatCompletionResponse):
-    """Set response attributes automatically."""
-    # Non-streaming response
-    if response.usage:
-        span.set_attribute(ATTR_PROMPT_TOKENS, response.usage.prompt_tokens)
-        span.set_attribute(ATTR_COMPLETION_TOKENS, response.usage.completion_tokens)
-        span.set_attribute(ATTR_TOTAL_TOKENS, response.usage.total_tokens)
-
-    if response.choices and response.choices[0].message:
-        content = response.choices[0].message.content or ""
-        span.set_attribute(ATTR_OUTPUT_CONTENT_LENGTH, len(content))
-
-
-def set_atrribute_response_stream(span: Span, response: ChatCompletionStreamResponse):
-    """Record streaming chunk as an event and accumulate tokens."""
-    chunk_content = ""
-    if (
-        response.choices
-        and response.choices[0].delta
-        and response.choices[0].delta.content
-    ):
-        chunk_content = response.choices[0].delta.content
-
-    chunk_json = response.model_dump_json()
-
-    # Record chunk as an event
-    chunk_event = {
-        EVENT_ATTR_CHUNK_SIZE: len(chunk_json),
-        EVENT_ATTR_CHUNK_CONTENT_SIZE: len(chunk_content),
-        # EVENT_ATTR_CHUNK_CONTENT: chunk_content,
-        # EVENT_ATTR_FINISH_REASON: response.choices[0].finish_reason or 0
-        # if response.choices
-        # else None,
-    }
-    span.add_event(EVENT_CHUNK_GENERATED, chunk_event)
-
-    # Only count chunks with actual content
-    if not chunk_content:
-        return
-
-    # Accumulate tokens directly on the span (only for recording spans)
-    if span.is_recording():
-        current_completion_tokens = span.attributes.get(ATTR_COMPLETION_TOKENS, 0)
-        span.set_attribute(ATTR_COMPLETION_TOKENS, current_completion_tokens + 1)
-
-        # Update total content length
-        current_output_length = span.attributes.get(ATTR_OUTPUT_CONTENT_LENGTH, 0)
-        span.set_attribute(
-            ATTR_OUTPUT_CONTENT_LENGTH, current_output_length + len(chunk_content)
-        )
-
-        # Update total tokens (assuming we have prompt tokens from initial setup)
-        prompt_tokens = span.attributes.get(ATTR_PROMPT_TOKENS, 0)
-        total_tokens = prompt_tokens + current_completion_tokens + 1
-        span.set_attribute(ATTR_TOTAL_TOKENS, total_tokens)
-
-        # Update chunk count
-        current_chunk_count = span.attributes.get(ATTR_CHUNK_COUNT, 0)
-        span.set_attribute(ATTR_CHUNK_COUNT, current_chunk_count + 1)
-
-
-@contextmanager
-def slm_span(req: ChatCompletionRequest, is_streaming: bool):
-    """Create SLM span with automatic timing and error handling."""
-    span_name = (
-        f"{SPAN_CHAT_COMPLETION}.{'streaming' if is_streaming else 'non_streaming'}"
-    )
-
-    # Pre-calculate attributes before starting span
-    messages_for_llm = [msg.model_dump() for msg in req.messages]
-    input_content_length = sum(len(msg.get("content", "")) for msg in messages_for_llm)
-
-    # Set initial attributes that will be available in on_start
-    initial_attributes = {
-        ATTR_MODEL: MODEL_NAME,
-        ATTR_STREAMING: is_streaming,
-        ATTR_MAX_TOKENS: req.max_tokens or 0,
-        ATTR_TEMPERATURE: req.temperature,
-        ATTR_INPUT_MESSAGES: len(messages_for_llm),
-        ATTR_INPUT_CONTENT_LENGTH: input_content_length,
-    }
-
-    # Add prompt tokens estimate for streaming
-    if is_streaming:
-        # Estimate prompt tokens for streaming
-        # (rough approximation: 1 token per 4 chars)
-        estimated_prompt_tokens = (
-            max(1, input_content_length // 4) if is_streaming else 0
-        )
-        initial_attributes[ATTR_PROMPT_TOKENS] = estimated_prompt_tokens
-
-    with tracer.start_as_current_span(span_name, attributes=initial_attributes) as span:
-        try:
-            yield span, messages_for_llm
-
-        except Exception:
-            # Use native error handling
-            error_str = traceback.format_exc()
-            span.set_status(Status(StatusCode.ERROR, error_str))
-            span.set_attribute(ATTR_FORCE_SAMPLE, True)
-            raise
-
-
-def calculate_performance_metrics(span: Span):
-    """Calculate performance metrics for a span after it has ended."""
-    if not (span.end_time and span.start_time):
-        return {}
-
-    attrs = span.attributes or {}
-    duration_ms = (span.end_time - span.start_time) / 1_000_000
-
-    # Get token counts
-    total_tokens = attrs.get(ATTR_TOTAL_TOKENS, 0)
-    completion_tokens = attrs.get(ATTR_COMPLETION_TOKENS, 0)
-
-    metrics = {
-        METRIC_TOTAL_DURATION: duration_ms,
-        METRIC_TOKENS_PER_SECOND: 0,
-        METRIC_TOTAL_TOKENS_PER_SECOND: 0,
-    }
-
-    # Calculate tokens per second
-    if duration_ms > 0:
-        duration_s = duration_ms / 1000
-        if completion_tokens > 0:
-            metrics[METRIC_TOKENS_PER_SECOND] = completion_tokens / duration_s
-        if total_tokens > 0:
-            metrics[METRIC_TOTAL_TOKENS_PER_SECOND] = total_tokens / duration_s
-
-    # Calculate streaming-specific metrics
-    is_streaming = attrs.get(ATTR_STREAMING, False)
-    if is_streaming:
-        chunk_count = attrs.get(ATTR_CHUNK_COUNT, 0)
-        if chunk_count > 0 and duration_ms > 0:
-            metrics[METRIC_CHUNK_DELAY] = duration_ms / chunk_count
-
-        # Calculate chunk metrics from events
-        chunk_metrics = _calculate_chunk_metrics_from_events(span.events)
-        metrics.update(chunk_metrics)
-
-        # First token delay - find first chunk with content
-        first_content_event = None
-        for event in span.events:
-            if event.name == EVENT_CHUNK_GENERATED:
-                first_content_event = event
-                break
-
-        if first_content_event:
-            first_token_delay = first_content_event.timestamp - span.start_time
-            metrics[METRIC_FIRST_TOKEN_DELAY] = first_token_delay / 1_000_000
-
-    return metrics
-
-
-def _calculate_chunk_metrics_from_events(events):
-    """Calculate chunk-related metrics from span events."""
-    chunk_events = [e for e in events if e.name == EVENT_CHUNK_GENERATED]
-
-    if not chunk_events:
-        return {}
-
-    chunk_sizes = []
-    chunk_content_sizes = []
-    chunks_with_content = 0
-    empty_chunks = 0
-
-    for event in chunk_events:
-        attrs = event.attributes or {}
-
-        chunk_size = attrs.get(EVENT_ATTR_CHUNK_SIZE, 0)
-        chunk_content_size = attrs.get(EVENT_ATTR_CHUNK_CONTENT_SIZE, 0)
-        # chunk_content = attrs.get(EVENT_ATTR_CHUNK_CONTENT, "")
-
-        chunk_sizes.append(chunk_size)
-        chunk_content_sizes.append(chunk_content_size)
-
-        if chunk_content_size:
-            chunks_with_content += 1
-        else:
-            empty_chunks += 1
-
-    metrics = {}
-
-    if chunk_sizes:
-        metrics[METRIC_AVG_CHUNK_SIZE] = sum(chunk_sizes) / len(chunk_sizes)
-        metrics[METRIC_MAX_CHUNK_SIZE] = max(chunk_sizes)
-        metrics[METRIC_MIN_CHUNK_SIZE] = min(chunk_sizes)
-
-    if chunk_content_sizes:
-        metrics[METRIC_AVG_CHUNK_CONTENT_SIZE] = sum(chunk_content_sizes) / len(
-            chunk_content_sizes
-        )
-
-    metrics[METRIC_CHUNKS_WITH_CONTENT] = chunks_with_content
-    metrics[METRIC_EMPTY_CHUNKS] = empty_chunks
-
-    return metrics
-
-
-class SLMLoggingSpanProcessor(SpanProcessor):
-    """Span processor for SLM logging using constants."""
-
-    def __init__(self):
-        self.logger = logging.getLogger(__name__)
-
-    def on_start(self, span, parent_context=None):
-        """Log span start."""
-        if not span.name.startswith(SPAN_CHAT_COMPLETION):
-            return
-
-        attrs = span.attributes or {}
-        is_streaming = attrs.get(ATTR_STREAMING, False)
-        log_data = {
-            LOG_KEY_MAX_TOKENS: attrs.get(ATTR_MAX_TOKENS, 0),
-            LOG_KEY_TEMPERATURE: attrs.get(ATTR_TEMPERATURE, 0.0),
-            LOG_KEY_INPUT_MESSAGES: attrs.get(ATTR_INPUT_MESSAGES, 0),
-            LOG_KEY_INPUT_CONTENT_LENGTH: attrs.get(ATTR_INPUT_CONTENT_LENGTH, 0),
-        }
-        mode = "streaming" if is_streaming else "non-streaming"
-        self.logger.info(LOG_MSG_STARTING_CALL.format(mode, log_data))
-
-    def on_end(self, span):
-        """Log span completion or error."""
-        if not span.name.startswith(SPAN_PREFIX):
-            return
-
-        attrs = span.attributes or {}
-
-        # Skip non-main spans (we no longer use chunk spans)
-        if not span.name.startswith(SPAN_CHAT_COMPLETION):
-            return
-
-        # Use native error status
-        if span.status.status_code == StatusCode.ERROR:
-            self.logger.error(LOG_MSG_FAILED_CALL.format(span.status.description))
-            return
-
-        # Calculate performance metrics (but don't try to set them on ended span)
-        performance_metrics = calculate_performance_metrics(span)
-        # Merge calculated metrics with existing attributes for logging
-        attrs = dict(attrs)
-        attrs.update(performance_metrics)
-        is_streaming = attrs.get(ATTR_STREAMING, False)
-        mode = "streaming" if is_streaming else "non-streaming"
-
-        log_data = {
-            LOG_KEY_DURATION_MS: round(attrs.get(METRIC_TOTAL_DURATION, 0), 2),
-            LOG_KEY_OUTPUT_CONTENT_LENGTH: attrs.get(ATTR_OUTPUT_CONTENT_LENGTH, 0),
-            LOG_KEY_TOTAL_TOKENS: attrs.get(ATTR_TOTAL_TOKENS, 0),
-            LOG_KEY_COMPLETION_TOKENS: attrs.get(ATTR_COMPLETION_TOKENS, 0),
-            LOG_KEY_COMPLETION_TOKENS_PER_SECOND: round(
-                attrs.get(METRIC_TOKENS_PER_SECOND, 0), 2
-            ),
-            LOG_KEY_TOTAL_TOKENS_PER_SECOND: round(
-                attrs.get(METRIC_TOTAL_TOKENS_PER_SECOND, 0), 2
-            ),
-        }
-
-        # Add streaming-specific metrics
-        if is_streaming:
-            log_data.update(
-                {
-                    LOG_KEY_CHUNK_COUNT: attrs.get(ATTR_CHUNK_COUNT, 0),
-                    LOG_KEY_AVG_CHUNK_DELAY_MS: round(
-                        attrs.get(METRIC_CHUNK_DELAY, 0), 2
-                    ),
-                    LOG_KEY_FIRST_TOKEN_DELAY_MS: round(
-                        attrs.get(METRIC_FIRST_TOKEN_DELAY, 0), 2
-                    ),
-                    LOG_KEY_AVG_CHUNK_SIZE: round(
-                        attrs.get(METRIC_AVG_CHUNK_SIZE, 0), 2
-                    ),
-                    LOG_KEY_AVG_CHUNK_CONTENT_SIZE: round(
-                        attrs.get(METRIC_AVG_CHUNK_CONTENT_SIZE, 0), 2
-                    ),
-                    LOG_KEY_CHUNKS_WITH_CONTENT: attrs.get(
-                        METRIC_CHUNKS_WITH_CONTENT, 0
-                    ),
-                    LOG_KEY_EMPTY_CHUNKS: attrs.get(METRIC_EMPTY_CHUNKS, 0),
-                }
-            )
-
-        self.logger.info(LOG_MSG_COMPLETED_CALL.format(mode, log_data))
-
-    def shutdown(self):
-        pass
-
-    def force_flush(self, timeout_millis: int = 30000):
-        return True
-
-
-class SLMMetricsSpanProcessor(SpanProcessor):
-    """Span processor for SLM metrics using constants."""
-
-    def __init__(self):
-        # Duration metrics
-        self.completion_duration = Histogram(
-            PROMETHEUS_COMPLETION_DURATION,
-            PROMETHEUS_COMPLETION_DURATION_DESC,
-            labelnames=["model", "streaming", "status"],
-            buckets=[0.1, 0.5, 1.0, 2.0, 5.0, 10.0, 20.0, 50.0],
-        )
-
-        # Token metrics
-        self.token_count = Histogram(
-            PROMETHEUS_TOKEN_COUNT,
-            PROMETHEUS_TOKEN_COUNT_DESC,
-            labelnames=["model", "streaming", "token_type"],
-            buckets=[10, 50, 100, 500, 1000, 2000, 5000, 10000],
-        )
-
-        # Throughput metrics - completion tokens (generation rate)
-        self.completion_tokens_per_second = Histogram(
-            PROMETHEUS_COMPLETION_TOKENS_PER_SECOND,
-            PROMETHEUS_COMPLETION_TOKENS_PER_SECOND_DESC,
-            labelnames=["model", "streaming"],
-            buckets=[1, 5, 10, 20, 50, 100, 200, 500],
-        )
-
-        # Throughput metrics - total tokens (including prompt processing)
-        self.total_tokens_per_second = Histogram(
-            PROMETHEUS_TOTAL_TOKENS_PER_SECOND,
-            PROMETHEUS_TOTAL_TOKENS_PER_SECOND_DESC,
-            labelnames=["model", "streaming"],
-            buckets=[1, 5, 10, 20, 50, 100, 200, 500],
-        )
-
-        # First token delay (streaming only)
-        self.first_token_delay = Histogram(
-            PROMETHEUS_FIRST_TOKEN_DELAY,
-            PROMETHEUS_FIRST_TOKEN_DELAY_DESC,
-            labelnames=["model"],
-            buckets=[10, 50, 100, 200, 500, 1000, 2000, 5000],
-        )
-
-        # Chunk delay metrics (streaming only)
-        self.chunk_delay = Histogram(
-            PROMETHEUS_CHUNK_DELAY,
-            PROMETHEUS_CHUNK_DELAY_DESC,
-            labelnames=["model"],
-            buckets=[1, 5, 10, 20, 50, 100, 200, 500],
-        )
-
-        # Chunk duration metrics
-        self.chunk_duration = Histogram(
-            PROMETHEUS_CHUNK_DURATION,
-            PROMETHEUS_CHUNK_DURATION_DESC,
-            labelnames=["model"],
-            buckets=[1, 5, 10, 20, 50, 100, 200, 500],
-        )
-
-        # Error rate
-        self.error_total = Counter(
-            PROMETHEUS_ERROR_TOTAL,
-            PROMETHEUS_ERROR_TOTAL_DESC,
-            labelnames=["model", "streaming", "error_type"],
-        )
-
-        # Chunk count for streaming
-        self.chunk_count = Histogram(
-            PROMETHEUS_CHUNK_COUNT,
-            PROMETHEUS_CHUNK_COUNT_DESC,
-            labelnames=["model"],
-            buckets=[1, 5, 10, 20, 50, 100, 200, 500],
-        )
-
-    def on_start(self, span, parent_context=None):
-        pass
-
-    def on_end(self, span):  # noqa: C901
-        """Record metrics on span end."""
-        if not span.name.startswith(SPAN_PREFIX):
-            return
-
-        attrs = span.attributes or {}
-        model = attrs.get(ATTR_MODEL, "unknown")
-
-        # Skip non-main spans (we no longer use chunk spans)
-        if not span.name.startswith(SPAN_CHAT_COMPLETION):
-            return
-
-        is_streaming = attrs.get(ATTR_STREAMING, False)
-        streaming_label = "streaming" if is_streaming else "non_streaming"
-
-        # Calculate performance metrics first
-        performance_metrics = calculate_performance_metrics(span)
-        # Merge calculated metrics with existing attributes
-        all_attrs = dict(attrs)
-        all_attrs.update(performance_metrics)
-
-        # Duration using calculated metric
-        duration_ms = all_attrs.get(METRIC_TOTAL_DURATION, 0)
-        duration_s = duration_ms / 1000 if duration_ms > 0 else 0
-        status = "success" if span.status.status_code == StatusCode.OK else "error"
-
-        self.completion_duration.labels(
-            model=model, streaming=streaming_label, status=status
-        ).observe(duration_s)
-
-        # Error tracking
-        if span.status.status_code == StatusCode.ERROR:
-            error_type = (
-                type(span.status.description).__name__
-                if span.status.description
-                else "unknown"
-            )
-            self.error_total.labels(
-                model=model, streaming=streaming_label, error_type=error_type
-            ).inc()
-            return
-
-        # Token metrics
-        prompt_tokens = all_attrs.get(ATTR_PROMPT_TOKENS, 0)
-        completion_tokens = all_attrs.get(ATTR_COMPLETION_TOKENS, 0)
-
-        if prompt_tokens > 0:
-            self.token_count.labels(
-                model=model, streaming=streaming_label, token_type="prompt"
-            ).observe(prompt_tokens)
-
-        if completion_tokens > 0:
-            self.token_count.labels(
-                model=model, streaming=streaming_label, token_type="completion"
-            ).observe(completion_tokens)
-
-        # Throughput metrics using calculated metrics
-        completion_tps = all_attrs.get(METRIC_TOKENS_PER_SECOND, 0)
-        if completion_tps > 0:
-            self.completion_tokens_per_second.labels(
-                model=model, streaming=streaming_label
-            ).observe(completion_tps)
-
-        total_tps = all_attrs.get(METRIC_TOTAL_TOKENS_PER_SECOND, 0)
-        if total_tps > 0:
-            self.total_tokens_per_second.labels(
-                model=model, streaming=streaming_label
-            ).observe(total_tps)
-
-        # Streaming-specific metrics
-        if is_streaming:
-            # Chunk count
-            chunk_count = all_attrs.get(ATTR_CHUNK_COUNT, 0)
-            if chunk_count > 0:
-                self.chunk_count.labels(model=model).observe(chunk_count)
-
-            # First token delay
-            first_token_delay_ms = all_attrs.get(METRIC_FIRST_TOKEN_DELAY, 0)
-            if first_token_delay_ms > 0:
-                self.first_token_delay.labels(model=model).observe(first_token_delay_ms)
-
-            # Average chunk delay
-            chunk_delay_ms = all_attrs.get(METRIC_CHUNK_DELAY, 0)
-            if chunk_delay_ms > 0:
-                self.chunk_delay.labels(model=model).observe(chunk_delay_ms)
-
-    def shutdown(self):
-        pass
-
-    def force_flush(self, timeout_millis: int = 30000):
-        return True
-
-
-class ErrorAwareSampler(Sampler):
-    """Sampler that forces sampling on errors."""
-
-    attr_force_sample = ATTR_FORCE_SAMPLE
-
-    def __init__(self, base_sampler: Sampler):
-        self.base_sampler = base_sampler
-
-    def should_sample(
-        self,
-        parent_context,
-        trace_id,
-        name,
-        kind=None,
-        attributes=None,
-        links=None,
-        trace_state=None,
-    ):
-        # Force sample if error attribute is set
-        if attributes and attributes.get(self.attr_force_sample):
-            return SamplingResult(
-                decision=Decision.RECORD_AND_SAMPLE,
-                attributes=attributes,
-                trace_state=trace_state,
-            )
-
-        # Use base sampler otherwise
-        return self.base_sampler.should_sample(
-            parent_context, trace_id, name, kind, attributes, links, trace_state
-        )
-
-    def get_description(self):
-        return f"ErrorAwareSampler(base={self.base_sampler})"
diff --git a/slm_server/utils/__init__.py b/slm_server/utils/__init__.py
new file mode 100644
index 0000000..015f0b2
--- /dev/null
+++ b/slm_server/utils/__init__.py
@@ -0,0 +1,6 @@
+# Re-export all functions and classes for backward compatibility
+from .constants import *  # noqa: F403, F401
+from .metrics import *  # noqa: F403, F401
+from .processors import *  # noqa: F403, F401
+from .sampler import *  # noqa: F403, F401
+from .spans import *  # noqa: F403, F401
diff --git a/slm_server/utils/constants.py b/slm_server/utils/constants.py
new file mode 100644
index 0000000..6506887
--- /dev/null
+++ b/slm_server/utils/constants.py
@@ -0,0 +1,115 @@
+# Constants for span naming and attributes
+MODEL_NAME = "llama-cpp"
+SPAN_PREFIX = "slm"
+
+# Span names
+SPAN_CHAT_COMPLETION = f"{SPAN_PREFIX}.chat_completion"
+SPAN_EMBEDDING = f"{SPAN_PREFIX}.embedding"
+
+# Event names
+EVENT_CHUNK_GENERATED = f"{SPAN_PREFIX}.chunk_generated"
+
+# Event attribute names
+EVENT_ATTR_CHUNK_SIZE = f"{SPAN_PREFIX}.chunk_size"
+EVENT_ATTR_CHUNK_CONTENT_SIZE = f"{SPAN_PREFIX}.chunk_content_size"
+
+# Attribute names
+ATTR_MODEL = f"{SPAN_PREFIX}.model"
+ATTR_STREAMING = f"{SPAN_PREFIX}.streaming"
+ATTR_MAX_TOKENS = f"{SPAN_PREFIX}.max_tokens"
+ATTR_TEMPERATURE = f"{SPAN_PREFIX}.temperature"
+ATTR_INPUT_MESSAGES = f"{SPAN_PREFIX}.input.messages"
+ATTR_INPUT_CONTENT_LENGTH = f"{SPAN_PREFIX}.input.content_length"
+ATTR_OUTPUT_CONTENT_LENGTH = f"{SPAN_PREFIX}.output.content_length"
+ATTR_CHUNK_COUNT = f"{SPAN_PREFIX}.output.chunk_count"
+ATTR_CHUNK_SIZE = f"{SPAN_PREFIX}.chunk.size"
+ATTR_PROMPT_TOKENS = f"{SPAN_PREFIX}.usage.prompt_tokens"
+ATTR_COMPLETION_TOKENS = f"{SPAN_PREFIX}.usage.completion_tokens"
+ATTR_TOTAL_TOKENS = f"{SPAN_PREFIX}.usage.total_tokens"
+ATTR_FORCE_SAMPLE = f"{SPAN_PREFIX}.force_sample"
+
+# Embedding attributes
+ATTR_INPUT_COUNT = f"{SPAN_PREFIX}.input.count"
+ATTR_OUTPUT_COUNT = f"{SPAN_PREFIX}.output.count"
+
+# Performance timing attributes
+ATTR_FIRST_TOKEN_DELAY = f"{SPAN_PREFIX}.timing.first_token_delay_ms"
+ATTR_TOKENS_PER_SECOND = f"{SPAN_PREFIX}.timing.completion_tokens_per_second"
+ATTR_TOTAL_TOKENS_PER_SECOND = f"{SPAN_PREFIX}.timing.total_tokens_per_second"
+ATTR_CHUNK_DELAY = f"{SPAN_PREFIX}.timing.chunk_delay_ms"
+ATTR_CHUNK_DURATION = f"{SPAN_PREFIX}.timing.chunk_duration_ms"
+ATTR_TOTAL_DURATION = f"{SPAN_PREFIX}.timing.total_duration_ms"
+ATTR_CHUNK_CONTENT_SIZE = f"{SPAN_PREFIX}.chunk.content_size"
+
+# Calculated metric names (used as keys in calculate_performance_metrics)
+METRIC_TOTAL_DURATION = ATTR_TOTAL_DURATION
+METRIC_TOKENS_PER_SECOND = ATTR_TOKENS_PER_SECOND
+METRIC_TOTAL_TOKENS_PER_SECOND = ATTR_TOTAL_TOKENS_PER_SECOND
+METRIC_CHUNK_DELAY = ATTR_CHUNK_DELAY
+METRIC_FIRST_TOKEN_DELAY = ATTR_FIRST_TOKEN_DELAY
+METRIC_AVG_CHUNK_SIZE = f"{SPAN_PREFIX}.metrics.avg_chunk_size"
+METRIC_AVG_CHUNK_CONTENT_SIZE = f"{SPAN_PREFIX}.metrics.avg_chunk_content_size"
+METRIC_MAX_CHUNK_SIZE = f"{SPAN_PREFIX}.metrics.max_chunk_size"
+METRIC_MIN_CHUNK_SIZE = f"{SPAN_PREFIX}.metrics.min_chunk_size"
+METRIC_CHUNKS_WITH_CONTENT = f"{SPAN_PREFIX}.metrics.chunks_with_content"
+METRIC_EMPTY_CHUNKS = f"{SPAN_PREFIX}.metrics.empty_chunks"
+
+# Embedding metrics
+METRIC_EMBEDDINGS_PER_SECOND = f"{SPAN_PREFIX}.metrics.embeddings_per_second"
+
+# Log data keys (for consistent logging format)
+LOG_KEY_MAX_TOKENS = "max_tokens"
+LOG_KEY_TEMPERATURE = "temperature"
+LOG_KEY_INPUT_MESSAGES = "input_messages"
+LOG_KEY_INPUT_CONTENT_LENGTH = "input_content_length"
+LOG_KEY_DURATION_MS = "duration_ms"
+LOG_KEY_OUTPUT_CONTENT_LENGTH = "output_content_length"
+LOG_KEY_TOTAL_TOKENS = "total_tokens"
+LOG_KEY_COMPLETION_TOKENS = "completion_tokens"
+LOG_KEY_COMPLETION_TOKENS_PER_SECOND = "completion_tokens_per_second"
+LOG_KEY_TOTAL_TOKENS_PER_SECOND = "total_tokens_per_second"
+LOG_KEY_CHUNK_COUNT = "chunk_count"
+LOG_KEY_AVG_CHUNK_DELAY_MS = "avg_chunk_delay_ms"
+LOG_KEY_FIRST_TOKEN_DELAY_MS = "first_token_delay_ms"
+LOG_KEY_AVG_CHUNK_SIZE = "avg_chunk_size"
+LOG_KEY_AVG_CHUNK_CONTENT_SIZE = "avg_chunk_content_size"
+LOG_KEY_CHUNKS_WITH_CONTENT = "chunks_with_content"
+LOG_KEY_EMPTY_CHUNKS = "empty_chunks"
+
+# Embedding log keys
+LOG_KEY_INPUT_COUNT = "input_count"
+LOG_KEY_OUTPUT_COUNT = "output_count"
+LOG_KEY_EMBEDDINGS_PER_SECOND = "embeddings_per_second"
+
+# Prometheus metric names and descriptions
+PROMETHEUS_COMPLETION_DURATION = "slm_completion_duration_seconds"
+PROMETHEUS_COMPLETION_DURATION_DESC = "SLM completion duration in seconds"
+PROMETHEUS_TOKEN_COUNT = "slm_tokens_total"
+PROMETHEUS_TOKEN_COUNT_DESC = "Total tokens processed"
+PROMETHEUS_COMPLETION_TOKENS_PER_SECOND = "slm_completion_tokens_per_second"
+PROMETHEUS_COMPLETION_TOKENS_PER_SECOND_DESC = (
+    "Completion token generation rate (tokens/sec)"
+)
+PROMETHEUS_TOTAL_TOKENS_PER_SECOND = "slm_total_tokens_per_second"
+PROMETHEUS_TOTAL_TOKENS_PER_SECOND_DESC = (
+    "Total token throughput including prompt processing (tokens/sec)"
+)
+PROMETHEUS_FIRST_TOKEN_DELAY = "slm_first_token_delay_ms"
+PROMETHEUS_FIRST_TOKEN_DELAY_DESC = "Time to first token in milliseconds (streaming)"
+PROMETHEUS_CHUNK_DELAY = "slm_chunk_delay_ms"
+PROMETHEUS_CHUNK_DELAY_DESC = "Average chunk delay in milliseconds (streaming)"
+PROMETHEUS_CHUNK_DURATION = "slm_chunk_duration_ms"
+PROMETHEUS_CHUNK_DURATION_DESC = "Individual chunk processing duration in milliseconds"
+PROMETHEUS_ERROR_TOTAL = "slm_errors_total"
+PROMETHEUS_ERROR_TOTAL_DESC = "Total SLM errors"
+PROMETHEUS_CHUNK_COUNT = "slm_chunks_total"
+PROMETHEUS_CHUNK_COUNT_DESC = "Number of chunks in streaming response"
+
+# Embedding metrics
+PROMETHEUS_EMBEDDINGS_PER_SECOND = "slm_embeddings_per_second"
+PROMETHEUS_EMBEDDINGS_PER_SECOND_DESC = "Embeddings generated per second"
+
+# Log message templates
+LOG_MSG_STARTING_CALL = "[SLM] starting {}: {}"
+LOG_MSG_COMPLETED_CALL = "[SLM] completed {}: {}"
+LOG_MSG_FAILED_CALL = "[SLM] failed: {}"
diff --git a/slm_server/utils/metrics.py b/slm_server/utils/metrics.py
new file mode 100644
index 0000000..d9dcc07
--- /dev/null
+++ b/slm_server/utils/metrics.py
@@ -0,0 +1,126 @@
+from opentelemetry.sdk.trace import Span
+
+from .constants import (
+    ATTR_CHUNK_COUNT,
+    ATTR_COMPLETION_TOKENS,
+    ATTR_OUTPUT_COUNT,
+    ATTR_STREAMING,
+    ATTR_TOTAL_TOKENS,
+    EVENT_ATTR_CHUNK_CONTENT_SIZE,
+    EVENT_ATTR_CHUNK_SIZE,
+    EVENT_CHUNK_GENERATED,
+    METRIC_AVG_CHUNK_CONTENT_SIZE,
+    METRIC_AVG_CHUNK_SIZE,
+    METRIC_CHUNK_DELAY,
+    METRIC_CHUNKS_WITH_CONTENT,
+    METRIC_EMBEDDINGS_PER_SECOND,
+    METRIC_EMPTY_CHUNKS,
+    METRIC_FIRST_TOKEN_DELAY,
+    METRIC_MAX_CHUNK_SIZE,
+    METRIC_MIN_CHUNK_SIZE,
+    METRIC_TOKENS_PER_SECOND,
+    METRIC_TOTAL_DURATION,
+    METRIC_TOTAL_TOKENS_PER_SECOND,
+    SPAN_EMBEDDING,
+)
+
+
+def calculate_performance_metrics(span: Span):  # noqa: C901
+    """Calculate performance metrics for a span after it has ended."""
+    if not (span.end_time and span.start_time):
+        return {}
+
+    attrs = span.attributes or {}
+    duration_ms = (span.end_time - span.start_time) / 1_000_000
+
+    # Get token counts
+    total_tokens = attrs.get(ATTR_TOTAL_TOKENS, 0)
+    completion_tokens = attrs.get(ATTR_COMPLETION_TOKENS, 0)
+
+    metrics = {
+        METRIC_TOTAL_DURATION: duration_ms,
+        METRIC_TOKENS_PER_SECOND: 0,
+        METRIC_TOTAL_TOKENS_PER_SECOND: 0,
+    }
+
+    # Calculate tokens per second
+    if duration_ms > 0:
+        duration_s = duration_ms / 1000
+        if completion_tokens > 0:
+            metrics[METRIC_TOKENS_PER_SECOND] = completion_tokens / duration_s
+        if total_tokens > 0:
+            metrics[METRIC_TOTAL_TOKENS_PER_SECOND] = total_tokens / duration_s
+
+    # Calculate streaming-specific metrics
+    is_streaming = attrs.get(ATTR_STREAMING, False)
+    if is_streaming:
+        chunk_count = attrs.get(ATTR_CHUNK_COUNT, 0)
+        if chunk_count > 0 and duration_ms > 0:
+            metrics[METRIC_CHUNK_DELAY] = duration_ms / chunk_count
+
+        # Calculate chunk metrics from events
+        chunk_metrics = _calculate_chunk_metrics_from_events(span.events)
+        metrics.update(chunk_metrics)
+
+        # First token delay - find first chunk with content
+        first_content_event = None
+        for event in span.events:
+            if event.name == EVENT_CHUNK_GENERATED:
+                first_content_event = event
+                break
+
+        if first_content_event:
+            first_token_delay = first_content_event.timestamp - span.start_time
+            metrics[METRIC_FIRST_TOKEN_DELAY] = first_token_delay / 1_000_000
+
+    elif span.name == SPAN_EMBEDDING:
+        output_count = attrs.get(ATTR_OUTPUT_COUNT, 0)
+        if output_count > 0 and duration_ms > 0:
+            metrics[METRIC_EMBEDDINGS_PER_SECOND] = output_count / (duration_ms / 1000)
+
+    return metrics
+
+
+def _calculate_chunk_metrics_from_events(events):
+    """Calculate chunk-related metrics from span events."""
+    chunk_events = [e for e in events if e.name == EVENT_CHUNK_GENERATED]
+
+    if not chunk_events:
+        return {}
+
+    chunk_sizes = []
+    chunk_content_sizes = []
+    chunks_with_content = 0
+    empty_chunks = 0
+
+    for event in chunk_events:
+        attrs = event.attributes or {}
+
+        chunk_size = attrs.get(EVENT_ATTR_CHUNK_SIZE, 0)
+        chunk_content_size = attrs.get(EVENT_ATTR_CHUNK_CONTENT_SIZE, 0)
+        # chunk_content = attrs.get(EVENT_ATTR_CHUNK_CONTENT, "")
+
+        chunk_sizes.append(chunk_size)
+        chunk_content_sizes.append(chunk_content_size)
+
+        if chunk_content_size:
+            chunks_with_content += 1
+        else:
+            empty_chunks += 1
+
+    metrics = {}
+
+    if chunk_sizes:
+        metrics[METRIC_AVG_CHUNK_SIZE] = sum(chunk_sizes) / len(chunk_sizes)
+        metrics[METRIC_MAX_CHUNK_SIZE] = max(chunk_sizes)
+        metrics[METRIC_MIN_CHUNK_SIZE] = min(chunk_sizes)
+
+    if chunk_content_sizes:
+        metrics[METRIC_AVG_CHUNK_CONTENT_SIZE] = sum(chunk_content_sizes) / len(
+            chunk_content_sizes
+        )
+
+    metrics[METRIC_CHUNKS_WITH_CONTENT] = chunks_with_content
+    metrics[METRIC_EMPTY_CHUNKS] = empty_chunks
+
+    return metrics
diff --git a/slm_server/utils/processors.py b/slm_server/utils/processors.py
new file mode 100644
index 0000000..63597ee
--- /dev/null
+++ b/slm_server/utils/processors.py
@@ -0,0 +1,391 @@
+import logging
+
+from opentelemetry.sdk.trace.export import SpanProcessor
+from opentelemetry.trace import StatusCode
+from prometheus_client import Counter, Histogram
+
+from .constants import (
+    ATTR_CHUNK_COUNT,
+    ATTR_COMPLETION_TOKENS,
+    ATTR_INPUT_CONTENT_LENGTH,
+    ATTR_INPUT_COUNT,
+    ATTR_INPUT_MESSAGES,
+    ATTR_MAX_TOKENS,
+    ATTR_MODEL,
+    ATTR_OUTPUT_CONTENT_LENGTH,
+    ATTR_OUTPUT_COUNT,
+    ATTR_PROMPT_TOKENS,
+    ATTR_STREAMING,
+    ATTR_TEMPERATURE,
+    ATTR_TOTAL_TOKENS,
+    LOG_KEY_AVG_CHUNK_CONTENT_SIZE,
+    LOG_KEY_AVG_CHUNK_DELAY_MS,
+    LOG_KEY_AVG_CHUNK_SIZE,
+    LOG_KEY_CHUNK_COUNT,
+    LOG_KEY_CHUNKS_WITH_CONTENT,
+    LOG_KEY_COMPLETION_TOKENS,
+    LOG_KEY_COMPLETION_TOKENS_PER_SECOND,
+    LOG_KEY_DURATION_MS,
+    LOG_KEY_EMBEDDINGS_PER_SECOND,
+    LOG_KEY_EMPTY_CHUNKS,
+    LOG_KEY_FIRST_TOKEN_DELAY_MS,
+    LOG_KEY_INPUT_CONTENT_LENGTH,
+    LOG_KEY_INPUT_COUNT,
+    LOG_KEY_INPUT_MESSAGES,
+    LOG_KEY_MAX_TOKENS,
+    LOG_KEY_OUTPUT_CONTENT_LENGTH,
+    LOG_KEY_OUTPUT_COUNT,
+    LOG_KEY_TEMPERATURE,
+    LOG_KEY_TOTAL_TOKENS,
+    LOG_KEY_TOTAL_TOKENS_PER_SECOND,
+    LOG_MSG_COMPLETED_CALL,
+    LOG_MSG_FAILED_CALL,
+    LOG_MSG_STARTING_CALL,
+    METRIC_AVG_CHUNK_CONTENT_SIZE,
+    METRIC_AVG_CHUNK_SIZE,
+    METRIC_CHUNK_DELAY,
+    METRIC_CHUNKS_WITH_CONTENT,
+    METRIC_EMBEDDINGS_PER_SECOND,
+    METRIC_EMPTY_CHUNKS,
+    METRIC_FIRST_TOKEN_DELAY,
+    METRIC_TOKENS_PER_SECOND,
+    METRIC_TOTAL_DURATION,
+    METRIC_TOTAL_TOKENS_PER_SECOND,
+    PROMETHEUS_CHUNK_COUNT,
+    PROMETHEUS_CHUNK_COUNT_DESC,
+    PROMETHEUS_CHUNK_DELAY,
+    PROMETHEUS_CHUNK_DELAY_DESC,
+    PROMETHEUS_CHUNK_DURATION,
+    PROMETHEUS_CHUNK_DURATION_DESC,
+    PROMETHEUS_COMPLETION_DURATION,
+    PROMETHEUS_COMPLETION_DURATION_DESC,
+    PROMETHEUS_COMPLETION_TOKENS_PER_SECOND,
+    PROMETHEUS_COMPLETION_TOKENS_PER_SECOND_DESC,
+    PROMETHEUS_EMBEDDINGS_PER_SECOND,
+    PROMETHEUS_EMBEDDINGS_PER_SECOND_DESC,
+    PROMETHEUS_ERROR_TOTAL,
+    PROMETHEUS_ERROR_TOTAL_DESC,
+    PROMETHEUS_FIRST_TOKEN_DELAY,
+    PROMETHEUS_FIRST_TOKEN_DELAY_DESC,
+    PROMETHEUS_TOKEN_COUNT,
+    PROMETHEUS_TOKEN_COUNT_DESC,
+    PROMETHEUS_TOTAL_TOKENS_PER_SECOND,
+    PROMETHEUS_TOTAL_TOKENS_PER_SECOND_DESC,
+    SPAN_CHAT_COMPLETION,
+    SPAN_EMBEDDING,
+    SPAN_PREFIX,
+)
+from .metrics import calculate_performance_metrics
+
+
+class SLMLoggingSpanProcessor(SpanProcessor):
+    """Span processor for SLM logging using constants."""
+
+    def __init__(self):
+        self.logger = logging.getLogger(__name__)
+
+    def on_start(self, span, parent_context=None):
+        """Log span start."""
+        if not span.name.startswith(SPAN_PREFIX):
+            return
+
+        attrs = span.attributes or {}
+        log_data = {}
+        mode = "unknown"
+
+        if span.name.startswith(SPAN_CHAT_COMPLETION):
+            is_streaming = attrs.get(ATTR_STREAMING, False)
+            log_data = {
+                LOG_KEY_MAX_TOKENS: attrs.get(ATTR_MAX_TOKENS, 0),
+                LOG_KEY_TEMPERATURE: attrs.get(ATTR_TEMPERATURE, 0.0),
+                LOG_KEY_INPUT_MESSAGES: attrs.get(ATTR_INPUT_MESSAGES, 0),
+                LOG_KEY_INPUT_CONTENT_LENGTH: attrs.get(ATTR_INPUT_CONTENT_LENGTH, 0),
+            }
+            mode = "streaming" if is_streaming else "non-streaming"
+
+        elif span.name == SPAN_EMBEDDING:
+            log_data = {
+                "input_count": attrs.get(ATTR_INPUT_COUNT, 0),
+                LOG_KEY_INPUT_CONTENT_LENGTH: attrs.get(ATTR_INPUT_CONTENT_LENGTH, 0),
+            }
+            mode = "embedding"
+
+        self.logger.info(LOG_MSG_STARTING_CALL.format(mode, log_data))
+
+    def on_end(self, span):
+        """Log span completion or error."""
+        if not span.name.startswith(SPAN_PREFIX):
+            return
+
+        attrs = span.attributes or {}
+
+        # Use native error status
+        if span.status.status_code == StatusCode.ERROR:
+            self.logger.error(LOG_MSG_FAILED_CALL.format(span.status.description))
+            return
+
+        # Calculate performance metrics (but don't try to set them on ended span)
+        performance_metrics = calculate_performance_metrics(span)
+        # Merge calculated metrics with existing attributes for logging
+        attrs = dict(attrs)
+        attrs.update(performance_metrics)
+
+        log_data = {
+            LOG_KEY_DURATION_MS: round(attrs.get(METRIC_TOTAL_DURATION, 0), 2),
+            LOG_KEY_TOTAL_TOKENS: attrs.get(ATTR_TOTAL_TOKENS, 0),
+            LOG_KEY_TOTAL_TOKENS_PER_SECOND: round(
+                attrs.get(METRIC_TOTAL_TOKENS_PER_SECOND, 0), 2
+            ),
+        }
+
+        mode = "unknown"
+
+        if span.name.startswith(SPAN_CHAT_COMPLETION):
+            is_streaming = attrs.get(ATTR_STREAMING, False)
+            mode = "streaming" if is_streaming else "non-streaming"
+            log_data.update(
+                {
+                    LOG_KEY_OUTPUT_CONTENT_LENGTH: attrs.get(
+                        ATTR_OUTPUT_CONTENT_LENGTH, 0
+                    ),
+                    LOG_KEY_COMPLETION_TOKENS: attrs.get(ATTR_COMPLETION_TOKENS, 0),
+                    LOG_KEY_COMPLETION_TOKENS_PER_SECOND: round(
+                        attrs.get(METRIC_TOKENS_PER_SECOND, 0), 2
+                    ),
+                }
+            )
+            if is_streaming:
+                log_data.update(
+                    {
+                        LOG_KEY_CHUNK_COUNT: attrs.get(ATTR_CHUNK_COUNT, 0),
+                        LOG_KEY_AVG_CHUNK_DELAY_MS: round(
+                            attrs.get(METRIC_CHUNK_DELAY, 0), 2
+                        ),
+                        LOG_KEY_FIRST_TOKEN_DELAY_MS: round(
+                            attrs.get(METRIC_FIRST_TOKEN_DELAY, 0), 2
+                        ),
+                        LOG_KEY_AVG_CHUNK_SIZE: round(
+                            attrs.get(METRIC_AVG_CHUNK_SIZE, 0), 2
+                        ),
+                        LOG_KEY_AVG_CHUNK_CONTENT_SIZE: round(
+                            attrs.get(METRIC_AVG_CHUNK_CONTENT_SIZE, 0), 2
+                        ),
+                        LOG_KEY_CHUNKS_WITH_CONTENT: attrs.get(
+                            METRIC_CHUNKS_WITH_CONTENT, 0
+                        ),
+                        LOG_KEY_EMPTY_CHUNKS: attrs.get(METRIC_EMPTY_CHUNKS, 0),
+                    }
+                )
+
+        elif span.name == SPAN_EMBEDDING:
+            mode = "embedding"
+            log_data.update(
+                {
+                    LOG_KEY_INPUT_COUNT: attrs.get(ATTR_INPUT_COUNT, 0),
+                    LOG_KEY_OUTPUT_COUNT: attrs.get(ATTR_OUTPUT_COUNT, 0),
+                    LOG_KEY_EMBEDDINGS_PER_SECOND: round(
+                        attrs.get(METRIC_EMBEDDINGS_PER_SECOND, 0), 2
+                    ),
+                }
+            )
+
+        self.logger.info(LOG_MSG_COMPLETED_CALL.format(mode, log_data))
+
+    def shutdown(self):
+        pass
+
+    def force_flush(self, timeout_millis: int = 30000):
+        return True
+
+
+class SLMMetricsSpanProcessor(SpanProcessor):
+    """Span processor for SLM metrics using constants."""
+
+    def __init__(self):
+        # Duration metrics
+        self.completion_duration = Histogram(
+            PROMETHEUS_COMPLETION_DURATION,
+            PROMETHEUS_COMPLETION_DURATION_DESC,
+            labelnames=["model", "streaming", "status"],
+            buckets=[0.1, 0.5, 1.0, 2.0, 5.0, 10.0, 20.0, 50.0],
+        )
+
+        # Token metrics
+        self.token_count = Histogram(
+            PROMETHEUS_TOKEN_COUNT,
+            PROMETHEUS_TOKEN_COUNT_DESC,
+            labelnames=["model", "streaming", "token_type"],
+            buckets=[10, 50, 100, 500, 1000, 2000, 5000, 10000],
+        )
+
+        # Throughput metrics - completion tokens (generation rate)
+        self.completion_tokens_per_second = Histogram(
+            PROMETHEUS_COMPLETION_TOKENS_PER_SECOND,
+            PROMETHEUS_COMPLETION_TOKENS_PER_SECOND_DESC,
+            labelnames=["model", "streaming"],
+            buckets=[1, 5, 10, 20, 50, 100, 200, 500],
+        )
+
+        # Throughput metrics - total tokens (including prompt processing)
+        self.total_tokens_per_second = Histogram(
+            PROMETHEUS_TOTAL_TOKENS_PER_SECOND,
+            PROMETHEUS_TOTAL_TOKENS_PER_SECOND_DESC,
+            labelnames=["model", "streaming"],
+            buckets=[1, 5, 10, 20, 50, 100, 200, 500],
+        )
+
+        # First token delay (streaming only)
+        self.first_token_delay = Histogram(
+            PROMETHEUS_FIRST_TOKEN_DELAY,
+            PROMETHEUS_FIRST_TOKEN_DELAY_DESC,
+            labelnames=["model"],
+            buckets=[10, 50, 100, 200, 500, 1000, 2000, 5000],
+        )
+
+        # Chunk delay metrics (streaming only)
+        self.chunk_delay = Histogram(
+            PROMETHEUS_CHUNK_DELAY,
+            PROMETHEUS_CHUNK_DELAY_DESC,
+            labelnames=["model"],
+            buckets=[1, 5, 10, 20, 50, 100, 200, 500],
+        )
+
+        # Chunk duration metrics
+        self.chunk_duration = Histogram(
+            PROMETHEUS_CHUNK_DURATION,
+            PROMETHEUS_CHUNK_DURATION_DESC,
+            labelnames=["model"],
+            buckets=[1, 5, 10, 20, 50, 100, 200, 500],
+        )
+
+        # Error rate
+        self.error_total = Counter(
+            PROMETHEUS_ERROR_TOTAL,
+            PROMETHEUS_ERROR_TOTAL_DESC,
+            labelnames=["model", "streaming"],
+        )
+
+        # Chunk count for streaming
+        self.chunk_count = Histogram(
+            PROMETHEUS_CHUNK_COUNT,
+            PROMETHEUS_CHUNK_COUNT_DESC,
+            labelnames=["model"],
+            buckets=[1, 5, 10, 20, 50, 100, 200, 500],
+        )
+
+        # Embedding metrics
+        self.embeddings_per_second = Histogram(
+            PROMETHEUS_EMBEDDINGS_PER_SECOND,
+            PROMETHEUS_EMBEDDINGS_PER_SECOND_DESC,
+            labelnames=["model"],
+            buckets=[1, 5, 10, 20, 50, 100, 200, 500],
+        )
+
+    def on_start(self, span, parent_context=None):
+        pass
+
+    def on_end(self, span):  # noqa: C901
+        """Record metrics on span end."""
+        if not span.name.startswith(SPAN_PREFIX):
+            return
+
+        attrs = span.attributes or {}
+        model = attrs.get(ATTR_MODEL, "unknown")
+        status = "success" if span.status.status_code == StatusCode.OK else "error"
+
+        # Calculate performance metrics first
+        performance_metrics = calculate_performance_metrics(span)
+        # Merge calculated metrics with existing attributes
+        all_attrs = dict(attrs)
+        all_attrs.update(performance_metrics)
+
+        duration_ms = all_attrs.get(METRIC_TOTAL_DURATION, 0)
+        duration_s = duration_ms / 1000 if duration_ms > 0 else 0
+
+        if span.name.startswith(SPAN_CHAT_COMPLETION):
+            is_streaming = attrs.get(ATTR_STREAMING, False)
+            streaming_label = "streaming" if is_streaming else "non_streaming"
+
+            self.completion_duration.labels(
+                model=model, streaming=streaming_label, status=status
+            ).observe(duration_s)
+
+            if span.status.status_code == StatusCode.ERROR:
+                self.error_total.labels(
+                    model=model,
+                    streaming=streaming_label,
+                ).inc()
+                return
+
+            prompt_tokens = all_attrs.get(ATTR_PROMPT_TOKENS, 0)
+            completion_tokens = all_attrs.get(ATTR_COMPLETION_TOKENS, 0)
+
+            if prompt_tokens > 0:
+                self.token_count.labels(
+                    model=model, streaming=streaming_label, token_type="prompt"
+                ).observe(prompt_tokens)
+
+            if completion_tokens > 0:
+                self.token_count.labels(
+                    model=model, streaming=streaming_label, token_type="completion"
+                ).observe(completion_tokens)
+
+            completion_tps = all_attrs.get(METRIC_TOKENS_PER_SECOND, 0)
+            if completion_tps > 0:
+                self.completion_tokens_per_second.labels(
+                    model=model, streaming=streaming_label
+                ).observe(completion_tps)
+
+            total_tps = all_attrs.get(METRIC_TOTAL_TOKENS_PER_SECOND, 0)
+            if total_tps > 0:
+                self.total_tokens_per_second.labels(
+                    model=model, streaming=streaming_label
+                ).observe(total_tps)
+
+            if is_streaming:
+                chunk_count = all_attrs.get(ATTR_CHUNK_COUNT, 0)
+                if chunk_count > 0:
+                    self.chunk_count.labels(model=model).observe(chunk_count)
+
+                first_token_delay_ms = all_attrs.get(METRIC_FIRST_TOKEN_DELAY, 0)
+                if first_token_delay_ms > 0:
+                    self.first_token_delay.labels(model=model).observe(
+                        first_token_delay_ms
+                    )
+
+                chunk_delay_ms = all_attrs.get(METRIC_CHUNK_DELAY, 0)
+                if chunk_delay_ms > 0:
+                    self.chunk_delay.labels(model=model).observe(chunk_delay_ms)
+
+        elif span.name == SPAN_EMBEDDING:
+            self.completion_duration.labels(
+                model=model, streaming="embedding", status=status
+            ).observe(duration_s)
+
+            if span.status.status_code == StatusCode.ERROR:
+                self.error_total.labels(model=model, streaming="embedding").inc()
+                return
+
+            prompt_tokens = all_attrs.get(ATTR_PROMPT_TOKENS, 0)
+            if prompt_tokens > 0:
+                self.token_count.labels(
+                    model=model, streaming="embedding", token_type="prompt"
+                ).observe(prompt_tokens)
+
+            total_tps = all_attrs.get(METRIC_TOTAL_TOKENS_PER_SECOND, 0)
+            if total_tps > 0:
+                self.total_tokens_per_second.labels(
+                    model=model, streaming="embedding"
+                ).observe(total_tps)
+
+            embeddings_per_second = all_attrs.get(METRIC_EMBEDDINGS_PER_SECOND, 0)
+            if embeddings_per_second > 0:
+                self.embeddings_per_second.labels(model=model).observe(
+                    embeddings_per_second
+                )
+
+    def shutdown(self):
+        pass
+
+    def force_flush(self, timeout_millis: int = 30000):
+        return True
diff --git a/slm_server/utils/sampler.py b/slm_server/utils/sampler.py
new file mode 100644
index 0000000..5b25c9d
--- /dev/null
+++ b/slm_server/utils/sampler.py
@@ -0,0 +1,38 @@
+from opentelemetry.sdk.trace.sampling import Decision, Sampler, SamplingResult
+
+from .constants import ATTR_FORCE_SAMPLE
+
+
+class ErrorAwareSampler(Sampler):
+    """Sampler that forces sampling on errors."""
+
+    attr_force_sample = ATTR_FORCE_SAMPLE
+
+    def __init__(self, base_sampler: Sampler):
+        self.base_sampler = base_sampler
+
+    def should_sample(
+        self,
+        parent_context,
+        trace_id,
+        name,
+        kind=None,
+        attributes=None,
+        links=None,
+        trace_state=None,
+    ):
+        # Force sample if error attribute is set
+        if attributes and attributes.get(self.attr_force_sample):
+            return SamplingResult(
+                decision=Decision.RECORD_AND_SAMPLE,
+                attributes=attributes,
+                trace_state=trace_state,
+            )
+
+        # Use base sampler otherwise
+        return self.base_sampler.should_sample(
+            parent_context, trace_id, name, kind, attributes, links, trace_state
+        )
+
+    def get_description(self):
+        return f"ErrorAwareSampler(base={self.base_sampler})"
diff --git a/slm_server/utils/spans.py b/slm_server/utils/spans.py
new file mode 100644
index 0000000..77a4e80
--- /dev/null
+++ b/slm_server/utils/spans.py
@@ -0,0 +1,226 @@
+import logging
+import traceback
+from contextlib import contextmanager
+
+from llama_cpp import ChatCompletionStreamResponse
+from opentelemetry import trace
+from opentelemetry.sdk.trace import Span
+from opentelemetry.trace import Status, StatusCode
+
+from llama_cpp.llama_types import (
+    CreateChatCompletionResponse as ChatCompletionResponse,
+    CreateEmbeddingResponse as EmbeddingResponse,
+)
+from slm_server.model import (
+    ChatCompletionRequest,
+    EmbeddingRequest,
+)
+
+from .constants import (
+    ATTR_CHUNK_COUNT,
+    ATTR_COMPLETION_TOKENS,
+    ATTR_FORCE_SAMPLE,
+    ATTR_INPUT_CONTENT_LENGTH,
+    ATTR_INPUT_COUNT,
+    ATTR_INPUT_MESSAGES,
+    ATTR_MAX_TOKENS,
+    ATTR_MODEL,
+    ATTR_OUTPUT_CONTENT_LENGTH,
+    ATTR_OUTPUT_COUNT,
+    ATTR_PROMPT_TOKENS,
+    ATTR_STREAMING,
+    ATTR_TEMPERATURE,
+    ATTR_TOTAL_TOKENS,
+    EVENT_ATTR_CHUNK_CONTENT_SIZE,
+    EVENT_ATTR_CHUNK_SIZE,
+    EVENT_CHUNK_GENERATED,
+    MODEL_NAME,
+    SPAN_CHAT_COMPLETION,
+    SPAN_EMBEDDING,
+)
+
+# Get tracer
+tracer = trace.get_tracer(__name__)
+logger = logging.getLogger(__name__)
+
+
+def set_atrribute_response(span: Span, response: ChatCompletionResponse | dict):
+    """Set response attributes automatically."""
+    # Non-streaming response - handle both dict and object responses
+    if isinstance(response, dict):
+        # Handle dict response
+        usage = response.get("usage")
+        if usage:
+            span.set_attribute(ATTR_PROMPT_TOKENS, usage.get("prompt_tokens", 0))
+            span.set_attribute(
+                ATTR_COMPLETION_TOKENS, usage.get("completion_tokens", 0)
+            )
+            span.set_attribute(ATTR_TOTAL_TOKENS, usage.get("total_tokens", 0))
+
+        choices = response.get("choices", [])
+        if choices and choices[0].get("message"):
+            content = choices[0]["message"].get("content") or ""
+            span.set_attribute(ATTR_OUTPUT_CONTENT_LENGTH, len(content))
+    else:
+        # Handle object response (original code)
+        if response.usage:
+            span.set_attribute(ATTR_PROMPT_TOKENS, response.usage.prompt_tokens)
+            span.set_attribute(ATTR_COMPLETION_TOKENS, response.usage.completion_tokens)
+            span.set_attribute(ATTR_TOTAL_TOKENS, response.usage.total_tokens)
+
+        if response.choices and response.choices[0].message:
+            content = response.choices[0].message.content or ""
+            span.set_attribute(ATTR_OUTPUT_CONTENT_LENGTH, len(content))
+
+
+def set_atrribute_response_stream(
+    span: Span, response: ChatCompletionStreamResponse | dict
+):
+    """Record streaming chunk as an event and accumulate tokens."""
+    chunk_content = ""
+    if isinstance(response, dict):
+        # Handle dict response
+        choices = response.get("choices", [])
+        if choices and choices[0].get("delta") and choices[0]["delta"].get("content"):
+            chunk_content = choices[0]["delta"]["content"]
+        chunk_json = str(response)  # Simple string representation for dict
+    else:
+        # Handle object response (original code)
+        if (
+            response.choices
+            and response.choices[0].delta
+            and response.choices[0].delta.content
+        ):
+            chunk_content = response.choices[0].delta.content
+        chunk_json = response.model_dump_json()
+
+    # Record chunk as an event
+    chunk_event = {
+        EVENT_ATTR_CHUNK_SIZE: len(chunk_json),
+        EVENT_ATTR_CHUNK_CONTENT_SIZE: len(chunk_content),
+        # EVENT_ATTR_CHUNK_CONTENT: chunk_content,
+        # EVENT_ATTR_FINISH_REASON: response.choices[0].finish_reason or 0
+        # if response.choices
+        # else None,
+    }
+    span.add_event(EVENT_CHUNK_GENERATED, chunk_event)
+
+    # Only count chunks with actual content
+    if not chunk_content:
+        return
+
+    # Accumulate tokens directly on the span (only for recording spans)
+    if span.is_recording():
+        current_completion_tokens = span.attributes.get(ATTR_COMPLETION_TOKENS, 0)
+        span.set_attribute(ATTR_COMPLETION_TOKENS, current_completion_tokens + 1)
+
+        # Update total content length
+        current_output_length = span.attributes.get(ATTR_OUTPUT_CONTENT_LENGTH, 0)
+        span.set_attribute(
+            ATTR_OUTPUT_CONTENT_LENGTH, current_output_length + len(chunk_content)
+        )
+
+        # Update total tokens (assuming we have prompt tokens from initial setup)
+        prompt_tokens = span.attributes.get(ATTR_PROMPT_TOKENS, 0)
+        total_tokens = prompt_tokens + current_completion_tokens + 1
+        span.set_attribute(ATTR_TOTAL_TOKENS, total_tokens)
+
+        # Update chunk count
+        current_chunk_count = span.attributes.get(ATTR_CHUNK_COUNT, 0)
+        span.set_attribute(ATTR_CHUNK_COUNT, current_chunk_count + 1)
+
+
+def set_attribute_response_embedding(span: Span, response: EmbeddingResponse | dict):
+    """Set embedding response attributes automatically."""
+    if isinstance(response, dict):
+        # Handle dict response
+        usage = response.get("usage")
+        if usage:
+            span.set_attribute(ATTR_PROMPT_TOKENS, usage.get("prompt_tokens", 0))
+            span.set_attribute(ATTR_TOTAL_TOKENS, usage.get("total_tokens", 0))
+        data = response.get("data")
+        if data:
+            span.set_attribute(ATTR_OUTPUT_COUNT, len(data))
+    else:
+        # Handle object response (original code)
+        if response.usage:
+            span.set_attribute(ATTR_PROMPT_TOKENS, response.usage.prompt_tokens)
+            span.set_attribute(ATTR_TOTAL_TOKENS, response.usage.total_tokens)
+        if response.data:
+            span.set_attribute(ATTR_OUTPUT_COUNT, len(response.data))
+
+
+def set_attribute_cancelled(span: Span, reason: str = "client disconnected"):
+    """Set span status to error for cancellation."""
+    span.set_status(Status(StatusCode.ERROR, description=reason))
+
+
+@contextmanager
+def slm_span(req: ChatCompletionRequest, is_streaming: bool):
+    """Create SLM span with automatic timing and error handling."""
+    span_name = (
+        f"{SPAN_CHAT_COMPLETION}.{'streaming' if is_streaming else 'non_streaming'}"
+    )
+
+    # Pre-calculate attributes before starting span
+    messages_for_llm = req.messages
+    input_content_length = sum(len(msg.get("content", "")) for msg in messages_for_llm)
+
+    # Set initial attributes that will be available in on_start
+    initial_attributes = {
+        ATTR_MODEL: MODEL_NAME,
+        ATTR_STREAMING: is_streaming,
+        ATTR_MAX_TOKENS: req.max_tokens or 0,
+        ATTR_TEMPERATURE: req.temperature,
+        ATTR_INPUT_MESSAGES: len(messages_for_llm),
+        ATTR_INPUT_CONTENT_LENGTH: input_content_length,
+    }
+
+    # Add prompt tokens estimate for streaming
+    if is_streaming:
+        # Estimate prompt tokens for streaming
+        # (rough approximation: 1 token per 4 chars)
+        estimated_prompt_tokens = (
+            max(1, input_content_length // 4) if is_streaming else 0
+        )
+        initial_attributes[ATTR_PROMPT_TOKENS] = estimated_prompt_tokens
+
+    with tracer.start_as_current_span(span_name, attributes=initial_attributes) as span:
+        try:
+            yield span
+
+        except Exception:
+            # Use native error handling
+            error_str = traceback.format_exc()
+            span.set_status(Status(StatusCode.ERROR, error_str))
+            span.set_attribute(ATTR_FORCE_SAMPLE, True)
+            raise
+
+
+@contextmanager
+def slm_embedding_span(req: EmbeddingRequest):
+    """Create SLM span for embedding requests."""
+    span_name = SPAN_EMBEDDING
+
+    if isinstance(req.input, list):
+        input_count = len(req.input)
+        input_content_length = sum(len(text) for text in req.input)
+    else:
+        input_count = 1
+        input_content_length = len(req.input)
+
+    initial_attributes = {
+        ATTR_MODEL: MODEL_NAME,
+        ATTR_INPUT_COUNT: input_count,
+        ATTR_INPUT_CONTENT_LENGTH: input_content_length,
+    }
+
+    with tracer.start_as_current_span(span_name, attributes=initial_attributes) as span:
+        try:
+            yield span
+
+        except Exception:
+            error_str = traceback.format_exc()
+            span.set_status(Status(StatusCode.ERROR, error_str))
+            span.set_attribute(ATTR_FORCE_SAMPLE, True)
+            raise
diff --git a/tests/e2e/conftest.py b/tests/e2e/conftest.py
new file mode 100644
index 0000000..ff5f931
--- /dev/null
+++ b/tests/e2e/conftest.py
@@ -0,0 +1,51 @@
+
+import socket
+import subprocess
+import time
+import pytest
+import httpx
+
+def is_server_running(port=8000):
+    """Check if a server is running on the specified port by checking the /health endpoint."""
+    try:
+        response = httpx.get(f"http://localhost:{port}/health", timeout=1)
+        return response.status_code == 200
+    except httpx.RequestError:
+        return False
+
+@pytest.fixture(scope="session")
+def server():
+    """
+    A session-scoped fixture that starts the SLM server if it's not already running.
+    It tears down the server process after all tests in the session are complete.
+    """
+    if is_server_running():
+        print("Server is already running. Tests will proceed against the existing server.")
+        yield
+        return
+
+    print("Starting server...")
+    # Start the server as a background process
+    process = subprocess.Popen(["./scripts/start.sh"], stdout=subprocess.PIPE, stderr=subprocess.PIPE)
+    
+    # Wait for the server to be ready
+    for _ in range(30):  # 30 seconds timeout
+        if is_server_running():
+            print("Server started successfully.")
+            break
+        time.sleep(1)
+    else:
+        stdout, stderr = process.communicate()
+        print(f"Server failed to start. Stdout: {stdout.decode()}, Stderr: {stderr.decode()}")
+        pytest.fail("Server did not start within the timeout period.", pytrace=False)
+
+    yield
+
+    print("Tearing down server...")
+    process.terminate()
+    try:
+        process.wait(timeout=10)
+    except subprocess.TimeoutExpired:
+        print("Server did not terminate gracefully, killing it.")
+        process.kill()
+    print("Server torn down.")
diff --git a/tests/e2e/main.py b/tests/e2e/main.py
deleted file mode 100644
index 565322a..0000000
--- a/tests/e2e/main.py
+++ /dev/null
@@ -1,65 +0,0 @@
-import asyncio
-import json
-
-import httpx
-
-
-async def test_chat_completion_non_streaming():
-    print("Testing non-streaming chat completion...")
-    async with httpx.AsyncClient() as client:
-        response = await client.post(
-            "http://localhost:8000/api/v1/chat/completions",
-            json={
-                "messages": [{"role": "user", "content": "Hello /no think"}],
-                "stream": False,
-            },
-            timeout=30,
-        )
-        assert response.status_code == 200
-        response_data = response.json()
-        print(f"Non-streaming response: {response_data}")
-        assert "choices" in response_data
-        assert len(response_data["choices"]) > 0
-        assert "message" in response_data["choices"][0]
-        assert "content" in response_data["choices"][0]["message"]
-
-
-async def test_chat_completion_streaming():
-    print("\nTesting streaming chat completion...")
-    async with httpx.AsyncClient() as client:
-        async with client.stream(
-            "POST",
-            "http://localhost:8000/api/v1/chat/completions",
-            json={
-                "messages": [{"role": "user", "content": "Hello /no think"}],
-                "stream": True,
-            },
-            timeout=30,
-        ) as response:
-            assert response.status_code == 200
-            print("Streaming response:")
-            async for chunk in response.aiter_bytes():
-                if chunk.strip():
-                    # Decode bytes to string and remove the 'data: ' prefix
-                    data_str = chunk.decode("utf-8").replace("data: ", "").strip()
-                    if data_str == "[DONE]":
-                        print("\nStream finished.")
-                        break
-                    try:
-                        # Parse the JSON data
-                        response_data = json.loads(data_str)
-                        print(response_data, end="", flush=True)
-                        assert "choices" in response_data
-                        assert len(response_data["choices"]) > 0
-                        assert "delta" in response_data["choices"][0]
-                    except json.JSONDecodeError:
-                        print(f"\nError decoding JSON: {data_str}")
-
-
-async def main():
-    await test_chat_completion_non_streaming()
-    await test_chat_completion_streaming()
-
-
-if __name__ == "__main__":
-    asyncio.run(main())
diff --git a/tests/e2e/test_api.py b/tests/e2e/test_api.py
new file mode 100644
index 0000000..1eee6e2
--- /dev/null
+++ b/tests/e2e/test_api.py
@@ -0,0 +1,114 @@
+
+import json
+import pytest
+import httpx
+
+@pytest.mark.api
+@pytest.mark.api_non_streaming
+def test_chat_completion_non_streaming(server):
+    """Test non-streaming chat completion API."""
+    with httpx.Client() as client:
+        response = client.post(
+            "http://localhost:8000/api/v1/chat/completions",
+            json={
+                "messages": [{"role": "user", "content": "Hello /no think"}],
+                "stream": False,
+            },
+            timeout=30,
+        )
+        response.raise_for_status()
+        response_data = response.json()
+        assert "choices" in response_data
+        assert len(response_data["choices"]) > 0
+        assert "message" in response_data["choices"][0]
+        assert "content" in response_data["choices"][0]["message"]
+
+@pytest.mark.api
+def test_chat_completion_streaming(server):
+    """Test streaming chat completion API."""
+    with httpx.Client() as client:
+        with client.stream(
+            "POST",
+            "http://localhost:8000/api/v1/chat/completions",
+            json={
+                "messages": [{"role": "user", "content": "Hello /no think"}],
+                "stream": True,
+            },
+            timeout=30,
+        ) as response:
+            response.raise_for_status()
+            for chunk in response.iter_bytes():
+                if chunk.strip():
+                    data_str = chunk.decode("utf-8").replace("data: ", "").strip()
+                    if data_str == "[DONE]":
+                        break
+                    try:
+                        response_data = json.loads(data_str)
+                        assert "choices" in response_data
+                        assert len(response_data["choices"]) > 0
+                        assert "delta" in response_data["choices"][0]
+                    except json.JSONDecodeError:
+                        pytest.fail(f"Error decoding JSON: {data_str}")
+
+@pytest.mark.api
+@pytest.mark.api_non_streaming
+def test_embeddings(server):
+    """Test embeddings API."""
+    with httpx.Client() as client:
+        response = client.post(
+            "http://localhost:8000/api/v1/embeddings",
+            json={
+                "input": "Hello world"
+            },
+        )
+        response.raise_for_status()
+        response_data = response.json()
+        assert response_data["object"] == "list"
+        assert len(response_data["data"]) == 1
+        assert "embedding" in response_data["data"][0]
+        assert len(response_data["data"][0]["embedding"]) > 0
+
+        # Test with multiple inputs
+        response = client.post(
+            "http://localhost:8000/api/v1/embeddings",
+            json={
+                "input": ["Hello", "World"],
+                "model": "Qwen3-0.6B-GGUF"
+            },
+        )
+        response.raise_for_status()
+        response_data = response.json()
+        assert len(response_data["data"]) == 2
+
+
+@pytest.mark.api
+@pytest.mark.api_non_streaming
+def test_embeddings_multiple(server):
+    """Test embeddings API."""
+    with httpx.Client() as client:
+        response = client.post(
+            "http://localhost:8000/api/v1/embeddings",
+            json={
+                "input": ["Hello, world"]
+            },
+        )
+        response.raise_for_status()
+        response_data = response.json()
+        assert response_data["object"] == "list"
+        assert len(response_data["data"]) == 1
+        assert "embedding" in response_data["data"][0]
+        assert len(response_data["data"][0]["embedding"]) > 0
+
+        # Test with multiple inputs
+        response = client.post(
+            "http://localhost:8000/api/v1/embeddings",
+            json={
+                "input": ["Hello", "World"],
+                "model": "Qwen3-0.6B-GGUF"
+            },
+            timeout=30,
+        )
+        response.raise_for_status()
+        response_data = response.json()
+        assert len(response_data["data"]) == 2
+
diff --git a/tests/e2e/test_langchain_compatibility.py b/tests/e2e/test_langchain_compatibility.py
new file mode 100644
index 0000000..dec0398
--- /dev/null
+++ b/tests/e2e/test_langchain_compatibility.py
@@ -0,0 +1,349 @@
+
+import pytest
+from langchain_openai import ChatOpenAI, OpenAIEmbeddings
+from langchain.prompts import PromptTemplate
+from langchain_core.messages import HumanMessage
+from langchain.agents import create_tool_calling_agent, create_react_agent, AgentExecutor
+from langchain.tools import tool
+from langchain_core.prompts import ChatPromptTemplate
+
+@pytest.mark.langchain
+def test_basic_chat_llm_call(server):
+    """Test basic ChatOpenAI call through LangChain interface."""
+    chat_llm = ChatOpenAI(
+        base_url="http://localhost:8000/api/v1",
+        api_key="dummy-key",
+        temperature=0.7,
+        max_tokens=150,
+    )
+    messages = [HumanMessage(content="Hello, can you say 'LangChain test successful'?")]
+    response = chat_llm.invoke(messages)
+    assert isinstance(response.content, str)
+    assert len(response.content) > 0
+    print(f"TEST LANGCHAIN RESPONSE: {response.content}")
+
+@pytest.mark.langchain
+def test_llm_chain_integration(server):
+    """Test modern RunnableSequence chain integration with our server."""
+    chat_llm = ChatOpenAI(
+        base_url="http://localhost:8000/api/v1",
+        api_key="dummy-key",
+        temperature=0.7,
+        max_tokens=150,
+    )
+    prompt = PromptTemplate(
+        input_variables=["topic"],
+        template="Write a short paragraph about {topic}. Keep it under 100 words."
+    )
+    chain = prompt | chat_llm
+    response = chain.invoke({"topic": "artificial intelligence"})
+    assert isinstance(response.content, str)
+    assert len(response.content) > 0
+    print(f"TEST LANGCHAIN RESPONSE: {response.content}")
+
+@pytest.mark.langchain  
+def test_agent_with_calculator_tool(server):
+    """Test agent with calculator tool for mathematical operations."""
+    
+    # Define a simple calculator tool
+    @tool
+    def calculator(expression: str) -> str:
+        """Evaluate a mathematical expression safely. Input should be a string like '25 + 15' or '40 * 3'."""
+        try:
+            # Simple evaluation for basic arithmetic
+            # Only allow basic operations for security
+            allowed_chars = set('0123456789+-*/.() ')
+            if not all(c in allowed_chars for c in expression):
+                return "Error: Only basic arithmetic operations are allowed"
+            
+            result = eval(expression)
+            return str(result)
+        except Exception as e:
+            return f"Error: {str(e)}"
+    
+    # Create the LLM
+    llm = ChatOpenAI(
+        base_url="http://localhost:8000/api/v1",
+        api_key="dummy-key",
+        temperature=0.1,
+        max_tokens=400,
+    )
+    
+    # Define tools list
+    tools = [calculator]
+    
+    # Create agent prompt
+    prompt = ChatPromptTemplate.from_messages([
+        ("system", """You are a helpful mathematical assistant with access to a calculator tool.
+
+When solving math problems:
+1. Use the calculator tool for any arithmetic operations
+2. Break down complex problems step by step  
+3. Show your work clearly
+4. Always use the calculator tool instead of doing math mentally
+
+The calculator tool accepts expressions like:
+- "25 + 15" 
+- "40 * 3"
+- "120 - 8"
+- "100 / 4"
+
+You MUST use the calculator tool for all mathematical operations."""),
+        ("human", "{input}"),
+        ("placeholder", "{agent_scratchpad}"),
+    ])
+    
+    # Test with a simpler problem first to ensure tool calling works
+    test_question = "What is 47 + 23? Please use the calculator to verify."
+    
+    try:
+        # Create the agent with timeout
+        agent = create_tool_calling_agent(llm, tools, prompt)
+        agent_executor = AgentExecutor(
+            agent=agent, 
+            tools=tools, 
+            verbose=True, 
+            max_iterations=5,
+            early_stopping_method="generate"
+        )
+        
+        # Add timeout by invoking with smaller problem first
+        response = agent_executor.invoke({"input": test_question})
+        
+        print(f"\n=== CALCULATOR AGENT TEST ===")
+        print(f"Question: {test_question}")
+        print(f"Response: {response['output']}")
+        print(f"=== END CALCULATOR AGENT TEST ===\n")
+        
+        # Basic assertions
+        assert isinstance(response, dict)
+        assert "output" in response
+        assert isinstance(response["output"], str)
+        assert len(response["output"]) > 0
+        
+        # Check if the response mentions calculation or contains the correct answer
+        output_lower = response["output"].lower()
+        assert any(word in output_lower for word in ["70", "calculate", "result", "answer"]), \
+            f"Response should contain the answer (70) or calculation reference, got: {response['output']}"
+            
+    except Exception as e:
+        print(f"Agent execution failed: {e}")
+        # If tool calling fails, fall back to basic LLM test
+        fallback_response = llm.invoke([HumanMessage(content="Calculate 47 + 23 and explain your reasoning.")])
+        assert isinstance(fallback_response.content, str)
+        assert len(fallback_response.content) > 0
+        print(f"\n=== FALLBACK RESPONSE ===")
+        print(f"Question: Calculate 47 + 23 and explain your reasoning.")
+        print(f"Response: {fallback_response.content}")
+        print(f"=== END FALLBACK RESPONSE ===\n")
+
+@pytest.mark.langchain
+def test_function_calling_capability(server):
+    """Test if the model can understand and respond to function calling requests."""
+    
+    # Create the LLM
+    llm = ChatOpenAI(
+        base_url="http://localhost:8000/api/v1",
+        api_key="dummy-key",
+        temperature=0.1,
+        max_tokens=200,
+    )
+    
+    # Test direct function calling format understanding
+    test_message = """I have access to a calculator function that can perform arithmetic. 
+    
+When I need to calculate something, I should call:
+calculator(expression="mathematical expression")
+
+Now, what is 154 + 267? I need to use the calculator function to get the exact answer."""
+    
+    response = llm.invoke([HumanMessage(content=test_message)])
+    
+    print(f"\n=== FUNCTION CALLING CAPABILITY TEST ===")
+    print(f"Question: {test_message}")
+    print(f"Response: {response.content}")
+    print(f"=== END FUNCTION CALLING CAPABILITY TEST ===\n")
+    
+    # Basic assertions
+    assert isinstance(response.content, str)
+    assert len(response.content) > 0
+    
+    # Check if model understands function calling concept
+    content_lower = response.content.lower()
+    has_calculator_ref = any(word in content_lower for word in ["calculator", "function", "call"])
+    has_answer = "421" in response.content or "154 + 267" in response.content
+    
+    print(f"Has calculator reference: {has_calculator_ref}")
+    print(f"Has correct answer or calculation: {has_answer}")
+    
+    # The test passes if model shows understanding of the concept, even if it doesn't actually call tools
+    assert has_calculator_ref or has_answer, f"Model should show understanding of function calling or provide answer, got: {response.content}"
+
+@pytest.mark.langchain
+def test_react_agent_complex_reasoning(server):
+    """Test ReAct agent with multiple tools for complex multi-step problem solving."""
+    
+    # Define multiple tools for complex scenarios
+    @tool
+    def calculator(expression: str) -> str:
+        """Evaluate a mathematical expression safely. Input should be a string like '25 + 15' or '40 * 3'."""
+        try:
+            # Simple evaluation for basic arithmetic
+            allowed_chars = set('0123456789+-*/.() ')
+            if not all(c in allowed_chars for c in expression):
+                return "Error: Only basic arithmetic operations are allowed"
+            result = eval(expression)
+            return str(result)
+        except Exception as e:
+            return f"Error: {str(e)}"
+    
+    @tool
+    def unit_converter(value: float, from_unit: str, to_unit: str) -> str:
+        """Convert between units. Supports: meters/feet, celsius/fahrenheit, kg/pounds."""
+        try:
+            if from_unit.lower() == "meters" and to_unit.lower() == "feet":
+                result = value * 3.28084
+                return f"{value} meters = {result:.2f} feet"
+            elif from_unit.lower() == "feet" and to_unit.lower() == "meters":
+                result = value / 3.28084
+                return f"{value} feet = {result:.2f} meters"
+            elif from_unit.lower() == "celsius" and to_unit.lower() == "fahrenheit":
+                result = (value * 9/5) + 32
+                return f"{value}°C = {result:.2f}°F"
+            elif from_unit.lower() == "fahrenheit" and to_unit.lower() == "celsius":
+                result = (value - 32) * 5/9
+                return f"{value}°F = {result:.2f}°C"
+            elif from_unit.lower() == "kg" and to_unit.lower() == "pounds":
+                result = value * 2.20462
+                return f"{value} kg = {result:.2f} pounds"
+            elif from_unit.lower() == "pounds" and to_unit.lower() == "kg":
+                result = value / 2.20462
+                return f"{value} pounds = {result:.2f} kg"
+            else:
+                return f"Error: Conversion from {from_unit} to {to_unit} not supported"
+        except Exception as e:
+            return f"Error: {str(e)}"
+    
+    @tool
+    def word_analyzer(text: str) -> str:
+        """Analyze text and return word count, character count, and other statistics."""
+        words = text.split()
+        chars = len(text)
+        chars_no_spaces = len(text.replace(' ', ''))
+        sentences = text.count('.') + text.count('!') + text.count('?')
+        return f"Words: {len(words)}, Characters: {chars}, Characters (no spaces): {chars_no_spaces}, Sentences: {sentences}"
+    
+    # Create the LLM
+    llm = ChatOpenAI(
+        base_url="http://localhost:8000/api/v1",
+        api_key="dummy-key",
+        temperature=0.2,
+        max_tokens=600,
+    )
+    
+    # Define tools list
+    tools = [calculator, unit_converter, word_analyzer]
+    
+    # Use a proper ReAct prompt with all required variables
+    react_prompt = ChatPromptTemplate.from_template("""
+Answer the following questions as best you can. You have access to the following tools:
+
+{tools}
+
+Use the following format:
+
+Question: the input question you must answer
+Thought: you should always think about what to do
+Action: the action to take, should be one of [{tool_names}]
+Action Input: the input to the action
+Observation: the result of the action
+... (this Thought/Action/Action Input/Observation can repeat N times)
+Thought: I now know the final answer
+Final Answer: the final answer to the original input question
+
+Begin!
+
+Question: {input}
+Thought:{agent_scratchpad}
+""")
+    
+    # Simplified multi-step problem
+    test_question = """Can you help me with two quick tasks:
+    1. Calculate 12.5 * 8.3 using the calculator
+    2. Convert 25 celsius to fahrenheit using the unit converter
+    
+    Please show your work for both steps."""
+    
+    try:
+        # Create the ReAct agent
+        agent = create_react_agent(llm, tools, react_prompt)
+        agent_executor = AgentExecutor(
+            agent=agent, 
+            tools=tools, 
+            verbose=True, 
+            max_iterations=8,
+            early_stopping_method="generate",
+            handle_parsing_errors=True
+        )
+        
+        print(f"\n=== REACT AGENT COMPLEX REASONING TEST ===")
+        print(f"Question: {test_question}")
+        print(f"--- Starting agent execution ---")
+        
+        # Execute the agent
+        response = agent_executor.invoke({"input": test_question})
+        
+        print(f"--- Agent execution completed ---")
+        print(f"Final Response: {response['output']}")
+        print(f"=== END REACT AGENT TEST ===\n")
+        
+        # Basic assertions
+        assert isinstance(response, dict)
+        assert "output" in response
+        assert isinstance(response["output"], str)
+        assert len(response["output"]) > 0
+        
+        # Check if response contains evidence of multi-step reasoning
+        output_lower = response["output"].lower()
+        
+        # Look for evidence of the two tasks
+        has_calculation = any(term in output_lower for term in ["103.75", "12.5", "8.3", "multiply"])
+        has_temp_conversion = any(term in output_lower for term in ["77", "fahrenheit", "celsius", "convert"])
+        
+        print(f"Analysis Results:")
+        print(f"- Has calculation (12.5 * 8.3): {has_calculation}")
+        print(f"- Has temperature conversion (25°C to °F): {has_temp_conversion}")
+        
+        # Test passes if at least one task is attempted
+        steps_completed = sum([has_calculation, has_temp_conversion])
+        print(f"- Steps completed: {steps_completed}/2")
+        
+        assert steps_completed >= 1, f"Expected at least 1 reasoning step, got {steps_completed}. Response: {response['output']}"
+        
+    except Exception as e:
+        print(f"ReAct agent execution failed: {e}")
+        # Fallback test - at least verify the LLM can handle the complex prompt
+        fallback_response = llm.invoke([HumanMessage(content=f"Solve this step by step: {test_question}")])
+        assert isinstance(fallback_response.content, str)
+        assert len(fallback_response.content) > 0
+        print(f"\n=== FALLBACK RESPONSE ===")
+        print(f"Question: Solve this step by step: {test_question}")
+        print(f"Response: {fallback_response.content}")
+        print(f"=== END FALLBACK RESPONSE ===\n")
+
+@pytest.mark.skip("Not compatible with our server yet sinse OpenAIEmbeddings pass tokenized input.")
+@pytest.mark.langchain
+def test_embeddings_compatibility(server):
+    """Test OpenAIEmbeddings compatibility with our server."""
+    embeddings = OpenAIEmbeddings(
+        base_url="http://localhost:8000/api/v1",
+        api_key="dummy-key",
+    )
+    texts = ["Hello world", "This is a test"]
+    result = embeddings.embed_documents(texts)
+    assert isinstance(result, list)
+    assert len(result) == 2
+    assert all(isinstance(embedding, list) for embedding in result)
+
+    query_result = embeddings.embed_query("Test query")
+    assert isinstance(query_result, list)
diff --git a/tests/test_app.py b/tests/test_app.py
index 933608b..d915744 100644
--- a/tests/test_app.py
+++ b/tests/test_app.py
@@ -35,6 +35,7 @@ def reset_mock():
     """Reset the mock before each test."""
     mock_llama.reset_mock()
     mock_llama.create_chat_completion.side_effect = None  # Clear any side effects
+    mock_llama.create_embedding.side_effect = None  # Clear any side effects for embedding
     
     # Patch the tracer in utils.py to use our test tracer
     local_tracer = tracer_provider.get_tracer(__name__)
@@ -129,7 +130,7 @@ def test_server_busy_exception():
             "/api/v1/chat/completions",
             json={"messages": [{"role": "user", "content": "Hello"}], "stream": False},
         )
-        assert response.status_code == 503
+        assert response.status_code == 408
         assert response.json()["detail"] == DETAIL_SEM_TIMEOUT
 
 
@@ -395,6 +396,220 @@ def test_streaming_call_with_empty_chunks():
     mock_llama.create_chat_completion.assert_called_once()
 
 
+def test_embeddings_endpoint_string_input():
+    """Tests the embeddings endpoint with string input."""
+    mock_llama.create_embedding.return_value = {
+        "object": "list",
+        "data": [
+            {
+                "object": "embedding",
+                "embedding": [0.1, -0.2, 0.3, -0.4, 0.5],
+                "index": 0
+            }
+        ],
+        "model": "test-model",
+        "usage": {
+            "prompt_tokens": 5,
+            "total_tokens": 5
+        }
+    }
+
+    response = client.post(
+        "/api/v1/embeddings",
+        json={"input": "Hello world", "model": "test-model"}
+    )
+
+    assert response.status_code == 200
+    response_data = response.json()
+    
+    assert response_data["object"] == "list"
+    assert len(response_data["data"]) == 1
+    assert response_data["data"][0]["object"] == "embedding"
+    assert response_data["data"][0]["embedding"] == [0.1, -0.2, 0.3, -0.4, 0.5]
+    assert response_data["data"][0]["index"] == 0
+    assert response_data["model"] == "test-model"
+    assert response_data["usage"]["prompt_tokens"] == 5
+    assert response_data["usage"]["total_tokens"] == 5
+    
+    # Verify the LLM was called correctly
+    mock_llama.create_embedding.assert_called_once_with(
+        input="Hello world",
+        model="test-model"
+    )
+
+
+def test_embeddings_endpoint_list_input():
+    """Tests the embeddings endpoint with list input."""
+    mock_llama.create_embedding.return_value = {
+        "object": "list", 
+        "data": [
+            {
+                "object": "embedding",
+                "embedding": [0.1, 0.2, 0.3],
+                "index": 0
+            },
+            {
+                "object": "embedding",
+                "embedding": [0.4, 0.5, 0.6],
+                "index": 1
+            }
+        ],
+        "model": "test-model",
+        "usage": {
+            "prompt_tokens": 10,
+            "total_tokens": 10
+        }
+    }
+
+    response = client.post(
+        "/api/v1/embeddings",
+        json={"input": ["First text", "Second text"], "model": "test-model"}
+    )
+
+    assert response.status_code == 200
+    response_data = response.json()
+    
+    assert response_data["object"] == "list"
+    assert len(response_data["data"]) == 2
+    assert response_data["data"][0]["embedding"] == [0.1, 0.2, 0.3]
+    assert response_data["data"][1]["embedding"] == [0.4, 0.5, 0.6]
+    assert response_data["usage"]["prompt_tokens"] == 10
+    
+    # Verify the LLM was called correctly
+    mock_llama.create_embedding.assert_called_once_with(
+        input=["First text", "Second text"],
+        model="test-model"
+    )
+
+
+def test_embeddings_endpoint_default_model():
+    """Tests the embeddings endpoint with default model."""
+    mock_llama.create_embedding.return_value = {
+        "object": "list",
+        "data": [
+            {
+                "object": "embedding",
+                "embedding": [0.1, 0.2],
+                "index": 0
+            }
+        ],
+        "model": "Qwen3-0.6B-GGUF",
+        "usage": {
+            "prompt_tokens": 3,
+            "total_tokens": 3
+        }
+    }
+
+    response = client.post(
+        "/api/v1/embeddings",
+        json={"input": "Test"}
+    )
+
+    assert response.status_code == 200
+    response_data = response.json()
+    
+    assert response_data["model"] == "Qwen3-0.6B-GGUF"
+    
+    # Verify default model was used
+    mock_llama.create_embedding.assert_called_once_with(
+        input="Test",
+        model=None  # Default model is None
+    )
+
+
+def test_embeddings_endpoint_error():
+    """Tests the embeddings endpoint error handling."""
+    mock_llama.create_embedding.side_effect = Exception("Embedding failed")
+
+    response = client.post(
+        "/api/v1/embeddings",
+        json={"input": "Test", "model": "test-model"}
+    )
+
+    assert response.status_code == 500
+    assert "Embedding failed" in response.json()["detail"]
+
+
+def test_embeddings_endpoint_empty_input():
+    """Tests the embeddings endpoint with empty input."""
+    mock_llama.create_embedding.return_value = {
+        "object": "list",
+        "data": [
+            {
+                "object": "embedding", 
+                "embedding": [0.0, 0.0],
+                "index": 0
+            }
+        ],
+        "model": "test-model",
+        "usage": {
+            "prompt_tokens": 0,
+            "total_tokens": 0
+        }
+    }
+
+    response = client.post(
+        "/api/v1/embeddings",
+        json={"input": "", "model": "test-model"}
+    )
+
+    assert response.status_code == 200
+    response_data = response.json()
+    
+    assert len(response_data["data"]) == 1
+    assert response_data["usage"]["prompt_tokens"] == 0
+    
+    # Verify empty string was passed through
+    mock_llama.create_embedding.assert_called_once_with(
+        input="",
+        model="test-model"
+    )
+
+
+def test_embeddings_endpoint_with_tracing_integration():
+    """Integration test for embeddings endpoint with complete tracing flow."""
+    mock_llama.create_embedding.return_value = {
+        "object": "list",
+        "data": [
+            {
+                "object": "embedding",
+                "embedding": [0.1, -0.2, 0.3, -0.4, 0.5, 0.6, -0.7, 0.8],
+                "index": 0
+            }
+        ],
+        "model": "test-model",
+        "usage": {
+            "prompt_tokens": 8,
+            "total_tokens": 8
+        }
+    }
+
+    response = client.post(
+        "/api/v1/embeddings",
+        json={
+            "input": "This is a test sentence for creating embeddings.",
+            "model": "test-model"
+        }
+    )
+
+    assert response.status_code == 200
+    response_data = response.json()
+    
+    # Verify response structure
+    assert response_data["object"] == "list"
+    assert len(response_data["data"]) == 1
+    assert len(response_data["data"][0]["embedding"]) == 8
+    assert response_data["usage"]["prompt_tokens"] == 8
+    assert response_data["usage"]["total_tokens"] == 8
+    
+    # Verify the LLM was called with correct parameters
+    mock_llama.create_embedding.assert_called_once()
+    call_args = mock_llama.create_embedding.call_args
+    
+    assert call_args[1]["input"] == "This is a test sentence for creating embeddings."
+    assert call_args[1]["model"] == "test-model"
+
+
 def test_request_validation_and_defaults():
     """Test request validation and default parameter handling."""
     # Test minimal request
@@ -422,8 +637,8 @@ def test_request_validation_and_defaults():
     
     # Verify defaults were applied
     call_args = mock_llama.create_chat_completion.call_args
-    assert call_args[1]["max_tokens"] == 2048  # Default value
-    assert call_args[1]["temperature"] == 0.7  # Default value
+    assert call_args[1]["max_tokens"] is None  # Default value
+    assert call_args[1]["temperature"] == 0.2  # Default value
     assert call_args[1]["stream"] is False     # Default value
 
 
diff --git a/tests/test_embedding.py b/tests/test_embedding.py
new file mode 100644
index 0000000..88cec97
--- /dev/null
+++ b/tests/test_embedding.py
@@ -0,0 +1,470 @@
+"""Tests for embedding functionality in slm_server."""
+
+from unittest.mock import Mock, patch
+
+import pytest
+from opentelemetry.sdk.trace import TracerProvider
+from opentelemetry.sdk.trace.export import SimpleSpanProcessor
+from opentelemetry.sdk.trace.export.in_memory_span_exporter import InMemorySpanExporter
+from opentelemetry.trace import StatusCode
+
+from llama_cpp.llama_types import (
+    CreateEmbeddingResponse as EmbeddingResponse,
+    EmbeddingData,
+    EmbeddingUsage,
+)
+from slm_server.model import EmbeddingRequest
+from slm_server.utils import (
+    ATTR_INPUT_COUNT,
+    ATTR_INPUT_CONTENT_LENGTH,
+    ATTR_MODEL,
+    ATTR_OUTPUT_COUNT,
+    ATTR_PROMPT_TOKENS,
+    ATTR_TOTAL_TOKENS,
+    SPAN_EMBEDDING,
+    set_attribute_response_embedding,
+    slm_embedding_span,
+)
+
+
+@pytest.fixture
+def setup_tracing():
+    """Set up tracing with in-memory span exporter for testing."""
+    # Create a tracer provider with in-memory exporter
+    tracer_provider = TracerProvider()
+    memory_exporter = InMemorySpanExporter()
+    span_processor = SimpleSpanProcessor(memory_exporter)
+    tracer_provider.add_span_processor(span_processor)
+    
+    # Don't override global tracer provider - use local one
+    local_tracer = tracer_provider.get_tracer(__name__)
+    
+    yield memory_exporter, local_tracer
+    
+    # Clean up
+    memory_exporter.clear()
+
+
+class TestSetAttributeResponseEmbedding:
+    """Tests for set_attribute_response_embedding function."""
+    
+    def test_sets_embedding_attributes_correctly(self):
+        """Test that embedding response attributes are set correctly on span."""
+        mock_span = Mock()
+        
+        # Create embedding response with usage and data
+        response = EmbeddingResponse(
+            object="list",
+            data=[
+                EmbeddingData(
+                    object="embedding",
+                    embedding=[0.1, 0.2, -0.3, 0.4, -0.5],
+                    index=0
+                ),
+                EmbeddingData(
+                    object="embedding", 
+                    embedding=[0.6, -0.7, 0.8, -0.9, 1.0],
+                    index=1
+                )
+            ],
+            model="test-model",
+            usage=EmbeddingUsage(prompt_tokens=15, total_tokens=15)
+        )
+        
+        set_attribute_response_embedding(mock_span, response)
+        
+        # Verify attributes were set
+        mock_span.set_attribute.assert_any_call(ATTR_PROMPT_TOKENS, 15)
+        mock_span.set_attribute.assert_any_call(ATTR_TOTAL_TOKENS, 15)
+        mock_span.set_attribute.assert_any_call(ATTR_OUTPUT_COUNT, 2)  # 2 embeddings
+    
+    def test_handles_single_embedding(self):
+        """Test handling of single embedding response."""
+        mock_span = Mock()
+        
+        response = EmbeddingResponse(
+            object="list",
+            data=[
+                EmbeddingData(
+                    object="embedding",
+                    embedding=[0.1, 0.2, 0.3],
+                    index=0
+                )
+            ],
+            model="test-model",
+            usage=EmbeddingUsage(prompt_tokens=5, total_tokens=5)
+        )
+        
+        set_attribute_response_embedding(mock_span, response)
+        
+        # Should set output count to 1
+        mock_span.set_attribute.assert_any_call(ATTR_OUTPUT_COUNT, 1)
+    
+    def test_handles_empty_data(self):
+        """Test handling of empty embedding data."""
+        mock_span = Mock()
+        
+        response = EmbeddingResponse(
+            object="list",
+            data=[],
+            model="test-model",
+            usage=EmbeddingUsage(prompt_tokens=0, total_tokens=0)
+        )
+        
+        set_attribute_response_embedding(mock_span, response)
+        
+        # Should still set usage attributes but not output count since data is empty
+        mock_span.set_attribute.assert_any_call(ATTR_PROMPT_TOKENS, 0)
+        mock_span.set_attribute.assert_any_call(ATTR_TOTAL_TOKENS, 0)
+        # Verify output count was NOT set since data is empty
+        output_count_calls = [call for call in mock_span.set_attribute.call_args_list 
+                             if call[0][0] == ATTR_OUTPUT_COUNT]
+        assert len(output_count_calls) == 0
+    
+    def test_handles_usage_properly(self):
+        """Test that usage attributes are set when present."""
+        mock_span = Mock()
+        
+        response = EmbeddingResponse(
+            object="list",
+            data=[
+                EmbeddingData(
+                    object="embedding",
+                    embedding=[0.1, 0.2],
+                    index=0
+                )
+            ],
+            model="test-model",
+            usage=EmbeddingUsage(prompt_tokens=5, total_tokens=5)
+        )
+        
+        set_attribute_response_embedding(mock_span, response)
+        
+        # Should set both usage and output count attributes
+        mock_span.set_attribute.assert_any_call(ATTR_OUTPUT_COUNT, 1)
+        mock_span.set_attribute.assert_any_call(ATTR_PROMPT_TOKENS, 5)
+        mock_span.set_attribute.assert_any_call(ATTR_TOTAL_TOKENS, 5)
+
+
+class TestSlmEmbeddingSpan:
+    """Tests for slm_embedding_span context manager."""
+    
+    def test_sets_initial_attributes_string_input(self, setup_tracing):
+        """Test that initial attributes are set correctly for string input."""
+        memory_exporter, local_tracer = setup_tracing
+        
+        request = EmbeddingRequest(
+            input="Hello world, this is a test input.",
+            model="test-model"
+        )
+        
+        # Patch the global tracer with our local one
+        with patch('slm_server.utils.spans.tracer', local_tracer):
+            with slm_embedding_span(request) as span:
+                pass
+        
+        # Get the finished span
+        spans = memory_exporter.get_finished_spans()
+        assert len(spans) == 1
+        
+        span = spans[0]
+        attrs = span.attributes
+        
+        assert span.name == SPAN_EMBEDDING
+        assert attrs[ATTR_MODEL] == "llama-cpp"
+        assert attrs[ATTR_INPUT_COUNT] == 1
+        assert attrs[ATTR_INPUT_CONTENT_LENGTH] > 0
+    
+    def test_sets_initial_attributes_list_input(self, setup_tracing):
+        """Test that initial attributes are set correctly for list input."""
+        memory_exporter, local_tracer = setup_tracing
+        
+        request = EmbeddingRequest(
+            input=["First text", "Second text", "Third text"],
+            model="test-model"
+        )
+        
+        # Patch the global tracer with our local one
+        with patch('slm_server.utils.spans.tracer', local_tracer):
+            with slm_embedding_span(request) as span:
+                pass
+        
+        # Get the finished span
+        spans = memory_exporter.get_finished_spans()
+        assert len(spans) == 1
+        
+        span = spans[0]
+        attrs = span.attributes
+        
+        assert attrs[ATTR_INPUT_COUNT] == 3
+        assert attrs[ATTR_INPUT_CONTENT_LENGTH] > 0
+    
+    def test_handles_empty_string_input(self, setup_tracing):
+        """Test handling of empty string input."""
+        memory_exporter, local_tracer = setup_tracing
+        
+        request = EmbeddingRequest(
+            input="",
+            model="test-model"
+        )
+        
+        with patch('slm_server.utils.spans.tracer', local_tracer):
+            with slm_embedding_span(request) as span:
+                pass
+        
+        spans = memory_exporter.get_finished_spans()
+        span = spans[0]
+        attrs = span.attributes
+        
+        assert attrs[ATTR_INPUT_COUNT] == 1
+        assert attrs[ATTR_INPUT_CONTENT_LENGTH] == 0
+    
+    def test_handles_empty_list_input(self, setup_tracing):
+        """Test handling of empty list input."""
+        memory_exporter, local_tracer = setup_tracing
+        
+        request = EmbeddingRequest(
+            input=[],
+            model="test-model"
+        )
+        
+        with patch('slm_server.utils.spans.tracer', local_tracer):
+            with slm_embedding_span(request):
+                pass
+        
+        spans = memory_exporter.get_finished_spans()
+        span = spans[0]
+        attrs = span.attributes
+        
+        assert attrs[ATTR_INPUT_COUNT] == 0
+        assert attrs[ATTR_INPUT_CONTENT_LENGTH] == 0
+    
+    def test_handles_list_with_empty_strings(self, setup_tracing):
+        """Test handling of list containing empty strings."""
+        memory_exporter, local_tracer = setup_tracing
+        
+        request = EmbeddingRequest(
+            input=["Hello", "", "World", ""],
+            model="test-model"
+        )
+        
+        with patch('slm_server.utils.spans.tracer', local_tracer):
+            with slm_embedding_span(request) as span:
+                pass
+        
+        spans = memory_exporter.get_finished_spans()
+        span = spans[0]
+        attrs = span.attributes
+        
+        assert attrs[ATTR_INPUT_COUNT] == 4
+        assert attrs[ATTR_INPUT_CONTENT_LENGTH] == 10  # len("Hello") + len("World") = 5 + 5
+    
+    def test_handles_exceptions(self, setup_tracing):
+        """Test exception handling in embedding span context."""
+        memory_exporter, local_tracer = setup_tracing
+        
+        request = EmbeddingRequest(input="test", model="test-model")
+        
+        with pytest.raises(ValueError):
+            with patch('slm_server.utils.spans.tracer', local_tracer):
+                with slm_embedding_span(request) as span:
+                    raise ValueError("test embedding error")
+        
+        spans = memory_exporter.get_finished_spans()
+        span = spans[0]
+        
+        assert span.status.status_code == StatusCode.ERROR
+        assert "test embedding error" in span.status.description
+        assert span.attributes["slm.force_sample"] is True
+
+
+class TestEmbeddingModelValidation:
+    """Tests for embedding model validation."""
+    
+    def test_embedding_request_string_input(self):
+        """Test EmbeddingRequest with string input."""
+        request = EmbeddingRequest(
+            input="Test input text",
+            model="test-model"
+        )
+        
+        assert request.input == "Test input text"
+        assert request.model == "test-model"
+    
+    def test_embedding_request_list_input(self):
+        """Test EmbeddingRequest with list input."""
+        request = EmbeddingRequest(
+            input=["First", "Second", "Third"],
+            model="test-model"
+        )
+        
+        assert request.input == ["First", "Second", "Third"]
+        assert request.model == "test-model"
+    
+    def test_embedding_request_default_model(self):
+        """Test EmbeddingRequest with default model."""
+        request = EmbeddingRequest(input="Test")
+        
+        assert request.model is None  # Default is None as model is not important for server
+    
+    def test_embedding_response_creation(self):
+        """Test EmbeddingResponse creation."""
+        response = EmbeddingResponse(
+            object="list",
+            data=[
+                EmbeddingData(
+                    object="embedding",
+                    embedding=[1.0, 2.0, 3.0],
+                    index=0
+                )
+            ],
+            model="test-model",
+            usage=EmbeddingUsage(prompt_tokens=10, total_tokens=10)
+        )
+        
+        assert response["object"] == "list"
+        assert len(response["data"]) == 1
+        assert response["data"][0]["embedding"] == [1.0, 2.0, 3.0]
+        assert response["data"][0]["index"] == 0
+        assert response["model"] == "test-model"
+        assert response["usage"]["prompt_tokens"] == 10
+        assert response["usage"]["total_tokens"] == 10
+    
+    def test_embedding_data_defaults(self):
+        """Test EmbeddingData with explicit object field."""
+        data = EmbeddingData(
+            object="embedding",
+            embedding=[0.1, 0.2, 0.3],
+            index=0
+        )
+        
+        assert data["object"] == "embedding"
+        assert data["embedding"] == [0.1, 0.2, 0.3]
+        assert data["index"] == 0
+
+
+class TestIntegrationEmbeddingFlow:
+    """Integration test for complete embedding flow."""
+    
+    def test_complete_embedding_flow_string_input(self, setup_tracing):
+        """Test complete flow of embedding request with string input."""
+        memory_exporter, local_tracer = setup_tracing
+        
+        request = EmbeddingRequest(
+            input="This is a test sentence for embedding.",
+            model="test-model"
+        )
+        
+        # Patch the global tracer with our local one
+        with patch('slm_server.utils.spans.tracer', local_tracer):
+            with slm_embedding_span(request) as span:
+                # Simulate processing embedding
+                response = EmbeddingResponse(
+                    object="list",
+                    data=[
+                        EmbeddingData(
+                            object="embedding",
+                            embedding=[0.1, -0.2, 0.3, -0.4, 0.5, -0.6, 0.7, -0.8],
+                            index=0
+                        )
+                    ],
+                    model="test-model",
+                    usage=EmbeddingUsage(prompt_tokens=8, total_tokens=8)
+                )
+                
+                set_attribute_response_embedding(span, response)
+        
+        # Get finished span and verify
+        spans = memory_exporter.get_finished_spans()
+        assert len(spans) == 1
+        
+        finished_span = spans[0]
+        
+        # Verify span attributes
+        assert finished_span.name == SPAN_EMBEDDING
+        assert finished_span.attributes[ATTR_MODEL] == "llama-cpp"
+        assert finished_span.attributes[ATTR_INPUT_COUNT] == 1
+        assert finished_span.attributes[ATTR_INPUT_CONTENT_LENGTH] > 0
+        assert finished_span.attributes[ATTR_OUTPUT_COUNT] == 1
+        assert finished_span.attributes[ATTR_PROMPT_TOKENS] == 8
+        assert finished_span.attributes[ATTR_TOTAL_TOKENS] == 8
+    
+    def test_complete_embedding_flow_list_input(self, setup_tracing):
+        """Test complete flow of embedding request with list input."""
+        memory_exporter, local_tracer = setup_tracing
+        
+        request = EmbeddingRequest(
+            input=["First sentence.", "Second sentence.", "Third sentence."],
+            model="test-model"
+        )
+        
+        # Patch the global tracer with our local one
+        with patch('slm_server.utils.spans.tracer', local_tracer):
+            with slm_embedding_span(request) as span:
+                # Simulate processing multiple embeddings
+                response = EmbeddingResponse(
+                    object="list",
+                    data=[
+                        EmbeddingData(
+                            object="embedding",
+                            embedding=[0.1, 0.2, 0.3],
+                            index=0
+                        ),
+                        EmbeddingData(
+                            object="embedding",
+                            embedding=[0.4, 0.5, 0.6],
+                            index=1
+                        ),
+                        EmbeddingData(
+                            object="embedding",
+                            embedding=[0.7, 0.8, 0.9],
+                            index=2
+                        )
+                    ],
+                    model="test-model",
+                    usage=EmbeddingUsage(prompt_tokens=12, total_tokens=12)
+                )
+                
+                set_attribute_response_embedding(span, response)
+        
+        # Get finished span and verify
+        spans = memory_exporter.get_finished_spans()
+        assert len(spans) == 1
+        
+        finished_span = spans[0]
+        
+        # Verify span attributes
+        assert finished_span.attributes[ATTR_INPUT_COUNT] == 3
+        assert finished_span.attributes[ATTR_INPUT_CONTENT_LENGTH] > 0
+        assert finished_span.attributes[ATTR_OUTPUT_COUNT] == 3
+        assert finished_span.attributes[ATTR_PROMPT_TOKENS] == 12
+        assert finished_span.attributes[ATTR_TOTAL_TOKENS] == 12
+    
+    def test_embedding_flow_with_error(self, setup_tracing):
+        """Test embedding flow with error handling."""
+        memory_exporter, local_tracer = setup_tracing
+        
+        request = EmbeddingRequest(
+            input="This will cause an error.",
+            model="test-model"
+        )
+        
+        with pytest.raises(RuntimeError):
+            with patch('slm_server.utils.spans.tracer', local_tracer):
+                with slm_embedding_span(request) as span:
+                    raise RuntimeError("Embedding processing failed")
+        
+        # Get finished span and verify error handling
+        spans = memory_exporter.get_finished_spans()
+        assert len(spans) == 1
+        
+        finished_span = spans[0]
+        
+        # Verify error status
+        assert finished_span.status.status_code == StatusCode.ERROR
+        assert "Embedding processing failed" in finished_span.status.description
+        assert finished_span.attributes["slm.force_sample"] is True
+        
+        # Initial attributes should still be set
+        assert finished_span.attributes[ATTR_INPUT_COUNT] == 1
+        assert finished_span.attributes[ATTR_INPUT_CONTENT_LENGTH] == 25
\ No newline at end of file
diff --git a/tests/test_utils.py b/tests/test_utils.py
index 26a4052..9a53533 100644
--- a/tests/test_utils.py
+++ b/tests/test_utils.py
@@ -9,16 +9,17 @@
 from opentelemetry.sdk.trace.export.in_memory_span_exporter import InMemorySpanExporter
 from opentelemetry.trace import Status, StatusCode, set_tracer_provider
 
-from slm_server.model import (
-    ChatCompletionRequest,
-    ChatCompletionResponse,
-    ChatCompletionStreamResponse,
-    ChatMessage,
-    Usage,
-    ChatCompletionChoice,
-    ChatCompletionStreamChoice,
-    DeltaMessage,
+from llama_cpp.llama_types import (
+    ChatCompletionRequestMessage,
+    ChatCompletionResponseMessage as ChatMessage,
+    CreateChatCompletionResponse as ChatCompletionResponse,
+    CreateChatCompletionStreamResponse as ChatCompletionStreamResponse,
+    CompletionUsage as Usage,
+    ChatCompletionResponseChoice as ChatCompletionChoice,
+    ChatCompletionStreamResponseChoice as ChatCompletionStreamChoice,
+    ChatCompletionStreamResponseDelta as DeltaMessage,
 )
+from slm_server.model import ChatCompletionRequest
 from slm_server.utils import (
     # EVENT_ATTR_CHUNK_CONTENT,
     EVENT_ATTR_CHUNK_CONTENT_SIZE,
@@ -38,13 +39,13 @@
     METRIC_TOTAL_TOKENS_PER_SECOND,
     SLMLoggingSpanProcessor,
     SLMMetricsSpanProcessor,
-    _calculate_chunk_metrics_from_events,
     calculate_performance_metrics,
     set_atrribute_response,
     set_atrribute_response_stream,
     slm_span,
-    tracer,
 )
+from slm_server.utils.metrics import _calculate_chunk_metrics_from_events
+from slm_server.utils.spans import tracer
 
 
 @pytest.fixture
@@ -342,8 +343,8 @@ def test_sets_initial_attributes(self, setup_tracing):
         )
         
         # Patch the global tracer with our local one
-        with patch('slm_server.utils.tracer', local_tracer):
-            with slm_span(request, is_streaming=True) as (span, messages):
+        with patch('slm_server.utils.spans.tracer', local_tracer):
+            with slm_span(request, is_streaming=True) as span:
                 pass
         
         # Get the finished span
@@ -370,8 +371,8 @@ def test_estimates_prompt_tokens_for_streaming(self, setup_tracing):
         )
         
         # Patch the global tracer with our local one
-        with patch('slm_server.utils.tracer', local_tracer):
-            with slm_span(request, is_streaming=True) as (span, messages):
+        with patch('slm_server.utils.spans.tracer', local_tracer):
+            with slm_span(request, is_streaming=True) as span:
                 pass
         
         spans = memory_exporter.get_finished_spans()
@@ -388,8 +389,8 @@ def test_handles_exceptions(self, setup_tracing):
         
         with pytest.raises(ValueError):
             # Patch the global tracer with our local one
-            with patch('slm_server.utils.tracer', local_tracer):
-                with slm_span(request, is_streaming=False) as (span, messages):
+            with patch('slm_server.utils.spans.tracer', local_tracer):
+                with slm_span(request, is_streaming=False) as span:
                     raise ValueError("test error")
         
         spans = memory_exporter.get_finished_spans()
@@ -607,7 +608,6 @@ def test_records_error_metrics(self):
             mock_labels.assert_called_with(
                 model="test-model", 
                 streaming="non_streaming", 
-                error_type="str"  # type of string description
             )
             mock_counter.inc.assert_called_once()
 
@@ -627,8 +627,8 @@ def test_complete_streaming_flow(self, setup_tracing):
         )
         
         # Patch the global tracer with our local one
-        with patch('slm_server.utils.tracer', local_tracer):
-            with slm_span(request, is_streaming=True) as (span, messages_for_llm):
+        with patch('slm_server.utils.spans.tracer', local_tracer):
+            with slm_span(request, is_streaming=True) as span:
                 # Simulate processing chunks
                 chunks = [
                     ChatCompletionStreamResponse(
@@ -699,8 +699,8 @@ def test_complete_non_streaming_flow(self, setup_tracing):
         )
         
         # Patch the global tracer with our local one
-        with patch('slm_server.utils.tracer', local_tracer):
-            with slm_span(request, is_streaming=False) as (span, messages_for_llm):
+        with patch('slm_server.utils.spans.tracer', local_tracer):
+            with slm_span(request, is_streaming=False) as span:
                 # Simulate processing response
                 response = ChatCompletionResponse(
                     model="test-model",
diff --git a/tests/test_utils_simple.py b/tests/test_utils_simple.py
index 158a5c6..f9bac61 100644
--- a/tests/test_utils_simple.py
+++ b/tests/test_utils_simple.py
@@ -4,16 +4,17 @@
 
 import pytest
 
-from slm_server.model import (
-    ChatCompletionRequest,
-    ChatCompletionResponse,
-    ChatCompletionStreamResponse,
-    ChatMessage,
-    Usage,
-    ChatCompletionChoice,
-    ChatCompletionStreamChoice,
-    DeltaMessage,
+from llama_cpp.llama_types import (
+    ChatCompletionRequestMessage,
+    ChatCompletionResponseMessage as ChatMessage,
+    CreateChatCompletionResponse as ChatCompletionResponse,
+    CreateChatCompletionStreamResponse as ChatCompletionStreamResponse,
+    CompletionUsage as Usage,
+    ChatCompletionResponseChoice as ChatCompletionChoice,
+    ChatCompletionStreamResponseChoice as ChatCompletionStreamChoice,
+    ChatCompletionStreamResponseDelta as DeltaMessage,
 )
+from slm_server.model import ChatCompletionRequest
 from slm_server.utils import (
     ATTR_CHUNK_COUNT,
     EVENT_ATTR_CHUNK_CONTENT_SIZE,
@@ -30,11 +31,11 @@
     METRIC_TOKENS_PER_SECOND,
     METRIC_TOTAL_DURATION,
     METRIC_TOTAL_TOKENS_PER_SECOND,
-    _calculate_chunk_metrics_from_events,
     calculate_performance_metrics,
     set_atrribute_response,
     set_atrribute_response_stream,
 )
+from slm_server.utils.metrics import _calculate_chunk_metrics_from_events
 
 
 class TestSetAttributeResponse:
diff --git a/uv.lock b/uv.lock
index 19a7679..5ed8bd2 100644
--- a/uv.lock
+++ b/uv.lock
@@ -42,6 +42,28 @@ wheels = [
     { url = "https://files.pythonhosted.org/packages/4f/52/34c6cf5bb9285074dc3531c437b3919e825d976fde097a7a73f79e726d03/certifi-2025.7.14-py3-none-any.whl", hash = "sha256:6b31f564a415d79ee77df69d757bb49a5bb53bd9f756cbbe24394ffd6fc1f4b2", size = 162722, upload_time = "2025-07-14T03:29:26.863Z" },
 ]
 
+[[package]]
+name = "cffi"
+version = "1.17.1"
+source = { registry = "https://pypi.org/simple" }
+dependencies = [
+    { name = "pycparser" },
+]
+sdist = { url = "https://files.pythonhosted.org/packages/fc/97/c783634659c2920c3fc70419e3af40972dbaf758daa229a7d6ea6135c90d/cffi-1.17.1.tar.gz", hash = "sha256:1c39c6016c32bc48dd54561950ebd6836e1670f2ae46128f67cf49e789c52824", size = 516621, upload_time = "2024-09-04T20:45:21.852Z" }
+wheels = [
+    { url = "https://files.pythonhosted.org/packages/8d/f8/dd6c246b148639254dad4d6803eb6a54e8c85c6e11ec9df2cffa87571dbe/cffi-1.17.1-cp313-cp313-macosx_10_13_x86_64.whl", hash = "sha256:f3a2b4222ce6b60e2e8b337bb9596923045681d71e5a082783484d845390938e", size = 182989, upload_time = "2024-09-04T20:44:28.956Z" },
+    { url = "https://files.pythonhosted.org/packages/8b/f1/672d303ddf17c24fc83afd712316fda78dc6fce1cd53011b839483e1ecc8/cffi-1.17.1-cp313-cp313-macosx_11_0_arm64.whl", hash = "sha256:0984a4925a435b1da406122d4d7968dd861c1385afe3b45ba82b750f229811e2", size = 178802, upload_time = "2024-09-04T20:44:30.289Z" },
+    { url = "https://files.pythonhosted.org/packages/0e/2d/eab2e858a91fdff70533cab61dcff4a1f55ec60425832ddfdc9cd36bc8af/cffi-1.17.1-cp313-cp313-manylinux_2_12_i686.manylinux2010_i686.manylinux_2_17_i686.manylinux2014_i686.whl", hash = "sha256:d01b12eeeb4427d3110de311e1774046ad344f5b1a7403101878976ecd7a10f3", size = 454792, upload_time = "2024-09-04T20:44:32.01Z" },
+    { url = "https://files.pythonhosted.org/packages/75/b2/fbaec7c4455c604e29388d55599b99ebcc250a60050610fadde58932b7ee/cffi-1.17.1-cp313-cp313-manylinux_2_17_aarch64.manylinux2014_aarch64.whl", hash = "sha256:706510fe141c86a69c8ddc029c7910003a17353970cff3b904ff0686a5927683", size = 478893, upload_time = "2024-09-04T20:44:33.606Z" },
+    { url = "https://files.pythonhosted.org/packages/4f/b7/6e4a2162178bf1935c336d4da8a9352cccab4d3a5d7914065490f08c0690/cffi-1.17.1-cp313-cp313-manylinux_2_17_ppc64le.manylinux2014_ppc64le.whl", hash = "sha256:de55b766c7aa2e2a3092c51e0483d700341182f08e67c63630d5b6f200bb28e5", size = 485810, upload_time = "2024-09-04T20:44:35.191Z" },
+    { url = "https://files.pythonhosted.org/packages/c7/8a/1d0e4a9c26e54746dc08c2c6c037889124d4f59dffd853a659fa545f1b40/cffi-1.17.1-cp313-cp313-manylinux_2_17_s390x.manylinux2014_s390x.whl", hash = "sha256:c59d6e989d07460165cc5ad3c61f9fd8f1b4796eacbd81cee78957842b834af4", size = 471200, upload_time = "2024-09-04T20:44:36.743Z" },
+    { url = "https://files.pythonhosted.org/packages/26/9f/1aab65a6c0db35f43c4d1b4f580e8df53914310afc10ae0397d29d697af4/cffi-1.17.1-cp313-cp313-manylinux_2_17_x86_64.manylinux2014_x86_64.whl", hash = "sha256:dd398dbc6773384a17fe0d3e7eeb8d1a21c2200473ee6806bb5e6a8e62bb73dd", size = 479447, upload_time = "2024-09-04T20:44:38.492Z" },
+    { url = "https://files.pythonhosted.org/packages/5f/e4/fb8b3dd8dc0e98edf1135ff067ae070bb32ef9d509d6cb0f538cd6f7483f/cffi-1.17.1-cp313-cp313-musllinux_1_1_aarch64.whl", hash = "sha256:3edc8d958eb099c634dace3c7e16560ae474aa3803a5df240542b305d14e14ed", size = 484358, upload_time = "2024-09-04T20:44:40.046Z" },
+    { url = "https://files.pythonhosted.org/packages/f1/47/d7145bf2dc04684935d57d67dff9d6d795b2ba2796806bb109864be3a151/cffi-1.17.1-cp313-cp313-musllinux_1_1_x86_64.whl", hash = "sha256:72e72408cad3d5419375fc87d289076ee319835bdfa2caad331e377589aebba9", size = 488469, upload_time = "2024-09-04T20:44:41.616Z" },
+    { url = "https://files.pythonhosted.org/packages/bf/ee/f94057fa6426481d663b88637a9a10e859e492c73d0384514a17d78ee205/cffi-1.17.1-cp313-cp313-win32.whl", hash = "sha256:e03eab0a8677fa80d646b5ddece1cbeaf556c313dcfac435ba11f107ba117b5d", size = 172475, upload_time = "2024-09-04T20:44:43.733Z" },
+    { url = "https://files.pythonhosted.org/packages/7c/fc/6a8cb64e5f0324877d503c854da15d76c1e50eb722e320b15345c4d0c6de/cffi-1.17.1-cp313-cp313-win_amd64.whl", hash = "sha256:f6a16c31041f09ead72d69f583767292f750d24913dadacf5756b966aacb3f1a", size = 182009, upload_time = "2024-09-04T20:44:45.309Z" },
+]
+
 [[package]]
 name = "charset-normalizer"
 version = "3.4.2"
@@ -125,6 +147,15 @@ wheels = [
     { url = "https://files.pythonhosted.org/packages/3f/27/4570e78fc0bf5ea0ca45eb1de3818a23787af9b390c0b0a0033a1b8236f9/diskcache-5.6.3-py3-none-any.whl", hash = "sha256:5e31b2d5fbad117cc363ebaf6b689474db18a1f6438bc82358b024abd4c2ca19", size = 45550, upload_time = "2023-08-31T06:11:58.822Z" },
 ]
 
+[[package]]
+name = "distro"
+version = "1.9.0"
+source = { registry = "https://pypi.org/simple" }
+sdist = { url = "https://files.pythonhosted.org/packages/fc/f8/98eea607f65de6527f8a2e8885fc8015d3e6f5775df186e443e0964a11c3/distro-1.9.0.tar.gz", hash = "sha256:2fa77c6fd8940f116ee1d6b94a2f90b13b5ea8d019b98bc8bafdcabcdd9bdbed", size = 60722, upload_time = "2023-12-24T09:54:32.31Z" }
+wheels = [
+    { url = "https://files.pythonhosted.org/packages/12/b3/231ffd4ab1fc9d679809f356cebee130ac7daa00d6d6f3206dd4fd137e9e/distro-1.9.0-py3-none-any.whl", hash = "sha256:7bffd925d65168f85027d8da9af6bddab658135b840670a223589bc0c8ef02b2", size = 20277, upload_time = "2023-12-24T09:54:30.421Z" },
+]
+
 [[package]]
 name = "fastapi"
 version = "0.116.1"
@@ -151,6 +182,30 @@ wheels = [
     { url = "https://files.pythonhosted.org/packages/86/f1/62a193f0227cf15a920390abe675f386dec35f7ae3ffe6da582d3ade42c7/googleapis_common_protos-1.70.0-py3-none-any.whl", hash = "sha256:b8bfcca8c25a2bb253e0e0b0adaf8c00773e5e6af6fd92397576680b807e0fd8", size = 294530, upload_time = "2025-04-14T10:17:01.271Z" },
 ]
 
+[[package]]
+name = "greenlet"
+version = "3.2.3"
+source = { registry = "https://pypi.org/simple" }
+sdist = { url = "https://files.pythonhosted.org/packages/c9/92/bb85bd6e80148a4d2e0c59f7c0c2891029f8fd510183afc7d8d2feeed9b6/greenlet-3.2.3.tar.gz", hash = "sha256:8b0dd8ae4c0d6f5e54ee55ba935eeb3d735a9b58a8a1e5b5cbab64e01a39f365", size = 185752, upload_time = "2025-06-05T16:16:09.955Z" }
+wheels = [
+    { url = "https://files.pythonhosted.org/packages/b1/cf/f5c0b23309070ae93de75c90d29300751a5aacefc0a3ed1b1d8edb28f08b/greenlet-3.2.3-cp313-cp313-macosx_11_0_universal2.whl", hash = "sha256:500b8689aa9dd1ab26872a34084503aeddefcb438e2e7317b89b11eaea1901ad", size = 270732, upload_time = "2025-06-05T16:10:08.26Z" },
+    { url = "https://files.pythonhosted.org/packages/48/ae/91a957ba60482d3fecf9be49bc3948f341d706b52ddb9d83a70d42abd498/greenlet-3.2.3-cp313-cp313-manylinux2014_aarch64.manylinux_2_17_aarch64.whl", hash = "sha256:a07d3472c2a93117af3b0136f246b2833fdc0b542d4a9799ae5f41c28323faef", size = 639033, upload_time = "2025-06-05T16:38:53.983Z" },
+    { url = "https://files.pythonhosted.org/packages/6f/df/20ffa66dd5a7a7beffa6451bdb7400d66251374ab40b99981478c69a67a8/greenlet-3.2.3-cp313-cp313-manylinux2014_ppc64le.manylinux_2_17_ppc64le.whl", hash = "sha256:8704b3768d2f51150626962f4b9a9e4a17d2e37c8a8d9867bbd9fa4eb938d3b3", size = 652999, upload_time = "2025-06-05T16:41:37.89Z" },
+    { url = "https://files.pythonhosted.org/packages/51/b4/ebb2c8cb41e521f1d72bf0465f2f9a2fd803f674a88db228887e6847077e/greenlet-3.2.3-cp313-cp313-manylinux2014_s390x.manylinux_2_17_s390x.whl", hash = "sha256:5035d77a27b7c62db6cf41cf786cfe2242644a7a337a0e155c80960598baab95", size = 647368, upload_time = "2025-06-05T16:48:21.467Z" },
+    { url = "https://files.pythonhosted.org/packages/8e/6a/1e1b5aa10dced4ae876a322155705257748108b7fd2e4fae3f2a091fe81a/greenlet-3.2.3-cp313-cp313-manylinux2014_x86_64.manylinux_2_17_x86_64.whl", hash = "sha256:2d8aa5423cd4a396792f6d4580f88bdc6efcb9205891c9d40d20f6e670992efb", size = 650037, upload_time = "2025-06-05T16:13:06.402Z" },
+    { url = "https://files.pythonhosted.org/packages/26/f2/ad51331a157c7015c675702e2d5230c243695c788f8f75feba1af32b3617/greenlet-3.2.3-cp313-cp313-manylinux_2_24_x86_64.manylinux_2_28_x86_64.whl", hash = "sha256:2c724620a101f8170065d7dded3f962a2aea7a7dae133a009cada42847e04a7b", size = 608402, upload_time = "2025-06-05T16:12:51.91Z" },
+    { url = "https://files.pythonhosted.org/packages/26/bc/862bd2083e6b3aff23300900a956f4ea9a4059de337f5c8734346b9b34fc/greenlet-3.2.3-cp313-cp313-musllinux_1_1_aarch64.whl", hash = "sha256:873abe55f134c48e1f2a6f53f7d1419192a3d1a4e873bace00499a4e45ea6af0", size = 1119577, upload_time = "2025-06-05T16:36:49.787Z" },
+    { url = "https://files.pythonhosted.org/packages/86/94/1fc0cc068cfde885170e01de40a619b00eaa8f2916bf3541744730ffb4c3/greenlet-3.2.3-cp313-cp313-musllinux_1_1_x86_64.whl", hash = "sha256:024571bbce5f2c1cfff08bf3fbaa43bbc7444f580ae13b0099e95d0e6e67ed36", size = 1147121, upload_time = "2025-06-05T16:12:42.527Z" },
+    { url = "https://files.pythonhosted.org/packages/27/1a/199f9587e8cb08a0658f9c30f3799244307614148ffe8b1e3aa22f324dea/greenlet-3.2.3-cp313-cp313-win_amd64.whl", hash = "sha256:5195fb1e75e592dd04ce79881c8a22becdfa3e6f500e7feb059b1e6fdd54d3e3", size = 297603, upload_time = "2025-06-05T16:20:12.651Z" },
+    { url = "https://files.pythonhosted.org/packages/d8/ca/accd7aa5280eb92b70ed9e8f7fd79dc50a2c21d8c73b9a0856f5b564e222/greenlet-3.2.3-cp314-cp314-macosx_11_0_universal2.whl", hash = "sha256:3d04332dddb10b4a211b68111dabaee2e1a073663d117dc10247b5b1642bac86", size = 271479, upload_time = "2025-06-05T16:10:47.525Z" },
+    { url = "https://files.pythonhosted.org/packages/55/71/01ed9895d9eb49223280ecc98a557585edfa56b3d0e965b9fa9f7f06b6d9/greenlet-3.2.3-cp314-cp314-manylinux2014_aarch64.manylinux_2_17_aarch64.whl", hash = "sha256:8186162dffde068a465deab08fc72c767196895c39db26ab1c17c0b77a6d8b97", size = 683952, upload_time = "2025-06-05T16:38:55.125Z" },
+    { url = "https://files.pythonhosted.org/packages/ea/61/638c4bdf460c3c678a0a1ef4c200f347dff80719597e53b5edb2fb27ab54/greenlet-3.2.3-cp314-cp314-manylinux2014_ppc64le.manylinux_2_17_ppc64le.whl", hash = "sha256:f4bfbaa6096b1b7a200024784217defedf46a07c2eee1a498e94a1b5f8ec5728", size = 696917, upload_time = "2025-06-05T16:41:38.959Z" },
+    { url = "https://files.pythonhosted.org/packages/22/cc/0bd1a7eb759d1f3e3cc2d1bc0f0b487ad3cc9f34d74da4b80f226fde4ec3/greenlet-3.2.3-cp314-cp314-manylinux2014_s390x.manylinux_2_17_s390x.whl", hash = "sha256:ed6cfa9200484d234d8394c70f5492f144b20d4533f69262d530a1a082f6ee9a", size = 692443, upload_time = "2025-06-05T16:48:23.113Z" },
+    { url = "https://files.pythonhosted.org/packages/67/10/b2a4b63d3f08362662e89c103f7fe28894a51ae0bc890fabf37d1d780e52/greenlet-3.2.3-cp314-cp314-manylinux2014_x86_64.manylinux_2_17_x86_64.whl", hash = "sha256:02b0df6f63cd15012bed5401b47829cfd2e97052dc89da3cfaf2c779124eb892", size = 692995, upload_time = "2025-06-05T16:13:07.972Z" },
+    { url = "https://files.pythonhosted.org/packages/5a/c6/ad82f148a4e3ce9564056453a71529732baf5448ad53fc323e37efe34f66/greenlet-3.2.3-cp314-cp314-manylinux_2_24_x86_64.manylinux_2_28_x86_64.whl", hash = "sha256:86c2d68e87107c1792e2e8d5399acec2487a4e993ab76c792408e59394d52141", size = 655320, upload_time = "2025-06-05T16:12:53.453Z" },
+    { url = "https://files.pythonhosted.org/packages/5c/4f/aab73ecaa6b3086a4c89863d94cf26fa84cbff63f52ce9bc4342b3087a06/greenlet-3.2.3-cp314-cp314-win_amd64.whl", hash = "sha256:8c47aae8fbbfcf82cc13327ae802ba13c9c36753b67e760023fd116bc124a62a", size = 301236, upload_time = "2025-06-05T16:15:20.111Z" },
+]
+
 [[package]]
 name = "grpcio"
 version = "1.73.1"
@@ -248,6 +303,143 @@ wheels = [
     { url = "https://files.pythonhosted.org/packages/62/a1/3d680cbfd5f4b8f15abc1d571870c5fc3e594bb582bc3b64ea099db13e56/jinja2-3.1.6-py3-none-any.whl", hash = "sha256:85ece4451f492d0c13c5dd7c13a64681a86afae63a5f347908daf103ce6d2f67", size = 134899, upload_time = "2025-03-05T20:05:00.369Z" },
 ]
 
+[[package]]
+name = "jiter"
+version = "0.10.0"
+source = { registry = "https://pypi.org/simple" }
+sdist = { url = "https://files.pythonhosted.org/packages/ee/9d/ae7ddb4b8ab3fb1b51faf4deb36cb48a4fbbd7cb36bad6a5fca4741306f7/jiter-0.10.0.tar.gz", hash = "sha256:07a7142c38aacc85194391108dc91b5b57093c978a9932bd86a36862759d9500", size = 162759, upload_time = "2025-05-18T19:04:59.73Z" }
+wheels = [
+    { url = "https://files.pythonhosted.org/packages/2e/b0/279597e7a270e8d22623fea6c5d4eeac328e7d95c236ed51a2b884c54f70/jiter-0.10.0-cp313-cp313-macosx_10_12_x86_64.whl", hash = "sha256:e0588107ec8e11b6f5ef0e0d656fb2803ac6cf94a96b2b9fc675c0e3ab5e8644", size = 311617, upload_time = "2025-05-18T19:04:02.078Z" },
+    { url = "https://files.pythonhosted.org/packages/91/e3/0916334936f356d605f54cc164af4060e3e7094364add445a3bc79335d46/jiter-0.10.0-cp313-cp313-macosx_11_0_arm64.whl", hash = "sha256:cafc4628b616dc32530c20ee53d71589816cf385dd9449633e910d596b1f5c8a", size = 318947, upload_time = "2025-05-18T19:04:03.347Z" },
+    { url = "https://files.pythonhosted.org/packages/6a/8e/fd94e8c02d0e94539b7d669a7ebbd2776e51f329bb2c84d4385e8063a2ad/jiter-0.10.0-cp313-cp313-manylinux_2_17_aarch64.manylinux2014_aarch64.whl", hash = "sha256:520ef6d981172693786a49ff5b09eda72a42e539f14788124a07530f785c3ad6", size = 344618, upload_time = "2025-05-18T19:04:04.709Z" },
+    { url = "https://files.pythonhosted.org/packages/6f/b0/f9f0a2ec42c6e9c2e61c327824687f1e2415b767e1089c1d9135f43816bd/jiter-0.10.0-cp313-cp313-manylinux_2_17_armv7l.manylinux2014_armv7l.whl", hash = "sha256:554dedfd05937f8fc45d17ebdf298fe7e0c77458232bcb73d9fbbf4c6455f5b3", size = 368829, upload_time = "2025-05-18T19:04:06.912Z" },
+    { url = "https://files.pythonhosted.org/packages/e8/57/5bbcd5331910595ad53b9fd0c610392ac68692176f05ae48d6ce5c852967/jiter-0.10.0-cp313-cp313-manylinux_2_17_ppc64le.manylinux2014_ppc64le.whl", hash = "sha256:5bc299da7789deacf95f64052d97f75c16d4fc8c4c214a22bf8d859a4288a1c2", size = 491034, upload_time = "2025-05-18T19:04:08.222Z" },
+    { url = "https://files.pythonhosted.org/packages/9b/be/c393df00e6e6e9e623a73551774449f2f23b6ec6a502a3297aeeece2c65a/jiter-0.10.0-cp313-cp313-manylinux_2_17_s390x.manylinux2014_s390x.whl", hash = "sha256:5161e201172de298a8a1baad95eb85db4fb90e902353b1f6a41d64ea64644e25", size = 388529, upload_time = "2025-05-18T19:04:09.566Z" },
+    { url = "https://files.pythonhosted.org/packages/42/3e/df2235c54d365434c7f150b986a6e35f41ebdc2f95acea3036d99613025d/jiter-0.10.0-cp313-cp313-manylinux_2_17_x86_64.manylinux2014_x86_64.whl", hash = "sha256:2e2227db6ba93cb3e2bf67c87e594adde0609f146344e8207e8730364db27041", size = 350671, upload_time = "2025-05-18T19:04:10.98Z" },
+    { url = "https://files.pythonhosted.org/packages/c6/77/71b0b24cbcc28f55ab4dbfe029f9a5b73aeadaba677843fc6dc9ed2b1d0a/jiter-0.10.0-cp313-cp313-manylinux_2_5_i686.manylinux1_i686.whl", hash = "sha256:15acb267ea5e2c64515574b06a8bf393fbfee6a50eb1673614aa45f4613c0cca", size = 390864, upload_time = "2025-05-18T19:04:12.722Z" },
+    { url = "https://files.pythonhosted.org/packages/6a/d3/ef774b6969b9b6178e1d1e7a89a3bd37d241f3d3ec5f8deb37bbd203714a/jiter-0.10.0-cp313-cp313-musllinux_1_1_aarch64.whl", hash = "sha256:901b92f2e2947dc6dfcb52fd624453862e16665ea909a08398dde19c0731b7f4", size = 522989, upload_time = "2025-05-18T19:04:14.261Z" },
+    { url = "https://files.pythonhosted.org/packages/0c/41/9becdb1d8dd5d854142f45a9d71949ed7e87a8e312b0bede2de849388cb9/jiter-0.10.0-cp313-cp313-musllinux_1_1_x86_64.whl", hash = "sha256:d0cb9a125d5a3ec971a094a845eadde2db0de85b33c9f13eb94a0c63d463879e", size = 513495, upload_time = "2025-05-18T19:04:15.603Z" },
+    { url = "https://files.pythonhosted.org/packages/9c/36/3468e5a18238bdedae7c4d19461265b5e9b8e288d3f86cd89d00cbb48686/jiter-0.10.0-cp313-cp313-win32.whl", hash = "sha256:48a403277ad1ee208fb930bdf91745e4d2d6e47253eedc96e2559d1e6527006d", size = 211289, upload_time = "2025-05-18T19:04:17.541Z" },
+    { url = "https://files.pythonhosted.org/packages/7e/07/1c96b623128bcb913706e294adb5f768fb7baf8db5e1338ce7b4ee8c78ef/jiter-0.10.0-cp313-cp313-win_amd64.whl", hash = "sha256:75f9eb72ecb640619c29bf714e78c9c46c9c4eaafd644bf78577ede459f330d4", size = 205074, upload_time = "2025-05-18T19:04:19.21Z" },
+    { url = "https://files.pythonhosted.org/packages/54/46/caa2c1342655f57d8f0f2519774c6d67132205909c65e9aa8255e1d7b4f4/jiter-0.10.0-cp313-cp313t-macosx_11_0_arm64.whl", hash = "sha256:28ed2a4c05a1f32ef0e1d24c2611330219fed727dae01789f4a335617634b1ca", size = 318225, upload_time = "2025-05-18T19:04:20.583Z" },
+    { url = "https://files.pythonhosted.org/packages/43/84/c7d44c75767e18946219ba2d703a5a32ab37b0bc21886a97bc6062e4da42/jiter-0.10.0-cp313-cp313t-manylinux_2_17_x86_64.manylinux2014_x86_64.whl", hash = "sha256:14a4c418b1ec86a195f1ca69da8b23e8926c752b685af665ce30777233dfe070", size = 350235, upload_time = "2025-05-18T19:04:22.363Z" },
+    { url = "https://files.pythonhosted.org/packages/01/16/f5a0135ccd968b480daad0e6ab34b0c7c5ba3bc447e5088152696140dcb3/jiter-0.10.0-cp313-cp313t-win_amd64.whl", hash = "sha256:d7bfed2fe1fe0e4dda6ef682cee888ba444b21e7a6553e03252e4feb6cf0adca", size = 207278, upload_time = "2025-05-18T19:04:23.627Z" },
+    { url = "https://files.pythonhosted.org/packages/1c/9b/1d646da42c3de6c2188fdaa15bce8ecb22b635904fc68be025e21249ba44/jiter-0.10.0-cp314-cp314-macosx_10_12_x86_64.whl", hash = "sha256:5e9251a5e83fab8d87799d3e1a46cb4b7f2919b895c6f4483629ed2446f66522", size = 310866, upload_time = "2025-05-18T19:04:24.891Z" },
+    { url = "https://files.pythonhosted.org/packages/ad/0e/26538b158e8a7c7987e94e7aeb2999e2e82b1f9d2e1f6e9874ddf71ebda0/jiter-0.10.0-cp314-cp314-macosx_11_0_arm64.whl", hash = "sha256:023aa0204126fe5b87ccbcd75c8a0d0261b9abdbbf46d55e7ae9f8e22424eeb8", size = 318772, upload_time = "2025-05-18T19:04:26.161Z" },
+    { url = "https://files.pythonhosted.org/packages/7b/fb/d302893151caa1c2636d6574d213e4b34e31fd077af6050a9c5cbb42f6fb/jiter-0.10.0-cp314-cp314-manylinux_2_17_aarch64.manylinux2014_aarch64.whl", hash = "sha256:3c189c4f1779c05f75fc17c0c1267594ed918996a231593a21a5ca5438445216", size = 344534, upload_time = "2025-05-18T19:04:27.495Z" },
+    { url = "https://files.pythonhosted.org/packages/01/d8/5780b64a149d74e347c5128d82176eb1e3241b1391ac07935693466d6219/jiter-0.10.0-cp314-cp314-manylinux_2_17_armv7l.manylinux2014_armv7l.whl", hash = "sha256:15720084d90d1098ca0229352607cd68256c76991f6b374af96f36920eae13c4", size = 369087, upload_time = "2025-05-18T19:04:28.896Z" },
+    { url = "https://files.pythonhosted.org/packages/e8/5b/f235a1437445160e777544f3ade57544daf96ba7e96c1a5b24a6f7ac7004/jiter-0.10.0-cp314-cp314-manylinux_2_17_ppc64le.manylinux2014_ppc64le.whl", hash = "sha256:e4f2fb68e5f1cfee30e2b2a09549a00683e0fde4c6a2ab88c94072fc33cb7426", size = 490694, upload_time = "2025-05-18T19:04:30.183Z" },
+    { url = "https://files.pythonhosted.org/packages/85/a9/9c3d4617caa2ff89cf61b41e83820c27ebb3f7b5fae8a72901e8cd6ff9be/jiter-0.10.0-cp314-cp314-manylinux_2_17_s390x.manylinux2014_s390x.whl", hash = "sha256:ce541693355fc6da424c08b7edf39a2895f58d6ea17d92cc2b168d20907dee12", size = 388992, upload_time = "2025-05-18T19:04:32.028Z" },
+    { url = "https://files.pythonhosted.org/packages/68/b1/344fd14049ba5c94526540af7eb661871f9c54d5f5601ff41a959b9a0bbd/jiter-0.10.0-cp314-cp314-manylinux_2_17_x86_64.manylinux2014_x86_64.whl", hash = "sha256:31c50c40272e189d50006ad5c73883caabb73d4e9748a688b216e85a9a9ca3b9", size = 351723, upload_time = "2025-05-18T19:04:33.467Z" },
+    { url = "https://files.pythonhosted.org/packages/41/89/4c0e345041186f82a31aee7b9d4219a910df672b9fef26f129f0cda07a29/jiter-0.10.0-cp314-cp314-manylinux_2_5_i686.manylinux1_i686.whl", hash = "sha256:fa3402a2ff9815960e0372a47b75c76979d74402448509ccd49a275fa983ef8a", size = 392215, upload_time = "2025-05-18T19:04:34.827Z" },
+    { url = "https://files.pythonhosted.org/packages/55/58/ee607863e18d3f895feb802154a2177d7e823a7103f000df182e0f718b38/jiter-0.10.0-cp314-cp314-musllinux_1_1_aarch64.whl", hash = "sha256:1956f934dca32d7bb647ea21d06d93ca40868b505c228556d3373cbd255ce853", size = 522762, upload_time = "2025-05-18T19:04:36.19Z" },
+    { url = "https://files.pythonhosted.org/packages/15/d0/9123fb41825490d16929e73c212de9a42913d68324a8ce3c8476cae7ac9d/jiter-0.10.0-cp314-cp314-musllinux_1_1_x86_64.whl", hash = "sha256:fcedb049bdfc555e261d6f65a6abe1d5ad68825b7202ccb9692636c70fcced86", size = 513427, upload_time = "2025-05-18T19:04:37.544Z" },
+    { url = "https://files.pythonhosted.org/packages/d8/b3/2bd02071c5a2430d0b70403a34411fc519c2f227da7b03da9ba6a956f931/jiter-0.10.0-cp314-cp314-win32.whl", hash = "sha256:ac509f7eccca54b2a29daeb516fb95b6f0bd0d0d8084efaf8ed5dfc7b9f0b357", size = 210127, upload_time = "2025-05-18T19:04:38.837Z" },
+    { url = "https://files.pythonhosted.org/packages/03/0c/5fe86614ea050c3ecd728ab4035534387cd41e7c1855ef6c031f1ca93e3f/jiter-0.10.0-cp314-cp314t-macosx_11_0_arm64.whl", hash = "sha256:5ed975b83a2b8639356151cef5c0d597c68376fc4922b45d0eb384ac058cfa00", size = 318527, upload_time = "2025-05-18T19:04:40.612Z" },
+    { url = "https://files.pythonhosted.org/packages/b3/4a/4175a563579e884192ba6e81725fc0448b042024419be8d83aa8a80a3f44/jiter-0.10.0-cp314-cp314t-manylinux_2_17_x86_64.manylinux2014_x86_64.whl", hash = "sha256:3aa96f2abba33dc77f79b4cf791840230375f9534e5fac927ccceb58c5e604a5", size = 354213, upload_time = "2025-05-18T19:04:41.894Z" },
+]
+
+[[package]]
+name = "jsonpatch"
+version = "1.33"
+source = { registry = "https://pypi.org/simple" }
+dependencies = [
+    { name = "jsonpointer" },
+]
+sdist = { url = "https://files.pythonhosted.org/packages/42/78/18813351fe5d63acad16aec57f94ec2b70a09e53ca98145589e185423873/jsonpatch-1.33.tar.gz", hash = "sha256:9fcd4009c41e6d12348b4a0ff2563ba56a2923a7dfee731d004e212e1ee5030c", size = 21699, upload_time = "2023-06-26T12:07:29.144Z" }
+wheels = [
+    { url = "https://files.pythonhosted.org/packages/73/07/02e16ed01e04a374e644b575638ec7987ae846d25ad97bcc9945a3ee4b0e/jsonpatch-1.33-py2.py3-none-any.whl", hash = "sha256:0ae28c0cd062bbd8b8ecc26d7d164fbbea9652a1a3693f3b956c1eae5145dade", size = 12898, upload_time = "2023-06-16T21:01:28.466Z" },
+]
+
+[[package]]
+name = "jsonpointer"
+version = "3.0.0"
+source = { registry = "https://pypi.org/simple" }
+sdist = { url = "https://files.pythonhosted.org/packages/6a/0a/eebeb1fa92507ea94016a2a790b93c2ae41a7e18778f85471dc54475ed25/jsonpointer-3.0.0.tar.gz", hash = "sha256:2b2d729f2091522d61c3b31f82e11870f60b68f43fbc705cb76bf4b832af59ef", size = 9114, upload_time = "2024-06-10T19:24:42.462Z" }
+wheels = [
+    { url = "https://files.pythonhosted.org/packages/71/92/5e77f98553e9e75130c78900d000368476aed74276eb8ae8796f65f00918/jsonpointer-3.0.0-py2.py3-none-any.whl", hash = "sha256:13e088adc14fca8b6aa8177c044e12701e6ad4b28ff10e65f2267a90109c9942", size = 7595, upload_time = "2024-06-10T19:24:40.698Z" },
+]
+
+[[package]]
+name = "langchain"
+version = "0.3.26"
+source = { registry = "https://pypi.org/simple" }
+dependencies = [
+    { name = "langchain-core" },
+    { name = "langchain-text-splitters" },
+    { name = "langsmith" },
+    { name = "pydantic" },
+    { name = "pyyaml" },
+    { name = "requests" },
+    { name = "sqlalchemy" },
+]
+sdist = { url = "https://files.pythonhosted.org/packages/7f/13/a9931800ee42bbe0f8850dd540de14e80dda4945e7ee36e20b5d5964286e/langchain-0.3.26.tar.gz", hash = "sha256:8ff034ee0556d3e45eff1f1e96d0d745ced57858414dba7171c8ebdbeb5580c9", size = 10226808, upload_time = "2025-06-20T22:23:01.174Z" }
+wheels = [
+    { url = "https://files.pythonhosted.org/packages/f1/f2/c09a2e383283e3af1db669ab037ac05a45814f4b9c472c48dc24c0cef039/langchain-0.3.26-py3-none-any.whl", hash = "sha256:361bb2e61371024a8c473da9f9c55f4ee50f269c5ab43afdb2b1309cb7ac36cf", size = 1012336, upload_time = "2025-06-20T22:22:58.874Z" },
+]
+
+[[package]]
+name = "langchain-core"
+version = "0.3.71"
+source = { registry = "https://pypi.org/simple" }
+dependencies = [
+    { name = "jsonpatch" },
+    { name = "langsmith" },
+    { name = "packaging" },
+    { name = "pydantic" },
+    { name = "pyyaml" },
+    { name = "tenacity" },
+    { name = "typing-extensions" },
+]
+sdist = { url = "https://files.pythonhosted.org/packages/23/ea/f7089f7557673b2ac71396ab4bd4322ec959fd7d3901232998ba22c8f953/langchain_core-0.3.71.tar.gz", hash = "sha256:03ce06ba86bd1fa202b7b704d81554306f9cf5a3044b80d9a8ea7d93eab08623", size = 567226, upload_time = "2025-07-22T19:55:59.122Z" }
+wheels = [
+    { url = "https://files.pythonhosted.org/packages/32/1b/e9af4aac9623d63596c499f619082fa48c4b995696b6d2e8e98e53423809/langchain_core-0.3.71-py3-none-any.whl", hash = "sha256:cce6f3faae57d23bc4f2b41246b9dcf06b8dcdf52caaf6afd62b0849df20ba23", size = 442804, upload_time = "2025-07-22T19:55:57.879Z" },
+]
+
+[[package]]
+name = "langchain-openai"
+version = "0.3.28"
+source = { registry = "https://pypi.org/simple" }
+dependencies = [
+    { name = "langchain-core" },
+    { name = "openai" },
+    { name = "tiktoken" },
+]
+sdist = { url = "https://files.pythonhosted.org/packages/6b/1d/90cd764c62d5eb822113d3debc3abe10c8807d2c0af90917bfe09acd6f86/langchain_openai-0.3.28.tar.gz", hash = "sha256:6c669548dbdea325c034ae5ef699710e2abd054c7354fdb3ef7bf909dc739d9e", size = 753951, upload_time = "2025-07-14T10:50:44.076Z" }
+wheels = [
+    { url = "https://files.pythonhosted.org/packages/91/56/75f3d84b69b8bdae521a537697375e1241377627c32b78edcae337093502/langchain_openai-0.3.28-py3-none-any.whl", hash = "sha256:4cd6d80a5b2ae471a168017bc01b2e0f01548328d83532400a001623624ede67", size = 70571, upload_time = "2025-07-14T10:50:42.492Z" },
+]
+
+[[package]]
+name = "langchain-text-splitters"
+version = "0.3.8"
+source = { registry = "https://pypi.org/simple" }
+dependencies = [
+    { name = "langchain-core" },
+]
+sdist = { url = "https://files.pythonhosted.org/packages/e7/ac/b4a25c5716bb0103b1515f1f52cc69ffb1035a5a225ee5afe3aed28bf57b/langchain_text_splitters-0.3.8.tar.gz", hash = "sha256:116d4b9f2a22dda357d0b79e30acf005c5518177971c66a9f1ab0edfdb0f912e", size = 42128, upload_time = "2025-04-04T14:03:51.521Z" }
+wheels = [
+    { url = "https://files.pythonhosted.org/packages/8b/a3/3696ff2444658053c01b6b7443e761f28bb71217d82bb89137a978c5f66f/langchain_text_splitters-0.3.8-py3-none-any.whl", hash = "sha256:e75cc0f4ae58dcf07d9f18776400cf8ade27fadd4ff6d264df6278bb302f6f02", size = 32440, upload_time = "2025-04-04T14:03:50.6Z" },
+]
+
+[[package]]
+name = "langsmith"
+version = "0.4.8"
+source = { registry = "https://pypi.org/simple" }
+dependencies = [
+    { name = "httpx" },
+    { name = "orjson", marker = "platform_python_implementation != 'PyPy'" },
+    { name = "packaging" },
+    { name = "pydantic" },
+    { name = "requests" },
+    { name = "requests-toolbelt" },
+    { name = "zstandard" },
+]
+sdist = { url = "https://files.pythonhosted.org/packages/46/38/0da897697ce29fb78cdaacae2d0fa3a4bc2a0abf23f84f6ecd1947f79245/langsmith-0.4.8.tar.gz", hash = "sha256:50eccb744473dd6bd3e0fe024786e2196b1f8598f8defffce7ac31113d6c140f", size = 352414, upload_time = "2025-07-18T19:36:06.082Z" }
+wheels = [
+    { url = "https://files.pythonhosted.org/packages/19/4f/481324462c44ce21443b833ad73ee51117031d41c16fec06cddbb7495b26/langsmith-0.4.8-py3-none-any.whl", hash = "sha256:ca2f6024ab9d2cd4d091b2e5b58a5d2cb0c354a0c84fe214145a89ad450abae0", size = 367975, upload_time = "2025-07-18T19:36:04.025Z" },
+]
+
 [[package]]
 name = "llama-cpp-python"
 version = "0.3.13"
@@ -318,6 +510,25 @@ wheels = [
     { url = "https://files.pythonhosted.org/packages/d4/ca/af82bf0fad4c3e573c6930ed743b5308492ff19917c7caaf2f9b6f9e2e98/numpy-2.3.1-cp313-cp313t-win_arm64.whl", hash = "sha256:eccb9a159db9aed60800187bc47a6d3451553f0e1b08b068d8b277ddfbb9b244", size = 10260376, upload_time = "2025-06-21T12:24:56.884Z" },
 ]
 
+[[package]]
+name = "openai"
+version = "1.97.1"
+source = { registry = "https://pypi.org/simple" }
+dependencies = [
+    { name = "anyio" },
+    { name = "distro" },
+    { name = "httpx" },
+    { name = "jiter" },
+    { name = "pydantic" },
+    { name = "sniffio" },
+    { name = "tqdm" },
+    { name = "typing-extensions" },
+]
+sdist = { url = "https://files.pythonhosted.org/packages/a6/57/1c471f6b3efb879d26686d31582997615e969f3bb4458111c9705e56332e/openai-1.97.1.tar.gz", hash = "sha256:a744b27ae624e3d4135225da9b1c89c107a2a7e5bc4c93e5b7b5214772ce7a4e", size = 494267, upload_time = "2025-07-22T13:10:12.607Z" }
+wheels = [
+    { url = "https://files.pythonhosted.org/packages/ee/35/412a0e9c3f0d37c94ed764b8ac7adae2d834dbd20e69f6aca582118e0f55/openai-1.97.1-py3-none-any.whl", hash = "sha256:4e96bbdf672ec3d44968c9ea39d2c375891db1acc1794668d8149d5fa6000606", size = 764380, upload_time = "2025-07-22T13:10:10.689Z" },
+]
+
 [[package]]
 name = "opentelemetry-api"
 version = "1.35.0"
@@ -514,13 +725,36 @@ wheels = [
     { url = "https://files.pythonhosted.org/packages/05/ca/20763fba2af06e73f0e666e46a32b5cdb9d2d75dcb5fd221f50c818cae43/opentelemetry_util_http-0.56b0-py3-none-any.whl", hash = "sha256:e26dd8c7f71da6806f1e65ac7cde189d389b8f152506146968f59b7a607dc8cf", size = 7645, upload_time = "2025-07-11T12:26:16.106Z" },
 ]
 
+[[package]]
+name = "orjson"
+version = "3.11.0"
+source = { registry = "https://pypi.org/simple" }
+sdist = { url = "https://files.pythonhosted.org/packages/29/87/03ababa86d984952304ac8ce9fbd3a317afb4a225b9a81f9b606ac60c873/orjson-3.11.0.tar.gz", hash = "sha256:2e4c129da624f291bcc607016a99e7f04a353f6874f3bd8d9b47b88597d5f700", size = 5318246, upload_time = "2025-07-15T16:08:29.194Z" }
+wheels = [
+    { url = "https://files.pythonhosted.org/packages/31/63/82d9b6b48624009d230bc6038e54778af8f84dfd54402f9504f477c5cfd5/orjson-3.11.0-cp313-cp313-macosx_10_15_x86_64.macosx_11_0_arm64.macosx_10_15_universal2.whl", hash = "sha256:4a8ba9698655e16746fdf5266939427da0f9553305152aeb1a1cc14974a19cfb", size = 240125, upload_time = "2025-07-15T16:07:35.976Z" },
+    { url = "https://files.pythonhosted.org/packages/16/3a/d557ed87c63237d4c97a7bac7ac054c347ab8c4b6da09748d162ca287175/orjson-3.11.0-cp313-cp313-macosx_15_0_arm64.whl", hash = "sha256:67133847f9a35a5ef5acfa3325d4a2f7fe05c11f1505c4117bb086fc06f2a58f", size = 129189, upload_time = "2025-07-15T16:07:37.486Z" },
+    { url = "https://files.pythonhosted.org/packages/69/5e/b2c9e22e2cd10aa7d76a629cee65d661e06a61fbaf4dc226386f5636dd44/orjson-3.11.0-cp313-cp313-manylinux_2_17_aarch64.manylinux2014_aarch64.whl", hash = "sha256:5f797d57814975b78f5f5423acb003db6f9be5186b72d48bd97a1000e89d331d", size = 131953, upload_time = "2025-07-15T16:07:39.254Z" },
+    { url = "https://files.pythonhosted.org/packages/e2/60/760fcd9b50eb44d1206f2b30c8d310b79714553b9d94a02f9ea3252ebe63/orjson-3.11.0-cp313-cp313-manylinux_2_17_armv7l.manylinux2014_armv7l.whl", hash = "sha256:28acd19822987c5163b9e03a6e60853a52acfee384af2b394d11cb413b889246", size = 126922, upload_time = "2025-07-15T16:07:41.282Z" },
+    { url = "https://files.pythonhosted.org/packages/6a/7a/8c46daa867ccc92da6de9567608be62052774b924a77c78382e30d50b579/orjson-3.11.0-cp313-cp313-manylinux_2_17_i686.manylinux2014_i686.whl", hash = "sha256:e8d38d9e1e2cf9729658e35956cf01e13e89148beb4cb9e794c9c10c5cb252f8", size = 128787, upload_time = "2025-07-15T16:07:42.681Z" },
+    { url = "https://files.pythonhosted.org/packages/f2/14/a2f1b123d85f11a19e8749f7d3f9ed6c9b331c61f7b47cfd3e9a1fedb9bc/orjson-3.11.0-cp313-cp313-manylinux_2_17_ppc64le.manylinux2014_ppc64le.whl", hash = "sha256:05f094edd2b782650b0761fd78858d9254de1c1286f5af43145b3d08cdacfd51", size = 131895, upload_time = "2025-07-15T16:07:44.519Z" },
+    { url = "https://files.pythonhosted.org/packages/c8/10/362e8192df7528e8086ea712c5cb01355c8d4e52c59a804417ba01e2eb2d/orjson-3.11.0-cp313-cp313-manylinux_2_17_s390x.manylinux2014_s390x.whl", hash = "sha256:6d09176a4a9e04a5394a4a0edd758f645d53d903b306d02f2691b97d5c736a9e", size = 133868, upload_time = "2025-07-15T16:07:46.227Z" },
+    { url = "https://files.pythonhosted.org/packages/f8/4e/ef43582ef3e3dfd2a39bc3106fa543364fde1ba58489841120219da6e22f/orjson-3.11.0-cp313-cp313-manylinux_2_17_x86_64.manylinux2014_x86_64.whl", hash = "sha256:2a585042104e90a61eda2564d11317b6a304eb4e71cd33e839f5af6be56c34d3", size = 128234, upload_time = "2025-07-15T16:07:48.123Z" },
+    { url = "https://files.pythonhosted.org/packages/d7/fa/02dabb2f1d605bee8c4bb1160cfc7467976b1ed359a62cc92e0681b53c45/orjson-3.11.0-cp313-cp313-musllinux_1_2_aarch64.whl", hash = "sha256:d2218629dbfdeeb5c9e0573d59f809d42f9d49ae6464d2f479e667aee14c3ef4", size = 130232, upload_time = "2025-07-15T16:07:50.197Z" },
+    { url = "https://files.pythonhosted.org/packages/16/76/951b5619605c8d2ede80cc989f32a66abc954530d86e84030db2250c63a1/orjson-3.11.0-cp313-cp313-musllinux_1_2_armv7l.whl", hash = "sha256:613e54a2b10b51b656305c11235a9c4a5c5491ef5c283f86483d4e9e123ed5e4", size = 403648, upload_time = "2025-07-15T16:07:52.136Z" },
+    { url = "https://files.pythonhosted.org/packages/96/e2/5fa53bb411455a63b3713db90b588e6ca5ed2db59ad49b3fb8a0e94e0dda/orjson-3.11.0-cp313-cp313-musllinux_1_2_i686.whl", hash = "sha256:9dac7fbf3b8b05965986c5cfae051eb9a30fced7f15f1d13a5adc608436eb486", size = 144572, upload_time = "2025-07-15T16:07:54.004Z" },
+    { url = "https://files.pythonhosted.org/packages/ad/d0/7d6f91e1e0f034258c3a3358f20b0c9490070e8a7ab8880085547274c7f9/orjson-3.11.0-cp313-cp313-musllinux_1_2_x86_64.whl", hash = "sha256:93b64b254414e2be55ac5257124b5602c5f0b4d06b80bd27d1165efe8f36e836", size = 132766, upload_time = "2025-07-15T16:07:55.936Z" },
+    { url = "https://files.pythonhosted.org/packages/ff/f8/4d46481f1b3fb40dc826d62179f96c808eb470cdcc74b6593fb114d74af3/orjson-3.11.0-cp313-cp313-win32.whl", hash = "sha256:359cbe11bc940c64cb3848cf22000d2aef36aff7bfd09ca2c0b9cb309c387132", size = 134638, upload_time = "2025-07-15T16:07:57.343Z" },
+    { url = "https://files.pythonhosted.org/packages/85/3f/544938dcfb7337d85ee1e43d7685cf8f3bfd452e0b15a32fe70cb4ca5094/orjson-3.11.0-cp313-cp313-win_amd64.whl", hash = "sha256:0759b36428067dc777b202dd286fbdd33d7f261c6455c4238ea4e8474358b1e6", size = 129411, upload_time = "2025-07-15T16:07:58.852Z" },
+    { url = "https://files.pythonhosted.org/packages/43/0c/f75015669d7817d222df1bb207f402277b77d22c4833950c8c8c7cf2d325/orjson-3.11.0-cp313-cp313-win_arm64.whl", hash = "sha256:51cdca2f36e923126d0734efaf72ddbb5d6da01dbd20eab898bdc50de80d7b5a", size = 126349, upload_time = "2025-07-15T16:08:00.322Z" },
+]
+
 [[package]]
 name = "packaging"
-version = "25.0"
+version = "24.2"
 source = { registry = "https://pypi.org/simple" }
-sdist = { url = "https://files.pythonhosted.org/packages/a1/d4/1fc4078c65507b51b96ca8f8c3ba19e6a61c8253c72794544580a7b6c24d/packaging-25.0.tar.gz", hash = "sha256:d443872c98d677bf60f6a1f2f8c1cb748e8fe762d2bf9d3148b5599295b0fc4f", size = 165727, upload_time = "2025-04-19T11:48:59.673Z" }
+sdist = { url = "https://files.pythonhosted.org/packages/d0/63/68dbb6eb2de9cb10ee4c9c14a0148804425e13c4fb20d61cce69f53106da/packaging-24.2.tar.gz", hash = "sha256:c228a6dc5e932d346bc5739379109d49e8853dd8223571c7c5b55260edc0b97f", size = 163950, upload_time = "2024-11-08T09:47:47.202Z" }
 wheels = [
-    { url = "https://files.pythonhosted.org/packages/20/12/38679034af332785aac8774540895e234f4d07f7545804097de4b666afd8/packaging-25.0-py3-none-any.whl", hash = "sha256:29572ef2b1f17581046b3a2227d5c611fb25ec70ca1ba8554b24b0e69331a484", size = 66469, upload_time = "2025-04-19T11:48:57.875Z" },
+    { url = "https://files.pythonhosted.org/packages/88/ef/eb23f262cca3c0c4eb7ab1933c3b1f03d021f2c48f54763065b6f0e321be/packaging-24.2-py3-none-any.whl", hash = "sha256:09abb1bccd265c01f4a3aa3f7a7db064b36514d2cba19a2f694fe6150451a759", size = 65451, upload_time = "2024-11-08T09:47:44.722Z" },
 ]
 
 [[package]]
@@ -583,6 +817,15 @@ wheels = [
     { url = "https://files.pythonhosted.org/packages/50/1b/6921afe68c74868b4c9fa424dad3be35b095e16687989ebbb50ce4fceb7c/psutil-7.0.0-cp37-abi3-win_amd64.whl", hash = "sha256:4cf3d4eb1aa9b348dec30105c55cd9b7d4629285735a102beb4441e38db90553", size = 244885, upload_time = "2025-02-13T21:54:37.486Z" },
 ]
 
+[[package]]
+name = "pycparser"
+version = "2.22"
+source = { registry = "https://pypi.org/simple" }
+sdist = { url = "https://files.pythonhosted.org/packages/1d/b2/31537cf4b1ca988837256c910a668b553fceb8f069bedc4b1c826024b52c/pycparser-2.22.tar.gz", hash = "sha256:491c8be9c040f5390f5bf44a5b07752bd07f56edf992381b05c701439eec10f6", size = 172736, upload_time = "2024-03-30T13:22:22.564Z" }
+wheels = [
+    { url = "https://files.pythonhosted.org/packages/13/a3/a812df4e2dd5696d1f351d58b8fe16a405b234ad2886a0dab9183fb78109/pycparser-2.22-py3-none-any.whl", hash = "sha256:c3702b6d3dd8c7abc1afa565d7e63d53a1d0bd86cdc24edd75470f4de499cfcc", size = 117552, upload_time = "2024-03-30T13:22:20.476Z" },
+]
+
 [[package]]
 name = "pydantic"
 version = "2.11.7"
@@ -688,6 +931,46 @@ wheels = [
     { url = "https://files.pythonhosted.org/packages/5f/ed/539768cf28c661b5b068d66d96a2f155c4971a5d55684a514c1a0e0dec2f/python_dotenv-1.1.1-py3-none-any.whl", hash = "sha256:31f23644fe2602f88ff55e1f5c79ba497e01224ee7737937930c448e4d0e24dc", size = 20556, upload_time = "2025-06-24T04:21:06.073Z" },
 ]
 
+[[package]]
+name = "pyyaml"
+version = "6.0.2"
+source = { registry = "https://pypi.org/simple" }
+sdist = { url = "https://files.pythonhosted.org/packages/54/ed/79a089b6be93607fa5cdaedf301d7dfb23af5f25c398d5ead2525b063e17/pyyaml-6.0.2.tar.gz", hash = "sha256:d584d9ec91ad65861cc08d42e834324ef890a082e591037abe114850ff7bbc3e", size = 130631, upload_time = "2024-08-06T20:33:50.674Z" }
+wheels = [
+    { url = "https://files.pythonhosted.org/packages/ef/e3/3af305b830494fa85d95f6d95ef7fa73f2ee1cc8ef5b495c7c3269fb835f/PyYAML-6.0.2-cp313-cp313-macosx_10_13_x86_64.whl", hash = "sha256:efdca5630322a10774e8e98e1af481aad470dd62c3170801852d752aa7a783ba", size = 181309, upload_time = "2024-08-06T20:32:43.4Z" },
+    { url = "https://files.pythonhosted.org/packages/45/9f/3b1c20a0b7a3200524eb0076cc027a970d320bd3a6592873c85c92a08731/PyYAML-6.0.2-cp313-cp313-macosx_11_0_arm64.whl", hash = "sha256:50187695423ffe49e2deacb8cd10510bc361faac997de9efef88badc3bb9e2d1", size = 171679, upload_time = "2024-08-06T20:32:44.801Z" },
+    { url = "https://files.pythonhosted.org/packages/7c/9a/337322f27005c33bcb656c655fa78325b730324c78620e8328ae28b64d0c/PyYAML-6.0.2-cp313-cp313-manylinux_2_17_aarch64.manylinux2014_aarch64.whl", hash = "sha256:0ffe8360bab4910ef1b9e87fb812d8bc0a308b0d0eef8c8f44e0254ab3b07133", size = 733428, upload_time = "2024-08-06T20:32:46.432Z" },
+    { url = "https://files.pythonhosted.org/packages/a3/69/864fbe19e6c18ea3cc196cbe5d392175b4cf3d5d0ac1403ec3f2d237ebb5/PyYAML-6.0.2-cp313-cp313-manylinux_2_17_s390x.manylinux2014_s390x.whl", hash = "sha256:17e311b6c678207928d649faa7cb0d7b4c26a0ba73d41e99c4fff6b6c3276484", size = 763361, upload_time = "2024-08-06T20:32:51.188Z" },
+    { url = "https://files.pythonhosted.org/packages/04/24/b7721e4845c2f162d26f50521b825fb061bc0a5afcf9a386840f23ea19fa/PyYAML-6.0.2-cp313-cp313-manylinux_2_17_x86_64.manylinux2014_x86_64.whl", hash = "sha256:70b189594dbe54f75ab3a1acec5f1e3faa7e8cf2f1e08d9b561cb41b845f69d5", size = 759523, upload_time = "2024-08-06T20:32:53.019Z" },
+    { url = "https://files.pythonhosted.org/packages/2b/b2/e3234f59ba06559c6ff63c4e10baea10e5e7df868092bf9ab40e5b9c56b6/PyYAML-6.0.2-cp313-cp313-musllinux_1_1_aarch64.whl", hash = "sha256:41e4e3953a79407c794916fa277a82531dd93aad34e29c2a514c2c0c5fe971cc", size = 726660, upload_time = "2024-08-06T20:32:54.708Z" },
+    { url = "https://files.pythonhosted.org/packages/fe/0f/25911a9f080464c59fab9027482f822b86bf0608957a5fcc6eaac85aa515/PyYAML-6.0.2-cp313-cp313-musllinux_1_1_x86_64.whl", hash = "sha256:68ccc6023a3400877818152ad9a1033e3db8625d899c72eacb5a668902e4d652", size = 751597, upload_time = "2024-08-06T20:32:56.985Z" },
+    { url = "https://files.pythonhosted.org/packages/14/0d/e2c3b43bbce3cf6bd97c840b46088a3031085179e596d4929729d8d68270/PyYAML-6.0.2-cp313-cp313-win32.whl", hash = "sha256:bc2fa7c6b47d6bc618dd7fb02ef6fdedb1090ec036abab80d4681424b84c1183", size = 140527, upload_time = "2024-08-06T20:33:03.001Z" },
+    { url = "https://files.pythonhosted.org/packages/fa/de/02b54f42487e3d3c6efb3f89428677074ca7bf43aae402517bc7cca949f3/PyYAML-6.0.2-cp313-cp313-win_amd64.whl", hash = "sha256:8388ee1976c416731879ac16da0aff3f63b286ffdd57cdeb95f3f2e085687563", size = 156446, upload_time = "2024-08-06T20:33:04.33Z" },
+]
+
+[[package]]
+name = "regex"
+version = "2024.11.6"
+source = { registry = "https://pypi.org/simple" }
+sdist = { url = "https://files.pythonhosted.org/packages/8e/5f/bd69653fbfb76cf8604468d3b4ec4c403197144c7bfe0e6a5fc9e02a07cb/regex-2024.11.6.tar.gz", hash = "sha256:7ab159b063c52a0333c884e4679f8d7a85112ee3078fe3d9004b2dd875585519", size = 399494, upload_time = "2024-11-06T20:12:31.635Z" }
+wheels = [
+    { url = "https://files.pythonhosted.org/packages/90/73/bcb0e36614601016552fa9344544a3a2ae1809dc1401b100eab02e772e1f/regex-2024.11.6-cp313-cp313-macosx_10_13_universal2.whl", hash = "sha256:a6ba92c0bcdf96cbf43a12c717eae4bc98325ca3730f6b130ffa2e3c3c723d84", size = 483525, upload_time = "2024-11-06T20:10:45.19Z" },
+    { url = "https://files.pythonhosted.org/packages/0f/3f/f1a082a46b31e25291d830b369b6b0c5576a6f7fb89d3053a354c24b8a83/regex-2024.11.6-cp313-cp313-macosx_10_13_x86_64.whl", hash = "sha256:525eab0b789891ac3be914d36893bdf972d483fe66551f79d3e27146191a37d4", size = 288324, upload_time = "2024-11-06T20:10:47.177Z" },
+    { url = "https://files.pythonhosted.org/packages/09/c9/4e68181a4a652fb3ef5099e077faf4fd2a694ea6e0f806a7737aff9e758a/regex-2024.11.6-cp313-cp313-macosx_11_0_arm64.whl", hash = "sha256:086a27a0b4ca227941700e0b31425e7a28ef1ae8e5e05a33826e17e47fbfdba0", size = 284617, upload_time = "2024-11-06T20:10:49.312Z" },
+    { url = "https://files.pythonhosted.org/packages/fc/fd/37868b75eaf63843165f1d2122ca6cb94bfc0271e4428cf58c0616786dce/regex-2024.11.6-cp313-cp313-manylinux_2_17_aarch64.manylinux2014_aarch64.whl", hash = "sha256:bde01f35767c4a7899b7eb6e823b125a64de314a8ee9791367c9a34d56af18d0", size = 795023, upload_time = "2024-11-06T20:10:51.102Z" },
+    { url = "https://files.pythonhosted.org/packages/c4/7c/d4cd9c528502a3dedb5c13c146e7a7a539a3853dc20209c8e75d9ba9d1b2/regex-2024.11.6-cp313-cp313-manylinux_2_17_ppc64le.manylinux2014_ppc64le.whl", hash = "sha256:b583904576650166b3d920d2bcce13971f6f9e9a396c673187f49811b2769dc7", size = 833072, upload_time = "2024-11-06T20:10:52.926Z" },
+    { url = "https://files.pythonhosted.org/packages/4f/db/46f563a08f969159c5a0f0e722260568425363bea43bb7ae370becb66a67/regex-2024.11.6-cp313-cp313-manylinux_2_17_s390x.manylinux2014_s390x.whl", hash = "sha256:1c4de13f06a0d54fa0d5ab1b7138bfa0d883220965a29616e3ea61b35d5f5fc7", size = 823130, upload_time = "2024-11-06T20:10:54.828Z" },
+    { url = "https://files.pythonhosted.org/packages/db/60/1eeca2074f5b87df394fccaa432ae3fc06c9c9bfa97c5051aed70e6e00c2/regex-2024.11.6-cp313-cp313-manylinux_2_17_x86_64.manylinux2014_x86_64.whl", hash = "sha256:3cde6e9f2580eb1665965ce9bf17ff4952f34f5b126beb509fee8f4e994f143c", size = 796857, upload_time = "2024-11-06T20:10:56.634Z" },
+    { url = "https://files.pythonhosted.org/packages/10/db/ac718a08fcee981554d2f7bb8402f1faa7e868c1345c16ab1ebec54b0d7b/regex-2024.11.6-cp313-cp313-manylinux_2_5_i686.manylinux1_i686.manylinux_2_17_i686.manylinux2014_i686.whl", hash = "sha256:0d7f453dca13f40a02b79636a339c5b62b670141e63efd511d3f8f73fba162b3", size = 784006, upload_time = "2024-11-06T20:10:59.369Z" },
+    { url = "https://files.pythonhosted.org/packages/c2/41/7da3fe70216cea93144bf12da2b87367590bcf07db97604edeea55dac9ad/regex-2024.11.6-cp313-cp313-musllinux_1_2_aarch64.whl", hash = "sha256:59dfe1ed21aea057a65c6b586afd2a945de04fc7db3de0a6e3ed5397ad491b07", size = 781650, upload_time = "2024-11-06T20:11:02.042Z" },
+    { url = "https://files.pythonhosted.org/packages/a7/d5/880921ee4eec393a4752e6ab9f0fe28009435417c3102fc413f3fe81c4e5/regex-2024.11.6-cp313-cp313-musllinux_1_2_i686.whl", hash = "sha256:b97c1e0bd37c5cd7902e65f410779d39eeda155800b65fc4d04cc432efa9bc6e", size = 789545, upload_time = "2024-11-06T20:11:03.933Z" },
+    { url = "https://files.pythonhosted.org/packages/dc/96/53770115e507081122beca8899ab7f5ae28ae790bfcc82b5e38976df6a77/regex-2024.11.6-cp313-cp313-musllinux_1_2_ppc64le.whl", hash = "sha256:f9d1e379028e0fc2ae3654bac3cbbef81bf3fd571272a42d56c24007979bafb6", size = 853045, upload_time = "2024-11-06T20:11:06.497Z" },
+    { url = "https://files.pythonhosted.org/packages/31/d3/1372add5251cc2d44b451bd94f43b2ec78e15a6e82bff6a290ef9fd8f00a/regex-2024.11.6-cp313-cp313-musllinux_1_2_s390x.whl", hash = "sha256:13291b39131e2d002a7940fb176e120bec5145f3aeb7621be6534e46251912c4", size = 860182, upload_time = "2024-11-06T20:11:09.06Z" },
+    { url = "https://files.pythonhosted.org/packages/ed/e3/c446a64984ea9f69982ba1a69d4658d5014bc7a0ea468a07e1a1265db6e2/regex-2024.11.6-cp313-cp313-musllinux_1_2_x86_64.whl", hash = "sha256:4f51f88c126370dcec4908576c5a627220da6c09d0bff31cfa89f2523843316d", size = 787733, upload_time = "2024-11-06T20:11:11.256Z" },
+    { url = "https://files.pythonhosted.org/packages/2b/f1/e40c8373e3480e4f29f2692bd21b3e05f296d3afebc7e5dcf21b9756ca1c/regex-2024.11.6-cp313-cp313-win32.whl", hash = "sha256:63b13cfd72e9601125027202cad74995ab26921d8cd935c25f09c630436348ff", size = 262122, upload_time = "2024-11-06T20:11:13.161Z" },
+    { url = "https://files.pythonhosted.org/packages/45/94/bc295babb3062a731f52621cdc992d123111282e291abaf23faa413443ea/regex-2024.11.6-cp313-cp313-win_amd64.whl", hash = "sha256:2b3361af3198667e99927da8b84c1b010752fa4b1115ee30beaa332cabc3ef1a", size = 273545, upload_time = "2024-11-06T20:11:15Z" },
+]
+
 [[package]]
 name = "requests"
 version = "2.32.4"
@@ -703,6 +986,18 @@ wheels = [
     { url = "https://files.pythonhosted.org/packages/7c/e4/56027c4a6b4ae70ca9de302488c5ca95ad4a39e190093d6c1a8ace08341b/requests-2.32.4-py3-none-any.whl", hash = "sha256:27babd3cda2a6d50b30443204ee89830707d396671944c998b5975b031ac2b2c", size = 64847, upload_time = "2025-06-09T16:43:05.728Z" },
 ]
 
+[[package]]
+name = "requests-toolbelt"
+version = "1.0.0"
+source = { registry = "https://pypi.org/simple" }
+dependencies = [
+    { name = "requests" },
+]
+sdist = { url = "https://files.pythonhosted.org/packages/f3/61/d7545dafb7ac2230c70d38d31cbfe4cc64f7144dc41f6e4e4b78ecd9f5bb/requests-toolbelt-1.0.0.tar.gz", hash = "sha256:7681a0a3d047012b5bdc0ee37d7f8f07ebe76ab08caeccfc3921ce23c88d5bc6", size = 206888, upload_time = "2023-05-01T04:11:33.229Z" }
+wheels = [
+    { url = "https://files.pythonhosted.org/packages/3f/51/d4db610ef29373b879047326cbf6fa98b6c1969d6f6dc423279de2b1be2c/requests_toolbelt-1.0.0-py2.py3-none-any.whl", hash = "sha256:cccfdd665f0a24fcf4726e690f65639d272bb0637b9b92dfd91a5568ccf6bd06", size = 54481, upload_time = "2023-05-01T04:11:28.427Z" },
+]
+
 [[package]]
 name = "ruff"
 version = "0.12.3"
@@ -751,6 +1046,9 @@ dependencies = [
 [package.dev-dependencies]
 dev = [
     { name = "httpx" },
+    { name = "langchain" },
+    { name = "langchain-core" },
+    { name = "langchain-openai" },
     { name = "pytest" },
     { name = "pytest-cov" },
     { name = "ruff" },
@@ -776,6 +1074,9 @@ requires-dist = [
 [package.metadata.requires-dev]
 dev = [
     { name = "httpx", specifier = ">=0.28.1" },
+    { name = "langchain", specifier = ">=0.3.26" },
+    { name = "langchain-core", specifier = ">=0.3.71" },
+    { name = "langchain-openai", specifier = ">=0.3.28" },
     { name = "pytest", specifier = ">=8.4.1" },
     { name = "pytest-cov", specifier = ">=4.0.0" },
     { name = "ruff", specifier = ">=0.12.3" },
@@ -790,6 +1091,27 @@ wheels = [
     { url = "https://files.pythonhosted.org/packages/e9/44/75a9c9421471a6c4805dbf2356f7c181a29c1879239abab1ea2cc8f38b40/sniffio-1.3.1-py3-none-any.whl", hash = "sha256:2f6da418d1f1e0fddd844478f41680e794e6051915791a034ff65e5f100525a2", size = 10235, upload_time = "2024-02-25T23:20:01.196Z" },
 ]
 
+[[package]]
+name = "sqlalchemy"
+version = "2.0.41"
+source = { registry = "https://pypi.org/simple" }
+dependencies = [
+    { name = "greenlet", marker = "(python_full_version < '3.14' and platform_machine == 'AMD64') or (python_full_version < '3.14' and platform_machine == 'WIN32') or (python_full_version < '3.14' and platform_machine == 'aarch64') or (python_full_version < '3.14' and platform_machine == 'amd64') or (python_full_version < '3.14' and platform_machine == 'ppc64le') or (python_full_version < '3.14' and platform_machine == 'win32') or (python_full_version < '3.14' and platform_machine == 'x86_64')" },
+    { name = "typing-extensions" },
+]
+sdist = { url = "https://files.pythonhosted.org/packages/63/66/45b165c595ec89aa7dcc2c1cd222ab269bc753f1fc7a1e68f8481bd957bf/sqlalchemy-2.0.41.tar.gz", hash = "sha256:edba70118c4be3c2b1f90754d308d0b79c6fe2c0fdc52d8ddf603916f83f4db9", size = 9689424, upload_time = "2025-05-14T17:10:32.339Z" }
+wheels = [
+    { url = "https://files.pythonhosted.org/packages/d3/ad/2e1c6d4f235a97eeef52d0200d8ddda16f6c4dd70ae5ad88c46963440480/sqlalchemy-2.0.41-cp313-cp313-macosx_10_13_x86_64.whl", hash = "sha256:4eeb195cdedaf17aab6b247894ff2734dcead6c08f748e617bfe05bd5a218443", size = 2115491, upload_time = "2025-05-14T17:55:31.177Z" },
+    { url = "https://files.pythonhosted.org/packages/cf/8d/be490e5db8400dacc89056f78a52d44b04fbf75e8439569d5b879623a53b/sqlalchemy-2.0.41-cp313-cp313-macosx_11_0_arm64.whl", hash = "sha256:d4ae769b9c1c7757e4ccce94b0641bc203bbdf43ba7a2413ab2523d8d047d8dc", size = 2102827, upload_time = "2025-05-14T17:55:34.921Z" },
+    { url = "https://files.pythonhosted.org/packages/a0/72/c97ad430f0b0e78efaf2791342e13ffeafcbb3c06242f01a3bb8fe44f65d/sqlalchemy-2.0.41-cp313-cp313-manylinux_2_17_aarch64.manylinux2014_aarch64.whl", hash = "sha256:a62448526dd9ed3e3beedc93df9bb6b55a436ed1474db31a2af13b313a70a7e1", size = 3225224, upload_time = "2025-05-14T17:50:41.418Z" },
+    { url = "https://files.pythonhosted.org/packages/5e/51/5ba9ea3246ea068630acf35a6ba0d181e99f1af1afd17e159eac7e8bc2b8/sqlalchemy-2.0.41-cp313-cp313-manylinux_2_17_x86_64.manylinux2014_x86_64.whl", hash = "sha256:dc56c9788617b8964ad02e8fcfeed4001c1f8ba91a9e1f31483c0dffb207002a", size = 3230045, upload_time = "2025-05-14T17:51:54.722Z" },
+    { url = "https://files.pythonhosted.org/packages/78/2f/8c14443b2acea700c62f9b4a8bad9e49fc1b65cfb260edead71fd38e9f19/sqlalchemy-2.0.41-cp313-cp313-musllinux_1_2_aarch64.whl", hash = "sha256:c153265408d18de4cc5ded1941dcd8315894572cddd3c58df5d5b5705b3fa28d", size = 3159357, upload_time = "2025-05-14T17:50:43.483Z" },
+    { url = "https://files.pythonhosted.org/packages/fc/b2/43eacbf6ccc5276d76cea18cb7c3d73e294d6fb21f9ff8b4eef9b42bbfd5/sqlalchemy-2.0.41-cp313-cp313-musllinux_1_2_x86_64.whl", hash = "sha256:4f67766965996e63bb46cfbf2ce5355fc32d9dd3b8ad7e536a920ff9ee422e23", size = 3197511, upload_time = "2025-05-14T17:51:57.308Z" },
+    { url = "https://files.pythonhosted.org/packages/fa/2e/677c17c5d6a004c3c45334ab1dbe7b7deb834430b282b8a0f75ae220c8eb/sqlalchemy-2.0.41-cp313-cp313-win32.whl", hash = "sha256:bfc9064f6658a3d1cadeaa0ba07570b83ce6801a1314985bf98ec9b95d74e15f", size = 2082420, upload_time = "2025-05-14T17:55:52.69Z" },
+    { url = "https://files.pythonhosted.org/packages/e9/61/e8c1b9b6307c57157d328dd8b8348ddc4c47ffdf1279365a13b2b98b8049/sqlalchemy-2.0.41-cp313-cp313-win_amd64.whl", hash = "sha256:82ca366a844eb551daff9d2e6e7a9e5e76d2612c8564f58db6c19a726869c1df", size = 2108329, upload_time = "2025-05-14T17:55:54.495Z" },
+    { url = "https://files.pythonhosted.org/packages/1c/fc/9ba22f01b5cdacc8f5ed0d22304718d2c758fce3fd49a5372b886a86f37c/sqlalchemy-2.0.41-py3-none-any.whl", hash = "sha256:57df5dc6fdb5ed1a88a1ed2195fd31927e705cad62dedd86b46972752a80f576", size = 1911224, upload_time = "2025-05-14T17:39:42.154Z" },
+]
+
 [[package]]
 name = "starlette"
 version = "0.47.1"
@@ -802,6 +1124,45 @@ wheels = [
     { url = "https://files.pythonhosted.org/packages/82/95/38ef0cd7fa11eaba6a99b3c4f5ac948d8bc6ff199aabd327a29cc000840c/starlette-0.47.1-py3-none-any.whl", hash = "sha256:5e11c9f5c7c3f24959edbf2dffdc01bba860228acf657129467d8a7468591527", size = 72747, upload_time = "2025-06-21T04:03:15.705Z" },
 ]
 
+[[package]]
+name = "tenacity"
+version = "9.1.2"
+source = { registry = "https://pypi.org/simple" }
+sdist = { url = "https://files.pythonhosted.org/packages/0a/d4/2b0cd0fe285e14b36db076e78c93766ff1d529d70408bd1d2a5a84f1d929/tenacity-9.1.2.tar.gz", hash = "sha256:1169d376c297e7de388d18b4481760d478b0e99a777cad3a9c86e556f4b697cb", size = 48036, upload_time = "2025-04-02T08:25:09.966Z" }
+wheels = [
+    { url = "https://files.pythonhosted.org/packages/e5/30/643397144bfbfec6f6ef821f36f33e57d35946c44a2352d3c9f0ae847619/tenacity-9.1.2-py3-none-any.whl", hash = "sha256:f77bf36710d8b73a50b2dd155c97b870017ad21afe6ab300326b0371b3b05138", size = 28248, upload_time = "2025-04-02T08:25:07.678Z" },
+]
+
+[[package]]
+name = "tiktoken"
+version = "0.9.0"
+source = { registry = "https://pypi.org/simple" }
+dependencies = [
+    { name = "regex" },
+    { name = "requests" },
+]
+sdist = { url = "https://files.pythonhosted.org/packages/ea/cf/756fedf6981e82897f2d570dd25fa597eb3f4459068ae0572d7e888cfd6f/tiktoken-0.9.0.tar.gz", hash = "sha256:d02a5ca6a938e0490e1ff957bc48c8b078c88cb83977be1625b1fd8aac792c5d", size = 35991, upload_time = "2025-02-14T06:03:01.003Z" }
+wheels = [
+    { url = "https://files.pythonhosted.org/packages/7a/11/09d936d37f49f4f494ffe660af44acd2d99eb2429d60a57c71318af214e0/tiktoken-0.9.0-cp313-cp313-macosx_10_13_x86_64.whl", hash = "sha256:2b0e8e05a26eda1249e824156d537015480af7ae222ccb798e5234ae0285dbdb", size = 1064919, upload_time = "2025-02-14T06:02:37.494Z" },
+    { url = "https://files.pythonhosted.org/packages/80/0e/f38ba35713edb8d4197ae602e80837d574244ced7fb1b6070b31c29816e0/tiktoken-0.9.0-cp313-cp313-macosx_11_0_arm64.whl", hash = "sha256:27d457f096f87685195eea0165a1807fae87b97b2161fe8c9b1df5bd74ca6f63", size = 1007877, upload_time = "2025-02-14T06:02:39.516Z" },
+    { url = "https://files.pythonhosted.org/packages/fe/82/9197f77421e2a01373e27a79dd36efdd99e6b4115746ecc553318ecafbf0/tiktoken-0.9.0-cp313-cp313-manylinux_2_17_aarch64.manylinux2014_aarch64.whl", hash = "sha256:2cf8ded49cddf825390e36dd1ad35cd49589e8161fdcb52aa25f0583e90a3e01", size = 1140095, upload_time = "2025-02-14T06:02:41.791Z" },
+    { url = "https://files.pythonhosted.org/packages/f2/bb/4513da71cac187383541facd0291c4572b03ec23c561de5811781bbd988f/tiktoken-0.9.0-cp313-cp313-manylinux_2_17_x86_64.manylinux2014_x86_64.whl", hash = "sha256:cc156cb314119a8bb9748257a2eaebd5cc0753b6cb491d26694ed42fc7cb3139", size = 1195649, upload_time = "2025-02-14T06:02:43Z" },
+    { url = "https://files.pythonhosted.org/packages/fa/5c/74e4c137530dd8504e97e3a41729b1103a4ac29036cbfd3250b11fd29451/tiktoken-0.9.0-cp313-cp313-musllinux_1_2_x86_64.whl", hash = "sha256:cd69372e8c9dd761f0ab873112aba55a0e3e506332dd9f7522ca466e817b1b7a", size = 1258465, upload_time = "2025-02-14T06:02:45.046Z" },
+    { url = "https://files.pythonhosted.org/packages/de/a8/8f499c179ec900783ffe133e9aab10044481679bb9aad78436d239eee716/tiktoken-0.9.0-cp313-cp313-win_amd64.whl", hash = "sha256:5ea0edb6f83dc56d794723286215918c1cde03712cbbafa0348b33448faf5b95", size = 894669, upload_time = "2025-02-14T06:02:47.341Z" },
+]
+
+[[package]]
+name = "tqdm"
+version = "4.67.1"
+source = { registry = "https://pypi.org/simple" }
+dependencies = [
+    { name = "colorama", marker = "sys_platform == 'win32'" },
+]
+sdist = { url = "https://files.pythonhosted.org/packages/a8/4b/29b4ef32e036bb34e4ab51796dd745cdba7ed47ad142a9f4a1eb8e0c744d/tqdm-4.67.1.tar.gz", hash = "sha256:f8aef9c52c08c13a65f30ea34f4e5aac3fd1a34959879d7e59e63027286627f2", size = 169737, upload_time = "2024-11-24T20:12:22.481Z" }
+wheels = [
+    { url = "https://files.pythonhosted.org/packages/d0/30/dc54f88dd4a2b5dc8a0279bdd7270e735851848b762aeb1c1184ed1f6b14/tqdm-4.67.1-py3-none-any.whl", hash = "sha256:26445eca388f82e72884e0d580d5464cd801a3ea01e63e5601bdff9ba6a48de2", size = 78540, upload_time = "2024-11-24T20:12:19.698Z" },
+]
+
 [[package]]
 name = "typing-extensions"
 version = "4.14.1"
@@ -884,3 +1245,30 @@ sdist = { url = "https://files.pythonhosted.org/packages/e3/02/0f2892c661036d50e
 wheels = [
     { url = "https://files.pythonhosted.org/packages/2e/54/647ade08bf0db230bfea292f893923872fd20be6ac6f53b2b936ba839d75/zipp-3.23.0-py3-none-any.whl", hash = "sha256:071652d6115ed432f5ce1d34c336c0adfd6a884660d1e9712a256d3d3bd4b14e", size = 10276, upload_time = "2025-06-08T17:06:38.034Z" },
 ]
+
+[[package]]
+name = "zstandard"
+version = "0.23.0"
+source = { registry = "https://pypi.org/simple" }
+dependencies = [
+    { name = "cffi", marker = "platform_python_implementation == 'PyPy'" },
+]
+sdist = { url = "https://files.pythonhosted.org/packages/ed/f6/2ac0287b442160a89d726b17a9184a4c615bb5237db763791a7fd16d9df1/zstandard-0.23.0.tar.gz", hash = "sha256:b2d8c62d08e7255f68f7a740bae85b3c9b8e5466baa9cbf7f57f1cde0ac6bc09", size = 681701, upload_time = "2024-07-15T00:18:06.141Z" }
+wheels = [
+    { url = "https://files.pythonhosted.org/packages/80/f1/8386f3f7c10261fe85fbc2c012fdb3d4db793b921c9abcc995d8da1b7a80/zstandard-0.23.0-cp313-cp313-macosx_10_13_x86_64.whl", hash = "sha256:576856e8594e6649aee06ddbfc738fec6a834f7c85bf7cadd1c53d4a58186ef9", size = 788975, upload_time = "2024-07-15T00:16:16.005Z" },
+    { url = "https://files.pythonhosted.org/packages/16/e8/cbf01077550b3e5dc86089035ff8f6fbbb312bc0983757c2d1117ebba242/zstandard-0.23.0-cp313-cp313-macosx_11_0_arm64.whl", hash = "sha256:38302b78a850ff82656beaddeb0bb989a0322a8bbb1bf1ab10c17506681d772a", size = 633448, upload_time = "2024-07-15T00:16:17.897Z" },
+    { url = "https://files.pythonhosted.org/packages/06/27/4a1b4c267c29a464a161aeb2589aff212b4db653a1d96bffe3598f3f0d22/zstandard-0.23.0-cp313-cp313-manylinux_2_17_aarch64.manylinux2014_aarch64.whl", hash = "sha256:d2240ddc86b74966c34554c49d00eaafa8200a18d3a5b6ffbf7da63b11d74ee2", size = 4945269, upload_time = "2024-07-15T00:16:20.136Z" },
+    { url = "https://files.pythonhosted.org/packages/7c/64/d99261cc57afd9ae65b707e38045ed8269fbdae73544fd2e4a4d50d0ed83/zstandard-0.23.0-cp313-cp313-manylinux_2_17_ppc64le.manylinux2014_ppc64le.whl", hash = "sha256:2ef230a8fd217a2015bc91b74f6b3b7d6522ba48be29ad4ea0ca3a3775bf7dd5", size = 5306228, upload_time = "2024-07-15T00:16:23.398Z" },
+    { url = "https://files.pythonhosted.org/packages/7a/cf/27b74c6f22541f0263016a0fd6369b1b7818941de639215c84e4e94b2a1c/zstandard-0.23.0-cp313-cp313-manylinux_2_17_s390x.manylinux2014_s390x.whl", hash = "sha256:774d45b1fac1461f48698a9d4b5fa19a69d47ece02fa469825b442263f04021f", size = 5336891, upload_time = "2024-07-15T00:16:26.391Z" },
+    { url = "https://files.pythonhosted.org/packages/fa/18/89ac62eac46b69948bf35fcd90d37103f38722968e2981f752d69081ec4d/zstandard-0.23.0-cp313-cp313-manylinux_2_17_x86_64.manylinux2014_x86_64.whl", hash = "sha256:6f77fa49079891a4aab203d0b1744acc85577ed16d767b52fc089d83faf8d8ed", size = 5436310, upload_time = "2024-07-15T00:16:29.018Z" },
+    { url = "https://files.pythonhosted.org/packages/a8/a8/5ca5328ee568a873f5118d5b5f70d1f36c6387716efe2e369010289a5738/zstandard-0.23.0-cp313-cp313-manylinux_2_5_i686.manylinux1_i686.manylinux_2_17_i686.manylinux2014_i686.whl", hash = "sha256:ac184f87ff521f4840e6ea0b10c0ec90c6b1dcd0bad2f1e4a9a1b4fa177982ea", size = 4859912, upload_time = "2024-07-15T00:16:31.871Z" },
+    { url = "https://files.pythonhosted.org/packages/ea/ca/3781059c95fd0868658b1cf0440edd832b942f84ae60685d0cfdb808bca1/zstandard-0.23.0-cp313-cp313-musllinux_1_1_aarch64.whl", hash = "sha256:c363b53e257246a954ebc7c488304b5592b9c53fbe74d03bc1c64dda153fb847", size = 4936946, upload_time = "2024-07-15T00:16:34.593Z" },
+    { url = "https://files.pythonhosted.org/packages/ce/11/41a58986f809532742c2b832c53b74ba0e0a5dae7e8ab4642bf5876f35de/zstandard-0.23.0-cp313-cp313-musllinux_1_1_x86_64.whl", hash = "sha256:e7792606d606c8df5277c32ccb58f29b9b8603bf83b48639b7aedf6df4fe8171", size = 5466994, upload_time = "2024-07-15T00:16:36.887Z" },
+    { url = "https://files.pythonhosted.org/packages/83/e3/97d84fe95edd38d7053af05159465d298c8b20cebe9ccb3d26783faa9094/zstandard-0.23.0-cp313-cp313-musllinux_1_2_aarch64.whl", hash = "sha256:a0817825b900fcd43ac5d05b8b3079937073d2b1ff9cf89427590718b70dd840", size = 4848681, upload_time = "2024-07-15T00:16:39.709Z" },
+    { url = "https://files.pythonhosted.org/packages/6e/99/cb1e63e931de15c88af26085e3f2d9af9ce53ccafac73b6e48418fd5a6e6/zstandard-0.23.0-cp313-cp313-musllinux_1_2_i686.whl", hash = "sha256:9da6bc32faac9a293ddfdcb9108d4b20416219461e4ec64dfea8383cac186690", size = 4694239, upload_time = "2024-07-15T00:16:41.83Z" },
+    { url = "https://files.pythonhosted.org/packages/ab/50/b1e703016eebbc6501fc92f34db7b1c68e54e567ef39e6e59cf5fb6f2ec0/zstandard-0.23.0-cp313-cp313-musllinux_1_2_ppc64le.whl", hash = "sha256:fd7699e8fd9969f455ef2926221e0233f81a2542921471382e77a9e2f2b57f4b", size = 5200149, upload_time = "2024-07-15T00:16:44.287Z" },
+    { url = "https://files.pythonhosted.org/packages/aa/e0/932388630aaba70197c78bdb10cce2c91fae01a7e553b76ce85471aec690/zstandard-0.23.0-cp313-cp313-musllinux_1_2_s390x.whl", hash = "sha256:d477ed829077cd945b01fc3115edd132c47e6540ddcd96ca169facff28173057", size = 5655392, upload_time = "2024-07-15T00:16:46.423Z" },
+    { url = "https://files.pythonhosted.org/packages/02/90/2633473864f67a15526324b007a9f96c96f56d5f32ef2a56cc12f9548723/zstandard-0.23.0-cp313-cp313-musllinux_1_2_x86_64.whl", hash = "sha256:fa6ce8b52c5987b3e34d5674b0ab529a4602b632ebab0a93b07bfb4dfc8f8a33", size = 5191299, upload_time = "2024-07-15T00:16:49.053Z" },
+    { url = "https://files.pythonhosted.org/packages/b0/4c/315ca5c32da7e2dc3455f3b2caee5c8c2246074a61aac6ec3378a97b7136/zstandard-0.23.0-cp313-cp313-win32.whl", hash = "sha256:a9b07268d0c3ca5c170a385a0ab9fb7fdd9f5fd866be004c4ea39e44edce47dd", size = 430862, upload_time = "2024-07-15T00:16:51.003Z" },
+    { url = "https://files.pythonhosted.org/packages/a2/bf/c6aaba098e2d04781e8f4f7c0ba3c7aa73d00e4c436bcc0cf059a66691d1/zstandard-0.23.0-cp313-cp313-win_amd64.whl", hash = "sha256:f3513916e8c645d0610815c257cbfd3242adfd5c4cfa78be514e5a3ebb42a41b", size = 495578, upload_time = "2024-07-15T00:16:53.135Z" },
+]