Skip to content

Latest commit

 

History

History
766 lines (579 loc) · 25.3 KB

File metadata and controls

766 lines (579 loc) · 25.3 KB

Project Specification: lite-chat

A complete blueprint for recreating the backend of this project using AI coding assistants. This document focuses on architecture, business logic, and data flow — not UI.


1. Project Overview

lite-chat is a lightweight, local AI chat application powered by Ollama. Users can have multi-turn conversations with locally-hosted LLMs, with support for streaming responses, image analysis, smart context management, and conversation history with undo/restore.

Core Capabilities

  • Multi-turn chat with any Ollama model
  • Real-time token streaming via Server-Sent Events (SSE)
  • Automatic conversation summarization to stay within context limits
  • Image attachment support (base64-encoded, sent to vision models)
  • AI-generated conversation titles
  • Response regeneration
  • Message editing and deletion with snapshot-based undo
  • Model management (list, pull new models)
  • User preferences (default model, system prompt)

2. Tech Stack

Backend

Component Technology Version Purpose
Framework FastAPI 0.115.0 Async web framework with auto-generated OpenAPI docs
Server Uvicorn 0.30.6 ASGI server
Database SQLite via aiosqlite 0.20.0 Async SQLite — zero-config, file-based persistence
Validation Pydantic 2.9.2 Request/response models with type validation
HTTP Client httpx 0.27.2 Async HTTP for Ollama REST API calls
LLM Framework LangChain + langchain-ollama 0.3.1 / 0.2.0 Imported but Ollama REST API is used directly via httpx
Testing pytest + pytest-asyncio 7.0+ / 0.21.0+ Async test support with auto mode

Frontend (for reference only)

Component Technology
Framework Next.js 14 (App Router)
Language TypeScript
Styling Tailwind CSS
UI Library shadcn/ui
State Zustand

External Dependency

Service URL Purpose
Ollama http://localhost:11434 Local LLM inference server

3. Architecture

Layered Architecture

Routes (API handlers)
  └── Services (business logic)
       ├── Database (aiosqlite)
       └── Ollama (httpx REST calls)
  • Routes: Thin handlers that validate input, call services, return responses. No business logic.
  • Services: Stateless functions that receive db: aiosqlite.Connection as a parameter. All database and Ollama interactions happen here.
  • Database: Async SQLite with WAL mode and foreign keys enabled. Connection provided via FastAPI dependency injection (get_db async generator).

Application Lifecycle

@asynccontextmanager
async def lifespan(app: FastAPI):
    # Startup: init DB schema + seed default user, check Ollama health
    await init_db()
    await check_ollama_health()  # Warning only, not fatal
    yield
    # Shutdown: log only

CORS

Wide open for local development:

allow_origins=["*"]
allow_credentials=True
allow_methods=["*"]
allow_headers=["*"]

Database Connection

Provided via async generator dependency:

async def get_db():
    db = await aiosqlite.connect(DB_PATH)
    db.row_factory = aiosqlite.Row  # Dict-like row access
    await db.execute("PRAGMA journal_mode=WAL")
    await db.execute("PRAGMA foreign_keys=ON")
    try:
        yield db
    finally:
        await db.close()

4. Configuration

All configuration lives in app/config.py as module-level constants:

Constant Default Value Description
DB_PATH ../data/chat.db SQLite database file path (relative to app/)
OLLAMA_BASE_URL http://localhost:11434 Ollama server base URL
DEFAULT_MODEL qwen3.5:9b Fallback model when user has no preference
CONTEXT_FULL_EXCHANGES 5 Number of recent user+assistant exchange pairs to keep in full context
SUMMARY_MAX_TOKENS 500 Target summary conciseness (used in prompt, not enforced)
SUMMARY_MODEL None Model for summarization; None = use same model as chat
SSE_RETRY_TIMEOUT 15000 SSE retry timeout in milliseconds
DEFAULT_USER_ID "default" Single-user system — all data belongs to this user
DEFAULT_USER_NAME "User" Default display name

5. Database Schema

SQLite with 4 tables. Schema is created on startup via init_db() using CREATE TABLE IF NOT EXISTS.

users

Column Type Default Description
id TEXT PK 'default' Single-user system
name TEXT 'User' Display name
default_model TEXT 'qwen3.5:9b' Preferred LLM model
system_prompt TEXT NULL Global system prompt
created_at TIMESTAMP CURRENT_TIMESTAMP
updated_at TIMESTAMP CURRENT_TIMESTAMP

conversations

Column Type Default Description
id TEXT PK UUID
user_id TEXT FK→users 'default'
title TEXT 'New Chat' Display title (AI-generated on first message)
model_name TEXT NULL Model used for this conversation
system_prompt TEXT NULL Per-conversation system prompt override
summary TEXT NULL Rolling summary of old messages
summary_upto_msg_id TEXT NULL Last message ID included in summary
created_at TIMESTAMP CURRENT_TIMESTAMP
updated_at TIMESTAMP CURRENT_TIMESTAMP Updated on any message or conversation change

FK: user_idusers(id) ON DELETE CASCADE

messages

Column Type Default Constraint Description
id TEXT PK UUID
conversation_id TEXT FK NOT NULL
role TEXT CHECK IN ('user', 'assistant', 'system')
content TEXT NOT NULL Message text
image_base64 TEXT NULL Base64-encoded image (user messages only)
thinking TEXT NULL Model's reasoning/thinking output
tokens_used INTEGER 0 Total tokens (prompt + completion)
tokens_per_sec REAL 0 Generation speed
thinking_duration INTEGER 0 Seconds spent in thinking phase
is_summarized BOOLEAN 0 Marked true after included in a summary
created_at TIMESTAMP CURRENT_TIMESTAMP

FK: conversation_idconversations(id) ON DELETE CASCADE

deleted_messages (snapshot/undo storage)

Column Type Default Description
id TEXT PK Original message ID
snapshot_id TEXT Groups messages deleted together (UUID)
conversation_id TEXT Original conversation ID
role TEXT
content TEXT
image_base64 TEXT NULL
thinking TEXT NULL
tokens_used INTEGER 0
tokens_per_sec REAL 0
thinking_duration INTEGER 0
original_created_at TIMESTAMP Preserves original ordering
deleted_at TIMESTAMP CURRENT_TIMESTAMP

No foreign keys — this table stores orphaned data for restoration.

Schema Migration

For existing databases, missing columns are added via ALTER TABLE ... ADD COLUMN wrapped in try/except (silently skips if column exists):

try:
    await db.execute("ALTER TABLE messages ADD COLUMN tokens_per_sec REAL DEFAULT 0")
except Exception:
    pass

Default User Seeding

On every startup, checks if user 'default' exists. If not, inserts it.


6. API Endpoints

All endpoints are prefixed with /api/.

Chat

POST /api/chat — Send Message (SSE Stream)

Request:

{
  "conversation_id": "uuid | null",
  "model": "model-name | null",
  "message": "user message (required, min 1 char)",
  "image": "base64-string | null"
}

Behavior:

  1. Resolve model: request → user preference → DEFAULT_MODEL
  2. If no conversation_id: create new conversation, mark is_first_message = true
  3. If conversation_id provided: verify it exists, check if message_count == 0 for first message detection
  4. Save user message to DB
  5. If first message: generate AI title via Ollama (blocking, non-streamed), update conversation title, emit title SSE event
  6. Build context (system prompt + summary + last N messages)
  7. Stream response from Ollama with think: true
  8. On stream complete: save assistant message to DB with token stats

SSE Event Types:

Type Fields When
title conversation_id, title First message only, before content starts
thinking content Each thinking token from model
content content Each content token from model
done conversation_id, message_id, user_message_id, tokens_used, tokens_per_sec, thinking_duration Stream complete
error content On any error

SSE Format: data: {JSON}\n\n — standard SSE with text/event-stream content type.

POST /api/chat/regenerate — Regenerate Response

Request:

{
  "conversation_id": "uuid (required)",
  "message_id": "uuid of assistant message (required)",
  "model": "model-name | null"
}

Behavior:

  1. Verify conversation and message exist, message must be role = 'assistant'
  2. Delete the old assistant message from DB
  3. Rebuild context (user message is still there)
  4. Stream new response (same SSE format as /api/chat)
  5. Save new assistant message

Conversations

POST /api/conversations — Create Conversation

Request: ConversationCreatetitle, model_name, system_prompt (all optional) Response: 201ConversationResponse

GET /api/conversations — List All Conversations

Response: 200ConversationResponse[] ordered by updated_at DESC

  • Each includes message_count via LEFT JOIN COUNT

GET /api/conversations/{conv_id} — Get Conversation Detail

Response: 200ConversationDetailResponse (includes all messages ordered by created_at ASC) Error: 404 if not found

PUT /api/conversations/{conv_id} — Update Conversation

Request: ConversationUpdatetitle, system_prompt (both optional, only provided fields are updated) Response: 200ConversationResponse

DELETE /api/conversations/{conv_id} — Delete Conversation

Response: 204 No Content. Messages are cascade-deleted by SQLite FK.

Messages

GET /api/conversations/{conv_id}/messages — Get Messages

Response: 200MessageResponse[]

PUT /api/conversations/{conv_id}/messages/{msg_id} — Update Message

Request: { "content": "new text" } Response: 200MessageResponse

DELETE /api/conversations/{conv_id}/messages/{msg_id} — Delete Message (Truncate + Snapshot)

This is NOT a simple delete. It deletes this message AND everything after it, creating a snapshot for undo.

Response: 200

{
  "snapshot_id": "uuid",
  "deleted_count": 3,
  "conversation_deleted": false
}

If all messages are deleted, the conversation itself is deleted and conversation_deleted: true.

POST /api/conversations/restore/{snapshot_id} — Restore (Undo)

Restores all messages from a snapshot. If the conversation was deleted, recreates it with title "Restored Chat".

Response: 200

{
  "conversation_id": "uuid",
  "restored_count": 3
}

Models

GET /api/models — List Available Models

Proxies GET /api/tags from Ollama. Filters out :latest tag variants when a specific tag exists with the same digest.

Response: 200{ "models": [...] }

POST /api/models/pull — Pull New Model

Request: { "name": "llama3.2" } Response: Streams NDJSON progress from Ollama's pull endpoint.

Health

GET /api/health — Health Check

Response: 200

{
  "status": "healthy | degraded",
  "ollama": "connected | unreachable",
  "database": "connected"
}

User

GET /api/user — Get User Profile

Response: 200UserProfile

PUT /api/user — Update Preferences

Request: UserPreferencesUpdatename, default_model, system_prompt (all optional) Response: 200UserProfile


7. Authentication & Authorization

None. This is a single-user, local-only application. All data is scoped to a hardcoded DEFAULT_USER_ID = "default". The user table exists to store preferences, not for auth.

All conversation queries filter by user_id = DEFAULT_USER_ID. If multi-user support were added, you'd:

  1. Add auth middleware (JWT/session)
  2. Extract user ID from the request instead of using the constant
  3. The existing user_id foreign keys already support this

8. Business Logic

8.1 Chat Streaming Flow

User sends message
  → Resolve model (request > user pref > default)
  → Create conversation if needed
  → Save user message to DB
  → [First message] Generate AI title → emit "title" SSE event
  → Build context window
  → Stream from Ollama (think=true)
  → For each chunk:
      - thinking tokens → emit "thinking" SSE event
      - content tokens → emit "content" SSE event
      - Track timing for thinking_duration
  → On done:
      - Calculate tokens_per_sec from eval_count / eval_duration
      - Save assistant message with stats
      - Emit "done" SSE event

8.2 Smart Context Management

The system maintains a bounded context window to avoid exceeding model limits.

Configuration:

  • CONTEXT_FULL_EXCHANGES = 5 → keeps last 10 messages (5 user + 5 assistant) in full

Context Building (build_context):

1. Count total messages in conversation
2. If total > 10: trigger summarization of old messages
3. Build message list:
   a. System prompt (conversation-level > user-level > skip)
   b. Summary of old messages (if exists)
   c. Last 10 messages in chronological order
   d. [New user message is already included as it was saved before context build]

Summarization Trigger:

  • Runs when total_messages > LAST_N_MESSAGES (10)
  • Only summarizes messages NOT already marked is_summarized
  • Excludes the last 10 messages from summarization

Summarization Process:

  1. Fetch unsummarized messages (excluding last 10)
  2. Fetch existing summary from conversation (if any)
  3. Call Ollama non-streamed with summarizer prompt:
    • System: "Condense into 2-3 paragraphs preserving key facts, decisions, preferences"
    • User: existing summary (if any) + new messages text
  4. Strip <think> tags from response
  5. Update conversations.summary and conversations.summary_upto_msg_id
  6. Mark old messages as is_summarized = 1
  7. On failure: silently skip (warning log)

Image Handling in Context: User messages with images include "images": [base64_string] in the Ollama message format.

8.3 AI Title Generation

When the first message is sent to a conversation:

  1. Call Ollama non-streamed with a title generation prompt:
    • System: "Generate a very short title (3-6 words, max 50 chars) for a chat conversation..."
    • User: the first message content
    • stream: false, think: false
    • Timeout: 15 seconds
  2. Sanitize response:
    • Strip leading/trailing quotes
    • Remove <think>...</think> blocks (regex with DOTALL)
    • Reject empty or >60 char titles
  3. Fallback: first 50 chars of message + "..." if AI fails
  4. Model used: request model or qwen3.5:9b if none specified

The title is generated and emitted via SSE before the response starts streaming, so the frontend can update the sidebar immediately.

8.4 Message Deletion with Undo (Truncation + Snapshot)

This is the most complex operation. When a message is "deleted" from the chat:

Truncate:

  1. Find the target message's created_at
  2. Select ALL messages with created_at >= target (this message and everything after it)
  3. Copy each to deleted_messages with a shared snapshot_id (UUID)
  4. Delete originals from messages table
  5. Check if any messages remain:
    • Yes: update conversation updated_at
    • No: delete the conversation entirely, set conversation_deleted = true
  6. Return { snapshot_id, deleted_count, conversation_deleted }

Restore:

  1. Fetch all deleted_messages with the given snapshot_id
  2. Check if the conversation still exists:
    • Yes: use it
    • No: recreate with title = "Restored Chat"
  3. Re-insert all messages using INSERT OR IGNORE (preserves original_created_at)
  4. Delete the snapshot from deleted_messages
  5. Return { conversation_id, restored_count }

8.5 Response Regeneration

  1. Verify the target message is role = 'assistant'
  2. Delete it from DB (simple DELETE, no snapshot)
  3. Rebuild context (the preceding user message remains)
  4. Stream a fresh response from Ollama
  5. Save the new response

8.6 Model Management

List Models:

  • Proxies GET {OLLAMA_BASE_URL}/api/tags
  • Deduplication: if both model:latest and model:specific-tag exist with the same digest, removes the :latest variant

Pull Model:

  • Proxies POST {OLLAMA_BASE_URL}/api/pull as NDJSON stream
  • Frontend receives progress updates line-by-line

Health Check:

  • GET {OLLAMA_BASE_URL}/ — expects 200 status code
  • Returns true/false, does not throw

8.7 User Preferences

  • Single user (id = "default")
  • get_or_create_default_user: creates the user row on first access if missing
  • Update: only modifies provided fields (Pydantic exclude_unset=True)
  • System prompt precedence: conversation-level > user-level > none

9. Pydantic Schemas

Chat

class ChatRequest(BaseModel):
    conversation_id: str | None = None
    model: str | None = None
    message: str = Field(..., min_length=1)
    image: str | None = None  # base64

class RegenerateRequest(BaseModel):
    conversation_id: str
    message_id: str
    model: str | None = None

Conversation

class ConversationCreate(BaseModel):
    title: str | None = "New Chat"
    model_name: str | None = None
    system_prompt: str | None = None

class ConversationUpdate(BaseModel):
    title: str | None = None
    system_prompt: str | None = None

class ConversationResponse(BaseModel):
    id: str
    title: str
    model_name: str | None
    message_count: int = 0
    created_at: datetime
    updated_at: datetime

class MessageResponse(BaseModel):
    id: str
    role: str
    content: str
    thinking: str | None = None
    image_base64: str | None = None
    tokens_used: int = 0
    tokens_per_sec: float = 0.0
    thinking_duration: int = 0
    created_at: datetime

class ConversationDetailResponse(BaseModel):
    id: str
    title: str
    model_name: str | None
    system_prompt: str | None
    created_at: datetime
    updated_at: datetime
    messages: list[MessageResponse] = []

User

class UserProfile(BaseModel):
    id: str
    name: str
    default_model: str
    system_prompt: str | None = None

class UserPreferencesUpdate(BaseModel):
    name: str | None = None
    default_model: str | None = None
    system_prompt: str | None = None

Note: Schemas with model_name fields use ConfigDict(protected_namespaces=()) to avoid Pydantic's reserved namespace warning for model_* fields.


10. Third-Party Integrations

Ollama REST API

All LLM interactions use Ollama's REST API directly via httpx (not LangChain, despite it being a dependency).

Endpoints used:

Ollama Endpoint Used For Streamed
POST /api/chat Chat responses Yes (NDJSON lines)
POST /api/chat Summarization No (stream: false)
POST /api/chat Title generation No (stream: false)
GET /api/tags List models No
POST /api/pull Pull models Yes (NDJSON lines)
GET / Health check No

Chat streaming format: Each NDJSON line from Ollama contains:

{
  "message": {
    "content": "token",
    "thinking": "thinking token (when think=true)"
  },
  "done": false
}

Final chunk has done: true and includes:

  • eval_count — number of generated tokens
  • eval_duration — generation time in nanoseconds
  • prompt_eval_count — number of prompt tokens

Thinking mode: All chat requests use "think": true. Ollama models that support thinking (like Qwen) return reasoning in the thinking field. Non-thinking models simply return empty thinking fields.


11. Project Structure

backend/
├── app/
│   ├── main.py              # FastAPI app, lifespan, CORS, router registration
│   ├── config.py             # All constants
│   ├── database.py           # SQLite schema, get_db dependency, init_db
│   │
│   ├── routes/               # API endpoint handlers (thin)
│   │   ├── chat.py           # POST /api/chat, /api/chat/regenerate
│   │   ├── conversations.py  # CRUD + message ops + truncate/restore
│   │   ├── models.py         # GET /api/models, POST /api/models/pull, GET /api/health
│   │   └── user.py           # GET/PUT /api/user
│   │
│   ├── services/             # Business logic (stateless, db passed in)
│   │   ├── chat_service.py   # stream_chat, stream_regenerate, build_context, summarize
│   │   ├── conversation_service.py  # CRUD, truncate, restore, AI title
│   │   ├── model_service.py  # list_models, pull_model, health check
│   │   └── user_service.py   # get/update preferences
│   │
│   ├── schemas/              # Pydantic request/response models
│   │   ├── chat.py
│   │   ├── conversation.py
│   │   └── user.py
│   │
│   └── prompts/
│       └── summarize.py      # Summarizer system prompt + builder
│
├── test/                     # pytest tests
│   ├── conftest.py           # In-memory DB fixture, test helpers
│   ├── test_conversation_service.py   # 23 service unit tests
│   └── test_conversation_routes.py    # 10 API integration tests
│
├── data/
│   └── chat.db               # SQLite database (auto-created)
│
├── requirements.txt
└── pytest.ini

12. Environment & Setup

Prerequisites

  • Python 3.11+
  • Ollama installed and running at localhost:11434
  • At least one Ollama model pulled (e.g., ollama pull qwen3.5:9b)

Installation

cd backend
python -m venv .venv
source .venv/bin/activate       # Windows: .venv\Scripts\activate
pip install -r requirements.txt

Running

uvicorn app.main:app --reload
# Runs at http://localhost:8000
# Swagger UI at http://localhost:8000/docs

Database

  • Auto-created at backend/data/chat.db on first startup
  • Schema is idempotent (CREATE TABLE IF NOT EXISTS)
  • Missing columns added via ALTER TABLE with try/except

Testing

cd backend
pytest test/ -v
  • Tests use in-memory SQLite (no file DB needed)
  • Ollama calls are mocked via unittest.mock
  • asyncio_mode = auto in pytest.ini — no @pytest.mark.asyncio needed

13. Key Design Decisions

  1. Direct Ollama REST vs LangChain: Despite LangChain being a dependency, all Ollama calls use httpx directly. This gives full control over streaming, thinking mode, and token statistics.

  2. SQLite over Postgres: Zero-config, file-based, perfect for a local single-user app. WAL mode enables concurrent reads during writes.

  3. Snapshot-based undo: Instead of soft-deletes on messages, deleted messages are moved to a separate deleted_messages table with a shared snapshot_id. This keeps the main messages table clean while supporting full undo.

  4. Rolling summarization: Old messages aren't deleted — they're marked is_summarized = 1 and a rolling summary replaces them in the context window. This preserves full history in the DB while keeping LLM context bounded.

  5. Title before response: AI title generation is a blocking call that completes before response streaming begins. This ensures the frontend has the conversation title immediately, avoiding a jarring delayed update.

  6. Stateless services: All service functions receive the DB connection as a parameter. No service holds state. This makes testing straightforward — just pass an in-memory DB.

  7. Single-user by design: The DEFAULT_USER_ID constant simplifies everything. Multi-user support would require adding auth middleware and replacing the constant with request-scoped user extraction, but the schema already supports it via user_id foreign keys.


14. Error Handling

  • Ollama unreachable: Chat endpoints return SSE error events. Health endpoint returns "degraded". Summarization and title generation silently fall back (warning logged).
  • Missing resources: Routes return 404 via HTTPException.
  • Validation errors: Pydantic returns 422 automatically.
  • All errors follow format: { "detail": "error message" }

15. LLM Prompts

Summarizer

System prompt:

"You are a conversation summarizer. Condense the following into a concise summary preserving key facts, decisions, user preferences, and important context. 2-3 paragraphs max."

User prompt construction:

[If existing summary exists:]
Existing summary:
{existing_summary}

New messages to incorporate:
USER: message1
ASSISTANT: message2
...

Title Generator

System prompt:

"Generate a very short title (3-6 words, max 50 chars) for a chat conversation based on the user's first message. Return ONLY the title, nothing else. No quotes, no punctuation at the end, no explanation."