A complete blueprint for recreating the backend of this project using AI coding assistants. This document focuses on architecture, business logic, and data flow — not UI.
lite-chat is a lightweight, local AI chat application powered by Ollama. Users can have multi-turn conversations with locally-hosted LLMs, with support for streaming responses, image analysis, smart context management, and conversation history with undo/restore.
- Multi-turn chat with any Ollama model
- Real-time token streaming via Server-Sent Events (SSE)
- Automatic conversation summarization to stay within context limits
- Image attachment support (base64-encoded, sent to vision models)
- AI-generated conversation titles
- Response regeneration
- Message editing and deletion with snapshot-based undo
- Model management (list, pull new models)
- User preferences (default model, system prompt)
| Component | Technology | Version | Purpose |
|---|---|---|---|
| Framework | FastAPI | 0.115.0 | Async web framework with auto-generated OpenAPI docs |
| Server | Uvicorn | 0.30.6 | ASGI server |
| Database | SQLite via aiosqlite | 0.20.0 | Async SQLite — zero-config, file-based persistence |
| Validation | Pydantic | 2.9.2 | Request/response models with type validation |
| HTTP Client | httpx | 0.27.2 | Async HTTP for Ollama REST API calls |
| LLM Framework | LangChain + langchain-ollama | 0.3.1 / 0.2.0 | Imported but Ollama REST API is used directly via httpx |
| Testing | pytest + pytest-asyncio | 7.0+ / 0.21.0+ | Async test support with auto mode |
| Component | Technology |
|---|---|
| Framework | Next.js 14 (App Router) |
| Language | TypeScript |
| Styling | Tailwind CSS |
| UI Library | shadcn/ui |
| State | Zustand |
| Service | URL | Purpose |
|---|---|---|
| Ollama | http://localhost:11434 |
Local LLM inference server |
Routes (API handlers)
└── Services (business logic)
├── Database (aiosqlite)
└── Ollama (httpx REST calls)
- Routes: Thin handlers that validate input, call services, return responses. No business logic.
- Services: Stateless functions that receive
db: aiosqlite.Connectionas a parameter. All database and Ollama interactions happen here. - Database: Async SQLite with WAL mode and foreign keys enabled. Connection provided via FastAPI dependency injection (
get_dbasync generator).
@asynccontextmanager
async def lifespan(app: FastAPI):
# Startup: init DB schema + seed default user, check Ollama health
await init_db()
await check_ollama_health() # Warning only, not fatal
yield
# Shutdown: log onlyWide open for local development:
allow_origins=["*"]
allow_credentials=True
allow_methods=["*"]
allow_headers=["*"]Provided via async generator dependency:
async def get_db():
db = await aiosqlite.connect(DB_PATH)
db.row_factory = aiosqlite.Row # Dict-like row access
await db.execute("PRAGMA journal_mode=WAL")
await db.execute("PRAGMA foreign_keys=ON")
try:
yield db
finally:
await db.close()All configuration lives in app/config.py as module-level constants:
| Constant | Default Value | Description |
|---|---|---|
DB_PATH |
../data/chat.db |
SQLite database file path (relative to app/) |
OLLAMA_BASE_URL |
http://localhost:11434 |
Ollama server base URL |
DEFAULT_MODEL |
qwen3.5:9b |
Fallback model when user has no preference |
CONTEXT_FULL_EXCHANGES |
5 |
Number of recent user+assistant exchange pairs to keep in full context |
SUMMARY_MAX_TOKENS |
500 |
Target summary conciseness (used in prompt, not enforced) |
SUMMARY_MODEL |
None |
Model for summarization; None = use same model as chat |
SSE_RETRY_TIMEOUT |
15000 |
SSE retry timeout in milliseconds |
DEFAULT_USER_ID |
"default" |
Single-user system — all data belongs to this user |
DEFAULT_USER_NAME |
"User" |
Default display name |
SQLite with 4 tables. Schema is created on startup via init_db() using CREATE TABLE IF NOT EXISTS.
| Column | Type | Default | Description |
|---|---|---|---|
id |
TEXT PK | 'default' |
Single-user system |
name |
TEXT | 'User' |
Display name |
default_model |
TEXT | 'qwen3.5:9b' |
Preferred LLM model |
system_prompt |
TEXT | NULL | Global system prompt |
created_at |
TIMESTAMP | CURRENT_TIMESTAMP | |
updated_at |
TIMESTAMP | CURRENT_TIMESTAMP |
| Column | Type | Default | Description |
|---|---|---|---|
id |
TEXT PK | UUID | |
user_id |
TEXT FK→users | 'default' |
|
title |
TEXT | 'New Chat' |
Display title (AI-generated on first message) |
model_name |
TEXT | NULL | Model used for this conversation |
system_prompt |
TEXT | NULL | Per-conversation system prompt override |
summary |
TEXT | NULL | Rolling summary of old messages |
summary_upto_msg_id |
TEXT | NULL | Last message ID included in summary |
created_at |
TIMESTAMP | CURRENT_TIMESTAMP | |
updated_at |
TIMESTAMP | CURRENT_TIMESTAMP | Updated on any message or conversation change |
FK: user_id → users(id) ON DELETE CASCADE
| Column | Type | Default | Constraint | Description |
|---|---|---|---|---|
id |
TEXT PK | UUID | ||
conversation_id |
TEXT FK | NOT NULL | ||
role |
TEXT | CHECK IN ('user', 'assistant', 'system') | ||
content |
TEXT | NOT NULL | Message text | |
image_base64 |
TEXT | NULL | Base64-encoded image (user messages only) | |
thinking |
TEXT | NULL | Model's reasoning/thinking output | |
tokens_used |
INTEGER | 0 | Total tokens (prompt + completion) | |
tokens_per_sec |
REAL | 0 | Generation speed | |
thinking_duration |
INTEGER | 0 | Seconds spent in thinking phase | |
is_summarized |
BOOLEAN | 0 | Marked true after included in a summary | |
created_at |
TIMESTAMP | CURRENT_TIMESTAMP |
FK: conversation_id → conversations(id) ON DELETE CASCADE
| Column | Type | Default | Description |
|---|---|---|---|
id |
TEXT PK | Original message ID | |
snapshot_id |
TEXT | Groups messages deleted together (UUID) | |
conversation_id |
TEXT | Original conversation ID | |
role |
TEXT | ||
content |
TEXT | ||
image_base64 |
TEXT | NULL | |
thinking |
TEXT | NULL | |
tokens_used |
INTEGER | 0 | |
tokens_per_sec |
REAL | 0 | |
thinking_duration |
INTEGER | 0 | |
original_created_at |
TIMESTAMP | Preserves original ordering | |
deleted_at |
TIMESTAMP | CURRENT_TIMESTAMP |
No foreign keys — this table stores orphaned data for restoration.
For existing databases, missing columns are added via ALTER TABLE ... ADD COLUMN wrapped in try/except (silently skips if column exists):
try:
await db.execute("ALTER TABLE messages ADD COLUMN tokens_per_sec REAL DEFAULT 0")
except Exception:
passOn every startup, checks if user 'default' exists. If not, inserts it.
All endpoints are prefixed with /api/.
Request:
{
"conversation_id": "uuid | null",
"model": "model-name | null",
"message": "user message (required, min 1 char)",
"image": "base64-string | null"
}Behavior:
- Resolve model: request → user preference →
DEFAULT_MODEL - If no
conversation_id: create new conversation, markis_first_message = true - If
conversation_idprovided: verify it exists, check ifmessage_count == 0for first message detection - Save user message to DB
- If first message: generate AI title via Ollama (blocking, non-streamed), update conversation title, emit
titleSSE event - Build context (system prompt + summary + last N messages)
- Stream response from Ollama with
think: true - On stream complete: save assistant message to DB with token stats
SSE Event Types:
| Type | Fields | When |
|---|---|---|
title |
conversation_id, title |
First message only, before content starts |
thinking |
content |
Each thinking token from model |
content |
content |
Each content token from model |
done |
conversation_id, message_id, user_message_id, tokens_used, tokens_per_sec, thinking_duration |
Stream complete |
error |
content |
On any error |
SSE Format: data: {JSON}\n\n — standard SSE with text/event-stream content type.
Request:
{
"conversation_id": "uuid (required)",
"message_id": "uuid of assistant message (required)",
"model": "model-name | null"
}Behavior:
- Verify conversation and message exist, message must be
role = 'assistant' - Delete the old assistant message from DB
- Rebuild context (user message is still there)
- Stream new response (same SSE format as
/api/chat) - Save new assistant message
Request: ConversationCreate — title, model_name, system_prompt (all optional)
Response: 201 — ConversationResponse
Response: 200 — ConversationResponse[] ordered by updated_at DESC
- Each includes
message_countvia LEFT JOIN COUNT
Response: 200 — ConversationDetailResponse (includes all messages ordered by created_at ASC)
Error: 404 if not found
Request: ConversationUpdate — title, system_prompt (both optional, only provided fields are updated)
Response: 200 — ConversationResponse
Response: 204 No Content. Messages are cascade-deleted by SQLite FK.
Response: 200 — MessageResponse[]
Request: { "content": "new text" }
Response: 200 — MessageResponse
This is NOT a simple delete. It deletes this message AND everything after it, creating a snapshot for undo.
Response: 200
{
"snapshot_id": "uuid",
"deleted_count": 3,
"conversation_deleted": false
}If all messages are deleted, the conversation itself is deleted and conversation_deleted: true.
Restores all messages from a snapshot. If the conversation was deleted, recreates it with title "Restored Chat".
Response: 200
{
"conversation_id": "uuid",
"restored_count": 3
}Proxies GET /api/tags from Ollama. Filters out :latest tag variants when a specific tag exists with the same digest.
Response: 200 — { "models": [...] }
Request: { "name": "llama3.2" }
Response: Streams NDJSON progress from Ollama's pull endpoint.
Response: 200
{
"status": "healthy | degraded",
"ollama": "connected | unreachable",
"database": "connected"
}Response: 200 — UserProfile
Request: UserPreferencesUpdate — name, default_model, system_prompt (all optional)
Response: 200 — UserProfile
None. This is a single-user, local-only application. All data is scoped to a hardcoded DEFAULT_USER_ID = "default". The user table exists to store preferences, not for auth.
All conversation queries filter by user_id = DEFAULT_USER_ID. If multi-user support were added, you'd:
- Add auth middleware (JWT/session)
- Extract user ID from the request instead of using the constant
- The existing
user_idforeign keys already support this
User sends message
→ Resolve model (request > user pref > default)
→ Create conversation if needed
→ Save user message to DB
→ [First message] Generate AI title → emit "title" SSE event
→ Build context window
→ Stream from Ollama (think=true)
→ For each chunk:
- thinking tokens → emit "thinking" SSE event
- content tokens → emit "content" SSE event
- Track timing for thinking_duration
→ On done:
- Calculate tokens_per_sec from eval_count / eval_duration
- Save assistant message with stats
- Emit "done" SSE event
The system maintains a bounded context window to avoid exceeding model limits.
Configuration:
CONTEXT_FULL_EXCHANGES = 5→ keeps last 10 messages (5 user + 5 assistant) in full
Context Building (build_context):
1. Count total messages in conversation
2. If total > 10: trigger summarization of old messages
3. Build message list:
a. System prompt (conversation-level > user-level > skip)
b. Summary of old messages (if exists)
c. Last 10 messages in chronological order
d. [New user message is already included as it was saved before context build]
Summarization Trigger:
- Runs when
total_messages > LAST_N_MESSAGES(10) - Only summarizes messages NOT already marked
is_summarized - Excludes the last 10 messages from summarization
Summarization Process:
- Fetch unsummarized messages (excluding last 10)
- Fetch existing summary from conversation (if any)
- Call Ollama non-streamed with summarizer prompt:
- System: "Condense into 2-3 paragraphs preserving key facts, decisions, preferences"
- User: existing summary (if any) + new messages text
- Strip
<think>tags from response - Update
conversations.summaryandconversations.summary_upto_msg_id - Mark old messages as
is_summarized = 1 - On failure: silently skip (warning log)
Image Handling in Context:
User messages with images include "images": [base64_string] in the Ollama message format.
When the first message is sent to a conversation:
- Call Ollama non-streamed with a title generation prompt:
- System: "Generate a very short title (3-6 words, max 50 chars) for a chat conversation..."
- User: the first message content
stream: false,think: false- Timeout: 15 seconds
- Sanitize response:
- Strip leading/trailing quotes
- Remove
<think>...</think>blocks (regex with DOTALL) - Reject empty or >60 char titles
- Fallback: first 50 chars of message + "..." if AI fails
- Model used: request model or
qwen3.5:9bif none specified
The title is generated and emitted via SSE before the response starts streaming, so the frontend can update the sidebar immediately.
This is the most complex operation. When a message is "deleted" from the chat:
Truncate:
- Find the target message's
created_at - Select ALL messages with
created_at >= target(this message and everything after it) - Copy each to
deleted_messageswith a sharedsnapshot_id(UUID) - Delete originals from
messagestable - Check if any messages remain:
- Yes: update conversation
updated_at - No: delete the conversation entirely, set
conversation_deleted = true
- Yes: update conversation
- Return
{ snapshot_id, deleted_count, conversation_deleted }
Restore:
- Fetch all
deleted_messageswith the givensnapshot_id - Check if the conversation still exists:
- Yes: use it
- No: recreate with
title = "Restored Chat"
- Re-insert all messages using
INSERT OR IGNORE(preservesoriginal_created_at) - Delete the snapshot from
deleted_messages - Return
{ conversation_id, restored_count }
- Verify the target message is
role = 'assistant' - Delete it from DB (simple DELETE, no snapshot)
- Rebuild context (the preceding user message remains)
- Stream a fresh response from Ollama
- Save the new response
List Models:
- Proxies
GET {OLLAMA_BASE_URL}/api/tags - Deduplication: if both
model:latestandmodel:specific-tagexist with the same digest, removes the:latestvariant
Pull Model:
- Proxies
POST {OLLAMA_BASE_URL}/api/pullas NDJSON stream - Frontend receives progress updates line-by-line
Health Check:
GET {OLLAMA_BASE_URL}/— expects 200 status code- Returns
true/false, does not throw
- Single user (
id = "default") get_or_create_default_user: creates the user row on first access if missing- Update: only modifies provided fields (Pydantic
exclude_unset=True) - System prompt precedence: conversation-level > user-level > none
class ChatRequest(BaseModel):
conversation_id: str | None = None
model: str | None = None
message: str = Field(..., min_length=1)
image: str | None = None # base64
class RegenerateRequest(BaseModel):
conversation_id: str
message_id: str
model: str | None = Noneclass ConversationCreate(BaseModel):
title: str | None = "New Chat"
model_name: str | None = None
system_prompt: str | None = None
class ConversationUpdate(BaseModel):
title: str | None = None
system_prompt: str | None = None
class ConversationResponse(BaseModel):
id: str
title: str
model_name: str | None
message_count: int = 0
created_at: datetime
updated_at: datetime
class MessageResponse(BaseModel):
id: str
role: str
content: str
thinking: str | None = None
image_base64: str | None = None
tokens_used: int = 0
tokens_per_sec: float = 0.0
thinking_duration: int = 0
created_at: datetime
class ConversationDetailResponse(BaseModel):
id: str
title: str
model_name: str | None
system_prompt: str | None
created_at: datetime
updated_at: datetime
messages: list[MessageResponse] = []class UserProfile(BaseModel):
id: str
name: str
default_model: str
system_prompt: str | None = None
class UserPreferencesUpdate(BaseModel):
name: str | None = None
default_model: str | None = None
system_prompt: str | None = NoneNote: Schemas with model_name fields use ConfigDict(protected_namespaces=()) to avoid Pydantic's reserved namespace warning for model_* fields.
All LLM interactions use Ollama's REST API directly via httpx (not LangChain, despite it being a dependency).
Endpoints used:
| Ollama Endpoint | Used For | Streamed |
|---|---|---|
POST /api/chat |
Chat responses | Yes (NDJSON lines) |
POST /api/chat |
Summarization | No (stream: false) |
POST /api/chat |
Title generation | No (stream: false) |
GET /api/tags |
List models | No |
POST /api/pull |
Pull models | Yes (NDJSON lines) |
GET / |
Health check | No |
Chat streaming format: Each NDJSON line from Ollama contains:
{
"message": {
"content": "token",
"thinking": "thinking token (when think=true)"
},
"done": false
}Final chunk has done: true and includes:
eval_count— number of generated tokenseval_duration— generation time in nanosecondsprompt_eval_count— number of prompt tokens
Thinking mode: All chat requests use "think": true. Ollama models that support thinking (like Qwen) return reasoning in the thinking field. Non-thinking models simply return empty thinking fields.
backend/
├── app/
│ ├── main.py # FastAPI app, lifespan, CORS, router registration
│ ├── config.py # All constants
│ ├── database.py # SQLite schema, get_db dependency, init_db
│ │
│ ├── routes/ # API endpoint handlers (thin)
│ │ ├── chat.py # POST /api/chat, /api/chat/regenerate
│ │ ├── conversations.py # CRUD + message ops + truncate/restore
│ │ ├── models.py # GET /api/models, POST /api/models/pull, GET /api/health
│ │ └── user.py # GET/PUT /api/user
│ │
│ ├── services/ # Business logic (stateless, db passed in)
│ │ ├── chat_service.py # stream_chat, stream_regenerate, build_context, summarize
│ │ ├── conversation_service.py # CRUD, truncate, restore, AI title
│ │ ├── model_service.py # list_models, pull_model, health check
│ │ └── user_service.py # get/update preferences
│ │
│ ├── schemas/ # Pydantic request/response models
│ │ ├── chat.py
│ │ ├── conversation.py
│ │ └── user.py
│ │
│ └── prompts/
│ └── summarize.py # Summarizer system prompt + builder
│
├── test/ # pytest tests
│ ├── conftest.py # In-memory DB fixture, test helpers
│ ├── test_conversation_service.py # 23 service unit tests
│ └── test_conversation_routes.py # 10 API integration tests
│
├── data/
│ └── chat.db # SQLite database (auto-created)
│
├── requirements.txt
└── pytest.ini
- Python 3.11+
- Ollama installed and running at
localhost:11434 - At least one Ollama model pulled (e.g.,
ollama pull qwen3.5:9b)
cd backend
python -m venv .venv
source .venv/bin/activate # Windows: .venv\Scripts\activate
pip install -r requirements.txtuvicorn app.main:app --reload
# Runs at http://localhost:8000
# Swagger UI at http://localhost:8000/docs- Auto-created at
backend/data/chat.dbon first startup - Schema is idempotent (
CREATE TABLE IF NOT EXISTS) - Missing columns added via
ALTER TABLEwith try/except
cd backend
pytest test/ -v- Tests use in-memory SQLite (no file DB needed)
- Ollama calls are mocked via
unittest.mock asyncio_mode = autoin pytest.ini — no@pytest.mark.asyncioneeded
-
Direct Ollama REST vs LangChain: Despite LangChain being a dependency, all Ollama calls use
httpxdirectly. This gives full control over streaming, thinking mode, and token statistics. -
SQLite over Postgres: Zero-config, file-based, perfect for a local single-user app. WAL mode enables concurrent reads during writes.
-
Snapshot-based undo: Instead of soft-deletes on messages, deleted messages are moved to a separate
deleted_messagestable with a sharedsnapshot_id. This keeps the mainmessagestable clean while supporting full undo. -
Rolling summarization: Old messages aren't deleted — they're marked
is_summarized = 1and a rolling summary replaces them in the context window. This preserves full history in the DB while keeping LLM context bounded. -
Title before response: AI title generation is a blocking call that completes before response streaming begins. This ensures the frontend has the conversation title immediately, avoiding a jarring delayed update.
-
Stateless services: All service functions receive the DB connection as a parameter. No service holds state. This makes testing straightforward — just pass an in-memory DB.
-
Single-user by design: The
DEFAULT_USER_IDconstant simplifies everything. Multi-user support would require adding auth middleware and replacing the constant with request-scoped user extraction, but the schema already supports it viauser_idforeign keys.
- Ollama unreachable: Chat endpoints return SSE error events. Health endpoint returns
"degraded". Summarization and title generation silently fall back (warning logged). - Missing resources: Routes return
404viaHTTPException. - Validation errors: Pydantic returns
422automatically. - All errors follow format:
{ "detail": "error message" }
System prompt:
"You are a conversation summarizer. Condense the following into a concise summary preserving key facts, decisions, user preferences, and important context. 2-3 paragraphs max."
User prompt construction:
[If existing summary exists:]
Existing summary:
{existing_summary}
New messages to incorporate:
USER: message1
ASSISTANT: message2
...
System prompt:
"Generate a very short title (3-6 words, max 50 chars) for a chat conversation based on the user's first message. Return ONLY the title, nothing else. No quotes, no punctuation at the end, no explanation."