In this hands-on workshop, you'll learn how to build and deploy a memory-efficient Generative AI stack using Docker. We'll create a Retrieval-Augmented Generation (RAG) application that can run on machines with as little as 6GB RAM - perfect for laptops, edge devices, or cost-conscious cloud deployments.
A fully functional AI-powered document Q&A system featuring:
- Ollama - Local LLM inference server
- ChromaDB - Vector database for semantic search
- LangChain - AI orchestration framework
- Streamlit - Interactive web interface
- Docker Desktop installed (with Docker Compose v2)
- 6GB+ available RAM
- Basic familiarity with Docker concepts
- Terminal/command-line access
- Full Workshop: 60-90 minutes
- Quick Start: 15-20 minutes
┌─────────────────────────────────────────────────────────────┐
│ User Browser │
│ http://localhost:8501 │
└─────────────────────────┬───────────────────────────────────┘
│
┌─────────────────────────▼───────────────────────────────────┐
│ Streamlit App (genai-app) │
│ Port: 8501 │
│ ┌─────────────────────────────────────────────────────────┐ │
│ │ • File Upload & Processing │ │
│ │ • Chat Interface │ │
│ │ • LangChain RAG Pipeline │ │
│ └─────────────────────────────────────────────────────────┘ │
└──────────┬──────────────────────────────────┬───────────────┘
│ │
▼ ▼
┌─────────────────────────┐ ┌─────────────────────────────┐
│ Ollama LLM Server │ │ ChromaDB │
│ Port: 11434 │ │ Port: 8000 │
│ ┌───────────────────┐ │ │ ┌───────────────────────┐ │
│ │ tinyllama:1.1b │ │ │ │ Vector Embeddings │ │
│ │ nomic-embed-text │ │ │ │ Document Chunks │ │
│ └───────────────────┘ │ │ │ Semantic Search │ │
└─────────────────────────┘ └─────────────────────────────┘
| Component | Purpose | Memory Usage |
|---|---|---|
| Ollama | Runs LLM locally | ~1-3GB (model dependent) |
| ChromaDB | Stores & searches embeddings | ~256-512MB |
| Streamlit App | Web UI + RAG logic | ~512MB-1GB |
| Docker Overhead | Container runtime | ~500MB |
Total: ~4-6GB vs. 20GB+ for full GenAI Stack
# Create project directory
mkdir lightweight-genai-stack
cd lightweight-genai-stack
# Create subdirectories
mkdir -p appCreate docker-compose.yml:
services:
# Ollama - LLM Server (lightweight model)
ollama:
image: ollama/ollama:latest
container_name: ollama
ports:
- "11434:11434"
volumes:
- ollama_data:/root/.ollama
deploy:
resources:
limits:
memory: 3G # Limit memory for Ollama (tinyllama is ~600MB)
healthcheck:
test: ["CMD", "ollama", "list"]
interval: 30s
timeout: 10s
retries: 5
start_period: 30s
restart: unless-stopped
# Model Puller - Downloads the lightweight model on startup
model-puller:
image: curlimages/curl:latest
container_name: model-puller
depends_on:
ollama:
condition: service_healthy
entrypoint: ["/bin/sh", "-c"]
command:
- |
echo "Waiting for Ollama to be ready..."
sleep 5
echo "Pulling tinyllama:1.1b model (ultra-lightweight, ~600MB)..."
curl -X POST http://ollama:11434/api/pull -d '{"name": "tinyllama:1.1b"}'
echo "Pulling nomic-embed-text for embeddings..."
curl -X POST http://ollama:11434/api/pull -d '{"name": "nomic-embed-text"}'
echo "Models downloaded successfully!"
# ChromaDB - Lightweight Vector Database
chromadb:
image: chromadb/chroma:latest
container_name: chromadb
ports:
- "8000:8000"
volumes:
- chroma_data:/chroma/chroma
environment:
- IS_PERSISTENT=TRUE
- ANONYMIZED_TELEMETRY=FALSE
deploy:
resources:
limits:
memory: 512M
restart: unless-stopped
# GenAI App - Streamlit RAG Application
genai-app:
build:
context: ./app
dockerfile: Dockerfile
container_name: genai-app
ports:
- "8501:8501"
environment:
- OLLAMA_BASE_URL=http://ollama:11434
- CHROMA_HOST=chromadb
- CHROMA_PORT=8000
- LLM_MODEL=tinyllama:1.1b
- EMBEDDING_MODEL=nomic-embed-text
depends_on:
ollama:
condition: service_healthy
chromadb:
condition: service_started
volumes:
- ./app:/app
- uploaded_docs:/app/uploads
deploy:
resources:
limits:
memory: 1G
restart: unless-stopped
volumes:
ollama_data:
chroma_data:
uploaded_docs:
networks:
default:
name: genai-network1. Service Dependencies with Health Checks
depends_on:
ollama:
condition: service_healthyThis ensures the app only starts after Ollama is fully ready, not just running.
2. Resource Limits
deploy:
resources:
limits:
memory: 3GPrevents any single container from consuming all available memory.
3. Named Volumes for Persistence
volumes:
ollama_data: # Persists downloaded models
chroma_data: # Persists vector embeddings4. Init Container Pattern (model-puller)
model-puller:
# Runs once to download models, then exitsThis pattern ensures models are ready before the app needs them.
Create app/Dockerfile:
FROM python:3.11-slim
WORKDIR /app
# Install system dependencies
RUN apt-get update && apt-get install -y \
curl \
&& rm -rf /var/lib/apt/lists/*
# Copy requirements first for better caching
COPY requirements.txt .
# Install Python dependencies
RUN pip install --no-cache-dir -r requirements.txt
# Copy application code
COPY . .
# Create uploads directory
RUN mkdir -p /app/uploads
# Expose Streamlit port
EXPOSE 8501
# Health check
HEALTHCHECK CMD curl --fail http://localhost:8501/_stcore/health
# Run Streamlit
ENTRYPOINT ["streamlit", "run", "main.py", "--server.port=8501", "--server.address=0.0.0.0"]- Use slim base images -
python:3.11-slimis ~150MB vs ~1GB for full image - Layer caching - Copy
requirements.txtbefore code for faster rebuilds - Clean up apt cache - Reduces image size
- Health checks - Enables Docker to monitor container health
- Non-root user (optional enhancement for production)
Create app/requirements.txt:
streamlit==1.40.0
langchain==0.3.9
langchain-community==0.3.8
langchain-chroma==0.1.4
langchain-ollama==0.2.2
chromadb==0.5.20
pypdf==5.1.0
python-docx==1.1.2
unstructured==0.16.10
sentence-transformers==3.3.1
requests==2.32.3
Create app/main.py:
"""
Lightweight GenAI Stack - RAG Application
A memory-efficient GenAI stack using Ollama + ChromaDB + LangChain
Designed to run within 6GB RAM
"""
import os
import streamlit as st
from langchain_ollama import OllamaLLM, OllamaEmbeddings
from langchain_chroma import Chroma
from langchain.text_splitter import RecursiveCharacterTextSplitter
from langchain.chains import ConversationalRetrievalChain
from langchain.memory import ConversationBufferWindowMemory
from langchain_community.document_loaders import (
PyPDFLoader,
TextLoader,
UnstructuredMarkdownLoader,
)
import tempfile
import time
# Configuration from environment
OLLAMA_BASE_URL = os.getenv("OLLAMA_BASE_URL", "http://localhost:11434")
CHROMA_HOST = os.getenv("CHROMA_HOST", "localhost")
CHROMA_PORT = os.getenv("CHROMA_PORT", "8000")
LLM_MODEL = os.getenv("LLM_MODEL", "tinyllama:1.1b")
EMBEDDING_MODEL = os.getenv("EMBEDDING_MODEL", "nomic-embed-text")
# Page config
st.set_page_config(
page_title="Lightweight GenAI Stack",
page_icon="🤖",
layout="wide",
initial_sidebar_state="expanded"
)
# Custom CSS for better UI
st.markdown("""
<style>
.stApp {
max-width: 1200px;
margin: 0 auto;
}
.chat-message {
padding: 1rem;
border-radius: 0.5rem;
margin-bottom: 1rem;
}
.user-message {
background-color: #e3f2fd;
}
.assistant-message {
background-color: #f5f5f5;
}
</style>
""", unsafe_allow_html=True)
@st.cache_resource
def init_llm():
"""Initialize the Ollama LLM"""
return OllamaLLM(
model=LLM_MODEL,
base_url=OLLAMA_BASE_URL,
temperature=0.7,
num_ctx=4096, # Smaller context for memory efficiency
)
@st.cache_resource
def init_embeddings():
"""Initialize Ollama embeddings"""
return OllamaEmbeddings(
model=EMBEDDING_MODEL,
base_url=OLLAMA_BASE_URL,
)
def init_vectorstore():
"""Initialize ChromaDB vector store"""
embeddings = init_embeddings()
return Chroma(
collection_name="documents",
embedding_function=embeddings,
persist_directory="/app/chroma_db"
)
def process_uploaded_file(uploaded_file):
"""Process uploaded file and return documents"""
# Save to temp file
with tempfile.NamedTemporaryFile(delete=False, suffix=f".{uploaded_file.name.split('.')[-1]}") as tmp:
tmp.write(uploaded_file.getbuffer())
tmp_path = tmp.name
# Load based on file type
file_extension = uploaded_file.name.split('.')[-1].lower()
try:
if file_extension == 'pdf':
loader = PyPDFLoader(tmp_path)
elif file_extension == 'txt':
loader = TextLoader(tmp_path)
elif file_extension == 'md':
loader = UnstructuredMarkdownLoader(tmp_path)
else:
st.error(f"Unsupported file type: {file_extension}")
return None
documents = loader.load()
# Split documents
text_splitter = RecursiveCharacterTextSplitter(
chunk_size=500, # Smaller chunks for memory efficiency
chunk_overlap=50,
length_function=len,
)
splits = text_splitter.split_documents(documents)
return splits
finally:
os.unlink(tmp_path)
def check_ollama_health():
"""Check if Ollama is running and model is available"""
import requests
try:
response = requests.get(f"{OLLAMA_BASE_URL}/api/tags", timeout=5)
if response.status_code == 200:
models = response.json().get('models', [])
model_names = [m['name'] for m in models]
return LLM_MODEL in model_names or any(LLM_MODEL in m for m in model_names)
except:
pass
return False
def main():
st.title("🤖 Lightweight GenAI Stack")
st.caption(f"Running on {LLM_MODEL} | Memory-efficient RAG with ChromaDB")
# Sidebar
with st.sidebar:
st.header("📁 Document Upload")
# Health check
if check_ollama_health():
st.success(f"✅ Connected to Ollama ({LLM_MODEL})")
else:
st.error("❌ Ollama not ready. Model may still be downloading...")
st.info("Run `docker logs model-puller` to check progress")
uploaded_files = st.file_uploader(
"Upload documents for RAG",
type=['pdf', 'txt', 'md'],
accept_multiple_files=True,
help="Upload PDF, TXT, or Markdown files to chat with"
)
if uploaded_files:
if st.button("📥 Process Documents", type="primary"):
vectorstore = init_vectorstore()
with st.spinner("Processing documents..."):
for uploaded_file in uploaded_files:
docs = process_uploaded_file(uploaded_file)
if docs:
vectorstore.add_documents(docs)
st.success(f"✅ Processed: {uploaded_file.name}")
st.session_state.docs_loaded = True
st.rerun()
st.divider()
# Settings
st.header("⚙️ Settings")
temperature = st.slider("Temperature", 0.0, 1.0, 0.7, 0.1)
use_rag = st.checkbox("Use RAG (if documents loaded)", value=True)
if st.button("🗑️ Clear Chat History"):
st.session_state.messages = []
st.session_state.memory = None
st.rerun()
st.divider()
# System info
st.header("📊 System Info")
st.info(f"""
**Model:** {LLM_MODEL}
**Embeddings:** {EMBEDDING_MODEL}
**Vector DB:** ChromaDB
**Target RAM:** < 6GB
""")
# Initialize session state
if "messages" not in st.session_state:
st.session_state.messages = []
if "memory" not in st.session_state:
st.session_state.memory = ConversationBufferWindowMemory(
memory_key="chat_history",
return_messages=True,
k=5 # Keep last 5 exchanges for memory efficiency
)
# Display chat messages
for message in st.session_state.messages:
with st.chat_message(message["role"]):
st.markdown(message["content"])
# Chat input
if prompt := st.chat_input("Ask me anything..."):
# Add user message
st.session_state.messages.append({"role": "user", "content": prompt})
with st.chat_message("user"):
st.markdown(prompt)
# Generate response
with st.chat_message("assistant"):
with st.spinner("Thinking..."):
try:
llm = init_llm()
# Check if we should use RAG
use_retrieval = use_rag and st.session_state.get('docs_loaded', False)
if use_retrieval:
# RAG mode
vectorstore = init_vectorstore()
retriever = vectorstore.as_retriever(
search_kwargs={"k": 3} # Retrieve top 3 chunks
)
chain = ConversationalRetrievalChain.from_llm(
llm=llm,
retriever=retriever,
memory=st.session_state.memory,
return_source_documents=True,
)
result = chain({"question": prompt})
response = result["answer"]
# Show sources
if result.get("source_documents"):
with st.expander("📚 Sources"):
for i, doc in enumerate(result["source_documents"], 1):
st.markdown(f"**Source {i}:**")
st.markdown(doc.page_content[:300] + "...")
else:
# Direct LLM mode
response = llm.invoke(prompt)
st.markdown(response)
st.session_state.messages.append({"role": "assistant", "content": response})
except Exception as e:
error_msg = f"Error: {str(e)}"
st.error(error_msg)
st.session_state.messages.append({"role": "assistant", "content": error_msg})
if __name__ == "__main__":
main()# Start all services in detached mode
docker compose up --build -d
# Watch the build and startup logs
docker compose logs -fThe model-puller container downloads the AI models. This takes a few minutes:
# Watch model download progress
docker logs -f model-pullerExpected output:
Waiting for Ollama to be ready...
Pulling tinyllama:1.1b model (ultra-lightweight, ~600MB)...
{"status":"pulling manifest"}
{"status":"pulling 2af3b81862c6","total":637699456,"completed":...}
...
{"status":"success"}
Pulling nomic-embed-text for embeddings...
...
{"status":"success"}
Models downloaded successfully!
# Check container status
docker compose psExpected output:
NAME IMAGE STATUS PORTS
chromadb chromadb/chroma:latest Up (healthy) 0.0.0.0:8000->8000/tcp
genai-app lightweight-genai-stack-genai-app Up (healthy) 0.0.0.0:8501->8501/tcp
ollama ollama/ollama:latest Up (healthy) 0.0.0.0:11434->11434/tcp
Open your browser and navigate to:
- Main App: http://localhost:8501
- Ollama API: http://localhost:11434
- ChromaDB API: http://localhost:8000/api/v2/heartbeat
- Open http://localhost:8501
- Type a question in the chat input: "What is Docker?"
- The LLM responds directly
- Click "Browse files" in the sidebar
- Upload a PDF, TXT, or Markdown file
- Click "📥 Process Documents"
- Wait for processing to complete
- Ask questions about the document content
The app includes a Learning Mode that visualizes the RAG pipeline in real-time:
- Enable "Show RAG Details" checkbox in the sidebar
- Ask a question about your documents
- Watch the RAG pipeline execute step-by-step:
- Step 1: Query converted to 768-dimensional embedding vector
- Step 2: Similarity search finds top 3 matching chunks
- Step 3: Retrieved context displayed with page numbers
- Step 4: LLM generates response using context
- View timing for each step to understand performance
- Expand "📚 View Source Chunks" to see the exact text used
- Vector Database Stats: Shows total chunks and documents
- Document Details: Expandable list of files with chunk counts
- Clear All Documents: Button to reset ChromaDB
You: What are the main topics covered in this document?
AI: Based on the document, the main topics are... [contextual response]
You: Can you summarize section 3?
AI: Section 3 discusses... [pulls from document chunks]
┌──────────────┐ ┌──────────────┐ ┌──────────────┐
│ Document │────▶│ Chunking │────▶│ Embedding │
│ Upload │ │ (500 chars) │ │ Model │
└──────────────┘ └──────────────┘ └──────┬───────┘
│
▼
┌──────────────┐ ┌──────────────┐ ┌──────────────┐
│ ChromaDB │◀────│ Vector │◀────│ Store │
│ Storage │ │ Index │ │ Vectors │
└──────┬───────┘ └──────────────┘ └──────────────┘
│
│ On Query:
▼
┌──────────────┐ ┌──────────────┐ ┌──────────────┐
│ User │────▶│ Embed │────▶│ Similarity │
│ Question │ │ Query │ │ Search │
└──────────────┘ └──────────────┘ └──────┬───────┘
│
▼
┌──────────────┐ ┌──────────────┐ ┌──────────────┐
│ LLM │◀────│ Context + │◀────│ Top K │
│ Response │ │ Question │ │ Chunks │
└──────────────┘ └──────────────┘ └──────────────┘
1. Document Chunking
text_splitter = RecursiveCharacterTextSplitter(
chunk_size=500, # Characters per chunk
chunk_overlap=50, # Overlap for context continuity
)Smaller chunks = faster search, less memory, but may lose context.
2. Embedding Generation
embeddings = OllamaEmbeddings(
model="nomic-embed-text", # 137M parameter model
base_url=OLLAMA_BASE_URL,
)Converts text to 768-dimensional vectors for semantic search.
3. Similarity Search
retriever = vectorstore.as_retriever(
search_kwargs={"k": 3} # Return top 3 most similar chunks
)4. Context-Augmented Generation
chain = ConversationalRetrievalChain.from_llm(
llm=llm,
retriever=retriever,
return_source_documents=True, # Show where answers came from
)# Real-time container stats
docker stats
# Expected output:
CONTAINER ID NAME CPU % MEM USAGE / LIMIT
xxxx ollama 0.50% 1.2GiB / 3GiB
xxxx chromadb 0.10% 256MiB / 512MiB
xxxx genai-app 0.30% 512MiB / 1GiB# All services
docker compose logs -f
# Specific service
docker compose logs -f ollama
# Last 100 lines
docker compose logs --tail=100 genai-app# List available models
curl http://localhost:11434/api/tags | jq
# Generate text directly
curl http://localhost:11434/api/generate -d '{
"model": "tinyllama:1.1b",
"prompt": "What is Docker in one sentence?",
"stream": false
}' | jq -r '.response'# Health check
curl http://localhost:8000/api/v2/heartbeat
# List collections
curl http://localhost:8000/api/v2/collectionsThree utility scripts are included to help debug and understand the RAG pipeline:
docker exec genai-app python /app/chroma_stats.pyShows total chunks, unique documents, and breakdown per file.
# Single query
docker exec genai-app python /app/rag_query.py "your question here"
# Interactive mode
docker exec -it genai-app python /app/rag_query.pyReturns similarity scores and retrieved chunks - useful for understanding why certain queries work better than others.
docker exec genai-app python /app/test_chroma.pyRuns comprehensive tests including collection stats, sample documents, and test queries.
Edit docker-compose.yml:
# For better quality (needs 8GB RAM)
environment:
- LLM_MODEL=phi3:mini
# For multilingual support
environment:
- LLM_MODEL=qwen2.5:3bThen update the model-puller command and restart:
docker compose down
docker compose up -dservices:
ollama:
image: ollama/ollama:latest
deploy:
resources:
reservations:
devices:
- driver: nvidia
count: 1
capabilities: [gpu]services:
genai-app:
deploy:
replicas: 3
# Add load balancer in frontAdd support for .docx files:
- The loader is already in requirements:
python-docx - Add a new condition in
process_uploaded_file():
elif file_extension == 'docx':
from langchain_community.document_loaders import Docx2txtLoader
loader = Docx2txtLoader(tmp_path)- Update the file uploader to accept
.docx
Modify the chat to stream responses:
# Instead of:
response = llm.invoke(prompt)
# Use:
for chunk in llm.stream(prompt):
st.write(chunk)Create a FastAPI wrapper for the RAG functionality:
- Add
fastapianduvicornto requirements - Create
api.pywith endpoints - Add a new service in docker-compose.yml
Add a button to download chat history as JSON:
import json
if st.button("📥 Export Chat"):
chat_json = json.dumps(st.session_state.messages, indent=2)
st.download_button(
label="Download",
data=chat_json,
file_name="chat_history.json",
mime="application/json"
)# Stop containers (preserves data)
docker compose down
# Stop and remove all data
docker compose down -v
# Remove built images
docker compose down --rmi all# Remove unused Docker resources
docker system prune -a
# Check space usage
docker system df- Docker Compose for multi-container AI applications
- Health checks and service dependencies
- Resource limits for memory-constrained environments
- Init containers for one-time setup tasks
- RAG architecture with LangChain
- Vector databases with ChromaDB
- Local LLM inference with Ollama (tinyllama:1.1b)
- Embeddings with nomic-embed-text (768-dimensional vectors)
- RAG debugging with test scripts (chroma_stats.py, rag_query.py)
| Concept | Docker Feature |
|---|---|
| Service orchestration | docker-compose.yml |
| Startup order | depends_on + condition: service_healthy |
| Memory management | deploy.resources.limits |
| Data persistence | Named volumes |
| One-time tasks | Init container pattern |
| Health monitoring | healthcheck directive |
- Explore the Docker GenAI Stack for production features
- Add authentication with Streamlit-Authenticator
- Implement document versioning
- Add support for more file types (HTML, CSV, etc.)
- Deploy to cloud with Docker Swarm or Kubernetes
- Ollama Documentation
- LangChain Documentation
- ChromaDB Documentation
- Streamlit Documentation
- Docker Compose Specification
Solution:
# Check if model is downloading
docker logs model-puller
# Manually pull the model
docker exec ollama ollama pull tinyllama:1.1bSolution:
# Use smaller model
docker exec ollama ollama pull tinyllama:1.1b
# Reduce container limits in docker-compose.yml
# Restart with: docker compose down && docker compose up -dSolution:
- Use GPU if available
- Reduce
num_ctxin LLM configuration - Use smaller/quantized models
- Increase container memory limits
Solution:
- Check if documents were processed (look for success message)
- Verify ChromaDB is running:
curl localhost:8000/api/v2/heartbeat - Try reprocessing documents
- Clear ChromaDB:
docker compose down -vand restart
Happy Building! 🚀