A local, privacy-focused desktop chat application powered by Qwen2.5-3B-Instruct-AWQ with vLLM acceleration.
Blog: https://medium.com/@harshitweb3/building-a-fully-private-gpt-549c0935d307 Download zip: https://mega.nz/file/L4RSDbqI#RbjqmLsRxVCZwUrwYBqh76VDXwODw5DkcGvuOuNIUU8
- π 100% Local & Private - All data stays on your machine
- π¬ Chat Interface - Modern PyQt6-based UI with real-time token streaming
- π Powered by vLLM - AWQ Marlin quantization for maximum efficiency
- π― Low VRAM Optimized - Runs on 4GB+ VRAM GPUs
- π 2K-6k Context Window - Balanced performance for low-end hardware
- β‘ Fast Generation - 66-378 tokens/sec on RTX 5060 Laptop
- π€ Apache 2.0 Licensed Model - Commercial use ready
-
Hardware:
- NVIDIA GPU with 4GB+ VRAM (GTX 1650, RTX 3050, RTX 4060, RTX 5060, etc.)
- CUDA-capable drivers (vLLM bundles CUDA runtime)
- 8GB+ system RAM
- 5GB disk space for model
-
Software:
- Python 3.10+
- Linux (tested on Ubuntu 22.04+)
- NVIDIA drivers 525+ (for CUDA 12 support)
cd /home/harshit/coding/private-gpt/private-gpt-app
# Install all dependencies
uv sync# Run the application
uv run python run.py --dev
# Or without dev mode
uv run python run.pyOn first launch, the app will:
- Check your GPU (4GB VRAM minimum)
- Auto-download Qwen2.5-3B-Instruct-AWQ from HuggingFace (~2.7GB)
- Load model with vLLM using AWQ Marlin kernels
- Ready to chat!
Model Info:
- Model: Qwen/Qwen2.5-3B-Instruct-AWQ (Apache 2.0 license)
- Size: 2.7GB download, 1.93GB loaded
- Context: 2048 tokens (~1500 words)
- VRAM Usage: ~3.1GB total (model + KV cache)
- Automatically cached in
models/Qwen2.5-3B-Instruct-AWQ/
The app provides a clean chat interface with:
- Real-time token streaming
- Message bubbles with user/assistant distinction
- Auto-scrolling to latest messages
- Responsive PyQt6 design
Edit src/private_gpt_app/ui/main_window.py to adjust VRAM/context trade-offs:
VLLMService(
gpu_memory_utilization=0.55, # 4GB VRAM: 0.55 | 6GB: 0.65 | 8GB+: 0.70
max_model_len=2048, # 4GB VRAM: 2048 | 6GB: 3072 | 8GB+: 4096
cpu_offload_gb=2.0, # Offload 2GB to system RAM for headroom
)GPU Compatibility Matrix:
- 4GB VRAM (GTX 1650, RTX 3050 4GB): 0.55 utilization, 2K context
- 6GB VRAM (RTX 3060 Mobile, RTX 2060): 0.65 utilization, 3K context
- 8GB+ VRAM (RTX 3070, RTX 4060, RTX 5060): 0.70 utilization, 4K context
private-gpt-app/
βββ src/private_gpt_app/
β βββ ui/ # PyQt6 interface (main_window, chat_widget, message_bubble)
β βββ backend/ # vLLM service with AWQ quantization
β βββ utils/ # GPU monitoring
βββ data/
β βββ faiss_index/ # (Future) Local vector database
β βββ crash_recovery/ # (Future) Auto-save temp files
βββ models/
β βββ Qwen2.5-3B-Instruct-AWQ/ # Downloaded model files
βββ docs/ # Documentation
# Run all tests
uv run pytest
# Run with coverage
uv run pytest --cov=private_gpt_appWhen running in development mode, QSS stylesheets auto-reload on file changes.
# Profile VRAM usage
uv run python -m memory_profiler src/private_gpt_app/main.pyThe app is currently configured for maximum compatibility (4GB+ VRAM).
To adjust for your specific GPU, edit src/private_gpt_app/ui/main_window.py:
# 4GB VRAM (current default)
gpu_memory_utilization=0.55
max_model_len=2048
# 6GB VRAM (recommended for better context)
gpu_memory_utilization=0.65
max_model_len=3072
# 8GB+ VRAM (optimal performance)
gpu_memory_utilization=0.70
max_model_len=4096Trade-offs:
- Lower utilization = more stable, less risk of OOM
- Higher context = better conversation memory, more VRAM usage
If you see OOM (Out of Memory) errors:
- Reduce
gpu_memory_utilizationfrom 0.55 to 0.50 - Reduce
max_model_lenfrom 2048 to 1536 - Increase
cpu_offload_gbfrom 2.0 to 3.0 - Close other GPU applications (Chrome, games, etc.)
Check current GPU usage:
nvidia-smiModel auto-downloads from HuggingFace on first run. If interrupted:
- Delete
models/Qwen2.5-3B-Instruct-AWQ/ - Restart the app - download will resume
If the app crashes and GPU is still occupied:
pkill -9 -f "python.*run.py"Ensure qasync is properly initialized. Check terminal output for vLLM errors.
MIT License - See LICENSE file for details
- Qwen team for the efficient 3B Instruct model
- vLLM team for the high-performance inference engine
- AWQ team for the quantization method
- PyQt6 for the cross-platform UI framework