Private-GPT Desktop App

A local, privacy-focused desktop chat application powered by Qwen2.5-3B-Instruct-AWQ with vLLM acceleration.

Blog: https://medium.com/@harshitweb3/building-a-fully-private-gpt-549c0935d307 Download zip: https://mega.nz/file/L4RSDbqI#RbjqmLsRxVCZwUrwYBqh76VDXwODw5DkcGvuOuNIUU8

Features

🔒 100% Local & Private - All data stays on your machine
💬 Chat Interface - Modern PyQt6-based UI with real-time token streaming
🚀 Powered by vLLM - AWQ Marlin quantization for maximum efficiency
🎯 Low VRAM Optimized - Runs on 4GB+ VRAM GPUs
📜 2K-6k Context Window - Balanced performance for low-end hardware
⚡ Fast Generation - 66-378 tokens/sec on RTX 5060 Laptop
🤖 Apache 2.0 Licensed Model - Commercial use ready

Prerequisites

Hardware:
- NVIDIA GPU with 4GB+ VRAM (GTX 1650, RTX 3050, RTX 4060, RTX 5060, etc.)
- CUDA-capable drivers (vLLM bundles CUDA runtime)
- 8GB+ system RAM
- 5GB disk space for model
Software:
- Python 3.10+
- Linux (tested on Ubuntu 22.04+)
- NVIDIA drivers 525+ (for CUDA 12 support)

Quick Start

1. Install Dependencies

cd /home/harshit/coding/private-gpt/private-gpt-app

# Install all dependencies
uv sync

2. Run the Application

# Run the application
uv run python run.py --dev

# Or without dev mode
uv run python run.py

3. First Run

On first launch, the app will:

Check your GPU (4GB VRAM minimum)
Auto-download Qwen2.5-3B-Instruct-AWQ from HuggingFace (~2.7GB)
Load model with vLLM using AWQ Marlin kernels
Ready to chat!

Model Info:

Model: Qwen/Qwen2.5-3B-Instruct-AWQ (Apache 2.0 license)
Size: 2.7GB download, 1.93GB loaded
Context: 2048 tokens (~1500 words)
VRAM Usage: ~3.1GB total (model + KV cache)
Automatically cached in models/Qwen2.5-3B-Instruct-AWQ/

Usage

Chat Interface

The app provides a clean chat interface with:

Real-time token streaming
Message bubbles with user/assistant distinction
Auto-scrolling to latest messages
Responsive PyQt6 design

Performance Tuning

Edit src/private_gpt_app/ui/main_window.py to adjust VRAM/context trade-offs:

VLLMService(
    gpu_memory_utilization=0.55,  # 4GB VRAM: 0.55 | 6GB: 0.65 | 8GB+: 0.70
    max_model_len=2048,            # 4GB VRAM: 2048 | 6GB: 3072 | 8GB+: 4096
    cpu_offload_gb=2.0,            # Offload 2GB to system RAM for headroom
)

GPU Compatibility Matrix:

4GB VRAM (GTX 1650, RTX 3050 4GB): 0.55 utilization, 2K context
6GB VRAM (RTX 3060 Mobile, RTX 2060): 0.65 utilization, 3K context
8GB+ VRAM (RTX 3070, RTX 4060, RTX 5060): 0.70 utilization, 4K context

Project Structure

private-gpt-app/
├── src/private_gpt_app/
│   ├── ui/              # PyQt6 interface (main_window, chat_widget, message_bubble)
│   ├── backend/         # vLLM service with AWQ quantization
│   └── utils/           # GPU monitoring
├── data/
│   ├── faiss_index/     # (Future) Local vector database
│   └── crash_recovery/  # (Future) Auto-save temp files
├── models/
│   └── Qwen2.5-3B-Instruct-AWQ/  # Downloaded model files
└── docs/                # Documentation

Development

Running Tests

# Run all tests
uv run pytest

# Run with coverage
uv run pytest --cov=private_gpt_app

Hot Reload (QSS Styles)

When running in development mode, QSS stylesheets auto-reload on file changes.

Memory Profiling

# Profile VRAM usage
uv run python -m memory_profiler src/private_gpt_app/main.py

Configuration

VRAM Optimization

The app is currently configured for maximum compatibility (4GB+ VRAM).

To adjust for your specific GPU, edit src/private_gpt_app/ui/main_window.py:

# 4GB VRAM (current default)
gpu_memory_utilization=0.55
max_model_len=2048

# 6GB VRAM (recommended for better context)
gpu_memory_utilization=0.65
max_model_len=3072

# 8GB+ VRAM (optimal performance)
gpu_memory_utilization=0.70
max_model_len=4096

Trade-offs:

Lower utilization = more stable, less risk of OOM
Higher context = better conversation memory, more VRAM usage

Troubleshooting

Low VRAM Error

If you see OOM (Out of Memory) errors:

Reduce gpu_memory_utilization from 0.55 to 0.50
Reduce max_model_len from 2048 to 1536
Increase cpu_offload_gb from 2.0 to 3.0
Close other GPU applications (Chrome, games, etc.)

Check current GPU usage:

nvidia-smi

Model Download Failed

Model auto-downloads from HuggingFace on first run. If interrupted:

Delete models/Qwen2.5-3B-Instruct-AWQ/
Restart the app - download will resume

Lingering Processes

If the app crashes and GPU is still occupied:

pkill -9 -f "python.*run.py"

UI Freezing

Ensure qasync is properly initialized. Check terminal output for vLLM errors.

License

MIT License - See LICENSE file for details

Acknowledgments

Qwen team for the efficient 3B Instruct model
vLLM team for the high-performance inference engine
AWQ team for the quantization method
PyQt6 for the cross-platform UI framework

Name		Name	Last commit message	Last commit date
Latest commit History 36 Commits
assets		assets
data		data
docs		docs
hooks		hooks
installer		installer
models		models
src/private_gpt_app		src/private_gpt_app
.gitignore		.gitignore
.python-version		.python-version
PrivateGPT.spec		PrivateGPT.spec
README.md		README.md
WRITEUP.md		WRITEUP.md
build.py		build.py
cuda-keyring_1.1-1_all.deb		cuda-keyring_1.1-1_all.deb
main.py		main.py
pyproject.toml		pyproject.toml
run.py		run.py
uv.lock		uv.lock
verify_build.py		verify_build.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Private-GPT Desktop App

Features

Prerequisites

Quick Start

1. Install Dependencies

2. Run the Application

3. First Run

Usage

Chat Interface

Performance Tuning

Project Structure

Development

Running Tests

Hot Reload (QSS Styles)

Memory Profiling

Configuration

VRAM Optimization

Troubleshooting

Low VRAM Error

Model Download Failed

Lingering Processes

UI Freezing

License

Acknowledgments

About

Uh oh!

Releases 1

Packages

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

Private-GPT Desktop App

Features

Prerequisites

Quick Start

1. Install Dependencies

2. Run the Application

3. First Run

Usage

Chat Interface

Performance Tuning

Project Structure

Development

Running Tests

Hot Reload (QSS Styles)

Memory Profiling

Configuration

VRAM Optimization

Troubleshooting

Low VRAM Error

Model Download Failed

Lingering Processes

UI Freezing

License

Acknowledgments

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases 1

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages