Skip to content

PacifAIst/Quansloth

Repository files navigation

🦥 Quansloth: TurboQuant Local AI Server

   ____                         _       _   _     
  / __ \                       | |     | | | |    
 | |  | |_   _  __ _ _ __   ___| | ___ | |_| |__  
 | |  | | | | |/ _` | '_ \ / __| |/ _ \| __| '_ \ 
 | |__| | |_| | (_| | | | |\__ \ | (_) | |_| | | |
  \___\_\\__,_|\__,_|_| |_||___/_|\___/ \__|_| |_|
         [ POWERED BY TURBOQUANT+ | NVIDIA CUDA ]

License: Apache 2.0 Platform: Linux | WSL2 Backend: CUDA

Breaking the VRAM Wall: Based on the implementation of Google's TurboQuant (ICLR 2026) — Quansloth brings elite KV cache compression to local LLM inference.

Quansloth is a fully private, air-gapped AI server that runs massive context models natively on consumer hardware (like an RTX 3060). By bridging a custom Gradio Python frontend with a highly optimized llama.cpp CUDA backend, Quansloth achieves extreme memory compression, saving up to 75% of VRAM.

🛑 Why Quansloth? (No More GPU Crashes)

Standard LLM inference often hits a "Memory Wall" when processing long documents; as the context grows, the GPU runs out of memory (OOM) and the system crashes.

Quansloth prevents these crashes by:

  • 75% Cache Shrink: Compressing the "memory" of the AI from 16-bit to 4-bit (TurboQuant).
  • Massive Context on Budget GPUs: Run 32k+ token contexts on a 6GB RTX 3060 that would normally require a 24GB RTX 4090.
  • Hardware-Level Stability: Our interface monitors the CUDA backend to ensure the model stays within your GPU's physical limits, allowing for stable, long-form document analysis without the fear of a system hang.

Interface


📸 Interface Preview

Interface


🖥️ OS Compatibility

  • Windows 10/11: Fully Supported (via WSL2 Ubuntu). Features a 1-click .bat launcher.
  • Linux: Fully Supported (Native).
  • macOS: Not officially supported out-of-the-box (backend optimized for NVIDIA CUDA GPUs).

✨ Features

  • TurboQuant Cache Compression: Run 8,192+ token contexts natively on 6GB GPUs without Out-Of-Memory (OOM) crashes.
  • Live Hardware Analytics: The UI physically intercepts the C++ engine logs to report your exact VRAM allocation and savings in real-time.
  • Context Injector: Upload long documents (PDF, TXT, CSV, MD) directly into the chat stream to test the AI's memory limits.
  • Dual-Routing: Auto-scan your local models/ folder, or input custom absolute paths to load any .gguf file.
  • Cyberpunk UI: A sleek, fully responsive dark-mode dashboard built for power users.

🛠️ Prerequisites

  • Windows with WSL2 (Ubuntu) OR native Linux
  • NVIDIA GPU with updated drivers
  • Miniconda or Anaconda installed

🚀 Installation

1. Prepare Python Environment

conda create -n quansloth python=3.10 -y
conda activate quansloth

2. Clone Repository and Requirements

git clone https://github.com/PacifAIst/Quansloth.git
cd Quansloth
pip install -r requirements.txt

3. Run Installer

chmod +x install.sh
./install.sh

🎮 Usage

Adding Models

Download .gguf models (e.g., Llama 3 8B) and place them in:

models/

Start Server (Windows - 1 Click)

  • Use Launch_Quansloth.bat
  • Double-click → auto-launches WSL, Conda, and server

Start Server (Linux / WSL)

conda activate quansloth
python quansloth_gui.py

Connect

http://127.0.0.1:7860

🎛️ Pro Tips

  • Symmetric (Turbo3) → Best overall compression
  • Asymmetric (Q8/Turbo4) → Better for Q4_K_M models
  • Monitor Hardware Stats for real-time VRAM savings

📜 License & Credits

  • License: This project is licensed under the Apache 2.0 License.
  • Core Technology: Built upon the TurboQuant+ implementation developed by TheTom (@TheTom).
  • Research & Algorithms: The underlying algorithm is based on research from Google Research (arXiv:2504.19874).
  • CUDA Kernels: Special thanks to Gabe Ortiz (signalnine) for porting the CUDA kernels.

👤 Author Dr. Manuel Herrador 📧 mherrador@ujaen.es
University of Jaén (UJA) - Spain


Made with ❤️ for the Local AI Community by PacifAIst

About

Based on the implementation of Google's TurboQuant (ICLR 2026) — Quansloth brings elite KV cache compression to local LLM inference. Quansloth is a fully private, air-gapped AI server that runs massive context models natively on consumer hardware with ease

Topics

Resources

License

Stars

Watchers

Forks

Contributors