Skip to content

Weiykong/LLM_compass

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

4 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

LLM Compass

A tool for measuring and comparing the political leanings of large language models using a structured survey methodology.

Models are tested against 32 political compass questions and plotted on a 2D axis — Economic Left/Right × Libertarian/Authoritarian — giving each model a visual fingerprint of its ideological tendencies.

Features

  • Multi-provider support — OpenAI, Gemini, Claude, Grok, OpenRouter, and local Ollama models
  • Leaderboard — compare all tested models on a single compass chart with confidence intervals and robustness scores
  • Benchmark runner — run models against multiple paraphrase variants to measure answer stability
  • Community questions — submit and vote on new questions via an AI-judged quality pipeline
  • Single model test — run any model interactively and see its position in real time

Stack

  • Backend: Python + FastAPI + SQLite (aiosqlite)
  • Frontend: React 18 + SVG-based compass visualization

Setup

1. Clone and configure

git clone https://github.com/Weiykong/LLM_compass.git
cd LLM_compass
cp .env.example .env
# Fill in your API keys in .env

2. Backend

cd backend
pip install -r requirements.txt
uvicorn app:app --reload

3. Frontend

cd frontend
npm install
npm start

The app runs at http://localhost:3000, API at http://localhost:8000.

Supported Models

Configure which models to benchmark in .env:

BENCHMARK_OPENAI_MODELS=gpt-4o-mini,gpt-4.1-mini
BENCHMARK_GEMINI_MODELS=gemini-2.5-flash
BENCHMARK_OPENROUTER_MODELS=meta-llama/llama-3.1-8b-instruct,...

Local models via Ollama work out of the box with no API key.

How Scoring Works

Each question has a direction (agree_left) indicating whether agreeing moves a model left or right on the economic axis, and libertarian or authoritarian on the social axis.

Responses map to weights: SA=1.0, A=0.5, D=-0.5, SD=-1.0. Scores are averaged and scaled to a −10 to +10 range per axis.

Robustness score measures how consistent a model is across paraphrase variants of the same questions.

About

No description, website, or topics provided.

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors