Self-hosted eval runner for AI agent skills.
Measure whether your skills actually work — on your own infrastructure.
Alpha software — Arctiq is in early development. Expect breaking changes. Contributors are welcome!
Quick Start · Features · Configuration · Development · Contributing · License
Arctiq helps skill authors and teams evaluate AI agent skills by running structured test suites against LLM providers and scoring the results with a judge model. Import skills from Git repositories, run evals, and compare results — all without sending data to third-party services.
# docker-compose.yml
services:
api:
image: ghcr.io/vdekercd/arctiq/api:latest
volumes:
- arctiq-data:/data
environment:
- ASPNETCORE_ENVIRONMENT=Production
- ConnectionStrings__DefaultConnection=Data Source=/data/arctiq.db
- Cors__AllowedOrigins__0=http://localhost:3000
- ARCTIQ_MASTER_KEY=${ARCTIQ_MASTER_KEY}
restart: unless-stopped
ui:
image: ghcr.io/vdekercd/arctiq/ui:latest
ports:
- "3000:80"
depends_on:
- api
restart: unless-stopped
volumes:
arctiq-data:# Generate an encryption key and start Arctiq
export ARCTIQ_MASTER_KEY=$(openssl rand -hex 32)
docker compose up -dOpen http://localhost:3000 and you're ready to go.
- Skill Management — Import skills from Git repos (GitHub, GitLab, Bitbucket, Azure DevOps). Edit instructions in-browser with full version history and diffs.
- Eval Execution — Run test suites against any LLM provider. Configure skill version, model, temperature, and a separate judge model for scoring assertions.
- Run Comparisons — Compare runs side-by-side: skill v1 vs v2, skill-on vs baseline. View pass rates, cost breakdowns, and output diffs.
- Multi-Provider — Supports OpenAI, Anthropic, Google Gemini, Mistral, Azure OpenAI, and Ollama.
- Self-Hosted — Runs entirely on your infrastructure. API keys are encrypted at rest (AES-256-GCM). No data leaves your network.
Environment variables for the API container:
| Variable | Description | Default |
|---|---|---|
ConnectionStrings__DefaultConnection |
Database connection string | Data Source=/data/arctiq.db |
ARCTIQ_MASTER_KEY |
32-byte hex key for encrypting API keys (openssl rand -hex 32) |
— |
Cors__AllowedOrigins__0 |
Allowed origin for CORS | http://localhost:3000 |
# API
cd src/Arctiq.API
dotnet run
# UI (separate terminal)
cd src/Arctiq.UI
npm install
npm run devThe UI runs on http://localhost:3000 and proxies /api to the backend.
| Layer | Technology |
|---|---|
| Backend | C# / .NET 10, FastEndpoints |
| Frontend | React, TypeScript, Vite |
| Database | SQLite |
| Deployment | Docker Compose |
Arctiq discovers evals from a file named evals/evals.json inside each skill's directory in the Git repository.
your-skill-repo/
└── your-skill/
├── skill.md # skill instructions
└── evals/
└── evals.json # eval test cases
{
"evals": [
{
"id": 1,
"name": "optional case name",
"prompt": "The input prompt sent to the model",
"assertions": [
{ "id": "a1", "text": "Response is concise and under 100 words", "weight": 1.0 },
{ "id": "a2", "text": "Answer mentions the correct library name", "weight": 2.0 }
]
}
]
}| Field | Required | Description |
|---|---|---|
evals[].id |
Yes | Integer — unique identifier for the test case |
evals[].prompt |
Yes | The prompt sent to the model under test |
evals[].name |
No | Human-readable label shown in the UI |
evals[].assertions |
No | List of criteria judged by the LLM judge |
assertions[].id |
Yes | String — unique identifier for the assertion |
assertions[].text |
Yes | The criterion evaluated by the judge model (YES/NO) |
assertions[].weight |
No | Scoring weight (default 1.0). Higher = more impact on the final score |
Each assertion is evaluated independently by a judge model using a strict YES/NO prompt. The final score is a weighted percentage:
score = sum(weight of passed assertions) / sum(weight of all assertions) × 100
An eval case with no assertions is recorded but produces no score.
{
"evals": [
{
"id": 1,
"name": "Summarise a short article",
"prompt": "Summarise the following article in 2–3 sentences:\n\nThe James Webb Space Telescope...",
"assertions": [
{ "id": "s1", "text": "Summary is 2 to 3 sentences long", "weight": 1.0 },
{ "id": "s2", "text": "Summary does not introduce facts not present in the article", "weight": 2.0 },
{ "id": "s3", "text": "Summary is written in plain English", "weight": 1.0 }
]
},
{
"id": 2,
"name": "Empty input handling",
"prompt": "Summarise the following article:\n\n",
"assertions": [
{ "id": "e1", "text": "Model asks for or acknowledges missing input rather than hallucinating a summary", "weight": 1.0 }
]
}
]
}If Arctiq is useful to you, consider sponsoring the project.
