Skip to content

vdekercd/arctiq

Repository files navigation

Arctiq

Arctiq

Self-hosted eval runner for AI agent skills.
Measure whether your skills actually work — on your own infrastructure.

Alpha

Alpha software — Arctiq is in early development. Expect breaking changes. Contributors are welcome!

Quick Start · Features · Configuration · Development · Contributing · License


What is Arctiq?

Arctiq helps skill authors and teams evaluate AI agent skills by running structured test suites against LLM providers and scoring the results with a judge model. Import skills from Git repositories, run evals, and compare results — all without sending data to third-party services.

Quick Start

# docker-compose.yml
services:
  api:
    image: ghcr.io/vdekercd/arctiq/api:latest
    volumes:
      - arctiq-data:/data
    environment:
      - ASPNETCORE_ENVIRONMENT=Production
      - ConnectionStrings__DefaultConnection=Data Source=/data/arctiq.db
      - Cors__AllowedOrigins__0=http://localhost:3000
      - ARCTIQ_MASTER_KEY=${ARCTIQ_MASTER_KEY}
    restart: unless-stopped

  ui:
    image: ghcr.io/vdekercd/arctiq/ui:latest
    ports:
      - "3000:80"
    depends_on:
      - api
    restart: unless-stopped

volumes:
  arctiq-data:
# Generate an encryption key and start Arctiq
export ARCTIQ_MASTER_KEY=$(openssl rand -hex 32)
docker compose up -d

Open http://localhost:3000 and you're ready to go.

Features

  • Skill Management — Import skills from Git repos (GitHub, GitLab, Bitbucket, Azure DevOps). Edit instructions in-browser with full version history and diffs.
  • Eval Execution — Run test suites against any LLM provider. Configure skill version, model, temperature, and a separate judge model for scoring assertions.
  • Run Comparisons — Compare runs side-by-side: skill v1 vs v2, skill-on vs baseline. View pass rates, cost breakdowns, and output diffs.
  • Multi-Provider — Supports OpenAI, Anthropic, Google Gemini, Mistral, Azure OpenAI, and Ollama.
  • Self-Hosted — Runs entirely on your infrastructure. API keys are encrypted at rest (AES-256-GCM). No data leaves your network.

Configuration

Environment variables for the API container:

Variable Description Default
ConnectionStrings__DefaultConnection Database connection string Data Source=/data/arctiq.db
ARCTIQ_MASTER_KEY 32-byte hex key for encrypting API keys (openssl rand -hex 32)
Cors__AllowedOrigins__0 Allowed origin for CORS http://localhost:3000

Development

Prerequisites

Run locally

# API
cd src/Arctiq.API
dotnet run

# UI (separate terminal)
cd src/Arctiq.UI
npm install
npm run dev

The UI runs on http://localhost:3000 and proxies /api to the backend.

Tech Stack

Layer Technology
Backend C# / .NET 10, FastEndpoints
Frontend React, TypeScript, Vite
Database SQLite
Deployment Docker Compose

Writing Evals for a Skill

Arctiq discovers evals from a file named evals/evals.json inside each skill's directory in the Git repository.

File location

your-skill-repo/
└── your-skill/
    ├── skill.md          # skill instructions
    └── evals/
        └── evals.json    # eval test cases

Format

{
  "evals": [
    {
      "id": 1,
      "name": "optional case name",
      "prompt": "The input prompt sent to the model",
      "assertions": [
        { "id": "a1", "text": "Response is concise and under 100 words", "weight": 1.0 },
        { "id": "a2", "text": "Answer mentions the correct library name", "weight": 2.0 }
      ]
    }
  ]
}

Fields

Field Required Description
evals[].id Yes Integer — unique identifier for the test case
evals[].prompt Yes The prompt sent to the model under test
evals[].name No Human-readable label shown in the UI
evals[].assertions No List of criteria judged by the LLM judge
assertions[].id Yes String — unique identifier for the assertion
assertions[].text Yes The criterion evaluated by the judge model (YES/NO)
assertions[].weight No Scoring weight (default 1.0). Higher = more impact on the final score

Scoring

Each assertion is evaluated independently by a judge model using a strict YES/NO prompt. The final score is a weighted percentage:

score = sum(weight of passed assertions) / sum(weight of all assertions) × 100

An eval case with no assertions is recorded but produces no score.

Example

{
  "evals": [
    {
      "id": 1,
      "name": "Summarise a short article",
      "prompt": "Summarise the following article in 2–3 sentences:\n\nThe James Webb Space Telescope...",
      "assertions": [
        { "id": "s1", "text": "Summary is 2 to 3 sentences long", "weight": 1.0 },
        { "id": "s2", "text": "Summary does not introduce facts not present in the article", "weight": 2.0 },
        { "id": "s3", "text": "Summary is written in plain English", "weight": 1.0 }
      ]
    },
    {
      "id": 2,
      "name": "Empty input handling",
      "prompt": "Summarise the following article:\n\n",
      "assertions": [
        { "id": "e1", "text": "Model asks for or acknowledges missing input rather than hallucinating a summary", "weight": 1.0 }
      ]
    }
  ]
}

License

MIT


If Arctiq is useful to you, consider sponsoring the project.

About

No description, website, or topics provided.

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Sponsor this project

 

Packages

 
 
 

Contributors

Languages