ai-agents · ux-testing · gemini · mcp · google-cloud · ab-testing · landing-page
20 AI agents simulate your target audience evaluating your landing page — before you spend a cent on traffic.
Upload one design for friction analysis. Upload two to find the directional winner.
Built for the Google Cloud Rapid Agent Hackathon · Arize Track
20 cognitive walkers evaluating your design in parallel — live in the PM Command Center.
Most teams ship a design and hope for the best — or burn $10,000–$50,000 in paid traffic running an A/B test to find out which variant converts better. By then:
- The losing variant has already been shown to half your audience
- You know what won — but not why, or which persona drove it
- You have a number, not a fix
VeraTest flips this. Run 20 AI agents in 90 seconds. Each embodies a specific audience persona and evaluates your design the way a real user does — scoring six resonance dimensions, flagging friction, surfacing trust gaps, and telling you exactly what to fix.
No traffic. No guessing. No waiting.
| Mode | What you upload | What you get |
|---|---|---|
| Single design | One screenshot | Resonance score, friction themes, trust gaps, sprint stories |
| A/B pretest | Two variants | All of the above + directional winner with gap significance |
| N-variant (roadmap) | 3+ variants | Ranked resonance matrix |
Start with a single design analysis — it's the fastest way to understand your audience before you build variant B.
Every run produces a PM Command Center — a single-page report structured around the decisions a PM actually needs to make:
CommandRail — top of page, always visible. Validity badge, persona-fidelity badge ("18/20 in persona"), overall resonance score (or A vs B tug-of-war), coverage %, and a one-click markdown export for Notion or Linear.
Your Audience Personas — a PageSpeed-style hero of persona circles, one per audience segment, each ring showing its lean (positive/negative, or A/B). Click any circle to expand its full 6-dimension resonance breakdown, trust signals, and which variant it preferred.
What to do next — the single recommendation plus the top high/medium friction items rephrased as positive, shippable actions ("Add concrete use cases"), each tagged with the agent count it affects and a projected ability-score target.
Conversion Blockers & Wins — every friction point and what-worked theme in one table. Each row tagged with the cognitive dimension it hits (Motivation↑/↓, Ability↑/↓) and a recommended fix pulled from the agents' metacognitive reflections.
User Stories to Write — "As a [persona], I need [fix] so that I can [goal]" cards auto-generated from your high and medium friction themes, phrased as positive needs. Copy-to-clipboard.
Visual Reference — collapsible variant image reference (collapsed by default when the test is confounded).
VeraTest doesn't invent a methodology. It digitizes one. Every layer in the pipeline maps to an established practice from UX research, cognitive science, or experimental design — fields with decades of evidence behind them. The question isn't whether the methodology is sound. It's whether AI agents can execute it faithfully enough to be useful.
Multiple independent evaluators beat any single expert. Nielsen (1994) demonstrated that 5 independent evaluators find ~80% of usability issues; 15 find ~90%. Condorcet's jury theorem formalizes why: independent judges with better-than-random individual accuracy converge on the correct answer as panel size grows. VeraTest uses 20 — each constrained to a different persona, eliminating the groupthink that a single LLM call would produce.
Structured evaluation outperforms open-ended preference. Decades of industrial/organizational psychology show that structured interviews predict outcomes 2× better than unstructured ones (Schmidt & Hunter, 1998). The same principle applies to design evaluation. Asking "which is better?" produces confident noise. Walking through a defined protocol — visual impact, scanning path, trust signals, cognitive load — produces diagnostics.
System 1 → System 2 progression mirrors real cognition. Kahneman's dual-process theory isn't a hypothesis — it's textbook cognitive science. Humans process a landing page in two distinct phases: fast visual/emotional reaction (System 1), then slow deliberate reading and decision-making (System 2). VeraTest's Cognitive Walkers follow this sequence because that's how brains actually process a page, not because it's a convenient architecture choice.
The Fogg Behavior Model (B = MAP) drives the decision layer. One of the most cited frameworks in persuasive design: Behavior = Motivation × Ability × Trigger. VeraTest's six resonance dimensions extend Fogg with Identity (Social Identity Theory), Situation (Situated Cognition), and Beliefs (Cognitive Dissonance Theory) — producing a richer diagnostic than any single framework alone.
Counterbalancing and confound detection are Experimental Design 101. Showing Variant A first to half the panel and B first to the other half is the minimum methodological standard for any comparison study. Rejecting tests where both variants differ in language, brand, and layout simultaneously is what any research director or IRB would do. Most AI evaluation tools skip both. VeraTest does neither.
Can language models actually simulate how different people evaluate a design? The evidence is early but directional:
- Argyle et al. (2023), "Out of One, Many" — demonstrated that LLMs can reproduce demographic subgroups' survey responses with surprising accuracy across age, income, and political affiliation. The paper calls this "silicon sampling."
- Aher et al. (2023), "Using LLMs to Simulate Multiple Humans" — replicated classic behavioral experiments (ultimatum game, Milgram-style studies) using LLM personas and got results that matched original human data.
- Park et al. (2023), "Generative Agents" — 25 LLM agents in a simulated town exhibited emergent social behaviors that human evaluators rated as more human-like than actual human transcripts.
VeraTest adds structural constraints that these papers identify as critical: persona locking (agents can't drift toward "helpful AI evaluator" mode), anti-cooperative prompting (agents are forced to behave like impatient, flawed humans), and metacognitive self-audit (agents check their own reasoning for persona leakage before submitting).
AI persona simulation is not a replacement for real traffic data. It's a replacement for the alternative — which, for most teams, is a PM's intuition, a Slack poll, or shipping the founder's favorite and hoping for the best.
The question isn't "is this as reliable as a 50,000-visitor A/B test with 95% statistical significance?" It's "is this more reliable than no test at all?" The methodology says yes. The emerging LLM research says yes. And VeraTest's own validation against known A/B outcomes provides a concrete accuracy number you can evaluate for yourself (see Validation).
A $50,000 A/B test is more rigorous. But you need a live product, real traffic, and 4–6 weeks. VeraTest gives you a directional signal in 90 seconds, before you've written a single line of code for Variant B — so you can build the right variant and save the real test for final confirmation.
On the 20-case balanced set (2026-06-11 run, free-tier Gemini under heavy 503 capacity pressure):
| Method | Accuracy | Decisive accuracy |
|---|---|---|
random |
30.0% (6/20) | 30.0% (6/20) |
always_a / always_b / heuristic |
50.0% (10/20) | 50.0% (10/20) |
oneshot_gemini |
70.0% (14/20) | 70.0% (14/20) |
| VeraTest (full pipeline) | 45.0% (9/20, 10 abstained) | 90.0% (9/10) |
- When VeraTest committed to a verdict, it was right 9 of 10 (90%) vs one-shot Gemini's 70% — small n, treat as a directional signal, not a benchmark claim.
- One confident error in 20 cases. One-shot Gemini, which always answers, was confidently wrong 6 times. Under degraded conditions VeraTest abstains ("tie") rather than fabricating a verdict — for a decision-support tool, refusing to guess is the correct behavior, and the pipeline now enforces it explicitly: if fewer than 70% of the persona panel completes (
SIMAB_SIM_QUORUM), the run fails loudly instead of synthesizing from thin evidence. - 10 of 20 runs degraded to abstention that day due to Gemini free-tier 503s — those score as wrong in the headline number (45%), which is why we report decisive accuracy separately and publish the raw per-case table in
validation/report_*.mdrather than a single flattering percentage.
Every report includes the full per-case prediction matrix, so you can audit exactly which cases each method got right. See Validation for the harness details.
Six agents, six phases. Each one mirrors a role in a professional usability study. Remove any layer and the results break in the same way a sloppy research study produces misleading data.
Upload → Study Designer → Panel Recruiter → 20 × Cognitive Walkers → Bias Auditor → Insight Analyst → Report Narrators (×3)
| Phase | Agent | Research equivalent | Model | What breaks without it |
|---|---|---|---|---|
| 1 | Study Designer | Research director who reads the brief | Gemini Flash | You test noise — confounded comparisons produce uninterpretable data |
| 2 | Panel Recruiter | Recruiter assembling a representative panel | Gemini Flash | Niche 5% segments get equal voice to your core 60% audience |
| 3 | Cognitive Walker ×20 | Moderated cognitive walkthrough session | Gemini Flash-Lite | You're asking "which do you prefer?" — confident noise, no diagnostics |
| 4 | Bias Auditor | Methodologist checking data quality | Gemini Flash | Position effects silently corrupt your results |
| 5 | Insight Analyst | Analyst synthesizing session transcripts | Gemini Flash | You have 20 opinions; opinions aren't findings |
| 6 | Report Narrator ×3 | Research debrief writer | Gemini Flash | PMs get dashboards of numbers, not decisions for sprint planning |
Before a single agent evaluates your design, the Study Designer reads your image(s), extracts who your audience actually is, and checks for confounds. Different languages between variants? Different brand names? More than three simultaneous changes? It flags the test as uninterpretable — before you waste 20 evaluations on a comparison that can't produce a valid result.
Builds 20 persona cards — each with a specific segment, intent, decision style, patience threshold, and device. Allocates agents proportionally to each segment's traffic weight using the largest-remainder method. A segment representing 40% of your traffic gets 40% of your evaluators. The synthesis reflects your actual audience, not an equal-weighted fiction.
The core of VeraTest. Each Cognitive Walker embodies one persona and evaluates your design through a structured cognitive sequence:
| Step | Cognitive mode | What the agent does |
|---|---|---|
| Identity anchoring | Pre-evaluation | Locks to the persona: "What kind of person am I? What situation am I arriving from?" |
| Gut reaction | System 1 | Rates visual impact, reads spatial hierarchy. First impressions form in <500ms. |
| Scanning | System 1 | Follows the eye path dictated by decision style — F-pattern (analytical), Z-pattern (impulse), trust-first (cautious). |
| Deliberate evaluation | System 2 | Reads messaging, checks trust signals, scores alignment with existing beliefs. |
| Decision | System 2 | Fogg model: B = Motivation × Ability × Trigger. Is the path clear enough for this persona's patience level? |
| Self-audit | Metacognitive | "Could I be wrong? Am I responding as this persona, or as a helpful AI?" |
Every agent scores six resonance dimensions, producing a diagnostic fingerprint rather than a blunt preference:
| Dimension | What it captures | Framework origin |
|---|---|---|
| Motivation | Does the design activate the right desire? | Fogg Behavior Model |
| Identity | Does it speak to who they see themselves as? | Social Identity Theory |
| Situation | Does it match the context they're arriving from? | Situated Cognition |
| Beliefs | Does it align with what they already think is true? | Cognitive Dissonance Theory |
| Ability | Is the path to action clear enough for their patience? | Fogg Behavior Model |
| Trigger | Is the CTA well-timed and unmissable? | Fogg Behavior Model |
Even-indexed agents see Variant A first; odd-indexed agents see Variant B first. The Bias Auditor checks whether the margin holds after controlling for presentation order. It also flags confidence collapse (suspiciously uniform scores), cohort imbalance, and rationale incoherence. If the result doesn't survive these checks, you see trust_level: low before the verdict — not after you've acted on it.
Takes 20 individual evaluations and produces findings: directional winner, resonance gap with significance assessment, friction themes clustered by severity and agent count, what-worked themes, trust signal gaps, and a single recommendation for what to fix first. Twenty opinions become one synthesis.
Three parallel sub-agents each write one section of the PM report:
| Narrator | What it produces |
|---|---|
| Structural Diff | What's objectively different between the variants and how it maps to the resonance gap |
| Hypothesis | The single highest-leverage thing to test next, with a projected improvement target |
| Cohort Story | How each audience segment responded differently — the "why behind the why" |
Even-indexed agents see Variant A first. Odd-indexed agents see Variant B first. The Bias Auditor then checks whether the gap holds after controlling for presentation order — if it doesn't, you get a trust_level: low warning before seeing the verdict.
The Study Designer analyses your images before building scenarios. If it detects different brand names, different languages, or more than three simultaneous variables, it surfaces a confound_warning explaining exactly why the test is uninterpretable — before running 20 agents on a meaningless comparison.
docker run -p 6006:6006 -p 4317:4317 arizephoenix/phoenix:latest
export PHOENIX_COLLECTOR_ENDPOINT=http://localhost:4317Every run produces one trace tree in Phoenix — ~125 spans nested under a single root veratest_run.run_<id> agent span, with one child phase.* span per pipeline phase and one LLM span per Gemini call beneath it. Full prompts, image payloads, responses, retries, and timing. You can see exactly what the Study Designer extracted, what each Cognitive Walker decided and why, what the Bias Auditor flagged, and which retry recovered a 503. No black box.
veratest_run.run_<id> [AGENT] ~165s
├─ phase.study_designer [CHAIN]
├─ phase.panel_recruiter [CHAIN] (retries surface as sibling ERROR spans + llm.retry events)
├─ phase.cognitive_walkers [CHAIN] → 20 × sim_agent.N → GenerateContent
├─ phase.audit_and_synthesis [CHAIN]
├─ phase.report_narrators [CHAIN]
└─ phase.fidelity_auditor [CHAIN] → 20 × persona_consistency.evaluate [EVALUATOR]
Cloud Run note. The pipeline runs as a background task after the HTTP response returns. Cloud Run's default per-request CPU throttling can starve the OTLP export thread between requests and silently drop early spans. If you deploy there, enable always-allocated CPU:
gcloud run services update veratest-backend --no-cpu-throttlingThis trades per-request billing for per-instance-time while warm; the background pipeline is materially safer with it on.
A 7th agent — FidelityAuditor — runs an LLM-as-a-Judge persona- consistency eval plus a code-based rationale-coherence check on every run. Drifted agents are written to a persistent Phoenix Dataset; on the next run targeting a similar audience, ScenarioBuilder queries that history and strengthens the prompt of any persona archetype that has drifted >25% of the time. The Command Center surfaces this as a "95% in character" badge — the answer to "how do I trust this?"
See scripts/run_calibration_experiment.py for the baseline-vs-tightened Phoenix Experiment that produces the visible before/after fidelity delta.
# Install uv (skip if you already have it): https://docs.astral.sh/uv/
curl -LsSf https://astral.sh/uv/install.sh | sh
git clone https://github.com/Fotomarky/VeraTest.git && cd VeraTest
# Backend
uv venv .venv && source .venv/bin/activate
uv pip install -e ".[dev]"
# Free Gemini key — no credit card: https://aistudio.google.com/app/apikey
export GEMINI_API_KEY="your-key-here"
# Smoke tests (no API calls needed)
pytest tests/ -v
# Expected: 79 passed
uvicorn simab.main:app --reload --port 8000
# Frontend (new terminal)
cd frontend && npm install && npm run devOpen http://localhost:3000. Upload one or two screenshots, write your conversion goal, click Run. Results stream in live as agents complete.
curl -X POST http://localhost:8000/api/runs \
-F "variant_a=@your-design.png" \
-F "goal=sign up for free trial" \
-F "audience=Startup founders evaluating CI tools"curl -X POST http://localhost:8000/api/runs \
-F "variant_a=@control.png" \
-F "variant_b=@challenger.png" \
-F "goal=sign up for free trial" \
-F "audience=Startup founders evaluating CI tools"simab/
├── agents/
│ ├── normalizer.py Phase 1 · Study Designer — image reading, persona extraction, confound detection
│ ├── scenarios.py Phase 2 · Panel Recruiter — traffic-weighted allocation, 20 micro-varied cards
│ ├── simulator.py Phase 3 · Cognitive Walker (×20) — 6-dimension resonance evaluation per persona
│ ├── auditor.py Phase 4 · Bias Auditor — cohort balance, score inflation, coherence checks
│ ├── synthesizer.py Phase 5 · Insight Analyst — friction clustering, gap computation, verdict
│ └── narrative.py Phase 6 · Report Narrators (×3) — diff, hypothesis, cohort story
├── models.py Pydantic schemas (single source of truth)
├── pipeline.py Sequential orchestration with async parallel sim phase
├── state.py SQLite WAL — distributed mutex for idempotent writes
├── main.py FastAPI — REST + SSE + share page + A2A endpoint
├── llm.py Gemini client — rate limiting, retries, JSON self-healing
└── agent.py Agent Builder (ADK) front door — wraps the pipeline as tools + Arize Phoenix MCP toolset
frontend/
└── app/
├── new/page.tsx Upload form — single or A/B mode
├── runs/[id]/page.tsx PM Command Center (SSE live updates)
└── components/
├── CommandRail.tsx Sticky verdict / resonance / fidelity header
├── ResultsHero.tsx Persona-circles hero (uses PersonaCard)
├── PersonaCard.tsx Single persona's resonance + trust deep-dive
├── PackmanTheater.tsx Pixelated agent animation while in-flight
├── WhatToDoNext.tsx Recommendation + positive next-step actions
├── BlockersMatrix.tsx Friction + wins table with cognitive badges
├── UserStoryScaffold.tsx Auto-generated user stories from friction
└── VisualEvidence.tsx Collapsible variant image reference
A framework-free core behind an Agent Builder front door. The 6-phase
pipeline coordinates through a single shared SQLite document — every agent
reads from and writes to one structured record (stigmergy) — so each run is
fully inspectable and every Gemini call is a direct OpenInference span, with
no framework intermediation to obscure what the agent saw and decided. That
transparent core is left untouched. In front of it sits a thin Google
Cloud Agent Builder layer (simab/agent.py): a single ADK LlmAgent
("VeraTest Concierge") that a PM chats with. It exposes the pipeline as two
tools (start_pretest, get_pretest_result) and mounts the Arize Phoenix
MCP server (@arizeai/phoenix-mcp) as a live MCP toolset, so it can query
traces, datasets, and prior runs at runtime. See
docs/agent-builder.md.
User ──chat──▶ ADK LlmAgent (Gemini via Vertex AI = Google Cloud Agent Builder)
├─ tool: start_pretest → existing pipeline.run_pipeline()
├─ tool: get_pretest_result → existing state.get_run()
└─ MCPToolset → @arizeai/phoenix-mcp (Arize partner MCP server, live)
One component satisfies all three Arize-track requirements at runtime:
Gemini (the agent's reasoning model, served via Vertex AI when
GOOGLE_GENAI_USE_VERTEXAI=TRUE), Google Cloud Agent Builder (ADK is its
official SDK), and the Arize partner MCP server (mounted as a live toolset).
pip install -e ".[agent]"
export GOOGLE_GENAI_USE_VERTEXAI=TRUE
export GOOGLE_CLOUD_PROJECT=veratest-497813 GOOGLE_CLOUD_LOCATION=us-central1
gcloud auth application-default login
export PHOENIX_BASE_URL=https://app.phoenix.arize.com PHOENIX_API_KEY=...
adk web simab # dev chat UI at http://localhost:8000Deploy to Vertex AI Agent Engine with python scripts/deploy_agent_engine.py
(use --dry-run to build without deploying), or run adk api_server simab as a
second Cloud Run service. The 20-walker pipeline and SQLite state are unchanged.
| Component | Technology |
|---|---|
| Agent Builder | Google ADK LlmAgent (simab/agent.py) — Vertex AI Agent Engine / Cloud Run |
| Orchestration | Gemini 2.5 Flash (Study Designer, Panel Recruiter, Bias Auditor, Insight Analyst, Report Narrators) |
| Simulation | Gemini 2.5 Flash-Lite (20 parallel Cognitive Walkers — free tier: 1,500/day) |
| Observability | Arize Phoenix (OTLP tracing — full prompt + image + response per span) |
| Partner MCP | @arizeai/phoenix-mcp mounted as a live ADK MCP toolset |
| Backend | FastAPI + aiosqlite + SQLite WAL |
| Frontend | Next.js 14 App Router + Tailwind CSS |
| Deployment | Google Cloud Run (backend 2Gi/2CPU, frontend 512Mi) |
| MCP server | Python stdio, 4 tools |
| Tests | pytest + pytest-asyncio, 79 tests, ~2s |
# Backend
gcloud builds submit --tag gcr.io/$PROJECT_ID/veratest-backend:latest
gcloud run deploy veratest-backend \
--image gcr.io/$PROJECT_ID/veratest-backend:latest \
--region us-central1 --memory 2Gi --cpu 2 \
--set-secrets GEMINI_API_KEY=gemini-api-key:latest
# Frontend
gcloud builds submit frontend --config frontend/cloudbuild.yaml
gcloud run deploy veratest-frontend \
--image gcr.io/$PROJECT_ID/veratest-frontend:latest \
--region us-central1
# Or both at once
./gcp/deploy.sh $PROJECT_ID| Method | Path | Purpose |
|---|---|---|
| POST | /api/runs |
Create run (variant_a, optional variant_b, goal, audience) |
| GET | /api/runs/{id}/stream |
SSE live progress |
| GET | /api/runs/{id} |
Full run state |
| GET | /api/runs/{id}/export.md |
Markdown export for Notion / Linear |
| GET | /share/{id} |
Standalone HTML share page (no JS required) |
| GET | /api/runs/{id}/summary |
PM-friendly plain-language summary |
| POST | /a2a/v1/tasks |
Google A2A protocol |
| GET | /.well-known/agent-card.json |
Agent marketplace discovery |
pip install -e mcp/{
"mcpServers": {
"veratest": {
"command": "python",
"args": ["-m", "simab_mcp"],
"env": { "SIMAB_API_URL": "http://localhost:8000" }
}
}
}| Tool | What it does |
|---|---|
run_pretest |
Submit images + goal + audience, get run ID |
get_pretest_result |
Poll or block until complete, returns full synthesis |
list_runs |
Recent runs with status and verdict |
list_personas |
Browse the persona library |
The Arize track requires agents to introspect their operational data at runtime via the Phoenix MCP server. Drop this into any MCP client config — Claude Desktop, Gemini CLI, Cursor — alongside the VeraTest MCP server:
cp mcp/phoenix-mcp.example.json ~/Library/Application\ Support/Claude/claude_desktop_config.jsonThen ask Claude (or Gemini CLI):
"Which personas drifted in the last 5 VeraTest runs, and what was their average rationale coherence?"
The Phoenix MCP server exposes Datasets, Experiments, Prompts, and Spans as MCP tools — so your assistant can query them directly, no SQL required.
Ask Claude: "Run a pretest on these two screenshots for trial signups from startup founders."
pytest tests/ -v
# 79 passed in ~2sCovers: idempotent state writes under concurrent agents, schema compatibility, traffic-weighted allocator, resonance aggregation, trust gap ranking, markdown export, share-page self-containment, and the describe-mode HTTP surface (upload sanitization, orphan cleanup, agent-unavailable degradation).
Most AI evaluation tools ask you to trust them. VeraTest ships a falsifiable benchmark you can re-run yourself.
validation/run.py scores the full 20-agent pipeline against real A/B tests with publicly documented winners (abtestcases.com), alongside four baselines:
| Baseline | What it controls for |
|---|---|
random |
Floor — is anything better than a coin flip? |
always_a / always_b |
Position bias — published A/B cases skew toward B winning (publication bias) |
heuristic |
"The challenger usually wins" shortcut |
oneshot_gemini |
The one that matters: same model, same images, single prompt — isolates the value of the multi-agent panel itself |
The dataset uses mirrored pairs — every case appears twice with A/B swapped — so a method can't score above 50% by exploiting position or publication bias. On the balanced set, always_a, always_b, and heuristic all land at exactly 50%, which is the design working.
python validation/run.py --dataset validation/dataset_balanced.csv --baselines allPredictions checkpoint after every case, so an interrupted run resumes instead of restarting; abstentions are re-attempted automatically.
Headline numbers from the latest run are in Why this works · §4; per-case prediction matrices live in validation/report_*.md.
MIT — self-host, fork, and build freely. See LICENSE.

