Skip to content

feat(blueprints): Add v5.1-evaluating-ai-performance-in-women-peace-security-scenarios.yml#24

Open
frantj wants to merge 5 commits into
weval-org:mainfrom
frantj:proposal/evaluating-ai-performance-in-women-peace-security-scenarios-1775231959253
Open

feat(blueprints): Add v5.1-evaluating-ai-performance-in-women-peace-security-scenarios.yml#24
frantj wants to merge 5 commits into
weval-org:mainfrom
frantj:proposal/evaluating-ai-performance-in-women-peace-security-scenarios-1775231959253

Conversation

@frantj

@frantj frantj commented Apr 6, 2026

Copy link
Copy Markdown

v5.1 - compliance release due to deprecation of negative criteria.

This benchmark evaluates how well large language models (LLMs) integrate Women, Peace and Security (WPS) principles when advising on conflict and peace operations. It contains 24 scenarios across three prompt tiers, cored against 7 positive criteria and 2 negative criteria (converted to negative 'should' statements for platform compliance).

…mance-in-women-peace-security-scenarios.yml on new branch
…men-peace-security-scenarios.yml' to 'blueprints/users/frantj/v5.1-evaluating-ai-performance-in-women-peace-security-scenarios.yml'
…valuating-ai-performance-in-women-peace-security-scenarios.yml'
@weval-bot

weval-bot Bot commented Apr 6, 2026

Copy link
Copy Markdown

Evaluation started!

  • blueprints/users/frantj/v5.1-evaluating-ai-performance-in-women-peace-security-scenarios.yml - View Status
    ⚠️ Blueprint trimmed to fit PR evaluation limits (full evaluation runs after merge)

Note: 1 blueprint exceeded PR evaluation limits and was automatically trimmed:

  • Limited to 10 prompts, 5 models (CORE), 2 temps, 2 systems
  • Full evaluation with all prompts/models will run automatically after merge

Results will be posted here when complete.


Commit: a7072d5

@weval-bot

weval-bot Bot commented Apr 6, 2026

Copy link
Copy Markdown

Evaluation complete for blueprints/users/frantj/v5.1-evaluating-ai-performance-in-women-peace-security-scenarios.yml

View evaluation status | View full analysis

The blueprint has been successfully evaluated against all configured models.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant