fix(ENG-12448): upgrade ML classifier to jbv2 (AgentShield 73.7 → 79.8) by hiskudin · Pull Request #25 · StackOneHQ/defender

hiskudin · 2026-03-24T15:32:26Z

Summary

Upgrades quantized ONNX model to full-aug-dojo-jailbreak-jbv2
Syncs model, tokenizer and config files

Benchmark results

Metric	Before (baseline)	After (jbv2)	Delta
AgentShield score	73.7	79.8	+6.1
Composite	77.2	87.4	+10.2
OR Penalty	3.51	7.54	+4.03
Jailbreak detection	48.9%	66.7%	+17.8 pts
Prompt injection	79.5%	92.7%	+13.2 pts
DAN-variant	20%	80%	+60 pts
Multi-agent security	80.0%	88.6%	+8.6 pts

Training additions (baseline → jbv2)

JasperLS/prompt-injections — jailbreak-focused dataset
jackhhao/jailbreak-classification — 527 examples (DAN, roleplay, persona-override)
lmsys/toxic-chat (jailbreaking=1) — 113 human-verified real jailbreaks
rubend18/ChatGPT-Jailbreak-Prompts — 79 classic named templates

🤖 Generated with Claude Code

Summary by cubic

Upgrades the classifier to the full-aug-dojo-jailbreak-jbv2 ONNX model per ENG-12448 to boost safety. AgentShield 73.7 → 79.8; jailbreak 48.9% → 66.7%; prompt-injection 79.5% → 92.7%.

New Features
- Replaced quantized model with full-aug-dojo-jailbreak-jbv2.
- Synced tokenizer files; set defaults: max_length 128, right-side padding/truncation.
- Updated model config to transformers_version 5.3.0.
Refactors
- Excluded src/classifiers/models/** in biome.json to skip linting large model blobs.

^{Written for commit 0ff4bed. Summary will update on new commits.}

Replaces baseline ONNX model with full-aug-dojo-jailbreak-jbv2. Training additions over baseline: - jasperls: JasperLS jailbreak dataset - jailbreakbench (527): DAN, roleplay, persona-override attacks - toxic-chat (113): human-verified real jailbreaks - chatgpt-jailbreaks (79): classic named templates AgentShield: 73.7 → 79.8 (composite 77.2 → 87.4, penalty 3.51 → 7.54) Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

cubic-dev-ai

No issues found across 4 files

Copilot

Pull request overview

Upgrades the bundled MiniLM ONNX jailbreak/prompt-injection classifier artifacts to the full-aug-dojo-jailbreak-jbv2 variant, syncing model configuration files to match the new release and improve AgentShield benchmark scores.

Changes:

Updated tokenizer_config.json with additional padding/truncation fields and a new default max_length.
Updated config.json formatting and bumped the embedded transformers_version metadata.

Reviewed changes

Copilot reviewed 4 out of 5 changed files in this pull request and generated 2 comments.

File	Description
`src/classifiers/models/minilm-full-aug/tokenizer_config.json`	Adds/updates tokenizer runtime defaults (padding/truncation/max length metadata).
`src/classifiers/models/minilm-full-aug/config.json`	Updates model configuration metadata (including `transformers_version`).

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

src/classifiers/models/minilm-full-aug/tokenizer_config.json

src/classifiers/models/minilm-full-aug/config.json

OMauriStkOne

LGTM

glebedel

LGTM

Copilot AI review requested due to automatic review settings March 24, 2026 15:32

Copilot started reviewing on behalf of hiskudin March 24, 2026 15:32 View session

cubic-dev-ai bot reviewed Mar 24, 2026

View reviewed changes

hiskudin changed the title ~~fix: upgrade ML classifier to jbv2 (AgentShield 73.7 → 79.8)~~ fix (ENG-12448): upgrade ML classifier to jbv2 (AgentShield 73.7 → 79.8) Mar 24, 2026

hiskudin changed the title ~~fix (ENG-12448): upgrade ML classifier to jbv2 (AgentShield 73.7 → 79.8)~~ fix(ENG-12448): upgrade ML classifier to jbv2 (AgentShield 73.7 → 79.8) Mar 24, 2026

chore(biome): add model files in biome ignore

8814745

Copilot AI reviewed Mar 24, 2026

View reviewed changes

src/classifiers/models/minilm-full-aug/tokenizer_config.json Show resolved Hide resolved

src/classifiers/models/minilm-full-aug/config.json Show resolved Hide resolved

chore(biome): add model files in biome ignore

0ff4bed

OMauriStkOne approved these changes Mar 24, 2026

View reviewed changes

glebedel approved these changes Mar 24, 2026

View reviewed changes

hiskudin merged commit 3061239 into main Mar 25, 2026
4 checks passed

stackone-devops-service-account mentioned this pull request Mar 25, 2026

chore(main): release defender 0.5.1 #26

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

fix(ENG-12448): upgrade ML classifier to jbv2 (AgentShield 73.7 → 79.8)#25

fix(ENG-12448): upgrade ML classifier to jbv2 (AgentShield 73.7 → 79.8)#25
hiskudin merged 3 commits intomainfrom
fix/update-model-jbv2

hiskudin commented Mar 24, 2026 •

edited by cubic-dev-ai bot

Loading

Uh oh!

cubic-dev-ai bot left a comment

Uh oh!

Copilot AI left a comment

Uh oh!

Uh oh!

Uh oh!

OMauriStkOne left a comment

Uh oh!

glebedel left a comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

Conversation

hiskudin commented Mar 24, 2026 • edited by cubic-dev-ai bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

Benchmark results

Training additions (baseline → jbv2)

Summary by cubic

Uh oh!

cubic-dev-ai bot left a comment

Choose a reason for hiding this comment

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull request overview

Reviewed changes

Uh oh!

Uh oh!

Uh oh!

OMauriStkOne left a comment

Choose a reason for hiding this comment

Uh oh!

glebedel left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

hiskudin commented Mar 24, 2026 •

edited by cubic-dev-ai bot

Loading