Skip to content

feat: add LLM-based judge for refusal classification#244

Draft
hpnyaggerman wants to merge 6 commits intop-e-w:masterfrom
hpnyaggerman:feat/judge-refusal-classification
Draft

feat: add LLM-based judge for refusal classification#244
hpnyaggerman wants to merge 6 commits intop-e-w:masterfrom
hpnyaggerman:feat/judge-refusal-classification

Conversation

@hpnyaggerman
Copy link
Copy Markdown

Summary

String markers work well for the common case, but they can misfire - responses that contain refusal-like phrases but are actually compliant, or refusals that dodge every predefined marker. This adds an optional LLM-based judge as an alternative classification method for those cases.

When judge_model points to a GGUF file, a local model is loaded via llama-cpp-python and queried during evaluation. It prompts the judge with the original prompt and the model's response, and expects a single-word verdict (REFUSAL or COMPLIANCE). If the judge fails to produce a parseable answer after retrying, classification falls through to the existing marker logic transparently. When judge_model is not set, nothing changes.

Requires the new judge-llama-cpp optional extra (llama-cpp-python~=0.3).

Configuration

All new settings are judge_* prefixed and documented in config.default.toml. The key ones:

  • judge_model - path to a GGUF file; this is what enables the judge
  • judge_gpu_layers - layer offloading (0 = CPU-only, -1 = all layers on GPU)
  • judge_kv_cache_type - KV cache quantization (f16 default, down to q4_0)
  • judge_tensor_split - multi-GPU distribution proportions
  • judge_system_prompt / judge_user_template - fully customizable prompts; the template uses {prompt} and {response} placeholders
  • judge_context_length - max context in tokens (default 2048)
  • judge_max_attempts - retries before falling back to markers (default 3)
  • judge_thinking_tokens - token pairs to strip from reasoning model output (e.g. /)
  • judge_self_test - runs a sanity check on init with obvious refusal/compliance pairs (default true)
  • print_judge_messages / judge_verbose - debug flags

Full list with defaults and descriptions is in config.default.toml.

Notes

  • The judge is queried for each prompt on every trial, so it can significantly increase optimization time depending on inference speed. This is called out in the config comments.
  • On init, a self-test verifies the judge can handle obvious cases and warns if it can't.

Test plan

  • uv sync -all-extras -dev installs cleanly
  • CI checks pass (ruff format, ruff check, ty check)
  • With judge_model unset, existing marker behavior is unchanged
  • With judge_model set, judge verdicts appear in verbose output
  • Fallback to markers when judge produces unparseable output
  • Self-test warnings appear when judge misclassifies obvious cases

Add an optional judge LLM that classifies model responses as refusals,
using llama-cpp-python to run a local GGUF model. When judge_model is
set, the judge is used as the primary classifier with string markers
as a fallback on parse failure. Includes configurable system/user
prompts, thinking-token stripping, a self-test on initialization,
GPU offloading, KV cache quantization, and debug logging.
@gemini-code-assist
Copy link
Copy Markdown
Contributor

Summary of Changes

Hello, I'm Gemini Code Assist1! I'm currently reviewing this pull request and will post my feedback shortly. In the meantime, here's a summary to help you and other reviewers quickly get up to speed!

This pull request significantly enhances the refusal classification system by introducing an optional LLM-based judge. This new method provides a more nuanced and robust way to identify refusals, addressing limitations of simple string markers. When enabled, the judge model becomes the primary classification mechanism, with the traditional marker-based approach serving as a reliable fallback. This change improves the accuracy of refusal detection and offers greater flexibility through extensive configuration options.

Highlights

  • LLM-based Refusal Judge: Introduced an optional LLM-based judge for more nuanced refusal classification, addressing limitations of string markers.
  • Configurable Judge Settings: Added extensive configuration options for the LLM judge, covering model path, GPU offloading, KV cache type, prompts, and retry logic.
  • Fallback Mechanism: Implemented a fallback to existing marker-based classification if the LLM judge fails to produce a parseable verdict after multiple attempts.
  • Judge Self-Test: Included an optional self-test for the judge on initialization to verify its ability to correctly classify obvious refusals and compliances.
  • New Optional Dependency: Added llama-cpp-python as an optional dependency (judge-llama-cpp extra) to support local GGUF models for the judge.

🧠 New Feature in Public Preview: You can now enable Memory to help Gemini Code Assist learn from your team's feedback. This makes future code reviews more consistent and personalized to your project's style. Click here to enable Memory in your admin console.

Using Gemini Code Assist

The full guide for Gemini Code Assist can be found on our documentation page, here are some quick tips.

Invoking Gemini

You can request assistance from Gemini at any point by creating a comment using either /gemini <command> or @gemini-code-assist <command>. Below is a summary of the supported commands on the current page.

Feature Command Description
Code Review /gemini review Performs a code review for the current pull request in its current state.
Pull Request Summary /gemini summary Provides a summary of the current pull request in its current state.
Comment @gemini-code-assist Responds in comments when explicitly tagged, both in pull request comments and review comments.
Help /gemini help Displays a list of available commands.

Customization

To customize Gemini Code Assist for GitHub experience, repository maintainers can create a configuration file and/or provide a custom code review style guide (such as PEP-8 for Python) by creating and adding files to a .gemini/ folder in the base of the repository. Detailed instructions can be found here.

Limitations & Feedback

Gemini Code Assist may make mistakes. Please leave feedback on any instances where its feedback is incorrect or counter productive. You can react with 👍 and 👎 on @gemini-code-assist comments. If you're interested in giving your feedback about your experience with Gemini Code Assist for GitHub and other Google products, sign up here.

Footnotes

  1. Review the Privacy Notices, Generative AI Prohibited Use Policy, Terms of Service, and learn how to configure Gemini Code Assist in GitHub here. Gemini can make mistakes, so double check it and use code with caution.

Copy link
Copy Markdown
Contributor

@gemini-code-assist gemini-code-assist Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request introduces an LLM-based judge for refusal classification, which is a significant enhancement over the existing string-marker-based approach. The implementation is well-structured, with the new logic encapsulated in a Judge class. The configuration options are extensive and well-documented in both config.py and config.default.toml. The error handling, especially the retry mechanism with verbose logging for model loading, is robust.

My review focuses on ensuring adherence to the repository's style guide. I've found a few minor issues related to comment formatting in config.default.toml and the new src/heretic/judge.py file. Addressing these will improve code consistency. Overall, this is a solid contribution.

Comment thread config.default.toml
Comment on lines +134 to +138
# When set, the judge model is used as the primary method for refusal classification,
# with string-based markers as a fallback on judge failure.
# Currently expects a path to a GGUF file (requires the judge-llama-cpp extra).
# Note: The judge is queried for each prompt on every trial, which may significantly
# increase optimization time depending on the model's inference speed.
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

According to the repository style guide (rule #4), comments should start with a capital letter and end with a period. This multi-line comment block, and others in this file, do not consistently follow this rule. For example, line 134 ends with a comma, and line 137 is an incomplete sentence.

Please ensure each comment line is a complete sentence that starts with a capital and ends with a period. This also applies to the comment blocks for judge_model_options (lines 141-146), judge_tensor_split (lines 163-166), and judge_self_test (lines 199-200).

References
  1. Comments should start with a capital letter and end with a period. They should use correct grammar and spelling. (link)

Copy link
Copy Markdown
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

These are single sentences that wrap across multiple lines. Each sentence starts with a capital letter and ends with a period. For example:

# When set, the judge model is used as the primary method for refusal classification,
# with string-based markers as a fallback on judge failure.

This is one sentence: "When set, the judge model is used as the primary method for refusal classification, with string-based markers as a fallback on judge failure.", capital W, ends with period. The comma on line 134 is mid-sentence punctuation, not the end of the comment. Same pattern for lines 137-138:

# Note: The judge is queried for each prompt on every trial, which may significantly
# increase optimization time depending on the model's inference speed.

One sentence, capital N, ends with period. The line break after "significantly" is just wrapping. All flagged locations in this file follow the same pattern, the rule applies to the comment (the logical sentence), not to each physical line. Requiring every line to be self-contained would force either very short lines or awkward restructuring of otherwise clear prose.

No changes need to be made here, I believe.

Comment thread src/heretic/judge.py Outdated
Comment on lines +15 to +16
# Values from the ggml_type enum in ggml.h:
# https://github.com/ggml-org/ggml/blob/v0.9.8/include/ggml.h
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

According to the repository style guide (rule #4), comments should start with a capital letter and end with a period. This comment block violates the rule, as line 15 ends with a colon and line 16 is a URL.

Please rephrase to comply with the style guide. For example:

# The values are from the ggml_type enum in ggml.h, which can be found at:
# https://github.com/ggml-org/ggml/blob/v0.9.8/include/ggml.h

Similar issues can be found on lines 55-57, 72-73, 135-136, 163-165, 253-254, and 272-273. Please review and fix them as well.

References
  1. Comments should start with a capital letter and end with a period. They should use correct grammar and spelling. (link)

Copy link
Copy Markdown
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Two of the flagged locations were genuine issues and have been fixed. The rest are the same wrapped-sentence pattern as above.

I fixed lines 15-16. The comment ended with a colon followed by a bare URL, so the comment block didn't end with a period.

Before:

# Map GGMLType enum values to the integer constants used by llama-cpp-python.
# Values from the ggml_type enum in ggml.h:
# https://github.com/ggml-org/ggml/blob/v0.9.8/include/ggml.h

After:

# Map GGMLType enum values to the integer constants used by llama-cpp-python.
# Values are from the ggml_type enum in
# https://github.com/ggml-org/ggml/blob/v0.9.8/include/ggml.h.

I also fixed lines 163-165. The comment started with lowercase llama-cpp-python.

Before:

            # llama-cpp-python suppresses C-level diagnostic output from llama.cpp
            # when verbose=False. Retry with verbose=True so the user can see the
            # actual error (e.g. unsupported model architecture).

After:

            # The llama-cpp-python package suppresses C-level diagnostic output
            # from llama.cpp when verbose=False. Retry with verbose=True so the
            # user can see the actual error (e.g. unsupported model architecture).

I do not believe changes are needed for lines 55-57, 72-73, 135-136, 253-254, and 272-273. These are all wrapped sentences that start with a capital and end with a period.

@hpnyaggerman
Copy link
Copy Markdown
Author

/gemini review

Copy link
Copy Markdown
Contributor

@gemini-code-assist gemini-code-assist Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request introduces an LLM-based judge for refusal classification, which is a significant feature enhancement. The implementation is well-structured, particularly the new judge.py module with its robust error handling and self-testing capabilities. I've identified a few minor style guide violations related to docstring formatting that should be addressed to maintain code consistency.

Comment thread src/heretic/config.py


class GGMLType(str, Enum):
"""GGML data types for KV cache quantization."""
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

The docstring should end with a period to adhere to the repository's style guide (Rule #4).

Suggested change
"""GGML data types for KV cache quantization."""
"""GGML data types for KV cache quantization.."""
References
  1. Comments should start with a capital letter and end with a period. They should use correct grammar and spelling. (link)

Copy link
Copy Markdown
Author

@hpnyaggerman hpnyaggerman Mar 21, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't think that's how it works

Comment thread src/heretic/judge.py

@dataclass
class JudgeVerdict:
"""Result of a judge classification."""
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

The docstring should end with a period to adhere to the repository's style guide (Rule #4).

Suggested change
"""Result of a judge classification."""
"""Result of a judge classification.."""
References
  1. Comments should start with a capital letter and end with a period. They should use correct grammar and spelling. (link)

Comment thread src/heretic/judge.py


class Judge:
"""Classifies model responses as refusals using a judge LLM."""
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

The docstring should end with a period to adhere to the repository's style guide (Rule #4). This also applies to other docstrings in this file.

Suggested change
"""Classifies model responses as refusals using a judge LLM."""
"""Classifies model responses as refusals using a judge LLM.."""
References
  1. Comments should start with a capital letter and end with a period. They should use correct grammar and spelling. (link)

@p-e-w
Copy link
Copy Markdown
Owner

p-e-w commented Mar 21, 2026

Thanks for the PR. This needs to happen as a plugin (see #53, currently in development). In order to be included among the Heretic default plugins, it would also need to adhere to Heretic conventions, which include the use of Transformers rather than llama.cpp, although connecting to an OAI-compatible endpoint would of course also be an option.

In general, I recommend opening an issue to discuss a plan before implementing a major change like this, or at least dropping a note on Discord.

@hpnyaggerman
Copy link
Copy Markdown
Author

hpnyaggerman commented Mar 21, 2026

@p-e-w, understood. I only recently got around to doing any real work with Heretic, and I found myself in dire need of this feature, so I added it. I was aware it is unlikely to be merged in its present state, but it would take too long to wait until the plugin system is merged, so I decided to submit this PR anyway so that those who also have a need for this feature can use my implementation. I intend to keep the feature branch in my repo up-to-date with master for the foreseeable future, or until the plugin system is merged, and then I will make the necessary changes to implement it as a plugin. I request that you not close this PR until then. I will reclassify it as a draft.

@hpnyaggerman hpnyaggerman marked this pull request as draft March 21, 2026 12:54
@p-e-w
Copy link
Copy Markdown
Owner

p-e-w commented Mar 21, 2026

Sure, we'll keep this open for now. I agree it's a good reference for others who want this feature.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants