feat: add LLM-based judge for refusal classification by hpnyaggerman · Pull Request #244 · p-e-w/heretic

hpnyaggerman · 2026-03-21T10:59:25Z

Summary

String markers work well for the common case, but they can misfire - responses that contain refusal-like phrases but are actually compliant, or refusals that dodge every predefined marker. This adds an optional LLM-based judge as an alternative classification method for those cases.

When judge_model points to a GGUF file, a local model is loaded via llama-cpp-python and queried during evaluation. It prompts the judge with the original prompt and the model's response, and expects a single-word verdict (REFUSAL or COMPLIANCE). If the judge fails to produce a parseable answer after retrying, classification falls through to the existing marker logic transparently. When judge_model is not set, nothing changes.

Requires the new judge-llama-cpp optional extra (llama-cpp-python~=0.3).

Configuration

All new settings are judge_* prefixed and documented in config.default.toml. The key ones:

judge_model - path to a GGUF file; this is what enables the judge
judge_gpu_layers - layer offloading (0 = CPU-only, -1 = all layers on GPU)
judge_kv_cache_type - KV cache quantization (f16 default, down to q4_0)
judge_tensor_split - multi-GPU distribution proportions
judge_system_prompt / judge_user_template - fully customizable prompts; the template uses {prompt} and {response} placeholders
judge_context_length - max context in tokens (default 2048)
judge_max_attempts - retries before falling back to markers (default 3)
judge_thinking_tokens - token pairs to strip from reasoning model output (e.g. /)
judge_self_test - runs a sanity check on init with obvious refusal/compliance pairs (default true)
print_judge_messages / judge_verbose - debug flags

Full list with defaults and descriptions is in config.default.toml.

Notes

The judge is queried for each prompt on every trial, so it can significantly increase optimization time depending on inference speed. This is called out in the config comments.
On init, a self-test verifies the judge can handle obvious cases and warns if it can't.

Test plan

uv sync -all-extras -dev installs cleanly
CI checks pass (ruff format, ruff check, ty check)
With judge_model unset, existing marker behavior is unchanged
With judge_model set, judge verdicts appear in verbose output
Fallback to markers when judge produces unparseable output
Self-test warnings appear when judge misclassifies obvious cases

Add an optional judge LLM that classifies model responses as refusals, using llama-cpp-python to run a local GGUF model. When judge_model is set, the judge is used as the primary classifier with string markers as a fallback on parse failure. Includes configurable system/user prompts, thinking-token stripping, a self-test on initialization, GPU offloading, KV cache quantization, and debug logging.

gemini-code-assist · 2026-03-21T10:59:46Z

Summary of Changes

Hello, I'm Gemini Code Assist¹! I'm currently reviewing this pull request and will post my feedback shortly. In the meantime, here's a summary to help you and other reviewers quickly get up to speed!

This pull request significantly enhances the refusal classification system by introducing an optional LLM-based judge. This new method provides a more nuanced and robust way to identify refusals, addressing limitations of simple string markers. When enabled, the judge model becomes the primary classification mechanism, with the traditional marker-based approach serving as a reliable fallback. This change improves the accuracy of refusal detection and offers greater flexibility through extensive configuration options.

Highlights

LLM-based Refusal Judge: Introduced an optional LLM-based judge for more nuanced refusal classification, addressing limitations of string markers.
Configurable Judge Settings: Added extensive configuration options for the LLM judge, covering model path, GPU offloading, KV cache type, prompts, and retry logic.
Fallback Mechanism: Implemented a fallback to existing marker-based classification if the LLM judge fails to produce a parseable verdict after multiple attempts.
Judge Self-Test: Included an optional self-test for the judge on initialization to verify its ability to correctly classify obvious refusals and compliances.
New Optional Dependency: Added llama-cpp-python as an optional dependency (judge-llama-cpp extra) to support local GGUF models for the judge.

🧠 New Feature in Public Preview: You can now enable Memory to help Gemini Code Assist learn from your team's feedback. This makes future code reviews more consistent and personalized to your project's style. Click here to enable Memory in your admin console.

Using Gemini Code Assist

The full guide for Gemini Code Assist can be found on our documentation page, here are some quick tips.

Invoking Gemini

You can request assistance from Gemini at any point by creating a comment using either /gemini <command> or @gemini-code-assist <command>. Below is a summary of the supported commands on the current page.

Feature	Command	Description
Code Review	`/gemini review`	Performs a code review for the current pull request in its current state.
Pull Request Summary	`/gemini summary`	Provides a summary of the current pull request in its current state.
Comment	@gemini-code-assist	Responds in comments when explicitly tagged, both in pull request comments and review comments.
Help	`/gemini help`	Displays a list of available commands.

Customization

To customize Gemini Code Assist for GitHub experience, repository maintainers can create a configuration file and/or provide a custom code review style guide (such as PEP-8 for Python) by creating and adding files to a .gemini/ folder in the base of the repository. Detailed instructions can be found here.

Limitations & Feedback

Gemini Code Assist may make mistakes. Please leave feedback on any instances where its feedback is incorrect or counter productive. You can react with 👍 and 👎 on @gemini-code-assist comments. If you're interested in giving your feedback about your experience with Gemini Code Assist for GitHub and other Google products, sign up here.

Review the Privacy Notices, Generative AI Prohibited Use Policy, Terms of Service, and learn how to configure Gemini Code Assist in GitHub here. Gemini can make mistakes, so double check it and use code with caution. ↩

gemini-code-assist

Code Review

This pull request introduces an LLM-based judge for refusal classification, which is a significant enhancement over the existing string-marker-based approach. The implementation is well-structured, with the new logic encapsulated in a Judge class. The configuration options are extensive and well-documented in both config.py and config.default.toml. The error handling, especially the retry mechanism with verbose logging for model loading, is robust.

My review focuses on ensuring adherence to the repository's style guide. I've found a few minor issues related to comment formatting in config.default.toml and the new src/heretic/judge.py file. Addressing these will improve code consistency. Overall, this is a solid contribution.

gemini-code-assist · 2026-03-21T11:03:17Z

+# When set, the judge model is used as the primary method for refusal classification,
+# with string-based markers as a fallback on judge failure.
+# Currently expects a path to a GGUF file (requires the judge-llama-cpp extra).
+# Note: The judge is queried for each prompt on every trial, which may significantly
+# increase optimization time depending on the model's inference speed.


According to the repository style guide (rule #4), comments should start with a capital letter and end with a period. This multi-line comment block, and others in this file, do not consistently follow this rule. For example, line 134 ends with a comma, and line 137 is an incomplete sentence.

Please ensure each comment line is a complete sentence that starts with a capital and ends with a period. This also applies to the comment blocks for judge_model_options (lines 141-146), judge_tensor_split (lines 163-166), and judge_self_test (lines 199-200).

References

Comments should start with a capital letter and end with a period. They should use correct grammar and spelling. ^(link)

These are single sentences that wrap across multiple lines. Each sentence starts with a capital letter and ends with a period. For example:

# When set, the judge model is used as the primary method for refusal classification, # with string-based markers as a fallback on judge failure.

This is one sentence: "When set, the judge model is used as the primary method for refusal classification, with string-based markers as a fallback on judge failure.", capital W, ends with period. The comma on line 134 is mid-sentence punctuation, not the end of the comment. Same pattern for lines 137-138:

# Note: The judge is queried for each prompt on every trial, which may significantly # increase optimization time depending on the model's inference speed.

One sentence, capital N, ends with period. The line break after "significantly" is just wrapping. All flagged locations in this file follow the same pattern, the rule applies to the comment (the logical sentence), not to each physical line. Requiring every line to be self-contained would force either very short lines or awkward restructuring of otherwise clear prose.

No changes need to be made here, I believe.

gemini-code-assist · 2026-03-21T11:03:17Z

+# Values from the ggml_type enum in ggml.h:
+# https://github.com/ggml-org/ggml/blob/v0.9.8/include/ggml.h


According to the repository style guide (rule #4), comments should start with a capital letter and end with a period. This comment block violates the rule, as line 15 ends with a colon and line 16 is a URL.

Please rephrase to comply with the style guide. For example:

# The values are from the ggml_type enum in ggml.h, which can be found at: # https://github.com/ggml-org/ggml/blob/v0.9.8/include/ggml.h

Similar issues can be found on lines 55-57, 72-73, 135-136, 163-165, 253-254, and 272-273. Please review and fix them as well.

References

Comments should start with a capital letter and end with a period. They should use correct grammar and spelling. ^(link)

Two of the flagged locations were genuine issues and have been fixed. The rest are the same wrapped-sentence pattern as above.

I fixed lines 15-16. The comment ended with a colon followed by a bare URL, so the comment block didn't end with a period.

Before:

# Map GGMLType enum values to the integer constants used by llama-cpp-python. # Values from the ggml_type enum in ggml.h: # https://github.com/ggml-org/ggml/blob/v0.9.8/include/ggml.h

After:

# Map GGMLType enum values to the integer constants used by llama-cpp-python. # Values are from the ggml_type enum in # https://github.com/ggml-org/ggml/blob/v0.9.8/include/ggml.h.

I also fixed lines 163-165. The comment started with lowercase llama-cpp-python.

Before:

# llama-cpp-python suppresses C-level diagnostic output from llama.cpp # when verbose=False. Retry with verbose=True so the user can see the # actual error (e.g. unsupported model architecture).

After:

# The llama-cpp-python package suppresses C-level diagnostic output # from llama.cpp when verbose=False. Retry with verbose=True so the # user can see the actual error (e.g. unsupported model architecture).

I do not believe changes are needed for lines 55-57, 72-73, 135-136, 253-254, and 272-273. These are all wrapped sentences that start with a capital and end with a period.

hpnyaggerman · 2026-03-21T11:38:11Z

/gemini review

gemini-code-assist

Code Review

This pull request introduces an LLM-based judge for refusal classification, which is a significant feature enhancement. The implementation is well-structured, particularly the new judge.py module with its robust error handling and self-testing capabilities. I've identified a few minor style guide violations related to docstring formatting that should be addressed to maintain code consistency.

gemini-code-assist · 2026-03-21T11:46:20Z



+class GGMLType(str, Enum):
+    """GGML data types for KV cache quantization."""


The docstring should end with a period to adhere to the repository's style guide (Rule #4).

Suggested change

"""GGML data types for KV cache quantization."""

"""GGML data types for KV cache quantization.."""

References

Comments should start with a capital letter and end with a period. They should use correct grammar and spelling. ^(link)

I don't think that's how it works

gemini-code-assist · 2026-03-21T11:46:20Z

+
+@dataclass
+class JudgeVerdict:
+    """Result of a judge classification."""


The docstring should end with a period to adhere to the repository's style guide (Rule #4).

Suggested change

"""Result of a judge classification."""

"""Result of a judge classification.."""

References

Comments should start with a capital letter and end with a period. They should use correct grammar and spelling. ^(link)

gemini-code-assist · 2026-03-21T11:46:20Z

+
+
+class Judge:
+    """Classifies model responses as refusals using a judge LLM."""


The docstring should end with a period to adhere to the repository's style guide (Rule #4). This also applies to other docstrings in this file.

Suggested change

"""Classifies model responses as refusals using a judge LLM."""

"""Classifies model responses as refusals using a judge LLM.."""

References

Comments should start with a capital letter and end with a period. They should use correct grammar and spelling. ^(link)

p-e-w · 2026-03-21T12:02:49Z

Thanks for the PR. This needs to happen as a plugin (see #53, currently in development). In order to be included among the Heretic default plugins, it would also need to adhere to Heretic conventions, which include the use of Transformers rather than llama.cpp, although connecting to an OAI-compatible endpoint would of course also be an option.

In general, I recommend opening an issue to discuss a plan before implementing a major change like this, or at least dropping a note on Discord.

hpnyaggerman · 2026-03-21T12:54:12Z

@p-e-w, understood. I only recently got around to doing any real work with Heretic, and I found myself in dire need of this feature, so I added it. I was aware it is unlikely to be merged in its present state, but it would take too long to wait until the plugin system is merged, so I decided to submit this PR anyway so that those who also have a need for this feature can use my implementation. I intend to keep the feature branch in my repo up-to-date with master for the foreseeable future, or until the plugin system is merged, and then I will make the necessary changes to implement it as a plugin. I request that you not close this PR until then. I will reclassify it as a draft.

p-e-w · 2026-03-21T13:22:00Z

Sure, we'll keep this open for now. I agree it's a good reference for others who want this feature.

# Conflicts: # uv.lock

gemini-code-assist Bot reviewed Mar 21, 2026

View reviewed changes

fix: conform judge.py comments to style guide

6550abf

gemini-code-assist Bot reviewed Mar 21, 2026

View reviewed changes

hpnyaggerman marked this pull request as draft March 21, 2026 12:54

Merge branch 'master' into feat/judge-refusal-classification

fd5e1fb

p-e-w mentioned this pull request Mar 27, 2026

https://github.com/p-e-w/heretic/compare/master...RyderFreeman4Logos:heretic:feat/llm-judge-pipeline?expand=1 #255

Closed

hpnyaggerman force-pushed the feat/judge-refusal-classification branch from 8c2b463 to fd5e1fb Compare March 28, 2026 13:50

hpnyaggerman added 3 commits March 28, 2026 13:57

Merge branch 'master' into feat/judge-refusal-classification

14c7fa0

# Conflicts: # uv.lock

Merge branch 'p-e-w:master' into feat/judge-refusal-classification

033b1cf

Merge branch 'p-e-w:master' into feat/judge-refusal-classification

361f09d

		# Values from the ggml_type enum in ggml.h:
		# https://github.com/ggml-org/ggml/blob/v0.9.8/include/ggml.h



		class GGMLType(str, Enum):
		"""GGML data types for KV cache quantization."""

	"""Result of a judge classification."""
	"""Result of a judge classification.."""



		class Judge:
		"""Classifies model responses as refusals using a judge LLM."""

Conversation

hpnyaggerman commented Mar 21, 2026

Summary

Configuration

Notes

Test plan

Uh oh!

gemini-code-assist Bot commented Mar 21, 2026

Summary of Changes

Highlights

Footnotes

Uh oh!

gemini-code-assist Bot left a comment

Choose a reason for hiding this comment

Code Review

Uh oh!

gemini-code-assist Bot Mar 21, 2026

Choose a reason for hiding this comment

Uh oh!

hpnyaggerman Mar 21, 2026

Choose a reason for hiding this comment

Uh oh!

gemini-code-assist Bot Mar 21, 2026

Choose a reason for hiding this comment

Uh oh!

hpnyaggerman Mar 21, 2026

Choose a reason for hiding this comment

Uh oh!

hpnyaggerman commented Mar 21, 2026

Uh oh!

gemini-code-assist Bot left a comment

Choose a reason for hiding this comment

Code Review

Uh oh!

gemini-code-assist Bot Mar 21, 2026

Choose a reason for hiding this comment

Uh oh!

hpnyaggerman Mar 21, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

gemini-code-assist Bot Mar 21, 2026

Choose a reason for hiding this comment

Uh oh!

gemini-code-assist Bot Mar 21, 2026

Choose a reason for hiding this comment

Uh oh!

p-e-w commented Mar 21, 2026

Uh oh!

hpnyaggerman commented Mar 21, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

p-e-w commented Mar 21, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

hpnyaggerman Mar 21, 2026 •

edited

Loading

hpnyaggerman commented Mar 21, 2026 •

edited

Loading