feat: add LLM-based judge for refusal classification#244
feat: add LLM-based judge for refusal classification#244hpnyaggerman wants to merge 6 commits intop-e-w:masterfrom
Conversation
Add an optional judge LLM that classifies model responses as refusals, using llama-cpp-python to run a local GGUF model. When judge_model is set, the judge is used as the primary classifier with string markers as a fallback on parse failure. Includes configurable system/user prompts, thinking-token stripping, a self-test on initialization, GPU offloading, KV cache quantization, and debug logging.
Summary of ChangesHello, I'm Gemini Code Assist1! I'm currently reviewing this pull request and will post my feedback shortly. In the meantime, here's a summary to help you and other reviewers quickly get up to speed! This pull request significantly enhances the refusal classification system by introducing an optional LLM-based judge. This new method provides a more nuanced and robust way to identify refusals, addressing limitations of simple string markers. When enabled, the judge model becomes the primary classification mechanism, with the traditional marker-based approach serving as a reliable fallback. This change improves the accuracy of refusal detection and offers greater flexibility through extensive configuration options. Highlights
🧠 New Feature in Public Preview: You can now enable Memory to help Gemini Code Assist learn from your team's feedback. This makes future code reviews more consistent and personalized to your project's style. Click here to enable Memory in your admin console. Using Gemini Code AssistThe full guide for Gemini Code Assist can be found on our documentation page, here are some quick tips. Invoking Gemini You can request assistance from Gemini at any point by creating a comment using either
Customization To customize Gemini Code Assist for GitHub experience, repository maintainers can create a configuration file and/or provide a custom code review style guide (such as PEP-8 for Python) by creating and adding files to a Limitations & Feedback Gemini Code Assist may make mistakes. Please leave feedback on any instances where its feedback is incorrect or counter productive. You can react with 👍 and 👎 on @gemini-code-assist comments. If you're interested in giving your feedback about your experience with Gemini Code Assist for GitHub and other Google products, sign up here. Footnotes
|
There was a problem hiding this comment.
Code Review
This pull request introduces an LLM-based judge for refusal classification, which is a significant enhancement over the existing string-marker-based approach. The implementation is well-structured, with the new logic encapsulated in a Judge class. The configuration options are extensive and well-documented in both config.py and config.default.toml. The error handling, especially the retry mechanism with verbose logging for model loading, is robust.
My review focuses on ensuring adherence to the repository's style guide. I've found a few minor issues related to comment formatting in config.default.toml and the new src/heretic/judge.py file. Addressing these will improve code consistency. Overall, this is a solid contribution.
| # When set, the judge model is used as the primary method for refusal classification, | ||
| # with string-based markers as a fallback on judge failure. | ||
| # Currently expects a path to a GGUF file (requires the judge-llama-cpp extra). | ||
| # Note: The judge is queried for each prompt on every trial, which may significantly | ||
| # increase optimization time depending on the model's inference speed. |
There was a problem hiding this comment.
According to the repository style guide (rule #4), comments should start with a capital letter and end with a period. This multi-line comment block, and others in this file, do not consistently follow this rule. For example, line 134 ends with a comma, and line 137 is an incomplete sentence.
Please ensure each comment line is a complete sentence that starts with a capital and ends with a period. This also applies to the comment blocks for judge_model_options (lines 141-146), judge_tensor_split (lines 163-166), and judge_self_test (lines 199-200).
References
- Comments should start with a capital letter and end with a period. They should use correct grammar and spelling. (link)
There was a problem hiding this comment.
These are single sentences that wrap across multiple lines. Each sentence starts with a capital letter and ends with a period. For example:
# When set, the judge model is used as the primary method for refusal classification,
# with string-based markers as a fallback on judge failure.This is one sentence: "When set, the judge model is used as the primary method for refusal classification, with string-based markers as a fallback on judge failure.", capital W, ends with period. The comma on line 134 is mid-sentence punctuation, not the end of the comment. Same pattern for lines 137-138:
# Note: The judge is queried for each prompt on every trial, which may significantly
# increase optimization time depending on the model's inference speed.One sentence, capital N, ends with period. The line break after "significantly" is just wrapping. All flagged locations in this file follow the same pattern, the rule applies to the comment (the logical sentence), not to each physical line. Requiring every line to be self-contained would force either very short lines or awkward restructuring of otherwise clear prose.
No changes need to be made here, I believe.
| # Values from the ggml_type enum in ggml.h: | ||
| # https://github.com/ggml-org/ggml/blob/v0.9.8/include/ggml.h |
There was a problem hiding this comment.
According to the repository style guide (rule #4), comments should start with a capital letter and end with a period. This comment block violates the rule, as line 15 ends with a colon and line 16 is a URL.
Please rephrase to comply with the style guide. For example:
# The values are from the ggml_type enum in ggml.h, which can be found at:
# https://github.com/ggml-org/ggml/blob/v0.9.8/include/ggml.hSimilar issues can be found on lines 55-57, 72-73, 135-136, 163-165, 253-254, and 272-273. Please review and fix them as well.
References
- Comments should start with a capital letter and end with a period. They should use correct grammar and spelling. (link)
There was a problem hiding this comment.
Two of the flagged locations were genuine issues and have been fixed. The rest are the same wrapped-sentence pattern as above.
I fixed lines 15-16. The comment ended with a colon followed by a bare URL, so the comment block didn't end with a period.
Before:
# Map GGMLType enum values to the integer constants used by llama-cpp-python.
# Values from the ggml_type enum in ggml.h:
# https://github.com/ggml-org/ggml/blob/v0.9.8/include/ggml.hAfter:
# Map GGMLType enum values to the integer constants used by llama-cpp-python.
# Values are from the ggml_type enum in
# https://github.com/ggml-org/ggml/blob/v0.9.8/include/ggml.h.I also fixed lines 163-165. The comment started with lowercase llama-cpp-python.
Before:
# llama-cpp-python suppresses C-level diagnostic output from llama.cpp
# when verbose=False. Retry with verbose=True so the user can see the
# actual error (e.g. unsupported model architecture).After:
# The llama-cpp-python package suppresses C-level diagnostic output
# from llama.cpp when verbose=False. Retry with verbose=True so the
# user can see the actual error (e.g. unsupported model architecture).I do not believe changes are needed for lines 55-57, 72-73, 135-136, 253-254, and 272-273. These are all wrapped sentences that start with a capital and end with a period.
|
/gemini review |
There was a problem hiding this comment.
Code Review
This pull request introduces an LLM-based judge for refusal classification, which is a significant feature enhancement. The implementation is well-structured, particularly the new judge.py module with its robust error handling and self-testing capabilities. I've identified a few minor style guide violations related to docstring formatting that should be addressed to maintain code consistency.
|
|
||
|
|
||
| class GGMLType(str, Enum): | ||
| """GGML data types for KV cache quantization.""" |
There was a problem hiding this comment.
The docstring should end with a period to adhere to the repository's style guide (Rule #4).
| """GGML data types for KV cache quantization.""" | |
| """GGML data types for KV cache quantization..""" |
References
- Comments should start with a capital letter and end with a period. They should use correct grammar and spelling. (link)
There was a problem hiding this comment.
I don't think that's how it works
|
|
||
| @dataclass | ||
| class JudgeVerdict: | ||
| """Result of a judge classification.""" |
There was a problem hiding this comment.
The docstring should end with a period to adhere to the repository's style guide (Rule #4).
| """Result of a judge classification.""" | |
| """Result of a judge classification..""" |
References
- Comments should start with a capital letter and end with a period. They should use correct grammar and spelling. (link)
|
|
||
|
|
||
| class Judge: | ||
| """Classifies model responses as refusals using a judge LLM.""" |
There was a problem hiding this comment.
The docstring should end with a period to adhere to the repository's style guide (Rule #4). This also applies to other docstrings in this file.
| """Classifies model responses as refusals using a judge LLM.""" | |
| """Classifies model responses as refusals using a judge LLM..""" |
References
- Comments should start with a capital letter and end with a period. They should use correct grammar and spelling. (link)
|
Thanks for the PR. This needs to happen as a plugin (see #53, currently in development). In order to be included among the Heretic default plugins, it would also need to adhere to Heretic conventions, which include the use of Transformers rather than llama.cpp, although connecting to an OAI-compatible endpoint would of course also be an option. In general, I recommend opening an issue to discuss a plan before implementing a major change like this, or at least dropping a note on Discord. |
|
@p-e-w, understood. I only recently got around to doing any real work with Heretic, and I found myself in dire need of this feature, so I added it. I was aware it is unlikely to be merged in its present state, but it would take too long to wait until the plugin system is merged, so I decided to submit this PR anyway so that those who also have a need for this feature can use my implementation. I intend to keep the feature branch in my repo up-to-date with master for the foreseeable future, or until the plugin system is merged, and then I will make the necessary changes to implement it as a plugin. I request that you not close this PR until then. I will reclassify it as a draft. |
|
Sure, we'll keep this open for now. I agree it's a good reference for others who want this feature. |
8c2b463 to
fd5e1fb
Compare
Summary
String markers work well for the common case, but they can misfire - responses that contain refusal-like phrases but are actually compliant, or refusals that dodge every predefined marker. This adds an optional LLM-based judge as an alternative classification method for those cases.
When judge_model points to a GGUF file, a local model is loaded via llama-cpp-python and queried during evaluation. It prompts the judge with the original prompt and the model's response, and expects a single-word verdict (REFUSAL or COMPLIANCE). If the judge fails to produce a parseable answer after retrying, classification falls through to the existing marker logic transparently. When judge_model is not set, nothing changes.
Requires the new judge-llama-cpp optional extra (llama-cpp-python~=0.3).
Configuration
All new settings are judge_* prefixed and documented in config.default.toml. The key ones:
Full list with defaults and descriptions is in config.default.toml.
Notes
Test plan