Skip to content

Bug fix of specs/implementations in benchmark#67

Open
barabbs wants to merge 6 commits intotrishullab:mainfrom
barabbs:main
Open

Bug fix of specs/implementations in benchmark#67
barabbs wants to merge 6 commits intotrishullab:mainfrom
barabbs:main

Conversation

@barabbs
Copy link

@barabbs barabbs commented Mar 5, 2026

Bug fixes addressing errors listed in issue #65

In my tests, correct proofs were found for the following entries after fixing, so I am confident that the issues have been

sample_examples

human_eval

The human_eval entries problem_19.lean and problem_25.lean had no successful proof, but no system I tested was able to find counterexamples or proof of the negations after the fix, so the original issues should have been addressed at least.

I can provide more details for each of the fixes, if needed.

Copilot AI review requested due to automatic review settings March 5, 2026 15:58
@amit9oct
Copy link
Collaborator

amit9oct commented Mar 5, 2026

Bug fixes addressing errors listed in issue #65

In my tests, correct proofs were found for the following entries after fixing, so I am confident that the issues have been
..........

Thank you so much for your evaluations and possible fixes. I will go through your PR and will get the changes merged in a few days.

Since you mentioned finding some proofs, let me know if you are interested in being added to our leaderboard here: https://trishullab.github.io/clever/leaderboard.html

Thank you so much for putting in the hard work. I will go through your PR ASAP.

Copy link

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Fixes multiple benchmark entries where the written specs didn’t match the intended behavior / reference implementations (per issue #65), improving solvability and correctness of the Lean4 benchmark problems.

Changes:

  • Adjusts several specs to align with intended semantics (e.g., prime checks require 2 ≤ n, palindrome-change spec, closest-pair duplicate handling, empty-substring case).
  • Fixes/updates a few implementations to match the corrected specs (notably rolling max initialization, factorization loop, polynomial root finding).
  • Updates minor API usages and indexing forms (e.g., Nat.digits, list indexing notation).

Reviewed changes

Copilot reviewed 17 out of 17 changed files in this pull request and generated 5 comments.

Show a summary per file
File Description
src/lean4/sample_examples/problem_0.lean Updates paren-balance implementation to detect excess closing parens via Option Nat.
src/lean4/human_eval/problem_1.lean Adds precondition requiring filtered input parentheses to be balanced before specifying behavior.
src/lean4/human_eval/problem_6.lean Aligns result type with count_max_paren_depth by switching from List Int to List Nat.
src/lean4/human_eval/problem_9.lean Fixes rolling-max initialization for non-empty lists.
src/lean4/human_eval/problem_13.lean Tightens GCD-like spec (nonnegative result + divisibility characterization).
src/lean4/human_eval/problem_18.lean Refines substring-occurrence spec (including empty-substring case) and occurrence definition.
src/lean4/human_eval/problem_19.lean Fixes sortedness helper initial flag; minor list-head access change.
src/lean4/human_eval/problem_20.lean Refines “closest elements” spec for duplicates and distinct-pair minimization; indexing cleanup.
src/lean4/human_eval/problem_25.lean Fixes factorization loop bounds/state and appends remaining prime factor.
src/lean4/human_eval/problem_31.lean Aligns primality spec with Nat.Prime convention (2 ≤ n ∧ ...).
src/lean4/human_eval/problem_32.lean Replaces non-robust Newton iteration with bounded bisection approach.
src/lean4/human_eval/problem_36.lean Corrects off-by-one references in spec (n-1 vs n).
src/lean4/human_eval/problem_73.lean Rewrites spec to characterize minimum changes needed to reach a palindrome.
src/lean4/human_eval/problem_78.lean Adjusts spec to count prime hex digits (1 per digit) and constrains to valid hex range.
src/lean4/human_eval/problem_82.lean Aligns prime-length spec with Nat.Prime convention (2 ≤ n ∧ ...).
src/lean4/human_eval/problem_101.lean Fixes word-splitting implementation (comma normalization) and strengthens spec invariants.
src/lean4/human_eval/problem_108.lean Updates digit-sum helper to use Nat.digits.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Copy link
Collaborator

@amit9oct amit9oct left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I have reviewed some problems (problem_{1,101, 108,13,18,19,20}.lean in human_eval). Will review more problems in a few days. Once again, thank you for raising this PR, there are some good catches, but there are some changes that don't lead to correct specification (I have provided details in my comments).

Copy link
Collaborator

@amit9oct amit9oct left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@barabbs Thank you very much for the contribution! I’ve reviewed the specifications, and most of the changes look correct. The only issue I noticed is in src/lean4/human_eval/problem_73.lean.

We really appreciate the effort from the community in helping maintain and improve the benchmark. If you have run any evaluations with your approach or models, please feel free to share them—we would be happy to update the leaderboard with your results.

-- on the test case [-6, 11, -6, 1] as the derivative of the polynomial
-- can be zero at if the guess is 0. In such cases, the implementation will not
-- converge. The implementation can be improved by using a better guess.
let lc := |xs.getLast!|;
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this has the same caveat of not converging if the bound is not found. I don't think there is a clean solution for this with a proof that either the solution is found or we run out of fuel steps. If one wants, they can bake in fuel in the specification itself.

Copy link
Author

@barabbs barabbs Mar 16, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think the fix actually works:

  • findBound always succeeds because it starts from the Cauchy root bound (1 + Σ|aᵢ|/|aₙ|), which is a proven upper bound on all root magnitudes. For odd-degree polynomials with non-zero leading coefficient, the polynomial is guaranteed to have opposite signs at ±bound, so the sign-change check passes on the very first iteration.
  • Bisection fuel is coefficient-dependent, not fixed. The fuel scales as 1000000 + coeffBits * (xs.length + 1) * 40, where coeffBits is the total bit-size of all coefficients. The number of bisection steps needed for convergence is O(n · log(max|aᵢ|)), which this formula always exceeds. Since any concrete List Rat has finite bit-size coefficients, the fuel is always sufficient.

Also, my confidence in this is quite high due to having a working proof of the correctness theorem for this implementation

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Okay, that makes sense since this is an odd-degree polynomial, so at least one real root is guaranteed. Given the dependence of fuel on the coefficient, this should work.

I would love to see the proof, if possible please DM me that. If you have such proofs, I think you should definitely share your results. I would love to update our leaderboard with your results and experiments.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is possibly one of the hardest problems in the benchmark.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sorry for not replying to you earlier on the matter of the leaderboard: I currently have experimented only on the proof-gen task, and with a custom setup different from the one you provided. I imagine this isn't very suitable to compete on the leaderboard.

I am however in the process of extending my tests to the other 3 tasks, and porting everything to your framework. I'll be sure to send everything your way as soon as I have it working!

[if you are still curious about this particular proof or any other I might have, please let me know and if I can I'll DM them to you :)]

Copy link
Collaborator

@amit9oct amit9oct left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@barabbs thank you so much for your contribution and interest in our benchmark. Most of your specification changes are correct except for problem 101. Please do those minor changes as discussed, and I will approve the PR.

Again, thank you so much for your fixes. You mentioned finding some hard proofs. Please DM me your results. I would love to update our leaderboard here: https://trishullab.github.io/clever/leaderboard.html

-- on the test case [-6, 11, -6, 1] as the derivative of the polynomial
-- can be zero at if the guess is 0. In such cases, the implementation will not
-- converge. The implementation can be improved by using a better guess.
let lc := |xs.getLast!|;
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Okay, that makes sense since this is an odd-degree polynomial, so at least one real root is guaranteed. Given the dependence of fuel on the coefficient, this should work.

I would love to see the proof, if possible please DM me that. If you have such proofs, I think you should definitely share your results. I would love to update our leaderboard with your results and experiments.

-- on the test case [-6, 11, -6, 1] as the derivative of the polynomial
-- can be zero at if the guess is 0. In such cases, the implementation will not
-- converge. The implementation can be improved by using a better guess.
let lc := |xs.getLast!|;
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is possibly one of the hardest problems in the benchmark.

Copy link
Collaborator

@amit9oct amit9oct left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@barabbs : Thank you so much for all the good work. I have approved your PR. Feel free to merge it.

@barabbs
Copy link
Author

barabbs commented Mar 18, 2026

@amit9oct It says that there's still 1 workflow awaiting approval, dosen't allow me to merge yet

@amit9oct
Copy link
Collaborator

@amit9oct It says that there's still 1 workflow awaiting approval, dosen't allow me to merge yet

Approved the workflow. Once this gets merged I will update the PyPi package version and do a new release.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants