Draft
Conversation
luxinyu1
requested changes
Jul 18, 2025
|
|
||
| In `eval_cot.yaml`, set `eval_type` to `cot_eval` for Reasoning model evaluation, `per_model_gpu` to define the number of GPUs per model worker, `dp_gpu` to configure data parallelism, `n_samples` to specify the number of samples in evaluation, and `max_length` to set the maximum response length. | ||
|
|
||
| The results for AIME24 and AIME25 will be stored in `outputs/{data_name}/{model_name}_{split}_{prompt_type}_{num_test_sample}_seed{seed}_t{temperature}_s{start}_e{end}_{prompt_type}_metrics.json`. |
Collaborator
There was a problem hiding this comment.
输出全部以./outputs/{model_name}作为base path
| 'ms_id': '', | ||
| 'hf_id': '', | ||
| 'local': './data/mbpp_plus/mbpp_plus.jsonl', | ||
| } |
| 'local': './data/mbpp_plus/mbpp_plus.jsonl', | ||
| } | ||
| ``` | ||
| 2. The `score` method in the MBPPEvaluator class for MBPP evaluation is incompatible with the `evalplus` library. The input to the `evaluate` function in the `evalplus` library needs to be modified from `self.eval(flags)` to `self.eval(**flags)`. Additionally, the parameters `base_only`, `i_just_wanna_run`, and `mini` in `flags` should be set to `False`. The output of the `evaluate` function also needs modification. Originally, it had no output; now, it should return the `pass_at_k` results. |
Collaborator
There was a problem hiding this comment.
可以考虑monkey patch嘛,这样的说明比较模糊
| } | ||
| ``` | ||
| 2. The `score` method in the MBPPEvaluator class for MBPP evaluation is incompatible with the `evalplus` library. The input to the `evaluate` function in the `evalplus` library needs to be modified from `self.eval(flags)` to `self.eval(**flags)`. Additionally, the parameters `base_only`, `i_just_wanna_run`, and `mini` in `flags` should be set to `False`. The output of the `evaluate` function also needs modification. Originally, it had no output; now, it should return the `pass_at_k` results. | ||
| 3. If you need to evaluate a new model in LiveCodeBench, you should modify `AutoAlign/src/autoalign/eval/livecodebench/lcb_runner/lm_styles.py`: |
| - To add a different LMStyle, first add the model type in the LMStyle class, then add the model's LanguageModel class in LanguageModelList. | ||
| - To add a different model under the same LMStyle, directly add the model's LanguageModel class in LanguageModelList. | ||
| 4. Due to the potentially varying performance of the reasoning model under few-shot and zero-shot settings, to modify the evaluation templates, you can follow these steps: | ||
| For AIME24 and AIME25 evaluation templates, modify `PROMPT_TEMPLATES` in `AutoAlign/src/autoalign/eval/math_eval/utils.py`. |
Collaborator
There was a problem hiding this comment.
AIME24/25 的template没有太多的改动空间,采用"Please reason step by step, and put your final answer within \boxed{{}}."即可,这个fix住
| 4. Due to the potentially varying performance of the reasoning model under few-shot and zero-shot settings, to modify the evaluation templates, you can follow these steps: | ||
| For AIME24 and AIME25 evaluation templates, modify `PROMPT_TEMPLATES` in `AutoAlign/src/autoalign/eval/math_eval/utils.py`. | ||
| For LiveCodeBench evaluation templates, modify the `format_prompt_generation` function in `AutoAlign/src/autoalign/eval/livecodebench/lcb_runner/prompts/code_generation.py`. | ||
| For templates evaluated in OpenCompass, locate the respective Python file for each dataset in `AutoAlign/opencompass/opencompass/configs/datasets` and modify the `template` in `{data_name}_infer_cfg`. |
Collaborator
There was a problem hiding this comment.
这个地方能不能参考一下 @xudong2001 之前的做法 从我们的conversation.py里读取bos eos?
luxinyu1
reviewed
Jul 18, 2025
| Evaluation Reference Results: | ||
| | Model | AIME24 | AIME25 | LCB | MBPP+ | GPQA | IFEval Prompt-level-strict-accuracy | IFEval Inst-level-strict-accuracy | IFEval Prompt-level-loose-accuracy | IFEval Inst-level-loose-accuracy | | ||
| |-------------|--------|--------|------|-------|-------|-------------------------------------|------------------------------------|-------------------------------------|----------------------------------| | ||
| | distill-1.5b | 27.3 | 21.6 | 17.2 | 1.06 | 27.27 | 26.99 | 41.13 | 28.10 | 42.81 | |
Collaborator
There was a problem hiding this comment.
distill-1.5b太模糊 改成官方名,为啥MBPP+只有1.06
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
删除了eval中多余文件