Skip to content

Commit bac6017

Browse files
authored
Merge pull request #959 from OptimalScale/feature-readme
Change recommended evaluation repo in `README`
2 parents 89921d8 + bb7c6ea commit bac6017

1 file changed

Lines changed: 1 addition & 15 deletions

File tree

README.md

Lines changed: 1 addition & 15 deletions
Original file line numberDiff line numberDiff line change
@@ -305,21 +305,7 @@ python ./examples/chatbot_gradio.py --deepspeed configs/ds_config_chatbot.json -
305305
306306
### Evaluation
307307

308-
[LMFlow Benchmark](https://blog.gopenai.com/lmflow-benchmark-an-automatic-evaluation-framework-for-open-source-llms-ef5c6f142418) is an automatic evaluation framework for open-source large language models.
309-
We use negative log likelihood (NLL) as the metric to evaluate different aspects of a language model: chitchat, commonsense reasoning, and instruction following abilities.
310-
311-
You can directly run the LMFlow benchmark evaluation to obtain the results to participate in the
312-
[LLM comparision](https://docs.google.com/spreadsheets/d/1JYh4_pxNzmNA9I0YM2epgRA7VXBIeIGS64gPJBg5NHA/edit?usp=sharing).
313-
For example, to run GPT2 XL, one may execute
314-
315-
```sh
316-
bash ./scripts/run_benchmark.sh --model_name_or_path gpt2-xl
317-
```
318-
319-
`--model_name_or_path` is required, you may fill in huggingface model name or local model path here.
320-
321-
To check the evaluation results, you may check `benchmark.log` in `./output_dir/gpt2-xl_lmflow_chat_nll_eval`,
322-
`./output_dir/gpt2-xl_all_nll_eval` and `./output_dir/gpt2-xl_commonsense_qa_eval`.
308+
We recommend using [LM Evaluation Harness](https://github.com/EleutherAI/lm-evaluation-harness) for most evaluation purposes.
323309

324310
## Supported Features
325311

0 commit comments

Comments
 (0)