-
--batch_sizeworks for both open-source and api model evaluation. When evaluating open-source models, you have to adjust thebatch_sizeaccording to the GPU memory; when evaluating api models,--batch_sizespecifies the number of parallel calls to the target api model. You should set it properly according to your OpenAI user tier to avoid rate limits. -
--api_parallel_numspecifies the number of parallel calls to the model parser api. In general, if you are a Tier-5 user, you can set--api_parallel_numto 100 or more to parse results in 30 seconds. -
You can use
--max_gpu_memoryto specify the maximum memory per GPU for storing model weights. This allows it to allocate more memory for activations, so you can use longer context lengths or largerbatch_size. E.g., with 4 GPUs, we can set--max_gpu_memory 5GiBforgemma_11_7b_instruct. -
Model response files and scores will be saved to
<output_folder>/<model_name>/<benchmark>/<version>/, for example,mix_eval/data/model_responses/gemma_11_7b_instruct/mixeval_hard/2024-06-01/. We take theoverall scoreas the reported score in Leaderboard. -
There is a resuming mechanism, which means that if you run evaluation with the same config as the run you want to resume, it will resume from where it stopped last time.
-
If you are evaluating base models, set the
--extract_base_model_responseflag to only retain the meaningful part in models' response when parsing to get more stablized parsing results. -
If you are evaluating api models, you should add a line in
.env. E.g., for OpenAI key, you should add:k_oai=<your openai api key>The key name here is 'k_oai'. You can find the key name in the model's class. For example,
claude_3_haiku's key can be found inmixeval.models.claude_3_haiku's__init__function:api_key=os.getenv('k_ant'), wherek_antis the key name.