Skip to content
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
7 changes: 5 additions & 2 deletions .github/workflows/run-eval.yaml
Original file line number Diff line number Diff line change
Expand Up @@ -71,11 +71,14 @@ jobs:
env:
MODEL_NAME: lfm-3b
MODEL_URL: ${{ vars.MODEL_URL }}
OPENAI_API_KEY: ${{ secrets.OPENAI_API_KEY }}
MODEL_API_KEY: ${{ secrets.MODEL_API_KEY }}
run: |
# let the model judge itself against the GPT-4 answers
bin/api/run_openai_judge.sh \
--model-name "$MODEL_NAME" \
--openai-api-key "$OPENAI_API_KEY" \
--judge-model-name "lfm-7b" \
--judge-model-url "$MODEL_URL" \
--judge-model-api-key "$MODEL_API_KEY" \
--parallel 3

- name: Process Judge Results
Expand Down
49 changes: 36 additions & 13 deletions README.md
Original file line number Diff line number Diff line change
Expand Up @@ -21,17 +21,21 @@ bin/api/run_docker_eval.sh generate \

Results will be output in `llm_judge/data/japanese_mt_bench/model_answer/<model-name>.jsonl`

2. Run OpenAI judge:
2. Run judge:

The judge script will use the judge model to compare [GPT-4 results](llm_judge/data/mt_bench/reference_answer/gpt-4.jsonl) with the model results. The judge model defaults to GPT-4.

```bash
bin/api/run_docker_eval.sh judge \
--model-name <model-name> \
--openai-api-key <openai-api-key>
--judge-model-name <judge-model-name> \
--judge-model-url <judge-model-url> \
--judge-model-api-key <judge-model-api-key>
```

GPT judge results will be output to `llm_judge/data/japanese_mt_bench/model_judgment/gpt-4_<model-name>.jsonl`.
Judge results will be output to `llm_judge/data/japanese_mt_bench/model_judgment/<judge-model-name>_<model-name>.jsonl`.

The final scores will be output in `llm_judge/data/japanese_mt_bench/gpt4-score-<model-name>.json`.
The final scores will be output in `llm_judge/data/japanese_mt_bench/<judge-model-name>-score-<model-name>.json`.

### Examples

Expand All @@ -45,7 +49,9 @@ bin/api/run_docker_eval.sh generate \

bin/api/run_docker_eval.sh judge \
--model-name lfm-3b-jp \
--openai-api-key <OPENAI-API-KEY>
--judge-model-name gpt-4o \
--judge-model-url https://api.openai.com/v1 \
--judge-model-api-key <OPENAI-API-KEY>
```

Run eval for `lfm-3b-ichikara` on-prem:
Expand All @@ -71,7 +77,9 @@ bin/api/run_docker_eval.sh generate \

bin/api/run_docker_eval.sh judge \
--model-name lfm-3b-jp \
--openai-api-key <OPENAI-API-KEY>
--judge-model-name gpt-4o \
--judge-model-url https://api.openai.com/v1 \
--judge-model-api-key <OPENAI-API-KEY>
```

## Run Evaluation without Docker
Expand Down Expand Up @@ -111,16 +119,29 @@ Results will be output in `llm_judge/data/japanese_mt_bench/model_answer/<model-
2. Run the following scripts to generate GPT-4 judgement scores for the model answers.

```bash
bin/api/run_openai_judge.sh --model-name <model-name> --openai-api-key <OPENAI-API-KEY>
bin/api/run_openai_judge.sh \
--model-name <model-name> \
--judge-model-name <judge-model-name> \
--judge-model-url <judge-model-url> \
--judge-model-api-key <judge-model-api-key>

# examples:
bin/api/run_openai_judge.sh --model-name lfm-3b-jp --openai-api-key <OPENAI-API-KEY>
bin/api/run_openai_judge.sh --model-name lfm-3b-ichikara --openai-api-key <OPENAI-API-KEY>
bin/api/run_openai_judge.sh \
--model-name lfm-3b-jp \
--judge-model-name gpt-4o \
--judge-model-url https://api.openai.com/v1 \
--judge-model-api-key <OPENAI-API-KEY>

bin/api/run_openai_judge.sh \
--model-name lfm-3b-ichikara \
--judge-model-name gpt-4o \
--judge-model-url https://api.openai.com/v1 \
--judge-model-api-key <OPENAI-API-KEY>
```

GPT judge results will be output to `llm_judge/data/japanese_mt_bench/model_judgment/gpt-4_<model-name>.jsonl`.
Judge results will be output to `llm_judge/data/japanese_mt_bench/model_judgment/<judge-model-name>_<model-name>.jsonl`.

The final scores will be output in `llm_judge/data/japanese_mt_bench/gpt4-score-<model-name>.json`.
The final scores will be output in `llm_judge/data/japanese_mt_bench/<judge-model-name>-score-<model-name>.json`.

</details>

Expand Down Expand Up @@ -148,8 +169,10 @@ This applies to both `bin/api/run_docker_eval.sh judge` and `bin/api/run_openai_

| Argument | Description | Required |
| --- | --- | --- |
| `--model-name` | Model name | Yes |
| `--openai-api-key` | OpenAI API key | Yes |
| `--model-name` | Model name to be evaluated | Yes |
| `--judge-model-name` | Name of the judge model (default: gpt-4) | No |
| `--judge-model-url` | Base URL for the judge model API | Yes |
| `--judge-model-api-key` | API key for the judge model | Yes |
| `--parallel` | Number of parallel API calls | No. Default to 5. |

</details>
Expand Down
22 changes: 20 additions & 2 deletions bin/api/entrypoint.sh
Original file line number Diff line number Diff line change
Expand Up @@ -47,6 +47,20 @@ elif [[ "$MODE" == "judge" ]]; then
# Extract arguments for judge mode
PARALLEL="5"
CI="false"
JUDGE_MODEL_NAME=${JUDGE_MODEL_NAME:-"gpt-4"}
JUDGE_MODEL_URL=${JUDGE_MODEL_URL:-""}
JUDGE_MODEL_API_KEY=${JUDGE_MODEL_API_KEY:-""}

# Ensure required parameters are set
if [[ -z "$JUDGE_MODEL_API_KEY" ]]; then
echo "Error: JUDGE_MODEL_API_KEY environment variable is required"
exit 1
fi

if [[ -z "$JUDGE_MODEL_URL" ]]; then
echo "Error: JUDGE_MODEL_URL environment variable is required"
exit 1
fi

while [[ $# -gt 0 ]]; do
case $1 in
Expand All @@ -67,14 +81,18 @@ elif [[ "$MODE" == "judge" ]]; then
# Generate judgments
python llm_judge/gen_judgment.py \
--model-list "$MODEL_NAME" \
--judge-model-name "$JUDGE_MODEL_NAME" \
--judge-model-url "$JUDGE_MODEL_URL" \
--judge-model-api-key "$JUDGE_MODEL_API_KEY" \
--parallel "$PARALLEL" \
--bench-name japanese_mt_bench

# Show results
python llm_judge/show_result.py \
--model-list "$MODEL_NAME" \
--judge-model-name "$JUDGE_MODEL_NAME" \
--ci "$CI" \
--bench-name japanese_mt_bench \
--input-file llm_judge/data/japanese_mt_bench/model_judgment/gpt-4_$MODEL_NAME.jsonl \
--output llm_judge/data/japanese_mt_bench/gpt4-score-$MODEL_NAME.json
--input-file "llm_judge/data/japanese_mt_bench/model_judgment/${JUDGE_MODEL_NAME}_$MODEL_NAME.jsonl" \
--output "llm_judge/data/japanese_mt_bench/${JUDGE_MODEL_NAME}-score-$MODEL_NAME.json"
fi
42 changes: 32 additions & 10 deletions bin/api/run_docker_eval.sh
Original file line number Diff line number Diff line change
Expand Up @@ -22,10 +22,12 @@ print_usage() {
echo " --question-count Number of questions to evaluate (optional)"
echo
echo "Judge mode options:"
echo " --model-name Name of the model to evaluate"
echo " --openai-api-key OpenAI API key for GPT-4 judgment"
echo " --parallel Number of parallel processes (default: 5)"
echo " --ci CI mode (default: false)"
echo " --model-name Name of the model to evaluate"
echo " --judge-model-name Name of the judge model (default: gpt-4)"
echo " --judge-model-url Base URL for the judge model API"
echo " --judge-model-api-key API key for the judge model"
echo " --parallel Number of parallel processes (default: 5)"
echo " --ci CI mode (default: false)"
}

if [ $# -lt 1 ]; then
Expand Down Expand Up @@ -106,7 +108,9 @@ if [[ "$MODE" == "generate" ]]; then
elif [[ "$MODE" == "judge" ]]; then
# Process judge mode arguments
MODEL_NAME=""
OPENAI_API_KEY=""
JUDGE_MODEL_NAME="gpt-4"
JUDGE_MODEL_URL=""
JUDGE_MODEL_API_KEY=""
PARALLEL="5"
CI="false"

Expand All @@ -116,8 +120,17 @@ elif [[ "$MODE" == "judge" ]]; then
MODEL_NAME="$2"
shift 2
;;
--openai-api-key)
OPENAI_API_KEY="$2"

--judge-model-name)
JUDGE_MODEL_NAME="$2"
shift 2
;;
--judge-model-url)
JUDGE_MODEL_URL="$2"
shift 2
;;
--judge-model-api-key)
JUDGE_MODEL_API_KEY="$2"
shift 2
;;
--parallel)
Expand All @@ -142,8 +155,15 @@ elif [[ "$MODE" == "judge" ]]; then
exit 1
fi

if [[ -z "$OPENAI_API_KEY" ]]; then
echo "Error: --openai-api-key is required"
# Validate required parameters
if [[ -z "$JUDGE_MODEL_API_KEY" ]]; then
echo "Error: --judge-model-api-key is required"
print_usage
exit 1
fi

if [[ -z "$JUDGE_MODEL_URL" ]]; then
echo "Error: --judge-model-url is required"
print_usage
exit 1
fi
Expand All @@ -152,7 +172,9 @@ elif [[ "$MODE" == "judge" ]]; then
docker run --rm -it \
--network="host" \
-e MODEL_NAME="$MODEL_NAME" \
-e OPENAI_API_KEY="$OPENAI_API_KEY" \
-e JUDGE_MODEL_NAME="$JUDGE_MODEL_NAME" \
-e JUDGE_MODEL_URL="$JUDGE_MODEL_URL" \
-e JUDGE_MODEL_API_KEY="$JUDGE_MODEL_API_KEY" \
-v "$(pwd)/llm_judge:/app/llm_judge" \
liquidai/mt-bench:latest judge \
--parallel "$PARALLEL" \
Expand Down
53 changes: 37 additions & 16 deletions bin/api/run_openai_judge.sh
Original file line number Diff line number Diff line change
@@ -1,29 +1,43 @@
#!/bin/bash

print_usage() {
echo "Usage: $0 --openai-api-key <api_key> --model-name <model_name> --parallel <parallel>"
echo "Usage: $0 --model-name <model_name> [--judge-model-name <judge_model_name>] [--judge-model-url <url>] --judge-model-api-key <api_key> [--parallel <parallel>]"
echo
echo "Arguments:"
echo " --openai-api-key OpenAI API key"
echo " --model-name Model name"
echo " --parallel Number of parallel processes"
echo " --model-name Model name to be evaluated (required)"
echo " --judge-model-name Name of the judge model (default: gpt-4)"
echo " --judge-model-url Base URL for the judge model API (default: https://api.openai.com/v1)"
echo " --judge-model-api-key API key for the judge model (required)"
echo " --parallel Number of parallel processes (default: 5)"
echo " --ci CI mode (default: false)"
}

OPENAI_API_KEY=""
MODEL_NAME=""
JUDGE_MODEL_NAME="gpt-4"
JUDGE_MODEL_URL=""
JUDGE_MODEL_API_KEY=""
PARALLEL="5"
CI="false"

while [[ $# -gt 0 ]]; do
case $1 in
--openai-api-key)
OPENAI_API_KEY="$2"
shift 2
;;

--model-name)
MODEL_NAME="$2"
shift 2
;;
--judge-model-name)
JUDGE_MODEL_NAME="$2"
shift 2
;;
--judge-model-url)
JUDGE_MODEL_URL="$2"
shift 2
;;
--judge-model-api-key)
JUDGE_MODEL_API_KEY="$2"
shift 2
;;
--parallel)
PARALLEL="$2"
shift 2
Expand All @@ -40,28 +54,35 @@ while [[ $# -gt 0 ]]; do
esac
done

if [[ -z "$OPENAI_API_KEY" ]]; then
echo "Error: --openai-api-key is required"
# Validate required parameters
if [[ -z "$MODEL_NAME" ]]; then
echo "Error: --model-name is required"
print_usage
exit 1
fi

if [[ -z "$MODEL_NAME" ]]; then
echo "Error: --model-name is required"
if [[ -z "$JUDGE_MODEL_API_KEY" ]]; then
echo "Error: --judge-model-api-key is required"
print_usage
exit 1
fi

export OPENAI_API_KEY="$OPENAI_API_KEY"
export JUDGE_MODEL_NAME="$JUDGE_MODEL_NAME"
export JUDGE_MODEL_URL="$JUDGE_MODEL_URL"
export JUDGE_MODEL_API_KEY="$JUDGE_MODEL_API_KEY"
export PYTHONPATH=.

python llm_judge/gen_judgment.py \
--model-list "$MODEL_NAME" \
--judge-model-name "$JUDGE_MODEL_NAME" \
--judge-model-url "$JUDGE_MODEL_URL" \
--judge-model-api-key "$JUDGE_MODEL_API_KEY" \
--parallel "$PARALLEL" \
--bench-name japanese_mt_bench

python llm_judge/show_result.py --model-list "$MODEL_NAME" \
--judge-model-name "$JUDGE_MODEL_NAME" \
--ci "$CI" \
--bench-name japanese_mt_bench \
--input-file llm_judge/data/japanese_mt_bench/model_judgment/gpt-4_$MODEL_NAME.jsonl \
--output llm_judge/data/japanese_mt_bench/gpt4-score-$MODEL_NAME.json
--input-file "llm_judge/data/japanese_mt_bench/model_judgment/${JUDGE_MODEL_NAME}_$MODEL_NAME.jsonl" \
--output "llm_judge/data/japanese_mt_bench/${JUDGE_MODEL_NAME}-score-$MODEL_NAME.json"
2 changes: 1 addition & 1 deletion conversation.py
Original file line number Diff line number Diff line change
Expand Up @@ -379,7 +379,7 @@ def register_conv_template(template: Conversation, override: bool = False):

def get_conv_template(name: str) -> Conversation:
"""Get a conversation template."""
print("Using template: ", name)
print("Using template:", name)
return conv_templates[name].copy()


Expand Down
Loading