Skip to content
Draft
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
41 changes: 27 additions & 14 deletions docs/source/improve-workflows/evaluate.md
Original file line number Diff line number Diff line change
Expand Up @@ -93,7 +93,12 @@ To evaluate a workflow, you can use the `nat eval` command. The `nat eval` comma

Note: If you would like to set up visualization dashboards for this initial evaluation, please refer to the **Visualizing Evaluation Results** section below.

To run and evaluate the simple example workflow, use the following command:
To run and evaluate the simple web query example workflow, first install the example with:
```bash
uv pip install -e examples/evaluation_and_profiling/simple_web_query_eval
```

Then, use the following command:
```bash
nat eval --config_file=examples/evaluation_and_profiling/simple_web_query_eval/configs/eval_config.yml
```
Expand All @@ -109,7 +114,7 @@ If you encounter rate limiting (`[429] Too Many Requests`) during evaluation, yo
llms:
nim_rag_eval_llm:
_type: nim
model_name: meta/llama-3.1-70b-instruct
model_name: nvidia/nemotron-3-nano
max_tokens: 8
base_url: http://localhost:8000/v1
```
Expand All @@ -119,11 +124,13 @@ If you encounter rate limiting (`[429] Too Many Requests`) during evaluation, yo
## Understanding the Evaluation Configuration
The `eval` section in the configuration file specifies the dataset and the evaluators to use. The following is an example of an `eval` section in a configuration file:

`examples/evaluation_and_profiling/simple_web_query_eval/configs/eval_config.yml`:
`examples/evaluation_and_profiling/simple_web_query_eval/configs/eval_config.yml` (some attributes have been omitted for brevity):
```yaml
eval:
general:
output_dir: ./.tmp/nat/examples/getting_started/simple_web_query/
output:
dir: ./.tmp/nat/examples/evaluation_and_profiling/simple_web_query_eval/eval/
cleanup: true
dataset:
_type: json
file_path: examples/evaluation_and_profiling/simple_web_query_eval/data/langsmith.json
Expand Down Expand Up @@ -260,23 +267,30 @@ These metrics use a judge LLM for evaluating the generated output and retrieved
llms:
nim_rag_eval_llm:
_type: nim
model_name: meta/llama-3.1-70b-instruct
model_name: nvidia/nemotron-3-nano-30b-a3b
max_tokens: 8
chat_template_kwargs:
enable_thinking: false
```
For these metrics, it is recommended to use 8 tokens for the judge LLM. The judge LLM returns a floating point score between 0 and 1 for each metric where 1.0 indicates a perfect match between the expected output and the generated output.

Evaluation is dependent on the judge LLM's ability to accurately evaluate the generated output and retrieved context. This is the leadership board for the judge LLM:
```
1) nvidia/Llama-3_3-Nemotron-Super-49B-v1
2) mistralai/mixtral-8x22b-instruct-v0.1
3) mistralai/mixtral-8x7b-instruct-v0.1
4) meta/llama-3.1-70b-instruct
5) meta/llama-3.3-70b-instruct
1)- nvidia/Llama-3_3-Nemotron-Super-49B-v1
2)- mistralai/mixtral-8x22b-instruct-v0.1
3)- mistralai/mixtral-8x7b-instruct-v0.1
4)- meta/llama-3.1-70b-instruct
5)- meta/llama-3.3-70b-instruct
6)- meta/llama-3.1-405b-instruct
7)- mistralai/mistral-nemo-12b-instruct
8)- nvidia/llama-3.1-nemotron-70b-instruct
9)- meta/llama-3.1-8b-instruct
10)- google/gemma-2-2b-it
```
Comment thread
dagardner-nv marked this conversation as resolved.
<!-- Update the link here when ragas is updated -->
For a complete list of up-to-date judge LLMs, refer to the [Ragas NV metrics leadership board](https://github.com/explodinggradients/ragas/blob/main/src/ragas/metrics/_nv_metrics.py)
For a complete list of up-to-date judge LLMs, refer to the [Ragas NV metrics leadership board](https://github.com/vibrantlabsai/ragas/blob/v0.4.3/src/ragas/metrics/_nv_metrics.py)

For more information on the prompt used by the judge LLM, refer to the [Ragas NV metrics](https://github.com/explodinggradients/ragas/blob/v0.2.14/src/ragas/metrics/_nv_metrics.py). The prompt for these metrics is not configurable. If you need a custom prompt, you can use the [Tunable RAG Evaluator](#tunable-rag-evaluator) or implement your own evaluator using the [Custom Evaluator](../extend/custom-components/custom-evaluator.md) documentation.
For more information on the prompt used by the judge LLM, refer to the [Ragas NV metrics](https://github.com/vibrantlabsai/ragas/blob/v0.4.3/src/ragas/metrics/_nv_metrics.py). The prompt for these metrics is not configurable. If you need a custom prompt, you can use the [Tunable RAG Evaluator](#tunable-rag-evaluator) or implement your own evaluator using the [Custom Evaluator](../extend/custom-components/custom-evaluator.md) documentation.

### Trajectory Evaluator
This evaluator uses the intermediate steps generated by the workflow to evaluate the workflow trajectory. The evaluator configuration includes the evaluator type and any additional parameters required by the evaluator.
Expand Down Expand Up @@ -346,7 +360,7 @@ eval:
```

:::{note}
If `cleanup` is set to `true`, the entire output directory will be removed after the evaluation is complete. This is useful for temporary evaluations where you don't need to retain the output files. Use this option with caution, as it will delete all evaluation results including workflow outputs and evaluator outputs.
If `cleanup` is set to `true`, the entire output directory will be removed prior to performing the evaluation.
:::


Expand Down Expand Up @@ -1233,7 +1247,6 @@ eval:
dir: ./.tmp/nat/examples/simple_output/
cleanup: true
```
Output directory cleanup is disabled by default for easy troubleshooting.

#### Job eviction from output directory
When running multiple evaluations, especially with `append_job_id_to_output_dir` enabled, the output directory can accumulate a large number of job folders over time. You can control this growth using a job eviction policy.
Expand Down
Original file line number Diff line number Diff line change
Expand Up @@ -132,12 +132,16 @@ class EvalGeneralConfig(BaseModel):
"this creates a fresh workflow instance per eval item, resetting all stateful tools to their "
"initial state. Set to False to disable this behavior.")

# overwrite the output_dir with the output config if present
# If output_dir is defined and output is not, define an EvalOutputConfig with output_dir as the dir
@model_validator(mode="before")
@classmethod
def override_output_dir(cls, values):
if values.get("output") and values["output"].get("dir"):
values["output_dir"] = values["output"]["dir"]
output_config = values.get("output")
if output_config is None:
output_dir = values.get("output_dir")
if output_dir is not None:
values["output"] = EvalOutputConfig(dir=output_dir)

Comment on lines +139 to +144
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟠 Major | ⚡ Quick win

output no longer reliably overrides output_dir (config drift bug).

When output is present, this validator no longer syncs output_dir. But downstream code still reads output_dir (e.g., packages/nvidia_nat_eval/src/nat/plugins/eval/runtime/builder.py Line 133 and packages/nvidia_nat_eval/src/nat/plugins/eval/runtime/evaluate.py Lines 432 and 511), so writes can go to the default path instead of output.dir.

💡 Proposed fix
 `@model_validator`(mode="before")
 `@classmethod`
 def override_output_dir(cls, values):
     output_config = values.get("output")
-    if output_config is None:
-        output_dir = values.get("output_dir")
-        if output_dir is not None:
-            values["output"] = EvalOutputConfig(dir=output_dir)
+    output_dir = values.get("output_dir")
+
+    if output_config is None:
+        if output_dir is not None:
+            values["output"] = EvalOutputConfig(dir=output_dir)
+    else:
+        # Keep legacy/expected precedence: output.dir overrides output_dir
+        if isinstance(output_config, dict):
+            out_dir = output_config.get("dir")
+            if out_dir is not None:
+                values["output_dir"] = out_dir
+        elif isinstance(output_config, EvalOutputConfig):
+            values["output_dir"] = output_config.dir
 
     return values
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@packages/nvidia_nat_core/src/nat/data_models/evaluate_config.py` around lines
139 - 144, The validator that syncs output and output_dir currently only sets
values["output"] when "output" is missing, which causes config drift when
"output" is present but output_dir is expected downstream; update the validator
so it always ensures values["output_dir"] matches values["output"].dir when
values["output"] is present, and conversely sets values["output"] =
EvalOutputConfig(dir=output_dir) when only "output_dir" is present—i.e., after
reading values.get("output") and values.get("output_dir"), if output is present
set values["output_dir"]=output.dir, else if output_dir is present set
values["output"]=EvalOutputConfig(dir=output_dir).

return values

@classmethod
Expand Down
Original file line number Diff line number Diff line change
Expand Up @@ -347,6 +347,9 @@ async def profile_workflow(self) -> ProfilerResults:

all_stats = [item.trajectory for item in self.eval_input.eval_input_items]

if len(all_stats) == 0 or all(len(stats) == 0 for stats in all_stats):
raise ValueError("No trajectories found for profiling.")

profiler_runner = ProfilerRunner(self.eval_config.general.profiler,
self.eval_config.general.output_dir,
write_output=self.config.write_output)
Expand Down
Loading