This is a full list of test instance patterns that can be used for evaluating understanding workflow architectures by LLMs.
To create test instances from the patterns, substitute all the parameters. Note that it might also be necessary to alter the question formulations to adapt them to the language of the specific workflow architecture.
- List of tasks (in workflow)
- List of tasks with a property
- Link existence, Task after task
- Next tasks in flow
- Flow cycle
- Flow start detection
- Flow end detection
- Missing link
Special types of links (e.g., conditional, exceptional):
Link operators (e.g., parallel flow - fork and join):
- Operator existence
- Parallel tasks (block) existence
- List tasks in a parallel (fork-join) block
- Parallel tasks to a specific task
- List all parallel tasks
- List tasks in an operator block (other than simple fork-join)
- Are tasks in correct order? (without conditional flow)
- Are all tasks in correct order? (without conditional flow)
- Determine task order (without conditional flow)
- Is conditional flow mutually exclusive?
- Next task in conditional flow
- Are tasks in correct order? (with conditional flow)
- Are all tasks in correct order? (with conditional flow)
- Determine task order (with conditional flow)
- Is loop infinite?
- Loop end condition
- Describe task functionality
- Inconsistent task name and description
- Inconsistent task name and other entities (e.g., parameters, data)
- Meaning (functionality) of tasks
- Describe workflow functionality
- Inconsistent workflow name and description
- Inconsistent descriptions of workflow and tasks
Rationale: Can the LLM list the tasks in a workflow?
Parameters:
-
$N$ : total number of tasks -
$W$ : workflow name -
$T_1, \dots, T_N$ : tasks in$W$
Architecture: Workflow
Question: List all tasks in workflow
Reference answer:
Evaluation metric: Jaccard index
Example instance: "List all tasks in workflow 'MLTrainingAndEvaluation'."
Rationale: Can the LLM list the tasks in a workflow and filter them based on a specific property?
Parameters:
-
$P$ : property of the tasks (e.g., the task has a parameter) -
$N$ : number of tasks satisfying$P$ -
$T$ : total number of tasks -
$W$ : workflow name
Architecture: Workflow
Question: List all tasks in workflow
Reference answer: a set of $N$ tasks satisfying $P$ (depends on the parameter substitution for the test instance)
Evaluation metric: Jaccard index
Example instance: "List all tasks in workflow 'MLTrainingAndEvaluation' that have a parameter."
Rationale: Can the LLM determine if there is a flow link between two tasks?
Parameters:
-
$F$ : flow type (e.g., control flow, data flow) -
$W$ : workflow name -
$T_1, T_2$ : tasks in$W$
Architecture: Workflow
Question: In workflow
Reference answer: yes
Evaluation metric: correctness
Example instance: "In workflow 'MLTrainingAndEvaluation', is there a control flow link from 'MLModelTraining' to 'MLModelEvaluation'?"
Rationale: Can the LLM determine if there is a flow link between two tasks? (negative test)
Parameters:
-
$F$ : flow type (e.g., control flow, data flow) -
$W$ : workflow name -
$T_1, T_2$ : tasks in$W$
Architecture: Workflow
Question: In workflow
Reference answer: no
Evaluation metric: correctness
Example instance: "In workflow 'MLTrainingAndEvaluation', is there a control flow link from 'MLModelEvaluation' to 'MLModelTraining'?"
Rationale: Can the LLM determine if there is a flow link between two tasks?
Parameters:
-
$F$ : flow type (e.g., control flow, data flow) -
$W$ : workflow name -
$T_1, T_2$ : tasks in$W$
Architecture: Workflow
Question: In workflow
Reference answer: yes
Evaluation metric: correctness
Example instance: "In workflow 'MLTrainingAndEvaluation', does 'MLModelEvaluation' directly follow 'MLModelTraining' in control flow?"
Rationale: Can the LLM determine if there is a flow link between two tasks? (negative test)
Parameters:
-
$F$ : flow type (e.g., control flow, data flow) -
$W$ : workflow name -
$T_1, T_2$ : tasks in$W$
Architecture: Workflow
Question: In workflow
Reference answer: no
Evaluation metric: correctness
Example instance: "In workflow 'MLTrainingAndEvaluation', does 'MLModelTraining' directly follow 'MLModelEvaluation' in control flow?"
Rationale: Can the LLM determine to which tasks there is a flow link from a given task?
Parameters:
-
$F$ : flow type (e.g., control flow, data flow) -
$W$ : workflow name -
$T_0$ : task in$W$ -
$T_1, \dots, T_N$ : tasks in$W$
Architecture: Workflow
Question: In workflow
Alternative question formulation: In workflow
Reference answer:
Evaluation metric: Jaccard index
Example instance: "In workflow 'MainWorkflow', which tasks come directly after 'Task2' in the control flow?", reference answer: { "Task1", "Task3" } (block of parallel tasks)
Rationale: Can the LLM find cycles in a flow?
Parameters:
-
$F$ : flow type (e.g., control flow, data flow) -
$C$ : length of the cycle -
$W$ : workflow name
Architecture: Workflow
Question: In workflow
Reference answer: yes
Evaluation metric: correctness
Rationale: Can the LLM find cycles in the flow? (negative test)
Parameters:
-
$F$ : flow type (e.g., control flow, data flow) -
$W$ : workflow name
Architecture: Workflow
Question: In workflow
Reference answer: no
Evaluation metric: correctness
Rationale: Can the LLM determine the start of a flow?
Parameters:
-
$F$ : flow type that has an unique start (e.g., control flow) -
$T$ : the first task in flow$F$ -
$W$ : workflow name
Architecture: Workflow
Question: In workflow
Reference answer:
Evaluation metric: correctness
Rationale: Can the LLM determine the last task(s) of a flow?
Parameters:
-
$F$ : flow type (e.g., control flow) -
$W$ : workflow name -
$T_1, \dots, T_N$ : tasks in$W$ which are last in flow$F$ (can be one task or a list of tasks with no$F$ links from them)
Architecture: Workflow
Question: In workflow
Reference answer:
Evaluation metric: Jaccard index
Rationale: Can the LLM detect that a flow is not connected (there is a missing link)?
Parameters:
-
$F$ : flow type (e.g., control flow) -
$W$ : workflow name
Architecture: Workflow
Question: In workflow
Reference answer: no
Evaluation metric: correctness
Rationale: Can the LLM detect that the flow is not connected (there is a missing link)? (negative test)
Parameters:
-
$F$ : flow type (e.g., control flow) -
$W$ : workflow name
Architecture: Workflow
Question: In workflow
Reference answer: yes
Evaluation metric: correctness
Rationale: Can the LLM determine whether there is a flow link between two tasks?
Parameters:
-
$F$ : flow type (e.g., control flow, data flow) -
$P$ : property of the links (e.g., it is conditional) -
$W$ : workflow name -
$T_1, T_2$ : tasks in$W$
Architecture: Workflow
Question: In workflow
Reference answer: yes
Evaluation metric: correctness
Example instance: "In workflow 'HyperparameterOptimization', is there a conditional control flow link from 'HyperparameterProposal' to 'MLModelValidation'?"
Rationale: Can the LLM determine if there is a flow link between two tasks? (negative test)
Parameters:
-
$F$ : flow type (e.g., control flow, data flow) -
$P$ : property of the links (e.g., it is conditional) -
$W$ : workflow name -
$T_1, T_2$ : tasks in$W$
Architecture: Workflow
Question: In workflow
Reference answer: no
Evaluation metric: correctness
Rationale: Can the LLM list the links from a given task and filter them based on a specific property?
Parameters:
-
$F$ : flow type (e.g., control flow, data flow) -
$P$ : property of the links (e.g., it is conditional) -
$W$ : workflow name -
$T_0$ : task in$W$ -
$T_1, \dots, T_N, \dots, T_L$ : tasks in$W$ ($N < L$ )
Architecture: Workflow
Question: To which tasks there is a flow
Reference answer:
Evaluation metric: Jaccard index
Example instance: "In workflow 'HyperparameterOptimization', to which tasks are there conditional flow links from 'HyperparameterProposal'?"
Note: It would be possible to also create a pattern to list all links in workflow satisfying a property, however, it is more complicated to evaluate it as we would have to define a format for encoding the links in the answer (encoding tasks is easier).
Rationale: Can the LLM understand operators (e.g., fork, join)? Can it determine whether there is an operator used in a workflow?
Pattern details to be added.
Rationale: Can the LLM determine whether some tasks can run in parallel?
Pattern details to be added.
Reference answer: yes
Evaluation metric: correctness
Rationale: Can the LLM determine which tasks can run in parallel (inside a particular fork-join block)?
Pattern details to be added.
Reference answer: list of tasks
Rationale: Can the LLM determine which tasks can run in parallel (to a particular task)?
Pattern details to be added.
Rationale: Can the LLM determine which tasks can run in parallel?
Pattern details to be added.
Reference answer: several lists of tasks that can be run in parallel (open question)
Rationale: Can the LLM correctly interpret other operators than simple fork-join?
Pattern details to be added.
Rationale: Can the LLM find dependencies (in a flow)?
Parameters:
-
$W$ : workflow name -
$E$ : entity (e.g., task, data) in the workflow -
$T$ : task in the workflow that depends on$E$ (trough e.g., control flow, data flow)
Architecture: Workflow
Question: In workflow
Reference answer: yes
Evaluation metric: correctness
Example instances:
- "In workflow 'MLTrainingAndEvaluation' does task 'MLModelEvaluation' depend on task 'MLModelTraining'?"
- "In workflow 'MLTrainingAndEvaluation' does task 'MLModelTraining' depend on data 'TrainingData'?"
Rationale: Can the LLM find dependencies (in a flow)? (negative test)
Parameters:
-
$W$ : workflow name -
$E$ : entity (e.g., task, data) in the workflow -
$T$ : task in the workflow that does not depend on$E$ (trough e.g., control flow, data flow)
Architecture: Workflow
Question: In workflow
Reference answer: no
Evaluation metric: correctness
Example instances:
- "In workflow 'MLTrainingAndEvaluation' does task 'MLModelTraining' depend on task 'MLModelEvaluation'?"
- "In workflow 'MLTrainingAndEvaluation' does task 'MLModelTraining' depend on data 'TestData'?"
Rationale: Can the LLM list dependencies (in a flow)?
Parameters:
-
$W$ : workflow name -
$E$ : entity type (e.g., task, data) -
$E_1, \dots, E_K$ : entities (of type$E$ ) in the workflow -
$T$ : task in the workflow that depends on$E_1, \dots, E_K$
Architecture: Workflow
Question: List all entities of type $E$ that
Reference answer:
Evaluation metric: Jaccard index
Example instances:
- "List all data that task 'MLModelEvaluation' (from workflow 'MLTrainingAndEvaluation') depends on.", reference answer: { MLModel, TestFeatures }
- "List all tasks that must run before task 'MLModelEvaluation' in workflow 'MLTrainingAndEvaluation'.", reference answer: { FeatureExtraction, ModelTraining }
Rationale: Can the LLM detect which task produces specific data?
Parameters:
-
$W$ : workflow name -
$D$ : data (or other entity) in the workflow -
$T$ : task which produces$D$ as its output
Architecture: Workflow
Question: In workflow
Reference answer:
Evaluation metric: correctness
Example instance: "In workflow 'MLTrainingAndEvaluation' which task produces 'MLModel'?", reference answer: 'MLModelTraining'
Rationale: Can the LLM interpret task hierarchy well?
Parameters:
-
$W$ : workflow name -
$T$ : task in the workflow$W$ that is composite (has sub-tasks) -
$S$ : sub-task of$T$
Architecture: Workflow
Question: Is task
Reference answer: yes
Evaluation metric: correctness
Example instance: "Is task 'HyperparameterProposal' a part of task 'HyperparameterOptimization' (from workflow 'FailurePredictionInManufacture')?"
Rationale: Can the LLM interpret task hierarchy well? (negative test)
Parameters:
-
$W$ : workflow name -
$T$ : task in the workflow$W$ (it might be composite) -
$S$ : a different task from$W$ (or another workflow) that is not sub-task of$T$
Architecture: Workflow
Question: Is task
Reference answer: no
Evaluation metric: correctness
Example instance: "Is task 'DataRetrieval' a part of task 'HyperparameterOptimization' (from workflow 'FailurePredictionInManufacture')?"
Rationale: Can the LLM interpret task hierarchy well?
Can be expressed via List of tasks with a property by the property that the task is a composite task (has nested sub-tasks).
Example instance: "List all tasks from 'AutoML' workflow that are composite."
Rationale: Can the LLM interpret task hierarchy well?
Can be expressed via List of tasks with a property by the property that the task is a part of a composite task.
Example instance: "List all tasks that are nested inside composite tasks of 'AutoML' workflow."
Rationale: Can the LLM detect infinite recursion in references (e.g., among sub-workflows)?
Parameters:
-
$W$ : workflow name
Architecture: Workflow
Question: In workflow
Reference answer: yes
Evaluation metric: correctness
Note: The recursion might also be more complicated than just a simple self-reference, i.e.,
Rationale: Can the LLM detect infinite recursion in references (e.g., among sub-workflows)? (negative test)
Parameters:
-
$W$ : workflow name
Architecture: Workflow
Question: In workflow
Reference answer: no
Evaluation metric: correctness
Rationale: Can the LLM notice that tasks are in incorrect order (not corresponding to the control flow)?
Parameters:
-
$W$ : workflow name -
$T_0, T_1, \dots, T_N$ : tasks in the workflow$W$
Architecture: Workflow
Question: Does task
Reference answer: yes
Evaluation metric: correctness
Example instance: Architecture: Workflow with control flow: START -> FeatureExtraction -> ModelTraining -> ModelEvaluation -> END Question: Does 'FeatureExtraction' run before 'ModelEvaluation'? Reference answer: yes
Rationale: Can the LLM notice that tasks are in incorrect order (not corresponding to the control flow)? (negative test)
Parameters:
-
$W$ : workflow name -
$T_0, T_1, \dots, T_N$ : tasks in the workflow$W$
Architecture: Workflow
Question: Does task
Reference answer: no
Evaluation metric: correctness
Example instance: Architecture: Workflow with control flow: START -> FeatureExtraction -> ModelTraining -> ModelEvaluation -> END Question: Does 'ModelEvaluation' run before 'FeatureExtraction'? Reference answer: no
Rationale: Can the LLM notice that tasks are in incorrect order (not corresponding to the control flow)?
Parameters:
-
$W$ : workflow name -
$T_1, \dots, T_K$ : tasks in the workflow$W$
Architecture: Workflow
Question: Can tasks
Reference answer: yes
Evaluation metric: correctness
Example instance: Architecture: Workflow with control flow: START -> FeatureExtraction -> ModelTraining -> ModelEvaluation -> END Question: Can tasks 'FeatureExtraction', 'ModelTraining', 'ModelEvaluation' run in this order? Reference answer: yes
Rationale: Can the LLM notice that tasks are in incorrect order (not corresponding to the control flow)? (negative test)
Parameters:
-
$W$ : workflow name -
$T_1, \dots, T_K$ : tasks in the workflow$W$
Architecture: Workflow
Question: Can tasks
Reference answer: no
Evaluation metric: correctness
Example instance: Architecture: Workflow with control flow: START -> FeatureExtraction -> ModelTraining -> ModelEvaluation -> END Question: Can tasks 'ModelTraining', 'FeatureExtraction', 'ModelEvaluation' run in this order? Reference answer: no
Rationale: Can the LLM order the tasks in a workflow without conditional flow to adhere to control flow links?
Parameters:
-
$W$ : workflow name -
$T_1, \dots, T_K$ : tasks in the workflow$W$
Architecture: Workflow
Question: List all the tasks in workflow
Reference answer:
Evaluation metric: Damerau–Levenshtein distance (note: special care must be given to the order of parallel tasks)
Example instance: "List all the tasks in workflow 'MLTrainingAndEvaluation' in order in which they run.", reference answer: FeatureExtraction, MLModelTraining, MLModelEvaluation
Rationale: Can the LLM understand conditional flow guards?
Parameters:
-
$F$ : flow type (e.g., control flow) -
$W$ : workflow name -
$T_0, T_1, T_2$ : tasks in the workflow -
$C_1, C_2$ : conditions for conditional links (in flow$F$ ) that are mutually exclusive
Architecture: Workflow
Question: Are conditional links in flow
Reference answer: yes
Evaluation metric: correctness (
Example instance: Workflow with a parameter
Rationale: Can the LLM understand conditional flow guards? (negative test)
Parameters:
-
$F$ : flow type (e.g., control flow) -
$W$ : workflow name -
$T_0, T_1, T_2$ : tasks in the workflow -
$C_1, C_2$ : conditions for conditional links (in flow$F$ ) that are not mutually exclusive
Architecture: Workflow
Question: Are conditional links in flow
Reference answer: no
Evaluation metric: correctness (
Example instance: Workflow with a parameter
Rationale: Can the LLM notice that tasks are in incorrect order (not corresponding to the control flow)?
Same as Are tasks in correct order? (without conditional flow), but some of the control flow links are conditional.
Rationale: Can the LLM notice that tasks are in incorrect order (not corresponding to the control flow)?
Same as Are all tasks in correct order? (without conditional flow), but some of the control flow links are conditional.
Rationale: Can the LLM evaluate conditional flow?
Parameters:
-
$F$ : flow type (e.g., control flow) -
$W$ : workflow name -
$T_0, \dots, T_K$ : tasks in the workflow -
$C_1, \dots, C_K$ : conditions for conditional links (in flow$F$ ) -
$S$ : situation (e.g., parameter values)
Architecture: Workflow
Question: Which task will follow
Reference answer:
Evaluation metric: correctness (
Example instance: "Which task will follow 'HyperparameterProposal' in control flow in workflow 'HyperparameterOptimization'?"
Rationale: Can the LLM understand the order of tasks in the workflow with conditional flow?
Parameters:
-
$W$ : workflow name -
$T_1, \dots, T_K$ : tasks in the workflow$W$ -
$S$ : situation (e.g., parameter values) -
$R_1, \dots, R_L$ ($L \le K$ ): tasks that will run when the workflow is executed with initial situation$S$
Architecture: Workflow
Question: Given the initial situation
Reference answer:
Evaluation metric: Damerau–Levenshtein distance (note: special care must be given to the order of parallel tasks)
Example instance: "Given the initial situation p=0, list all the tasks that will run when workflow 'Workflow3' is executed in order in which they will run.", reference answer: Task7, Task8
Rationale: Can the LLM determine that a loop is infinite?
Pattern details to be added.
Rationale: Can the LLM determine under which conditions a loop ends?
Pattern details to be added.
Rationale: Can the LLM determine that a trace of tasks can occur in a specific initial situation?
Parameters:
-
$W$ : workflow name -
$T_1, \dots, T_K$ : tasks in the workflow$W$ -
$S$ : situation (e.g., parameter values) -
$R_1, \dots, R_L$ ($L \le K$ ): tasks that will run when the workflow is executed with initial situation$S$
Architecture: Workflow
Question: Can the trace of tasks
Reference answer: yes
Evaluation metric: correctness
Example instance: "Can the trace of tasks 'FeatureExtraction', 'MLModelTraining', 'MLModelEvaluation' occur in workflow 'MLTrainingAndEvaluation'?"
Rationale: Can the LLM determine that a trace of tasks can occur in a specific initial situation? (negative test)
Parameters:
-
$W$ : workflow name -
$T_1, \dots, T_K$ : tasks in the workflow$W$ -
$R_1, \dots, R_L$ : tasks (from$W$ or different workflow)
Architecture: Workflow
Question: Can the trace of tasks
Reference answer: no
Evaluation metric: correctness
Example instance: "Can the trace of tasks 'TrainTestSplit', 'MLModelEvaluation', 'MLModelTraining' occur in workflow 'MLTrainingAndEvaluation'?"
Rationale: Can the LLM interpret conditional flow guards correctly?
Parameters:
-
$F$ : flow type (e.g., control flow) -
$W$ : workflow name -
$T$ : task in the workflow$W$
Architecture: Workflow
Question: Will the task
Reference answer: yes
Evaluation metric: correctness
In the source code and raw results, these patterns are labeled semantics.
Rationale: Can the LLM determine the meaning of a task based on its name, parameters, links to other tasks, ...?
Pattern details to be added.
Rationale: Can the LLM detect inconsistency of a task's name and its description?
Parameters:
-
$W$ : workflow name -
$T$ : task name -
$D_T$ : task description that is inconsistent with name$T$
Architecture: Workflow
Question: Identify tasks in
Reference answer: The description of task
Evaluation metric: ROUGE or BERTScore
Example instance: Task named 'BinaryClassificationModelTraining' with description 'Training of a regression ML model'
Rationale: Can the LLM detect inconsistencies of task's name and the linked entities (e.g., data, other tasks)?
Pattern details to be added.
Rationale: Can the LLM interpret the meaning of tasks adequately (e.g., task performs an operation that is not directly mentioned in its name)?
Parameters:
-
$W$ : workflow name -
$P$ : property of task (e.g., the task is part of data preprocessing) -
$T$ : task in$W$ with property$P$
Architecture: Workflow
Question: Is there a task with property
Reference answer: yes
Evaluation metric: correctness
Example instance: Task 'FeatureExtraction' is labeled as "data preprocessing", question: "Is there a data processing task in 'MLTrainingAndEvaluation'?"
Rationale: Can the LLM interpret the meaning of tasks adequately (e.g., task performs an operation that is not directly mentioned in its name)? (negative test)
Parameters:
-
$W$ : workflow name -
$P$ : property of task (e.g., the task is part of data preprocessing)
Architecture: Workflow
Question: Is there a task with property
Reference answer: no
Evaluation metric: correctness
Rationale: Can the LLM determine the meaning of a workflow based on its tasks?
Pattern details to be added.
Rationale: Can the LLM detect inconsistent workflow name and description?
Parameters:
-
$W$ : workflow name -
$D_W$ : workflow description that is inconsistent with name$W$
Architecture: Workflow
Question: Does the name of workflow
Reference answer: The description of
Evaluation metric: ROUGE or BERTScore
Example instance: Workflow named 'BinaryClassification' with description 'Training and evaluation of a regression ML model'
Rationale: Can the LLM detect inconsistent descriptions of a workflow and its tasks?
Parameters:
-
$W$ : workflow name -
$D_W$ : workflow description -
$D_T$ : task description that is inconsistent with$D_W$
Architecture: Workflow containing a task with inconsistent description.
Question: Identify tasks in workflow
Reference answer: description of the inconsistency (depends on the test instance)
Evaluation metric: ROUGE or BERTScore
Example instance: A workflow specifying an ML pipeline where the ML goal is said to be "binary classification" in the workflow description. At the same time, the tasks perform training of a "regression" ML model (which is inconsistent with "binary classification").
Rationale: Can the LLM detect tasks that are clearly in the wrong order (semantically)?
Parameters:
-
$W$ : workflow name -
$T_1, T_2$ : tasks in$W$ -
$F$ : flow type (e.g., control flow)
Architecture: Workflow
Question: Identify potential errors in the specification of
Reference answer: Task
Evaluation metric: ROUGE or BERTScore
Example instance: Workflow with task 'MLModelTraining' after 'MLModelEvaluation'.
Rationale: Can the LLM understand the meaning of tasks (e.g., a task performs an operation that is not directly mentioned in its name)?
Parameters:
-
$W$ : workflow name -
$P$ : property of task (e.g., the task is part of data preprocessing) -
$T_1$ : task in$W$ with property$P$ -
$T_2$ : task in$W$
Architecture: Workflow
Question: Does a task with property
Reference answer: yes
Evaluation metric: correctness
Example instance: Task 'FeatureExtraction' is labeled as "data preprocessing" and it precedes 'MLModelTraining', question: "Does a data preprocessing task run before 'MLModelTraining?"
Note: Other variants of this pattern can also be created, e.g., "Does a task with property
Rationale: Can the LLM understand meaning of tasks (e.g., task performs an operation that is not directly mentioned in the name)? (negative test)
Parameters:
-
$W$ : workflow name -
$P$ : property of task (e.g., the task is part of data preprocessing) -
$T_2$ : task in$W$
Architecture: Workflow
Question: Does a task with property
Reference answer: no
Evaluation metric: correctness
Example instance: "Does a data preprocessing task run before 'DataRetrieval?" (some of the later tasks might be labeled as "data preprocessing")
Note: Other variants of this pattern can also be created, e.g., "Does a task with property
-
correctness:
$1$ if LLM's answer = reference answer,$0$ otherwise
- Jaccard index: the size of the intersection divided by the size of the union of the sets (LLM's answer set and the reference answer set)
- Damerau–Levenshtein distance: edit distance between two sequences allowing insertions, deletions, substitutions, and transposition (swap) of two adjacent elements
- ROUGE (Recall-Oriented Understudy for Gisting Evaluation): the word overlap of the reference answer and the LLM output
- BERTScore: the cosine similarity of word embeddings (that capture the meaning of words)
- manual: manual assessment of the LLM's output
- LLM as a judge (e.g., G-Eval): use another LLM to assess the LLM's output