Skip to content

bug: LLM grader reports 'no response provided' when system prompt is present in input #982

@christso

Description

@christso

Bug Description

When running evals with a system prompt in the input messages, the LLM grader frequently reports "No response content was provided to evaluate" even when the model actually generated a correct response. This was discovered during benchmarking with with-superpowers vs without-superpowers experiments.

Reproduction

  1. Create an eval with a system message followed by a user message:
tests:
  - id: test-1
    input:
      - role: system
        content: "You are a helpful assistant that thinks step by step..."
      - role: user
        content: "Solve this logic puzzle..."
    assertions:
      - type: llm-grader
        prompt: "Check if the response correctly solves the puzzle"
  1. Run against multiple targets:
agentv eval test.EVAL.yaml --target azure --experiment with-system-prompt
agentv eval test.EVAL.yaml --target gemini --experiment with-system-prompt
  1. Observe that the LLM grader returns low scores with reasoning like "No response content was provided to evaluate" even when the model's actual response is correct.

Observed behavior

Experiment Target Actual Response Grader Score Grader Reasoning
without-superpowers gemini Correct (A=Knave, B=Knave) 1.0 Correct analysis
with-superpowers gemini Correct (A=Knave, B=Knave) 0.5 "No response text was provided"
without-superpowers azure Correct 0.97 Good solution
with-superpowers azure Correct 0.0 "No solution steps were included"

The contains evaluator confirms the correct content IS present in the response. The issue is specific to the LLM grader's prompt construction when system messages are in the input.

Expected behavior

The LLM grader should correctly receive and evaluate the model's response regardless of whether the input includes system messages.

Environment

Likely cause

The LLM grader prompt construction in packages/core/src/evaluation/evaluators/llm-grader.ts may not correctly handle multi-turn inputs with system messages when building the evaluation context for the judge model.

Metadata

Metadata

Assignees

No one assigned

    Labels

    bugSomething isn't workingcoreAnything pertaining to core functionality of AgentV

    Type

    No type

    Projects

    Status

    Done

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions