Skip to content

feat(core): Content preprocessor for LLM graders to handle binary agent outputs #963

@christso

Description

@christso

Objective

Add a content preprocessor pipeline so LLM graders can evaluate agents that produce binary file outputs (e.g., .xlsx, .pdf, .docx). Currently, ContentFile blocks are defined in content.ts but silently ignored by the grading pipeline — the grader receives an empty candidate string.

Problem

  1. extractLastAssistantContent() in providers/types.ts:250 only extracts ContentText blocks — ContentFile is ignored
  2. LLM grader receives empty candidate when agent output is a file
  3. Built-in agent mode's read_file skips binary extensions
  4. Code grader's materializeContentForGrader() handles images but not files

Prerequisites — ContentFile Production

Before this feature is useful end-to-end, at least one provider must emit ContentFile blocks in agent output. Check whether any current providers (claude, codex, copilot) already produce ContentFile when agents write files, or whether provider-side changes are also needed. If provider work is required, it can be a parallel workstream — the preprocessor pipeline should be built to be testable with mock ContentFile blocks regardless.

Proposed Design

Add a preprocessor pipeline that converts ContentFile blocks to ContentText before graders see them.

Default behavior: read as text

Any ContentFile without a registered preprocessor is read as UTF-8 text. This covers csv, json, sql, md, yaml, html, xml, txt, and any other text-based format — no registration needed.

Preprocessors: only for formats that need transformation

Preprocessors exist only when raw text read is insufficient (binary formats, or when a text format needs restructuring before grading). Core ships no built-in preprocessors — only the registry and default text read. Converter scripts are provided as examples that users copy into their projects and customize.

Resolution order:

  1. User-defined preprocessor in YAML → takes priority (overrides default text read)
  2. Default fallback → readFile(path, 'utf-8')

Core implementation

  • Preprocessor registry (content-preprocessor.ts): Map<type, (ContentFile) => ContentText> — populated only by user-defined preprocessors
  • Format alias map: Short aliases resolve to MIME types (xlsx → full MIME string). Unrecognized values treated as literal MIME types. One type field.
  • Pipeline integration: Run preprocessing on ContentFile blocks before candidate extraction

YAML config — scoping and syntax

Preprocessors are declared top-level in the eval file (shared by all evaluators). Per-evaluator override is possible but optional.

# Top-level: applies to all evaluators in this file
preprocessors:
  - type: xlsx
    command: ["bun", "run", "scripts/preprocessors/xlsx-to-csv.ts"]
  - type: html
    command: ["bun", "run", "scripts/preprocessors/html-to-md.ts"]

tests:
  - id: report-check
    assertions:
      - type: llm-grader          # inherits xlsx/html preprocessors
        prompt: grade-report.txt
      - type: rubrics              # also inherits
        criteria:
          - Has revenue column

  - id: special-case
    assertions:
      - type: llm-grader
        preprocessors:             # per-evaluator override
          - type: xlsx
            command: ["bun", "run", "scripts/preprocessors/xlsx-to-json.ts"]

Command path resolution

Preprocessor command paths follow the same resolution as code-grader: the last element of the command array is resolved relative to searchRoots (eval file directory + project root) via resolveFileReference(). This keeps preprocessor scripts at a project-level location, not mixed into eval folders:

my-project/
  scripts/preprocessors/
    xlsx-to-csv.ts
    html-to-md.ts
  evals/
    dataset.eval.yaml          # references scripts/preprocessors/xlsx-to-csv.ts

Integration points — hybrid approach

Use both integration strategies:

  1. At extraction boundary (for LLM graders): Modify or wrap extractLastAssistantContent() to run preprocessors on ContentFile blocks → all LLM graders benefit automatically
  2. At materialization (for code graders): Extend materializeContentForGrader() to write ContentFile blocks to temp files and pass paths to code-grader scripts — code graders may want raw file access, not just text

Error handling

  • Binary file with no preprocessor: attempt text read → if it fails (invalid UTF-8), log warning, skip the block, note in grader evidence that file content was not evaluable
  • Preprocessor command fails: log stderr, skip the block, note in grader evidence

Example converter scripts

Ship ready-to-copy converter scripts in examples/features/preprocessors/:

examples/features/preprocessors/
  scripts/preprocessors/
    xlsx-to-csv.ts             # xlsx → CSV (zero deps, uses built-in zip/XML parsing)
    html-to-md.ts              # HTML → markdown (zero deps, regex-based)
  evals/
    dataset.eval.yaml          # demonstrates top-level preprocessor config
  README.md                    # usage guide

Users copy converter scripts into their project's scripts/preprocessors/ and customize as needed (e.g., pick specific xlsx sheets, filter HTML elements, change output format).

Design Latitude

  • The preprocessor registry pattern is prescribed (aligns with Inspect AI Tier 1 approach from research)
  • Hybrid integration (extraction + materialization) is recommended but implementer may simplify if warranted
  • YAML config schema for custom preprocessors can be deferred to a follow-up if simpler to start with programmatic-only registration
  • Implementation details (sync vs async, exact function signatures) are flexible

Acceptance Signals

  • ContentFile blocks in agent output are converted to text before reaching LLM graders
  • Default: text-based files read as UTF-8 with no configuration needed
  • Top-level preprocessors config shared across all evaluators, with per-evaluator override
  • Custom preprocessors can be registered via YAML config (command scripts) and override default text read
  • Command path resolution matches code-grader behavior (resolveFileReference)
  • Existing text-only workflows are unaffected (non-breaking, ContentFile absent = no-op)
  • Code graders receive materialized file paths alongside text content
  • Binary files with no preprocessor produce a warning, not a failure
  • Example converter scripts: xlsx → CSV, HTML → markdown (TypeScript, zero deps)
  • Unit tests for preprocessor registry and pipeline
  • E2E test: eval with a mock agent that outputs a file, graded via preprocessor

Non-Goals

  • Multimodal LLM grading (sending files natively to vision models) — separate concern
  • Preprocessing for trace display or non-grader stages
  • Streaming preprocessing
  • Provider-side changes to emit ContentFile (separate issue if needed)
  • Built-in preprocessors in core (converters are examples, not built-ins)

Industry Context

Framework Approach
Inspect AI Structured Content union preserved end-to-end (gold standard)
Braintrust Attachment → S3, AttachmentReference to scorers
promptfoo output: string only, no binary support
deepeval Slug injection into strings (anti-pattern)

No framework has a first-class preprocessor primitive — this is an industry gap. The converter registry pattern (media type → converter function) is universal in adjacent domains (Apache Tika, LangChain document loaders, Unstructured.io).

Related

  • Research: agentevals-research/research/findings/binary-output-preprocessing/README.md
  • Multimodal content model research: agentevals-research/research/findings/multimodal-content-model/README.md

Metadata

Metadata

Assignees

No one assigned

    Labels

    coreAnything pertaining to core functionality of AgentV

    Type

    No type

    Projects

    Status

    Done

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions