Struct Extract Eval

Domain-agnostic, field-level evaluation for LLM structured JSON extraction.

Why This Package

When an LLM extracts structured data from text, you need to know: how good is it? Exact match against a gold JSON is useless -- "New York" vs "NYC", 42 vs 42.0 are semantically equivalent but fail string equality. And a single overall score tells you nothing about which fields are wrong or how they are wrong.

This package helps you:

Optimize prompts for LLM data extraction
Compare models for extraction quality
Compare extraction pipelines end to end

What It Provides

Per-field evaluation. Each field is scored independently with its own comparator and transform chain. Custom comparators can be registered for domain-specific needs.
Type-aware comparison. Strings, numbers, booleans, arrays, and nested objects are each handled appropriately (exact, numeric, oneof, semantic via LLM judge).
Semantic equivalence. Free-text fields can be judged by an LLM -- paraphrases count as correct, factual disagreements don't.
Diagnostic metrics. Precision, recall, and F1 at per-record and per-field level. Trace exactly which fields were missed (omissions), invented (hallucinations), or wrong (mismatches). Post-processors can reclassify results before metrics are computed (e.g. treating null values as absent for constrained-output tools).
Array alignment. Ordered, key-field, or Hungarian bipartite matching for arrays where element order may differ.
Post-processing. Plug in custom logic to adjust how fields are scored before final metrics are computed. Built-in and custom post-processors supported.
Single source of truth. All evaluation config lives in the schema as x-eval-* extension keys. One file, no drift.

Installation

pip install -e .                       # core only
pip install -e ".[dev]"                # core + dev tools
pip install -e ".[dev,methodology]"    # + IAA, contamination, shift detection

Requires Python >= 3.10.

Key Concepts

Terminology

Term	Meaning
Gold	Ground truth JSON instances (what the correct extraction looks like)
Extracted	LLM-produced JSON instances (what the extractor actually output)
Resolved schema	A simplified JSON schema with only `type`, `properties`, and `items` -- no `$ref`, `allOf`, `anyOf`, etc.
Eval schema	A resolved schema annotated with `x-eval-*` keys that tells the evaluator how to compare each field
Transform	A preprocessing step applied to both gold and extracted values before comparison (e.g. `lowercase`, `strip`). Configured via `x-eval-transform`.
Comparator	A function that scores one field by comparing gold and extracted values. Built-ins: `exact`, `numeric`, `oneof`. Custom comparators can be registered.
Batch comparator	A comparator that receives all fields in a record that use it and scores them together (e.g. LLM judge for semantic comparison, or compound comparators for grouped fields like name parts).
Post-processor	A function that reclassifies field results after scoring but before metrics are computed (e.g. treating null values as absent).

Default Field Statuses

Every leaf field gets one of these statuses after scoring:

Status	What happened	Effect on metrics
match	Both sides present, values equivalent	Raises both precision and recall
mismatch	Both sides present, values differ	Lowers both precision and recall
omission	In gold but missing key from extracted	Lowers recall only
hallucination	Key not in gold but present in extracted, or not in schema but in extracted	Lowers precision only
skipped	Field marked `x-eval-skip: true`	Excluded from all metrics
pending	Awaiting batch comparator dispatch (internal, resolved before metrics)	Excluded from all metrics
batch_error	Batch comparator failed for this field	Excluded from all metrics

Field statuses can be changed by post-processors before metrics are computed.

Scoring: Precision, Recall, F1

Precision = matches / (matches + mismatches + hallucinations). "Of what the extractor produced, how much is correct?"

Recall = matches / (matches + mismatches + omissions). "Of what gold expected, how much did the extractor get right?"

F1 = harmonic mean of precision and recall.

Run-level metrics (mean_precision, mean_recall, mean_f1) are the arithmetic mean across all records.

The Scoring Table

In schema?	Gold has field?	Extracted has field?	Result
Yes	Yes	Yes	Compare using the field's comparator (match/mismatch)
Yes	Yes	No	Omission -- penalizes recall
Yes	No	Yes	Hallucination -- penalizes precision
Yes	No	No	Nothing -- field doesn't exist for this record
No	--	Yes	Hallucination -- extractor invented an unknown field
No	--	No	Nothing -- invisible to the evaluator

Important: All gold fields must be defined in the eval schema, so the evaluator knows how to score them. A gold instance can omit a field that is in the schema (it simply won't be scored for that record). validate_gold() raises an error if gold has fields not in the schema.

The Workflow

  Gold Instances          Existing JSON Schema
       |                        |
       v                        v
 infer_schema()      resolve_schema_references()
       |                        |
       +----------+-------------+
                  |
                  v
           Resolved Schema
          (type, properties,
           items only)
                  |
                  v
         annotate_xeval()
                  |
                  v
          Eval Schema (with x-eval-*)
                  |
                  v
          User reviews and edits
                  |
                  v
          validate_gold()          <-- recommended, otherwise, scoring may be inaccurate
                  |
                  v
     evaluate(gold, extracted, schema)
                  |
                  v
             RunResult
       (precision, recall, F1,
        per-field breakdown,
        per-record detail)

Quick Start

from struct_extract_eval import evaluate, infer_schema, annotate_xeval, parse_eval_schema, validate_gold

gold = [
    {"method": "sputtering", "temperature": 300, "lab_id": "A1"},
    {"method": "evaporation", "temperature": 450, "lab_id": "B2"},
]
extracted = [
    {"method": "sputtering", "temperature": 301, "lab_id": "A1"},
    {"method": "evaporation", "temperature": 460, "lab_id": "B3"},
]

# 1. Infer schema from gold, add eval defaults
eval_schema = infer_schema(gold)
annotate_xeval(eval_schema)

# 2. Review and customize eval_schema (edit x-eval-* keys)

# 3.  Validate gold against eval_schema to catch errors and omissions
# and parse schema to ensure no errors before evaluation.
parse_eval_schema(eval_schema)
validate_gold(gold, eval_schema)


# 3. Evaluate
result = evaluate(gold, extracted, schema=eval_schema)
print(f"F1: {result.mean_f1:.2f}")
print(f"Precision: {result.mean_precision:.2f}")
print(f"Recall: {result.mean_recall:.2f}")

for path, agg in result.per_field.items():
    print(f"  {path}: score={agg.mean_score:.2f}  matches={agg.matches}  "
          f"mismatches={agg.mismatches}  omissions={agg.omissions}")

Comparators

Built-in comparators (registered by default):

Comparator	Use case	Score
`exact`	Booleans, enums, IDs, short strings	0 or 1. Strict type and value equality.
`numeric`	Numbers	0 or 1. Within tolerance = 1, outside = 0. Default: exact equality.
`oneof`	Fields with known acceptable synonyms	1 if extracted matches any value in list, 0 otherwise.

Provided batch comparator (must be registered by the user before use):

Comparator	Use case	Score
`semantic`	Free-text fields (paraphrases, synonyms)	0 or 1. Uses an LLM judge. Short-circuits on exact string match. See `examples/04_example_semantic`.

Schema examples:

"x-eval-compare": "exact"
"x-eval-compare": {"numeric": {"tolerance": {"rel": 0.01}}}
"x-eval-compare": {"oneof": {"values": ["PVD", "Sputtering", "CVD"]}}
"x-eval-compare": "semantic"

Custom Comparators

Write a function that takes (gold, extracted, params) and returns a ComparatorResult, then register it:

from struct_extract_eval.core.comparators.registry import register
from struct_extract_eval.core.comparators.comparator import ComparatorResult

def compare_date(gold, extracted, params):
    """Compare dates regardless of format."""
    from datetime import datetime
    formats = params.get("formats", ["%Y-%m-%d", "%b %d, %Y"])
    def parse(val):
        for fmt in formats:
            try:
                return datetime.strptime(str(val), fmt)
            except ValueError:
                continue
        return None
    g, e = parse(gold), parse(extracted)
    return ComparatorResult(
        score=1.0 if (g and e and g == e) else 0.0,
        comparator="date",
    )

register("date", compare_date)

Then in the schema: "x-eval-compare": {"date": {"formats": ["%Y-%m-%d", "%b %d, %Y"]}}

Use overwrite=True to replace an existing registration: register("date", compare_date, overwrite=True)

Batch Comparators

Per-field comparators score one field at a time. Batch comparators receive all fields in a record that use them and score them together in one call. Two use cases:

LLM judge (semantic): batches multiple free-text fields into one API call for cost efficiency and consistency.
Compound comparators: groups sibling fields (e.g. surname + name) and scores them as a unit.

Batch comparators are not registered by default. See examples/04_example_semantic.ipynb and examples/05_example_compound.ipynb in the examples.

Use overwrite=True to replace an existing custom registration (e.g. in notebooks): register("date", compare_date, overwrite=True). Built-in comparators (exact, numeric, oneof) cannot be overwritten.

Transforms

Preprocess both gold and extracted values before comparison. Chained left to right. Skipped when value is null.

Transform	Params	What it does
`lowercase`	--	Convert to lowercase
`strip`	--	Strip leading/trailing whitespace
`normalize_whitespace`	--	Collapse multiple spaces/newlines to single space
`sort_tokens`	--	Alphabetize whitespace-separated tokens
`round_digits`	`{"digits": int}`	Round numeric value to N decimal places

Schema: "x-eval-transform": ["strip", "lowercase"]

Array Alignment

Arrays need an alignment step before scoring: which gold element pairs with which extracted element?

Strategy	Config	When to use
Ordered (default)	No `x-eval-align` needed	Order matters (time series, steps)
Key-field	`{"match_by": "key_field", "key": "name"}`	Elements have a unique ID
Hungarian	`{"match_by": "hungarian"}`	No unique key, order doesn't matter

After alignment, matched pairs are scored recursively. Unmatched gold elements are omissions. Unmatched extracted elements are hallucinations.

See examples/03_example_arrays.ipynb in the examples.

Post-Processing

Post-processors run after field scoring and can reclassify field results and influence the score matrix. Pass them to evaluate(post_process=...).

See the reclassify_nulls and propagate_batch_errors APIs.

`x-eval-*` Extension Keys

All evaluation config lives in the JSON schema. No separate config file.

Key	Purpose	Default	Example
`x-eval-compare`	Which comparator to use	inferred from type	`"exact"`, `{"numeric": {"tolerance": {"rel": 0.01}}}`
`x-eval-skip`	Exclude field from scoring	`false`	`true`
`x-eval-transform`	Preprocessing chain	none	`["lowercase", "strip"]`
`x-eval-align`	Array alignment strategy	ordered	`{"match_by": "key_field", "key": "name"}`

Config syntax: both x-eval-compare and x-eval-transform entries use the same two shapes:

String: "exact" (no parameters)
Single-key dict: {"numeric": {"tolerance": {"rel": 0.01}}} (with parameters, value must be a dict)

Results

`RunResult`

Field	Type	Description
`records`	`list[RecordResult]`	All record results
`mean_precision`	`float`	Mean across records
`mean_recall`	`float`	Mean across records
`mean_f1`	`float`	Mean across records
`total_records`	`int`	Number of records evaluated
`total_fields`	`int`	Total scored fields
`total_omissions`	`int`	Fields missing from extracted
`total_hallucinations`	`int`	Extra fields in extracted
`total_batch_errors`	`int`	Fields where batch comparator failed
`per_field`	`dict[str, FieldAggregation]`	Per-field-path breakdown

`RecordResult`

Field	Type	Description
`record_id`	`str \| int`	Record identifier
`field_results`	`list[FieldResult]`	Per-field scores and statuses
`precision`	`float`
`recall`	`float`
`f1`	`float`

`FieldAggregation`

Field	Type	Description
`mean_score`	`float`	Average score for this field path
`matches`	`int`	Correct extractions
`mismatches`	`int`	Incorrect extractions
`omissions`	`int`	Times this field was missing
`hallucinations`	`int`	Times this field was hallucinated

The per_field breakdown is the primary diagnostic view -- it tells you which specific fields your extractor struggles with.

Examples

Step-by-step Jupyter notebooks in examples/:

Example	What it covers
`00_example_schema`	Getting a resolved schema (infer from gold or resolve existing schema)
`01_example_simple`	Simplest evaluation -- 3 steps, no customization
`02_example_customize`	Customizing the eval schema (oneof, tolerance, transforms, skip, custom comparator)
`03_example_arrays`	Array alignment (ordered, key-field, Hungarian)
`04_example_semantic`	Batch comparators and the LLM semantic judge
`05_example_compound`	Compound comparators (grouping sibling fields)

API Reference

Function	Purpose
`infer_schema(instances)`	Infer resolved schema from gold instances
`resolve_schema_references(schema)`	Simplify a complex JSON Schema into a resolved schema
`annotate_xeval(schema)`	Add `x-eval-*` defaults to a resolved schema (in-place)
`set_type_default(json_type, comparator)`	Change the default comparator for a JSON type
`reset_type_defaults()`	Reset type-defaults mapping to built-in defaults
`parse_eval_schema(schema)`	Parse and validate eval schema, returns SchemaNode tree
`validate_gold(gold, schema, ...)`	Validate gold against schema (type errors, extra-field errors, missing-field warnings)
`evaluate(gold, extracted, schema)`	Evaluate gold vs extracted using a reviewed eval schema
`register(name, fn, overwrite=False)`	Register a custom comparator

Name		Name	Last commit message	Last commit date
Latest commit History 207 Commits
examples		examples
src/struct_extract_eval		src/struct_extract_eval
tests		tests
README.md		README.md
pyproject.toml		pyproject.toml

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Struct Extract Eval

Why This Package

What It Provides

Installation

Key Concepts

Terminology

Default Field Statuses

Scoring: Precision, Recall, F1

The Scoring Table

The Workflow

Quick Start

Comparators

Custom Comparators

Batch Comparators

Transforms

Array Alignment

Post-Processing

`x-eval-*` Extension Keys

Results

`RunResult`

`RecordResult`

`FieldAggregation`

Examples

API Reference

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

Struct Extract Eval

Why This Package

What It Provides

Installation

Key Concepts

Terminology

Default Field Statuses

Scoring: Precision, Recall, F1

The Scoring Table

The Workflow

Quick Start

Comparators

Custom Comparators

Batch Comparators

Transforms

Array Alignment

Post-Processing

x-eval-* Extension Keys

Results

RunResult

RecordResult

FieldAggregation

Examples

API Reference

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

`x-eval-*` Extension Keys

`RunResult`

`RecordResult`

`FieldAggregation`

Packages