-
Notifications
You must be signed in to change notification settings - Fork 22
Release of initial benchmark #52
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Merged
Changes from all commits
Commits
Show all changes
38 commits
Select commit
Hold shift + click to select a range
98286fa
feat: Initialize benchmark project with extraction and evaluation pip…
harumiWeb f515153
feat: Add manifest and truth data for application forms, flowcharts, …
harumiWeb 09164a4
fix
harumiWeb 1f0924d
feat: Update Makefile and README for exstruct installation; enhance p…
harumiWeb ed3d7ea
fix: Correct flowchart ID and file paths in manifest.json
harumiWeb 32bf771
feat: Add taskipy as a development dependency and update task definit…
harumiWeb 3d01848
fix
harumiWeb c9a1464
feat: Update LLM client and CLI to support temperature parameter for …
harumiWeb ab8a2a3
feat: Update manifest and truth files for improved data extraction; a…
harumiWeb a40b4b1
feat: Add tax report case to manifest and corresponding truth data
harumiWeb a16086a
feat: Enhance scoring functions with normalization and support for ne…
harumiWeb a061e57
feat: Add SmartArt organization chart case to manifest with correspon…
harumiWeb 522bb90
feat: Refactor extraction process to use ExStructEngine for improved …
harumiWeb 1d466a7
feat: Add basic document case to manifest with corresponding truth data
harumiWeb 183b81e
feat: Add total cost and call count tracking to ask function
harumiWeb 3cad08a
feat: Update tax report question and truth data structure for improve…
harumiWeb f00b408
feat: Add normalization rules and scoring enhancements for improved e…
harumiWeb 349f622
feat: Add alias rules for certificate of employment to normalization …
harumiWeb 10bb9da
feat: Enhance benchmark report with interpretation guidelines for acc…
harumiWeb 5ce4696
feat: Move summary output to the end of the report function for bette…
harumiWeb de3acfd
feat: Add evaluation protocol to README and report function for repro…
harumiWeb 0a666c6
feat: Add reproducibility scripts for Windows PowerShell and macOS/Linux
harumiWeb 9cb9571
feat: Add normalization rules and truth data for heatstroke and workf…
harumiWeb 6681d84
feat: Add raw evaluation metrics and update README for new evaluation…
harumiWeb 8fec6f5
fix: Format JSON structure for better readability and consistency
harumiWeb 417da57
feat: Add Markdown conversion functionality and evaluation metrics fo…
harumiWeb 55feb05
feat: Add food inspection record data and enhance Markdown evaluation…
harumiWeb 5813c2c
feat: Add RUB specification document for Reconstruction Utility Bench…
harumiWeb a84535d
Add RUB (Reconstruction Utility Benchmark) support with manifest and …
harumiWeb dc05390
feat: Add RUB lite support with manifest and evaluation tasks
harumiWeb 522c902
feat: Enhance Markdown functionality with full-document generation an…
harumiWeb f48afd1
feat: Refactor cost estimation to use a pricing dictionary for model …
harumiWeb 17780b9
feat: Add public report generation with charts and update functionality
harumiWeb e582213
Add benchmark reports and publicize scripts
harumiWeb cbd3aba
feat: Add note about initial benchmark and future expansion
harumiWeb 275d458
feat: Add benchmark section with reports and charts to documentation
harumiWeb 552977d
feat: Exclude benchmark directory from coverage and linting checks
harumiWeb 53890f1
fix: Update benchmark chart paths in documentation and scripts for co…
harumiWeb File filter
Filter by extension
Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
There are no files selected for viewing
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
| Original file line number | Diff line number | Diff line change |
|---|---|---|
| @@ -0,0 +1,2 @@ | ||
| exclude_paths: | ||
| - "benchmark/**" |
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
| Original file line number | Diff line number | Diff line change |
|---|---|---|
| @@ -0,0 +1,4 @@ | ||
| OPENAI_API_KEY=your_key_here | ||
| # optional | ||
| OPENAI_ORG= | ||
| OPENAI_PROJECT= |
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
| Original file line number | Diff line number | Diff line change |
|---|---|---|
| @@ -0,0 +1,15 @@ | ||
| # Python-generated files | ||
| __pycache__/ | ||
| *.py[oc] | ||
| build/ | ||
| dist/ | ||
| drafts/ | ||
| wheels/ | ||
| *.egg-info | ||
|
|
||
| # Virtual environments | ||
| .venv | ||
| data/raw/ | ||
| *.log | ||
| outputs/ | ||
| .env |
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
| Original file line number | Diff line number | Diff line change |
|---|---|---|
| @@ -0,0 +1,20 @@ | ||
| .PHONY: setup extract ask eval report all | ||
|
|
||
| setup: | ||
| python -m pip install -U pip | ||
| pip install -e .. | ||
| pip install -e . | ||
|
|
||
| extract: | ||
| exbench extract --case all --method all | ||
|
|
||
| ask: | ||
| exbench ask --case all --method all --model gpt-4o | ||
|
|
||
| eval: | ||
| exbench eval --case all --method all | ||
|
|
||
| report: | ||
| exbench report | ||
|
|
||
| all: extract ask eval report |
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
| Original file line number | Diff line number | Diff line change |
|---|---|---|
| @@ -0,0 +1,194 @@ | ||
| # ExStruct Benchmark | ||
|
|
||
| This benchmark compares methods for answering questions about Excel documents using GPT-4o: | ||
|
|
||
| - exstruct | ||
| - openpyxl | ||
| - pdf (xlsx->pdf->text) | ||
| - html (xlsx->html->table text) | ||
| - image_vlm (xlsx->pdf->png -> GPT-4o vision) | ||
|
|
||
| ## Requirements | ||
|
|
||
| - Python 3.11+ | ||
| - LibreOffice (`soffice` in PATH) | ||
| - OPENAI_API_KEY in `.env` | ||
|
|
||
| ## Setup | ||
|
|
||
| ```bash | ||
| cd benchmark | ||
| cp .env.example .env | ||
| pip install -e .. # install exstruct from repo root | ||
| pip install -e . | ||
| ``` | ||
|
|
||
| ## Run | ||
|
|
||
| ```bash | ||
| make all | ||
| ``` | ||
|
|
||
| ## Reproducibility script (Windows PowerShell) | ||
|
|
||
| ```powershell | ||
| .\scripts\reproduce.ps1 | ||
| ``` | ||
|
|
||
| Options: | ||
|
|
||
| - `-Case` (default: `all`) | ||
| - `-Method` (default: `all`) | ||
| - `-Model` (default: `gpt-4o`) | ||
| - `-Temperature` (default: `0.0`) | ||
| - `-SkipAsk` (skip LLM calls; uses existing responses) | ||
|
|
||
| ## Reproducibility script (macOS/Linux) | ||
|
|
||
| ```bash | ||
| ./scripts/reproduce.sh | ||
| ``` | ||
|
|
||
| If you see a permission error, run: | ||
|
|
||
| ```bash | ||
| chmod +x ./scripts/reproduce.sh | ||
| ``` | ||
|
|
||
| Options: | ||
|
|
||
| - `--case` (default: `all`) | ||
| - `--method` (default: `all`) | ||
| - `--model` (default: `gpt-4o`) | ||
| - `--temperature` (default: `0.0`) | ||
| - `--skip-ask` (skip LLM calls; uses existing responses) | ||
|
|
||
| Outputs: | ||
|
|
||
| - outputs/extracted/\* : extracted context (text or images) | ||
| - outputs/prompts/\*.jsonl | ||
| - outputs/responses/\*.jsonl | ||
| - outputs/markdown/\*/\*.md | ||
| - outputs/markdown/responses/\*.jsonl | ||
| - outputs/results/results.csv | ||
| - outputs/results/report.md | ||
|
|
||
| ## Public report (REPORT.md) | ||
|
|
||
| Generate chart images and update `REPORT.md` in the benchmark root: | ||
|
|
||
| ```bash | ||
| python -m bench.cli report-public | ||
| ``` | ||
|
|
||
| This command writes plots under `outputs/plots/` and inserts them into | ||
| `REPORT.md` between the chart markers. | ||
|
|
||
| ## Public bundle (for publishing) | ||
|
|
||
| Create a clean, shareable bundle under `benchmark/public/`: | ||
|
|
||
| ```bash | ||
| python scripts/publicize.py | ||
| ``` | ||
|
|
||
| Windows PowerShell: | ||
|
|
||
| ```powershell | ||
| .\scripts\publicize.ps1 | ||
| ``` | ||
|
|
||
| ## Markdown conversion (optional) | ||
|
|
||
| Generate Markdown from the latest JSON responses: | ||
|
|
||
| ```bash | ||
| python -m bench.cli markdown --case all --method all | ||
| ``` | ||
|
|
||
| Markdown scores (`score_md`, `score_md_precision`) are only computed when | ||
| Markdown outputs exist under `outputs/markdown/responses/`. | ||
|
|
||
| If you want a deterministic renderer without LLM calls: | ||
|
|
||
| ```bash | ||
| python -m bench.cli markdown --case all --method all --use-llm false | ||
| ``` | ||
|
|
||
| ## RUB (lite) | ||
|
|
||
| RUB lite evaluates reconstruction utility using Markdown-only inputs. | ||
|
|
||
| Run Stage B tasks with the lite manifest: | ||
|
|
||
| ```bash | ||
| python -m bench.cli rub-ask --task all --method all --manifest rub/manifest_lite.json | ||
| python -m bench.cli rub-eval --manifest rub/manifest_lite.json | ||
| python -m bench.cli rub-report | ||
| ``` | ||
|
|
||
| Outputs: | ||
|
|
||
| - outputs/rub/results/rub_results.csv | ||
| - outputs/rub/results/report.md | ||
|
|
||
| ## Evaluation protocol (public) | ||
|
|
||
| To ensure reproducibility and fair comparison, follow these fixed settings: | ||
|
|
||
| - Model: gpt-4o (Responses API) | ||
| - Temperature: 0.0 | ||
| - Prompt: fixed in `bench/llm/openai_client.py` | ||
| - Input contexts: generated by `bench.cli extract` using the same sources for all methods | ||
| - Normalization: optional normalized track uses `data/normalization_rules.json` | ||
| - Evaluation: `bench.cli eval` produces Exact, Normalized, Raw, and Markdown scores | ||
| - Report: `bench.cli report` generates `report.md` and per-case detailed reports | ||
|
|
||
| Recommended disclosure when publishing results: | ||
|
|
||
| - Model name + version, temperature, and date of run | ||
| - Full `normalization_rules.json` used for normalized scores | ||
| - Cost/token estimation method | ||
| - Any skipped cases and the reason (missing files, extraction failures) | ||
|
|
||
| ## How to interpret results (public guide) | ||
|
|
||
| This benchmark reports four evaluation tracks to keep comparisons fair: | ||
|
|
||
| - Exact: strict string match with no normalization. | ||
| - Normalized: applies case-specific rules in `data/normalization_rules.json` to | ||
| absorb formatting differences (aliases, split/composite labels). | ||
| - Raw: loose coverage/precision over flattened text tokens (schema-agnostic), | ||
| intended to reflect raw data capture without penalizing minor label variations. | ||
| - Markdown: coverage/precision against canonical Markdown rendered from truth. | ||
|
|
||
| Recommended interpretation: | ||
|
|
||
| - Use **Exact** to compare end-to-end string fidelity (best for literal extraction). | ||
| - Use **Normalized** to compare **document understanding** across methods. | ||
| - Use **Raw** to compare how much ground-truth text is captured regardless of schema. | ||
| - Use **Markdown** to evaluate JSON-to-Markdown conversion quality. | ||
| - When methods disagree between tracks, favor Normalized for Excel-heavy layouts | ||
| where labels are split/merged or phrased differently. | ||
| - Always cite both accuracy and cost metrics when presenting results publicly. | ||
|
|
||
| ## Evaluation | ||
|
|
||
| The evaluator now writes four tracks: | ||
|
|
||
| - Exact: `score`, `score_ordered` (strict string match, current behavior) | ||
| - Normalized: `score_norm`, `score_norm_ordered` (applies case-specific rules) | ||
| - Raw: `score_raw`, `score_raw_precision` (loose coverage/precision) | ||
| - Markdown: `score_md`, `score_md_precision` (Markdown coverage/precision) | ||
|
|
||
| Normalization rules live in `data/normalization_rules.json` and are applied in | ||
| `bench.cli eval`. Publish these rules alongside the benchmark to keep the | ||
| normalized track transparent and reproducible. | ||
|
|
||
| ## Notes: | ||
|
|
||
| - GPT-4o Responses API supports text and image inputs. See docs: | ||
| - [https://platform.openai.com/docs/api-reference/responses](https://platform.openai.com/docs/api-reference/responses) | ||
| - [https://platform.openai.com/docs/guides/images-vision](https://platform.openai.com/docs/guides/images-vision) | ||
| - Pricing for gpt-4o used in cost estimation: | ||
| - https://platform.openai.com/docs/models/compare?model=gpt-4o |
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
| Original file line number | Diff line number | Diff line change |
|---|---|---|
| @@ -0,0 +1,84 @@ | ||
| # Benchmark Summary (Public) | ||
|
|
||
| This summary consolidates the latest results for the Excel document benchmark and | ||
| RUB (structure query track). Use this file as a public-facing overview and link | ||
| full reports for reproducibility. | ||
|
|
||
| Sources: | ||
| - outputs/results/report.md (core benchmark) | ||
| - outputs/rub/results/report.md (RUB structure_query) | ||
| <!-- CHARTS_START --> | ||
| ## Charts | ||
|
|
||
|  | ||
|  | ||
|  | ||
| <!-- CHARTS_END --> | ||
| ## Scope | ||
|
|
||
| - Cases: 12 Excel documents | ||
| - Methods: exstruct, openpyxl, pdf, html, image_vlm | ||
| - Model: gpt-4o (Responses API) | ||
| - Temperature: 0.0 | ||
| - Note: record the run date/time when publishing | ||
| - This is an initial benchmark (n=12) and will be expanded in future releases. | ||
|
|
||
| ## Core Benchmark (extraction + scoring) | ||
|
|
||
| Key metrics from outputs/results/report.md: | ||
|
|
||
| - Exact accuracy (acc): best = pdf 0.607551, exstruct = 0.583802 | ||
| - Normalized accuracy (acc_norm): best = pdf 0.856642, exstruct = 0.835538 | ||
| - Raw coverage (acc_raw): best = exstruct 0.876495 (tie for top) | ||
| - Raw precision: best = exstruct 0.933691 | ||
| - Markdown coverage (acc_md): best = pdf 0.700094, exstruct = 0.697269 | ||
| - Markdown precision: best = exstruct 0.796101 | ||
|
|
||
| Interpretation: | ||
| - pdf leads in Exact/Normalized, especially when literal string match matters. | ||
| - exstruct is strongest on Raw coverage/precision and Markdown precision, | ||
| indicating robust capture and downstream-friendly structure. | ||
|
|
||
| ## RUB (structure_query track) | ||
|
|
||
| RUB evaluates Stage B questions using Markdown-only inputs. Current track is | ||
| "structure_query" (paths selection). | ||
|
|
||
| Summary from outputs/rub/results/report.md: | ||
|
|
||
| - RUS: exstruct 0.166667 (tie for top with openpyxl 0.166667) | ||
| - Partial F1: exstruct 0.436772 (best among methods) | ||
|
|
||
| Interpretation: | ||
| - exstruct is competitive for structure queries, but the margin is not large. | ||
| - This track is sensitive to question design; it rewards selection accuracy | ||
| more than raw reconstruction. | ||
|
|
||
| ## Positioning for RAG/LLM Preprocessing | ||
|
|
||
| Practical strengths shown by the current benchmark: | ||
| - High Raw coverage/precision (exstruct best) | ||
| - High Markdown precision (exstruct best) | ||
| - Near-top normalized accuracy | ||
|
|
||
| Practical caveats: | ||
| - Exact/normalized top spot is often pdf | ||
| - RUB structure_query shows only a modest advantage | ||
|
|
||
| Recommended public framing: | ||
| - exstruct is a strong option when the goal is structured reuse (JSON/Markdown) | ||
| for downstream LLM/RAG pipelines. | ||
| - pdf/VLM methods can be stronger for literal string fidelity or visual layout | ||
| recovery. | ||
|
|
||
| ## Known Limitations | ||
|
|
||
| - Absolute RUS values are low in some settings (task design sensitive). | ||
| - Results vary by task type (forms/flows/diagrams vs tables). | ||
| - Model changes (e.g., gpt-4.1) require separate runs and reporting. | ||
|
|
||
| ## Next Steps (optional) | ||
|
|
||
| - Add a reconstruction track that scores “structure rebuild” directly. | ||
| - Add task-specific structure queries (not only path selection). | ||
| - Publish run date, model version, and normalization rules with results. |
Oops, something went wrong.
Oops, something went wrong.
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
Uh oh!
There was an error while loading. Please reload this page.