harumiWeb · harumiWeb · Jan 29, 2026 · Jan 24, 2026 · Jan 24, 2026 · Jan 24, 2026
diff --git a/.codacy.yml b/.codacy.yml
@@ -0,0 +1,2 @@
+exclude_paths:
+  - "benchmark/**"
diff --git a/.pre-commit-config.yaml b/.pre-commit-config.yaml
@@ -3,6 +3,7 @@ repos:
     rev: v0.4.5
     hooks:
       - id: ruff
+        exclude: ^benchmark/
       - id: ruff-format
 
   - repo: https://github.com/pre-commit/mirrors-mypy
@@ -12,3 +13,4 @@ repos:
         additional_dependencies:
           - pydantic>=2.0.0
           - types-PyYAML
+        exclude: ^benchmark/
diff --git a/README.ja.md b/README.ja.md
@@ -1,6 +1,6 @@
 # ExStruct — Excel 構造化抽出エンジン
 
-[![PyPI version](https://badge.fury.io/py/exstruct.svg)](https://pypi.org/project/exstruct/) [![PyPI Downloads](https://static.pepy.tech/personalized-badge/exstruct?period=total&units=INTERNATIONAL_SYSTEM&left_color=BLACK&right_color=GREEN&left_text=downloads)](https://pepy.tech/projects/exstruct) ![Licence: BSD-3-Clause](https://img.shields.io/badge/license-BSD--3--Clause-blue?style=flat-square) [![pytest](https://github.com/harumiWeb/exstruct/actions/workflows/pytest.yml/badge.svg)](https://github.com/harumiWeb/exstruct/actions/workflows/pytest.yml) [![Codacy Badge](https://app.codacy.com/project/badge/Grade/e081cb4f634e4175b259eb7c34f54f60)](https://app.codacy.com/gh/harumiWeb/exstruct/dashboard?utm_source=gh&utm_medium=referral&utm_content=&utm_campaign=Badge_grade) [![codecov](https://codecov.io/gh/harumiWeb/exstruct/graph/badge.svg?token=2XI1O8TTA9)](https://codecov.io/gh/harumiWeb/exstruct)
+[![PyPI version](https://badge.fury.io/py/exstruct.svg)](https://pypi.org/project/exstruct/) [![PyPI Downloads](https://static.pepy.tech/personalized-badge/exstruct?period=total&units=INTERNATIONAL_SYSTEM&left_color=BLACK&right_color=GREEN&left_text=downloads)](https://pepy.tech/projects/exstruct) ![Licence: BSD-3-Clause](https://img.shields.io/badge/license-BSD--3--Clause-blue?style=flat-square) [![pytest](https://github.com/harumiWeb/exstruct/actions/workflows/pytest.yml/badge.svg)](https://github.com/harumiWeb/exstruct/actions/workflows/pytest.yml) [![Codacy Badge](https://app.codacy.com/project/badge/Grade/e081cb4f634e4175b259eb7c34f54f60)](https://app.codacy.com/gh/harumiWeb/exstruct/dashboard?utm_source=gh&utm_medium=referral&utm_content=&utm_campaign=Badge_grade) [![codecov](https://codecov.io/gh/harumiWeb/exstruct/graph/badge.svg?token=2XI1O8TTA9)](https://codecov.io/gh/harumiWeb/exstruct) [![Ask DeepWiki](https://deepwiki.com/badge.svg)](https://deepwiki.com/harumiWeb/exstruct)
 
 ![ExStruct Image](docs/assets/icon.webp)
 
@@ -17,6 +17,15 @@ ExStruct は Excel ワークブックを読み取り、構造化データ（セ
 - **CLI レンダリング**（Excel 必須）: PDF とシート画像を生成可能。
 - **安全なフォールバック**: Excel COM 不在でもプロセスは落ちず、セル＋テーブル候補＋印刷範囲に切り替え（図形・チャートは空）。
 
+## ベンチマーク
+
+![Benchmark Chart](benchmark/public/plots/markdown_quality.png)
+
+このリポジトリには、ExcelドキュメントのRAG/LLM前処理に焦点を当てたベンチマークレポートが含まれています。
+私たちは2つの視点から追跡しています。(1) コア抽出精度と (2) 下流構造クエリのための再構築ユーティリティ (RUB) です。
+作業サマリーについては`benchmark/REPORT.md`を、公開バンドルについては`benchmark/public/REPORT.md`を参照してください。
+現在の結果はn=12のケースに基づいており、今後さらに拡張される予定です。
+
 ## インストール
 
 ```bash

diff --git a/README.md b/README.md
@@ -1,6 +1,6 @@
 # ExStruct — Excel Structured Extraction Engine
 
-[![PyPI version](https://badge.fury.io/py/exstruct.svg)](https://pypi.org/project/exstruct/) [![PyPI Downloads](https://static.pepy.tech/personalized-badge/exstruct?period=total&units=INTERNATIONAL_SYSTEM&left_color=BLACK&right_color=GREEN&left_text=downloads)](https://pepy.tech/projects/exstruct) ![Licence: BSD-3-Clause](https://img.shields.io/badge/license-BSD--3--Clause-blue?style=flat-square) [![pytest](https://github.com/harumiWeb/exstruct/actions/workflows/pytest.yml/badge.svg)](https://github.com/harumiWeb/exstruct/actions/workflows/pytest.yml) [![Codacy Badge](https://app.codacy.com/project/badge/Grade/e081cb4f634e4175b259eb7c34f54f60)](https://app.codacy.com/gh/harumiWeb/exstruct/dashboard?utm_source=gh&utm_medium=referral&utm_content=&utm_campaign=Badge_grade) [![codecov](https://codecov.io/gh/harumiWeb/exstruct/graph/badge.svg?token=2XI1O8TTA9)](https://codecov.io/gh/harumiWeb/exstruct)
+[![PyPI version](https://badge.fury.io/py/exstruct.svg)](https://pypi.org/project/exstruct/) [![PyPI Downloads](https://static.pepy.tech/personalized-badge/exstruct?period=total&units=INTERNATIONAL_SYSTEM&left_color=BLACK&right_color=GREEN&left_text=downloads)](https://pepy.tech/projects/exstruct) ![Licence: BSD-3-Clause](https://img.shields.io/badge/license-BSD--3--Clause-blue?style=flat-square) [![pytest](https://github.com/harumiWeb/exstruct/actions/workflows/pytest.yml/badge.svg)](https://github.com/harumiWeb/exstruct/actions/workflows/pytest.yml) [![Codacy Badge](https://app.codacy.com/project/badge/Grade/e081cb4f634e4175b259eb7c34f54f60)](https://app.codacy.com/gh/harumiWeb/exstruct/dashboard?utm_source=gh&utm_medium=referral&utm_content=&utm_campaign=Badge_grade) [![codecov](https://codecov.io/gh/harumiWeb/exstruct/graph/badge.svg?token=2XI1O8TTA9)](https://codecov.io/gh/harumiWeb/exstruct) [![Ask DeepWiki](https://deepwiki.com/badge.svg)](https://deepwiki.com/harumiWeb/exstruct)
 
 ![ExStruct Image](docs/assets/icon.webp)
 
@@ -19,6 +19,15 @@ ExStruct reads Excel workbooks and outputs structured data (cells, table candida
 - **CLI rendering** (Excel required): optional PDF and per-sheet PNGs.
 - **Graceful fallback**: if Excel COM is unavailable, extraction falls back to cells + table candidates without crashing.
 
+## Benchmark
+
+![Benchmark Chart](benchmark/public/plots/markdown_quality.png)
+
+This repository includes benchmark reports focused on RAG/LLM preprocessing of Excel documents.
+We track two perspectives: (1) core extraction accuracy and (2) reconstruction utility for downstream structure queries (RUB).
+See `benchmark/REPORT.md` for the working summary and `benchmark/public/REPORT.md` for the public bundle.
+Current results are based on n=12 cases and will be expanded.
+
 ## Installation
 
 ```bash

diff --git a/benchmark/.env.example b/benchmark/.env.example
@@ -0,0 +1,4 @@
+OPENAI_API_KEY=your_key_here
+# optional
+OPENAI_ORG=
+OPENAI_PROJECT=
diff --git a/benchmark/.gitignore b/benchmark/.gitignore
@@ -0,0 +1,15 @@
+# Python-generated files
+__pycache__/
+*.py[oc]
+build/
+dist/
+drafts/
+wheels/
+*.egg-info
+
+# Virtual environments
+.venv
+data/raw/
+*.log
+outputs/
+.env
diff --git a/benchmark/Makefile b/benchmark/Makefile
@@ -0,0 +1,20 @@
+.PHONY: setup extract ask eval report all
+
+setup:
+	python -m pip install -U pip
+	pip install -e ..
+	pip install -e .
+
+extract:
+	exbench extract --case all --method all
+
+ask:
+	exbench ask --case all --method all --model gpt-4o
+
+eval:
+	exbench eval --case all --method all
+
+report:
+	exbench report
+
+all: extract ask eval report
diff --git a/benchmark/README.md b/benchmark/README.md
@@ -0,0 +1,194 @@
+# ExStruct Benchmark
+
+This benchmark compares methods for answering questions about Excel documents using GPT-4o:
+
+- exstruct
+- openpyxl
+- pdf (xlsx->pdf->text)
+- html (xlsx->html->table text)
+- image_vlm (xlsx->pdf->png -> GPT-4o vision)
+
+## Requirements
+
+- Python 3.11+
+- LibreOffice (`soffice` in PATH)
+- OPENAI_API_KEY in `.env`
+
+## Setup
+
+```bash
+cd benchmark
+cp .env.example .env
+pip install -e ..  # install exstruct from repo root
+pip install -e .
+```
+
+## Run
+
+```bash
+make all
+```
+
+## Reproducibility script (Windows PowerShell)
+
+```powershell
+.\scripts\reproduce.ps1
+```
+
+Options:
+
+- `-Case` (default: `all`)
+- `-Method` (default: `all`)
+- `-Model` (default: `gpt-4o`)
+- `-Temperature` (default: `0.0`)
+- `-SkipAsk` (skip LLM calls; uses existing responses)
+
+## Reproducibility script (macOS/Linux)
+
+```bash
+./scripts/reproduce.sh
+```
+
+If you see a permission error, run:
+
+```bash
+chmod +x ./scripts/reproduce.sh
+```
+
+Options:
+
+- `--case` (default: `all`)
+- `--method` (default: `all`)
+- `--model` (default: `gpt-4o`)
+- `--temperature` (default: `0.0`)
+- `--skip-ask` (skip LLM calls; uses existing responses)
+
+Outputs:
+
+- outputs/extracted/\* : extracted context (text or images)
+- outputs/prompts/\*.jsonl
+- outputs/responses/\*.jsonl
+- outputs/markdown/\*/\*.md
+- outputs/markdown/responses/\*.jsonl
+- outputs/results/results.csv
+- outputs/results/report.md
+
+## Public report (REPORT.md)
+
+Generate chart images and update `REPORT.md` in the benchmark root:
+
+```bash
+python -m bench.cli report-public
+```
+
+This command writes plots under `outputs/plots/` and inserts them into
+`REPORT.md` between the chart markers.
+
+## Public bundle (for publishing)
+
+Create a clean, shareable bundle under `benchmark/public/`:
+
+```bash
+python scripts/publicize.py
+```
+
+Windows PowerShell:
+
+```powershell
+.\scripts\publicize.ps1
+```
+
+## Markdown conversion (optional)
+
+Generate Markdown from the latest JSON responses:
+
+```bash
+python -m bench.cli markdown --case all --method all
+```
+
+Markdown scores (`score_md`, `score_md_precision`) are only computed when
+Markdown outputs exist under `outputs/markdown/responses/`.
+
+If you want a deterministic renderer without LLM calls:
+
+```bash
+python -m bench.cli markdown --case all --method all --use-llm false
+```
+
+## RUB (lite)
+
+RUB lite evaluates reconstruction utility using Markdown-only inputs.
+
+Run Stage B tasks with the lite manifest:
+
+```bash
+python -m bench.cli rub-ask --task all --method all --manifest rub/manifest_lite.json
+python -m bench.cli rub-eval --manifest rub/manifest_lite.json
+python -m bench.cli rub-report
+```
+
+Outputs:
+
+- outputs/rub/results/rub_results.csv
+- outputs/rub/results/report.md
+
+## Evaluation protocol (public)
+
+To ensure reproducibility and fair comparison, follow these fixed settings:
+
+- Model: gpt-4o (Responses API)
+- Temperature: 0.0
+- Prompt: fixed in `bench/llm/openai_client.py`
+- Input contexts: generated by `bench.cli extract` using the same sources for all methods
+- Normalization: optional normalized track uses `data/normalization_rules.json`
+- Evaluation: `bench.cli eval` produces Exact, Normalized, Raw, and Markdown scores
+- Report: `bench.cli report` generates `report.md` and per-case detailed reports
+
+Recommended disclosure when publishing results:
+
+- Model name + version, temperature, and date of run
+- Full `normalization_rules.json` used for normalized scores
+- Cost/token estimation method
+- Any skipped cases and the reason (missing files, extraction failures)
+
+## How to interpret results (public guide)
+
+This benchmark reports four evaluation tracks to keep comparisons fair:
+
+- Exact: strict string match with no normalization.
+- Normalized: applies case-specific rules in `data/normalization_rules.json` to
+  absorb formatting differences (aliases, split/composite labels).
+- Raw: loose coverage/precision over flattened text tokens (schema-agnostic),
+  intended to reflect raw data capture without penalizing minor label variations.
+- Markdown: coverage/precision against canonical Markdown rendered from truth.
+
+Recommended interpretation:
+
+- Use **Exact** to compare end-to-end string fidelity (best for literal extraction).
+- Use **Normalized** to compare **document understanding** across methods.
+- Use **Raw** to compare how much ground-truth text is captured regardless of schema.
+- Use **Markdown** to evaluate JSON-to-Markdown conversion quality.
+- When methods disagree between tracks, favor Normalized for Excel-heavy layouts
+  where labels are split/merged or phrased differently.
+- Always cite both accuracy and cost metrics when presenting results publicly.
+
+## Evaluation
+
+The evaluator now writes four tracks:
+
+- Exact: `score`, `score_ordered` (strict string match, current behavior)
+- Normalized: `score_norm`, `score_norm_ordered` (applies case-specific rules)
+- Raw: `score_raw`, `score_raw_precision` (loose coverage/precision)
+- Markdown: `score_md`, `score_md_precision` (Markdown coverage/precision)
+
+Normalization rules live in `data/normalization_rules.json` and are applied in
+`bench.cli eval`. Publish these rules alongside the benchmark to keep the
+normalized track transparent and reproducible.
+
+## Notes:
+
+- GPT-4o Responses API supports text and image inputs. See docs:
+  - [https://platform.openai.com/docs/api-reference/responses](https://platform.openai.com/docs/api-reference/responses)
+  - [https://platform.openai.com/docs/guides/images-vision](https://platform.openai.com/docs/guides/images-vision)
+- Pricing for gpt-4o used in cost estimation:
+  - https://platform.openai.com/docs/models/compare?model=gpt-4o
diff --git a/benchmark/REPORT.md b/benchmark/REPORT.md
@@ -0,0 +1,84 @@
+# Benchmark Summary (Public)
+
+This summary consolidates the latest results for the Excel document benchmark and
+RUB (structure query track). Use this file as a public-facing overview and link
+full reports for reproducibility.
+
+Sources:
+- outputs/results/report.md (core benchmark)
+- outputs/rub/results/report.md (RUB structure_query)
+<!-- CHARTS_START -->
+## Charts
+
+![Core Benchmark Summary](outputs/plots/core_benchmark.png)
+![Markdown Evaluation Summary](outputs/plots/markdown_quality.png)
+![RUB Structure Query Summary](outputs/plots/rub_structure_query.png)
+<!-- CHARTS_END -->
+## Scope
+
+- Cases: 12 Excel documents
+- Methods: exstruct, openpyxl, pdf, html, image_vlm
+- Model: gpt-4o (Responses API)
+- Temperature: 0.0
+- Note: record the run date/time when publishing
+- This is an initial benchmark (n=12) and will be expanded in future releases.
+
+## Core Benchmark (extraction + scoring)
+
+Key metrics from outputs/results/report.md:
+
+- Exact accuracy (acc): best = pdf 0.607551, exstruct = 0.583802
+- Normalized accuracy (acc_norm): best = pdf 0.856642, exstruct = 0.835538
+- Raw coverage (acc_raw): best = exstruct 0.876495 (tie for top)
+- Raw precision: best = exstruct 0.933691
+- Markdown coverage (acc_md): best = pdf 0.700094, exstruct = 0.697269
+- Markdown precision: best = exstruct 0.796101
+
+Interpretation:
+- pdf leads in Exact/Normalized, especially when literal string match matters.
+- exstruct is strongest on Raw coverage/precision and Markdown precision,
+  indicating robust capture and downstream-friendly structure.
+
+## RUB (structure_query track)
+
+RUB evaluates Stage B questions using Markdown-only inputs. Current track is
+"structure_query" (paths selection).
+
+Summary from outputs/rub/results/report.md:
+
+- RUS: exstruct 0.166667 (tie for top with openpyxl 0.166667)
+- Partial F1: exstruct 0.436772 (best among methods)
+
+Interpretation:
+- exstruct is competitive for structure queries, but the margin is not large.
+- This track is sensitive to question design; it rewards selection accuracy
+  more than raw reconstruction.
+
+## Positioning for RAG/LLM Preprocessing
+
+Practical strengths shown by the current benchmark:
+- High Raw coverage/precision (exstruct best)
+- High Markdown precision (exstruct best)
+- Near-top normalized accuracy
+
+Practical caveats:
+- Exact/normalized top spot is often pdf
+- RUB structure_query shows only a modest advantage
+
+Recommended public framing:
+- exstruct is a strong option when the goal is structured reuse (JSON/Markdown)
+  for downstream LLM/RAG pipelines.
+- pdf/VLM methods can be stronger for literal string fidelity or visual layout
+  recovery.
+
+## Known Limitations
+
+- Absolute RUS values are low in some settings (task design sensitive).
+- Results vary by task type (forms/flows/diagrams vs tables).
+- Model changes (e.g., gpt-4.1) require separate runs and reporting.
+
+## Next Steps (optional)
+
+- Add a reconstruction track that scores “structure rebuild” directly.
+- Add task-specific structure queries (not only path selection).
+- Publish run date, model version, and normalization rules with results.