Skip to content

Latest commit

 

History

History
113 lines (79 loc) · 4.23 KB

File metadata and controls

113 lines (79 loc) · 4.23 KB

Case Study: SynthGL - 4-9x Faster GL Exports

Company

SynthGL generates synthetic accounting data for fintech QA teams and AI model training. Integration teams at companies like Plaid, Ramp, and Mercury use it to test accounting connectors against deterministic general ledger (GL) data - journal entries, trial balances, financial statements - without touching production systems.

The Workload

SynthGL exports GL journal entries to Excel workbooks. Each row is a 14-column journal line: transaction IDs, dates, account codes, debit/credit amounts, memos, and policy metadata. A typical export spans 5,000 to 50,000 rows.

These exports run continuously in CI pipelines, so Excel I/O speed directly shapes developer feedback loops.

Data profile: 14 columns per row (mixed types: strings, floats, dates, IDs). Real accounting data with realistic value distributions.

The Challenge

SynthGL relied on openpyxl for all Excel I/O. Below 1,000 rows, performance was fine. Larger test datasets exposed the bottleneck:

  • A 50,000-row GL export took 2.9s to write and 4.3s to read back for validation
  • The test suite generates dozens of bundles per run, making Excel I/O a significant share of total CI time
  • Reading a workbook to verify exported data took longer than generating the data itself

The Migration

SynthGL replaced openpyxl with WolfXL in one import change:

- from openpyxl import load_workbook, Workbook
+ from wolfxl import load_workbook, Workbook

For write-heavy paths, they adopted WolfXL's append() API, which buffers rows in Python and flushes them to Rust in a single call at save time:

from wolfxl import Workbook

wb = Workbook()
ws = wb.active
ws.title = "GL Journal"

ws.append(["GLID", "EntryID", "TxnDate", "AccountCode", "Debit", "Credit", ...])
for line in journal_lines:
    ws.append([line.line_id, line.entry_id, line.date, line.account_code, ...])

wb.save("gl_export.xlsx")

For reads, iter_rows(values_only=True) pulls all values in a single Rust call:

wb = load_workbook("gl_export.xlsx")
ws = wb[wb.sheetnames[0]]
for row in ws.iter_rows(values_only=True):
    validate(row)
wb.close()

The rest of the codebase worked unchanged. Cell access patterns like ws.cell() and ws["A1"].value kept working as before.

Results

Benchmarked on Apple M-series, Python 3.12, mean of 5 runs per scale. Data generated by SynthGL's BundleGenerator using the saas company profile.

Write Performance

Scale Cells openpyxl WolfXL Speedup
1,000 rows 14,000 0.070s 0.015s 4.7x
5,000 rows 70,000 0.278s 0.076s 3.7x
10,000 rows 140,000 0.614s 0.151s 4.1x
20,000 rows 280,000 1.219s 0.299s 4.1x
50,000 rows 700,000 2.967s 0.761s 3.9x

Read Performance

Scale Cells openpyxl WolfXL Speedup
1,000 rows 14,000 0.071s 0.010s 7.2x
5,000 rows 70,000 0.394s 0.049s 8.1x
10,000 rows 140,000 0.838s 0.102s 8.2x
20,000 rows 280,000 1.749s 0.205s 8.5x
50,000 rows 700,000 4.616s 0.515s 9.0x

Key Takeaways

  • Writes are 3.7-4.7x faster across all dataset sizes using append() or write_rows().
  • Reads get faster at scale: 7.2x at 1K rows, 9.0x at 50K. WolfXL's fixed Rust initialization cost amortizes across more data, so larger files benefit most.
  • One import change, no rewrite: Batch APIs (append, write_rows, iter_rows(values_only=True)) delivered additional gains with minimal refactoring.
  • 50K-row exports dropped from 7.6s to 1.3s (write + read combined), cutting CI cycle time directly.

Reproduce These Results

The benchmark script is open source:

pip install wolfxl openpyxl

# Clone SynthGL and run the benchmark
git clone https://github.com/SynthGL/synthgl.git
cd synthgl
uv sync --extra dev
uv run python3 scripts/benchmark_wolfxl.py \
    --scales 1000,5000,10000,20000,50000 \
    --rounds 5 \
    --output results/wolfxl_benchmark.json

Or benchmark WolfXL against any Excel file you already have:

pip install wolfxl
wolfxl bench your_file.xlsx