Research-grade tools for detecting and migrating structured equation sources in Word and related document formats, including MathType OLE, native OMML, Equation Editor 3.0 OLE, ODF MathML, and related bridge sources.
This project is Windows-first because the OMML conversion path uses Microsoft Office's MML2OMML.XSL, and the optional PDF validation path uses Word COM automation.
This repository is a research preview, not a guaranteed lossless converter.
Current published release: v0.2.0-research-preview.
The current strongest deliverable-oriented route is still MathType OLE to MathML / OMML / editable Word equations. The detector-first executor also has source-core canonical MathML slices for native OMML, ODF MathML, and an implemented limited Equation Editor 3.0 path.
- Extracts
oleObject*.binfiles from a Word.docx. - Scans supported DOCX, legacy
.docOLE, and ODF/FODT containers for formula source evidence. - Writes formula-source manifests, routing reports, execution plans, canonical-target contracts, and evidence/blocker records.
- Converts MathType OLE / MTEF content to intermediate XML.
- Converts the intermediate XML to MathML.
- Normalizes common MathML defects found in MathType-to-MathML conversion output.
- Converts MathML to OMML with Office's
MML2OMML.XSL. - Replaces OLE formula objects in a copy of the original
.docx. - Produces a LaTeX validation preview and risk classification output.
- Converts the implemented limited Equation Editor 3.0 MTEF v2/v3 slice to canonical MathML with provenance for supported payloads.
- It does not guarantee pixel-identical layout after conversion.
- It does not guarantee semantic equivalence for every possible MathType equation.
- It does not claim universal support for every historical Equation Editor 3.0 document.
- It does not claim a statistically valid global Equation Editor 3.0 coverage percentage.
- It does not claim Word/DOCX/PDF deliverability for Equation Editor 3.0 output.
- It does not include proprietary or third-party sample documents.
- It does not vendor a JDK or large third-party runtime binaries.
- It does not replace legal review for documents that you do not own or cannot redistribute.
DOCX
-> word/embeddings/oleObject*.bin
-> MathType / MTEF XML
-> MathML
-> normalized MathML
-> OMML
-> new DOCX with editable Word math
- Windows.
- Python 3.11 or newer.
- Full Java JDK 17 or newer with
javac.exeandjdk.charsets. A JRE or stripped runtime image is not enough for first-run compilation or JRuby/Nokogiri extraction. - Microsoft Office with
MML2OMML.XSL. - Optional:
pandocfor LaTeX validation previews. - Optional: Microsoft Word desktop for PDF export validation.
Python packages:
python -m pip install -r requirements.txtOptional visual PDF comparison packages:
python -m pip install -r requirements-visual.txtPrepare third-party converter sources:
powershell -ExecutionPolicy Bypass -File .\scripts\bootstrap_third_party.ps1The bootstrap script clones:
transpect/mathtype-extensionjure/mathtype_to_mathml
It also applies the local quality patch in patches/mathtype_to_mathml-quality-fixes.patch. You must comply with the licenses of those projects and their dependencies.
For the full external-tool requirements, known Java charset failure mode, and troubleshooting guidance, see Dependencies.
python -m venv .venv
.\.venv\Scripts\python -m pip install -r requirements.txt
powershell -ExecutionPolicy Bypass -File .\scripts\bootstrap_third_party.ps1
powershell -ExecutionPolicy Bypass -File .\run_docx_open_source_pipeline.ps1 `
-InputDocx .\input.docx `
-OutputDir .\out `
-MathtypeExtensionDir .\third_party\mathtype-extension `
-MathTypeToMathMlDir .\third_party\mathtype_to_mathml `
-Mml2OmmlXsl "C:\Program Files\Microsoft Office\root\Office16\MML2OMML.XSL"If you do not have pandoc installed and only need the converted .docx, add -SkipLatexPreview to the pipeline command.
Main output:
out\<input-name>.omml.docx: converted Word document.out\pipeline_summary.txt: conversion counts.out\converted\summary.csv: per-equation conversion summary.out\<input-name>.omml.validation.tex: LaTeX validation preview, unless-SkipLatexPreviewis used.out\<input-name>.omml.ole_map.json: mapping between formulas and document context.
The repository now also contains an experimental shared core package for source detection and manifest generation:
src/document_equation_migration/source_taxonomy.pysrc/document_equation_migration/manifest.pysrc/document_equation_migration/container_scan.pysrc/document_equation_migration/detectors/base.pysrc/document_equation_migration/detectors/registry.pysrc/document_equation_migration/cli.py
This shared core does not replace the existing MathType conversion scripts. It establishes a detector-first entry point that inventories formula sources before routing them to source-specific conversion paths.
Install the package locally for development:
python -m pip install -e ".[test]"Scan a document and write a manifest, routing report, execution plan, plus a human-readable summary:
dem scan .\input.docx --output .\out\manifest.json --routing .\out\routing.json --execution-plan .\out\execution-plan.json --summary .\out\summary.txtEquivalent module invocation:
python -m document_equation_migration.cli scan .\input.docx --output .\out\manifest.json --routing .\out\routing.json --execution-plan .\out\execution-plan.jsonSupported detector-first source families currently include:
mathtype-oleomml-nativeequation-editor-3-oleaxmath-oleodf-nativelibreoffice-transformed
The detector-first CLI identifies formula sources and writes a manifest; it does not yet perform full MathML / OMML / LaTeX conversion for every source family.
routing.json is a document-level route decision artifact. It includes:
recommended_sequence: source families ordered by route priorityroute_plan: next action per source familymanual_review_requiredandmanual_review_reasons
execution-plan.json is a converter-oriented plan generated from routing.json. It includes:
steps: source-family execution steps with provider name and ordered actionsmanual_review_required: aggregated gate for downstream validation
Preview the current execution plan without executing any converter commands:
dem run-plan .\out\execution-plan.json --dry-run --output .\out\execution-report.jsonexecution-report.json is a dry-run executor report. In the current milestone:
- source-specific providers expose concrete or explicitly gated dry-run bindings
- every source-family step includes a
canonical_targetblock that namescanonical-mathmlas the shared structured target and records the current gate status for that source line
Run the currently supported execution bindings:
dem run-plan .\out\execution-plan.json --execute --output-dir .\out\execution --output .\out\execution-report.jsonIn the current milestone:
ommlcan execute a native-preserving execution slice that extracts OMML XML fragments, writes a manifest, converts common presentation OMML structures into canonical MathML artifacts, performs a deterministic packaging pass, and records execution metadatamathtypeis wired to the existing PowerShell/Python document pipeline, but external tools are blocked unless you explicitly pass--allow-external-tools; Word validation remains a separate gateequation3provides an internal limited Equation Editor 3.0 MTEF v2/v3 to canonical MathML path for supported DOCX OLE embeddings and legacy.docObjectPoolEquation Nativestreams; Word roundtrip remains downstream and is not claimedaxmathis export-assisted and stays behind external export / validation gates; the project does not claim a native static AxMath parser, and canonical MathML evidence must come from reviewed export artifacts or a validated conversion stepodf-nativecan execute a native MathML extraction slice from ODF/FODT content, whilelibreoffice-transformedremains a bridge provenance review gate- render parity, Word opening, and PDF export are still validation gates; an execution report alone is not proof of deliverable Word output
Execute-mode provider outputs are evidence-oriented:
- each provider output root should contain either
validation-evidence.jsonorblocker-record.json validation-evidence.jsonandblocker-record.jsonshould carry the samecanonical_targetcontract used by the execution reportvalidation-plan.jsoncan exist as a supporting artifact, but it does not replace the evidence/blocker contract on its ownvalidation-gatedandreview-gatedstatuses mean the slice produced traceable evidence or a review gate, not that deliverable conversion is complete
Only allow external MathType tools after the dry-run report has been inspected and Java / Office XSL / local script dependencies are ready:
dem run-plan .\out\execution-plan.json --execute --allow-external-tools --output-dir .\out\execution --output .\out\execution-report.jsonFor MathType live conversion, verify that JAVA_EXE / JAVAC_EXE point to a full JDK and that MML2OMML_XSL points to an Office-provided MML2OMML.XSL. A runtime missing jdk.charsets can fail during extraction with UnsupportedCharsetException: ISO-2022-JP.
Validate a target DOCX and write a reusable validation report artifact:
dem validate-docx .\out\target.docx --output-dir .\out\validation --provider omml --source-family omml-nativeIf an execute output already wrote execution metadata or validation evidence with a packaged validation target, resolve the DOCX directly from that JSON instead of reconstructing the path manually:
dem validate-docx --target-from-metadata .\out\execution\omml-native\package\execution-metadata.json --output-dir .\out\validation --provider omml --source-family omml-nativeFor deliverable-oriented Word validation, allow Word PDF export:
dem validate-docx .\out\target.docx --output-dir .\out\validation --provider omml --source-family omml-native --allow-word-exportIf you also have a reference PDF and the optional visual dependencies installed, run visual comparison:
dem validate-docx .\out\target.docx --output-dir .\out\validation --provider omml --source-family omml-native --allow-word-export --reference-pdf .\out\reference.pdf --visual-compareYou can tighten or relax the shared visual gate explicitly:
dem validate-docx .\out\target.docx --output-dir .\out\validation --provider omml --source-family omml-native --allow-word-export --reference-pdf .\out\reference.pdf --visual-compare --visual-max-changed-ratio-per-page 0.02 --visual-max-unmatched-pages 0validation-report.json distinguishes:
deliverable-ready: target DOCX exists and Word PDF export passedreview-gated: Word PDF export passed and visual compare ran, but the current visual gate threshold was exceededresearch-only: structural evidence exists, but Word deliverability was not yet validatedblocked: target file is missing, Word export failed, or requested visual comparison failed
Important: visual_compare = passed now means both "the compare pipeline ran" and "the current visual gate thresholds were met". If the compare pipeline runs but page-count mismatch or changed-ratio exceeds threshold, the visual check becomes review-gated instead of passed.
For the research-preview release, MathType conversion results should be interpreted conservatively:
deliverable-readyis an automated candidate only when Word export passes, conversion/replacement counts are complete, and the configured visual gate passes.review-gatedcan be a manual-review candidate when Word export passes, conversion/replacement counts are complete, source and converted page counts match, unmatched pages are zero, and the visual drift is documented for human review.blockedmeans the output should not be presented as a usable converted document until the failed or missing gate is resolved.
Current real MathType evidence supports the guarded layout-preservation path as a manual-review candidate, not as a pixel-identical or lossless converter. The guarded layout option remains opt-in because its current factor is sample-derived and requires broader validation.
For a structured statement of the current claim boundary, evidence classes, and manual-review gate, see MathType evidence pack.
For the current Equation Editor 3.0 source-core claim boundary and public native-stream fixtures, see Equation Editor 3.0 evidence pack.
Before using a review-gated output in production, review the generated PDF, inspect changed pages, spot-check high-risk formulas, and keep the source document available for comparison.
Run the current test gate:
python -m pytest tests -qAfter generating summary.csv and an OLE map, classify equations with:
python .\analyze_formula_risks.py `
.\out\converted\summary.csv `
.\out\input.omml.ole_map.json `
.\out\risk_analysis.json `
.\out\risk_analysis.txtThe categories are:
auto_replace: simple formulas that did not trigger known risk rules.spot_check: complex formulas that deserve sampling.manual_review: formulas that match patterns associated with likely conversion defects.
Risk analysis is most useful when LaTeX previews are available, so QA runs should keep LaTeX previews enabled when possible.
Optional PDF validation requires Microsoft Word desktop:
powershell -ExecutionPolicy Bypass -File .\export_word_pdf.ps1 `
-InputDocx .\out\input.omml.docx `
-OutputPdf .\out\converted.pdfVisual PDF comparison uses PyMuPDF and Pillow:
python -m pip install -r requirements-visual.txt
python .\compare_pdf_visual.py .\original.pdf .\converted.pdf .\out\visual_compare- Architecture
- Dependencies
- Limitations
- MathType evidence pack
- Equation Editor 3.0 evidence pack
- Research-preview release notes
This repository's original code is licensed under the MIT License. Third-party tools referenced by this project keep their own licenses.