Preprocessing repo for the Minutes of the Austrian Cabinet Council 1919–1920 project. Holds the workflow for converting transcription DOCX files into project-compliant TEI-XML documents ready for editorial work in krp-data.
input/ DOCX transcription files (gitignored)
preprocessing/
├── docx-to-tei.sh Bash script for local TEIGarage conversion
├── teigarage-out/ generic TEI-XML output
└── xslts/
└── upconvert.xsl XSLT 3.0 stylesheet for upconversion
header-docs/ TEI headers generated from JSON metadata
saxon/ Saxon HE 12.5 + xmlresolver
data/templates/ project-compliant TEI-XML output files
src/
└── generate_tei_templates.py Python script for generating header-docs
build.xml Ant for running upconvert.xsl via Saxon
A Python script fetches protocol metadata from a Baserow JSON dump and generates one header file per protocol into header-docs/ (naming: krp-???_header.xml).
uv run src/generate_tei_templates.py
Transcription DOCX files placed in input/ are converted to generic TEI-XML via a local TEIGarage Docker container. Output goes to preprocessing/teigarage-out/.
bash preprocessing/docx-to-tei.sh
Warning
The TEIGarage conversion does not preserve DOCX paragraph indentation. In the Stenogramm sections, speaker-turn grouping (via hanging indent in the DOCX) is lost.
An Ant build applies upconvert.xsl (XSLT 3.0, processed by Saxon HE 12.5) to each generic TEI-XML in preprocessing/teigarage-out/. The XSLT automatically merges the matching header-doc and transforms the body into a project-compliant structure. Output goes to data/templates/.
ant
Caution
The upconvert.xsl stylesheet is work in progress and does not yet generate actionable templates for editorial markup.
The files in data/templates/ (which preserve the transcription DOCX filenames) are ready for being copied into data/editions/ in the krp-data repo and renamed to krp-???.xml for editorial work.
Steps 1 and 3 are automated via GitHub Actions:
- write-headers (
write-headers.yml): Generates TEI header-docs from Baserow metadata. Runs on push tosrc/as well as manually. - upconvert-tei (
upconvert-tei.yml): Runs the Ant/Saxon TEI-XML upconversion. Triggered by pushes topreprocessing/teigarage-out/orheader-docs/, or manually.
The first Action generates up-to-date header-docs. This then triggers the TEI-XML upconversion, which fires also when new TEIGarage output is pushed.
Important
When running the upconvert-tei Action manually, it will use whatever header-docs are currently in the repo - if in doubt, run write-headers first to ensure that the TEI headers are up to date with upstream metadata.