This document contains the command-line workflow and script reference.
If you are a regular user, prefer the UI in README.md.
generate_training_data.pytrain_model.pyinference_sbb.pyocr_on_detections.pypseudo_label_from_vlm.pylayout_rule_filter.pyrun_pseudo_label_workflow.pyscripts/download_openpecha_line_segmentation.pyscripts/train_line_segmentation.pycli.py(unified diffusion + retrieval-encoder commands)
pip install -r requirements.txtrequirements.txt is the unified dependency file for CLI, UI, VLM, diffusion/LoRA, and retrieval encoder training.
Legacy files requirements-ui.txt, requirements-vlm.txt, and requirements-lora.txt remain as compatibility wrappers.
Use:
python cli.py -hAvailable subcommands:
prepare-texture-lora-datasettrain-texture-loratexture-augmenttrain-image-encodertrain-text-encoderexport-text-hierarchygen-patchesweak-ocr-labelmine-mnn-pairstrain-text-hierarchy-viteval-text-hierarchy-vitfaiss-text-hierarchy-searcheval-faiss-crosspageprepare-donut-ocr-dataseteval-ocr-tokenizertrain-donut-ocrrun-donut-ocr-workflowdownload-openpecha-ocr-linesdownload-openpecha-line-segmentationtrain-line-segmentation
python generate_training_data.py \
--train_samples 100 \
--val_samples 100 \
--font_path_tibetan ext/Microsoft\ Himalaya.ttf \
--font_path_chinese ext/simkai.ttf \
--dataset_name tibetan-yoloOptional: apply LoRA-based texture augmentation directly during data generation:
python generate_training_data.py \
--train_samples 100 \
--val_samples 20 \
--font_path_tibetan ext/Microsoft\ Himalaya.ttf \
--font_path_chinese ext/simkai.ttf \
--dataset_name tibetan-yolo \
--lora_augment_path ./models/texture-lora-sdxl/texture_lora.safetensors \
--lora_augment_splits train \
--lora_augment_targets imagespython train_model.py --dataset tibetan-yolo --epochs 100 --exportpython inference_sbb.py --ppn 337138764X --model runs/detect/train/weights/best.ptList available parsers:
python ocr_on_detections.py --list-parsersLegacy parser:
python ocr_on_detections.py --source image.jpg --parser legacy --model runs/detect/train/weights/best.pt --lang bodMinerU2.5 parser:
python ocr_on_detections.py --source image.jpg --parser mineru25 --mineru-command mineruTransformer parser examples:
python ocr_on_detections.py --source image.jpg --parser paddleocr_vl
python ocr_on_detections.py --source image.jpg --parser qwen25vl
python ocr_on_detections.py --source image.jpg --parser qwen3_vl
python ocr_on_detections.py --source image.jpg --parser granite_docling
python ocr_on_detections.py --source image.jpg --parser deepseek_ocr
python ocr_on_detections.py --source image.jpg --parser florence2
python ocr_on_detections.py --source image.jpg --parser groundingdinoEnd-to-end (generate synthetic data + prepare manifests + train OCR model):
python cli.py run-donut-ocr-workflow \
--dataset_name tibetan-donut-ocr-label1 \
--dataset_output_dir ./datasets \
--font_path_tibetan "ext/Microsoft Himalaya.ttf" \
--font_path_chinese ext/simkai.ttf \
--train_samples 2000 \
--val_samples 200 \
--target_newline_token "<NL>" \
--model_output_dir ./models/donut-ocr-label1Optional with LoRA augmentation during the generation step:
python cli.py run-donut-ocr-workflow \
--dataset_name tibetan-donut-ocr-label1 \
--dataset_output_dir ./datasets \
--font_path_tibetan "ext/Microsoft Himalaya.ttf" \
--font_path_chinese ext/simkai.ttf \
--lora_augment_path ./models/texture-lora-sdxl/texture_lora.safetensors \
--lora_augment_splits train \
--lora_augment_targets images_and_ocr_crops \
--model_output_dir ./models/donut-ocr-label1Manual step-by-step:
# A) Synthetic data + OCR crops/targets (label 1 only for crops)
python generate_training_data.py \
--dataset_name tibetan-donut-ocr-label1 \
--output_dir ./datasets \
--font_path_tibetan "ext/Microsoft Himalaya.ttf" \
--font_path_chinese ext/simkai.ttf \
--train_samples 2000 \
--val_samples 200 \
--save_rendered_text_targets \
--save_ocr_crops \
--ocr_crop_labels 1 \
--target_newline_token "<NL>"
# B) Prepare JSONL manifests from ocr_targets/ocr_crops (label_id=1)
python cli.py prepare-donut-ocr-dataset \
--dataset_dir ./datasets/tibetan-donut-ocr-label1 \
--output_dir ./datasets/tibetan-donut-ocr-label1/donut_ocr_label1 \
--label_id 1
# C) Train VisionEncoderDecoder OCR model
python cli.py train-donut-ocr \
--train_manifest ./datasets/tibetan-donut-ocr-label1/donut_ocr_label1/train_manifest.jsonl \
--val_manifest ./datasets/tibetan-donut-ocr-label1/donut_ocr_label1/val_manifest.jsonl \
--output_dir ./models/donut-ocr-label1 \
--model_name_or_path microsoft/trocr-base-stage1 \Recommended for OpenPecha OCR line datasets (BoSentencePiece, no tokenizer retraining):
# A) Download and merge OpenPecha OCR HF datasets into train/test/eval line format
python cli.py download-openpecha-ocr-lines \
--output-dir ./datasets/openpecha_ocr_lines
# B) Prepare Donut manifests from line metadata (val auto-maps to eval)
python cli.py prepare-donut-ocr-dataset \
--dataset_dir ./datasets/openpecha_ocr_lines \
--output_dir ./datasets/openpecha_ocr_lines/donut_manifests \
--splits train,val \
--text_field text
# C) Compare BoSentencePiece vs baselines before training
python cli.py eval-ocr-tokenizer \
--manifests-dir ./datasets/openpecha_ocr_lines/donut_manifests \
--tokenizer openpecha/BoSentencePiece \
--with-baselines \
--output-json ./datasets/openpecha_ocr_lines/donut_manifests/tokenizer_compare.json
# D) Train Donut OCR with the same tokenizer used in evaluation
python cli.py train-donut-ocr \
--train_manifest ./datasets/openpecha_ocr_lines/donut_manifests/train_manifest.jsonl \
--val_manifest ./datasets/openpecha_ocr_lines/donut_manifests/val_manifest.jsonl \
--output_dir ./models/donut-openpecha-ocr \
--model_name_or_path microsoft/trocr-base-stage1 \
--tokenizer_path openpecha/BoSentencePieceNote: The Donut OCR training flow now always reuses the configured tokenizer path directly (no tokenizer retraining flag).
Download the Hugging Face line-coordinate dataset and convert it into an Ultralytics segment dataset:
python cli.py download-openpecha-line-segmentation \
--output-dir ./datasets/openpecha_line_segmentationIf you want to create a second dataset with vertically expanded line polygons, you can derive it from the raw base dataset:
python cli.py expand-line-segmentation-dataset \
--dataset ./datasets/openpecha_line_segmentation/data.yaml \
--output-dir ./datasets/openpecha_line_segmentation_padded \
--top-ratio 0.20 \
--bottom-ratio 0.20If you want to remove tall/narrow line polygons into a separate dataset root, use the dedicated filter CLI:
python cli.py filter-line-segmentation-dataset \
--dataset ./datasets/openpecha_line_segmentation_padded/data.yaml \
--output-dir ./datasets/openpecha_line_segmentation_padded_filtered \
--min-width-height-ratio 1.0Train a YOLO segmentation model on the converted dataset. The line-image preprocessing now belongs to the training run, not to the downloader:
python cli.py train-line-segmentation \
--dataset ./datasets/openpecha_line_segmentation/data.yaml \
--model yolo11n-seg.pt \
--image-preprocess-pipeline gray \
--epochs 100 \
--project ./runs/segment \
--name tibetan-line-segThe OCR Workbench can then switch between Classical CV line splitting and Pretrained YOLO Model.
The training command defaults to gray, matching the DONUT OCR gray preprocessing semantics (min_rgb, binarize=false), while the downloaded dataset stays raw.
Generate the patch dataset (patches/ + meta/patches.parquet) from page images:
python cli.py gen-patches \
--model ./models/layoutModels/layout_model.pt \
--input-dir ./sbb_images \
--output-dir ./datasets/text_patches \
--no-samples 100 \
--debug-dump 10Optional: generate weak OCR labels:
python cli.py weak-ocr-label \
--dataset ./datasets/text_patches \
--meta ./datasets/text_patches/meta/patches.parquet \
--out ./datasets/text_patches/meta/weak_ocr.parquet \
--num_workers 8 \
--resumeMine robust cross-page MNN positives:
python cli.py mine-mnn-pairs \
--dataset ./datasets/text_patches \
--meta ./datasets/text_patches/meta/patches.parquet \
--out ./datasets/text_patches/meta/mnn_pairs.parquet \
--config ./configs/mnn_mining.yaml \
--num-workers 8 \
--debug-dump 20Train a pretrained ViT/DINOv2 retrieval encoder with mp-InfoNCE using mnn, ocr, or both weak positive sources:
python cli.py train-text-hierarchy-vit \
--dataset-dir ./datasets/text_patches \
--output-dir ./models/text_hierarchy_vit_mpnce \
--model-name-or-path facebook/dinov2-base \
--train-mode patch_mpnce \
--positive-sources both \
--pairs-parquet ./datasets/text_patches/meta/mnn_pairs.parquet \
--weak-ocr-parquet ./datasets/text_patches/meta/weak_ocr.parquet \
--phase1-epochs 2 \
--phase2-epochs 8 \
--unfreeze-last-n-blocks 2Cross-page FAISS evaluation from exported embeddings (same-page results excluded):
python cli.py eval-faiss-crosspage \
--embeddings-npy ./models/text_hierarchy_vit_mpnce/faiss_embeddings.npy \
--embeddings-meta ./models/text_hierarchy_vit_mpnce/faiss_embeddings_meta.parquet \
--mnn-pairs ./datasets/text_patches/meta/mnn_pairs.parquet \
--output-dir ./models/text_hierarchy_vit_mpnce/eval_crosspage \
--recall-ks 1,5,10 \
--exclude-same-pageFAISS similarity search on a query crop (interactive inspection):
python cli.py faiss-text-hierarchy-search \
--query-image ./some_query.png \
--dataset-dir ./datasets/text_patches \
--backbone-dir ./models/text_hierarchy_vit_mpnce/text_hierarchy_vit_backbone \
--projection-head-path ./models/text_hierarchy_vit_mpnce/text_hierarchy_projection_head.pt \
--output-dir ./models/text_hierarchy_vit_mpnce/faiss_search \
--top-k 10Export line/word hierarchy crops from page images:
python cli.py export-text-hierarchy \
--model ./models/layoutModels/layout_model.pt \
--input-dir ./sbb_images \
--output-dir ./datasets/text_hierarchy \
--no_samples 100Train on the legacy hierarchy layout:
python cli.py train-text-hierarchy-vit \
--dataset-dir ./datasets/text_hierarchy \
--output-dir ./models/text_hierarchy_vit \
--train-mode legacy \
--model-name-or-path facebook/dinov2-base \
--target-height 64 \
--width-buckets 256,384,512,768 \
--max-width 1024Evaluate legacy hierarchy retrieval quality:
python cli.py eval-text-hierarchy-vit \
--dataset-dir ./datasets/text_hierarchy \
--backbone-dir ./models/text_hierarchy_vit/text_hierarchy_vit_backbone \
--projection-head-path ./models/text_hierarchy_vit/text_hierarchy_projection_head.pt \
--output-dir ./models/text_hierarchy_vit/eval \
--recall-ks 1,5,10export LABEL_STUDIO_LOCAL_FILES_SERVING_ENABLED=true
export LABEL_STUDIO_LOCAL_FILES_DOCUMENT_ROOT=$(pwd)/datasets/tibetan-yolo
label-studio-converter import yolo \
-i datasets/tibetan-yolo/train \
-o ls-tasks.json \
--image-ext ".png" \
--image-root-url "/data/local-files/?d=train/images"Start Label Studio:
label-studio- Pseudo-labeling and Label Studio import details: README_PSEUDO_LABELING_LABEL_STUDIO.md
- Patch dataset generation: docs/dataset_generation.md
- MNN mining (cross-page positives): docs/mnn_mining.md
- Retrieval training (mp-InfoNCE + MNN/OCR): docs/retrieval_mpnce_training.md
- Weak OCR labeling: docs/weak_ocr.md
- Diffusion + LoRA details: docs/texture_augmentation.md
- Retrieval roadmap: docs/tibetan_ngram_retrieval_plan.md