An automatic multilingual sentence aligner optimized for CPU.
Bertalign-Fast is a lightweight, CPU-optimized version of Bertalign that uses modern SWE (Static Word Embeddings) instead of CWE (Contextualized Word Embeddings).
- 🚀 Fast on CPU - No GPU required
- 💡 Lightweight - Use static word embeddings
- 🌍 Multilingual - Support 30+ languages
- 🎯 Accurate - Maintain high alignment quality
- 🖥️ User-friendly GUI - Visual interface for alignment tasks
- 🔄 Multi-version alignment - Align multiple language pairs or document versions
-
Use Bertalign when:
- You have GPU available
- Maximum accuracy is critical
- Processing time is not a constraint
- Working with complex literary or highly nuanced texts
-
Use Bertalign-Fast when:
- You need fast CPU inference
- Processing large volumes of text
- Running on resource-constrained systems
- Working with non-literary texts (news, technical documents, etc.)
git clone https://github.com/bfsujason/bertalign_fast.git
cd bertalign_fast
# Core only
pip install -r requirements.txt
# If you want the GUI
pip install pyqt5 igraph
# Download SWE model
python download_model.py
# Use mirror site if you cannot visit Hugging Face
python download_model.py --mirror hf-mirror.comfrom bertalign_fast import BertalignFast
# Initialize aligner
aligner = BertalignFast()
# Load texts to be aligned
src_file = "data/demo/src/001.txt"
tgt_file = "data/demo/tgt/001.txt"
src_text = open(src_file, "rt", encoding="utf-8").read()
tgt_text = open(tgt_file, "rt", encoding="utf-8").read()
# Start aligning
aligner.align_sents(src_text, tgt_text)
# Print sentence indices and bead scores
print(aligner.alignment)
# Print aligned sentences
for src_sent, tgt_sent in aligner.bitext:
print(f"{src_sent}\n{tgt_sent}\n")Run on Intel Core i7-11800H CPU @ 2.30GHz, 16G RAM
| dataset | genre | language | # 1-1 alignment |
|---|---|---|---|
| gov | political | zh-ja | 1574 (87.4%) |
| berg | yearbook | de-fr | 678 (74.0%) |
| mac | literary | zh-en | 2628 (59.8%) |
python aligner_eval.py --dataset gov
python aligner_eval.py --dataset berg
python aligner_eval.py --dataset macThe running time includes both embedding and aligning.
| dataset | precision | recall | F1 | time |
|---|---|---|---|---|
| gov | 0.986 | 0.989 | 0.987 | 13.20s |
| berg | 0.918 | 0.923 | 0.921 | 4.79s |
| mac | 0.867 | 0.895 | 0.881 | 19.49s |
Bertalign-Fast includes a graphical interface for alignment tasks.
The GUI was prototyped and refined using Claude Sonnet 4.5 as a coding assistant.
python aligner_gui.py- Automatic language detection with manual override
- Align one source against multiple targets simultaneously
- Interactive table editing: split, merge, delete, and add rows
- Undo/Redo support (Ctrl+Z / Ctrl+Y)
- Mark/Unmark rows with a distinct color for review
- Export to TMX, TSV, and JSON formats
- Project save/load
| Action | How |
|---|---|
| Edit cell | Double-click the cell |
| Mark rows | Select rows → right-click → Mark/Unmark |
| Split text | While editing → right-click → Split at cursor |
| Move text | While editing → right-click → Move Up/Down |
| Merge rows | Select rows → right-click → Merge |
The data/demo directory contains a Chinese source file and three target files, which can be used to demonstrate two workflows:
Multi-language alignment: Align a Chinese source against both an English and a Polish human translation (zh-en-pl), producing a trilingual parallel corpus in one pass.
Multi-version alignment: Align a Chinese source against a human and a ChatGPT English translation side by side, useful for comparing translation quality across different versions of the same text.
To try it: load the source file, click "Add Target File" to add multiple target files, then click "Start Alignment."
If you use Bertalign-Fast in your research, please cite:
@software{Bertalign-Fast,
author = {Lei Liu},
title = {Bertalign-Fast: An Accessible Multilingual Sentence Aligner with CPU Optimization and Interactive Proofreading},
year = {2026},
url = {https://github.com/bfsujason/bertalign_fast}
}This project is licensed under the GPL-3.0 License.
Questions, bug reports, and feature requests are welcome. Please open an issue on GitHub.
Feedback on the following is especially appreciated:
-
Language pairs: The evaluation only covers zh-ja, de-fr, and zh-en. If you test other pairs, we'd love to hear your results.
-
Operating systems: The GUI has been tested on Windows. Reports from macOS and Linux users would be very helpful.

