This module implements a complete Neural Machine Translation (NMT) system from the ground up using PyTorch. It features a custom Whitespace Tokenizer and an enhanced Transformer architecture with SwiGLU activation, serving as a robust baseline for analyzing tokenization strategies.
Here is the breakdown of the core components in this implementation:
The central command center for the entire system.
- Responsibilities:
- Parses command-line arguments (dataset paths, model dimensions, hyperparameters).
- Initializes the Tokenizer, Model architecture, and Optimizer.
- Orchestrates the execution pipeline: Training, Validation, or Test/Inference.
- Modes: Supports flexible execution modes including
train,validate, andinference(with Beam Search integration).
Implements a custom Whitespace Tokenizer specifically designed for analyzing linguistic impacts on MT.
- Features:
- Tokenization based on whitespace delimiters (preserving syllable boundaries for Vietnamese).
- Vocabulary construction and management.
- Sequence encoding/decoding with special token handling (
<pad>,<bos>,<eos>).
Contains the core Transformer logic built from scratch.
- Key Components:
- Encoder-Decoder: Full attention-based architecture.
- SwiGLU FFN: Replaces standard ReLU Feed-Forward Networks for enhanced expressiveness.
- Multi-head Attention: Standard parallel attention mechanism.
- Embeddings & Positional Encoding: Handles input representation and sequence order.
Implements algorithms to generate translations from the trained model.
- Strategies:
- Greedy Decoding: Selects the highest probability token at each step.
- Beam Search: A more sophisticated approach supporting:
- Customizable
beam_size. - Length penalty optimization.
- Early stopping mechanisms upon generating
<eos>.
- Customizable
Handles the preprocessing and loading of bilingual datasets (Source-Target).
- Functionality:
- Reads and normalizes train/valid/test files.
- Efficient batching logic with dynamic padding.
- Generates Attention Masks to ignore padding tokens during training.
Manages the optimization process.
- Workflow:
- Forward pass -> Loss computation -> Backpropagation.
- Gradient Clipping: Prevents exploding gradients.
- Early Stopping: Monitors validation loss to prevent overfitting.
- Checkpointing: Automatically saves the best model state based on validation metrics.
Tools for assessing model performance.
- Metrics: Primarily uses BLEU score (expandable to METEOR/ROUGE).
- Reporting:
- Compares model Predictions vs. References.
- Exports detailed results to CSV for qualitative analysis.
Ensure all dependencies are installed via pip:
pip install -r requirements.txtpython3 main.pyThis module contains the implementation for our Qwen-2.5-3B submission, featuring LoRA fine-tuning and a specialized Refine Pipeline.
.
├── input/ # Data Storage
│ ├── train.en.txt # Raw source files
│ ├── train.vi.txt # Raw source files
│ ├── public_test.en.txt # Raw test files
│ ├── public_test.vi.txt # Raw test files
│ |── clean_train.en.txt # Generated by clean_data.py
│ ├── clean_train.vi.txt # Generated by clean_data.py
│ |── simple_medical_glossory.json #
│ ├── vi_abbre.json #
│ |── bidirectional_train_data.jsonl # Generated by make_data_(gloss_and_vi_abbre).py
│ └── final_ultimate_train.jsonl # Generated by make_data_final.py
|
│
├── output/ # Artifacts & Results
│ ├── qwen_mt_3B_finetuned/ # LoRA Checkpoints (from train.py)
│ ├── merged_qwen_3b/ # Merged Model (from merge_model.py)
│ ├── pipeline_result_en2vi.csv # Final Inference Results (from pipeline_full.py)
│ └── pipeline_result_vi2en.csv # Final Inference Results (from pipeline_full.py)
│
├── clean_data.py # Data cleaning utilities
├── make_data_(gloss_and_vi_abbre).py # Glossory and Abbreviations Data preparation
├── make_data_final.py # Final Data formatting & preparation script
├── train.py # QLoRA Training script
├── merge_model.py # Adapter merging script
├── pipeline_full.py # End-to-end Inference & Evaluation
├── demo.py # Interactive Demo (Using Gradio)
└── requirements.txt # Python dependencies
input/: Initially contains the 4 raw datasets (train and public_test pairs). During execution, the data preparation scripts will generate cleaned and formatted JSON/JSONL files here for training.
output/: The destination for all training artifacts:
-
qwen_mt_3B_finetuned/: Stores the LoRA adapter checkpoints saved duringtrain.py. -
merged_qwen_3b/: Stores the standalone model after merging the base model with LoRA adapters via merge_model.py. -
*.csv: The final evaluation metrics and translation outputs generated bypipeline_full.py.
Follow these steps to reproduce the full pipeline:
pip install -r requirements.txtClean and format the raw text files into training-ready datasets:
python3 clean_data.py
python3 make_data_(gloss_and_vi_abbre).py
python3 make_data_final.pyFine-tune the Qwen-3B model using the processed data:
python3 train.pyOutputs: LoRA adapters saved in output/qwen_mt_3B_finetuned.
Merge the trained LoRA adapters with the base model for faster inference:
python3 merge_model.pyOutputs: Full model saved in output/merged_qwen_3b.
Run the full Refine Pipeline to generate translations and calculate scores:
python3 pipeline_full.pyOutputs: pipeline_result_en2vi.csv and pipeline_result_vi2en.csv in the output/ directory.
To launch a web interface (using Gradio) for testing individual sentences:
python3 demo.pyOr access our hosted demo via this link.