Part 1: Machine Translation Transformer from Scratch

This module implements a complete Neural Machine Translation (NMT) system from the ground up using PyTorch. It features a custom Whitespace Tokenizer and an enhanced Transformer architecture with SwiGLU activation, serving as a robust baseline for analyzing tokenization strategies.

Project Structure & Components

Here is the breakdown of the core components in this implementation:

1. `main.py` (Entry Point)

The central command center for the entire system.

Responsibilities:
- Parses command-line arguments (dataset paths, model dimensions, hyperparameters).
- Initializes the Tokenizer, Model architecture, and Optimizer.
- Orchestrates the execution pipeline: Training, Validation, or Test/Inference.
Modes: Supports flexible execution modes including train, validate, and inference (with Beam Search integration).

2. `tokenizer/` (Text Processing)

Implements a custom Whitespace Tokenizer specifically designed for analyzing linguistic impacts on MT.

Features:
- Tokenization based on whitespace delimiters (preserving syllable boundaries for Vietnamese).
- Vocabulary construction and management.
- Sequence encoding/decoding with special token handling (<pad>, <bos>, <eos>).

3. `model/` (Architecture)

Contains the core Transformer logic built from scratch.

Key Components:
- Encoder-Decoder: Full attention-based architecture.
- SwiGLU FFN: Replaces standard ReLU Feed-Forward Networks for enhanced expressiveness.
- Multi-head Attention: Standard parallel attention mechanism.
- Embeddings & Positional Encoding: Handles input representation and sequence order.

4. `decoding/` (Generation Strategies)

Implements algorithms to generate translations from the trained model.

Strategies:
- Greedy Decoding: Selects the highest probability token at each step.
- Beam Search: A more sophisticated approach supporting:
  - Customizable beam_size.
  - Length penalty optimization.
  - Early stopping mechanisms upon generating <eos>.

5. `data/` (Data Pipeline)

Handles the preprocessing and loading of bilingual datasets (Source-Target).

Functionality:
- Reads and normalizes train/valid/test files.
- Efficient batching logic with dynamic padding.
- Generates Attention Masks to ignore padding tokens during training.

6. `train/` (Training Loop)

Manages the optimization process.

Workflow:
- Forward pass -> Loss computation -> Backpropagation.
- Gradient Clipping: Prevents exploding gradients.
- Early Stopping: Monitors validation loss to prevent overfitting.
- Checkpointing: Automatically saves the best model state based on validation metrics.

7. `evaluate/` (Metrics & Analysis)

Tools for assessing model performance.

Metrics: Primarily uses BLEU score (expandable to METEOR/ROUGE).
Reporting:
- Compares model Predictions vs. References.
- Exports detailed results to CSV for qualitative analysis.

Quick Start

1. Prerequisites

Ensure all dependencies are installed via pip:

pip install -r requirements.txt

2. Training model

python3 main.py

Part 2: VLSP 2025 Shared Task Machine Translation

This module contains the implementation for our Qwen-2.5-3B submission, featuring LoRA fine-tuning and a specialized Refine Pipeline.

Project Structure

.
├── input/                              # Data Storage
│   ├── train.en.txt                    # Raw source files
│   ├── train.vi.txt                    # Raw source files
│   ├── public_test.en.txt              # Raw test files
│   ├── public_test.vi.txt              # Raw test files
│   |── clean_train.en.txt              # Generated by clean_data.py
│   ├── clean_train.vi.txt              # Generated by clean_data.py
│   |── simple_medical_glossory.json    # 
│   ├── vi_abbre.json                   # 
│   |── bidirectional_train_data.jsonl  # Generated by make_data_(gloss_and_vi_abbre).py
│   └── final_ultimate_train.jsonl      # Generated by make_data_final.py
|
│
├── output/                             # Artifacts & Results
│   ├── qwen_mt_3B_finetuned/           # LoRA Checkpoints (from train.py)
│   ├── merged_qwen_3b/                 # Merged Model (from merge_model.py)
│   ├── pipeline_result_en2vi.csv       # Final Inference Results (from pipeline_full.py)
│   └── pipeline_result_vi2en.csv       # Final Inference Results (from pipeline_full.py)
│
├── clean_data.py                       # Data cleaning utilities
├── make_data_(gloss_and_vi_abbre).py   # Glossory and Abbreviations Data preparation
├── make_data_final.py                  # Final Data formatting & preparation script
├── train.py                            # QLoRA Training script
├── merge_model.py                      # Adapter merging script
├── pipeline_full.py                    # End-to-end Inference & Evaluation
├── demo.py                             # Interactive Demo (Using Gradio)
└── requirements.txt                    # Python dependencies

Directory Details

input/: Initially contains the 4 raw datasets (train and public_test pairs). During execution, the data preparation scripts will generate cleaned and formatted JSON/JSONL files here for training.

output/: The destination for all training artifacts:

qwen_mt_3B_finetuned/: Stores the LoRA adapter checkpoints saved during train.py.
merged_qwen_3b/: Stores the standalone model after merging the base model with LoRA adapters via merge_model.py.
*.csv: The final evaluation metrics and translation outputs generated by pipeline_full.py.

Execution Workflow

Follow these steps to reproduce the full pipeline:

1. Setup Environment

pip install -r requirements.txt

2. Data Preparation

Clean and format the raw text files into training-ready datasets:

python3 clean_data.py
python3 make_data_(gloss_and_vi_abbre).py
python3 make_data_final.py

3. Training (QLoRA)

Fine-tune the Qwen-3B model using the processed data:

python3 train.py

Outputs: LoRA adapters saved in output/qwen_mt_3B_finetuned.

4. Model Merging

Merge the trained LoRA adapters with the base model for faster inference:

python3 merge_model.py

Outputs: Full model saved in output/merged_qwen_3b.

5. Inference & Evaluation

Run the full Refine Pipeline to generate translations and calculate scores:

python3 pipeline_full.py

Outputs: pipeline_result_en2vi.csv and pipeline_result_vi2en.csv in the output/ directory.

6. Interactive Demo (Optional)

To launch a web interface (using Gradio) for testing individual sentences:

python3 demo.py

Or access our hosted demo via this link.

Name		Name	Last commit message	Last commit date
Latest commit History 13 Commits
1. Machine Translation Transformer from Scratch		1. Machine Translation Transformer from Scratch
2. VLSP 2025 Shared Task Machine Translation		2. VLSP 2025 Shared Task Machine Translation
__pycache__		__pycache__
2025.NLP.MT.Nhóm 14 (1).pdf		2025.NLP.MT.Nhóm 14 (1).pdf
README.md		README.md

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Part 1: Machine Translation Transformer from Scratch

Project Structure & Components

1. `main.py` (Entry Point)

2. `tokenizer/` (Text Processing)

3. `model/` (Architecture)

4. `decoding/` (Generation Strategies)

5. `data/` (Data Pipeline)

6. `train/` (Training Loop)

7. `evaluate/` (Metrics & Analysis)

Quick Start

1. Prerequisites

2. Training model

Part 2: VLSP 2025 Shared Task Machine Translation

Project Structure

Directory Details

Execution Workflow

1. Setup Environment

2. Data Preparation

3. Training (QLoRA)

4. Model Merging

5. Inference & Evaluation

6. Interactive Demo (Optional)

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

Part 1: Machine Translation Transformer from Scratch

Project Structure & Components

1. main.py (Entry Point)

2. tokenizer/ (Text Processing)

3. model/ (Architecture)

4. decoding/ (Generation Strategies)

5. data/ (Data Pipeline)

6. train/ (Training Loop)

7. evaluate/ (Metrics & Analysis)

Quick Start

1. Prerequisites

2. Training model

Part 2: VLSP 2025 Shared Task Machine Translation

Project Structure

Directory Details

Execution Workflow

1. Setup Environment

2. Data Preparation

3. Training (QLoRA)

4. Model Merging

5. Inference & Evaluation

6. Interactive Demo (Optional)

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

1. `main.py` (Entry Point)

2. `tokenizer/` (Text Processing)

3. `model/` (Architecture)

4. `decoding/` (Generation Strategies)

5. `data/` (Data Pipeline)

6. `train/` (Training Loop)

7. `evaluate/` (Metrics & Analysis)

Packages