Skip to content

Dainn98/MachineTranslation

Repository files navigation

Part 1: Machine Translation Transformer from Scratch

This module implements a complete Neural Machine Translation (NMT) system from the ground up using PyTorch. It features a custom Whitespace Tokenizer and an enhanced Transformer architecture with SwiGLU activation, serving as a robust baseline for analyzing tokenization strategies.

Project Structure & Components

Here is the breakdown of the core components in this implementation:

1. main.py (Entry Point)

The central command center for the entire system.

  • Responsibilities:
    • Parses command-line arguments (dataset paths, model dimensions, hyperparameters).
    • Initializes the Tokenizer, Model architecture, and Optimizer.
    • Orchestrates the execution pipeline: Training, Validation, or Test/Inference.
  • Modes: Supports flexible execution modes including train, validate, and inference (with Beam Search integration).

2. tokenizer/ (Text Processing)

Implements a custom Whitespace Tokenizer specifically designed for analyzing linguistic impacts on MT.

  • Features:
    • Tokenization based on whitespace delimiters (preserving syllable boundaries for Vietnamese).
    • Vocabulary construction and management.
    • Sequence encoding/decoding with special token handling (<pad>, <bos>, <eos>).

3. model/ (Architecture)

Contains the core Transformer logic built from scratch.

  • Key Components:
    • Encoder-Decoder: Full attention-based architecture.
    • SwiGLU FFN: Replaces standard ReLU Feed-Forward Networks for enhanced expressiveness.
    • Multi-head Attention: Standard parallel attention mechanism.
    • Embeddings & Positional Encoding: Handles input representation and sequence order.

4. decoding/ (Generation Strategies)

Implements algorithms to generate translations from the trained model.

  • Strategies:
    • Greedy Decoding: Selects the highest probability token at each step.
    • Beam Search: A more sophisticated approach supporting:
      • Customizable beam_size.
      • Length penalty optimization.
      • Early stopping mechanisms upon generating <eos>.

5. data/ (Data Pipeline)

Handles the preprocessing and loading of bilingual datasets (Source-Target).

  • Functionality:
    • Reads and normalizes train/valid/test files.
    • Efficient batching logic with dynamic padding.
    • Generates Attention Masks to ignore padding tokens during training.

6. train/ (Training Loop)

Manages the optimization process.

  • Workflow:
    • Forward pass -> Loss computation -> Backpropagation.
    • Gradient Clipping: Prevents exploding gradients.
    • Early Stopping: Monitors validation loss to prevent overfitting.
    • Checkpointing: Automatically saves the best model state based on validation metrics.

7. evaluate/ (Metrics & Analysis)

Tools for assessing model performance.

  • Metrics: Primarily uses BLEU score (expandable to METEOR/ROUGE).
  • Reporting:
    • Compares model Predictions vs. References.
    • Exports detailed results to CSV for qualitative analysis.

Quick Start

1. Prerequisites

Ensure all dependencies are installed via pip:

pip install -r requirements.txt

2. Training model

python3 main.py

Part 2: VLSP 2025 Shared Task Machine Translation

This module contains the implementation for our Qwen-2.5-3B submission, featuring LoRA fine-tuning and a specialized Refine Pipeline.

Project Structure

.
├── input/                              # Data Storage
│   ├── train.en.txt                    # Raw source files
│   ├── train.vi.txt                    # Raw source files
│   ├── public_test.en.txt              # Raw test files
│   ├── public_test.vi.txt              # Raw test files
│   |── clean_train.en.txt              # Generated by clean_data.py
│   ├── clean_train.vi.txt              # Generated by clean_data.py
│   |── simple_medical_glossory.json    # 
│   ├── vi_abbre.json                   # 
│   |── bidirectional_train_data.jsonl  # Generated by make_data_(gloss_and_vi_abbre).py
│   └── final_ultimate_train.jsonl      # Generated by make_data_final.py
|
│
├── output/                             # Artifacts & Results
│   ├── qwen_mt_3B_finetuned/           # LoRA Checkpoints (from train.py)
│   ├── merged_qwen_3b/                 # Merged Model (from merge_model.py)
│   ├── pipeline_result_en2vi.csv       # Final Inference Results (from pipeline_full.py)
│   └── pipeline_result_vi2en.csv       # Final Inference Results (from pipeline_full.py)
│
├── clean_data.py                       # Data cleaning utilities
├── make_data_(gloss_and_vi_abbre).py   # Glossory and Abbreviations Data preparation
├── make_data_final.py                  # Final Data formatting & preparation script
├── train.py                            # QLoRA Training script
├── merge_model.py                      # Adapter merging script
├── pipeline_full.py                    # End-to-end Inference & Evaluation
├── demo.py                             # Interactive Demo (Using Gradio)
└── requirements.txt                    # Python dependencies

Directory Details

input/: Initially contains the 4 raw datasets (train and public_test pairs). During execution, the data preparation scripts will generate cleaned and formatted JSON/JSONL files here for training.

output/: The destination for all training artifacts:

  • qwen_mt_3B_finetuned/: Stores the LoRA adapter checkpoints saved during train.py.

  • merged_qwen_3b/: Stores the standalone model after merging the base model with LoRA adapters via merge_model.py.

  • *.csv: The final evaluation metrics and translation outputs generated by pipeline_full.py.

Execution Workflow

Follow these steps to reproduce the full pipeline:

1. Setup Environment

pip install -r requirements.txt

2. Data Preparation

Clean and format the raw text files into training-ready datasets:

python3 clean_data.py
python3 make_data_(gloss_and_vi_abbre).py
python3 make_data_final.py

3. Training (QLoRA)

Fine-tune the Qwen-3B model using the processed data:

python3 train.py

Outputs: LoRA adapters saved in output/qwen_mt_3B_finetuned.

4. Model Merging

Merge the trained LoRA adapters with the base model for faster inference:

python3 merge_model.py

Outputs: Full model saved in output/merged_qwen_3b.

5. Inference & Evaluation

Run the full Refine Pipeline to generate translations and calculate scores:

python3 pipeline_full.py

Outputs: pipeline_result_en2vi.csv and pipeline_result_vi2en.csv in the output/ directory.

6. Interactive Demo (Optional)

To launch a web interface (using Gradio) for testing individual sentences:

python3 demo.py

Or access our hosted demo via this link.

About

No description, website, or topics provided.

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors