GitHub - Mark-Kitur/Swahili_English-Translator

Swahili–English Neural Machine Translation Model

Transformer Architecture (Attention Is All You Need) This repository contains an end-to-end Swahili–English Neural Machine Translation (NMT) system implemented using the Transformer architecture introduced in the landmark paper Attention Is All You Need. The project includes dataset preprocessing, custom tokenizer creation, model definition, training pipeline, and inference utilities. The goal of this project is to build a fully functional sequence-to-sequence translation model without relying on external pretrained weights, while demonstrating a clean and reproducible implementation of the Transformer architecture.

1. Project Overview

The Transformer architecture eliminates recurrence and convolution by relying entirely on multi-head self-attention, enabling efficient parallelism and improved long-range sequence modeling. This project applies that architecture to translate Swahili sentences into English using a dataset collected from open parallel corpora. Key objectives of the project include: Build a custom tokenizer for both languages. Implement the original Transformer components from scratch. Train an encoder–decoder model following the “Attention Is All You Need” specification. Evaluate translation quality using BLEU scores. Provide an inference script for real-time translation.

3. Tokenizer Construction

A key objective of this project was to build the tokenizer manually rather than relying on prebuilt libraries. Tokenizer Design Text normalization Lowercasing Removing non-language symbols Basic punctuation handling Subword vocabulary construction Built using Byte Pair Encoding (BPE) Separate vocabularies for Swahili and English Special tokens included: , , , Vocabulary size Configurable; default is typically 8k–16k tokens per language. Encoding and decoding utilities Convert text to token IDs Convert token IDs back to text Handle unknown and padding tokens

4. Model Architecture The model strictly follows the Attention Is All You Need architecture: Encoder Token embedding + positional encoding N identical layers Multi-Head Self-Attention Positionwise feed-forward network Layer normalization and residual connections Decoder Masked multi-head self-attention Encoder–decoder cross-attention Feed-forward network Layer normalization and residual connections

5. Training Pipeline

Training was performed using PyTorch, with full teacher forcing and label smoothing. Training Steps Dataset preparation Tokenize Swahili and English sentences Pad sequences to uniform length Create DataLoader with batching and masking Loss function Cross-entropy with label smoothing Padding tokens excluded from loss Optimizer Adam optimizer with the Transformer learning rate schedule Warm-up steps implemented per original paper Checkpointing Saves model state, optimizer state, and tokenizers

Name		Name	Last commit message	Last commit date
Latest commit History 4 Commits
ENG_detokenizer.json		ENG_detokenizer.json
QED.en-swa.en		QED.en-swa.en
QED.en-swa.swa		QED.en-swa.swa
README.md		README.md
SWA_tokenizer.json		SWA_tokenizer.json
model.ipynb		model.ipynb

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages