This project demonstrates a machine translation system that translates Urdu text into English using the Hugging Face transformers
library.
To get started, install the necessary dependencies:
pip install datasets
pip install transformers
pip install sacrebleu
pip install evaluate
pip install accelerate -U
The dataset used for this project consists of Urdu to English sentence pairs. Ensure your dataset is structured with each pair on a new line, separated by a tab.
-
Load the dataset:
- Read the dataset from a file.
- Split the dataset into training, validation, and testing sets.
-
Convert to Hugging Face
Dataset
format:- Convert the data to dictionaries.
- Create
DatasetDict
with training, validation, and testing sets.
-
Tokenization:
- Tokenize the sentences using the
Helsinki-NLP/opus-mt-ur-en
tokenizer.
- Tokenize the sentences using the
-
Define Model:
- Load the pre-trained model and tokenizer from Hugging Face.
-
Freeze Specific Layers:
- Freeze the initial layers of the encoder and decoder to focus training on the remaining layers.
-
Training Arguments:
- Set training arguments such as learning rate, batch size, evaluation steps, etc.
-
Train:
- Use the
Seq2SeqTrainer
to train the model.
- Use the
-
Metric:
- Use BLEU score for evaluation with the
evaluate
library.
- Use BLEU score for evaluation with the
-
Evaluate:
- Evaluate the model on the test dataset.
-
Load the Model:
- Load the fine-tuned model and tokenizer.
-
Translate Text:
- Encode the Urdu text and generate the English translation.
After training, evaluate the model to check its performance on the test dataset. The evaluation results will include BLEU scores and other relevant metrics.