Skip to content

Bangla ASR using Fast Conformer CTC (NVIDIA NeMo). This repository contains a Bangla Automatic Speech Recognition (ASR) model built using NVIDIA's NeMo framework and Fast Conformer architecture with CTC loss. The model is trained on 1200 hours of Mozilla Common Voice Bangla dataset.

License

Notifications You must be signed in to change notification settings

nahidbrur/BanglaASR

Repository files navigation

BanglaASR

Bangla is the seventh most spoken language in the world, yet resources for automatic speech recognition (ASR) in Bangla are relatively scarce. To address this gap, we developed a robust ASR model using NVIDIA's NeMo framework. This model is designed to accurately transcribe spoken Bangla into text, which can be used in various applications like voice assistants, transcription services, and more.

This repository contains the code and the model. The model is trained on the Mozilla Common Voice Bangla dataset, leveraging the Fast Conformer architecture with a CTC loss function for efficient and accurate speech-to-text conversion in Bangla.

Table of Contents

Fast Conformer

The Fast Conformer architecture improves on the standard Conformer by optimizing for both speed and accuracy. It achieves this through a combination of convolutional layers and self-attention mechanisms, making it well-suited for speech recognition tasks where computational efficiency is critical.

Architecture of the encoder of Conformer-CTC:

Alt text

The key benefits of the Fast Conformer model:

  • High accuracy: Convolutional layers capture local dependencies, while self-attention layers handle global context, resulting in accurate transcriptions.
  • Efficient computation: The architecture is optimized for both speed and memory efficiency, making it ideal for deployment on real-time ASR systems.
  • CTC Loss: The model is trained using Connectionist Temporal Classification (CTC), a popular loss function for sequence prediction tasks where alignment between input and output sequences is unknown.

To know more, please check the paper

Dataset

The model is trained on the Mozilla Common Voice Bangla dataset, which contains approximately 1200 hours of Bangla speech data. This dataset includes diverse speakers, dialects, and recording conditions, making it well-suited for building a robust ASR model.

The Mozilla Common Voice project is a crowd-sourced initiative that collects transcribed speech in multiple languages, including Bangla, to support the development of ASR models. You can download the dataset from Mozilla Common Voice.

Installation

  1. Create a conda environment as
conda create --name nemo_asr python==3.11
conda activate nemo_asr
  1. Install dependencies
sudo apt-get update && apt-get install -y libsndfile1 ffmpeg
pip install Cython packaging
python -m pip install git+https://github.com/NVIDIA/NeMo.git@main#egg=nemo_toolkit[all]

Training

  • Architecture: Fast Conformer (CTC-based)
  • Framework: NVIDIA NeMo
  • Dataset: Mozilla Common Voice Bangla (1200 hours)
  • Training Steps: 1000 epochs with a batch size of 32
  • Optimizations: Mixed precision training with AMP (Automatic Mixed Precision)
  • Word Error Rate (WER): After 1000 epochs, the model achieved a WER of 4.12%.

To train the model using NeMo, the following command can be used:

python train.py \
  --model-config-path ./configs/fast_conformer_ctc_bpe_bangla.yaml \
  --dataset-path ./datasets/bangla_common_voice \
  --num-gpus 4

Inference

There are different variants of FastConformer-CTC-BPE, Among them we only trained the Large one.

Model d_model n_layers n_heads Parameters Training Status
Small 176 16 4 14 M X
Medium 256 16 4 32 M X
Large 512 17 8 120 M
XLarge 1024 24 8 616 M X
XXLarge 1024 42 8 1.2 B X

You can use the pre-trained model for inference with the following code snippet:

import nemo.collections.asr as nemo_asr

# Load the pre-trained ASR model
asr_model = nemo_asr.models.ASRModel.from_pretrained("nahidbrur/stt_bn_fastconformer_ctc_large")

# Transcribe an audio file
transcription = asr_model.transcribe(["./dataset/test/test_bangla.wav"])
print(f"Transcription: {transcription[0]}")

Contribution

@misc{BanglaASR ,
  title={Fast Conformer Bangla ASR Model},
  author={Md Nurul Islam},
  howpublished={},
  year={2023}
}

Reference

  1. https://docs.nvidia.com/nemo-framework/user-guide/latest/nemotoolkit/asr/models.html#fast-conformer

About

Bangla ASR using Fast Conformer CTC (NVIDIA NeMo). This repository contains a Bangla Automatic Speech Recognition (ASR) model built using NVIDIA's NeMo framework and Fast Conformer architecture with CTC loss. The model is trained on 1200 hours of Mozilla Common Voice Bangla dataset.

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published