Assignment 2: Multilingual Speech Recognition


By completing this assignment you will be able develop an ASR system for any language. In this assignment, students are given the freedom to choose between two popular automatic speech recognition (ASR) frameworks: espnet and fairseq. The assignment focuses on two sub-tasks to evaluate the performance of these frameworks on the two recommended languages, Guarani and Quechua . You can find the datasets here.Additional data can be found here.

Sub-Task 1: Individual ASR Models

For Sub-Task 1, students are required to create working ASR models using their chosen framework for Guarani. The primary objective is to build accurate ASR models and report the corresponding error rates. ASR for Quechua is optional.

Sub-Task 2: Multi-Task, Multilingual Joint Training

In Sub-Task 2, students will explore advanced techniques by implementing multi-task, multilingual joint training, and using pre-trained models such as XLSR/Wav2vec2. This sub-task also focuses on the two recommended languages, Quechua and Guarani, and requires reporting the Word Error Rate (WER) for the trained models.Quechua is optional. Completing both tasks for Quechua will receive 20pts bonus.

DL Frameworks

Choice of Framework: Students are encouraged to select either espnet or fairseq as their framework for this assignment.

Experiments with ESPnet

If you choose to work with ESPnet, you have the opportunity to explore various experiments and extend your ASR knowledge. Below are some recommended experiments you can undertake:

Running the Base Experiment

The base code for this experiment is provided through a Google Colab notebook that can be accessed here. Your first task is to run the entire notebook from start to end. If you plan to conduct further experiments outside of Google Colab, we recommend saving it by creating a copy on your GitHub using "File -> Save a copy on Github."

Running Another Non-English Recipe

You need to run an existing recipe in ESPnet for Guarani and Quechua (optional). The list of available reference recipes is available here. Your goal is to modify the recipe and run it for the two languages mentioned.

Train a Speech Recognition System on a New Dataset

  1. Following Stage 1, implement a bash script (local/ that includes:
    • Download the dataset
    • Split the dataset into train / dev / test
    • Perform text normalization, e.g., removing punctuation, unifying letter case, etc.
    • Prepare the data in Kaldi style
  2. Following Stage 2-4, do speed perturbation if necessary and dump the audio file.
  3. Prepare the tokenization model as in Stage 5.
  4. Train the language model (LM) following Stages 6 - 9.
  5. Train the end-to-end speech recognition model (E2E-ASR), following Stages 10 - 11.
  6. Decoding and scoring the dev and test sets with your system (LM + E2E-ASR), following Stages 12-13.

These experiments will provide you with a comprehensive understanding of ASR and ESPnet's capabilities.

Experiments with fairseq

Please follow the following steps if you choose fairseq.

Running the Base Experiment

Start by running the base ASR experiment provided in Fairseq. Follow the instructions in the official documentation to run the experiment on your chosen dataset. Carefully review and understand the configuration options, data preprocessing, and model training steps. Here's a reference git page for the same. Here's a reference colab notebook for the same.

Changing the Preprocessing File

Fairseq, by default, includes a preprocessing file tailored for English datasets such as LibriSpeech. However, for this assignment, you will need to modify the preprocessing file to incorporate datasets from CommonVoice for the languages Quechua and Guarani. Follow these steps to make the necessary changes:

  1. Locate the Fairseq preprocessing file for data preparation. The path for this file in fairseq repo is typically named something like examples/speech_to_text/
  2. Open the preprocessing file using a text editor or integrated development environment (IDE).
  3. Identify the sections of the code that pertain to data loading, preprocessing, and tokenization. These sections are responsible for handling input data and transforming it into a format suitable for training ASR models.
  4. Within the preprocessing code, look for configurations related to data paths, data formats, and tokenization. You will need to update these configurations to specify the paths to the Quechua and Guarani datasets from CommonVoice.
  5. Ensure that you adapt the preprocessing steps to handle the specific characteristics of the Quechua and Guarani datasets. This may include text normalization, language-specific tokenization, and any other necessary data transformations.
  6. Test the modified preprocessing code by running it on the Quechua and Guarani datasets. Verify that the data is loaded, preprocessed, and tokenized correctly.
  7. Document any changes you make to the preprocessing file, including dataset paths, modifications to data loading functions, and any additional preprocessing steps. This documentation will be important for reproducibility and reporting in your assignment.

Modifying the preprocessing file allows you to adapt Fairseq to handle non-English datasets and ensures that your ASR system can work effectively with the Quechua and Guarani datasets from CommonVoice. Remember to test and validate your changes thoroughly before proceeding with model training.

Customize Model Architecture

One of the strengths of Fairseq is its flexibility in customizing model architectures. You can experiment with different neural network architectures, such as LSTM, Transformer, Conformer, to improve ASR performance. Keep track of the changes you make and document the impact on recognition accuracy.

Augmenting Training Objectives

You may explore CTC combined with LM (RNN transducer), joint training with multiple languages, and pre-training with additional data.

Advanced Decoding Techniques

Investigate advanced decoding techniques such as beam search, shallow fusion, and subword units. Compare the results of different decoding strategies and analyze their impact on ASR accuracy.

These experiments with Fairseq will enable you to delve deeper into ASR research and gain valuable insights into customizing and optimizing ASR systems.


Your assignment for this project will be assessed as follows:

  1. Report Submission: We request that you submit a report for Assignment 2, which should be approximately 2 pages in length. Your report should include details of the implemented models, methodologies, and the results obtained in Sub-Task 1 and Sub-Task 2.
  2. Leaderboard Competition: We do not require participation in a leaderboard competition. The focus of this assignment is on individual effort and learning, rather than competition.
  3. Assessment of Effort: As long as you complete both Sub-Task 1 and Sub-Task 2 with reasonable effort, you will receive full credit for this assignment. The emphasis is on your learning and the application of ASR techniques rather than achieving specific performance metrics.

We encourage you to take this assignment as an opportunity to explore, learn, and experiment with ASR frameworks. If you have any questions or need clarifications, please don't hesitate to reach out to the course instructor or teaching assistant.


This assignment provides an opportunity for students to gain hands-on experience in automatic speech recognition using popular frameworks. The flexibility to choose between espnet and fairseq allows students to explore different approaches to ASR in the context of the languages Quechua and Guarani. Sub-Task 1 and Sub-Task 2 offer a comprehensive evaluation of ASR model performance. The primary goal is for students to enhance their understanding of ASR techniques and their practical application.

For any questions or clarifications, please feel free to reach out to the course instructor or teaching assistant. Good luck with your assignment!