By completing this assignment you will be able develop an ASR
system for any language. In this assignment, students are given
the freedom to choose between two popular automatic speech
recognition (ASR) frameworks: espnet
and fairseq
.
The assignment focuses on two sub-tasks to evaluate the
performance of these frameworks on the two recommended
languages, Guarani
and Quechua
.
You can find the datasets
here.Additional data can be found here.
For Sub-Task 1, students are required to create working ASR
models using their chosen framework for Guarani. The primary
objective is to build accurate ASR models and report the
corresponding error rates. ASR for Quechua is optional.
In Sub-Task 2, students will explore advanced techniques by
implementing multi-task, multilingual joint training, and using
pre-trained models such as XLSR/Wav2vec2. This sub-task also
focuses on the two recommended languages, Quechua and Guarani,
and requires reporting the Word Error Rate (WER) for the trained
models.Quechua is optional. Completing both tasks for Quechua
will receive 20pts bonus.
espnet
or fairseq
as their framework for this assignment.
If you choose to work with ESPnet, you have the opportunity to explore various experiments and extend your ASR knowledge. Below are some recommended experiments you can undertake:
The base code for this experiment is provided through a Google Colab notebook that can be accessed here. Your first task is to run the entire notebook from start to end. If you plan to conduct further experiments outside of Google Colab, we recommend saving it by creating a copy on your GitHub using "File -> Save a copy on Github."
You need to run an existing recipe in ESPnet for Guarani and Quechua (optional). The list of available reference recipes is available here. Your goal is to modify the recipe and run it for the two languages mentioned.
These experiments will provide you with a comprehensive understanding of ASR and ESPnet's capabilities.
Please follow the following steps if you choose fairseq.
Start by running the base ASR experiment provided in Fairseq. Follow the instructions in the official documentation to run the experiment on your chosen dataset. Carefully review and understand the configuration options, data preprocessing, and model training steps. Here's a reference git page for the same. Here's a reference colab notebook for the same.
Fairseq, by default, includes a preprocessing file tailored for English datasets such as LibriSpeech. However, for this assignment, you will need to modify the preprocessing file to incorporate datasets from CommonVoice for the languages Quechua and Guarani. Follow these steps to make the necessary changes:
examples/speech_to_text/prep_librispeech_data.py
.Modifying the preprocessing file allows you to adapt Fairseq to handle non-English datasets and ensures that your ASR system can work effectively with the Quechua and Guarani datasets from CommonVoice. Remember to test and validate your changes thoroughly before proceeding with model training.
One of the strengths of Fairseq is its flexibility in
customizing model architectures. You can experiment with
different neural network architectures, such as LSTM,
Transformer, Conformer, to improve ASR performance. Keep track
of the changes you make and document the impact on recognition
accuracy.
You may explore CTC combined with LM (RNN transducer), joint
training with multiple languages, and pre-training with
additional data.
Investigate advanced decoding techniques such as beam search, shallow fusion, and subword units. Compare the results of different decoding strategies and analyze their impact on ASR accuracy.
These experiments with Fairseq will enable you to delve deeper into ASR research and gain valuable insights into customizing and optimizing ASR systems.
Your assignment for this project will be assessed as follows:
We encourage you to take this assignment as an opportunity to explore, learn, and experiment with ASR frameworks. If you have any questions or need clarifications, please don't hesitate to reach out to the course instructor or teaching assistant.
This assignment provides an opportunity for students to gain
hands-on experience in automatic speech recognition using
popular frameworks. The flexibility to choose between espnet
and fairseq
allows students to explore different
approaches to ASR in the context of the languages Quechua and
Guarani. Sub-Task 1 and Sub-Task 2 offer a comprehensive
evaluation of ASR model performance. The primary goal is for
students to enhance their understanding of ASR techniques and
their practical application.
For any questions or clarifications, please feel free to reach out to the course instructor or teaching assistant. Good luck with your assignment!