CMU 11-737: Multilingual Natural Language Processing

Course Description

CMU 11-737 is an advanced graduate-level course on natural language processing techniques applicable to many languages. Students who take this course should be able to develop linguistically motivated solutions to core and applied NLP tasks for any language. This includes understanding and mitigating the difficulties posed by lack of data in low-resourced languages or language varieties, and the necessity to model particular properties of the language of interest such as complex morphology or syntax. The course will introduce modeling solutions to these issues such as multilingual or cross-lingual methods, linguistically informed NLP models, and methods for effectively bootstrapping systems with limited data or human intervention. The project work will involve building an end-to-end NLP pipeline in a language you don’t know.

Instructor

Lei Li (Office Hour: Tuesday 4-5pm outside GHC 5417, book a slot here or drop in)

Teaching Assistants

TA Mailing list: cs11-737-fa2023-tas@cs.cmu.edu

Time and Location

Tuesday and Thursday, 2-3:20pm, Doherty Hall 1212 (in-person expected)

Prerequisites

You are highly recommended to take a NLP (11-411 or 11-611 or 11-711) course previously. The assignments for the class will be done by creating neural network models, and examples will be provided using PyTorch. If you are not familiar with PyTorch, we suggest you attempt to familiarize yourself using online tutorials (for example Deep Learning for NLP with PyTorch) before starting the class.

Class Format

For each class there will be:

Homework Submission & Grading

Please submit your homework on canvas or the specified platform. The assignments will be given a grade of A+ (100), A (96), A- (92), B+ (88), B (85), B- (82), or below. The final grades will be determined based on the weighted average of discussion participation, assignments, and project. Cutoffs for final grades will be approximately 97+ A+, 93+ A, 90+ A-, 87+ B+, 83+ B, 80+ B-, etc., although we reserve some flexibility to change these thresholds slightly.

Discussion Forum

We will use the Ed platform for discussions (sign up here), but coming to office hours is also encouraged. You may send private message on edstem platform as well.

Syllabus

#
Date
Topic
Material
Homework
1
8/29
Introduction
Slides
Reading List out
2
8/31
Typology: The Space of Languages
Slides

3
9/5 Words and Morphology
Slides

4
9/7 Sequence Labeling Slides HW1 out
5
9/12
Machine Translation Overview and Evaluation
Slides

6
9/14 Neural Machine Translation Models
Slides

7
9/19
Sequence Decoding
Slides

8
9/21
Semi-supervised and Unsupervised MT
Slides

9
9/26 Multilingual NMT
Slides
10
9/28
Pre-training for NMT Slides
11
10/3
Speech Processing Slides HW1 Due
12
10/5 End-to-end Speech Recognition Slides Project Proposal Du, HW2 out
13
10/10
Text-to-speech, Tactron2 Noteboo , FastSpeech 2 code Slides
14
10/12
Speech Representation Learning
Slides

10/17 Fall Break


10/19 Fall Break

15
10/24
Speech Translation Slides
16
10/26 Streaming Speech Translation
Slides

17
10/31
Guest Lecture: Juan Pino (Meta)


18
11/2
Language Contact and Change, Code Switching, Pidgins, Creoles Slides
HW 2 Due

11/7 Democracy Day Holiday

19
11/9
Morphological Analysis and Inflection (Possible Guest Lecture by Kristine Stenzel)

Mid-term Report Due
20
11/14
Learned Metrics
Slides

21
11/16 Guest Lecture: Large Language Models for MT, Colin Cherry (Google)

HW3 Due
22
11/21
Vocabulary Learning
Slides


11/23
Thanksgiving, no classes Slides

23
11/28 Non-Autoregressive Generation Models


24
11/30
Guest Lecture (Stephen Mayhew, Duolingo)



12/5
Poster Presentations


12/7
Image-Text Modeling for Multilingual NLP
Slides
Final Report Due