COMP6211J 2025 Fall

Advanced Large-Scale Machine Learning System for Foundation Models

Overview

In recent years, foundation models have fundamentally revolutionized the state-of-the-art of artificial intelligence. Thus, the computation in the training or inference of the foundation model could be one of the most important workflows running on top of modern computer systems. This course unravels the secrets of the efficient deployment of such workflows from the system perspective. Specifically, we will: i) explain how a modern machine learning system (i.e., PyTorch) works; ii) understand the performance bottleneck of machine learning computation over modern hardware (e.g., Nvidia GPUs); iii) discuss four main parallel strategies in foundation model training (data-, pipeline-, tensor model-, optimizer- parallelism); and iv) real-world deployment of foundation model including domain-specific adaptations.

Syllabus

Date	Topic
W1 - 09/03, 09/05	Introduction and Logistics [Slides] & Stochastic Gradient Descent [Slides]
W2 - 09/10, 09/12	Auto-Differentiation [Slides] & Nvidia GPU Computation and Communication [Slides]
W3 – 09/17, 09/19	LLM Pretraining [Slides] & Data-, Pipeline- Parallelism [Slides]
W4 - 09/24, 09/26	Tensor Model-, Optimizer- Parallelism [Slides] & LLM Tuning and Utilization [Slides]
W5 - 10/03	Generative Inference Overview [Slides]
W6 - 10/08, 10/10	Alogirhtm Optimizations for Inference [Slides] & System Optimizations for Inference [Slides]
W7 - 10/15, 10/17	RAG and Domain Specific LLM Agent [Slides] & Course Review [Slides]
W8 - 10/22, 10/24	Presentation Sessions
W9 – 10/29, 10/31	Presentation Sessions
W10 - 11/05, 11/07	Presentation Sessions
W11 - 11/12, 11/14	Presentation Sessions
W12 - 11/19, 11/21	Presentation Sessions
W13 - 11/26, 11/28	Presentation Sessions

Lecture Record.

Grading

Course Report (70%):
- Literature review (50%):
  - Cover the relevant techniques exhaustively. (10%)
  - Understand the relevant techniques correctly. (15%)
  - Organize the techniques using good categorization. (15%)
  - The report is written in professional academic English. (10%)
  - Page limits: 4 pages in NeurIPS template (excluding reference).
- Research plan (20%):
  - The proposed research plan is executable. (10%)
  - The proposed research plan includes novelty and concrete design. (10%)
  - Page limits: 4 pages in NeurIPS template (excluding reference).
In-class Presentation (30%), including literature review only:
- Clearly organize the material and present the problem definition, related work, and methodology appropriately. (20%)
- Can answer the questions from the lecturers and other students appropriately. (5%)
- Submit short feedback for all the other presentation sessions. (5%)
- (Other student feedback determines 70% of the grades for this part.)

Topics for Literature Review:

Topics and Reference, Topic Assignment, Presentation Schedule.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

COMP6211J 2025 Fall

Advanced Large-Scale Machine Learning System for Foundation Models

Overview

Syllabus

Grading

Topics for Literature Review:

About

Releases

Packages

Name		Name	Last commit message	Last commit date
Latest commit History 83 Commits
.DS_Store		.DS_Store
Lecture 1 - Introduction and Logistics.pdf		Lecture 1 - Introduction and Logistics.pdf
Lecture 10 - Generative Inference Algorithm Optimization.pdf		Lecture 10 - Generative Inference Algorithm Optimization.pdf
Lecture 11 - Generative Inference System Optimization.pdf		Lecture 11 - Generative Inference System Optimization.pdf
Lecture 12 - RAG and Domain Specific LLM Agent.pdf		Lecture 12 - RAG and Domain Specific LLM Agent.pdf
Lecture 13 - Review.pdf		Lecture 13 - Review.pdf
Lecture 2 - Stochastic Gradient Descent.pdf		Lecture 2 - Stochastic Gradient Descent.pdf
Lecture 3 - Automatic Differentiation.pdf		Lecture 3 - Automatic Differentiation.pdf
Lecture 4 - Nvidia GPU Computation and Communication.pdf		Lecture 4 - Nvidia GPU Computation and Communication.pdf
Lecture 5 - LLM Pretraining.pdf		Lecture 5 - LLM Pretraining.pdf
Lecture 6 - Data and Pipeline Parallel Training.pdf		Lecture 6 - Data and Pipeline Parallel Training.pdf
Lecture 7 - Tensor Model and Optimizer Parallel Training.pdf		Lecture 7 - Tensor Model and Optimizer Parallel Training.pdf
Lecture 8 - LLM Tuning and Utilization.pdf		Lecture 8 - LLM Tuning and Utilization.pdf
Lecture 9 - Generative Inference Overview.pdf		Lecture 9 - Generative Inference Overview.pdf
README.md		README.md
presentation_schedule.md		presentation_schedule.md
topic_assignment.md		topic_assignment.md
topics.md		topics.md

Relaxed-System-Lab/COMP6211J_Course_HKUST

Folders and files

Latest commit

History

Repository files navigation

COMP6211J 2025 Fall

Advanced Large-Scale Machine Learning System for Foundation Models

Overview

Syllabus

Grading

Topics for Literature Review:

About

Resources

Stars

Watchers

Forks

Releases

Packages 0

Packages