18-667: Algorithms for Large-scale ML

Course Overview

The objective of this course is to introduce students to state-of-the-art algorithms in large-scale machine learning and distributed optimization, in particular, the emerging field of federated learning. Topics to be covered include but are not limited to:

Mini-batch SGD and its convergence analysis
Momentum and variance reduction methods
Synchronous and asynchronous SGD
Local-update SGD
Decentralized SGD
Gradient compression/quantization
Data Heterogeneity in federated learning
Computational Heterogeneity in federated learning
Client Selection and Partial Participation in federated learning
Differential privacy in federated learning
Secure Aggregation in federated learning
Robustness to Adversaries in federated learning

Prerequisites

A pre-requisite is an introductory course in machine learning: 18-461/661, 10-601/701 or equivalent
Undergraduate level training or coursework in algorithms, linear algebra, calculus, probability, and statistics is strongly encouraged.
A background in programming will also be necessary for the problem sets; students are expected to be familiar with Python or learn it during the course.

Comparison with Related Courses

18-660: Optimization: While 18-660 covers the fundamentals of convex and non-convex optimization and stochastic gradient descent, 18-667 will discuss state-of-the-art research papers in federated learning and optimization. 18-667 can be taken after or along with 18-660.
18-661: Introduction to Machine learning: 18-661 covers a breadth of machine learning methods including linear and logistic regression, neural networks, SVMs, decision trees, and online and reinforcement learning. Many of these methods used stochastic gradient descent (SGD) to train the model parameters. In 18-667, we will dive deeper into SGD and more specifically distributed implementations of SGD. While 18-661 covers classic and foundational concepts, 18-667 will discuss state-of-the-art research papers in federated learning and optimization. 18-667 can be taken after or along with 18-661.

Textbooks

Students are expected to read the research paper discussed in each lecture and review the lecture slides to prepare for the quizzes and homework assignments. Material covered in the first part of the class is also in Prof. Joshi's book on Optimization Algorithms for Distributed Machine Learning, available through CMU libraries

Piazza

We will use Piazza for class discussions. We strongly encourage students to post on this forum rather than emailing the course staff directly (this will be more efficient for both students and staff). Students should use Piazza to:

Ask clarifying questions about the course material.
Share useful resources with classmates (so long as they do not contain homework solutions).
Look for students to form study and project groups.
Answer questions posted by other students to solidify your own understanding of the material.

The course Academic Integrity Policy must be followed on the message boards at all times. Do not post or request homework solutions! Also, please be polite.

Tentative Grading Policy

Grades will be based on the following components:

Homework (40%): There will be 4 equally weighted homeworks, each consisting of a mix of mathematical and implementation questions.
- You are given 3 late days (self-granted 24-hr extensions) which you can use to give yourself extra time without penalty. At most one late day can be used per assignment. This will be monitored automatically via Gradescope.
- Solutions will be graded on both correctness and clarity. If you cannot solve a problem completely, you will get more partial credit by identifying the gaps in your argument than by attempting to cover them up.
Three Quizzes (45%) : Each quiz will be a mix of multiple-choice and descriptive questions. Every quiz will only be based on the papers discussed during lectures and recitations preceding that quiz.
Class Project (15%) Students will form teams of 4 to conduct a detailed literature survey and/or original research and/or implementation on one of the following project topics. Projects on a topic outside this list are also welcome -- please contact the instructor to discuss your idea. At the end of semester, each team will need to submit a 4-page review paper and give a 15-min project presentation.

Survey of Stochastic Variance Reduction Methods
Concept/Data Drift in Federated Learning
Data Unlearning in Federated Learning
Convergence Analysis of Differentially Private Distributed Optimization Algorithms
Federated Reinforcement Learning
Client Selection in Federated Learning
Incentivizing Client Participation in Federated Learning
Federated Multi-armed Bandits and Online Learning
Model-parallel, Split Federated Learning, Independent Subnet training
Federated Training of Heterogeneous Sized Models
One-shot Federated Learning and Model Fusion
Efficient Distributed Inference on Large Models
Parameter-efficient Federated Finetuning of LLMs
Hyperparameter optimization in Distributed and Federated ML

Collaboration Policy

Group studying and collaborating on problem sets are encouraged, as working together is a great way to understand new material. Students are free to discuss the homework problems with anyone under the following conditions:

Students must write their own solutions and understand the solutions that they wrote down. AI tools like ChatGPT are considered as collaborators and their use must be acknowledged.
Students must list the names of their collaborators (i.e., anyone with whom the assignment was discussed).
Students may not use old solution sets from other classes under any circumstances, unless the instructor grants special permission.

Students are encouraged to read CMU's Policy on Cheating and Plagiarism.

Schedule (subject to change)

Date	Lecture/Recitation	Readings	Announcements
08/26	Intro and Logistics [Slides]
08/28	SGD and its Variants in Machine Learning [Slides]	SGDMethodsSurvey1 SGDMethodsSurvey2
08/30	Math Review		HW1 release
09/02	Labor Day; No classes
09/04	SGD Convergence Analysis [Slides] [Annotated]	Chapter 4 of Bottou's Optimization in Large-scale ML monograph Chapter 3 of Prof. Joshi's book
09/06	PyTorch tutorial
09/09	Variance-reduced SGD, Distributed Synchronous SGD [Slides] [Annotated]	SAG SAGA SVRG SyncAsyncSGD Chapter 4 of Prof. Joshi's book
09/11	Asynchronous SGD, Hogwild [Slides] [Annotated]	SyncAsyncSGD Async SGD analysis Hogwild Chapter 5 of Prof. Joshi's book
09/13	Guest Lecture
09/16	Local-update SGD [Slides] [Annotated]	StichLocalUpdate Cooperative SGD Chapter 6 of Prof. Joshi's book
09/18	Adacomm, Elastic Averaging, Overlap SGD [Slides] [Annotated]	Adacomm Elastic Averaging SGD Overlap SGD Chapter 6 of Prof. Joshi's book
09/20	Concept Review and Practice		HW1 due; HW2 release
09/23	Quiz 1
9/25	Quantized and Sparsified Distributed SGD [Slides] [Annotated]	QSGD FedPAQ AdaQuant Sparsified SGD Chapter 7 of Prof. Joshi's book
9/27	Guest Lecture: Leveraging Correlation in Sparsified SGD
9/30	Decentralized SGD [Slides] [Annotated]	Chapter 8 of Prof. Joshi's book MATCHA Cooperative SGD
10/02	Federated Learning Intro [Slides] [Annotated]	Communication-Efficient FL Federated Learning Survey Chapter 9 of Prof. Joshi's book Comic book on FL
10/04	Guest Lecture on Decentralized SGD
10/07	Data Heterogeneity in FL [Slides] [Annotated]	FedProx SCAFFOLD Field Guide to Federated Optimization Chapter 9 of Prof. Joshi's book
10/09	Computational Heterogeneity in FL [Slides] [Annotated]	FedNova
10/11	Guest Lecture: FedExp		HW2 due; HW3 release
10/14	Fall Break
10/16	Fall Break
10/18	Fall Break
10/21	Client Selection and Partial Participation [Slides] [Annotated]	Client Selection FedVARP
10/23	Personalized Federated Learning [Slides] [Annotated]	Three Approaches to Personalized FL Personalized Meta Learning Model-Agnostic Meta Learning
10/25	Concept Review and Practice
10/28	Quiz 2
10/30	Multi-task Learning [Slides] [Annotated]	Multi-Task Learning with Deep Neural Networks Multi-Task Learning as Multi-Objective Optimization FAMO: Fast Adaptive Multitask Optimization Federated Multi-Objective Learning Federated Multi-Task Learning
11/01	Guest Lecture		HW3 due
11/04	Federated Min-max Optimization [Slides] [Annotated]	Federated Minimax Optimization Local SGDA GDA for Non-convex-concave minimax problems
11/06	Fairness and Participation Incentives [Slides] [Annotated]	Agnostic FL q-FFL Participation Incentives
11/8	Guest Lecture		Project titles and teams due; HW4 release
11/11	Differential Privacy in Dist. Optimization [Slides] [Annotated]	DL with DP DP Recurrent Language Models DP-FTRL
11/13	Secure Aggregation in Distributed Learning [Slides] [Annotated]	Practical Secure Agg. in FL Fast Secure Agg. in FL
11/15	Guest Lecture
11/18	Robustness to Adversaries [Slides] [Annotated]	Krum Robust Aggregation How to Backdoor Federated Learning Can you Really Backdoor Federated Learning
11/20	Federated Learning in the LLM Era [Slides]	Low Rank Adaptation of LLMs
11/22	Concept Review and Practice
11/25	Quiz 3
11/27	Thanksgiving Break
11/29	Thanksgiving Break
12/02	Project Presentations
12/04	Project Presentations		HW4 due
12/06	Project Presentations
12/09			Project reports due

Baris Askin	(baskin)
Arian Raje	(araje)
Siddharth Shah	(sgshah)
Tong Shen	(tongshen)
Ritvika Sonawane	(rsonawan)
Chuqi Zhang	(chuqiz)
Ziyi Zhang	(ziyizhan)

Prof. Gauri Joshi	Tues 2:00 pm-3:00 pm, CIC 4119
Baris Askin	Tues 10:00 am - 11:00 am, CIC 4th floor common area
Arian Raje	Wed 9:00 am - 10:00 am, HH 1210
Siddharth Shah	Tues, 3:30 pm-4:30 pm, HH1304
Tong Shen	Mon 12:00 pm- 1:00 pm, CIC 4th floor common area
Ritvika Sonawane	Thurs, 12:00pm-1:00pm, HH1304
Chuqi Zhang	Fri, 1:00pm-2:00pm HH1210
Ziyi Zhang	Thurs 1:00 pm-2:00pm, HH 1304

18-667: Algorithms for Large-scale Distributed Machine Learning and Optimization