Announcements

3/20: Hadoop/Spark Setup Recitation (March 21) starting 6:30pm in HBH 1206 to demonstrate the installation of Hadoop and Spark on your local machine and AWS clusters
3/20: You can find the course lectures posted on the Blackboard.
3/21: Welcome to the class! Hope you will enjoy it :)

CLASS MEETS:

Time: Tue & Thu 9:00AM - 10:20AM
Place: HBH 1002

PEOPLE:

Instructor: Leman Akoglu

Office: HBH 2118C
Office hours: Thu 11am-12pm
Email: invert (cs.cmu.edu @ lakoglu)

Teaching Assistant: Abhinav Maurya

Office: HBH 3034
Office hours: Mon and Fri 5-6pm
Email: invert (andrew.cmu.edu @ amaurya)

Grader: Ming Lin

Email: invert (andrew.cmu.edu @ mingl3)

COURSE DESCRIPTION:

The rate and amount of data being generated in today's world by both humans and machines are unprecedented. Being able to store, manage, and analyze large-scale data has critical impact on business intelligence, scientific discovery, social and environmental challenges.

The goal of this course is to equip students with the understanding, knowledge, and practical skills to develop big data / machine learning solutions with the state-of-the-art tools, particularly those in the Spark environment, with a focus on programming models in MLlib, GraphX, and SparkSQL. See the syllabus for more details. Students will also gain hands-on experience with MapReduce and Apache Spark using real-world datasets.

This course is designed to give a graduate-level student a thorough grounding in the technologies and best practices used in big data machine learning. The course assumes that the students have the understanding of basic data analysis and machine learning concepts as well as basic knowledge of programming (preferably in Python or Java). Previous experience with Hadoop, Spark or distributed computing is NOT required.

Learning Objectives

By the end of this class, students will

gain understanding of the MapReduce paradigm and Hadoop ecosystem
understand scalability challenges for common ML tasks
study distributed machine learning algorithms
understand details of SparkSQL, GraphX, and MLlib (Spark's ML library)
implement distributed pipelines in Apache Spark using real datasets

BULLETIN BOARD and other info

For course material, assignments, announcements, and grades please see the Blackboard.
For questions and discussions please use Piazza.
Carnegie Mellon 2016-2017 Official academic calendar

MISC - FUN:

Joke-1 Joke-2 Joke-3