All times listed are in Pittsburgh time (US Eastern Time)
Lectures, time and location: Currently, the plan is for lectures prior to Thanksgiving break to be in-person and live at the same time (i.e., I teach in a classroom and start a Zoom session). After Thanksgiving, all instruction will be purely remote. Note that Tue/Thur lectures are recorded and not Mon/Wed.
Recitations: Fridays 1:30pm-2:50pm, remote (Zoom)
Instructor: George Chen (email: georgechen ♣ cmu.edu) ‐ replace "♣" with the "at" symbol
Teaching assistants: Xinyu Yao (xinyuyao ♣ andrew.cmu.edu), Xuejian Wang (xuejianw ♣ andrew.cmu.edu)
Office hours:
Contact: Please use Piazza (follow the link to it within Canvas) and, whenever possible, post so that everyone can see (if you have a question, chances are other people can benefit from the answer as well!).
Companies, governments, and other organizations now collect massive amounts of data such as text, images, audio, and video. How do we turn this heterogeneous mess of data into actionable insights? A common problem is that we often do not know what structure underlies the data ahead of time, hence the data often being referred to as "unstructured". This course takes a practical approach to unstructured data analysis via a two-step approach:
We will be coding lots of Python and dabble a bit with GPU computing (Google Colab).
Prerequisite: If you are a Heinz student, then you must have either (1) passed the Heinz Python exemption exam, or (2) taken 95-888 "Data-Focused Python" or 90-819 "Intermediate Programming with Python". If you are not a Heinz student and would like to take the course, please contact the instructor and clearly state what Python courses you have taken/what Python experience you have.
Helpful but not required: Math at the level of calculus and linear algebra may help you appreciate some of the material more
Grading: Homework 20%, quiz 1 40%, quiz 2 40%*
*Students with the most instructor-endorsed answers on Piazza will receive a slight bonus at the end of the mini, which will be added directly to their quiz 2 score (a maximum of 5 bonus points; quiz 2 is out of 100 points prior to any bonus points being added).
Syllabus: [pdf]
Previous version of course (including lecture slides and demos): 95-865 Spring 2020 mini 3
Date | Topic | Supplemental Material |
---|---|---|
Part I. Exploratory data analysis | ||
Week 1: Oct 26-30 |
HW1 was released Oct 26 (check Canvas)
Lecture 1 (Oct 26|27): Course overview, analyzing text using frequencies
Lecture 2 (Oct 28|29): Text analysis demo, co-occurrence analysis
Recitation (Oct 30): Basic Python review
|
|
Week 2: Nov 2-6 |
Lecture 3 (Nov 2|3): Finding possibly related entities
Lecture 4 (Nov 4|5): Visualizing high-dimensional data (PCA)
Recitation (Nov 6): More on PCA, practice with argsort
HW1 due Friday Nov 6, 11:59pm |
What is the maximum phi-squared/chi-squared value? (technical)
Causality additional reading:
PCA additional reading (technical): |
Week 3: Nov 9-13 |
HW2 released start of the week
Lecture 5 (Nov 9|10): Manifold learning (Isomap, t-SNE)
Lecture 6 (Nov 11|12): Wrap up manifold learning, begin clustering (k-means)
Recitation (Nov 13): Quiz 1 review
|
Python examples for dimensionality reduction:
Some details on t-SNE including code (from a past UDA recitation):
Additional dimensionality reduction reading (technical):
Additional clustering reading (technical):
|
Week 4: Nov 16-20 |
Lecture 7 (Nov 16|17): Clustering (k-means, GMMs)
Lecture 8 (Nov 18|19): More clustering (automatically choosing k with CH-index, DP-GMMs, and DP-means)
Friday Nov 20: no recitation, instead Quiz 1 — upon opening the quiz, you have 80 minutes to complete it |
Python cluster evaluation:
DP-means paper (technical):
Hierarchical clustering reading (technical):
|
Week 5: Nov 23-27 |
Lecture 9 (Nov 23|24): Topic modeling
Thanksgiving: no class Wednesday through Friday (note that to keep the two sections synced, there is no Wednesday class!) |
Topic modeling reading:
|
Part 2. Predictive data analysis | ||
Week 6: Nov 30-Dec 4 |
HW2 due Monday Nov 30, 11:59pm HW3 released early in the week Instruction becomes purely remote at the start of this week — do not show up to HBH 1204
Lecture 10 (Nov 30|Dec 1): Intro to predictive data analytics (some terminology, k-NN classification, model evaluation)
Lecture 11 (Dec 2|3): Wrap up predictive model evaluation, classical classifiers; intro to neural nets and deep learning
Lecture 12 during Dec 4 recitation slot:
Wrap up intro to neural nets and deep learning; image analysis with convolutional neural nets
|
Some nuanced details on cross-validation (technical):
PyTorch tutorial (at the very least, go over the first page of this tutorial to familiarize yourself with going between NumPy arrays and PyTorch tensors, and also understand the basic explanation of how tensors can reside on either the CPU or a GPU):
Additional reading:
Video introduction on neural nets:
Mike Jordan's Medium article on where AI is at (April 2018): |
Week 7 Dec 7-11 |
Lecture 13 (Dec 7|8): Time series analysis with recurrent neural nets
Lecture 14 (Dec 9|10): More on deep learning and course wrap-up
Recitation: Quiz 2 review
|
Additional reading:
Some bonus reading (a student asked about image segmentation, and here's an introduction):
|
Final exam period Dec 14-20 |
HW3 due Monday Dec 14, 11:59pm Friday Dec 18: Quiz 2 — upon opening the quiz, you have 80 minutes to complete it |