Lectures:
Note that the current plan is for Section A4 and K4 lectures to be recorded.
Recitations:
Instructor: George Chen (email: georgechen ♣ cmu.edu) ‐ replace "♣" with the "at" symbol
Teaching assistants:
Office hours (starting second week of class):
Regardless of which section you are in, you are welcome to attend office hours for any of the course staff and we've tried to have the office hours in rather scattered times to try to get to many of the time zones that you are in. I suggest that you add all of the times below to your calendar via Google calendar using its time zone feature so that it automatically converts it to your local time (Pittsburgh time is labeled as "Eastern Time - New York" and Adelaide time is listed as "Central Australia Time - Adelaide"). Office hours are all held remotely over Zoom; Zoom links for office hours are posted in Canvas.
Contact: Please use Piazza (follow the link to it within Canvas) and, whenever possible, post so that everyone can see (if you have a question, chances are other people can benefit from the answer as well!).
Companies, governments, and other organizations now collect massive amounts of data such as text, images, audio, and video. How do we turn this heterogeneous mess of data into actionable insights? A common problem is that we often do not know what structure underlies the data ahead of time, hence the data often being referred to as "unstructured". This course takes a practical approach to unstructured data analysis via a two-step approach:
We will be coding lots of Python and dabble a bit with GPU computing (Google Colab).
Prerequisite: If you are a Heinz student, then you must have either (1) passed the Heinz Python exemption exam, or (2) taken 95-888 "Data-Focused Python" or 90-819 "Intermediate Programming with Python". If you are not a Heinz student and would like to take the course, please contact the instructor and clearly state what Python courses you have taken/what Python experience you have.
Helpful but not required: Math at the level of calculus and linear algebra may help you appreciate some of the material more
Grading: Homework 20%, quiz 1 40%, quiz 2 40%*
*Students with the most instructor-endorsed answers on Piazza will receive a slight bonus at the end of the mini, which will be added directly to their quiz 2 score (a maximum of 5 bonus points; quiz 2 is out of 100 points prior to any bonus points being added).
Syllabus (updated 3/22 11:24pm Pittsburgh time): [pdf]
Previous version of course (including lecture slides and demos): 95-865 Fall 2020 mini 2
Date | Topic | Supplemental Material |
---|---|---|
Part I. Exploratory data analysis | ||
Mon Mar 22 |
Lecture 1: Course overview
|
|
Wed Mar 24 |
Lecture 2: Basic text analysis, co-occurrence analysis
For the basic text analysis demo to work, please
install Anaconda Python 3, Jupyter, and spaCy first
|
|
Fri Mar 26 |
Recitation: Basic Python review
|
|
Mon Mar 29 |
Lecture 3: Finding possibly related entities
|
What is the maximum phi-squared/chi-squared value? (technical)
|
Wed Mar 31 |
Lecture 4: Visualizing high-dimensional data with PCA
|
Causality additional reading:
PCA additional reading (technical): |
Fri Apr 2 |
Recitation slot — Lecture 5: Manifold learning with Isomap
|
Python examples for dimensionality reduction: |
Mon Apr 5 |
No class (CMU break day) |
|
Wed Apr 7 |
Lecture 6: Wrap up manifold learning (t-SNE), a first look at analyzing images, and an introduction to clustering phenomena
HW1 due 11:59pm Pittsburgh time |
See supplementary materials from the previous lecture; in addition, here's some reading for t-SNE (technical):
|
Fri Apr 9 |
Recitation: More on PCA, practice with argsort
|
|
Mon Apr 12 |
Lecture 7: Distance and similarity functions, clustering (k-means, GMMs)
|
Clustering additional reading (technical): |
Wed Apr 14 |
Lecture 8: More on clustering (interpreting clustering results, automatically choosing the number of clusters for GMM-related models)
|
Reading on DP-means, DP mixture models for which a DP-GMM is a special case (technical):
|
Fri Apr 16 | No class (CMU break day) | |
Mon Apr 19 |
Lecture 9: Topic modeling
|
Topic modeling reading:
|
Wed Apr 21 |
Lecture 10: Wrap up topic modeling; wrap up clustering; a glimpse of predictive data analytics
|
|
Thur Apr 22 |
HW2 due 11:59pm Pittsburgh time |
|
Fri Apr 23 |
Quiz 1:
|
|
Part II. Predictive data analysis | ||
Mon Apr 26 |
Lecture 11: Intro to predictive data analytics
|
Some nuanced details on cross-validation (technical):
|
Wed April 28 |
Lecture 12: Wrap up basic prediction concepts
|
|
Fri April 30 |
Recitation: More practice on model evaluation
|
|
Mon May 3 |
Lecture 13: Intro to neural nets and deep learning
|
PyTorch tutorial (at the very least, go over the first page of this tutorial to familiarize yourself with going between NumPy arrays and PyTorch tensors, and also understand the basic explanation of how tensors can reside on either the CPU or a GPU):
Additional reading:
Video introduction on neural nets:
Mike Jordan's Medium article on where AI is at (April 2018): |
Wed May 5 |
Lecture 14: Image analysis with convolutional neural nets
|
Additional reading:
|
Fri Dec 7 |
Recitation slot: Lecture 15 on time series analysis with recurrent neural nets; more deep learning topics; course wrap-up
The demo will not be covered during the recitation slot and is instead covered in the April 30 Section K4 recitation Zoom recording by your TA Erick (check Canvas Zoom recordings)
|
Additional reading:
|
Mon May 10 |
HW3 due 11:59pm Pittsburgh time |
|
Thur May 13 |
Quiz 2:
|
Date | Topic | Supplemental Material |
---|---|---|
Part I. Exploratory data analysis | ||
Wed Mar 24 |
Lecture 1: Course overview
|
|
Fri Mar 26 |
Lecture 2: Basic text analysis, co-occurrence analysis
For the basic text analysis demo to work, please
install Anaconda Python 3, Jupyter, and spaCy first
Recitation: Basic Python review
|
|
Wed Mar 31 |
Lecture 3: Finding possibly related entities
|
What is the maximum phi-squared/chi-squared value? (technical)
|
Fri Apr 2 |
No class (Good Friday) |
|
Wed Apr 7 |
Lecture 4: Visualizing high-dimensional data with PCA
|
Causality additional reading:
PCA additional reading (technical): |
Thur Apr 8 |
HW1 due 1:29pm Adelaide time (corresponds to 11:59pm Wed Apr 7 Pittsburgh time) |
|
Fri Apr 9 |
Lecture 5: Manifold learning with Isomap
Extended recitation slot (5:30pm-8:30pm Adelaide time): Lectures 6 and 7 on wrapping up manifold learning (t-SNE), a first look at analyzing images, and an introduction to clustering (k-means, GMMs)
|
Python examples for dimensionality reduction:
T-SNE additional reading (technical):
Clustering additional reading (technical): |
Wed Apr 14 |
Lecture 8: More on clustering (interpreting clustering results, automatically choosing the number of clusters for GMM-related models)
Quiz 1 review session (7pm-8:30pm Adelaide time) |
Reading on DP-means, DP mixture models for which a DP-GMM is a special case (technical):
|
Fri Apr 16 |
Lecture 9: Wrap up clustering (density-based clustering with DBSCAN, final remarks); topic modeling
Recitation slot: Quiz 1 (80 minutes to match amount of time that will be given to Pittsburgh students) |
Topic modeling reading:
|
Part II. Predictive data analysis | ||
Wed Apr 21 |
Lecture 10: Intro to predictive data analytics
|
Some nuanced details on cross-validation (technical):
|
Fri Apr 23 |
HW2 due 1:29pm Adelaide time (corresponds to 11:59pm Mon Apr 22 Pittsburgh time)
Lecture 11: Wrap up basic prediction concepts; intro to neural nets and deep learning
Recitation slot — Lecture 12: Intro to neural nets and deep learning
|
PyTorch tutorial (at the very least, go over the first page of this tutorial to familiarize yourself with going between NumPy arrays and PyTorch tensors, and also understand the basic explanation of how tensors can reside on either the CPU or a GPU):
Additional reading:
Video introduction on neural nets:
Mike Jordan's Medium article on where AI is at (April 2018): |
Wed Apr 28 |
Lecture 13: Image analysis with convolutional neural nets
|
Additional reading:
|
Fri April 30 |
Lecture 14: Time series analysis with recurrent neural nets; some other deep learning topics; course wrap-up
Recitation: sentiment analysis with IMDB reviews; more on word embeddings and fine tuning; some PyTorch code examples
|
Additional reading:
|
Final exam period May 3-7 |
HW3 due date May 6, 11:59pm Adelaide time Quiz 2, May 7 10:30am-11:50am Adelaide time |