95-865: Unstructured Data Analytics (Spring 2021 Mini 4)

Lectures:
Note that the current plan is for Section A4 and K4 lectures to be recorded.

Section A4: Mondays and Wednesdays 4:50pm-6:10pm Pittsburgh time, HBH 1202 & live over Zoom
Section B4: Mondays and Wednesdays 3:10pm-4:30pm Pittsburgh time, HBH 1202 & live over Zoom
Section K4: Wednesdays and Fridays 10:30am-11:50am Adelaide time, live over Zoom
Section Z4: asynchronous (watch Zoom recording); Z4 students can also attend the A4/B4 lectures live via Zoom but should not follow the schedule for section K4 which is on a slightly different timeline

Recitations:

Sections A4/B4/Z4: Fridays 3:10pm-4:30pm Pittsburgh time, live over Zoom

Section K4: Fridays 5:30pm-7pm Adelaide time, live over Zoom

Instructor: George Chen (email: georgechen ♣ cmu.edu) ‐ replace "♣" with the "at" symbol

Teaching assistants:

(Adelaide) Erick Rodriguez (erickger ♣ andrew.cmu.edu)
(Pittsburgh) Jingbo Jiang (jingboj ♣ andrew.cmu.edu)
(Pittsburgh) Daisy Ren (rmjdaisy ♣ cmu.edu)
(Pittsburgh) Xuejian Wang (xuejianw ♣ andrew.cmu.edu)

Office hours (starting second week of class):
Regardless of which section you are in, you are welcome to attend office hours for any of the course staff and we've tried to have the office hours in rather scattered times to try to get to many of the time zones that you are in. I suggest that you add all of the times below to your calendar via Google calendar using its time zone feature so that it automatically converts it to your local time (Pittsburgh time is labeled as "Eastern Time - New York" and Adelaide time is listed as "Central Australia Time - Adelaide"). Office hours are all held remotely over Zoom; Zoom links for office hours are posted in Canvas.

Erick: Saturdays 10:30am-12:30pm Adelaide time
Jingbo: Saturdays 10-11pm and Sundays 9-11pm, Pittsburgh time
Xuejian: Mondays 9-10am and Wednesdays 9-10am, Pittsburgh time
George: Wednesdays 6:30pm-8:30pm Pittsburgh time

Contact: Please use Piazza (follow the link to it within Canvas) and, whenever possible, post so that everyone can see (if you have a question, chances are other people can benefit from the answer as well!).

course description

Companies, governments, and other organizations now collect massive amounts of data such as text, images, audio, and video. How do we turn this heterogeneous mess of data into actionable insights? A common problem is that we often do not know what structure underlies the data ahead of time, hence the data often being referred to as "unstructured". This course takes a practical approach to unstructured data analysis via a two-step approach:

We first examine how to identify possible structure present in the data via visualization and other exploratory methods.
Once we have clues for what structure is present in the data, we turn toward exploiting this structure to make predictions.

Many examples are given for how these methods help solve real problems faced by organizations. Along the way, we encounter many of the most popular methods in analyzing unstructured data, from modern classics in manifold learning, clustering, and topic modeling to some of the latest developments in deep neural networks for analyzing text, images, and time series.

We will be coding lots of Python and dabble a bit with GPU computing (Google Colab).

Prerequisite: If you are a Heinz student, then you must have either (1) passed the Heinz Python exemption exam, or (2) taken 95-888 "Data-Focused Python" or 90-819 "Intermediate Programming with Python". If you are not a Heinz student and would like to take the course, please contact the instructor and clearly state what Python courses you have taken/what Python experience you have.

Helpful but not required: Math at the level of calculus and linear algebra may help you appreciate some of the material more

Grading: Homework 20%, quiz 1 40%, quiz 2 40%*

*Students with the most instructor-endorsed answers on Piazza will receive a slight bonus at the end of the mini, which will be added directly to their quiz 2 score (a maximum of 5 bonus points; quiz 2 is out of 100 points prior to any bonus points being added).

Syllabus (updated 3/22 11:24pm Pittsburgh time): [pdf]

calendar (subject to revision)

Previous version of course (including lecture slides and demos): 95-865 Fall 2020 mini 2

Pittsburgh (Sections A4/B4/Z4)
Adelaide (Section K4)

Pittsburgh

Date	Topic	Supplemental Material
Part I. Exploratory data analysis
Mon Mar 22	Lecture 1: Course overview [slides]
Wed Mar 24	Lecture 2: Basic text analysis, co-occurrence analysis [slides] For the basic text analysis demo to work, please install Anaconda Python 3, Jupyter, and spaCy first [Jupyter notebook (basic text analysis)]
Fri Mar 26	Recitation: Basic Python review [Jupyter notebook]
Mon Mar 29	Lecture 3: Finding possibly related entities [slides] [Jupyter notebook (co-occurrence analysis)]	What is the maximum phi-squared/chi-squared value? (technical) [stack exchange answer]
Wed Mar 31	Lecture 4: Visualizing high-dimensional data with PCA [slides] [Jupyter notebook (PCA)]	Causality additional reading: [Computational and Inferential Thinking, "Causality and Experiments" chapter] PCA additional reading (technical): [Abdi and Williams's PCA review]
Fri Apr 2	Recitation slot — Lecture 5: Manifold learning with Isomap [slides] [Jupyter notebook (manifold learning)]	Python examples for dimensionality reduction: [scikit-learn example (PCA, Isomap, t-SNE, and many other methods)]
Mon Apr 5	No class (CMU break day)
Wed Apr 7	Lecture 6: Wrap up manifold learning (t-SNE), a first look at analyzing images, and an introduction to clustering phenomena [slides] [Jupyter notebook (manifold learning); same demo as previous lecture] [required reading: "How to Use t-SNE Effectively" (Wattenberg et al, Distill 2016)] For the demo below to work (on t-SNE with images), you will need to install some packages: `pip install torch torchvision` [Jupyter notebook (t-SNE with images)] [slides with some technical details for t-SNE] HW1 due 11:59pm Pittsburgh time	See supplementary materials from the previous lecture; in addition, here's some reading for t-SNE (technical): [Simon Carbonnelle's t-SNE slides] [t-SNE webpage]
Fri Apr 9	Recitation: More on PCA, practice with argsort [Jupyter notebook]
Mon Apr 12	Lecture 7: Distance and similarity functions, clustering (k-means, GMMs) [slides] [Jupyter notebook (PCA, t-SNE, clustering with drug data)]	Clustering additional reading (technical): [see Section 14.3 of the book "Elements of Statistical Learning" on clustering]
Wed Apr 14	Lecture 8: More on clustering (interpreting clustering results, automatically choosing the number of clusters for GMM-related models) [slides] [Jupyter notebook (PCA, t-SNE, clustering on UCI drug consumption data); same demo as previous lecture]	Reading on DP-means, DP mixture models for which a DP-GMM is a special case (technical): [Revisiting k-means: New Algorithms via Bayesian Nonparametrics]
Fri Apr 16	No class (CMU break day)
Mon Apr 19	Lecture 9: Topic modeling [slides] [Jupyter notebook (topic modeling with LDA)]	Topic modeling reading: [David Blei's general intro to topic modeling] [(technical) Topic Modelling Meets Deep Neural Networks: A Survey]
Wed Apr 21	Lecture 10: Wrap up topic modeling; wrap up clustering; a glimpse of predictive data analytics [slides] [Jupyter notebook (topic modeling with LDA); same demo as previous lecture] [Jupyter notebook (PCA, t-SNE, clustering on UCI drug consumption data); same demo as previous lecture]
Thur Apr 22	HW2 due 11:59pm Pittsburgh time
Fri Apr 23	Quiz 1: Recitation slot (3:10pm-4:30pm Pittsburgh time) for Sections A4/B4 6:30pm-7:50pm Pittsburgh time for Section Z4
Part II. Predictive data analysis
Mon Apr 26	Lecture 11: Intro to predictive data analytics [slides]	Some nuanced details on cross-validation (technical): [Andrew Ng's article Preventing "Overfitting" of Cross-Validation Data] [Braga-Neto and Dougherty's article Is cross-validation valid for small-sample microarray classification? (this article applies more generally rather than only to microarray data from biology)] [Bias and variance as we change the number of folds in k-fold cross-validation]
Wed April 28	Lecture 12: Wrap up basic prediction concepts [slides] [Jupyter notebook (prediction and model validation; same demo as last time)]
Fri April 30	Recitation: More practice on model evaluation [Jupyter notebook]
Mon May 3	Lecture 13: Intro to neural nets and deep learning [slides] For the neural net demo below to work, you will need to install some packages: `pip install torch torchvision torchaudio` `pip install torchsummaryX` `python -m spacy download en` `pip install pytorch-nlp` [Jupyter notebook (handwritten digit recognition with neural nets; be sure to scroll to the bottom to download UDA_pytorch_utils.py)]	PyTorch tutorial (at the very least, go over the first page of this tutorial to familiarize yourself with going between NumPy arrays and PyTorch tensors, and also understand the basic explanation of how tensors can reside on either the CPU or a GPU): [PyTorch tutorial] Additional reading: [Chapter 1 "Using neural nets to recognize handwritten digits" of the book Neural Networks and Deep Learning] Video introduction on neural nets: ["But what is a neural network? \| Chapter 1, deep learning" by 3Blue1Brown] Mike Jordan's Medium article on where AI is at (April 2018): ["Artificial Intelligence - The Revolution Hasn't Happened Yet"]
Wed May 5	Lecture 14: Image analysis with convolutional neural nets [slides] [Jupyter notebook (handwritten digit recognition with neural nets; same demo as previous lecture)]	Additional reading: [Stanford CS231n Convolutional Neural Networks for Visual Recognition] [(technical) Richard Zhang's fix for max pooling (presented at ICML 2019)]
Fri Dec 7	Recitation slot: Lecture 15 on time series analysis with recurrent neural nets; more deep learning topics; course wrap-up [slides] The demo will not be covered during the recitation slot and is instead covered in the April 30 Section K4 recitation Zoom recording by your TA Erick (check Canvas Zoom recordings) [slides] For the demo below to work, be sure to install the prerequisite packages as mentioned for the lecture 13 demo. [Jupyter notebook (sentiment analysis with IMDb reviews)]	Additional reading: [Christopher Olah's "Understanding LSTM Networks"] [A tutorial on word2vec word embeddings] [A tutorial on BERT word embeddings]
Mon May 10	HW3 due 11:59pm Pittsburgh time
Thur May 13	Quiz 2: 1pm-2:20pm Pittsburgh time for Sections A4/B4 5:30pm-6:50pm Pittsburgh time for students in A4/B4 taking the alternate Z4 time Students officially in Section Z4 have been sent separate instructions for their Quiz 2 (check Canvas)

Adelaide

Date	Topic	Supplemental Material
Part I. Exploratory data analysis
Wed Mar 24	Lecture 1: Course overview [slides]
Fri Mar 26	Lecture 2: Basic text analysis, co-occurrence analysis [slides] For the basic text analysis demo to work, please install Anaconda Python 3, Jupyter, and spaCy first [Jupyter notebook (basic text analysis)] Recitation: Basic Python review [Jupyter notebook]
Wed Mar 31	Lecture 3: Finding possibly related entities [slides] [Jupyter notebook (co-occurrence analysis)]	What is the maximum phi-squared/chi-squared value? (technical) [stack exchange answer]
Fri Apr 2	No class (Good Friday)
Wed Apr 7	Lecture 4: Visualizing high-dimensional data with PCA [slides] [Jupyter notebook (PCA)]	Causality additional reading: [Computational and Inferential Thinking, "Causality and Experiments" chapter] PCA additional reading (technical): [Abdi and Williams's PCA review]
Thur Apr 8	HW1 due 1:29pm Adelaide time (corresponds to 11:59pm Wed Apr 7 Pittsburgh time)
Fri Apr 9	Lecture 5: Manifold learning with Isomap [slides] [Jupyter notebook (manifold learning)] Extended recitation slot (5:30pm-8:30pm Adelaide time): Lectures 6 and 7 on wrapping up manifold learning (t-SNE), a first look at analyzing images, and an introduction to clustering (k-means, GMMs) [slides] [Jupyter notebook (manifold learning); same demo as previous lecture] [required reading: "How to Use t-SNE Effectively" (Wattenberg et al, Distill 2016)] For the demo below to work (on t-SNE with images), you will need to install some packages: `pip install torch torchvision` [Jupyter notebook (t-SNE with images)] [Jupyter notebook (PCA, t-SNE, clustering with drug data)] [slides with some technical details for t-SNE]	Python examples for dimensionality reduction: [scikit-learn example (PCA, Isomap, t-SNE, and many other methods)] T-SNE additional reading (technical): [Simon Carbonnelle's t-SNE slides] [t-SNE webpage] Clustering additional reading (technical): [see Section 14.3 of the book "Elements of Statistical Learning" on clustering]
Wed Apr 14	Lecture 8: More on clustering (interpreting clustering results, automatically choosing the number of clusters for GMM-related models) [slides] [Jupyter notebook (PCA, t-SNE, clustering on UCI drug consumption data); same demo as previous lecture] Quiz 1 review session (7pm-8:30pm Adelaide time)	Reading on DP-means, DP mixture models for which a DP-GMM is a special case (technical): [Revisiting k-means: New Algorithms via Bayesian Nonparametrics]
Fri Apr 16	Lecture 9: Wrap up clustering (density-based clustering with DBSCAN, final remarks); topic modeling [slides] [Jupyter notebook (PCA, t-SNE, clustering on UCI drug consumption data); same demo as previous lecture] [Jupyter notebook (topic modeling with LDA)] Recitation slot: Quiz 1 (80 minutes to match amount of time that will be given to Pittsburgh students)	Topic modeling reading: [David Blei's general intro to topic modeling] [(technical) Topic Modelling Meets Deep Neural Networks: A Survey]
Part II. Predictive data analysis
Wed Apr 21	Lecture 10: Intro to predictive data analytics [slides] For the demo below to work, you will need to install some packages: `pip install torch torchvision` [Jupyter notebook (prediction and model validation)]	Some nuanced details on cross-validation (technical): [Andrew Ng's article Preventing "Overfitting" of Cross-Validation Data] [Braga-Neto and Dougherty's article Is cross-validation valid for small-sample microarray classification? (this article applies more generally rather than only to microarray data from biology)] [Bias and variance as we change the number of folds in k-fold cross-validation]
Fri Apr 23	HW2 due 1:29pm Adelaide time (corresponds to 11:59pm Mon Apr 22 Pittsburgh time) Lecture 11: Wrap up basic prediction concepts; intro to neural nets and deep learning [slides] [Jupyter notebook (prediction and model validation; same demo as last time)] Recitation slot — Lecture 12: Intro to neural nets and deep learning [slides] For the neural net demo below to work, you will need to install some packages: `pip install torch torchvision torchaudio` `pip install torchsummaryX` `python -m spacy download en` `pip install pytorch-nlp` [Jupyter notebook (handwritten digit recognition with neural nets; be sure to scroll to the bottom to download UDA_pytorch_utils.py)]	PyTorch tutorial (at the very least, go over the first page of this tutorial to familiarize yourself with going between NumPy arrays and PyTorch tensors, and also understand the basic explanation of how tensors can reside on either the CPU or a GPU): [PyTorch tutorial] Additional reading: [Chapter 1 "Using neural nets to recognize handwritten digits" of the book Neural Networks and Deep Learning] Video introduction on neural nets: ["But what is a neural network? \| Chapter 1, deep learning" by 3Blue1Brown] Mike Jordan's Medium article on where AI is at (April 2018): ["Artificial Intelligence - The Revolution Hasn't Happened Yet"]
Wed Apr 28	Lecture 13: Image analysis with convolutional neural nets [slides] [Jupyter notebook (handwritten digit recognition with neural nets; same demo as previous lecture]	Additional reading: [Stanford CS231n Convolutional Neural Networks for Visual Recognition] [(technical) Richard Zhang's fix for max pooling (presented at ICML 2019)]
Fri April 30	Lecture 14: Time series analysis with recurrent neural nets; some other deep learning topics; course wrap-up [slides] Recitation: sentiment analysis with IMDB reviews; more on word embeddings and fine tuning; some PyTorch code examples For the demo below to work, be sure to install the prerequisite packages as mentioned for the lecture 12 demo. [slides] [Jupyter notebook (sentiment analysis with IMDb reviews)]	Additional reading: [Christopher Olah's "Understanding LSTM Networks"] [A tutorial on word2vec word embeddings] [A tutorial on BERT word embeddings]
Final exam period May 3-7	HW3 due date May 6, 11:59pm Adelaide time Quiz 2, May 7 10:30am-11:50am Adelaide time