95-865: Unstructured Data Analytics
(Spring 2022 Mini 4)

Unstructured Data Analytics

⚠ This mini's schedule is a bit peculiar to keep the course synced between CMU Pittsburgh and Adelaide campuses.


Lectures:
Note that the current plan is for Section A4 and K4 lectures to be recorded.

There is also a Z4 section that is only for Heinz students that are in a part-time distance learning degree program, which follows a different format of the course but uses the same lectures and recitations.

Recitations:

Instructor: George Chen (email: georgechen ♣ cmu.edu) ‐ replace "♣" with the "at" symbol

Teaching assistants:

♢ Erick is the TA on the Adelaide campus. All other TAs are with the Pittsburgh campus.

Office hours (starting second week of class):
Office hours are all held remotely over Zoom; Zoom links for office hours are posted in Canvas. Regardless of which section you are in, you are welcome to attend office hours for any of the course staff and we've tried to have the office hours in rather scattered times to cater to a variety of schedules. I suggest that you add all of the times below to your calendar via Google calendar using its time zone feature so that it automatically converts it to your local time (Pittsburgh time is labeled as "Eastern Time - New York" and Adelaide time is listed as "Central Australia Time - Adelaide").

Contact: Please use Piazza (follow the link to it within Canvas) and, whenever possible, post so that everyone can see (if you have a question, chances are other people can benefit from the answer as well!).

Course Description

Companies, governments, and other organizations now collect massive amounts of data such as text, images, audio, and video. How do we turn this heterogeneous mess of data into actionable insights? A common problem is that we often do not know what structure underlies the data ahead of time, hence the data often being referred to as "unstructured". This course takes a practical approach to unstructured data analysis via a two-step approach:

  1. We first examine how to identify possible structure present in the data via visualization and other exploratory methods.
  2. Once we have clues for what structure is present in the data, we turn toward exploiting this structure to make predictions.
Many examples are given for how these methods help solve real problems faced by organizations. Along the way, we encounter many of the most popular methods in analyzing unstructured data, from modern classics in manifold learning, clustering, and topic modeling to some of the latest developments in deep neural networks for analyzing text, images, and time series.

We will be coding lots of Python and dabble a bit with GPU computing (Google Colab).

Prerequisite: If you are a Heinz student, then you must have taken 95-888 "Data-Focused Python" or 90-819 "Intermediate Programming with Python". If you are not a Heinz student and would like to take the course, please contact the instructor and clearly state what Python courses you have taken/what Python experience you have.

Helpful but not required: Math at the level of calculus and linear algebra may help you appreciate some of the material more

Grading: Homework (30%), Quiz 1 (35%), Quiz 2 (35%*)

*Students with the most instructor-endorsed posts on Piazza will receive a slight bonus at the end of the mini, which will be added directly to their Quiz 2 score (a maximum of 10 bonus points, so that it is possible to get 110 out of 100 points on Quiz 2).

Letter grades are determined based on a curve.

Calendar (tentative)

Previous version of course (including lecture slides and demos): 95-865 Fall 2021 mini 2

Pittsburgh

⚠ Due to enrollment being much higher than normal, I've been asked to tell on-campus A4/B4 students to rotate as to who shows up to the class in-person vs who shows up over Zoom. Basically if everyone shows up in-person, there will not be enough space. I will not strictly enforce who shows up in-person vs over Zoom and will instead leave it up to you which format you would prefer, where I encourage you to switch between in-person and Zoom; if the classroom is full, please watch the lecture elsewhere via Zoom. I do not require live attendance for lectures or recitations (aside from the recitations in which there is an exam). I plan on recording Section A4 lectures (but not B4), which can be watched after lecture. Recitations and review sessions will also be recorded.

Date Topic Supplemental Material
Part I. Exploratory data analysis
Mon Mar 14 Lecture 1: Course overview, analyzing text using frequencies
[slides]

Please install Anaconda Python 3 and spaCy by following this tutorial (needed for HW1 and the demo next lecture):

[slides]
Note: Anaconda Python 3 includes support for Jupyter notebooks, which we use extensively in this class

HW1 released (check Canvas)

Wed Mar 16 Lecture 2: Basic text analysis demo (requires Anaconda Python 3 & spaCy), co-occurrence analysis
[slides]
[Jupyter notebook (basic text analysis)]
What is the maximum phi-squared/chi-squared value? (technical)
[Stack Exchange answer]
Thur Mar 17 Your TA Ziyuan will hold a Python review outside of class over Zoom (7pm-8:20pm) — find the Zoom link in Canvas (this review session is not mandatory but we encourage you to attend)
[Jupyter notebook]
Fri Mar 18 Recitation slot — Lecture 3: Co-occurrence analysis (cont'd), visualizing high-dimensional data
[slides]
[Jupyter notebook (co-occurrence analysis)]
Mon Mar 21 Lecture 4: PCA, manifold learning
[slides]
[Jupyter notebook (PCA)]


Additional reading (technical):

[Abdi and Williams's PCA review]
[The original Isomap paper (Tenenbaum et al 2020)]
Wed Mar 23 Lecture 5: Manifold learning (cont'd)
[slides]
[Jupyter notebook (manifold learning)]
[required reading: "How to Use t-SNE Effectively" (Wattenberg et al, Distill 2016)]

HW1 due 11:59pm Pittsburgh time
Python examples for dimensionality reduction:
[scikit-learn example (PCA, Isomap, t-SNE, and many other methods)]

Some technical details for t-SNE:

[slides]

Even more technical reading for t-SNE:

[Simon Carbonnelle's t-SNE slides]
[t-SNE webpage]
Fri Mar 25 Recitation: More on PCA, practice with argsort
[Jupyter notebook]
Mon Mar 28 Lecture 6: Dimensionality reduction for images, intro to clustering
[slides]
[Jupyter notebook (dimensionality reduction with images)***]
***For the demo on t-SNE with images to work, you will need to install some packages:
pip install torch torchvision
[Jupyter notebook (dimensionality reduction and clustering with drug data)]
Clustering additional reading (technical):
[see Section 14.3 of the book "Elements of Statistical Learning" on clustering]
Wed Mar 30 Lecture 7: Clustering (cont'd)
[slides]
We continue using the same demo from last time:
[Jupyter notebook (dimensionality reduction and clustering with drug data)]
See supplemental clustering reading posted for previous lecture
Fri Apr 1 Recitation slot: Quiz 1 (80 minutes)
Mon Apr 4 Lecture 8: Clustering (cont'd), topic modeling
[slides]
[Jupyter notebook (topic modeling with LDA)]
Topic modeling reading:
[David Blei's general intro to topic modeling]
Part II. Predictive data analysis
Wed Apr 6 Lecture 9: Topic modeling (cont'd), intro to predictive data analysis
[slides]
Fri Apr 8 No class (CMU Spring Carnival)
Mon Apr 11 Lecture 10: Hyperparameter tuning, decision trees & forests, classifier evaluation
[slides]
[Jupyter notebook (prediction and model validation)]
Some nuanced details on cross-validation (technical):
[Andrew Ng's article Preventing "Overfitting" of Cross-Validation Data]
[Braga-Neto and Dougherty's article Is cross-validation valid for small-sample microarray classification? (this article applies more generally rather than only to microarray data from biology)]
[Bias and variance as we change the number of folds in k-fold cross-validation]
Wed Apr 13 No class — instead, watch the video recording from Section K4 of the recitation titled "Clustering on unstructured data, more on topic models"; they had this recitation during CMU Pittsburgh Spring Carnival (Adelaide still has class at that time and do not get that Thursday and Friday off); this recording is just on Canvas where the cloud recordings are (look for the April 8 Section K4 recitation)
[Jupyter notebook]

HW2 due 11:59pm Pittsburgh time
Fri Apr 15 Recitation: More on classifier evaluation
[slides]
[Jupyter notebook]
Mon Apr 18 Lecture 11: Intro to neural nets & deep learning
[slides]
For the neural net demo below to work, you will need to install some packages:
pip install torch torchvision torchaudio
pip install torchsummaryX
pip install pytorch-nlp
[Jupyter notebook (handwritten digit recognition with neural nets; be sure to scroll to the bottom to download UDA_pytorch_utils.py)]
Be sure to edit two pytorch-nlp files as indicated in the following slides (resolves some issues with recent updates to PyTorch & spaCy):
[slides]
Mike Jordan's Medium article on where AI is at (April 2018):
["Artificial Intelligence - The Revolution Hasn't Happened Yet"]

PyTorch tutorial (at the very least, go over the first page of this tutorial to familiarize yourself with going between NumPy arrays and PyTorch tensors, and also understand the basic explanation of how tensors can reside on either the CPU or a GPU):

[PyTorch tutorial]

Additional reading:

[Chapter 1 "Using neural nets to recognize handwritten digits" of the book Neural Networks and Deep Learning]

Video introduction on neural nets:

["But what *is* a neural network? | Chapter 1, deep learning" by 3Blue1Brown]
Wed Apr 20 Lecture 12: Image analysis with convolutional neural nets (also called CNNs or convnets)
[slides]
We continue using the demo from the previous lecture
Additional reading:
[Stanford CS231n Convolutional Neural Networks for Visual Recognition]
[(technical) Richard Zhang's fix for max pooling]
Fri Apr 22 Recitation slot — Lecture 13: Time series analysis with recurrent neural nets (RNNs)
[slides]
[Jupyter notebook (sentiment analysis with IMDb reviews; requires UDA_pytorch_utils.py from the previous demo)]
Additional reading:
[Christopher Olah's "Understanding LSTM Networks"]
Mon Apr 25 Lecture 14: Additional deep learning topics and course wrap-up
[slides]
Captum tutorial:
[Captum tutorial]

Additional reading:

[A tutorial on word2vec word embeddings]
[A tutorial on BERT word embeddings]
[Belkin et al's 2019 paper on "Reconciling modern machine learning practice and the bias-variance trade-off"]
Wed Apr 27 No class

HW3 due 11:59pm Pittsburgh time
Fri Apr 29 Recitation slot: Quiz 2 (80 minutes)

Adelaide

Date Topic Supplemental Material
Part I. Exploratory data analysis
Wed Mar 16 Lecture 1: Course overview, analyzing text using frequencies
[slides]

Please install Anaconda Python 3 and spaCy by following this tutorial (needed for HW1 and the demo next lecture):

[slides]
Note: Anaconda Python 3 includes support for Jupyter notebooks, which we use extensively in this class

HW1 released (check Canvas)

Thur Mar 17 Your TA Erick will hold a Python review outside of class over Zoom (5:30pm-7:30pm) — find the Zoom link in Canvas (this review session is not mandatory but we encourage you to attend)
[Jupyter notebook]
Fri Mar 18 Lecture 2: Basic text analysis demo (requires Anaconda Python 3 & spaCy), co-occurrence analysis
[slides]
[Jupyter notebook (basic text analysis)]

Recitation slot — Lecture 3: Co-occurrence analysis (cont'd), visualizing high-dimensional data

[slides]
[Jupyter notebook (co-occurrence analysis)]
What is the maximum phi-squared/chi-squared value? (technical)
[Stack Exchange answer]
Wed Mar 23 Lecture 4: PCA, manifold learning
[slides]
[Jupyter notebook (PCA)]


Additional reading (technical):

[Abdi and Williams's PCA review]
[The original Isomap paper (Tenenbaum et al 2020)]
Thur Mar 24 HW1 due 2:29pm Adelaide time (corresponds to 11:59pm Wed Mar 23 Pittsburgh time)
Fri Mar 25 Lecture 5: Manifold learning (cont'd)
[slides]
[Jupyter notebook (manifold learning)]
[required reading: "How to Use t-SNE Effectively" (Wattenberg et al, Distill 2016)]

Recitation: More on PCA, practice with argsort

[Jupyter notebook]
Python examples for dimensionality reduction:
[scikit-learn example (PCA, Isomap, t-SNE, and many other methods)]

Some technical details for t-SNE:

[slides]

Even more technical reading for t-SNE:

[Simon Carbonnelle's t-SNE slides]
[t-SNE webpage]
Wed Mar 30 Lecture 6: Dimensionality reduction for images, intro to clustering
[slides]
[Jupyter notebook (dimensionality reduction with images)***]
***For the demo on t-SNE with images to work, you will need to install some packages:
pip install torch torchvision
[Jupyter notebook (dimensionality reduction and clustering with drug data)]
Clustering additional reading (technical):
[see Section 14.3 of the book "Elements of Statistical Learning" on clustering]
Fri Apr 1 Lecture 7: Clustering (cont'd)
[slides]
We continue using the same demo from last time:
[Jupyter notebook (dimensionality reduction and clustering with drug data)]

Recitation slot: Quiz 1 (80 minutes)
See supplemental clustering reading posted for previous lecture
Wed Apr 6 Lecture 8: Clustering (cont'd), topic modeling
[slides]
[Jupyter notebook (topic modeling with LDA)]
Topic modeling reading:
[David Blei's general intro to topic modeling]
Part II. Predictive data analysis
Fri Apr 8 Lecture 9: Topic modeling (cont'd), intro to predictive data analysis
[slides]

Recitation slot: Clustering on unstructured data, more on topic models

[Jupyter notebook]
Some nuanced details on cross-validation (technical):
[Andrew Ng's article Preventing "Overfitting" of Cross-Validation Data]
[Braga-Neto and Dougherty's article Is cross-validation valid for small-sample microarray classification? (this article applies more generally rather than only to microarray data from biology)]
[Bias and variance as we change the number of folds in k-fold cross-validation]
Wed Apr 13 Lecture 10: Hyperparameter tuning, decision trees & forests, classifier evaluation
[slides]
[Jupyter notebook (prediction and model validation)]
Thur Apr 14 HW2 due 1:29pm Adelaide time (corresponds to 11:59pm Wed Apr 13 Pittsburgh time)
Fri Apr 15 No class (Good Friday) — instead, watch the video recording from Section A4 of the recitation "More on classifier evaluation"
Wed Apr 20 Lecture 11: Intro to neural nets & deep learning
[slides]
For the neural net demo below to work, you will need to install some packages:
pip install torch torchvision torchaudio
pip install torchsummaryX
pip install pytorch-nlp
[Jupyter notebook (handwritten digit recognition with neural nets; be sure to scroll to the bottom to download UDA_pytorch_utils.py)]
Be sure to edit two pytorch-nlp files as indicated in the following slides (resolves some issues with recent updates to PyTorch & spaCy):
[slides]
Mike Jordan's Medium article on where AI is at (April 2018):
["Artificial Intelligence - The Revolution Hasn't Happened Yet"]

PyTorch tutorial (at the very least, go over the first page of this tutorial to familiarize yourself with going between NumPy arrays and PyTorch tensors, and also understand the basic explanation of how tensors can reside on either the CPU or a GPU):

[PyTorch tutorial]

Additional reading:

[Chapter 1 "Using neural nets to recognize handwritten digits" of the book Neural Networks and Deep Learning]

Video introduction on neural nets:

["But what *is* a neural network? | Chapter 1, deep learning" by 3Blue1Brown]
Fri April 22 Lecture 12: Image analysis with convolutional neural nets (also called CNNs or convnets)
[slides]
We continue using the demo from the previous lecture

Recitation slot — Extended lecture 13 (covers material from lecture 13 and a little bit of lecture 14 for Pittsburgh): Time series analysis with recurrent neural nets (RNNs); additional deep learning topics and course wrap-up

[slides]
[Jupyter notebook (sentiment analysis with IMDb reviews; requires UDA_pytorch_utils.py from the previous demo)]
Additional reading:
[Stanford CS231n Convolutional Neural Networks for Visual Recognition]
[(technical) Richard Zhang's fix for max pooling]
[Christopher Olah's "Understanding LSTM Networks"]
Thur Apr 28 HW3 due 1:29pm Adelaide time (corresponds to 11:59pm Wed Apr 27 Pittsburgh time)
Fri Apr 29 Quiz 2 (80 minutes), 4pm-5:20pm Adelaide time