95-865: Unstructured Data Analytics
(Fall 2021 Mini 2)

Lectures:

Section A2: Tuesdays and Thursdays 4:40pm-6:00pm, HBH 1002
Section B2: Mondays and Wednesdays 1:25pm-2:45pm, HBH 2008

Note that the current plan is for Section A2 to be recorded.

Recitations: Fridays 1:25pm-2:45pm, HBH A301

Instructor: George Chen (email: georgechen ♣ cmu.edu) ‐ replace "♣" with the "at" symbol

Teaching assistants:

Tess Niewood (tniewood ♣ andrew.cmu.edu)
Varun Jayachandran (varunjay ♣ andrew.cmu.edu)
Xuejian Wang (xuejianw ♣ andrew.cmu.edu)
Yuanpei Wang (yuanpeiw ♣ andrew.cmu.edu)

Office hours (starting second week of class):
The TA office hours are all virtual; see the course's Canvas homepage for the Zoom links

George: Tuesdays 6:10pm-7:10pm, in-person HBH 2216, and also by appointment
Tess: Mondays 7:15pm-8:15pm, virtual
Varun: Thursdays 2pm-3pm, virtual
Xuejian Wang: Wednesdays 9am-10am, virtual
Yuanpei Wang: Mondays 5pm-6pm, virtual

Contact: Please use Piazza (follow the link to it within Canvas) and, whenever possible, post so that everyone can see (if you have a question, chances are other people can benefit from the answer as well!).

Course Description

Companies, governments, and other organizations now collect massive amounts of data such as text, images, audio, and video. How do we turn this heterogeneous mess of data into actionable insights? A common problem is that we often do not know what structure underlies the data ahead of time, hence the data often being referred to as "unstructured". This course takes a practical approach to unstructured data analysis via a two-step approach:

We first examine how to identify possible structure present in the data via visualization and other exploratory methods.
Once we have clues for what structure is present in the data, we turn toward exploiting this structure to make predictions.

Many examples are given for how these methods help solve real problems faced by organizations. Along the way, we encounter many of the most popular methods in analyzing unstructured data, from modern classics in manifold learning, clustering, and topic modeling to some of the latest developments in deep neural networks for analyzing text, images, and time series.

We will be coding lots of Python and dabble a bit with GPU computing (Google Colab).

Prerequisite: If you are a Heinz student, then you must have either (1) passed the Heinz Python exemption exam, or (2) taken 95-888 "Data-Focused Python" or 90-819 "Intermediate Programming with Python". If you are not a Heinz student and would like to take the course, please contact the instructor and clearly state what Python courses you have taken/what Python experience you have.

Helpful but not required: Math at the level of calculus and linear algebra may help you appreciate some of the material more

Grading: Homework (30%), Quiz 1 (35%), Quiz 2 (35%*)

*Students with the most instructor-endorsed answers on Piazza will receive a slight bonus at the end of the mini, which will be added directly to their quiz 2 score (a maximum of 5 bonus points; quiz 2 is out of 100 points prior to any bonus points being added).

Syllabus: [handout]

Calendar (subject to revision)

Previous version of course (including lecture slides and demos): 95-865 Spring 2021 mini 4

Date	Topic	Supplemental Material
Part I. Exploratory data analysis
Mon Oct 18 Tue Oct 19	Lecture 1: Course overview, analyzing text using frequencies [slides] Please install Anaconda Python 3 and spaCy by following this tutorial (needed for HW1 and the demo next lecture): [slides] Note: Anaconda Python 3 includes support for Jupyter notebooks, which we use extensively in this class HW1 released (check Canvas)
Wed Oct 20 Thur Oct 21	Lecture 2: Basic text analysis demo (requires Anaconda Python 3 & spaCy), co-occurrence analysis [slides] [Jupyter notebook (basic text analysis)] [Jupyter notebook (co-occurrence analysis)]	What is the maximum phi-squared/chi-squared value? (technical) [stack exchange answer]
Fri Oct 22	Recitation: Basic Python review [Jupyter notebook]
Mon Oct 25 Tue Oct 26	Lecture 3: Co-occurrence analysis (cont'd), visualizing high-dimensional data with PCA [slides] [Jupyter notebook (co-occurrence analysis; same demo as posted for Lecture 2)]	PCA additional reading (technical): [Abdi and Williams's PCA review]
Wed Oct 27 Thur Oct 28	Lecture 4: PCA (cont'd), manifold learning [slides] [Jupyter notebook (PCA)] [Jupyter notebook (manifold learning)] HW1 due 11:59pm on Thursday Oct 28	Python examples for dimensionality reduction: [scikit-learn example (PCA, Isomap, t-SNE, and many other methods)]
Fri Oct 29	Recitation: More on PCA, argsort [Jupyter notebook]
Mon Nov 1 Tue Nov 2	Lecture 5: Manifold learning (cont'd), clustering [slides] [Jupyter notebook (manifold learning; same demo as from previous lecture)] [required reading: "How to Use t-SNE Effectively" (Wattenberg et al, Distill 2016)] [Jupyter notebook (t-SNE with images)*] [Jupyter notebook (PCA, t-SNE, clustering with drug data)] *For the demo on t-SNE with images to work, you will need to install some packages: `pip install torch torchvision` HW2 released	Some technical details for t-SNE: [slides] Some reading for t-SNE (technical): [Simon Carbonnelle's t-SNE slides] [t-SNE webpage]
Wed Nov 3 Thur Nov 4	Lecture 6: Clustering (cont'd) — k-means, GMMs [slides] [Jupyter notebook (PCA, t-SNE, clustering with drug data; same demo as from previous lecture)]	Clustering additional reading (technical): [see Section 14.3 of the book "Elements of Statistical Learning" on clustering]
Fri Nov 5	No class: CMU break day for community engagement
Mon Nov 8 Tue Nov 9	Lecture 7: Clustering (cont'd) — interpreting GMMs, automatically selecting the number of clusters [slides] [Jupyter notebook (PCA, t-SNE, clustering with drug data; same demo as from previous lecture)]
Wed Nov 10 Thur Nov 11	Lecture 8: Topic modeling [slides] [Jupyter notebook (topic modeling with LDA)]	Topic modeling reading: [David Blei's general intro to topic modeling] [(technical; requires prior deep learning knowledge) Topic Modelling Meets Deep Neural Networks: A Survey]
Fri Nov 12	Recitation: More on topic models, and clustering on images (including strategies on how to make sense of the clusters) [Jupyter notebook]
Part II. Predictive data analysis
Mon Nov 15 Tue Nov 16	Lecture 9: Wrap up clustering; intro to predictive data analysis [slides] [Jupyter notebook (PCA, t-SNE, clustering with drug data; same demo as from previous clustering lectures)]	Reading on DP-means, DP mixture models for which a DP-GMM is a special case (technical): [Revisiting k-means: New Algorithms via Bayesian Nonparametrics]
Wed Nov 17 Thur Nov 18	Lecture 10: Hyperparameter tuning, decision trees & forests, more on classifier evaluation [slides] [Jupyter notebook (prediction and model validation)] HW2 due 11:59pm on Thursday Nov 18	Some nuanced details on cross-validation (technical): [Andrew Ng's article Preventing "Overfitting" of Cross-Validation Data] [Braga-Neto and Dougherty's article Is cross-validation valid for small-sample microarray classification? (this article applies more generally rather than only to microarray data from biology)] [Bias and variance as we change the number of folds in k-fold cross-validation]
Fri Nov 19	Quiz 1 (80-minute exam) during recitation slot (in-person: 1:25pm-2:45pm HBH A301)
Mon Nov 22 Tue Nov 23	Lecture 11: Wrap up classifier evaluation; neural nets & deep learning [slides] HW3 released	Mike Jordan's Medium article on where AI is at (April 2018): ["Artificial Intelligence - The Revolution Hasn't Happened Yet"] PyTorch tutorial (at the very least, go over the first page of this tutorial to familiarize yourself with going between NumPy arrays and PyTorch tensors, and also understand the basic explanation of how tensors can reside on either the CPU or a GPU): [PyTorch tutorial] Additional reading: [Chapter 1 "Using neural nets to recognize handwritten digits" of the book Neural Networks and Deep Learning] Video introduction on neural nets: ["But what is a neural network? \| Chapter 1, deep learning" by 3Blue1Brown]
Wed Nov 24 Thur Nov 25 Fri Nov 26	No class: Thanksgiving break 🦃
Mon Nov 29 Tue Nov 30	Lecture 12: Wrap up neural net basics; image analysis with convolutional neural nets (also called CNNs or convnets) [slides] For the neural net demo below to work, you will need to install some packages: `pip install torch torchvision torchaudio` `pip install torchsummaryX` `pip install pytorch-nlp` [Jupyter notebook (handwritten digit recognition with neural nets; be sure to scroll to the bottom to download UDA_pytorch_utils.py)] Be sure to edit two pytorch-nlp files as indicated in the following slides (resolves some issues with recent updates to PyTorch & spaCy): [slides]	Additional reading: [Stanford CS231n Convolutional Neural Networks for Visual Recognition] [(technical) Richard Zhang's fix for max pooling]
Wed Dec 1 Thur Dec 2	Lecture 13: Time series analysis with recurrent neural nets [slides] [Jupyter notebook (sentiment analysis with IMDb reviews; requires UDA_pytorch_utils.py from the previous demo)]	Additional reading: [Christopher Olah's "Understanding LSTM Networks"]
Fri Dec 3	Recitation slot (in-person): Lecture 14 — Additional deep learning topics; course wrap-up [slides] [Captum tutorial]	Additional reading: [A tutorial on word2vec word embeddings] [A tutorial on BERT word embeddings] [Belkin et al's 2019 paper on "Reconciling modern machine learning practice and the bias-variance trade-off"]
Mon Dec 6	HW3 due 11:59pm
Fri Dec 10	Quiz 2 (80-minute exam): 8:30am HBH A301 (in-person)

95-865: Unstructured Data Analytics(Fall 2021 Mini 2)

Course Description

Calendar (subject to revision)

95-865: Unstructured Data Analytics
(Fall 2021 Mini 2)