Lectures:
Note that the current plan is for Section A2 to be recorded.
Recitations: Fridays 1:25pm-2:45pm, HBH A301
Instructor: George Chen (email: georgechen ♣ cmu.edu) ‐ replace "♣" with the "at" symbol
Teaching assistants:
Office hours (starting second week of class): Check the course Canvas homepage for the office hour times and locations.
Contact: Please use Piazza (follow the link to it within Canvas) and, whenever possible, post so that everyone can see (if you have a question, chances are other people can benefit from the answer as well!).
Companies, governments, and other organizations now collect massive amounts of data such as text, images, audio, and video. How do we turn this heterogeneous mess of data into actionable insights? A common problem is that we often do not know what structure underlies the data ahead of time, hence the data often being referred to as "unstructured". This course takes a practical approach to unstructured data analysis via a two-step approach:
We will be coding lots of Python and dabble a bit with GPU computing (Google Colab).
Prerequisite: If you are a Heinz student, then you must have taken 95-888 "Data-Focused Python" or 90-819 "Intermediate Programming with Python". If you are not a Heinz student and would like to take the course, please contact the instructor and clearly state what Python courses you have taken/what Python experience you have.
Helpful but not required: Math at the level of calculus and linear algebra may help you appreciate some of the material more
Grading: Homework (30%), Quiz 1 (35%), Quiz 2 (35%*)
*Students with the most instructor-endorsed posts on Piazza will receive a slight bonus at the end of the mini, which will be added directly to their Quiz 2 score (a maximum of 10 bonus points, so that it is possible to get 110 out of 100 points on Quiz 2).
Letter grades are determined based on a curve.
Syllabus: [pdf (updated Tue Oct 25)]
Previous version of course (including lecture slides and demos): 95-865 Spring 2022 mini 4
Date | Topic | Supplemental Material |
---|---|---|
Part I. Exploratory data analysis | ||
Tue Oct 25 |
Lecture 1: Course overview, analyzing text using frequencies
[slides] Please install Anaconda Python 3 and spaCy by following this tutorial (needed for HW1 and the demo next lecture): [slides] Note: Anaconda Python 3 includes support for Jupyter notebooks, which we use extensively in this class HW1 released (check Canvas) |
|
Thur Oct 27 |
Lecture 2: Basic text analysis demo (requires Anaconda Python 3 & spaCy), co-occurrence analysis
[slides] [Jupyter notebook (basic text analysis)] [Jupyter notebook (co-occurrence analysis)] Your TAs Isabella and Sumedh will be running an optional Python review session at 7pm-8pm (check Canvas for the Zoom link) |
|
Fri Oct 28 | No class (Tartan Community Day) | |
Tue Nov 1 |
Lecture 3: Co-occurrence analysis (cont'd), visualizing high-dimensional data
[slides] We continue to use the co-occurrence analysis demo from the previous lecture |
|
Thur Nov 3 |
Lecture 4: PCA, manifold learning
[slides] [Jupyter notebook (PCA)] |
Additional reading (technical):
[Abdi and Williams's PCA review] |
Fri Nov 4 |
Recitation slot: Lecture 5 — Manifold learning (cont'd)
[slides] [Jupyter notebook (manifold learning)] [required reading: "How to Use t-SNE Effectively" (Wattenberg et al, Distill 2016)] |
Python examples for dimensionality reduction:
[scikit-learn example (PCA, Isomap, t-SNE, and many other methods)] Additional reading (technical): [The original Isomap paper (Tenenbaum et al 2000)] [some technical slides on t-SNE by George for 95-865] [Simon Carbonnelle's much more technical t-SNE slides] [t-SNE webpage] New manifold learning method that is promising (PaCMAP): [paper (Wang et al 2021) (technical)] [code (github repo)] |
Tue Nov 8 |
Lecture 6: Dimensionality reduction for images, intro to clustering
[slides] [Jupyter notebook (dimensionality reduction with images)***] ***For the demo on t-SNE with images to work, you will need to install some packages: pip install torch torchvision
[Jupyter notebook (dimensionality reduction and clustering with drug data)] HW1 due 11:59pm |
Clustering additional reading (technical): [see Section 14.3 of the book "Elements of Statistical Learning" on clustering] |
Thur Nov 10 |
Lecture 7: Clustering (cont'd)
[slides] We continue using the same demo from last time: [Jupyter notebook (dimensionality reduction and clustering with drug data)] |
See supplemental clustering reading posted for previous lecture |
Fri Nov 11 |
Recitation: Clustering on images, and more details on PCA
[Jupyter notebook] |
|
Tue Nov 15 |
Lecture 8: Clustering (cont'd), topic modeling
[slides] [Jupyter notebook (topic modeling with LDA)] |
Topic modeling reading:
[David Blei's general intro to topic modeling] |
Part II. Predictive data analysis | ||
Thur Nov 17 |
Lecture 9: Topic modeling (cont'd), intro to predictive data analysis
[slides] |
Some nuanced details on cross-validation (technical):
[Andrew Ng's article Preventing "Overfitting" of Cross-Validation Data] [Braga-Neto and Dougherty's article Is cross-validation valid for small-sample microarray classification? (this article applies more generally rather than only to microarray data from biology)] [Bias and variance as we change the number of folds in k-fold cross-validation] |
Fri Nov 18 | Quiz 1 (80-minute exam) | |
Tue Nov 22 |
Lecture 10: Hyperparameter tuning, decision trees & forests, classifier evaluation
[slides] [Jupyter notebook (prediction and model validation)] HW2 due 11:59pm |
|
Thur Nov 24 & Fri Nov 25 | No class (Thanksgiving) | |
Tue Nov 29 |
Lecture 11: Intro to neural nets & deep learning
[slides] For the neural net demo below to work, you will need to install some packages: pip install torch torchvision torchaudio torchtext torchinfo
[Jupyter notebook (handwritten digit recognition with neural nets; be sure to scroll to the bottom to download UDA_pytorch_utils.py)] |
PyTorch tutorial (at the very least, go over the first page of this tutorial to familiarize yourself with going between NumPy arrays and PyTorch tensors, and also understand the basic explanation of how tensors can reside on either the CPU or a GPU): [PyTorch tutorial] Additional reading: [Chapter 1 "Using neural nets to recognize handwritten digits" of the book Neural Networks and Deep Learning] Video introduction on neural nets: ["But what *is* a neural network? | Chapter 1, deep learning" by 3Blue1Brown] |
Thur Dec 1 |
Lecture 12: Image analysis with convolutional neural nets (also called CNNs or convnets)
[slides] We continue using the demo from the previous lecture |
Additional reading:
[Stanford CS231n Convolutional Neural Networks for Visual Recognition] [(technical) Richard Zhang's fix for max pooling] |
Fri Dec 2 | Recitation: More on classifier evaluation
[slides] [Jupyter notebook] |
|
Tue Dec 6 |
Lecture 13: Time series analysis with recurrent neural nets (RNNs)
[slides] [Jupyter notebook (sentiment analysis with IMDb reviews; requires UDA_pytorch_utils.py from the previous demo)] |
Additional reading:
[Christopher Olah's "Understanding LSTM Networks"] |
Thur Dec 8 |
Lecture 14: Additional deep learning topics and course wrap-up
[slides] |
Software for explaining neural nets:
[Captum] Additional reading: [A tutorial on word2vec word embeddings] [A tutorial on BERT word embeddings] ["Understanding Deep Learning (Still) Requires Rethinking Generalization" (Zhang et al 2021)] ["Reconciling modern machine learning practice and the bias-variance trade-off" (Belkin et al 2019)] |
Fri Dec 9 | Recitation slot: Quiz 2 review | |
Mon Dec 12 | HW3 due 11:59pm | |
Fri Dec 16 | Quiz 2 (80-minute exam): 1pm-2:20pm HBH A301 |