Lectures:
Note that the current plan is for Section B4 to be recorded.
Recitations (A4/B4): Fridays 5pm-6:20pm, HBH A301
Instructor: George Chen (email: georgechen ♣ cmu.edu) ‐ replace "♣" with the "at" symbol
Teaching assistants:
Office hours (starting second week of class): Check the course Canvas homepage for the office hour times and locations.
Contact: Please use Piazza (follow the link to it within Canvas) and, whenever possible, post so that everyone can see (if you have a question, chances are other people can benefit from the answer as well!).
Companies, governments, and other organizations now collect massive amounts of data such as text, images, audio, and video. How do we turn this heterogeneous mess of data into actionable insights? A common problem is that we often do not know what structure underlies the data ahead of time, hence the data often being referred to as "unstructured". This course takes a practical approach to unstructured data analysis via a two-step approach:
We will be coding lots of Python and dabble a bit with GPU computing (Google Colab).
Prerequisite: If you are a Heinz student, then you must have taken 95-888 "Data-Focused Python" or 90-819 "Intermediate Programming with Python". If you are not a Heinz student and would like to take the course, please contact the instructor and clearly state what Python courses you have taken/what Python experience you have.
Helpful but not required: Math at the level of calculus and linear algebra may help you appreciate some of the material more
Grading: Homework (30%), Quiz 1 (35%), Quiz 2 (35%*)
*Students with the most instructor-endorsed posts on Piazza will receive a slight bonus at the end of the mini, which will be added directly to their Quiz 2 score (a maximum of 10 bonus points, so that it is possible to get 110 out of 100 points on Quiz 2).
Letter grades are determined based on a curve.
Previous version of course (including lecture slides and demos): 95-865 Fall 2024 mini 2
Date | Topic | Supplemental Materials |
---|---|---|
Part I. Exploratory data analysis | ||
Week 1 | ||
Mon Mar 10 |
Lecture 1: Course overview, analyzing text using frequencies
[slides] Please install Anaconda Python 3 and spaCy by following this tutorial (needed for HW1 and the demo next lecture): [slides] Note: Anaconda Python 3 includes support for Jupyter notebooks, which we use extensively in this class |
|
Tue Mar 11 | HW1 released | |
Wed Mar 12 |
Lecture 2: Basic text analysis demo (requires Anaconda Python 3 & spaCy)
[slides] [Jupyter notebook (basic text analysis)] |
|
Fri Mar 14 |
Recitation slot: Lecture 3 — Basic text analysis (cont'd), co-occurrence analysis
[slides] [Jupyter notebook (basic text analysis using arrays)] [Jupyter notebook (co-occurrence analysis toy example)] |
As we saw in class, PMI is defined in terms of log probabilities. Here's additional reading that provides some intuition on log probabilities (technical):
[Section 1.2 of lecture notes from CMU 10-704 "Information Processing and Learning" Lecture 1 (Fall 2016) discusses "information content" of random outcomes, which are in terms of log probabilities] |
Week 2 | ||
Mon Mar 17 |
Lecture 4: Co-occurrence analysis (cont'd), visualizing high-dimensional data with PCA
[slides] [Jupyter notebook (text generation using n-grams)] [Jupyter notebook (PCA)] |
Additional reading (technical):
[Abdi and Williams's PCA review] Supplemental videos: [StatQuest: PCA main ideas in only 5 minutes!!!] [StatQuest: Principal Component Analysis (PCA) Step-by-Step (note that this is a more technical introduction than mine using SVD/eigenvalues)] [StatQuest: PCA - Practical Tips] [StatQuest: PCA in Python (note that this video is more Pandas-focused whereas 95-865 is taught in a manner that is more numpy-focused to better prep for working with PyTorch later)] |
Wed Mar 19 |
Lecture 5: PCA (cont'd), manifold learning (Isomap, MDS)
[slides] [the first demo is actually wrapping up the PCA demo from the previous lecture: Jupyter notebook (PCA)] [Jupyter notebook (manifold learning)] |
Additional reading (technical):
[The original Isomap paper (Tenenbaum et al 2000)] Python examples for manifold learning: [scikit-learn example (Isomap, t-SNE, and many other methods)] Supplemental video: [StatQuest: t-SNE, clearly explained] Additional reading (technical): [some technical slides on t-SNE by George for 95-865] [Simon Carbonnelle's much more technical t-SNE slides] [t-SNE webpage] |
Fri Mar 21 |
Recitation slot: More on dimensionality reduction
[slides (how to save a Jupyter notebook as PDF)] [Jupyter notebook (more on PCA, argsort)] ["How to Use t-SNE Effectively" (Wattenberg et al 2016)] [Jupyter notebook (analyzing the 20 Newsgroups dataset)] |
|
Week 3 | ||
Mon Mar 24 |
HW1 due Monday Mar 24, 11:59pm
Lecture 6: Wrap up manifold learning, intro to clustering [slides] [Jupyter notebook (PCA and t-SNE with images)***] ***For the demo on PCA and t-SNE with images to work, you will need to install some packages: pip install torch torchvision
|
New manifold learning method that is promising (PaCMAP):
[paper (Wang et al 2021) (technical)] [code (github repo)] Additional reading on clustering(technical): [see Section 14.3 of the book "Elements of Statistical Learning"] Supplemental video: [StatQuest: K-means clustering (note: the elbow method is specific to using total variation (i.e., residual sum of squares) as a score function; the elbow method is not always the approach you should use with other score functions) |
Tue Mar 25 | Quiz 1 review session: 7pm-8:30pm over Zoom (check Canvas -> Zoom for the link) |
|
Wed Mar 26 |
Lecture 7: Clustering
[slides] [Jupyter notebook (preprocessing 20 Newsgroups dataset)] [Jupyter notebook (clustering 20 Newsgroups dataset)] |
Clustering additional reading (technical):
[same clustering reading suggested in the previous lecture: see Section 14.3 of the book "Elements of Statistical Learning"] |
Fri Mar 28 | Recitation slot: Quiz 1 (80-minute exam) — material coverage is up to and including Mon Mar 24's lecture (i.e., Lecture 6) | |
Week 4 | ||
Mon Mar 31 |
Lecture 8: Clustering (cont'd)
[slides] [we resume the demo from last time: Jupyter notebook (clustering 20 Newsgroups dataset)] [Jupyter notebook (toy GMM example to show when CH index actually works)] |
Same supplemental materials as the previous lecture |
Wed Apr 2 |
Lecture 9: Wrap up clustering, topic modeling
[slides] [Jupyter notebook (clustering on text revisited using TF-IDF, normalizing using Euclidean norm)] [required reading: Jupyter notebook (clustering with images)] [Jupyter notebook (topic modeling with LDA)] |
Topic modeling reading:
[David Blei's general intro to topic modeling] [Maria Antoniak's practical guide for using LDA] |
Fri Apr 4 |
No class (CMU Spring Carnival)
🎪 |
|
Part II. Predictive data analysis | ||
Week 5 | ||
Mon Apr 7 |
Lecture 10: Wrap up topic modeling; intro to predictive data analysis
[slides] [Jupyter notebook (LDA: choosing the number of topics)] |
|
Wed Apr 9 |
Lecture 11: wrap up intro predictive data analysis; intro to neural nets & deep learning
[slides] [Jupyter notebook (prediction and model validation)] | |
Fri Apr 11 |
Recitation slot: Some key concepts for prediction
[slides] [Jupyter notebook] |
|
Week 6 | ||
Mon Apr 14 |
HW2 due Monday Apr 14, 11:59pm
Lecture 12: Wrap up neural net basics; image analysis with convolutional neural nets (also called CNNs or convnets) [slides] For the neural net demo below to work, you will need to install some packages: pip install torch torchvision torchaudio torchtext torchinfo
[Jupyter notebook (handwritten digit recognition with neural nets; be sure to scroll to the bottom to download UDA_pytorch_utils.py)] |
PyTorch tutorial (at the very least, go over the first page of this tutorial to familiarize yourself with going between NumPy arrays and PyTorch tensors, and also understand the basic explanation of how tensors can reside on either the CPU or a GPU):
[PyTorch tutorial] Additional reading on basic neural networks: [Chapter 1 "Using neural nets to recognize handwritten digits" of the book Neural Networks and Deep Learning] Video introduction on neural nets: ["But what *is* a neural network? | Chapter 1, deep learning" by 3Blue1Brown] StatQuest series of videos on neural nets and deep learning: [YouTube playlist (note: there are a lot of videos in this playlist, some of which goes into more detail than you're expected to know for 95-865; make sure that you understand concepts at the level of how they are presented in 95-865 lectures/recitations)] Supplemental reading and video for convolutional neural networks (CNNs): [Stanford CS231n Convolutional Neural Networks for Visual Recognition] [(technical) Richard Zhang's fix for max pooling] In the above StatQuest YouTube playlist, there's a video in the playlist on CNNs |
Wed Apr 16 |
Lecture 13: Time series analysis with recurrent neural nets (RNNs)
[slides] [we resume the demo from last time: Jupyter notebook (handwritten digit recognition with neural nets; be sure to scroll to the bottom to download UDA_pytorch_utils.py)] |
See the supplemental materials from the previous lecture; note that in the StatQuest neural net and deep learning YouTube playlist (in supplemental materials for last lecture; there's a video in the playlist on RNNs) |
Fri Apr 18 |
Recitation slot: Gradient descent, more on RNNs and time series analysis + how to use Colab
[slides] [Juypter notebook (Colab intro; meant to be run in Colab)] |
|
Week 7 | ||
Mon Apr 21 | Lecture 14: Text generation with generative pretrained transformers (GPTs) | |
Wed Apr 23 | Lecture 15: Other deep learning topics; course wrap-up | |
Fri Apr 25 | Recitation slot: TBD | |
Final exam week | ||
Mon Apr 28 | HW3 due 11:59pm | |
Fri May 2 |
Quiz 2 (80-minute exam) — 1pm-2:20pm, location TBD
Quiz 2 focuses on material from Wed Mar 26's lecture (Lecture 7) and onwards (note that by how the course is set up, material from Lecture 7 onwards naturally at times relates to material from Lectures 1–6, so some ideas in these earlier lectures could still possibly show up on Quiz 2—please focus your studying on material from Lecture 7 onwards) |