 
      
        Lectures:
        
Note that the current plan is for Section C2 to be recorded.
        
Recitations (shared across sections): Fridays 2pm-3:20pm, HBH A301
Instructor: George Chen (email: georgechen ♣ cmu.edu) ‐ replace "♣" with the "at" symbol
Teaching assistants:
Office hours (starting second week of class): Check the course Canvas homepage for the office hour times and locations.
Contact: Please use Piazza (follow the link to it within Canvas) and, whenever possible, post so that everyone can see (if you have a question, chances are other people can benefit from the answer as well!).
Companies, governments, and other organizations now collect massive amounts of data such as text, images, audio, and video. How do we turn this heterogeneous mess of data into actionable insights? A common problem is that we often do not know what structure underlies the data ahead of time, hence the data often being referred to as "unstructured". This course takes a practical approach to unstructured data analysis via a two-step approach:
We will be coding lots of Python and dabble a bit with GPU computing (Google Colab).
Prerequisite: If you are a Heinz student, then you must have taken 95-888 "Data-Focused Python" or 90-819 "Intermediate Programming with Python". If you are not a Heinz student and would like to take the course, please contact the instructor and clearly state what Python courses you have taken/what Python experience you have.
Helpful but not required: Math at the level of calculus and linear algebra may help you appreciate some of the material more
Grading: Homework (30%), Quiz 1 (35%), Quiz 2 (35%*)
*Students with the most instructor-endorsed posts on Piazza will receive a slight bonus at the end of the mini, which will be added directly to their Quiz 2 score (a maximum of 10 bonus points, so that it is possible to get 110 out of 100 points on Quiz 2).
Letter grades are determined based on a curve.
Previous version of course (including lecture slides and demos): 95-865 Spring 2025 mini 4
| Date | Topic | Supplemental Materials | 
|---|---|---|
| Part I. Exploratory data analysis | ||
| Week 1 | ||
| Mon Oct 20/Tue Oct 21 | Lecture 1: Course overview, analyzing text using frequencies [slides] | 
 | 
| Wed Oct 22/Thur Oct 23 | Lecture 2: Basic text analysis demo (requires Anaconda Python 3 & spaCy) [lecture slides] [slides on how to install Anaconda Python 3 and spaCy (needed for HW1 and lecture demos)] Note: Anaconda Python 3 includes support for Jupyter notebooks, which we use extensively in this class [Jupyter notebook (basic text analysis)] HW1 released (check Canvas) | |
| Fri Oct 24 | Recitation slot: Lecture 3 — Basic text analysis (cont'd), co-occurrence analysis [slides] [Jupyter notebook (basic text analysis using arrays)] [Jupyter notebook (co-occurrence analysis toy example)] | As we saw in class, PMI is defined in terms of log probabilities. Here's additional reading that provides some intuition on log probabilities (technical): [Section 1.2 of lecture notes from CMU 10-704 "Information Processing and Learning" Lecture 1 (Fall 2016) discusses "information content" of random outcomes, which are in terms of log probabilities] | 
| Week 2 | ||
| Mon Oct 27/Tue Oct 28 | Lecture 4: Co-occurrence analysis (cont'd), visualizing high-dimensional data with PCA [slides] [Jupyter notebook (text generation using n-grams)] | Additional reading (technical): [Abdi and Williams's PCA review] Supplemental videos: [StatQuest: PCA main ideas in only 5 minutes!!!] [StatQuest: Principal Component Analysis (PCA) Step-by-Step (note that this is a more technical introduction than mine using SVD/eigenvalues)] [StatQuest: PCA - Practical Tips] [StatQuest: PCA in Python (note that this video is more Pandas-focused whereas 95-865 is taught in a manner that is more numpy-focused to better prep for working with PyTorch later)] | 
| Wed Oct 29/Thur Oct 30 | Lecture 5: PCA (cont'd), manifold learning (Isomap, MDS) [slides] [Jupyter notebook (PCA)] | Additional reading (technical): [The original Isomap paper (Tenenbaum et al 2000)] Python examples for manifold learning: [scikit-learn example (Isomap, t-SNE, and many other methods)] Supplemental video: [StatQuest: t-SNE, clearly explained] Additional reading (technical): [some technical slides on t-SNE by George for 95-865] [Simon Carbonnelle's much more technical t-SNE slides] [t-SNE webpage] | 
| Fri Oct 31 | Recitation slot: More on dimensionality reduction | |
| Week 3 | ||
| Mon Nov 3/Tue Nov 4 | HW1 due Monday Nov 3, 11:59pm No lecture on Mon Nov 3/Tue Nov 4 (this is to keep the three sections of the class synced and to account for one of them not being held due to CMU's observance of Democracy Day) | |
| Wed Nov 5/Thur Nov 6 | Lecture 6: Wrap up manifold learning, intro to clustering | |
| Fri Nov 7 | Recitation slot: Lecture 7 — Clustering | |
| Week 4 | ||
| Mon Nov 10/Tue Nov 11 | Lecture 8: Clustering (cont'd) | |
| Wed Nov 12/Thur Nov 13 | Lecture 9: Wrap up clustering, topic modeling | |
| Fri Nov 14 | Recitation slot: Quiz 1 (80-minute exam) — material coverage is up to and including last Friday's (Nov 7) recitation | |
| Part II. Predictive data analysis | ||
| Week 5 | ||
| Mon Nov 17/Tue Nov 18 | Lecture 10: Wrap up topic modeling; intro to predictive data analysis | |
| Wed Nov 19/Thur Nov 20 | Lecture 11: wrap up intro predictive data analysis; intro to neural nets & deep learning | |
| Fri Nov 21 | Recitation slot: Some key concepts for prediction | |
| Week 6 | ||
| Mon Nov 24/Tue Nov 25 | HW2 due Monday Nov 24, 11:59pm Lecture 12: Wrap up neural net basics; image analysis with convolutional neural nets (also called CNNs or convnets) | |
| Wed Nov 26—Fri Nov 28 | No class (Thanksgiving holiday) | |
| Week 7 | ||
| Mon Dec 1/Tue Dec 2 | Lecture 13: Time series analysis with recurrent neural nets (RNNs) | |
| Wed Dec 3/Thur Dec 4 | Lecture 14: Text generation with generative pretrained transformers; course wrap-up | |
| Fri Dec 5 | Recitation slot: TBD | |
| Final exam week | ||
| Mon Dec 8 | HW3 due 11:59pm | |
| Fri Dec 12 | Quiz 2 (80-minute exam): 1pm-2:20pm, location TBA Quiz 2 focuses on material from weeks 4–7 (note that by how the course is set up, material from weeks 4–7 naturally at times relates to material from weeks 1–3, so some ideas in these earlier weeks could still possibly show up on Quiz 2— please focus your studying on material from weeks 4–7) | |