95-865: Unstructured Data Analytics
(Fall 2024 Mini 2)

Unstructured Data Analytics

Lectures:
Note that the current plan is for Section C2 to be recorded.

Recitations (shared across sections): Fridays 2pm-3:20pm, HBH A301

Instructor: George Chen (email: georgechen ♣ cmu.edu) ‐ replace "♣" with the "at" symbol

Teaching assistants:

Office hours (starting second week of class): Check the course Canvas homepage for the office hour times and locations.

Contact: Please use Piazza (follow the link to it within Canvas) and, whenever possible, post so that everyone can see (if you have a question, chances are other people can benefit from the answer as well!).

Course Description

Companies, governments, and other organizations now collect massive amounts of data such as text, images, audio, and video. How do we turn this heterogeneous mess of data into actionable insights? A common problem is that we often do not know what structure underlies the data ahead of time, hence the data often being referred to as "unstructured". This course takes a practical approach to unstructured data analysis via a two-step approach:

  1. We first examine how to identify possible structure present in the data via visualization and other exploratory methods.
  2. Once we have clues for what structure is present in the data, we turn toward exploiting this structure to make predictions.
Many examples are given for how these methods help solve real problems faced by organizations. Along the way, we encounter many of the most popular methods in analyzing unstructured data, from modern classics in manifold learning, clustering, and topic modeling to some of the latest developments in deep neural networks for analyzing text, images, and time series.

We will be coding lots of Python and dabble a bit with GPU computing (Google Colab).

Note regarding foundation models (such as Large Language Models): As likely all of you are aware, there are now technologies like (Chat)GPT, Gemini, Llama, etc which will all be getting better over time. If you use any of these in your homework, please cite them. For the purposes of the class, I will view these as external resources/collaborators. For exams, I want to make sure that you actually understand the material and are not just telling me what someone else or GPT/Gemini/etc knows. This is important so that in the future, if you use AI technologies to assist you in your data analysis, you have enough background knowledge to check for yourself whether you think the AI is giving you a solution that is correct or not. For this reason, exams this semester will explicitly not allow electronics.

Prerequisite: If you are a Heinz student, then you must have taken 95-888 "Data-Focused Python" or 90-819 "Intermediate Programming with Python". If you are not a Heinz student and would like to take the course, please contact the instructor and clearly state what Python courses you have taken/what Python experience you have.

Helpful but not required: Math at the level of calculus and linear algebra may help you appreciate some of the material more

Grading: Homework (30%), Quiz 1 (35%), Quiz 2 (35%*)

*Students with the most instructor-endorsed posts on Piazza will receive a slight bonus at the end of the mini, which will be added directly to their Quiz 2 score (a maximum of 10 bonus points, so that it is possible to get 110 out of 100 points on Quiz 2).

Letter grades are determined based on a curve.

Syllabus: [handout]

Calendar for Sections A2/B2/C2 (tentative)

Previous version of course (including lecture slides and demos): 95-865 Spring 2024 mini 4

Date Topic Supplemental Materials
Part I. Exploratory data analysis
Mon Oct 21/Tue Oct 22 Lecture 1: Course overview, analyzing text using frequencies
[slides]

Please install Anaconda Python 3 and spaCy by following this tutorial (needed for HW1 and the demo next lecture):

[slides]
Note: Anaconda Python 3 includes support for Jupyter notebooks, which we use extensively in this class

HW1 released

Wed Oct 23/Thur Oct 24 Lecture 2: Basic text analysis demo (requires Anaconda Python 3 & spaCy)
[slides]
[Jupyter notebook (basic text analysis)]
Fri Oct 25 Recitation slot (over Zoom for this week): Python review
[Jupyter notebook]
Mon Oct 28/Tue Oct 29 Lecture 3: Wrap-up basic text analysis, co-occurrence analysis
[slides]
[Jupyter notebook (basic text analysis using arrays)]
[Jupyter notebook (co-occurrence analysis toy example)]
As we saw in class, PMI is defined in terms of log probabilities. Here's additional reading that provides some intuition on log probabilities (technical):
[Section 1.2 of lecture notes from CMU 10-704 "Information Processing and Learning" Lecture 1 (Fall 2016) discusses "information content" of random outcomes, which are in terms of log probabilities]
Wed Oct 30/Thur Oct 31 Lecture 4: Co-occurrence analysis (cont'd), visualizing high-dimensional data with PCA
[slides]
[Jupyter notebook (text generation using n-grams)]
[Jupyter notebook (PCA)]
Additional reading (technical):
[Abdi and Williams's PCA review]

Supplemental videos:

[StatQuest: PCA main ideas in only 5 minutes!!!]
[StatQuest: Principal Component Analysis (PCA) Step-by-Step (note that this is a more technical introduction than mine using SVD/eigenvalues)]
[StatQuest: PCA - Practical Tips]
[StatQuest: PCA in Python (note that this video is more Pandas-focused whereas 95-865 is taught in a manner that is more numpy-focused to better prep for working with PyTorch later)]
Fri Nov 1 Recitation slot: Lecture 5— PCA (cont'd), manifold learning (Isomap, MDS)
[slides]
[the first demo is actually wrapping up the PCA demo from the previous lecture: Jupyter notebook (PCA)]
[Jupyter notebook (manifold learning)]
Additional reading (technical):
[The original Isomap paper (Tenenbaum et al 2000)]
Mon Nov 4/Tue Nov 5 HW1 due Monday Nov 4, 11:59pm

No class (note: Tue Nov 5 is Democracy day)
Wed Nov 6/Thur Nov 7 Lecture 6: Manifold learning, intro to clustering
[slides]
[we wrap up the demo from last lecture: Jupyter notebook (manifold learning)]
[Jupyter notebook (PCA and t-SNE with images)***]
***For the demo on PCA and t-SNE with images to work, you will need to install some packages:
pip install torch torchvision

HW2 released (check Canvas)
Python examples for manifold learning:
[scikit-learn example (Isomap, t-SNE, and many other methods)]

Supplemental video:

[StatQuest: t-SNE, clearly explained]

Additional reading (technical):

[some technical slides on t-SNE by George for 95-865]
[Simon Carbonnelle's much more technical t-SNE slides]
[t-SNE webpage]
[clustering: see Section 14.3 of the book "Elements of Statistical Learning"]

New manifold learning method that is promising (PaCMAP):

[paper (Wang et al 2021) (technical)]
[code (github repo)]
Fri Nov 8 Recitation slot: More on PCA and manifold learning
[Jupyter notebook (more on PCA, argsort)]
["How to Use t-SNE Effectively" (Wattenberg et al 2016)]
Mon Nov 11/Tue Nov 12 Lecture 7: Clustering
[slides]
[Jupyter notebook (preprocessing 20 Newsgroups dataset)]
[Jupyter notebook (clustering 20 Newsgroups dataset)]

Quiz 1 review session Tue Nov 12, 7:30pm-8:50pm over Zoom (led by TAs Tanyue and Zekai)
Clustering additional reading (technical):
[same clustering reading suggested in the previous lecture: see Section 14.3 of the book "Elements of Statistical Learning"]

Supplemental video:

[StatQuest: K-means clustering (note: the elbow method is specific to using total variation (i.e., residual sum of squares) as a score function; the elbow method is not always the approach you should use with other score functions)]
Wed Nov 13/Thur Nov 14 Lecture 8: Clustering (cont'd)
[slides]
[we resume the demo from last time: Jupyter notebook (clustering 20 Newsgroups dataset)]
[Jupyter notebook (toy GMM example to show when CH index actually works)]
See the supplemental materials from the previous lecture
Fri Nov 15 Recitation slot: Quiz 1 (80-minute exam) — material coverage is up to and including last Friday's (Nov 8) recitation
Mon Nov 18/Tue Nov 19 Lecture 9: Wrap up clustering, topic modeling
[slides]
[Jupyter notebook (clustering on text revisited using TF-IDF, normalizing using Euclidean norm)]
[required reading: Jupyter notebook (clustering with images)]
[Jupyter notebook (topic modeling with LDA)]
Topic modeling reading:
[David Blei's general intro to topic modeling]
[Maria Antoniak's practical guide for using LDA]
Part II. Predictive data analysis
Wed Nov 20/Thur Nov 21 Lecture 10: Wrap up topic modeling, intro to predictive data analysis
[slides]
[Jupyter notebook (LDA: choosing the number of topics)]
Fri Nov 22 Recitation slot: Lecture 11— wrap up intro predictive data analysis; intro to neural nets & deep learning
Mon Nov 25/Tue Nov 26 HW2 due Monday Nov 25, 11:59pm

Lecture 12: Wrap up neural net basics; image analysis with convolutional neural nets (also called CNNs or convnets)
Wed Nov 27—Fri Nov 29 No class (Thanksgiving holiday)
Mon Dec 2/Tue Dec 3 Lecture 13: Time series analysis with recurrent neural nets (RNNs)
Wed Apr 24 Lecture 14: Text generation with RNNs and generative pre-trained transformers (GPTs); course wrap-up
Fri Apr 26 Recitation slot: TBD
Mon Dec 9 HW3 due 11:59pm
Fri Dec 13 Quiz 2 (80-minute exam): 1pm-2:20pm, HBH A301

Quiz 2 focuses on material from weeks 4–7 (note that by how the course is set up, material from weeks 4–7 naturally at times relates to material from weeks 1–3, so some ideas in these earlier weeks could still possibly show up on Quiz 2— please focus your studying on material from weeks 4–7)