94-775: Practical Unstructured Data Analytics
(Spring 2025 Mini 4; listed as 94-475 for undergrads)

Unstructured Data Analytics

Class time and location:

Instructor: George Chen (email: georgechen ♣ cmu.edu) ‐ replace "♣" with the "at" symbol

Teaching assistant: Johnna Sundberg (jsundber ♣ andrew.cmu.edu)

Office hours (starting second week of class): Check the course Canvas homepage for the office hour times and locations.

Contact: Please use Piazza (follow the link to it within Canvas) and, whenever possible, post so that everyone can see (if you have a question, chances are other people can benefit from the answer as well!).

Course Description

Companies, governments, and other organizations now collect massive amounts of data such as text, images, audio, and video. How do we turn this heterogeneous mess of data into actionable insights? A common problem is that we often do not know what structure underlies the data ahead of time, hence the data often being referred to as "unstructured". This course takes a practical approach to unstructured data analysis via a two-step approach:

  1. We first examine how to identify possible structure present in the data via visualization and other exploratory methods.
  2. Once we have clues for what structure is present in the data, we turn toward exploiting this structure to make predictions.
Many examples are given for how these methods help solve real problems faced by organizations. There is a final project in this course which must address a policy question.

Note regarding GenAI (including large language models): As likely all of you are aware, there are now technologies like (Chat)GPT, Gemini, Claude, Llama, DeepSeek, etc which will all be getting better over time. If you use any of these in your homework, please cite them. For the purposes of the class, I will view these as external collaborators (no different than if you got help from a human friend). For exams, I want to make sure that you actually understand the material and are not just telling me what someone else or an AI assistant knows. This is important so that in the future, if you get help from an AI assistant (or a human) to aid you in your unstructured data analysis, you have enough background knowledge to check for yourself whether you think the AI (or human) is giving you a solution that is correct or not. For this reason, exams in this class will explicitly not allow electronics.

Prerequisite: If you are a Heinz student, then you must have already completed 95-791 "Data Mining" and also one of either 95-888 "Data-Focused Python" or 90-819 "Intermediate Programming with Python". If you are not a Heinz student and would like to take the course, please contact the instructor and clearly state what Python courses you have taken/what Python experience you have.

Helpful but not required: Math at the level of calculus and linear algebra may help you appreciate some of the material more

Grading:

*Students with the most instructor-endorsed posts on Piazza will get a bonus of up to 20 points on Quiz 2 (so that it is possible to get 120 out of 100 points).

Letter grades are determined based on a curve.

Calendar (tentative)

Date Topic Supplemental Materials
Part I. Exploratory data analysis
Week 1
Tue Mar 11 Lecture 1: Course overview, analyzing text using frequencies

Thur Mar 13 Lecture 2: Basic text analysis demo (requires Anaconda Python 3 & spaCy)
Fri Mar 14 Recitation slot: Lecture 3 — Basic text analysis (cont'd), co-occurrence analysis
Week 2
Tue Mar 18 Lecture 4: Co-occurrence analysis (cont'd), visualizing high-dimensional data with PCA
Thur Mar 20 Lecture 5: PCA (cont'd), manifold learning (Isomap, MDS)
Fri Mar 21 Recitation slot: More on dimensionality reduction
Week 3
Tue Mar 25 HW1 due 11:59pm

Lecture 6: Manifold learning, intro to clustering
Thur Mar 27 Lecture 7: Clustering
Fri Mar 28 Recitation slot: Quiz 1 — material coverage: everything up to and including Fri Mar 21 (i.e., weeks 1-2)
Week 4
Tue Apr 1 Final project proposals due 11:59pm (1 email per group)

Lecture 8: Clustering (cont'd)
Thur Apr 3 & Fri Apr 4 No class (CMU Spring Carnival) 🎪
Week 5
Tue Apr 8 Lecture 9: Wrap up clustering, topic modeling
Part II. Predictive data analysis
Thur Apr 10 Lecture 10: Intro to predictive data analysis
Fri Apr 11 Recitation slot: Quiz 2 — material coverage: Tue Mar 25 up to Tue Apr 8 (i.e., weeks 3-4 as well as Lecture 9)
Week 6
Tue Apr 15 HW2 due 11:59pm

Lecture 11: Intro to neural nets & deep learning
Thur Apr 17 Lecture 12: Image analysis with convolutional neural nets (also called CNNs or convnets)
Fri Apr 18 Recitation slot: TBD
Week 7
Tue Apr 22 Lecture 13: Text generation with generative pretrained transformers (GPTs)
Thur Apr 24 Lecture 14: Other deep learning topics; course wrap-up
Fri Apr 25 Recitation slot: Final project presentations
Final exam week
Mon Apr 28 Final project slide decks + Jupyter notebooks due 11:59pm by email (1 email per group)