94-775 & 94-475 Practical Unstructured Data Analytics

Course Information

Instructor: Woody ZhuContact: shixianz [AT] andrew [DOT] cmu [DOT] edu

Section B3: TR 3:30-4:50 and F 5:00-6:20Room: HBH 1204

Course Description

Organizations like companies, governments, and others are currently gathering a huge amount of data that is composed of various forms such as text, images, audio, and video. The question is how to convert this diverse and disorganized data into useful information. One common issue is that the underlying structure of the data is not always known before analyzing it, which is why it is called "unstructured." This course aims to provide a hands-on approach to analyzing unstructured data. We first investigate how to recognize any potential structure that may be present in the data through utilizing visual representation and other techniques for investigating the data.Once we have indications of what structure may be present in the data, we can use it to make predictions. Throughout the course, we will come across several widely used techniques for analyzing unstructured data. This includes both established methods such as manifold learning, clustering, and topic modeling, as well as newer approaches like deep neural networks for analyzing text, images, and time series. Programming in Python using tools like Jupyter Notebook or Colab will be a significant component of the course. Additionally, the use of ChatGPT is also encouraged throughout.See more details in the course syllabus.

Course Schedule

Tue, Jan 16

Lecture 1: Introduction [Slides]

Course overview and introduction to unstructured data

Thu, Jan 18

Lecture 2: Unstructured data modeling [Slides]HW1 out

This lecture discusses traditional techniques for modeling unstructured data, including images, graphs, and text.

Fri, Jan 19

Recitation: Tutorials for Colab, spaCy, and sklearn

Tue, Jan 23

Lecture 3: Text analysis and PCA [Slides]

Covers some basic text analysis techniques and starts the discussion of dimensionality reduction as well as one of the most commonly used methods -- Principal Component Analysis (PCA).

Thu, Jan 25

Lecture 4: Manifold learning [Slides]

Focuses on manifold learning, exploring two specific techniques: Isomap and t-SNE.

Fri, Jan 26

Recitation: Demo for text modeling and analysis

Tue, Jan 30

Lecture 5: Clustering part 1 [Slides]

Discusses the clustering algorithms in general and delves further into k-Means and Gaussian mixture models (GMM).

Thu, Feb 1

Lecture 6: Clustering part 2 [Slides]HW2 out

Delves into more details about GMMs and draws the connection between GMMs and k-Means. Also discusses how to select their hyper-parameters.

Fri, Feb 2

Case study: Police 911 calls-for-service analysis

Tue, Feb 6

Lecture 7: Clustering part 3 and topic modeling [Slides]

Discusses two other clustering algorithms and gives a brief introduction to the topic modeling.

Thu, Feb 8

Lecture 8: LDA and Intro to predictive analysis [Slides]

Focuses on one of the topic modeling methods, Latent Dirichlet Allocation, and gives an introduction to the predictive data analysis.

Fri, Feb 9

Quiz 1

Tue, Feb 13

Lecture 9: Classification [Slides]

Introduces one of the commonly-used classification model -- Decision Tree and Random Forest. Also covers how to select hyper-parameters through k-fold cross-validation.

Thu, Feb 15

Lecture 10: Regression [Slides]HW3 out

Focuses on linear regression and continues the discussion on how to select its hyper-parameters. We will also talk about other commonly-used model evaluation metrics.

Fri, Feb 16

Case study: COVID-19 prediction and analysis

Tue, Feb 20

Lecture 11: Spatio-temporal modeling [Slides]

Introduces spatio-temporal data and the modeling techniques, including Generalized Least Square, Covariance Function, Generalized Linear Models, and Auto-regressive Models

Thu, Feb 22

Lecture 12: Deep learning part 1 [Slides]

An overview of deep learning and neural networks is provided. We also briefly introduces widely-used deep learning computational frameworks, such as PyTorch.

Fri, Feb 23

Review session

Tue, Feb 27

Lecture 13: Deep learning part 2 [Slides]

This lecture centers on two specific types of deep neural networks—recurrent neural networks (RNNs) and convolutional neural networks (CNNs).

Thu, Feb 29

Lecture 14: Other advance topics [Slides]

Explores the concepts of generative models, including VAEs, diffusion models, and Large Language Models, highlighting their evolution, applications, and significant contributions to multi-modality in AI.

Fri, Mar 1

Quiz 2