94-775 & 94-475 Practical Unstructured Data Analytics
Course Information
Section B3: TR 3:30-4:50 and F 5:00-6:20Room: HBH 1204
Course Description
Organizations like companies, governments, and others are currently gathering a huge amount of data that is composed of various forms such as text, images, audio, and video. The question is how to convert this diverse and disorganized data into useful information. One common issue is that the underlying structure of the data is not always known before analyzing it, which is why it is called "unstructured." This course aims to provide a hands-on approach to analyzing unstructured data. We first investigate how to recognize any potential structure that may be present in the data through utilizing visual representation and other techniques for investigating the data.Once we have indications of what structure may be present in the data, we can use it to make predictions. Throughout the course, we will come across several widely used techniques for analyzing unstructured data. This includes both established methods such as manifold learning, clustering, and topic modeling, as well as newer approaches like deep neural networks for analyzing text, images, and time series. Programming in Python using tools like Jupyter Notebook or Colab will be a significant component of the course. Additionally, the use of ChatGPT is also encouraged throughout.See more details in the course syllabus.Â
Course Schedule
Thu, Jan 18
This lecture discusses traditional techniques for modeling unstructured data, including images, graphs, and text.Â
Fri, Jan 19
Recitation: Tutorials for Colab, spaCy, and sklearn
Tue, Jan 23
Covers some basic text analysis techniques and starts the discussion of dimensionality reduction as well as one of the most commonly used methods -- Principal Component Analysis (PCA).
Thu, Jan 25
Focuses on manifold learning, exploring two specific techniques: Isomap and t-SNE.Â
Fri, Jan 26
Recitation: Demo for text modeling and analysis
Tue, Jan 30
Discusses the clustering algorithms in general and delves further into k-Means and Gaussian mixture models (GMM).
Thu, Feb 1
Delves into more details about GMMs and draws the connection between GMMs and k-Means. Also discusses how to select their hyper-parameters.
Fri, Feb 2
Case study: Police 911 calls-for-service analysis
Tue, Feb 6
Discusses two other clustering algorithms and gives a brief introduction to the topic modeling.
Thu, Feb 8
Focuses on one of the topic modeling methods, Latent Dirichlet Allocation, and gives an introduction to the predictive data analysis.Â
Fri, Feb 9
Quiz 1
Tue, Feb 13
Introduces one of the commonly-used classification model -- Decision Tree and Random Forest. Also covers how to select hyper-parameters through k-fold cross-validation.Â
Thu, Feb 15
Focuses on linear regression and continues the discussion on how to select its hyper-parameters. We will also talk about other commonly-used model evaluation metrics.Â
Fri, Feb 16
Case study: COVID-19 prediction and analysis
Tue, Feb 20
Introduces spatio-temporal data and the modeling techniques, including Generalized Least Square, Covariance Function, Generalized Linear Models, and Auto-regressive Models
Thu, Feb 22
An overview of deep learning and neural networks is provided. We also briefly introduces widely-used deep learning computational frameworks, such as PyTorch.
Fri, Feb 23
Review session
Tue, Feb 27
This lecture centers on two specific types of deep neural networks—recurrent neural networks (RNNs) and convolutional neural networks (CNNs).Â
Thu, Feb 29
Explores the concepts of generative models, including VAEs, diffusion models, and Large Language Models, highlighting their evolution, applications, and significant contributions to multi-modality in AI.
Fri, Mar 1
Quiz 2