Below are descriptions of several data sets, and some suggested projects.
The first few are spelled out in greater detail. You are encouraged to select and flesh out one of these projects,
or make up your own well-specified project using these datasets. If you have other data sets you would like to work on,
we would consider that as well, provided you already have access to this data and a good idea of what to do with it.
Several of the project ideas are compiled from similar courses online.
A0. Object Detection
Dataset
You can download the dataset from here.
There are 20 categories of objects, ranging from car and bike to cat
and dog. It is also the most used benchmark dataset for object
detection task. Project Ideas:
There are two
classic ways of doing object detection; one is based on exhaustive
sliding window search, such as Deformable Part Model (http://people.cs.uchicago.edu/~rbg/latent/), the other one is based on Selective Search method (http://koen.me/research/selectivesearch/).
Selective search has great potential to apply on larger dataset with
more categories. You can try the selective search idea and compare to
the DPM’s performance (results available online). Papers:
- “Object
Detection with Discriminatively Trained Part Based Models.” P.
Felzenszwalb, R. Girshick, D. McAllester, D. Ramanan. PAMI 2010.
- “Segmentation
As Selective Search for Object Recognition.” Koen E. A. van de Sande,
Jasper R. R. Uijlings, Theo Gevers, and Arnold W. M. Smeulders. ICCV
2011.
A. fMRI Brain Imaging
Dataset
Available here
This data set contains a time series of images of brain activation,
measured using fMRI, with one image every 500 msec. During this time,
human subjects performed 40 trials of a sentence-picture comparison
task (reading a sentence, observing a picture, and determining whether
the sentence correctly described the picture). Each of the 40 trials
lasts approximately 30 seconds. Each image contains approximately 5,000
voxels (3D pixels), across a large portion of the brain. Data is
available for 12 different human subjects.
Project A1: Bayes network classifiers for fMRI
Gaussian Naïve Bayes classifiers and SVMs have been used with this data
to predict when the subject was reading a sentence versus perceiving a
picture. Both of these classify 8-second windows of data into these two
classes, achieving around 85% classification accuracy [Mitchell et al,
2004]. This project will explore going beyond the Gaussian Naïve Bayes
classifier (which assumes voxel activities are conditionally
independent), by training a Bayes network in particular a TAN tree
[Friedman, et al., 1997]. Issues youll need to confront include which
features to include (5000 voxels times 8 seconds of images is a lot of
features) for classifier input, whether to train brain-specific or
brain-independent classifiers, and a number of issues about efficient
computation with this fairly large data set.
Papers:
Project A2: Dimensionality reduction for fMRI data
Explore the use of dimensionality-reduction methods to improve classification accuracy with this data.
Given the extremely high dimension of the input (5000 voxels times 8 images) to the classifier, it is sensible to
explore methods for reducing this to a small number of dimension. For example, consider PCA,
hidden layers of neural nets, or other relevant dimensionality reducing methods.
PCA is an example of a method that finds lower dimension representations that minimize error in reconstructing the data.
In contrast, neural network hidden layers are lower dimensional representations of the inputs that
minimize classification error (but only find a local minimum).
Does one of these work better? Does it depend on parameters such as the number of training examples?
Papers:
Project A3: Feature selection/feature invention for fMRI classification
As in many high dimensional data sets, automatic selection of a subset of features can have a strong positive
impact on classifier accuracy. It has been found that selecting features by the difference in their activity
when the subject performs the task, relative to their activity while the subject is resting, is one useful strategy
[Mitchell et al., 2004]. In this project you could suggest, implement, and test alternative feature selection strategies
(eg., consider the incremental value of adding a new feature to the current feature set, instead of scoring each feature
independent of other features that are being selected), and see whether you can obtain higher classification accuracies.
Alternatively, you could consider methods for synthesizing new features (e.g., define the 'smoothed value' of a voxel in
terms of a spatial Gaussian kernel function applied to it and its neighbors, or define features by averaging voxels
whose time series are highly correlated).
Papers:
B. Brain network (Connectome)-based classification
This project involves classifying different human-subjects by their brain connectivity structure (or brain network, connectome).
Dataset
This dataset contains the brain connecticity graphs of 114 human subjects.
Each brain is segmented into 70 regions (or supervoxels). The network depicts the connectivity among these regions,
where weights on links represent the strength of the connection.
Meta-data on human-subjects include gender, age, IQ, etc. as well as scores obtained by tests
evaluating the math capability or creativity of the subjects.
Available here.
Project suggestions:
- Classify human-subjects into (1) male vs. female, (2) high-math capable vs. normal, (3) creative vs. normal
- Dimensionality reduction and feature construction for improving accuracy (see A2 and A3 above).
Papers:
C. NBA statistics
Dataset
This dataset contains 2004-2005 NBA and ABA stats for
- Player regular season stats
- Player regular season career totals
- Player playoff stats
- Player playoff career totals
- Player all-star game stats
- Team regular season stats
- Complete draft history
- coaches_season.txt - nba coaching records by season
- coaches_career.txt - nba career coaching records
(currently all of the regular season)
Available here.
Project suggestions:
- You can try to predict the outcome of a given game.
- Detecting groups of similar players, and outlier detection on the players; find out who are the outstanding ones.
D. Physiological Data Modeling (BodyMedia)
Physiological data offers many challenges to the machine learning
community including dealing with large amounts of data, sequential
data, issues of sensor fusion, and a rich domain complete with noise,
hidden variables, and significant effects of context.
Dataset
1. Which sensors correspond to each column?
- characteristic1 age
- characteristic2 handedness
- sensor1 gsr_low_average
- sensor2 heat_flux_high_average
- sensor3 near_body_temp_average
- sensor4 pedometer
- sensor5 skin_temp_average
- sensor6 longitudinal_accelerometer_SAD
- sensor7 longitudinal_accelerometer_average
- sensor8 transverse_accelerometer_SAD
- sensor9 transverse_accelerometer_average
2. What are the activities behind each annotation?
The annotations for the contest were:
- 5102 = sleep
- 3104 = watching TV
Available here (external link broken, use the internal link).
Project suggestions:
- Behavior classification: to classify the person's activity based on the sensor measurements.
- Train a classifier to identify subjects as men or women (this information is given in the training data sequences)
E. Face Recognition
Datasets
There are two data sets for this type of problem.
- The first dataset (CMU Machine Learning Faces) contains 640 images of faces.
The faces themselves are images of 20 former Machine Learning students and instructors,
with about 32 images of each person. Images vary by the pose (direction the person is looking),
expression (happy/sad), face jewelry (sun glasses or not), etc. This gives you a chance to consider a
variety of classification problems ranging from person identification to sunglass detection.
The data, documentation, and associated code are available at the link.
* The same website provides an implementation of a neural network
classifier for this image data. The code is quite robust, and pretty
well documented.
- The second dataset (Facial Attractiveness Images) consists of 2253 female and 1745 male
rectified frontal face images scraped from the hotornot.com website by
Ryan White along with user ratings of attractiveness.
Project suggestions:
- Try SVM's on this data, and compare their performance to that of the provided neural networks.
- Apply a clustering algorithm to find "similar" faces.
- Learn a facial attractiveness classifier. A paper on the topic of predicting facial attractiveness can be found here.
F. Character recognition (digits/letters)
Optical character recognition, and the simpler digit recognition task, has been the focus of much ML research.
Datasets
We have three datasets on this topic.
- The first dataset
tackles the more general OCR task, on a small vocabulary of words:
(Note that the first letter of each word was removed, since these were
capital letters that would make the task harder for you.)
- The second dataset is the now "classic" digit recognition task for outgoing mail zip codes
- The third (and most challenging) data set consists of
scrambled text known as CAPTCHAs (which stands for Completely Automated
Public Turing test to tell Computers and Humans Apart) that were
designed by Luis Von Ahn at CMU to be difficult to automatically
recognize. (For more about CAPTCHAs go to Wikipedia Article or
Captcha.net where you will find several papers.
Project suggestions:
- Learn a classifier to recognize the digit/letter
- Use an HMM to exploit correlations between neighboring letters in
the general OCR case to improve accuracy. (Since ZIP codes don't have
such constraints between neighboring digits, HMMs will probably not
help in the digit case.)
- Apply a clustering/dimensionality reduction algorithm on this data,
see if you get better classification on this lower dimensional space.
- Learn a classifier to decipher CAPTCHAs. You may want to begin by reading the following:
You may want to begin by building a classifier to segment the image into seperate letters.
G. Image Segmentation
The main goal of this project is to segment given images in a meaningful way.
Datasets
Berkeley collected three hundred images and paid students to hand-segment each one
(usually each image has multiple hand-segmentations).
Two-hundred of these images are training images, and the remaining 100 are test images.
The dataset includes code for reading the images and ground-truth labels, computing the benchmark scores,
and some other utility functions. It also includes code for a segmentation example.
Available resources can be found here.
A newer (and bigger) dataset of manually labeled images is here (images, ground-truth data and benchmarks).
Project G1: Region-Based Segmentation
Most segmentation algorithms have focused on segmentation based on edges or based on
discontinuity of color and texture. The ground-truth in this dataset, however, allows supervised learning
algorithms to segment the images based on statistics calculated over regions.
One way to do this is to "oversegment" the image into superpixels (Felzenszwalb 2004, code available)
and merge the superpixels into larger segments. Come up with a set of features to represent the superpixels
(probably based on color and texture), a classifier/regression algorithm (suggestion: boosted decision trees)
that allows you to estimate the likelihood that two superpixels are in the same segment, and an algorithm
for segmentation based on those pairwise likelihoods. Since this project idea is fairly time-consuming
focusing on a specific part of the project may also be acceptable.
For midway report, you should be able to estimate the likelihood that two superpixels
are in the same segment and have a quantitative measure of how good your estimator is.
You should also have an outline of how to use the likelihood estimates to form the final segmentation.
The rest of the project will involve improving your likelihood estimation and your grouping algorithm,
and in generating final results.
Papers:
- Some segmentation papers from Berkeley are available here
Project G2: Supervised vs. Unsupervised Segmentation Methods
Write two segmentation algorithms (these may be simpler than the one
above): a supervised method (such as logistic regression) and an
unsupervised method (such as K-means). Compare the results of the two
algorithms. For your write-up, describe the two classification methods
that you plan to use.
For midway report, you should have completed at least one of your segmentation algorithms and have results for that algorithm.
Papers:
- Some segmentation papers from Berkeley are available here
H. Object Recognition
Dataset
The Caltech 256 dataset
contains images of 256 object categories taken at varying orientations,
varying lighting conditions, and with different backgrounds.
Available here.
Project suggestions:
- You can try to create an object recognition system which can
identify which object category is the best match for a given test
image.
- Apply clustering to learn object categories without supervision.
Papers:
I. Election Contributions
Dataset
This dataset represents federal electoral compaign donations in the United States for the election years
1980 through 2006. The data, fully built, will form a tripartite, directed graph. Donors (individuals and corporations)
make contributions to Committtees, who then in turn make contributions to Candidates.
There is a many-to-many relationship between Donors and Committees, and also a many-to-many
relationship between Committees and Candidates.
Available here (data and documentation)
Project suggestions:
- Predict a committee's contribution rate, or preferred candidates,
based on its past contribution rate. Which features best indicate who donates to it?
- Predict how much a donor will contribute based on zip code, or
whether an occupation is listed (or, if you can analyze the text, what
occupation is listed).
- Predict how much money a candidate will receive based on party, state, or whether s/he is an incumbent/challenger/open seat.
- Discover clusters of donors/committees/candidates.
J. Sensor networks
Dataset
This dataset contains temperature, humidity, and light data
measurements, along with the voltage level of the batteries at each
node,
using this 54-node sensor network deployment. The data was collected
every 30 seconds, starting around 1am on February 28th 2004.
This is a "real" dataset, with lots of missing data, noise, and
failed sensors giving outlier values, especially when battery levels
are low.
Available here.
Project suggestions:
- Compare various regression algorithms.
- Automatically detect failed sensors.
- Learn graphical models (e.g. Bayes nets) representing the correlations between measurements at different nodes.
Papers:
K. Twenty Newgroups text classification
Dataset
This data set contains 1000 text articles posted to each of 20 online newgroups,
for a total of 20,000 articles.
This data is useful for a variety of text classification and/or clustering projects.
The "label" of each article is which of the 20 newsgroups it belongs to.
The newsgroups (labels) are hierarchically organized (e.g., "sports", "hockey").
Available here.
* The same website provides an implementation of a Naive Bayes
classifier for this text data. The code is quite robust, and some
documentation is available.
Project suggestions:
- EM for text classification in the case where you have labels for
some documents, but not for others (see Nigam et al, and come up with
your own suggestions).
- Make up your own text learning problem/approach.
Papers:
L. WebKB webpage classification
Dataset
This dataset contains webpages from 4 universities, labeled with whether they are professor, student, project, or other pages.
Available here.
Project suggestions:
- You can try to learn classifiers to predict the type of a webpage from the text.
- Try to improve accuracy by (1) exploiting correlations between pages that point to each other,
and/or (2) segmenting the pages into meaningful parts (bio, publications, etc.)
Papers:
Final note:
Kaggle has a long
list of (machine learning) problems! They help people with such
problems meet people with the know-how to solve their problems for
them. The problems are cast as open competitions (with dollar awards).
You can consider picking up a problem from Kaggle (e.g.
salary prediction, predicting which new questions asked on Stack
Overflow will be closed, diabetes classification, etc.) (they often
have the data available) and maybe even win a prize!
Additional datasets & problems
From Prof. Jitendra Malik's talk: '3 R's of Computer Vision'
- Semantic reconstruction: relatively hard problem, involves
semantically labeling objects in the reconstructed image such as doors,
walls, etc.
- Semantic segmentation (telling a stroy about the image): involves
for example (i) attribute classification (e.g., elderly white man with
a baseball hat), (ii) orientation (e.g., next to, behind, face to face)
- Face expression classification (smiling, angry, worried, suspicious, ...)
- Person pose (called poselets) classification (standing, arms
crossed, hand raised, ...); can also do action detection using
probability of poselets as features? (e.g., dancing, running, ...)
For feature construction one can use RGBD (D for depth) in contrast to historical RGB.
For labeling, one can use AMT where humans mark joints, arms, shoulders, etc. on example images.
Alternatively, gazing patterns of camera-recorded people can be used.
There are many many other datasets and machine learning problems out there.
You can choose to work with any of these datasets and define your own ML problems to solve that are interesting to you.
- UC Irvine has a ML repository that could be useful for you project .
Many of these data sets have been used extensively in ML research (although often small datasets)
- Sam Roweis also has a link to several datasets.
- Many online media datasets by Jure Leskovec (mostly network/graph data, but also tweets, reviews, etc.) as well as more data here.
For a nice read on several interesting prediction tasks on StackOverflow, see
this.
For a nice read on several interesting prediction tasks on Facebook and Wikipedia, see
this.
- arXiv Preprints:
A collection of preprints in the field of high-energy physics. Includes
the raw LaTeX source of each paper (so you can extract either
structured sentences or a bag-of-words) along with the graph of
citations between papers.
- TRECVID:
A competition for multimedia information retrieval. They keep a fairly
large archive of video data sets, along with featurizations of the data.
- Activity Modelling data:
Activity modelling is the task of inferring what the user is doing from
observations (eg, motion sensors, microphones). This data set consists
of GPS motion data for two subjects tagged with labels like car,
working, athome, shopping.
A related paper using a Bayes net for this problem is here.
- Record Deduplication data:
The datasets provided below comprise of lists of records, and the goal
is to identify, for any dataset, the set of records which refer to
unique entities. This problem is known by the varied names of
deduplication, identity uncertainty and record linkage.
One common approach is to cast the deduplication problem as a
classification problem. Consider the set of record-pairs, and classify
them as either "unique" or "not-unique". Some papers on record
deduplication include
this and this.
- Enron e-mail data:
Consists of ~500K e-mails collected from Enron employees. It has been
used for research into information extraction, social network analysis,
and topic modeling.
For a possible project, (1) you can try to classify the text of an
e-mail message to decide who sent it, or (2)
you can try to predict the length of an email given the past emailing
history of the sender and recipients.
- NIPS Corpus data:
A data set based on papers from a machine learning conference (NIPS
volumes 1-12). The data can be viewed as a tripartite graph on authors,
papers, and words. Links represent authorship and the words used in a
paper. Additionally, papers are tagged with topics and we know which
year each paper was written. Potential projects include authorship
prediction, document clustering, and topic tracking.
- Precipitation data:
This dataset has includes 45 years of daily precipitation data from the
Northwestern US. Ideas for projects include predicting rain levels,
and deciding where to place sensors to best predict rainfall.
See this for the latter and the citations therein.
|