Nonasymptotic theory for nearest neighbor methods in prediction
Despite nearest neighbor methods appearing in text as early as the
11th century in Alhazen's "Book of Optics", it was not until fairly
recently that arguably the most general, nonasymptotic theory for
nearest neighbor classification was developed by
Chaudhuri and Dasgupta (2014).
I've worked on a book that goes over some of the latest nonasymptotic theoretical guarantees
for nearest neighbor and related kernel regression and classification methods
both in general metric spaces, and in contemporary applications where
clustering structure appears (time series forecasting, recommendation
systems, medical image segmentation). The book also covers some recent
advances in approximate nearest neighbor search, explains why
decision tree and related ensemble methods are nearest neighbor
methods, and discusses the potential for far away neighbors to help
in prediction. I have also developed theory for nearest neighbor and
kernel survival analysis (ICML 2019) and helped organize a
related workshop at NeurIPS 2017
(slides are available for all the talks).
-
(Monograph/Book) "Explaining the Success of Nearest Neighbor Methods in Prediction"
George H. Chen, Devavrat Shah
Foundations and Trends in Machine Learning, May 2018
[DOI]
-
(Survival analysis) "Nearest Neighbor and Kernel Survival Analysis: Nonasymptotic Error Bounds and Strong Consistency Rates"
George H. Chen
International Conference on Machine Learning (ICML), June 2019
[arXiv] [code] [talk] [poster]
Chapter 5 of the above monograph is on theoretical results using clustering
structure. This chapter is based on my PhD thesis and provides a better
overview than my thesis does. Proofs for the chapter are deferred to my thesis:
-
"Latent Source Models for Nonparametric Inference"
George H. Chen
Ph.D. thesis, MIT, May 2015
[paper]
Received the George M. Sprowls award for best Ph.D. thesis in Computer Science at MIT
My thesis unifies and builds on the following trilogy of papers:
-
"A Latent Source Model for Patch-Based Image Segmentation"
George H. Chen, Devavrat Shah, Polina Golland
Medical Image Computing and Computer-Assisted Intervention (MICCAI), October 2015
[arXiv]
[paper]
[poster]
Note:
For a more comprehensive exposition of this paper, consider
reading Chapter 5 of my
Ph.D. thesis.
-
"A Latent Source Model for Online Collaborative Filtering"
♣
Guy Bresler, George H. Chen, Devavrat Shah
Neural Information Processing Systems (NeurIPS), December 2014
[arXiv - longer version]
[paper - short conference version]
[poster]
Selected for spotlight (one of 62/1678 submissions)
Note:
An expanded version including intuition for how collaborative
filtering relates to an MAP item recommender and derivations for
the examples is in Chapter 4 of my
Ph.D. thesis;
the notation has also been changed to be more similar to the
rest of the trilogy of papers.
-
"A Latent Source Model for Nonparametric Time Series Classification"
♣
George H. Chen, Stanislav Nikolov, Devavrat Shah
Neural Information Processing Systems (NeurIPS), December 2013
[arXiv - longer version]
[paper - short conference version]
[poster]
Note:
An expanded version with a lower bound on the misclassification
rate and further discussion is in Chapter 3 of my
Ph.D. thesis.
Handling missing not at random data
In a variety of prediction problems, we have feature vectors with entries missing not at random, where a standard approach is to impute missing entries (i.e., solve a matrix completion problem) prior to solving a prediction task (with imputed features). Even when there's no follow-up prediction task and we're only doing matrix completion, there is an incomplete understanding of debiasing guarantees. With Wei Ma, I am working on developing a new approach to estimating probabilities of entries being missing for debiasing matrix completion. Some preliminary progress is in our NeurIPS 2019 paper:
-
"Missing Not at Random in Matrix Completion: The Effectiveness of Estimating Missingness Probabilities Under a Low Nuclear Norm Assumption"
Wei Ma*, George H. Chen* (* = equal contribution)
Neural Information Processing Systems (NeurIPS), December 2019
[arXiv] [code] [poster] [slides]
Note: We have a longer version in preparation analyzing a collection of missingness probability estimators, with more debiasing guarantees
Best paper (theoretical track) at INFORMS Data Mining and Decision Analytics Workshop 2019
Machine learning for sustainable development
Automatically finding trucks in satellite images to help estimate truck traffic, with an application to freight activity estimation in developing countries:
-
"Truck Traffic Monitoring with Satellite Images"
Lynn H. Kaack, George H. Chen, M. Granger Morgan
ACM Conference on Computing and Sustainable Societies (COMPASS), July 2019
[arXiv]
(Also presented at the
International Conference on Machine Learning (ICML) Workshop on Climate Change, June 2019)
With a startup called CoolCrop, I am working on providing small and marginal
farmers in rural India with access to cost-effective refrigeration and
predictive analytics:
-
"An Interpretable Produce Price Forecasting System for Small Farmers in India using Collaborative Filtering and Adaptive Nearest Neighbors"
Wei Ma, Kendall Nowocin, Niraj Marathe, George H. Chen
Information and Communication Technologies and Development (ICTD), January 2019
[arXiv]
-
"Toward Reducing Crop Spoilage and Increasing Small Farmer Profits in India: a Simultaneous Hardware and Software Solution"
George H. Chen, Kendall Nowocin, Niraj Marathe
Information and Communication Technologies for Development (ICTD), November 2017
[arXiv]
Previously, as part of a startup GridForm,
I analyzed satellite images of enormous tracts of land to help plan
development projects. We focused on
helping
renewable energy companies bring electricity to rural India. We
won
the $10,000 grand prize at the 2014 MIT IDEAS Global Challenge. Here's a
joint paper with Kush Varshney and Brian Abelson of
DataKind:
-
"Targeting Villages for Rural Development Using Satellite Image Analysis"
Kush R. Varshney, George H. Chen, Brian Abelson, Kendall Nowocin, Vivek Sakhrani, Ling Xu, Brian L. Spatocco
Big Data, March 2015
[paper]
Quantifying persuasion
With Emaad Manzoor, Dokyun Lee, and Alan Montgomery, I'm working on quantifying what makes an argument persuasive by mining the ChangeMyView subreddit:
-
"Quantifying Strategic Persuasion — Measuring d(opinion)/d(argument) in Debates on Gun Control"
Emaad Ahmed Manzoor, Dokyun Lee, George H. Chen, Alan Montgomery
INFORMS Conference on Information Systems & Technology (CIST), October 2019
INFORMS Workshop on Data Science, October 2019
(Also previously presented at ISMS Marketing Science Conference, June 2019)
Online education videos
With Mi Zhou, Pedro Ferreira, and Michael D. Smith, I'm working on identifying what features in an online educational video helps predict whether the videos will be watched; we're using MasterClass data:
-
"Disrupting Class: Using Video Analytics and Machine Learning to Understand Student Behavior Online"
Mi Zhou, George H. Chen, Pedro Ferreira, Michael D. Smith
INFORMS Conference on Information Systems & Technology (CIST), October 2019
Workshop on Information Systems & Economics (WISE), December 2019
Forecasting patient outcomes in electronic health records
I'm working on survival analysis (predicting time durations until critical events) for healthcare.
Some preliminary results for predicting length of stay for pancreatitis patients admitted to an
intensive care unit were presented in the ML for health workshop at NeurIPS:
-
"Survival-Supervised Topic Modeling with Anchor Words: Characterizing Pancreatitis Outcomes"
George H. Chen, Jeremy C. Weiss
Neural Information Processing Systems (NeurIPS) Workshop on Machine Learning for Health, December 2017
[arXiv (short workshop version)]
(Also presented at Society for Medical Decision Making North American Meeting, October 2017)
Real-time medical image analysis
Various real-time medical imaging applications could be enabled by speeding up
dimensionality reduction, a subroutine used in many image analysis algorithms.
To do this, we create a sparse description of a manifold; our work relates to
sparse multivariate regression:
-
"Sparse Projections of Medical Images onto Manifolds"
George H. Chen, Christian Wachinger, Polina Golland
Information Processing in Medical Imaging (IPMI), June-July 2013
[arXiv]
[paper]
[poster]
Modeling brain activation patterns
My master's thesis presented a probabilistic model of brain
activation patterns evoked by functional stimuli such as reading
sentences; the model combines sparse coding and image alignment:
-
"Deformation-Invariant Sparse Coding"
George H. Chen
Master's thesis, MIT, May 2012
[paper]
[poster]
Preliminary version:
-
"Deformation-Invariant Sparse Coding for Modeling Spatial Variability of Functional Patterns in the Brain"
George H. Chen, Evelina G. Fedorenko, Nancy G. Kanwisher, Polina Golland
Neural Information Processing Systems (NeurIPS) Workshop on Machine Learning and Interpretation in Neuroimaging, December 2011
[paper]
[talk slides]
Backpack with sensors for indoor modeling
I developed algorithms that track where this fancy backpack is
indoors using laser scanners.
After I graduated from Berkeley, this project progressed quite a bit!
Be sure to check out the latest developments from the
Video and Image
Processing Lab's website.
Preliminary results:
-
"Indoor Localization and Visualization Using a Human-Operated Backpack System"
Timothy Liu, Matthew Carlberg, George Chen, Jacky Chen, John Kua, Avideh Zakhor
International Conference on Indoor Positioning and Indoor Navigation (IPIN), September 2010
[paper]
-
"Indoor Localization Algorithms for a Human-Operated Backpack System"
George Chen, John Kua, Stephen Shum, Nikhil Naikal, Matthew Carlberg, Avideh Zakhor
International Symposium on 3D Data Processing, Visualization and Transmission (3DPVT), May 2010
[paper]
-
"Image Augmented Laser Scan Matching for Indoor Dead Reckoning"
Nikhil Naikal, John Kua, George Chen, Avideh Zakhor
International Conference on Intelligent Robots and Systems (IROS), October 2009
[paper]
Analyzing aerial images of cities
How to automatically find buildings, trees, ground, and water in
aerial LIDAR images:
-
"Classifying Urban Landscape in Aerial LIDAR Using 3D Shape
Analysis"
Matthew Carlberg, Peiran Gao, George Chen,
Avideh Zakhor
International Conference on Image Processing (ICIP), November 2009
[paper]
-
"2D Tree Detection in Large Urban Landscapes Using Aerial LIDAR
Data"
George Chen, Avideh Zakhor
International Conference on Image Processing (ICIP), November 2009
[paper]