Research

My research mainly focuses on causal inference and machine learning. More specifically:

Model and feature selection in causal inference.
Estimation and inference for large-scale and high-dimensional electronic healthcare databases.
Online and offline ensemble machine learning.

Thesis:

Title:

Committe Member (alphabetical order):

Qualifying Exam:

Title:

Propensity Score Estimation with Machine Learning Methods on Electronic Health Databases with Healthcare Claims Data [Slides]

Committe Member (alphabetical order):

John F. Canny
Paul and Stacy Jacobs Distinguished Professor of Engineering

Alan Hubbard (Committe Chair)
Professor and Division Head of Biostatistics

Nicholas P. Jewell
Professor of Biostatistics and Statistics

Mark J. van der Laan (Advisor)
Jiann-Ping Hsu/Karl E. Peace Professor of Biostatistics and Statistics

Master Thesis

Title:

A General Ensemble Learning Framework for Online and Offline Machine Learning [PDF]

Committe Member (alphabetical order):

Haiyan Huang
Professor of Statistics

Maya L. Petersen
Professor of Biostatistics

Mark J. van der Laan (Advisor, Committe Chair)
Jiann-Ping Hsu/Karl E. Peace Professor of Biostatistics and Statistics

Other Interesting Applied Projects

Scene Recognition using Convolutional Neural Network

alt text

In this project, we appled Convolutional Neural Network (CNN) for scene recognition on MIT Place dataset. We investigated GoogLeNet, VGG net and Residual Net to classify the scene images. We achieved top 3 (out of 46 teams, 71 participants) performance in the Berkeley Computer Vision course (CS280) competition with respect to the prediction accuracy. [Link]

Prediction of Brain Responses to Visual Images on fMRI data

alt text

In this project, we studied on the responses in 20 voxels located in the region of the brain responsible for visual functions. There are 120 observations and each with 10921 features (wavelets). We first applied Sure Independence Screening for dimension reduction. Then we investigated Lasso , L2-boosting , Random Forest , Tree Boosting to predict the brain responds. We also applied Knockoff filter to find the significant coefficients while controling the false discover rate.

Organ Classification and Sengmentation for Drosophila Embryo Image based on Machine Learning and Conputer Vision Method.

alt text

This project mainly focused on distinguish gut, yolk, and epidermal/mesodermal tissue in embryo images. We constructed features from the raw embryo image based on multiple classical computer vision method (e.g. Histogram oriented gradient feature and Local Binary Pattern) with scikit-image in python, selected features based on variable importance from Random Forest and Tree Boosting. We studied different machine learning algorithms (Random Forest, Tree boosting, Lasso, SVM) to classify and segment the organ.

Discover of Subculture in US with Dialectometric Analysis using Unsupervised Learning Method.

alt text

In this project, we investigated unspervised learning to the dialectometric analysis of the Harvard Dialect Survey conducted in 2003. We use Multidimensional Scaling (MDS) to visualize the relationship of the questions and answers in the survey data. Based on Term Frequency-Inverse Document Frequency (TF-IDF) model, we proposed a new distance metric and applied k-medoids model to find out and study the sub-culture in America.