Final project: missing data & topcoded values

Prof. Chouldechova

There is a fair bit of missingness in the data set. There are several approaches to dealing with missing data:

Exclude
- You can omit observations with missing values (e.g., remove any rows that contain missing data)
Impute
- R has various packages (Amelia, mice, mi, impute() function from Hmisc, etc.) that can help with imputing missing values.
Think carefully about whether certain kinds of missingness are informative

The downsides of the Impute approach:

Imputation methods often rely on fairly strong assumptions concerning the process governing the appearance of missing values (assumptions such as MAR, missing at random; or MCAR, missing completely at random).
This is a lot of hassle to go through unless you want practice imputing values

Why the think carefully approach can be a good one:

For factor variables, you can treat missing values as just another factor level. Sometimes missingness can be informative (predictive), leading to a significant coefficient for the missing level.
- E.g., Just now we ran a logistic regression in which we used ? as one of the levels of the workingclass variable to indicate individuals whose working class is unknown. Having workingclass = ? turned out to be strong associated with earning under 50k a year.
For numeric variables, there's not much you can do. Just recode negative values to NA.

My recommendation

Start by thinking carefully about missing values
If nothing interesting turns up, go ahead and exclude them (code as NA, proceed accordingly)

Warning: Trying to impute can consume a lot of time
- Not guaranteed to produce better results than what you'd have if you just excluded all observations with missing values.

The income variable that you have available is topcoded.
For the top 2% of earners, you don't observe their actual income.
Instead, their income is recorded as the average of the top 2% of incomes.
Standard regression applied to data with a topcoded outcome is inconsistent.
- i.e., even if you had infinite data, your coefficient estimates won't converge to the “true” coefficients.

Tobit regression (censored regression).
- We didn't talk about this method in class
- It's not too difficult to understand if you already understand linear regression.
- A tutorial can be found here.
Try fitting the regression models / running hypothesis tests two ways
- First way: include the topcoded observations
- Second way: exclude all observations with topcoded outcomes
- If your estimates change a lot, then you probably don't want to use the topcoded observations
- If you go this route, be sure to explain what omiting the high earning individuals means for the scope of your conclusions.

My recommendation: Take approach (2), unless you want practice with tobit regression.