Prof. Chouldechova
There is a fair bit of missingness in the data set. There are several approaches to dealing with missing data:
Exclude
Impute
impute()
function from Hmisc, etc.) that can help with imputing missing values.Think carefully about whether certain kinds of missingness are informative
The downsides of the Impute approach:
Imputation methods often rely on fairly strong assumptions concerning the process governing the appearance of missing values (assumptions such as MAR, missing at random; or MCAR, missing completely at random).
This is a lot of hassle to go through unless you want practice imputing values
Why the think carefully approach can be a good one:
For factor variables, you can treat missing values as just another factor level. Sometimes missingness can be informative (predictive), leading to a significant coefficient for the missing level.
?
as one of the levels of the workingclass
variable to indicate individuals whose working class is unknown. Having workingclass = ?
turned out to be strong associated with earning under 50k a year.For numeric variables, there's not much you can do. Just recode negative values to NA
.
My recommendation
Start by thinking carefully about missing values
If nothing interesting turns up, go ahead and exclude them (code as NA
, proceed accordingly)
The income variable that you have available is topcoded.
For the top 2% of earners, you don't observe their actual income.
Instead, their income is recorded as the average of the top 2% of incomes.
Standard regression applied to data with a topcoded outcome is inconsistent.
Tobit regression (censored regression).
Try fitting the regression models / running hypothesis tests two ways
My recommendation: Take approach (2), unless you want practice with tobit regression.