Sample selection is a particular instance of the identification
problem. We are interested in a theory about the relationship between
two variables, say X and Y , that says that high values of
X induce high values of Y. So we get data on X and
Y to test the theory. A sample selection problem arises when all
the observations for which X is high corresponds to high values
of another variable, say Z, because individuals with high Z
select themselves into having high X. Furthermore, Z has
a direct effect on Y so it will be hard to measure the effect of
X on Y. |
As an example, let Y denote earnings after leaving
school. Let X denote the number of years of schooling an individual
has had, and let Z denote an individual's ability (this is one of
the examples we looked at in class). There is a sample selection probem
here because high-ability individuals choose to stay in school longer.
Thus, they have high earnings both because they are smarter and
because they got more schooling. If we cannot measure ability well, we
will not be able to disentangle the separate effects of ability and schooling
on earnings. |
Imagine that we compared years of schooling with earnings
without adjusting for the sample selection problem. Individuals with more
schooling are smarter so they also earn more because of their ability.
We would find statistically a very strong effect of an extra year of schooling
on earnings, but in fact only a fraction of that would actually be due
to schooling while the rest would be due to unmeasured ability differences. |
To deal with this we would like to conduct some form
of experiment. |