Consider a model, in which, we try to predict women's wages from their education and age. We have an artificially constructed example of a sample of 2,000 women but we only have wage data for 1,343 of them. The remaining 657 women were not working and so did not receive wages. We will start off with a simple-minded model in which we estimate the regression model using only the observations that have wage data.
First Try
use http://www.gseis.ucla.edu/courses/data/wages
univar wage education age
-------------- Quantiles --------------
Variable n Mean S.D. Min .25 Mdn .75 Max
-------------------------------------------------------------------------------
wage 1343 23.69 6.31 5.88 19.31 23.51 28.05 45.81
education 2000 13.08 3.05 10.00 10.00 12.00 16.00 20.00
age 2000 36.21 8.29 20.00 30.00 36.00 42.00 59.00
-------------------------------------------------------------------------------
regress wage education age
Source | SS df MS Number of obs = 1343
-------------+------------------------------ F( 2, 1340) = 227.49
Model | 13524.0337 2 6762.01687 Prob > F = 0.0000
Residual | 39830.8609 1340 29.7245231 R-squared = 0.2535
-------------+------------------------------ Adj R-squared = 0.2524
Total | 53354.8946 1342 39.7577456 Root MSE = 5.452
------------------------------------------------------------------------------
wage | Coef. Std. Err. t P>|t| [95% Conf. Interval]
-------------+----------------------------------------------------------------
education | .8965829 .0498061 18.00 0.000 .7988765 .9942893
age | .1465739 .0187135 7.83 0.000 .109863 .1832848
_cons | 6.084875 .8896182 6.84 0.000 4.339679 7.830071
------------------------------------------------------------------------------
predict pwage
This analysis would be fine if, in fact, the missing wage data were missing completely
at random.
However, the decision to work or not work was made by the individual
woman. Thus, those who were not working constitute a self-selected sample and not a random sample.
It is likely some of the women that would earn low wages choose not to work and this would account
for much of the missing wage data. Thus, it is likely that we will over estimate the wages of the
women in the population.
So, somehow, we need to account for information that we have on the non-working women.
Maybe, we could
replace the missing values with zeros. The variable wage0 does the trick.Second Try
univar wage0
Variable n Mean S.D. Min .25 Mdn .75 Max
-------------------------------------------------------------------------------
wage0 2000 15.91 12.27 0.00 0.00 19.39 25.77 45.81
-------------------------------------------------------------------------------
regress wage0 education age
Source | SS df MS Number of obs = 2000
-------------+------------------------------ F( 2, 1997) = 208.32
Model | 51956.6949 2 25978.3475 Prob > F = 0.0000
Residual | 249038.262 1997 124.70619 R-squared = 0.1726
-------------+------------------------------ Adj R-squared = 0.1718
Total | 300994.957 1999 150.572765 Root MSE = 11.167
------------------------------------------------------------------------------
wage0 | Coef. Std. Err. t P>|t| [95% Conf. Interval]
-------------+----------------------------------------------------------------
education | 1.064572 .0844208 12.61 0.000 .8990101 1.230134
age | .3907662 .0310308 12.59 0.000 .3299101 .4516223
_cons | -12.16843 1.398146 -8.70 0.000 -14.91041 -9.426456
------------------------------------------------------------------------------
predict pwage0
This analysis is also troubling. Its true that we are using data from all 2,000 women but
using zero is not a fair estimate of what the women would have earned if they had chose to
work. It is likely that this model will under estimate the wages of women in the
population. The solution to our quandary is to use the Heckman selection model
(Gronau 1974, Lewis 1974, Heckman 1976). The Heckman selection model is a two equation model. First, there is the regression model,
In our example, we have one model predicting wages and one model predicting whether a women will be working. We will use married, children, education and age to predict selection. Checkout this probit example.
generate s=wage~=.
tab s
s | Freq. Percent Cum.
------------+-----------------------------------
0 | 657 32.85 32.85
1 | 1343 67.15 100.00
------------+-----------------------------------
Total | 2000 100.00
probit s married children education age
Probit estimates Number of obs = 2000
LR chi2(4) = 478.32
Prob > chi2 = 0.0000
Log likelihood = -1027.0616 Pseudo R2 = 0.1889
------------------------------------------------------------------------------
s | Coef. Std. Err. z P>|z| [95% Conf. Interval]
-------------+----------------------------------------------------------------
married | .4308575 .074208 5.81 0.000 .2854125 .5763025
children | .4473249 .0287417 15.56 0.000 .3909922 .5036576
education | .0583645 .0109742 5.32 0.000 .0368555 .0798735
age | .0347211 .0042293 8.21 0.000 .0264318 .0430105
_cons | -2.467365 .1925635 -12.81 0.000 -2.844782 -2.089948
------------------------------------------------------------------------------
Now we are ready to try the full Heckman selection model.Third Time's a Charm
heckman wage education age, select(married children education age)
/* can also be written as
heckman wage education age, select(s=married children education age) */
Heckman selection model Number of obs = 2000
(regression model with sample selection) Censored obs = 657
Uncensored obs = 1343
Wald chi2(2) = 508.44
Log likelihood = -5178.304 Prob > chi2 = 0.0000
------------------------------------------------------------------------------
| Coef. Std. Err. z P>|z| [95% Conf. Interval]
-------------+----------------------------------------------------------------
wage |
education | .9899537 .0532565 18.59 0.000 .8855729 1.094334
age | .2131294 .0206031 10.34 0.000 .1727481 .2535108
_cons | .4857752 1.077037 0.45 0.652 -1.625179 2.59673
-------------+----------------------------------------------------------------
select |
married | .4451721 .0673954 6.61 0.000 .3130794 .5772647
children | .4387068 .0277828 15.79 0.000 .3842534 .4931601
education | .0557318 .0107349 5.19 0.000 .0346917 .0767718
age | .0365098 .0041533 8.79 0.000 .0283694 .0446502
_cons | -2.491015 .1893402 -13.16 0.000 -2.862115 -2.119915
-------------+----------------------------------------------------------------
/athrho | .8742086 .1014225 8.62 0.000 .6754241 1.072993
/lnsigma | 1.792559 .027598 64.95 0.000 1.738468 1.84665
-------------+----------------------------------------------------------------
rho | .7035061 .0512264 .5885365 .7905862
sigma | 6.004797 .1657202 5.68862 6.338548
lambda | 4.224412 .3992265 3.441942 5.006881
------------------------------------------------------------------------------
LR test of indep. eqns. (rho = 0): chi2(1) = 61.20 Prob > chi2 = 0.0000
------------------------------------------------------------------------------
predict pheckman
In addition to the two equations, heckman estimates rho (actually the inverse hyperbolic
tangent of rho) the correlation of the residuals in the two equations and sigma (actually the log of sigma) the standard error of the residuals of the
wage equation. Lambda is rho*sigma. The output also includes a likelihood ratio test of
rho = 0.Recall that it was stated at the beginning that this dataset was constructed. As it turns out, we do have full wage information on all 2,000 women. The variable wagefull has the complete wage data. We can therefore run a regression using the full wage information to use as a comarison.
regress wagefull education age
Source | SS df MS Number of obs = 2000
-------------+------------------------------ F( 2, 1997) = 398.82
Model | 28053.371 2 14026.6855 Prob > F = 0.0000
Residual | 70234.8124 1997 35.1701614 R-squared = 0.2854
-------------+------------------------------ Adj R-squared = 0.2847
Total | 98288.1834 1999 49.168676 Root MSE = 5.9304
------------------------------------------------------------------------------
wagefull | Coef. Std. Err. t P>|t| [95% Conf. Interval]
-------------+----------------------------------------------------------------
education | 1.004456 .0448325 22.40 0.000 .9165328 1.092379
age | .1874822 .0164792 11.38 0.000 .155164 .2198004
_cons | 1.381099 .7424989 1.86 0.063 -.0750544 2.837253
------------------------------------------------------------------------------
predict pfull
If we compare (see below) the predicted wages from the first model (omit missing),
the second model (substitute zero for missing) and the heckman model to
the complete wage and predicted full wage values, we note the following:univar pwage pwage0 pheckman wagefull pfull
-------------- Quantiles --------------
Variable n Mean S.D. Min .25 Mdn .75 Max
-------------------------------------------------------------------------------
pwage 2000 23.12 3.24 17.98 20.36 22.56 25.71 32.66
pwage0 2000 15.91 5.10 6.29 11.76 15.95 19.36 32.18
pheckman 2000 21.16 3.84 14.65 18.06 20.83 24.00 32.86
wagefull 2000 21.31 7.01 -1.68 16.46 21.18 26.14 45.81
pfull 2000 21.31 3.75 15.18 18.18 20.77 24.20 32.53
-------------------------------------------------------------------------------
Two-Stage Heckman SelectionIt is possible to compute the Heckman Selection model manually using a two-stage process. Recall the selection model from above which we will run with Stat's twostep option.
heckman wage education age, select(s = married children education age) twostep
Heckman selection model -- two-step estimates Number of obs = 2000
(regression model with sample selection) Censored obs = 657
Uncensored obs = 1343
Wald chi2(4) = 551.37
Prob > chi2 = 0.0000
------------------------------------------------------------------------------
| Coef. Std. Err. z P>|z| [95% Conf. Interval]
-------------+----------------------------------------------------------------
wage |
education | .9825259 .0538821 18.23 0.000 .8769189 1.088133
age | .2118695 .0220511 9.61 0.000 .1686502 .2550888
_cons | .7340391 1.248331 0.59 0.557 -1.712645 3.180723
-------------+----------------------------------------------------------------
s |
married | .4308575 .074208 5.81 0.000 .2854125 .5763025
children | .4473249 .0287417 15.56 0.000 .3909922 .5036576
education | .0583645 .0109742 5.32 0.000 .0368555 .0798735
age | .0347211 .0042293 8.21 0.000 .0264318 .0430105
_cons | -2.467365 .1925635 -12.81 0.000 -2.844782 -2.089948
-------------+----------------------------------------------------------------
mills |
lambda | 4.001615 .6065388 6.60 0.000 2.812821 5.19041
-------------+----------------------------------------------------------------
rho | 0.67284
sigma | 5.9473529
lambda | 4.0016155 .6065388
------------------------------------------------------------------------------
We will begin with a probit model, do some transformations to obtain the inverse Mills
ratio, which is then included in a standard OLS regression.
probit s married children education age
Probit estimates Number of obs = 2000
LR chi2(4) = 478.32
Prob > chi2 = 0.0000
Log likelihood = -1027.0616 Pseudo R2 = 0.1889
------------------------------------------------------------------------------
s | Coef. Std. Err. z P>|z| [95% Conf. Interval]
-------------+----------------------------------------------------------------
married | .4308575 .074208 5.81 0.000 .2854125 .5763025
children | .4473249 .0287417 15.56 0.000 .3909922 .5036576
education | .0583645 .0109742 5.32 0.000 .0368555 .0798735
age | .0347211 .0042293 8.21 0.000 .0264318 .0430105
_cons | -2.467365 .1925635 -12.81 0.000 -2.844782 -2.089948
------------------------------------------------------------------------------
predict p1, xb
generate phi = (1/sqrt(2*_pi))*exp(-(p1^2/2)) /*standardize it*/
generate capphi = norm(p1)
generate invmills = phi/capphi
regress wage education age invmills
Source | SS df MS Number of obs = 1343
-------------+------------------------------ F( 3, 1339) = 173.01
Model | 14904.6806 3 4968.22688 Prob > F = 0.0000
Residual | 38450.214 1339 28.7156191 R-squared = 0.2793
-------------+------------------------------ Adj R-squared = 0.2777
Total | 53354.8946 1342 39.7577456 Root MSE = 5.3587
------------------------------------------------------------------------------
wage | Coef. Std. Err. t P>|t| [95% Conf. Interval]
-------------+----------------------------------------------------------------
education | .9825259 .0504982 19.46 0.000 .8834616 1.08159
age | .2118695 .0206636 10.25 0.000 .171333 .252406
invmills | 4.001616 .5771027 6.93 0.000 2.869492 5.133739
_cons | .7340391 1.166214 0.63 0.529 -1.553766 3.021844
------------------------------------------------------------------------------
Probit with SelectionStata also includes another selection model the heckprob which works in a manner very similar to heckman except that the response variable is binary. heckprob stands for heckman probit estimation. We can illustrate heckprob using the same dataset and creating a binary reponse variable hw, for high wage.
summarize wage
Variable | Obs Mean Std. Dev. Min Max
-------------+--------------------------------------------------------
wage | 1343 23.69217 6.305374 5.88497 45.80979
generate hw = wage>r(mean) if wage ~= .
(657 missing values generated)
tabulate hw
hw | Freq. Percent Cum.
------------+-----------------------------------
0 | 686 51.08 51.08
1 | 657 48.92 100.00
------------+-----------------------------------
Total | 1,343 100.00
tabulate hw, miss
hw | Freq. Percent Cum.
------------+-----------------------------------
0 | 686 34.30 34.30
1 | 657 32.85 67.15
. | 657 32.85 100.00
------------+-----------------------------------
Total | 2,000 100.00
We will begin just as we did in the heckman analysis by analyzing hw for the 1343 cases with
complete data.
probit hw education age, nolog
Probit estimates Number of obs = 1343
LR chi2(2) = 246.68
Prob > chi2 = 0.0000
Log likelihood = -807.24513 Pseudo R2 = 0.1325
------------------------------------------------------------------------------
hw | Coef. Std. Err. z P>|z| [95% Conf. Interval]
-------------+----------------------------------------------------------------
education | .1629735 .0126813 12.85 0.000 .1381186 .1878283
age | .0284745 .0046322 6.15 0.000 .0193956 .0375534
_cons | -3.29333 .2348306 -14.02 0.000 -3.753589 -2.83307
As before, this solution is less than satisfying because information from 657 individuals was
left out because they self-selected out of the labor force.Next, we will recode all of the miss values of hw with zero and try again.
gen hw0 = hw
(657 missing values generated)
replace hw0=0 if hw0 == .
(657 real changes made)
probit hw0 education age, nolog
Probit estimates Number of obs = 2000
LR chi2(2) = 366.87
Prob > chi2 = 0.0000
Log likelihood = -1082.7874 Pseudo R2 = 0.1449
------------------------------------------------------------------------------
hw0 | Coef. Std. Err. z P>|z| [95% Conf. Interval]
-------------+----------------------------------------------------------------
education | .1436513 .0103329 13.90 0.000 .1233991 .1639035
age | .0392975 .0039473 9.96 0.000 .0315609 .0470341
_cons | -3.819543 .1952091 -19.57 0.000 -4.202146 -3.43694
------------------------------------------------------------------------------
Now, we are using all of the observations but by setting all of the missing values to zero
we are implying that all of these observations would have not been high wage had the
indivdual chosen to work.The solution, of course, is a Heckman selection model using heckprob.
heckprob hw education age, select(married children education age) nolog
Probit model with sample selection Number of obs = 2000
Censored obs = 657
Uncensored obs = 1343
Wald chi2(2) = 288.91
Log likelihood = -1817.402 Prob > chi2 = 0.0000
------------------------------------------------------------------------------
| Coef. Std. Err. z P>|z| [95% Conf. Interval]
-------------+----------------------------------------------------------------
hw |
education | .1615956 .012297 13.14 0.000 .1374939 .1856973
age | .0374595 .0043218 8.67 0.000 .0289889 .0459301
_cons | -3.913347 .2142349 -18.27 0.000 -4.33324 -3.493454
-------------+----------------------------------------------------------------
select |
married | .4455411 .0703079 6.34 0.000 .3077401 .5833421
children | .4443875 .0286673 15.50 0.000 .3882006 .5005744
education | .0569751 .0108367 5.26 0.000 .0357356 .0782145
age | .0347465 .0041812 8.31 0.000 .0265515 .0429414
_cons | -2.455993 .1908705 -12.87 0.000 -2.830092 -2.081894
-------------+----------------------------------------------------------------
/athrho | .9695628 .2283646 4.25 0.000 .5219765 1.417149
-------------+----------------------------------------------------------------
rho | .7485121 .1004187 .479224 .8890027
------------------------------------------------------------------------------
LR test of indep. eqns. (rho = 0): chi2(1) = 33.81 Prob > chi2 = 0.0000
------------------------------------------------------------------------------
These results are not that different from the first probit model but we can feel
more confident about the analysis since it is using all of the information that is available.
Categorical Data Analysis Course
Phil Ender -- revised 3/23/05