Using the hsb2 dataset, consider the correlation between science and math.
use http://www.gseis.ucla.edu/courses/data/hsb2
corr science math
(obs=200)
| science math
-------------+------------------
science | 1.0000
math | 0.6307 1.0000
The correlation of 0.63 may be satisfyingly large but it is also somewhat misleading.
It would be tempting to interpret the correlation as reflecting the relationship between a measure
of ability in science and ability in math. The problem is that both the science and math tests
are standardized written tests so that general academic skills and intelligence are likely to
influence the results of both, leading to an inflated correlation. In addition to the inflated
correlation there is a more subtle problem that can arise when you try to use these test scores in
a regression analysis. Consider the following model:
Because portions of the variability of both science and math are jointly determined by general academic skills and intelligence there is a strong likelihood that there will be a correlation between math and the error (residuals) in the model. This correlation violates one of the basic assumptions of independence in OLS regression. Using reading and writing scores as indicators of general academic skills and intelligence, we can check out this possibility with the following commands.
regress math female read write
Source | SS df MS Number of obs = 200
-------------+------------------------------ F( 3, 196) = 72.33
Model | 9176.66954 3 3058.88985 Prob > F = 0.0000
Residual | 8289.12546 196 42.2914564 R-squared = 0.5254
-------------+------------------------------ Adj R-squared = 0.5181
Total | 17465.795 199 87.7678141 Root MSE = 6.5032
------------------------------------------------------------------------------
math | Coef. Std. Err. t P>|t| [95% Conf. Interval]
-------------+----------------------------------------------------------------
female | -2.023984 .991051 -2.04 0.042 -3.978476 -.0694912
read | .385395 .0581257 6.63 0.000 .270763 .500027
write | .3888326 .0649587 5.99 0.000 .2607249 .5169402
_cons | 13.09825 2.80151 4.68 0.000 7.573277 18.62322
------------------------------------------------------------------------------
predict resmath, resid
regress science math female resmath
Source | SS df MS Number of obs = 200
-------------+------------------------------ F( 3, 196) = 72.86
Model | 10284.6648 3 3428.22159 Prob > F = 0.0000
Residual | 9222.83523 196 47.0552818 R-squared = 0.5272
-------------+------------------------------ Adj R-squared = 0.5200
Total | 19507.50 199 98.0276382 Root MSE = 6.8597
------------------------------------------------------------------------------
science | Coef. Std. Err. t P>|t| [95% Conf. Interval]
-------------+----------------------------------------------------------------
math | 1.007845 .0716667 14.06 0.000 .8665077 1.149182
female | -1.978643 .9748578 -2.03 0.044 -3.9012 -.0560859
resmath | -.7255874 .103985 -6.98 0.000 -.9306604 -.5205144
_cons | -.1296216 3.861937 -0.03 0.973 -7.745907 7.486664
------------------------------------------------------------------------------
The significant resmath coefficient indicates that there is a problem with using
math as a predictor of science. In a traditional linear regression model
the response variable is considered to be endogenous and the predictors to be exogenous.An endogenous variable is a variable whose variation is explained by either exogenous variables or other endogenous variables in the model. Exogenous variables are variables whose variability is determined by variables outside of the model.
When one, or more, of the predictor variables is endogenous we encounter the problem of the variable being correlated with the error (residual). The test of resmath (above) can be considered to be a test of the endogeneity of math but is more specifically a test as to whether the OLS estimates in the model are consistent.
The ivreg command (or two-stage least squares; 2SLS) is designed to used in situations in which predictors are endogenous. In essence, ivreg simultaneously estimates two equations,
The ivreg command for our example looks like this,
ivreg science female (math = read write)
Instrumental variables (2SLS) regression
Source | SS df MS Number of obs = 200
-------------+------------------------------ F( 2, 197) = 69.77
Model | 5920.63012 2 2960.31506 Prob > F = 0.0000
Residual | 13586.8699 197 68.9688827 R-squared = 0.3035
-------------+------------------------------ Adj R-squared = 0.2964
Total | 19507.50 199 98.0276382 Root MSE = 8.3048
------------------------------------------------------------------------------
science | Coef. Std. Err. t P>|t| [95% Conf. Interval]
-------------+----------------------------------------------------------------
math | 1.007845 .0867641 11.62 0.000 .836739 1.17895
female | -1.978643 1.180222 -1.68 0.095 -4.306134 .3488478
_cons | -.1296216 4.675495 -0.03 0.978 -9.350068 9.090824
------------------------------------------------------------------------------
Instrumented: math
Instruments: female read write
------------------------------------------------------------------------------
predict p1
estimates store ivreg
Next, we can use Stata's hausman command to test whether the differences between the ivreg
and OLS estimates are large enough to suggest that the OLS estimates are not consistent.
regress science math female
Source | SS df MS Number of obs = 200
-------------+------------------------------ F( 2, 197) = 68.38
Model | 7993.54995 2 3996.77498 Prob > F = 0.0000
Residual | 11513.95 197 58.4464469 R-squared = 0.4098
-------------+------------------------------ Adj R-squared = 0.4038
Total | 19507.50 199 98.0276382 Root MSE = 7.645
------------------------------------------------------------------------------
science | Coef. Std. Err. t P>|t| [95% Conf. Interval]
-------------+----------------------------------------------------------------
math | .6631901 .0578724 11.46 0.000 .549061 .7773191
female | -2.168396 1.086043 -2.00 0.047 -4.310159 -.026633
_cons | 18.11813 3.167133 5.72 0.000 11.8723 24.36397
------------------------------------------------------------------------------
predict p2
hausman ivreg . , constant sigmamore
---- Coefficients ----
| (b) (B) (b-B) sqrt(diag(V_b-V_B))
| ivreg . Difference S.E.
-------------+----------------------------------------------------------------
math | 1.007845 .6631901 .3446546 .0550478
female | -1.978643 -2.168396 .1897529 .0303071
_cons | -.1296216 18.11813 -18.24776 2.914507
------------------------------------------------------------------------------
b = consistent under Ho and Ha; obtained from ivreg
B = inconsistent under Ha, efficient under Ho; obtained from regress
Test: Ho: difference in coefficients not systematic
chi2(1) = (b-B)'[(V_b-V_B)^(-1)](b-B)
= 39.20
Prob>chi2 = 0.0000
Sure enough, the there is a significant (chi-square = 39.2, df =1, p = 0.0000) difference between
the ivreg and OLS coefficients, indicating clearly that OLS is an inconsistent
estimator in this equation. The conclusion is that the reason for the inconsistent estimates
is due to the endogeneity of math.The R2 for the OLS model is much higher than the R2 for the ivreg model but this is due to the fact that both science and math are correlation with the exogenous variable read and write.
If we wanted to represent this model graphically, it would look something like this

with squares for the exogenous variables and circles for the endogenous variables.
Let's look at the variable science and the two predicted values, p1 from the ivreg model and p2 from the OLS model.
summarize science p1 p2
Variable | Obs Mean Std. Dev. Min Max
-------------+-----------------------------------------------------
science | 200 51.85 9.900891 26 74
p1 | 200 51.85 9.522247 31.15061 75.45872
p2 | 200 51.85 6.33787 37.83501 67.85739
corr science p1 p2
(obs=200)
| science p1 p2
-------------+---------------------------
science | 1.0000
p1 | 0.6387 1.0000
p2 | 0.6401 0.9977 1.0000
Finally, let's how close we can come to the ivreg results doing our own two-stage regression.
regress math read write female
Source | SS df MS Number of obs = 200
-------------+------------------------------ F( 3, 196) = 72.33
Model | 9176.66954 3 3058.88985 Prob > F = 0.0000
Residual | 8289.12546 196 42.2914564 R-squared = 0.5254
-------------+------------------------------ Adj R-squared = 0.5181
Total | 17465.795 199 87.7678141 Root MSE = 6.5032
------------------------------------------------------------------------------
math | Coef. Std. Err. t P>|t| [95% Conf. Interval]
-------------+----------------------------------------------------------------
read | .385395 .0581257 6.63 0.000 .270763 .500027
write | .3888326 .0649587 5.99 0.000 .2607249 .5169402
female | -2.023984 .991051 -2.04 0.042 -3.978476 -.0694912
_cons | 13.09825 2.80151 4.68 0.000 7.573277 18.62322
------------------------------------------------------------------------------
predict pmath
regress science pmath female
Source | SS df MS Number of obs = 200
-------------+------------------------------ F( 2, 197) = 95.92
Model | 9624.27766 2 4812.13883 Prob > F = 0.0000
Residual | 9883.22234 197 50.1686413 R-squared = 0.4934
-------------+------------------------------ Adj R-squared = 0.4882
Total | 19507.50 199 98.0276382 Root MSE = 7.083
------------------------------------------------------------------------------
science | Coef. Std. Err. t P>|t| [95% Conf. Interval]
-------------+----------------------------------------------------------------
pmath | 1.007845 .0739996 13.62 0.000 .8619116 1.153778
female | -1.978643 1.006591 -1.97 0.051 -3.963721 .0064348
_cons | -.1296241 3.987652 -0.03 0.974 -7.993588 7.73434
------------------------------------------------------------------------------
/* this is the ivreg results from above */
Instrumental variables (2SLS) regression
Source | SS df MS Number of obs = 200
-------------+------------------------------ F( 2, 197) = 69.77
Model | 5920.63012 2 2960.31506 Prob > F = 0.0000
Residual | 13586.8699 197 68.9688827 R-squared = 0.3035
-------------+------------------------------ Adj R-squared = 0.2964
Total | 19507.50 199 98.0276382 Root MSE = 8.3048
------------------------------------------------------------------------------
science | Coef. Std. Err. t P>|t| [95% Conf. Interval]
-------------+----------------------------------------------------------------
math | 1.007845 .0867641 11.62 0.000 .836739 1.17895
female | -1.978643 1.180222 -1.68 0.095 -4.306134 .3488478
_cons | -.1296216 4.675495 -0.03 0.978 -9.350068 9.090824
------------------------------------------------------------------------------
Instrumented: math
Instruments: female read write
------------------------------------------------------------------------------
In the first regression, we regressed the endogenous predictor on the three exogenous
variables. In the second regression, we used the predicted math (pmath) as the
instrumented variable in our model. Note that the coefficients in the second regression
and the ivreg are the same, but that the standard errors are different.One final note, it is also possible to estimate this system of equations using three-stage least squares (3SLS). Stata's reg3 command can perform either 2SLS (equivalent to ivreg) or 3SLS and clearly illustrates the two equation nature of the problem.
reg3 (science = math female)(math = read write female), 2sls
Two-stage least-squares regression
----------------------------------------------------------------------
Equation Obs Parms RMSE "R-sq" F-Stat P
----------------------------------------------------------------------
science 200 2 8.304751 0.3035 69.7726 0.0000
math 200 3 6.503188 0.5254 72.32879 0.0000
----------------------------------------------------------------------
------------------------------------------------------------------------------
| Coef. Std. Err. t P>|t| [95% Conf. Interval]
-------------+----------------------------------------------------------------
science |
math | 1.007845 .0867641 11.62 0.000 .8372648 1.178424
female | -1.978643 1.180222 -1.68 0.094 -4.298981 .3416951
_cons | -.1296216 4.675495 -0.03 0.978 -9.321732 9.062489
-------------+----------------------------------------------------------------
math |
read | .385395 .0581257 6.63 0.000 .2711189 .4996712
write | .3888326 .0649587 5.99 0.000 .2611226 .5165425
female | -2.023984 .991051 -2.04 0.042 -3.972408 -.075559
_cons | 13.09825 2.80151 4.68 0.000 7.590429 18.60607
------------------------------------------------------------------------------
Endogenous variables: science math
Exogenous variables: female read write
------------------------------------------------------------------------------
reg3 (science = math female)(math = read write female)
Three-stage least squares regression
----------------------------------------------------------------------
Equation Obs Parms RMSE "R-sq" chi2 P
----------------------------------------------------------------------
science 200 2 8.24223 0.3035 141.6703 0.0000
math 200 3 6.438234 0.5253 221.4518 0.0000
----------------------------------------------------------------------
------------------------------------------------------------------------------
| Coef. Std. Err. z P>|z| [95% Conf. Interval]
-------------+----------------------------------------------------------------
science |
math | 1.007845 .0861109 11.70 0.000 .8390704 1.176619
female | -1.978643 1.171337 -1.69 0.091 -4.274421 .3171349
_cons | -.1296216 4.640297 -0.03 0.978 -9.224436 8.965192
-------------+----------------------------------------------------------------
math |
read | .3772331 .0496322 7.60 0.000 .2799557 .4745104
write | .3981663 .0550155 7.24 0.000 .290338 .5059947
female | -2.078337 .9617418 -2.16 0.031 -3.963316 -.1933579
_cons | 13.06158 2.770268 4.71 0.000 7.631958 18.49121
------------------------------------------------------------------------------
Endogenous variables: science math
Exogenous variables: female read write
------------------------------------------------------------------------------
Categorical Data Analysis Course
Phil Ender