Shrinkage
Estimating Shrinkage

Cross Validation
Double Cross Validation
Stata Cross Validation Example
We will begin by looking at the 1999 API data for 67 Orange County high schools and and for 226 Los Angeles County high schools.
use http://www.philender.com/courses/data/ochi, clear
regress api99 pctmeal pctel yrrnd core avged pctemer
Source | SS df MS Number of obs = 67
-------------+------------------------------ F( 6, 60) = 77.36
Model | 846646.429 6 141107.738 Prob > F = 0.0000
Residual | 109446.288 60 1824.1048 R-squared = 0.8855
-------------+------------------------------ Adj R-squared = 0.8741
Total | 956092.716 66 14486.2533 Root MSE = 42.71
------------------------------------------------------------------------------
api99 | Coef. Std. Err. t P>|t| [95% Conf. Interval]
-------------+----------------------------------------------------------------
pctmeal | .6088963 .3348236 1.82 0.074 -.0608507 1.278643
pctel | -3.372816 .7189063 -4.69 0.000 -4.810842 -1.934789
yrrnd | -30.10476 24.75397 -1.22 0.229 -79.62008 19.41056
core | -4.97981 3.339713 -1.49 0.141 -11.66023 1.700611
avged | 69.37089 15.76158 4.40 0.000 37.84305 100.8987
pctemer | -.1026734 .8709507 -0.12 0.907 -1.844834 1.639487
_cons | 693.2872 105.3493 6.58 0.000 482.5573 904.0171
------------------------------------------------------------------------------
use http://www.philender.com/courses/data/lahi, clear
regress api99 pctmeal pctel yrrnd core avged pctemer
Source | SS df MS Number of obs = 226
-------------+------------------------------ F( 6, 219) = 229.23
Model | 3408806.48 6 568134.413 Prob > F = 0.0000
Residual | 542769.29 219 2478.39858 R-squared = 0.8626
-------------+------------------------------ Adj R-squared = 0.8589
Total | 3951575.77 225 17562.559 Root MSE = 49.784
------------------------------------------------------------------------------
api99 | Coef. Std. Err. t P>|t| [95% Conf. Interval]
-------------+----------------------------------------------------------------
pctmeal | -.4042542 .1405528 -2.88 0.004 -.6812634 -.127245
pctel | -.6719812 .3342255 -2.01 0.046 -1.330691 -.013271
yrrnd | -31.0319 11.23642 -2.76 0.006 -53.17725 -8.886547
core | -.9548313 1.404674 -0.68 0.497 -3.723241 1.813579
avged | 132.6249 9.269121 14.31 0.000 114.3568 150.893
pctemer | -1.967988 .3124097 -6.30 0.000 -2.583702 -1.352274
_cons | 329.3452 58.39626 5.64 0.000 214.2546 444.4358
------------------------------------------------------------------------------
We will demonstrate cross validation by starting with the Orange County data.
use http://www.philender.com/courses/data/ochi, clear
regress api99 pctmeal pctel yrrnd core avged pctemer
Source | SS df MS Number of obs = 67
---------+------------------------------ F( 6, 60) = 77.36
Model | 846646.429 6 141107.738 Prob > F = 0.0000
Residual | 109446.288 60 1824.1048 R-squared = 0.8855
---------+------------------------------ Adj R-squared = 0.8741
Total | 956092.716 66 14486.2533 Root MSE = 42.71
------------------------------------------------------------------------------
api99 | Coef. Std. Err. t P>|t| [95% Conf. Interval]
---------+--------------------------------------------------------------------
pctmeal | .6088963 .3348236 1.819 0.074 -.0608506 1.278643
pctel | -3.372816 .7189063 -4.692 0.000 -4.810842 -1.934789
yrrnd | -30.10476 24.75397 -1.216 0.229 -79.62008 19.41056
core | -4.97981 3.339713 -1.491 0.141 -11.66023 1.70061
avged | 69.37089 15.76158 4.401 0.000 37.84305 100.8987
pctemer | -.1026734 .8709507 -0.118 0.907 -1.844834 1.639487
_cons | 693.2872 105.3493 6.581 0.000 482.5573 904.0171
------------------------------------------------------------------------------
Next we will load the Los Angeles County data. This dataset, lahi, has the same variables as the first dataset, ochi.
use http://www.philender.com/courses/data/lahi, clear
predict p1
(option xb assumed; fitted values)
(10 missing values generated)
corr api99 p1
(obs=67)
| api99 p1
---------+------------------
api99 | 1.0000
p1 | 0.8522 1.0000
regress api99 p1
Source | SS df MS Number of obs = 226
-------------+------------------------------ F( 1, 224) = 594.14
Model | 2869665.82 1 2869665.82 Prob > F = 0.0000
Residual | 1081909.95 224 4829.95512 R-squared = 0.7262
-------------+------------------------------ Adj R-squared = 0.7250
Total | 3951575.77 225 17562.559 Root MSE = 69.498
------------------------------------------------------------------------------
api99 | Coef. Std. Err. t P>|t| [95% Conf. Interval]
-------------+----------------------------------------------------------------
p1 | 1.236506 .0507285 24.37 0.000 1.13654 1.336472
_cons | -247.279 34.42901 -7.18 0.000 -315.1252 -179.4328
------------------------------------------------------------------------------Note that the R2 of .7262 is much lower than the R2 of .8855 from our original regression analysis.
Stata Cross Validation Method 2
use http://www.philender.com/courses/data/ochi, clear
count
71
append using http://www.philender.com/courses/data/lahi
(label yn already defined)
count
307
generate sample=1 in 1/71
(236 missing values generated)
replace sample=2 in 72/l
(236 real changes made)
regress api99 pctmeal pctel yrrnd core avged pctemer if sample==1
Source | SS df MS Number of obs = 67
-------------+------------------------------ F( 6, 60) = 77.36
Model | 846646.429 6 141107.738 Prob > F = 0.0000
Residual | 109446.288 60 1824.1048 R-squared = 0.8855
-------------+------------------------------ Adj R-squared = 0.8741
Total | 956092.716 66 14486.2533 Root MSE = 42.71
------------------------------------------------------------------------------
api99 | Coef. Std. Err. t P>|t| [95% Conf. Interval]
-------------+----------------------------------------------------------------
pctmeal | .6088963 .3348236 1.82 0.074 -.0608507 1.278643
pctel | -3.372816 .7189063 -4.69 0.000 -4.810842 -1.934789
yrrnd | -30.10476 24.75397 -1.22 0.229 -79.62008 19.41056
core | -4.97981 3.339713 -1.49 0.141 -11.66023 1.700611
avged | 69.37089 15.76158 4.40 0.000 37.84305 100.8987
pctemer | -.1026734 .8709507 -0.12 0.907 -1.844834 1.639487
_cons | 693.2872 105.3493 6.58 0.000 482.5573 904.0171
------------------------------------------------------------------------------
predict pre
(option xb assumed; fitted values)
(14 missing values generated)
regress api99 pre if sample==2
Source | SS df MS Number of obs = 226
-------------+------------------------------ F( 1, 224) = 594.14
Model | 2869665.82 1 2869665.82 Prob > F = 0.0000
Residual | 1081909.95 224 4829.95512 R-squared = 0.7262
-------------+------------------------------ Adj R-squared = 0.7250
Total | 3951575.77 225 17562.559 Root MSE = 69.498
------------------------------------------------------------------------------
api99 | Coef. Std. Err. t P>|t| [95% Conf. Interval]
-------------+----------------------------------------------------------------
pre | 1.236506 .0507285 24.37 0.000 1.13654 1.336472
_cons | -247.279 34.42901 -7.18 0.000 -315.1252 -179.4328
------------------------------------------------------------------------------
In this cross validation the R2 has decreased from 0.8855 to .7262.Using regvalidate on the combined sample
The program regvalidate (findit regvalidate) uses resampling methods within single sample to assess validation. We will demonstrate its use on the combined Los Angeles and Orange County samples.
regress api99 pctmeal pctel yrrnd core avged pctemer
Source | SS df MS Number of obs = 293
-------------+------------------------------ F( 6, 286) = 277.36
Model | 4538777.22 6 756462.87 Prob > F = 0.0000
Residual | 780037.386 286 2727.40345 R-squared = 0.8533
-------------+------------------------------ Adj R-squared = 0.8503
Total | 5318814.61 292 18215.1185 Root MSE = 52.225
------------------------------------------------------------------------------
api99 | Coef. Std. Err. t P>|t| [95% Conf. Interval]
-------------+----------------------------------------------------------------
pctmeal | -.5078381 .1305415 -3.89 0.000 -.7647822 -.2508941
pctel | -.5148149 .3020057 -1.70 0.089 -1.109251 .0796209
yrrnd | -35.90818 10.89156 -3.30 0.001 -57.34596 -14.47041
core | -1.813285 1.368788 -1.32 0.186 -4.507461 .8808907
avged | 123.0551 8.487517 14.50 0.000 106.3492 139.761
pctemer | -2.53015 .2807462 -9.01 0.000 -3.08274 -1.977559
_cons | 399.1887 54.36759 7.34 0.000 292.1773 506.2001
------------------------------------------------------------------------------
regvalidate, reps(200)
original sample size = 293 reps = 200
regression model: regress api99 pctmeal pctel yrrnd core avged pctemer
orig train test diff orig adj
R-squared 0.8533 0.8585 0.8416 0.0169 0.8365
rss/n 2662.2436 2558.9878 2849.9416 -290.9538 2953.1974
fit slope 1.0000 1.0000 0.9876 0.0124 0.9876
fit _cons 0.0000 -0.0000 7.8432 -7.8432 7.8432
Linear Statistical Models Course
Phil Ender, 16oct10, 29Jan98