The effect of scaling predictor variables can be easily demonstrated using the variable read in the hsbdemo dataset. We will begin with a model regressing write on female and read.
Example 1
use http://www.phhilender.com/courses/data/hsbdemo, clear
regress write female read
Source | SS df MS Number of obs = 200
-------------+------------------------------ F( 2, 197) = 77.21
Model | 7856.32118 2 3928.16059 Prob > F = 0.0000
Residual | 10022.5538 197 50.8759077 R-squared = 0.4394
-------------+------------------------------ Adj R-squared = 0.4337
Total | 17878.875 199 89.843593 Root MSE = 7.1327
------------------------------------------------------------------------------
write | Coef. Std. Err. t P>|t| [95% Conf. Interval]
-------------+----------------------------------------------------------------
female | 5.486894 1.014261 5.41 0.000 3.48669 7.487098
read | .5658869 .0493849 11.46 0.000 .468496 .6632778
_cons | 20.22837 2.713756 7.45 0.000 14.87663 25.58011
------------------------------------------------------------------------------
The coefficient for read (.57) indicates how much change is expected in write when
there is a one unit increase in read with female held constant. The concern here
is that a one unit change might not be terribly meaningful. Suppose that research has
indicated that a 12 point change in read is meaningful. Here is what you could do.
generate read12 = read/12
regress write female read12
Source | SS df MS Number of obs = 200
-------------+------------------------------ F( 2, 197) = 77.21
Model | 7856.32128 2 3928.16064 Prob > F = 0.0000
Residual | 10022.5537 197 50.8759072 R-squared = 0.4394
-------------+------------------------------ Adj R-squared = 0.4337
Total | 17878.875 199 89.843593 Root MSE = 7.1327
------------------------------------------------------------------------------
write | Coef. Std. Err. t P>|t| [95% Conf. Interval]
-------------+----------------------------------------------------------------
female | 5.486894 1.014261 5.41 0.000 3.48669 7.487098
read12 | 6.790643 .5926186 11.46 0.000 5.621953 7.959334
_cons | 20.22837 2.713756 7.45 0.000 14.87663 25.58011
------------------------------------------------------------------------------
Now a one unit change in read12 predicts a 6.8 point change in write with female
held constant. A one point change in read12 is equivalent to a 12 point change in read.
Note that the standardized coefficients are identical for read and read12.
regress write female read, beta noheader
------------------------------------------------------------------------------
write | Coef. Std. Err. t P>|t| Beta
-------------+----------------------------------------------------------------
female | 5.486894 1.014261 5.41 0.000 .2889851
read | .5658869 .0493849 11.46 0.000 .6121169
_cons | 20.22837 2.713756 7.45 0.000 .
------------------------------------------------------------------------------
regress write female read12, beta noheader
------------------------------------------------------------------------------
write | Coef. Std. Err. t P>|t| Beta
-------------+----------------------------------------------------------------
female | 5.486894 1.014261 5.41 0.000 .2889851
read12 | 6.790643 .5926186 11.46 0.000 .6121169
_cons | 20.22837 2.713756 7.45 0.000 .
------------------------------------------------------------------------------
Example 2Now, what if reading was a categorical variable? We will divide read up into five categories. Please realize that I am not suggesting that you should take a continuous variable and break it up into categories, but to show the effect of scaling read as a categorical variable.
egen readcat = cut(read), group(5) icodes
tabulate readcat
readcat | Freq. Percent Cum.
------------+-----------------------------------
0 | 39 19.50 19.50
1 | 16 8.00 27.50
2 | 62 31.00 58.50
3 | 37 18.50 77.00
4 | 46 23.00 100.00
------------+-----------------------------------
Total | 200 100.00
tabstat read, by(readcat)
Summary for variables: read
by categories of: readcat
readcat | mean
---------+----------
0 | 38.61538
1 | 44.25
2 | 49.22581
3 | 57.13514
4 | 66.65217
---------+----------
Total | 52.23
--------------------
Let's run a regression with dummy coded readcat.
regress write female i.readcat
Source | SS df MS Number of obs = 200
-------------+------------------------------ F( 5, 194) = 28.02
Model | 7497.74329 5 1499.54866 Prob > F = 0.0000
Residual | 10381.1317 194 53.5109882 R-squared = 0.4194
-------------+------------------------------ Adj R-squared = 0.4044
Total | 17878.875 199 89.843593 Root MSE = 7.3151
------------------------------------------------------------------------------
write | Coef. Std. Err. t P>|t| [95% Conf. Interval]
-------------+----------------------------------------------------------------
female | 5.714997 1.0442 5.47 0.000 3.655556 7.774438
|
readcat |
1 | 2.237243 2.173774 1.03 0.305 -2.050021 6.524508
2 | 6.692244 1.495062 4.48 0.000 3.743581 9.640906
3 | 11.49109 1.680671 6.84 0.000 8.176361 14.80583
4 | 15.76366 1.596531 9.87 0.000 12.61487 18.91244
|
_cons | 41.65526 1.323366 31.48 0.000 39.04523 44.26529
------------------------------------------------------------------------------
testparm i.readcat
( 1) 1.readcat = 0
( 2) 2.readcat = 0
( 3) 3.readcat = 0
( 4) 4.readcat = 0
F( 4, 194) = 29.53
Prob > F = 0.0000
We see that overall readcat is a significant predictor of write. The
R2 for this model is .4199 as compared to .4394 when read is continuous.
Next, let's use readcat in a model but treat it as a one degree of freedom linear
predictor.
regress write female readcat
Source | SS df MS Number of obs = 200
-------------+------------------------------ F( 2, 197) = 69.99
Model | 7426.82985 2 3713.41492 Prob > F = 0.0000
Residual | 10452.0452 197 53.0560668 R-squared = 0.4154
-------------+------------------------------ Adj R-squared = 0.4095
Total | 17878.875 199 89.843593 Root MSE = 7.284
------------------------------------------------------------------------------
write | Coef. Std. Err. t P>|t| [95% Conf. Interval]
-------------+----------------------------------------------------------------
female | 5.688666 1.037052 5.49 0.000 3.643518 7.733814
readcat | 4.030212 .3713078 10.85 0.000 3.297964 4.76246
_cons | 40.90897 1.141635 35.83 0.000 38.65757 43.16036
------------------------------------------------------------------------------
The linear form of readcat is still significant but the R2 for the model has gone
down to .4154, a trivial difference for a gain of four degrees of freedom in the residual.We can test to see if the difference between using read and readcat is significant by including both in a model. The significant coefficient for read (below) suggests that the continuous form of read accounts variability in reading that is not captured in the categorical form.
regress write female i.readcat read
Source | SS df MS Number of obs = 200
-------------+------------------------------ F( 6, 193) = 25.62
Model | 7926.37551 6 1321.06258 Prob > F = 0.0000
Residual | 9952.49949 193 51.5673549 R-squared = 0.4433
-------------+------------------------------ Adj R-squared = 0.4260
Total | 17878.875 199 89.843593 Root MSE = 7.181
------------------------------------------------------------------------------
write | Coef. Std. Err. t P>|t| [95% Conf. Interval]
-------------+----------------------------------------------------------------
female | 5.469592 1.028589 5.32 0.000 3.440875 7.49831
|
readcat |
1 | -.6631964 2.359184 -0.28 0.779 -5.31629 3.989897
2 | 1.273686 2.384601 0.53 0.594 -3.429538 5.97691
3 | 2.011662 3.678693 0.55 0.585 -5.243941 9.267264
4 | 1.413838 5.218196 0.27 0.787 -8.878175 11.70585
|
read | .5108452 .177188 2.88 0.004 .1613717 .8603187
_cons | 22.0735 6.91511 3.19 0.002 8.434609 35.71239
------------------------------------------------------------------------------
Linear Statistical Models Course
Phil Ender, 20sep10, 22dec00