Scaling

Linear Statistical Models: Regression

Some Scaling Issues

Updated for Stata 11

The effect of scaling predictor variables can be easily demonstrated using the variable read in the hsbdemo dataset. We will begin with a model regressing write on female and read.

Example 1

use http://www.phhilender.com/courses/data/hsbdemo, clear

regress write female read

      Source |       SS       df       MS              Number of obs =     200
-------------+------------------------------           F(  2,   197) =   77.21
       Model |  7856.32118     2  3928.16059           Prob > F      =  0.0000
    Residual |  10022.5538   197  50.8759077           R-squared     =  0.4394
-------------+------------------------------           Adj R-squared =  0.4337
       Total |   17878.875   199   89.843593           Root MSE      =  7.1327

------------------------------------------------------------------------------
       write |      Coef.   Std. Err.      t    P>|t|     [95% Conf. Interval]
-------------+----------------------------------------------------------------
      female |   5.486894   1.014261     5.41   0.000      3.48669    7.487098
        read |   .5658869   .0493849    11.46   0.000      .468496    .6632778
       _cons |   20.22837   2.713756     7.45   0.000     14.87663    25.58011
------------------------------------------------------------------------------

The coefficient for read (.57) indicates how much change is expected in write when there is a one unit increase in read with female held constant. The concern here is that a one unit change might not be terribly meaningful. Suppose that research has indicated that a 12 point change in read is meaningful. Here is what you could do.

generate read12 = read/12

regress write female read12

      Source |       SS       df       MS              Number of obs =     200
-------------+------------------------------           F(  2,   197) =   77.21
       Model |  7856.32128     2  3928.16064           Prob > F      =  0.0000
    Residual |  10022.5537   197  50.8759072           R-squared     =  0.4394
-------------+------------------------------           Adj R-squared =  0.4337
       Total |   17878.875   199   89.843593           Root MSE      =  7.1327

------------------------------------------------------------------------------
       write |      Coef.   Std. Err.      t    P>|t|     [95% Conf. Interval]
-------------+----------------------------------------------------------------
      female |   5.486894   1.014261     5.41   0.000      3.48669    7.487098
      read12 |   6.790643   .5926186    11.46   0.000     5.621953    7.959334
       _cons |   20.22837   2.713756     7.45   0.000     14.87663    25.58011
------------------------------------------------------------------------------

Now a one unit change in read12 predicts a 6.8 point change in write with female held constant. A one point change in read12 is equivalent to a 12 point change in read.

Note that the standardized coefficients are identical for read and read12.

regress write female read, beta noheader

------------------------------------------------------------------------------
       write |      Coef.   Std. Err.      t    P>|t|                     Beta
-------------+----------------------------------------------------------------
      female |   5.486894   1.014261     5.41   0.000                 .2889851
        read |   .5658869   .0493849    11.46   0.000                 .6121169
       _cons |   20.22837   2.713756     7.45   0.000                        .
------------------------------------------------------------------------------

regress write female read12, beta noheader

------------------------------------------------------------------------------
       write |      Coef.   Std. Err.      t    P>|t|                     Beta
-------------+----------------------------------------------------------------
      female |   5.486894   1.014261     5.41   0.000                 .2889851
      read12 |   6.790643   .5926186    11.46   0.000                 .6121169
       _cons |   20.22837   2.713756     7.45   0.000                        .
------------------------------------------------------------------------------

Example 2

Now, what if reading was a categorical variable? We will divide read up into five categories. Please realize that I am not suggesting that you should take a continuous variable and break it up into categories, but to show the effect of scaling read as a categorical variable.

egen readcat = cut(read), group(5) icodes

tabulate readcat

    readcat |      Freq.     Percent        Cum.
------------+-----------------------------------
          0 |         39       19.50       19.50
          1 |         16        8.00       27.50
          2 |         62       31.00       58.50
          3 |         37       18.50       77.00
          4 |         46       23.00      100.00
------------+-----------------------------------
      Total |        200      100.00

tabstat read, by(readcat)

Summary for variables: read
     by categories of: readcat 

 readcat |      mean
---------+----------
       0 |  38.61538
       1 |     44.25
       2 |  49.22581
       3 |  57.13514
       4 |  66.65217
---------+----------
   Total |     52.23
--------------------

Let's run a regression with dummy coded readcat.

regress write female i.readcat

      Source |       SS       df       MS              Number of obs =     200
-------------+------------------------------           F(  5,   194) =   28.02
       Model |  7497.74329     5  1499.54866           Prob > F      =  0.0000
    Residual |  10381.1317   194  53.5109882           R-squared     =  0.4194
-------------+------------------------------           Adj R-squared =  0.4044
       Total |   17878.875   199   89.843593           Root MSE      =  7.3151

------------------------------------------------------------------------------
       write |      Coef.   Std. Err.      t    P>|t|     [95% Conf. Interval]
-------------+----------------------------------------------------------------
      female |   5.714997     1.0442     5.47   0.000     3.655556    7.774438
             |
     readcat |
          1  |   2.237243   2.173774     1.03   0.305    -2.050021    6.524508
          2  |   6.692244   1.495062     4.48   0.000     3.743581    9.640906
          3  |   11.49109   1.680671     6.84   0.000     8.176361    14.80583
          4  |   15.76366   1.596531     9.87   0.000     12.61487    18.91244
             |
       _cons |   41.65526   1.323366    31.48   0.000     39.04523    44.26529
------------------------------------------------------------------------------

testparm i.readcat

 ( 1)  1.readcat = 0
 ( 2)  2.readcat = 0
 ( 3)  3.readcat = 0
 ( 4)  4.readcat = 0

       F(  4,   194) =   29.53
            Prob > F =    0.0000

We see that overall readcat is a significant predictor of write. The R² for this model is .4199 as compared to .4394 when read is continuous. Next, let's use readcat in a model but treat it as a one degree of freedom linear predictor.

regress write female readcat

      Source |       SS       df       MS              Number of obs =     200
-------------+------------------------------           F(  2,   197) =   69.99
       Model |  7426.82985     2  3713.41492           Prob > F      =  0.0000
    Residual |  10452.0452   197  53.0560668           R-squared     =  0.4154
-------------+------------------------------           Adj R-squared =  0.4095
       Total |   17878.875   199   89.843593           Root MSE      =   7.284

------------------------------------------------------------------------------
       write |      Coef.   Std. Err.      t    P>|t|     [95% Conf. Interval]
-------------+----------------------------------------------------------------
      female |   5.688666   1.037052     5.49   0.000     3.643518    7.733814
     readcat |   4.030212   .3713078    10.85   0.000     3.297964     4.76246
       _cons |   40.90897   1.141635    35.83   0.000     38.65757    43.16036
------------------------------------------------------------------------------

The linear form of readcat is still significant but the R² for the model has gone down to .4154, a trivial difference for a gain of four degrees of freedom in the residual.

We can test to see if the difference between using read and readcat is significant by including both in a model. The significant coefficient for read (below) suggests that the continuous form of read accounts variability in reading that is not captured in the categorical form.

regress write female i.readcat read

      Source |       SS       df       MS              Number of obs =     200
-------------+------------------------------           F(  6,   193) =   25.62
       Model |  7926.37551     6  1321.06258           Prob > F      =  0.0000
    Residual |  9952.49949   193  51.5673549           R-squared     =  0.4433
-------------+------------------------------           Adj R-squared =  0.4260
       Total |   17878.875   199   89.843593           Root MSE      =   7.181

------------------------------------------------------------------------------
       write |      Coef.   Std. Err.      t    P>|t|     [95% Conf. Interval]
-------------+----------------------------------------------------------------
      female |   5.469592   1.028589     5.32   0.000     3.440875     7.49831
             |
     readcat |
          1  |  -.6631964   2.359184    -0.28   0.779     -5.31629    3.989897
          2  |   1.273686   2.384601     0.53   0.594    -3.429538     5.97691
          3  |   2.011662   3.678693     0.55   0.585    -5.243941    9.267264
          4  |   1.413838   5.218196     0.27   0.787    -8.878175    11.70585
             |
        read |   .5108452    .177188     2.88   0.004     .1613717    .8603187
       _cons |    22.0735    6.91511     3.19   0.002     8.434609    35.71239
------------------------------------------------------------------------------

Linear Statistical Models Course

Phil Ender, 20sep10, 22dec00