ED230B/C

Linear Statistical Models

Coding Categorical Variables

Updated for Stata 11

Consider the Following 4 Group Design:

Level a1 a2 a3 a4 Total
1
3
2
2 2
3
4
3 5
6
4
5 10
10
9
11
Mean 2.0 3.0 5.0 10.0 5.0

Dummy Coding

For k groups, use k-1 coded vectors.

Uses only zeros and ones.

Reference group is coded with all zeros.

Each coded column is one degree of freedom.

Constant is the mean of the reference group.

Regression coefficients are the differences between the each group mean and the reference group mean.

Dummy coded variables are also known as indicator variables.

input y  grp d1  d2  d3
 1   1   1   0   0 
 3   1   1   0   0
 2   1   1   0   0
 2   1   1   0   0
 2   2   0   1   0
 3   2   0   1   0
 4   2   0   1   0
 3   2   0   1   0
 5   3   0   0   1
 6   3   0   0   1
 4   3   0   0   1
 5   3   0   0   1
10   4   0   0   0
10   4   0   0   0
 9   4   0   0   0
11   4   0   0   0
end

tabstat y, by(grp)

Summary for variables: y
     by categories of: grp 

     grp |      mean
---------+----------
       1 |         2
       2 |         3
       3 |         5
       4 |        10
---------+----------
   Total |         5
--------------------

regress y d1 d2 d3

  Source |       SS       df       MS                  Number of obs =      16
---------+------------------------------               F(  3,    12) =   76.00
   Model |      152.00     3  50.6666667               Prob > F      =  0.0000
Residual |        8.00    12  .666666667               R-squared     =  0.9500
---------+------------------------------               Adj R-squared =  0.9375
   Total |      160.00    15  10.6666667               Root MSE      =   .8165

------------------------------------------------------------------------------
       y |      Coef.   Std. Err.       t     P>|t|       [95% Conf. Interval]
---------+--------------------------------------------------------------------
      d1 |         -8   .5773503    -13.856   0.000      -9.257938   -6.742062
      d2 |         -7   .5773503    -12.124   0.000      -8.257938   -5.742062
      d3 |         -5   .5773503     -8.660   0.000      -6.257938   -3.742062
   _cons |         10   .4082483     24.495   0.000       9.110503     10.8895
------------------------------------------------------------------------------

Introduced in Stata 11, dummy coded factor variables can be generated for most estomation models.

regress y i.grp

      Source |       SS       df       MS              Number of obs =      16
-------------+------------------------------           F(  3,    12) =   76.00
       Model |         152     3  50.6666667           Prob > F      =  0.0000
    Residual |           8    12  .666666667           R-squared     =  0.9500
-------------+------------------------------           Adj R-squared =  0.9375
       Total |         160    15  10.6666667           Root MSE      =   .8165

------------------------------------------------------------------------------
           y |      Coef.   Std. Err.      t    P>|t|     [95% Conf. Interval]
-------------+----------------------------------------------------------------
         grp |
          2  |          1   .5773503     1.73   0.109    -.2579382    2.257938
          3  |          3   .5773503     5.20   0.000     1.742062    4.257938
          4  |          8   .5773503    13.86   0.000     6.742062    9.257938
             |
       _cons |          2   .4082483     4.90   0.000     1.110503    2.889497
------------------------------------------------------------------------------

/* change reference group to grp 4 */

regress y ib4.grp

      Source |       SS       df       MS              Number of obs =      16
-------------+------------------------------           F(  3,    12) =   76.00
       Model |         152     3  50.6666667           Prob > F      =  0.0000
    Residual |           8    12  .666666667           R-squared     =  0.9500
-------------+------------------------------           Adj R-squared =  0.9375
       Total |         160    15  10.6666667           Root MSE      =   .8165

------------------------------------------------------------------------------
           y |      Coef.   Std. Err.      t    P>|t|     [95% Conf. Interval]
-------------+----------------------------------------------------------------
         grp |
          1  |         -8   .5773503   -13.86   0.000    -9.257938   -6.742062
          2  |         -7   .5773503   -12.12   0.000    -8.257938   -5.742062
          3  |         -5   .5773503    -8.66   0.000    -6.257938   -3.742062
             |
       _cons |         10   .4082483    24.49   0.000     9.110503     10.8895
------------------------------------------------------------------------------


/* anova treats all predictors as categorical unless otherwise indicated */

anova y grp

                           Number of obs =      16     R-squared     =  0.9500
                           Root MSE      = .816497     Adj R-squared =  0.9375

                  Source |  Partial SS    df       MS           F     Prob > F
              -----------+----------------------------------------------------
                   Model |      152.00     3  50.6666667      76.00     0.0000
                         |
                     grp |      152.00     3  50.6666667      76.00     0.0000
                         |
                Residual |        8.00    12  .666666667   
              -----------+----------------------------------------------------
                   Total |      160.00    15  10.6666667   

regress

      Source |       SS       df       MS              Number of obs =      16
-------------+------------------------------           F(  3,    12) =   76.00
       Model |         152     3  50.6666667           Prob > F      =  0.0000
    Residual |           8    12  .666666667           R-squared     =  0.9500
-------------+------------------------------           Adj R-squared =  0.9375
       Total |         160    15  10.6666667           Root MSE      =   .8165

------------------------------------------------------------------------------
           y |      Coef.   Std. Err.      t    P>|t|     [95% Conf. Interval]
-------------+----------------------------------------------------------------
         grp |
          2  |          1   .5773503     1.73   0.109    -.2579382    2.257938
          3  |          3   .5773503     5.20   0.000     1.742062    4.257938
          4  |          8   .5773503    13.86   0.000     6.742062    9.257938
             |
       _cons |          2   .4082483     4.90   0.000     1.110503    2.889497
------------------------------------------------------------------------------

Effect Coding

For k groups, use k-1 coded vectors.

Uses ones, zeros, and minus ones.

Reference group is coded -1.

Each coded column is one degree of freedom.

Constant is the unweighted grand mean.

Regression coefficients are differences between the group mean and the grad mean.

Effect coding is sometimes known as deviation coding.

 input y  grp e1  e2  e3
 1   1   1   0   0 
 3   1   1   0   0
 2   1   1   0   0
 2   1   1   0   0
 2   2   0   1   0
 3   2   0   1   0
 4   2   0   1   0
 3   2   0   1   0
 5   3   0   0   1
 6   3   0   0   1
 4   3   0   0   1
 5   3   0   0   1
10   4  -1  -1  -1
10   4  -1  -1  -1
 9   4  -1  -1  -1
11   4  -1  -1  -1
end

regress y e1 e2 e3

  Source |       SS       df       MS                  Number of obs =      16
---------+------------------------------               F(  3,    12) =   76.00
   Model |      152.00     3  50.6666667               Prob > F      =  0.0000
Residual |        8.00    12  .666666667               R-squared     =  0.9500
---------+------------------------------               Adj R-squared =  0.9375
   Total |      160.00    15  10.6666667               Root MSE      =   .8165

------------------------------------------------------------------------------
       y |      Coef.   Std. Err.       t     P>|t|       [95% Conf. Interval]
---------+--------------------------------------------------------------------
      e1 |         -3   .3535534     -8.485   0.000      -3.770327   -2.229673
      e2 |         -2   .3535534     -5.657   0.000      -2.770327   -1.229673
      e3 |          0   .3535534      0.000   1.000      -.7703266    .7703266
   _cons |          5   .2041241     24.495   0.000       4.555252    5.444748
------------------------------------------------------------------------------

test e1 e2 e3

 ( 1)  e1 = 0
 ( 2)  e2 = 0
 ( 3)  e3 = 0

       F(  3,    12) =   76.00
            Prob > F =    0.0000

Orthogonal Coding

For k groups, use k-1 coded vectors.

All vectors are pairwise orthogonal.

Constant is unweighted grand mean.

Each coded column is one degree of freedom.

Example Using Orthogonal Coding

input y grp x1  x2  x3
 1   1   1   1   1 
 3   1   1   1   1
 2   1   1   1   1
 2   1   1   1   1
 2   2  -1   1   1
 3   2  -1   1   1
 4   2  -1   1   1
 3   2  -1   1   1
 5   3   0  -2   1
 6   3   0  -2   1
 4   3   0  -2   1
 5   3   0  -2   1
10   4   0   0  -3
10   4   0   0  -3
 9   4   0   0  -3
11   4   0   0  -3
end

table grp, contents(freq mean y sd y)

----------------------------------------------
      grp |      Freq.     mean(y)       sd(y)
----------+-----------------------------------
        1 |          4           2    .8164966
        2 |          4           3    .8164966
        3 |          4           5    .8164966
        4 |          4          10    .8164966
----------------------------------------------

corr x1 x2 x3
(obs=16)

             |       x1       x2       x3
-------------+---------------------------
          x1 |   1.0000
          x2 |   0.0000   1.0000
          x3 |   0.0000   0.0000   1.0000

Anova

anova y grp

                           Number of obs =      16     R-squared     =  0.9500
                           Root MSE      = .816497     Adj R-squared =  0.9375

                  Source |  Partial SS    df       MS           F     Prob > F
              -----------+----------------------------------------------------
                   Model |      152.00     3  50.6666667      76.00     0.0000
                         |
                     grp |      152.00     3  50.6666667      76.00     0.0000
                         |
                Residual |        8.00    12  .666666667   
              -----------+----------------------------------------------------
                   Total |      160.00    15  10.6666667 
				   
Regression Analysis Using Orthogonal Coding

regress y x1 x2 x3

  Source |       SS       df       MS                  Number of obs =      16
---------+------------------------------               F(  3,    12) =   76.00
   Model |      152.00     3  50.6666667               Prob > F      =  0.0000
Residual |        8.00    12  .666666667               R-squared     =  0.9500
---------+------------------------------               Adj R-squared =  0.9375
   Total |      160.00    15  10.6666667               Root MSE      =   .8165

------------------------------------------------------------------------------
       y |      Coef.   Std. Err.       t     P>|t|       [95% Conf. Interval]
---------+--------------------------------------------------------------------
      x1 |        -.5   .2886751     -1.732   0.109      -1.128969    .1289691
      x2 |  -.8333333   .1666667     -5.000   0.000      -1.196469   -.4701979
      x3 |  -1.666667   .1178511    -14.142   0.000      -1.923442   -1.409891
   _cons |          5   .2041241     24.495   0.000       4.555252    5.444748
------------------------------------------------------------------------------

test x1 x2 x3

 ( 1)  x1 = 0
 ( 2)  x2 = 0
 ( 3)  x3 = 0

       F(  3,    12) =   76.00
            Prob > F =    0.0000

Orthogonal Coding Schema


Grp X1 X2 X3 X4 X5 X6 X7 X8 X9
 1   1  1  1  1  1  1  1  1  1
 2  -1  1  1  1  1  1  1  1  1
 3   0 -2  1  1  1  1  1  1  1
 4   0  0 -3  1  1  1  1  1  1
 5   0  0  0 -4  1  1  1  1  1
 6   0  0  0  0 -5  1  1  1  1
 7   0  0  0  0  0 -6  1  1  1
 8   0  0  0  0  0  0 -7  1  1
 9   0  0  0  0  0  0  0 -8  1
10   0  0  0  0  0  0  0  0 -9

Linear Statistical Models Course

Phil Ender, 17sep10, 21Feb02, 17Mar98

Level	a1	a2	a3	a4	Total
	1 3 2 2	2 3 4 3	5 6 4 5	10 10 9 11
Mean	2.0	3.0	5.0	10.0	5.0