Categorical Predictors

Linear Statistical Models: Regression

Categorical Predictors

Updated for Stata 11

Regression Analysis So Far...

Regression analysis so far has invloved only continuous variables or at the least quasi-interval scaled variables.

There are many variables of interest that are categorical in nature.

Regression can include the use of categorical variables which, in turn, involves the use of coding to indicate group membership.

Qualitative regressors; categorical predictor variables; nominal predictor variables.

Coding Methods for Categorical Variables

Categorical variables require some system for coding observations as to their group member ship. At this time we will introduce two methods for coding categorical variables:

Dummy Coding.
Effect Coding.

Later, in Analysis of Variance, we will introduce Orthogonal Coding of categorical variables.

Consider the Following 4 Group Design:

Level a1 a2 a3 a4 Total
1
3
2
2 2
3
4
3 5
6
4
5 10
10
9
11
Mean 2.0 3.0 5.0 10.0 5.0

Dummy Coding

Use only 1's and 0's.

For k groups, use k-1 coded vectors.

1 indicates group membership

Often the control group is coded with all 0's.

Each coded column is one degree of freedom.

Dummy variables are sometimes called indicator variables.

Example Using Dummy Coding

 y  grp x1  x2  x3
 1   1  1   0   0 
 3   1  1   0   0
 2   1  1   0   0
 2   1  1   0   0
 2   2  0   1   0
 3   2  0   1   0
 4   2  0   1   0
 3   2  0   1   0
 5   3  0   0   1
 6   3  0   0   1
 4   3  0   0   1
 5   3  0   0   1
10   4  0   0   0
10   4  0   0   0
 9   4  0   0   0
11   4  0   0   0

Regression Analysis Using Dummy Coding

regress y x1 x2 x3

  Source |       SS       df       MS                  Number of obs =      16
---------+------------------------------               F(  3,    12) =   76.00
   Model |      152.00     3  50.6666667               Prob > F      =  0.0000
Residual |        8.00    12  .666666667               R-squared     =  0.9500
---------+------------------------------               Adj R-squared =  0.9375
   Total |      160.00    15  10.6666667               Root MSE      =   .8165

------------------------------------------------------------------------------
       y |      Coef.   Std. Err.       t     P>|t|       [95% Conf. Interval]
---------+--------------------------------------------------------------------
      x1 |         -8   .5773503    -13.856   0.000      -9.257938   -6.742062
      x2 |         -7   .5773503    -12.124   0.000      -8.257938   -5.742062
      x3 |         -5   .5773503     -8.660   0.000      -6.257938   -3.742062
   _cons |         10   .4082483     24.495   0.000       9.110503     10.8895
------------------------------------------------------------------------------

Interpretation of Coefficients

The constant, b₀, is equal to the mean of the group coded with all 0's, i.e., Group 4.

b₁ is equal to the difference between the mean for Group 1 and Group 4.

b₂ is equal to the difference between the mean for Group 2 and Group 4.

b₃ is equal to the difference between the mean for Group 3 and Group 4.

The t-test for each coefficient tests the difference between the group coded with 1's and the group coded with all 0's.

Effect Coding

Use only 1's, 0's and -1's.

For k groups, use k-1 coded vectors.

1 indicates group membership

Usually the control group is coded with all -1's.

Each coded column is one degree of freedom.

Example Using Effect Coding

 y  grp x1  x2  x3
 1   1  1   0   0 
 3   1  1   0   0
 2   1  1   0   0
 2   1  1   0   0
 2   2  0   1   0
 3   2  0   1   0
 4   2  0   1   0
 3   2  0   1   0
 5   3  0   0   1
 6   3  0   0   1
 4   3  0   0   1
 5   3  0   0   1
10   4 -1  -1  -1
10   4 -1  -1  -1
 9   4 -1  -1  -1
11   4 -1  -1  -1

The Linear Model

Y_ij = m + a_j + e_i(j)

Where a_j represents the treatment effect of the jth group.

Regression Analysis Using Effect Coding

regress y x1 x2 x3

  Source |       SS       df       MS                  Number of obs =      16
---------+------------------------------               F(  3,    12) =   76.00
   Model |      152.00     3  50.6666667               Prob > F      =  0.0000
Residual |        8.00    12  .666666667               R-squared     =  0.9500
---------+------------------------------               Adj R-squared =  0.9375
   Total |      160.00    15  10.6666667               Root MSE      =   .8165

------------------------------------------------------------------------------
       y |      Coef.   Std. Err.       t     P>|t|       [95% Conf. Interval]
---------+--------------------------------------------------------------------
      x1 |         -3   .3535534     -8.485   0.000      -3.770327   -2.229673
      x2 |         -2   .3535534     -5.657   0.000      -2.770327   -1.229673
      x3 |          0   .3535534      0.000   1.000      -.7703266    .7703266
   _cons |          5   .2041241     24.495   0.000       4.555252    5.444748
------------------------------------------------------------------------------

Interpretation of Coefficients

The constant is equal to the grand mean of the dependent variable.

b1 is equal to the difference between the mean for Group 1 and the grand mean, i.e., treatment effect for Group 1.

b2 is equal to the difference between the mean for Group 2 and the grand mean, i.e., treatment effect for Group 2.

b3 is equal to the difference between the mean for Group 3 and the grand mean, i.e., treatment effect for Group 3.

The t-test for each coefficient tests whether the treatment effect for that group is significant.

The treatment effect for the group coded with -1's is -Sb_k, in this case, -(-3-2+0) = 5

F-ratio Using R²

An example using hsbdemo

Let's analyze the hsbdemo data for the variable program type (prog) using write as the dependent variable. We will dummy code prog using the tabulate command with the generate option to create the dummy variables for us automatically.

use http://www.gseis.ucla.edu/courses/data/hsbdemo, clear

tab prog, gen(prog)

    type of |
    program |      Freq.     Percent        Cum.
------------+-----------------------------------
    general |         45       22.50       22.50
   academic |        105       52.50       75.00
   vocation |         50       25.00      100.00
------------+-----------------------------------
      Total |        200      100.00ed)

regress write prog2 prog3

  Source |       SS       df       MS                  Number of obs =     200
---------+------------------------------               F(  2,   197) =   21.27
   Model |  3175.69786     2  1587.84893               Prob > F      =  0.0000
Residual |  14703.1771   197   74.635417               R-squared     =  0.1776
---------+------------------------------               Adj R-squared =  0.1693
   Total |   17878.875   199   89.843593               Root MSE      =  8.6392

------------------------------------------------------------------------------
   write |      Coef.   Std. Err.       t     P>|t|       [95% Conf. Interval]
---------+--------------------------------------------------------------------
   prog2 |    4.92381   1.539279      3.199   0.002       1.888231    7.959388
   prog3 |  -4.573333   1.775183     -2.576   0.011      -8.074134   -1.072533
   _cons |   51.33333   1.287853     39.860   0.000       48.79359    53.87308
------------------------------------------------------------------------------

test prog2 prog3

 ( 1)  prog2 = 0.0
 ( 2)  prog3 = 0.0

       F(  2,   197) =   21.27
            Prob > F =    0.0000

It is also possible to have Stata perform dummy coding on-the-fly using factor variables.

regress write i.prog

      Source |       SS       df       MS              Number of obs =     200
-------------+------------------------------           F(  2,   197) =   21.27
       Model |  3175.69786     2  1587.84893           Prob > F      =  0.0000
    Residual |  14703.1771   197   74.635417           R-squared     =  0.1776
-------------+------------------------------           Adj R-squared =  0.1693
       Total |   17878.875   199   89.843593           Root MSE      =  8.6392

------------------------------------------------------------------------------
       write |      Coef.   Std. Err.      t    P>|t|     [95% Conf. Interval]
-------------+----------------------------------------------------------------
        prog |
          2  |    4.92381   1.539279     3.20   0.002     1.888231    7.959388
          3  |  -4.573333   1.775183    -2.58   0.011    -8.074134   -1.072533
             |
       _cons |   51.33333   1.287853    39.86   0.000     48.79359    53.87308
------------------------------------------------------------------------------

testparm i.prog

 ( 1)  2.prog = 0
 ( 2)  3.prog = 0

       F(  2,   197) =   21.27
            Prob > F =    0.0000

Effect coding using manual coding

In this example group one is the reference group, i.e., the group that would be coded -1.

replace prog2 = -1 if prog==1
replace prog3 = -1 if prog==1

regress write prog2 prog3


      Source |       SS       df       MS              Number of obs =     200
-------------+------------------------------           F(  2,   197) =   21.27
       Model |  3175.69786     2  1587.84893           Prob > F      =  0.0000
    Residual |  14703.1771   197   74.635417           R-squared     =  0.1776
-------------+------------------------------           Adj R-squared =  0.1693
       Total |   17878.875   199   89.843593           Root MSE      =  8.6392

------------------------------------------------------------------------------
       write |      Coef.   Std. Err.      t    P>|t|     [95% Conf. Interval]
-------------+----------------------------------------------------------------
       prog2 |   4.806984   .8161241     5.89   0.000     3.197523    6.416445
       prog3 |  -4.690159   .9626475    -4.87   0.000    -6.588576   -2.791742
       _cons |   51.45016   .6550731    78.54   0.000      50.1583    52.74201
------------------------------------------------------------------------------

The ANOVA Alternative

Many people picture anova software as being good only for classical experimental designs with categorical variables. However, the Stata anova command is actually regression in disguise. Consider the following regression that has both categorical and continuous variables and their interactions.

use http://www.philender/courses/data/hsbdemo, clear

tabulate prog, gen(prog)

    type of |
    program |      Freq.     Percent        Cum.
------------+-----------------------------------
    general |         45       22.50       22.50
   academic |        105       52.50       75.00
   vocation |         50       25.00      100.00
------------+-----------------------------------
      Total |        200      100.00

regress write i.female i.prog##c.read i.prog##c.math

      Source |       SS       df       MS              Number of obs =     200
-------------+------------------------------           F(  9,   190) =   25.80
       Model |  9833.77329     9  1092.64148           Prob > F      =  0.0000
    Residual |  8045.10171   190  42.3426406           R-squared     =  0.5500
-------------+------------------------------           Adj R-squared =  0.5287
       Total |   17878.875   199   89.843593           Root MSE      =  6.5071

------------------------------------------------------------------------------
       write |      Coef.   Std. Err.      t    P>|t|     [95% Conf. Interval]
-------------+----------------------------------------------------------------
    1.female |   5.706612   .9390611     6.08   0.000     3.854288    7.558937
             |
        prog |
          2  |   5.872569   8.496026     0.69   0.490    -10.88608    22.63122
          3  |  -3.126916   9.509236    -0.33   0.743    -21.88415    15.63032
             |
        read |   .5184569   .1172236     4.42   0.000     .2872301    .7496837
             |
 prog#c.read |
          2  |  -.3111253   .1493291    -2.08   0.039    -.6056813   -.0165694
          3  |  -.2499231   .1666047    -1.50   0.135    -.5785557    .0787094
             |
        math |   .2072995   .1436855     1.44   0.151    -.0761243    .4907233
             |
 prog#c.math |
          2  |   .2062863   .1759469     1.17   0.242     -.140774    .5533465
          3  |   .2725577   .1950709     1.40   0.164    -.1122251    .6573405
             |
       _cons |   12.12411    7.35309     1.65   0.101    -2.380063    26.62829
------------------------------------------------------------------------------

testparm prog#c.math

 ( 1)  2.prog#c.math = 0
 ( 2)  3.prog#c.math = 0

       F(  2,   190) =    1.06
            Prob > F =    0.3482

test prog#c.read

 ( 1)  2.prog#c.read = 0
 ( 2)  3.prog#c.read = 0

       F(  2,   190) =    2.25
            Prob > F =    0.1079

test 1.female

 ( 1)  1.female = 0

       F(  1,   190) =   36.93
            Prob > F =    0.0000

Admittedly, that wasn't a very interesting model but it did illustrate one way to put together all the pieces involved in model with categorical and interaction terms. Now, let's look at exactly the same model using the anova command.

anova write i.female i.prog##c.read i.prog##c.math

                           Number of obs =     200     R-squared     =  0.5500
                           Root MSE      = 6.50712     Adj R-squared =  0.5287

                  Source |  Partial SS    df       MS           F     Prob > F
              -----------+----------------------------------------------------
                   Model |  9833.77329     9  1092.64148      25.80     0.0000
                         |
                  female |  1563.67667     1  1563.67667      36.93     0.0000
                    prog |  66.1182428     2  33.0591214       0.78     0.4595
                    read |  1170.32031     1  1170.32031      27.64     0.0000
               prog#read |  190.783714     2  95.3918572       2.25     0.1079
                    math |  1066.81222     1  1066.81222      25.19     0.0000
               prog#math |  89.8348393     2  44.9174197       1.06     0.3482
                         |
                Residual |  8045.10171   190  42.3426406   
              -----------+----------------------------------------------------
                   Total |   17878.875   199   89.843593   

regress

      Source |       SS       df       MS              Number of obs =     200
-------------+------------------------------           F(  9,   190) =   25.80
       Model |  9833.77329     9  1092.64148           Prob > F      =  0.0000
    Residual |  8045.10171   190  42.3426406           R-squared     =  0.5500
-------------+------------------------------           Adj R-squared =  0.5287
       Total |   17878.875   199   89.843593           Root MSE      =  6.5071

------------------------------------------------------------------------------
       write |      Coef.   Std. Err.      t    P>|t|     [95% Conf. Interval]
-------------+----------------------------------------------------------------
    1.female |   5.706612   .9390611     6.08   0.000     3.854288    7.558937
             |
        prog |
          2  |   5.872569   8.496026     0.69   0.490    -10.88608    22.63122
          3  |  -3.126916   9.509236    -0.33   0.743    -21.88415    15.63032
             |
        read |   .5184569   .1172236     4.42   0.000     .2872301    .7496837
             |
 prog#c.read |
          2  |  -.3111253   .1493291    -2.08   0.039    -.6056813   -.0165694
          3  |  -.2499231   .1666047    -1.50   0.135    -.5785557    .0787094
             |
        math |   .2072995   .1436855     1.44   0.151    -.0761243    .4907233
             |
 prog#c.math |
          2  |   .2062863   .1759469     1.17   0.242     -.140774    .5533465
          3  |   .2725577   .1950709     1.40   0.164    -.1122251    .6573405
             |
       _cons |   12.12411    7.35309     1.65   0.101    -2.380063    26.62829
------------------------------------------------------------------------------

The results are the same as the regression analysis however the set-up of the model to be tested was a little more straight forward. Let's try one more.

ANOVA Exampe 2

use http://www.philender.com/courses/data/htwt, clear

anova weight i.female##c.height

                           Number of obs =    1000     R-squared     =  0.2795
                           Root MSE      = 8.20887     Adj R-squared =  0.2773

                  Source |  Partial SS    df       MS           F     Prob > F
           --------------+----------------------------------------------------
                   Model |  26034.4351     3  8678.14505     128.78     0.0000
                         |
                  female |  587.074483     1  587.074483       8.71     0.0032
                  height |  19197.3548     1  19197.3548     284.89     0.0000
           female#height |   547.82512     1   547.82512       8.13     0.0044
                         |
                Residual |  67115.9985   996  67.3855406   
           --------------+----------------------------------------------------
                   Total |  93150.4336   999  93.2436773   

regress

      Source |       SS       df       MS              Number of obs =    1000
-------------+------------------------------           F(  3,   996) =  128.78
       Model |  26034.4351     3  8678.14505           Prob > F      =  0.0000
    Residual |  67115.9985   996  67.3855406           R-squared     =  0.2795
-------------+------------------------------           Adj R-squared =  0.2773
       Total |  93150.4336   999  93.2436773           Root MSE      =  8.2089

------------------------------------------------------------------------------
      weight |      Coef.   Std. Err.      t    P>|t|     [95% Conf. Interval]
-------------+----------------------------------------------------------------
    1.female |   38.26321   12.96338     2.95   0.003     12.82455    63.70188
      height |   .7706638    .052059    14.80   0.000     .6685058    .8728217
             |
      female#|
    c.height |
          1  |  -.2227448   .0781214    -2.85   0.004    -.3760463   -.0694434
             |
       _cons |  -72.01376   8.892743    -8.10   0.000    -89.46442   -54.56309
------------------------------------------------------------------------------

Linear Statistical Models Course

Phil Ender, 18dec99

Level	a1	a2	a3	a4	Total
	1 3 2 2	2 3 4 3	5 6 4 5	10 10 9 11
Mean	2.0	3.0	5.0	10.0	5.0