Discriminant Analysis

Multivariate Analysis
Discriminant Analysis

Discriminant analysis, also known as linear discriminant function analysis, combines aspects of multivariate analysis of varicance with the ability to classify observations into known categories. It is a useful adjunct in helping to interpret the results of manova. It provides an alternative to multinomial logistic regression that will function adequately with somewhat smaller sample sizes. Like so many multivariate techniques it does require that the dependent variables come from a multivariate normal distribution.

What does discriminant analysis do?

Tests multivariate differences between groups -- really the same as manova

Determines the dimensionality of the group differences

Provides information on the relative importance or contribution of each variable

Classifies known and unknown observations into groups

In the beginning...

Let Y be a linear combination of p dependent variables, such that

Total deviation SSCP

Let T be the Total deviation SSCP matrix, which is obtained from

Within Groups deviation SSCP

The within group sums of squares and cross products matrix is expressed as:

Between Groups deviation SSCP

where

which has n₁ rows of means from group 1, n₂ rows of means from groups 2, n_k rows of means from group k, and where

However, it is easier to obtain B by taking T - W.

The between groups sums of squares for the linear combination is:

Discriminant Analysis

We wish to select the elements of v such that is a maximum.

This occurs when (B - λW)v = 0. Thus, discriminant analysis reduces to finding the eigenvalues and eigenvectors of W^-1B which is often written E^-1H.

Computational Note

W^-1B is generally non-symmetric. Most matrix languages cannot compute the eigenvalues and vectors unless the matrix is symmetric. Therefore, you must perform some additional computations to obtain the eigenvalues and eigenvectors.

mat l = cholesky(W)          /* Cholesky decomposition */
mat u = l'                   /* u equals l transpose   */
mat a = inv(l)*B*inv(u)      /* a is symmetric */
mat symeigen uv e = a        /* e will have the eigenvalues of W^-1B */
mat v = inv(u)*uv            /* v will have the eigenvectors of W^-1B */

Aside

The cholesky function performs the Cholesky decomposition of a matrix such that if L = cholesky(A) then L'L = A, where L is lower triangular matrix. The formula could also be be written LU = A where U is upper triangular. Matrix A must be symmetric and nonnegative definite.

Number of Dimensions

The maximum number of discriminant dimensions is Min(p, k-1).

Trimming

Trim the eigenvalues and eigenvectors

Retain only as many as the number of eigenvalues greater than zero.

Let E be the trimmed matrix of eigenvalues.

Let V be the trimmed matrix of eigenvectors.

Standardized Discriminant Weights

Standardized discriminant weights or coefficients are used when all the variables are in standard score from.

Standardized discriminant coefficients are one of the pieces used in the interpretation of the discriminant analysis.

w_ii is a diagonal matrix whose elements are the square roots of the diagonal elements of W.

Then let

sv is the vector of standardized discriminant coefficients.

Structure Matrix

The structure matrix gives the correlations between the variables and the discriminant functions.

The structure matrix is another of the pieces used in the interpretation of the discriminant analysis.

Computing the structure matrix.

mat E = V'*W*V
mat F = inv(sqrt(diag(E)))
mat G = inv(sqet(diag(W)))
mat A = G*W*V*F

A is the structure matrix.

Canonical Correlations

Wilks' Lambda

Using Canonical Correlations

Let t = 1 - R_c²

then,

2-Group Case

Discuss the 2-group case.

>Chi-square

with degrees of freedom:

let q = k-1
For Chisquare₁ = pq
For Chisquare₂ = (p-1)*(q-1)
For Chisquare₃ = (p-2)*(q-2)

2-Group Example

There are new discriminant analyse procedures in Stata 10. We will be using the candisc or discrim lda command for these examples. candisc is a wrapper program for discrim lad which includes a number of commonly used options.

use http://www.philender.com/courses/data/honors, clear

describe

Contains data from http://www.gseis.ucla.edu/courses/data/honors.dta
  obs:           200                          
 vars:             7                          14 Dec 2001 09:19
 size:         6,400 (99.9% of memory free)
-------------------------------------------------------------------------------
              storage  display     value
variable name   type   format      label      variable label
-------------------------------------------------------------------------------
id              float  %9.0g                  
female          float  %9.0g       fl         
ses             float  %9.0g       sl         
lang            float  %9.0g                  language test score
math            float  %9.0g                  math score
science         float  %9.0g                  science score
honors          float  %9.0g                  

candisc female lang, group(honors)

Canonical linear discriminant analysis

      |                                 | Like- 
      | Canon.   Eigen-     Variance    | lihood
  Fcn | Corr.    value   Prop.   Cumul. | Ratio     F      df1    df2  Prob>F
  ----+---------------------------------+------------------------------------
    1 | 0.5166  .364005  1.0000  1.0000 | 0.7331  35.854     2    197  0.0000 e
  ---------------------------------------------------------------------------
  Ho: this and smaller canon. corr. are zero;                     e = exact F

Standardized canonical discriminant function coefficients

                 | function1 
    -------------+-----------
          female |  .3706778 
            lang |  .9822288 

Canonical structure

                 | function1 
    -------------+-----------
          female |  .2328554 
            lang |  .9302168 

Group means on canonical variables

          honors | function1 
    -------------+-----------
               0 | -.3604546 
               1 |  .9997513 

Resubstitution classification summary

    +---------+
    | Key     |
    |---------|
    | Number  |
    | Percent |
    +---------+
                 | Classified             
    True honors  |      0       1 |  Total
    -------------+----------------+-------
               0 |    114      33 |    147
                 |  77.55   22.45 | 100.00
                 |                |       
               1 |     16      37 |     53
                 |  30.19   69.81 | 100.00
    -------------+----------------+-------
           Total |    130      70 |    200
                 |  65.00   35.00 | 100.00
                 |                |       
          Priors | 0.5000  0.5000 | 
          
/* display raw discriminant coefficients */
matrix list e(L_unstd)

e(L_unstd)[3,1]
         function1
female   .74791853
  lang   .10957945
 _cons  -6.1309503
 
predict pr, pr
predict dsc, dsc
twoway line pr dsc, sort

3-Group Example

use http://www.philender.com/courses/data/hsb2, clear
(highschool and beyond (200 cases))

/* equivalent to one-way anova */

candisc write, group(prog) notable nostruct

Canonical linear discriminant analysis

      |                                 | Like- 
      | Canon.   Eigen-     Variance    | lihood
  Fcn | Corr.    value   Prop.   Cumul. | Ratio     F      df1    df2  Prob>F
  ----+---------------------------------+------------------------------------
    1 | 0.4215  .215987  1.0000  1.0000 | 0.8224  21.275     2    197  0.0000 e
  ---------------------------------------------------------------------------
  Ho: this and smaller canon. corr. are zero;                     e = exact F

Standardized canonical discriminant function coefficients

                 | function1 
    -------------+-----------
           write |         1 

Group means on canonical variables

            prog | function1 
    -------------+-----------
         general | -.1668754 
        academic |  .4030641 
        vocation | -.6962467 


/* compare with */

anova write prog

                           Number of obs =     200     R-squared     =  0.1776
                           Root MSE      = 8.63918     Adj R-squared =  0.1693

                  Source |  Partial SS    df       MS           F     Prob > F
              -----------+----------------------------------------------------
                   Model |  3175.69786     2  1587.84893      21.27     0.0000
                         |
                    prog |  3175.69786     2  1587.84893      21.27     0.0000
                         |
                Residual |  14703.1771   197   74.635417   
              -----------+----------------------------------------------------
                   Total |   17878.875   199   89.843593

display sqrt(e(r2))
.42145333

/* equivalent to one-way manova */

candisc write read math, group(prog)  nostruct

Canonical linear discriminant analysis

      |                                 | Like- 
      | Canon.   Eigen-     Variance    | lihood
  Fcn | Corr.    value   Prop.   Cumul. | Ratio     F      df1    df2  Prob>F
  ----+---------------------------------+------------------------------------
    1 | 0.5125  .356283  0.9874  0.9874 | 0.7340   10.87     6    390  0.0000 e
    2 | 0.0672  .004543  0.0126  1.0000 | 0.9955  .44518     2    196  0.6414 e
  ---------------------------------------------------------------------------
  Ho: this and smaller canon. corr. are zero;                     e = exact F

Standardized canonical discriminant function coefficients

                 | function1  function2 
    -------------+----------------------
           write |  .3310784   1.183414 
            read |  .2728524  -.4097932 
            math |  .5815538   -.655658 

Group means on canonical variables

            prog | function1  function2 
    -------------+----------------------
         general | -.3120021   .1190423 
        academic |  .5358515  -.0196809 
        vocation | -.8444861  -.0658081 

Resubstitution classification summary

    +---------+
    | Key     |
    |---------|
    | Number  |
    | Percent |
    +---------+
                 | Classified                             
    True prog    |  general  academic  vocation |    Total
    -------------+------------------------------+---------
         general |       11        17        17 |       45
                 |    24.44     37.78     37.78 |   100.00
                 |                              |         
        academic |       18        68        19 |      105
                 |    17.14     64.76     18.10 |   100.00
                 |                              |         
        vocation |       14         7        29 |       50
                 |    28.00     14.00     58.00 |   100.00
    -------------+------------------------------+---------
           Total |       43        92        65 |      200
                 |    21.50     46.00     32.50 |   100.00
                 |                              |         
          Priors |   0.3333    0.3333    0.3333 |     

/* display manova table */

estat manova

                           Number of obs =     200

                           W = Wilks' lambda      L = Lawley-Hotelling trace
                           P = Pillai's trace     R = Roy's largest root

                  Source |  Statistic     df   F(df1,    df2) =   F   Prob>F
              -----------+--------------------------------------------------
                    prog | W   0.7340      2     6.0   390.0    10.87 0.0000 e
                         | P   0.2672            6.0   392.0    10.08 0.0000 a
                         | L   0.3608            6.0   388.0    11.67 0.0000 a
                         | R   0.3563            3.0   196.0    23.28 0.0000 u
                         |--------------------------------------------------
                Residual |               197
              -----------+--------------------------------------------------
                   Total |               199
              --------------------------------------------------------------
                           e = exact, a = approximate, u = upper bound on F

label define sel 1 "G" 2 "A" 3 "V", modify

scoreplot, msymbol(i)



/* compare discriminant analysis with manova */

manova write read math = prog

                           Number of obs =     200

                           W = Wilks' lambda      L = Lawley-Hotelling trace
                           P = Pillai's trace     R = Roy's largest root

                  Source |  Statistic     df   F(df1,    df2) =   F   Prob>F
              -----------+--------------------------------------------------
                    prog | W   0.7340      2     6.0   390.0    10.87 0.0000 e
                         | P   0.2672            6.0   392.0    10.08 0.0000 a
                         | L   0.3608            6.0   388.0    11.67 0.0000 a
                         | R   0.3563            3.0   196.0    23.28 0.0000 u
                         |--------------------------------------------------
                Residual |               197
              -----------+--------------------------------------------------
                   Total |               199
              --------------------------------------------------------------
                           e = exact, a = approximate, u = upper bound on F

Fisher's Iris Data

This example makes use of the classic Iris data that R. A. Fisher used in developing the linear discriminant function.

use http://www.philender.com/courses/data/iris, clear

describe

Contains data from http://www.gseis.ucla.edu/courses/data/iris.dta
  obs:           150                          
 vars:             6                          17 Feb 2000 14:30
 size:         4,200 (99.9% of memory free)
-------------------------------------------------------------------------------
              storage  display     value
variable name   type   format      label      variable label
-------------------------------------------------------------------------------
case            float  %9.0g                  
type            float  %10.0g      tl         type of iris
sl              float  %9.0g                  sepal length
sw              float  %9.0g                  sepal width
pl              float  %9.0g                  petal length
pw              float  %9.0g                  petal width
-------------------------------------------------------------------------------
Sorted by:  type

tabulate type

    type of |
       iris |      Freq.     Percent        Cum.
------------+-----------------------------------
     setosa |         50       33.33       33.33
 versicolor |         50       33.33       66.67
  virginica |         50       33.33      100.00
------------+-----------------------------------
      Total |        150      100.00


candisc sl sw pl pw, group(type)

Canonical linear discriminant analysis

      |                                 | Like- 
      | Canon.   Eigen-     Variance    | lihood
  Fcn | Corr.    value   Prop.   Cumul. | Ratio     F      df1    df2  Prob>F
  ----+---------------------------------+------------------------------------
    1 | 0.9848  32.1919  0.9912  0.9912 | 0.0234  199.15     8    288  0.0000 e
    2 | 0.4712  .285391  0.0088  1.0000 | 0.7780  13.794     3    145  0.0000 e
  ---------------------------------------------------------------------------
  Ho: this and smaller canon. corr. are zero;                     e = exact F

Standardized canonical discriminant function coefficients

                 | function1  function2 
    -------------+----------------------
              sl | -.4269549  -.0124077 
              sw | -.5212416  -.7352612 
              pl |  .9472573   .4010379 
              pw |  .5751607  -.5810398 

Canonical structure

                 | function1  function2 
    -------------+----------------------
              sl |  .2225959  -.3108118 
              sw | -.1190115  -.8636809 
              pl |  .7060654  -.1677014 
              pw |  .6331779  -.7372421 

Group means on canonical variables

            type | function1  function2 
    -------------+----------------------
          setosa |   -7.6076   -.215133 
      versicolor |  1.825049   .7278996 
       virginica |   5.78255  -.5127666 

Resubstitution classification summary

    +---------+
    | Key     |
    |---------|
    | Number  |
    | Percent |
    +---------+
                 | Classified                                     
    True type    |     setosa  versicolor   virginica |      Total
    -------------+------------------------------------+-----------
          setosa |         50           0           0 |         50
                 |     100.00        0.00        0.00 |     100.00
                 |                                    |           
      versicolor |          0          48           2 |         50
                 |       0.00       96.00        4.00 |     100.00
                 |                                    |           
       virginica |          0           1          49 |         50
                 |       0.00        2.00       98.00 |     100.00
    -------------+------------------------------------+-----------
           Total |         50          49          51 |        150
                 |      33.33       32.67       34.00 |     100.00
                 |                                    |           
          Priors |     0.3333      0.3333      0.3333 | 

estat manova

                           Number of obs =     150

                           W = Wilks' lambda      L = Lawley-Hotelling trace
                           P = Pillai's trace     R = Roy's largest root

                  Source |  Statistic     df   F(df1,    df2) =   F   Prob>F
              -----------+--------------------------------------------------
                    type | W   0.0234      2     8.0   288.0   199.15 0.0000 e
                         | P   1.1919            8.0   290.0    53.47 0.0000 a
                         | L  32.4773            8.0   286.0   580.53 0.0000 a
                         | R  32.1919            4.0   145.0  1166.96 0.0000 u
                         |--------------------------------------------------
                Residual |               147
              -----------+--------------------------------------------------
                   Total |               149
              --------------------------------------------------------------
                           e = exact, a = approximate, u = upper bound on F
     
label define tl 1 "S" 2 "C" 3 "V", modify

scoreplot, msymbol(i)

One More Example

use http://www.philender.com/courses/data/hsb2

egen grp = group(prog female)

/* findit tablist */

tablist grp prog female, sort(v) nol clean

    grp   prog   female   Freq  
      1      1        0     21  
      2      1        1     24  
      3      2        0     47  
      4      2        1     58  
      5      3        0     23  
      6      3        1     27

candisc read write math science socst, group(grp)

Canonical linear discriminant analysis

      |                                 | Like- 
      | Canon.   Eigen-     Variance    | lihood
  Fcn | Corr.    value   Prop.   Cumul. | Ratio     F      df1    df2  Prob>F
  ----+---------------------------------+------------------------------------
    1 | 0.5782  .502254  0.6132  0.6132 | 0.4975  5.8498    25  707.3  0.0000 a
    2 | 0.4363   .23516  0.2871  0.9003 | 0.7474  3.6511    16  584.2  0.0000 a
    3 | 0.2197  .050705  0.0619  0.9622 | 0.9231  1.7353     9  467.4  0.0786 a
    4 | 0.1728  .030785  0.0376  0.9997 | 0.9699  1.4843     4    386  0.2062 e
    5 | 0.0144  .000207  0.0003  1.0000 | 0.9998  .04025     1    194  0.8412 e
  ---------------------------------------------------------------------------
  Ho: this and smaller canon. corr. are zero;  e = exact F, a = approximate F

Standardized canonical discriminant function coefficients

                 | function1  function2  function3  function4  function5 
    -------------+-------------------------------------------------------
            read | -.1161169  -.4829687   .1886365   1.321818   .1538771 
           write | -.6602812   1.128027   -.212974   .1360569   .4061492 
            math |  -.463602  -.5339929   .8257989  -.7658781   .3940945 
         science |  .6107802  -.3927407  -1.125132  -.3845829   .2800101 
           socst | -.3563766   -.225085  -.2689487  -.3519051  -1.088045 

Canonical structure

                 | function1  function2  function3  function4  function5 
    -------------+-------------------------------------------------------
            read | -.5809872  -.5235904  -.2656326   .5296293   .1929744 
           write | -.8095514   .2094842   -.443327   .0312908   .3212865 
            math | -.6837577  -.5135297   .0156334  -.2991848   .4230919 
         science | -.2778489  -.4341668  -.7449201  -.1238876   .4050248 
           socst | -.7035231  -.2935881  -.3890976  -.0536654  -.5143777 

Group means on canonical variables

             grp | function1  function2  function3  function4  function5 
    -------------+-------------------------------------------------------
               1 |  .6640051  -.4587528  -.4351828   .2093051  -.0166562 
               2 |  .1637273   .6219646  -.1688396  -.3557218  -.0124101 
               3 | -.3608076  -.4947248  -.0729075  -.1062484   .0169581 
               4 | -.7733276   .0895048   .1406045   .0935383  -.0099369 
               5 |  1.315385  -.3706178   .4039811  -.0541723  -.0050263 
               6 |   .506798   .7885798  -.0307032   .1835678    .020094 

Resubstitution classification summary

    +---------+
    | Key     |
    |---------|
    | Number  |
    | Percent |
    +---------+
                 | Classified                                             
    True grp     |      1       2       3       4       5       6 |  Total
    -------------+------------------------------------------------+-------
               1 |      5       2       3       1       7       3 |     21
                 |  23.81    9.52   14.29    4.76   33.33   14.29 | 100.00
                 |                                                |       
               2 |      0      10       1       5       4       4 |     24
                 |   0.00   41.67    4.17   20.83   16.67   16.67 | 100.00
                 |                                                |       
               3 |      5       6      17      13       3       3 |     47
                 |  10.64   12.77   36.17   27.66    6.38    6.38 | 100.00
                 |                                                |       
               4 |      1       9      11      25       4       8 |     58
                 |   1.72   15.52   18.97   43.10    6.90   13.79 | 100.00
                 |                                                |       
               5 |      5       4       1       1      12       0 |     23
                 |  21.74   17.39    4.35    4.35   52.17    0.00 | 100.00
                 |                                                |       
               6 |      3       4       2       2       4      12 |     27
                 |  11.11   14.81    7.41    7.41   14.81   44.44 | 100.00
    -------------+------------------------------------------------+-------
           Total |     19      35      35      47      34      30 |    200
                 |   9.50   17.50   17.50   23.50   17.00   15.00 | 100.00
                 |                                                |       
          Priors | 0.1667  0.1667  0.1667  0.1667  0.1667  0.1667 |  

estat manova

                           Number of obs =     200

                           W = Wilks' lambda      L = Lawley-Hotelling trace
                           P = Pillai's trace     R = Roy's largest root

                  Source |  Statistic     df   F(df1,    df2) =   F   Prob>F
              -----------+--------------------------------------------------
                     grp | W   0.4975      5    25.0   707.3     5.85 0.0000 a
                         | P   0.6031           25.0   970.0     5.32 0.0000 a
                         | L   0.8191           25.0   942.0     6.17 0.0000 a
                         | R   0.5023            5.0   194.0    19.49 0.0000 u
                         |--------------------------------------------------
                Residual |               194
              -----------+--------------------------------------------------
                   Total |               199
              --------------------------------------------------------------
                           e = exact, a = approximate, u = upper bound on F

Multivariate Course Page

Phil Ender, 13jul07, 28oct05, 23apr05, 29Jan98

Multivariate Analysis Discriminant Analysis

Multivariate Analysis
Discriminant Analysis