Linear Statistical Models: Regression

Collinearity


First Thoughts

  • Certainly, multiple regression is capable of analyzing data with correlated independent variables.
  • However, problems can arise from situations in which two or more variables are highly intercorrelated.
  • No consensus about meaning of collinearity - Simple Collinearity

  • When two variables are highly correlated.
  • Can be detected by looking at the zero order correlations.
  • Usually, correlations in the .9's.

    Multicollinearity

  • Involves combinations of more than two variables.
  • Variables that are uncorrelated are said to be orthogonal.
  • Computation of regression coefficients involves inverting a matrix. If one variable is a perfect linear combination of two or more other variables then the inverse cannot be computed and the matrix is said to be singular.
  • Example: sat total = sat verbal + sat math
  • In matrix terms, a linear dependency exists when a row (or column) of a matrix can be obtained as a linear combination of other rows (or columns).

    Common Indicators of Collinearity

  • VIF -- variance inflation factor
  • tolerance Other Indicators of Collinearity

  • Condition index -- large values
  • Condition number -- large values
  • Eigenvalues -- small values, close to zero
  • Determinant of correlation matrix -- very small, close to zero
  • Diagonal of R-1 (inverse of correlation matrix) -- large values, values close to one are good

    Effects of Collinearity

  • Imprecise estimates of regression coefficients.
  • Slight fluctuations in correlation may lead to large differences in regression coefficients.
  • Adding or dropping cases may lead to large differences in regression coefficients.
  • Increases the standard error of coefficients, thus reduces tests of significance.

  • Note: As r12 increases, the denominator decreases, and the standard error of the coefficient increases.

  • Note: As R2 increases, the denominator decreases, and the standard error of the coefficient increases.

    Checking for Collinearity in Stata

  • Use the vif command after the regress command. See Stata example
  • Also, the collin program which can be downloaded from UCLA ATS over the Internet

    Stata Example Using collin

    use http://www.philender.com/courses/data/hsbdemo, clear
     
    collin female schtyp read write math science socst
    
      Collinearity Diagnostics
    
                            SQRT                   R-
      Variable      VIF     VIF    Tolerance    Squared
    ----------------------------------------------------
        female      1.25    1.12    0.8027      0.1973
        schtyp      1.02    1.01    0.9819      0.0181
          read      2.45    1.57    0.4080      0.5920
         write      2.52    1.59    0.3962      0.6038
          math      2.28    1.51    0.4378      0.5622
       science      2.12    1.46    0.4717      0.5283
         socst      1.91    1.38    0.5224      0.4776
    ----------------------------------------------------
      Mean VIF      1.94
    
                               Cond
            Eigenval          Index
    ---------------------------------
        1     3.4004          1.0000
        2     1.1347          1.7311
        3     0.9782          1.8644
        4     0.5229          2.5502
        5     0.3577          3.0831
        6     0.3299          3.2104
        7     0.2762          3.5087
    ---------------------------------
     Condition Number         3.5087 
     Eigenvalues & Cond Index computed from deviation sscp (no intercept)
     Det(correlation matrix)    0.0643
    
    use http://www.philender.com/courses/data/lahigh, clear
    
    collin mathnce langnce mathpr langpr
    
      Collinearity Diagnostics
    
                            SQRT                   R-
      Variable      VIF     VIF    Tolerance    Squared
    ----------------------------------------------------
       mathnce     24.20    4.92    0.0413      0.9587
       langnce     28.31    5.32    0.0353      0.9647
        mathpr     25.02    5.00    0.0400      0.9600
        langpr     29.09    5.39    0.0344      0.9656
    ----------------------------------------------------
      Mean VIF     26.65
    
                               Cond
            Eigenval          Index
    ---------------------------------
        1     3.3643          1.0000
        2     0.5926          2.3827
        3     0.0287         10.8179
        4     0.0143         15.3294
    ---------------------------------
     Condition Number        15.3294 
     Eigenvalues & Cond Index computed from deviation sscp (no intercept)
     Det(correlation matrix)    0.0008
    
    collin mathnce langnce
    
      Collinearity Diagnostics
    
                            SQRT                   R-
      Variable      VIF     VIF    Tolerance    Squared
    ----------------------------------------------------
       mathnce      1.90    1.38    0.5256      0.4744
       langnce      1.90    1.38    0.5256      0.4744
    ----------------------------------------------------
      Mean VIF      1.90
    
                               Cond
            Eigenval          Index
    ---------------------------------
        1     1.6888          1.0000
        2     0.3112          2.3295
    ---------------------------------
     Condition Number         2.3295 
     Eigenvalues & Cond Index computed from deviation sscp (no intercept)
     Det(correlation matrix)    0.5256

    Computational Examples

    The following computational examples show some of the effects of high collinearity on standardized regression coefficients.

    Example A

         1   2    3    Y
    1   -  .20  .20  .50
    2       -   .10  .50
    3            -   .50
    Y                 -
    
     R2 = .56373   Det = .918
          Beta   Std Err     F
    1   .34314  .07001  24.025
    2   .39216  .06894  32.360
    3   .39216  .06894  32.360
    

    Example B

         1   2    3    Y
    1   -  .20  .20  .50
    2       -   .85  .50
    3            -   .50
    Y                 -
    
    R2 = .43079   Det = .2655
           Beta   Std Err     F
    1   .40960  .07872   27.073
    2   .22599  .14642    2.382
    3   .22599  .14642    2.382
    

    Example C

         1   2    3    Y
    1   -  .20  .20  .50
    2       -   .10  .50
    3            -   .52
    Y                 -
    
     R2 = .57983
          Beta   Std Err     F
    1   .33922  .06870  24.378
    2   .39085  .06765  33.376
    3   .41307  .06765  37.279
    

    Example D

         1   2    3    Y
    1   -  .20  .20  .50
    2       -   .85  .50
    3            -   .52
    Y                 -
    
    R2 = .44128
           Beta   Std Err     F
    1   .40734  .07799   27.277
    2   .16497  .14507    1.293
    3   .29831  .14507    4.229
    

    Remedies

  • Delete variables - may cause specification errors.
  • Collection of additional data.
  • Grouping variables into blocks.
  • Principal components analysis or principal factor analysis.
  • Ridge regression


    Linear Statistical Models Course

    Phil Ender, 29Jan98