Logistic Regression

Linear Statistical Models: Regression

Logistic Regression

Updated for Stata 11

Classical Regression vs Logistic Regression

All of the previous regression examples have used continuous dependent variables.
Logistic regression is used when the dependent variable is binary or dichotomous.

Different Assumptions

There are a number of differences in the assumption between traditional regression and logistic regression.
- The population means of the dependent variables at each level of the independent variable are not on a straight line, i.e., no linearity.
- The variance of the errors are not constant, i.e., no homogeneity of variance.
- The errors are not normally distributed, i.e., no normaility.

Logistic Regression Assumptions

The model is correctly specified, i.e., 1) the true conditional probabilities are a logistic function of the indpendent variables, 2) no important variables are omitted, 3) no extraneous variables are included, and 4) the independent variables are measured without error.
The cases are independent.
The independent variables are not linear combinations of each other. Perfect multicolinearity makes estimation impossible, while strong multicolinearity makes estimates imprecise.

Logit

Let P = the probability of being admitted.
Let Q = 1 - P = the probability of not being admitted.
Let the odds of a male admitted be odds(M) = P/Q = P/1-P = .7/.3 = 2.3333
Let the odds of a female admitted be odds(F) = P/Q = P/1-P = .3/.7 = .42857
Let the odds ratio, OR = odds(M)/odds(F) = 2.3333/.42857 = 5.44
- The odds if being admitted to the program are about 5.44 times greater for males then for females.
Let logit(P) = log(odds) = ln(P/Q) = ln (P/1 - P)
This results in the logistic regression equation logit(P) = a + bX.
- In effect, this represents a transformation of the dependent variable such that the resulting logistic regression equation better meets the assumptions of linearity, normality and homogeneity of variance.

Note: I would like to thank John Napier (1550-1617), lord of Merchiston (near Edinburgh), for developing the idea of logarithms.

About Logistic Regression

It uses a maximum likelihood estimation procedure rather than the least squares estimation procedure that is used in traditional multiple regression.
The general form of the distribution is assumed.
Starting values of the estimated parameters are used and the likelihood that the sample came from a population with those parameters is computed.
The values of the estimated parameters are adjusted iteratively until the maximum likelihood value for the estimated parateters is obtained.
That is, maximum likelihood approaches try to find estimates of parameters that make the data actually observed "most likely."

Intrepreting Logistic Coefficients

Logistic slope coefficients can be interpreted as the effect of a unit of change in the X variable on the predicted logits with the other variables in the model held constant. That is, how a one unit change in X effects the log of the odds when the other variables in the model held constant.

Intrepreting Odds Ratios

Odds ratios in logistic regression can be interpreted as the effect of a one unit of change in X in the predicted odds ratio with the other variables in the model held constant.

Example Dataset


input apt gender admit
8 1 1
7 1 0
5 1 1
3 1 0
3 1 0
5 1 1
7 1 1
8 1 1
5 1 1
5 1 1
4 0 0
7 0 1
3 0 1
2 0 0
4 0 0
2 0 0
3 0 0
4 0 1
3 0 0
2 0 0
end

  
Example 1: Categorical Independent Variable
  
logit admit i.gender

Iteration 0:   log likelihood = -13.862944  
Iteration 1:   log likelihood = -12.222013  
Iteration 2:   log likelihood = -12.217286  
Iteration 3:   log likelihood = -12.217286  

Logistic regression                               Number of obs   =         20
                                                  LR chi2(1)      =       3.29
                                                  Prob > chi2     =     0.0696
Log likelihood = -12.217286                       Pseudo R2       =     0.1187

------------------------------------------------------------------------------
       admit |      Coef.   Std. Err.      z    P>|z|     [95% Conf. Interval]
-------------+----------------------------------------------------------------
    1.gender |   1.694596   .9759001     1.74   0.082    -.2181333    3.607325
       _cons |  -.8472978   .6900656    -1.23   0.220    -2.199801    .5052058
------------------------------------------------------------------------------

logit admit gender, or

Logistic regression                               Number of obs   =         20
                                                  LR chi2(1)      =       3.29
                                                  Prob > chi2     =     0.0696
Log likelihood = -12.217286                       Pseudo R2       =     0.1187

------------------------------------------------------------------------------
       admit | Odds Ratio   Std. Err.      z    P>|z|     [95% Conf. Interval]
-------------+----------------------------------------------------------------
    1.gender |   5.444444   5.313233     1.74   0.082     .8040182    36.86729
------------------------------------------------------------------------------

Example 2: Continuous Independent Variable

logit admit apt

Iteration 0:   log likelihood = -13.862944
Iteration 1:   log likelihood = -9.6278718
Iteration 2:   log likelihood = -9.3197603
Iteration 3:   log likelihood = -9.3029734
Iteration 4:   log likelihood = -9.3028914

Logit estimates                                   Number of obs   =         20
                                                  LR chi2(1)      =       9.12
                                                  Prob > chi2     =     0.0025
Log likelihood = -9.3028914                       Pseudo R2       =     0.3289

------------------------------------------------------------------------------
   admit |      Coef.   Std. Err.       z     P>|z|       [95% Conf. Interval]
---------+--------------------------------------------------------------------
     apt |   .9455112    .422872      2.236   0.025       .1166974    1.774325
   _cons |  -4.095248    1.83403     -2.233   0.026      -7.689881   -.5006154
------------------------------------------------------------------------------

logit, or

Logit estimates                                   Number of obs   =         20
                                                  LR chi2(1)      =       9.12
                                                  Prob > chi2     =     0.0025
Log likelihood = -9.3028914                       Pseudo R2       =     0.3289

------------------------------------------------------------------------------
   admit | Odds Ratio   Std. Err.       z     P>|z|       [95% Conf. Interval]
---------+--------------------------------------------------------------------
     apt |   2.574129   1.088527      2.236   0.025       1.123779      5.8963
------------------------------------------------------------------------------

Example 3: Categorical & Continuous Independent Variables

logit admit i.gender apt

Iteration 0:   log likelihood = -13.862944  
Iteration 1:   log likelihood = -9.3188454  
Iteration 2:   log likelihood = -9.2822992  
Iteration 3:   log likelihood = -9.2820991  
Iteration 4:   log likelihood = -9.2820991  

Logistic regression                               Number of obs   =         20
                                                  LR chi2(2)      =       9.16
                                                  Prob > chi2     =     0.0102
Log likelihood = -9.2820991                       Pseudo R2       =     0.3304

------------------------------------------------------------------------------
       admit |      Coef.   Std. Err.      z    P>|z|     [95% Conf. Interval]
-------------+----------------------------------------------------------------
    1.gender |   .2671938   1.300911     0.21   0.837    -2.282545    2.816932
         apt |   .8982803   .4713918     1.91   0.057    -.0256307    1.822191
       _cons |  -4.028764   1.838393    -2.19   0.028    -7.631949   -.4255801
------------------------------------------------------------------------------
 
logit, or

Logistic regression                               Number of obs   =         20
                                                  LR chi2(2)      =       9.16
                                                  Prob > chi2     =     0.0102
Log likelihood = -9.2820991                       Pseudo R2       =     0.3304

------------------------------------------------------------------------------
       admit | Odds Ratio   Std. Err.      z    P>|z|     [95% Conf. Interval]
-------------+----------------------------------------------------------------
    1.gender |   1.306294   1.699372     0.21   0.837     .1020242    16.72547
         apt |   2.455377   1.157445     1.91   0.057      .974695    6.185398
------------------------------------------------------------------------------

Example 4: Honors Composition using HSB Dataset


use http://www.philender.com/courses/data/hsbdemo, clear

tabulate honors

    honcomp |      Freq.     Percent        Cum.
------------+-----------------------------------
          0 |        147       73.50       73.50
          1 |         53       26.50      100.00
------------+-----------------------------------
      Total |        200      100.00
  
logit honors female i.ses read math

Iteration 0:   log likelihood = -115.64441  
Iteration 1:   log likelihood = -75.969526  
Iteration 2:   log likelihood = -72.051616  
Iteration 3:   log likelihood = -71.994777  
Iteration 4:   log likelihood = -71.994756  
Iteration 5:   log likelihood = -71.994756  

Logistic regression                               Number of obs   =        200
                                                  LR chi2(5)      =      87.30
                                                  Prob > chi2     =     0.0000
Log likelihood = -71.994756                       Pseudo R2       =     0.3774

------------------------------------------------------------------------------
      honors |      Coef.   Std. Err.      z    P>|z|     [95% Conf. Interval]
-------------+----------------------------------------------------------------
      female |   1.145726   .4513589     2.54   0.011     .2610792    2.030374
             |
         ses |
          2  |  -1.040402   .5791511    -1.80   0.072    -2.175517     .094713
          3  |   .0541296   .5945439     0.09   0.927    -1.111155    1.219414
             |
        read |   .0687277   .0287044     2.39   0.017     .0124681    .1249873
        math |   .1358904   .0336875     4.03   0.000     .0698642    .2019166
       _cons |  -12.55332   1.838493    -6.83   0.000     -16.1567   -8.949939
------------------------------------------------------------------------------

testparm i.ses

 ( 1)  [honors]2.ses = 0
 ( 2)  [honors]3.ses = 0

           chi2(  2) =    6.13
         Prob > chi2 =    0.0466

logit, or

Logistic regression                               Number of obs   =        200
                                                  LR chi2(5)      =      87.30
                                                  Prob > chi2     =     0.0000
Log likelihood = -71.994756                       Pseudo R2       =     0.3774

------------------------------------------------------------------------------
      honors |      Coef.   Std. Err.      z    P>|z|     [95% Conf. Interval]
-------------+----------------------------------------------------------------
      female |   1.145726   .4513589     2.54   0.011     .2610792    2.030374
             |
         ses |
          2  |  -1.040402   .5791511    -1.80   0.072    -2.175517     .094713
          3  |   .0541296   .5945439     0.09   0.927    -1.111155    1.219414
             |
        read |   .0687277   .0287044     2.39   0.017     .0124681    .1249873
        math |   .1358904   .0336875     4.03   0.000     .0698642    .2019166
       _cons |  -12.55332   1.838493    -6.83   0.000     -16.1567   -8.949939
------------------------------------------------------------------------------


 
fitstat  /* available for J. Scott Long via the Internet */

Measures of Fit for logit of honors

Log-Lik Intercept Only:       -115.644   Log-Lik Full Model:            -71.995
D(193):                        143.990   LR(5):                          87.299
                                         Prob > LR:                       0.000
McFadden's R2:                   0.377   McFadden's Adj R2:               0.317
ML (Cox-Snell) R2:               0.354   Cragg-Uhler(Nagelkerke) R2:      0.516
McKelvey & Zavoina's R2:         0.549   Efron's R2:                      0.404
Variance of y*:                  7.296   Variance of error:               3.290
Count R2:                        0.830   Adj Count R2:                    0.358
AIC:                             0.790   AIC*n:                         157.990
BIC:                          -878.586   BIC':                          -60.808
BIC used by Stata:             175.779   AIC used by Stata:             155.990

lfit

Logistic model for honors, goodness-of-fit test

       number of observations =       200
 number of covariate patterns =       189
            Pearson chi2(183) =       166.48
                  Prob > chi2 =         0.8040

lfit, group(10)

Logistic model for honors, goodness-of-fit test

  (Table collapsed on quantiles of estimated probabilities)

       number of observations =       200
             number of groups =        10
      Hosmer-Lemeshow chi2(8) =        12.91
                  Prob > chi2 =         0.1151

lstat

Logistic model for honors

              -------- True --------
Classified |         D            ~D  |      Total
-----------+--------------------------+-----------
     +     |        31            12  |         43
     -     |        22           135  |        157
-----------+--------------------------+-----------
   Total   |        53           147  |        200

Classified + if predicted Pr(D) >= .5
True D defined as honors != 0
--------------------------------------------------
Sensitivity                     Pr( +| D)   58.49%
Specificity                     Pr( -|~D)   91.84%
Positive predictive value       Pr( D| +)   72.09%
Negative predictive value       Pr(~D| -)   85.99%
--------------------------------------------------
False + rate for true ~D        Pr( +|~D)    8.16%
False - rate for true D         Pr( -| D)   41.51%
False + rate for classified +   Pr(~D| +)   27.91%
False - rate for classified -   Pr( D| -)   14.01%
--------------------------------------------------
Correctly classified                        83.00%
--------------------------------------------------

Linear Statistical Models Course

Phil Ender, 17sep10, 20dec00