Applied Categorical & Nonnormal Data Analysis

Regression with Censored or Truncated Data

Analyzing data that contain censored values or are truncated is common in many research disciplines. According to Hosmer and Lemeshow (1999), a censored value is one whose value is incomplete due to random factors for each subject. A truncated observation, on the other hand, is one which is incomplete due to a selection process in the design of the study. Thus, truncation changes the sample size while censoring does not.

We will begin by looking at analyzing data with censored values.

Regression with Censored Data

Regression models with censored data are sometimes called tobit models, named for the estimation that was originally developed by J. Tobin (1958).

The log likelihood for the general model with censored data looks like

where C are point observations, L are left censored observations, R are right censored observations, and I are intervals. And where Φ is the standard cumulative normal distribution, and the w_j is the normalized weight of the jth observation.

Let's start off with an example from Long (1997), the data are available from www.indiana.edu/~jsl650 (the data file is called job1tob.dta). This example looks at the prestige of a scientist's first job. Job prestige values were not available for departments without graduate programs or for graduate programs rated below 1.0. These cases were coded as ones. In this example, some of the ones represent 'true' ones, while the others are censored values that are less than one but whose 'true' values are unknown.

First we will looks at the OLS analysis with the censored data. With this approach all of the values scored as one are treated as if they were 'true' ones.

use http://www.gseis.ucla.edu/courses/data/job1tob

summarize jobcen0 jobcen1

    Variable |     Obs        Mean   Std. Dev.       Min        Max
-------------+-----------------------------------------------------
     jobcen0 |     408    1.990784    1.31612          0        4.8
     jobcen1 |     408    2.233431   .9736029          1        4.8

regress jobcen1 fem phd ment fel art cit

      Source |       SS       df       MS              Number of obs =     408
-------------+------------------------------           F(  6,   401) =   17.78
       Model |  81.0584763     6  13.5097461           Prob > F      =  0.0000
    Residual |  304.737915   401  .759944926           R-squared     =  0.2101
-------------+------------------------------           Adj R-squared =  0.1983
       Total |  385.796392   407  .947902683           Root MSE      =  .87175

------------------------------------------------------------------------------
     jobcen1 |      Coef.   Std. Err.      t    P>|t|     [95% Conf. Interval]
-------------+----------------------------------------------------------------
         fem |  -.1391939   .0902344    -1.54   0.124    -.3165856    .0381977
         phd |   .2726826   .0493183     5.53   0.000     .1757278    .3696375
        ment |   .0011867   .0007012     1.69   0.091    -.0001917    .0025651
         fel |   .2341384   .0948206     2.47   0.014     .0477308    .4205461
         art |   .0228011   .0288843     0.79   0.430    -.0339824    .0795846
         cit |   .0044788   .0019687     2.28   0.023     .0006087     .008349
       _cons |   1.067184   .1661357     6.42   0.000     .7405785     1.39379
------------------------------------------------------------------------------

listcoef

regress (N=408): Unstandardized and Standardized Estimates 

 Observed SD: .97360294
 SD of Error: .8717482

---------------------------------------------------------------------------
 jobcen1 |      b         t     P>|t|    bStdX    bStdY   bStdXY      SDofX
---------+-----------------------------------------------------------------
     fem |  -0.13919   -1.543   0.124  -0.0680  -0.1430  -0.0698     0.4883
     phd |   0.27268    5.529   0.000   0.2601   0.2801   0.2671     0.9538
    ment |   0.00119    1.692   0.091   0.0778   0.0012   0.0799    65.5299
     fel |   0.23414    2.469   0.014   0.1139   0.2405   0.1170     0.4866
     art |   0.02280    0.789   0.430   0.0514   0.0234   0.0528     2.2561
     cit |   0.00448    2.275   0.023   0.1481   0.0046   0.1521    33.0599
---------------------------------------------------------------------------

Next, we will perform an OLS regression after dropping out all of the cases that had been censored to one. In this analysis, all of the ones are 'true' ones, the other values are deleted. We have truncated the sample by dropping all prestige ratings less than one.

regress jobcen0 fem phd ment fel art cit if jobcen0 ~= 0

      Source |       SS       df       MS              Number of obs =     309
-------------+------------------------------           F(  6,   302) =   12.69
       Model |  37.6365095     6  6.27275158           Prob > F      =  0.0000
    Residual |  149.290989   302  .494341024           R-squared     =  0.2013
-------------+------------------------------           Adj R-squared =  0.1855
       Total |  186.927499   308  .606907463           Root MSE      =  .70309

------------------------------------------------------------------------------
     jobcen0 |      Coef.   Std. Err.      t    P>|t|     [95% Conf. Interval]
-------------+----------------------------------------------------------------
         fem |   .1014513   .0854827     1.19   0.236    -.0667658    .2696685
         phd |   .2973797   .0467477     6.36   0.000     .2053873    .3893722
        ment |   .0007784   .0006113     1.27   0.204    -.0004247    .0019814
         fel |   .1405303   .0897917     1.57   0.119    -.0361662    .3172269
         art |   .0058978   .0248279     0.24   0.812    -.0429598    .0547554
         cit |   .0021032   .0016553     1.27   0.205    -.0011542    .0053607
       _cons |   1.412782   .1621386     8.71   0.000     1.093718    1.731846
------------------------------------------------------------------------------

listcoef

regress (N=309): Unstandardized and Standardized Estimates 

 Observed SD: .77904266
 SD of Error: .70309389

---------------------------------------------------------------------------
 jobcen0 |      b         t     P>|t|    bStdX    bStdY   bStdXY      SDofX
---------+-----------------------------------------------------------------
     fem |   0.10145    1.187   0.236   0.0481   0.1302   0.0618     0.4744
     phd |   0.29738    6.361   0.000   0.2758   0.3817   0.3540     0.9274
    ment |   0.00078    1.273   0.204   0.0541   0.0010   0.0695    69.5468
     fel |   0.14053    1.565   0.119   0.0662   0.1804   0.0850     0.4710
     art |   0.00590    0.238   0.812   0.0142   0.0076   0.0182     2.4000
     cit |   0.00210    1.271   0.205   0.0760   0.0027   0.0976    36.1466
---------------------------------------------------------------------------

Finally, we will estimate a model using the tobit method. It includes those cases that were censored to a value of one. We will declare the data to be left censored at 1.0. Using information in the sample, the tobit procedure computes the probability that a value of one is censored and uses the probability to aid in the estimation of the coefficients.

tobit jobcen1 fem phd ment fel art cit, ll(1)

Tobit estimates                                   Number of obs   =        408
                                                  LR chi2(6)      =      89.20
                                                  Prob > chi2     =     0.0000
Log likelihood = -560.25209                       Pseudo R2       =     0.0737

------------------------------------------------------------------------------
     jobcen1 |      Coef.   Std. Err.      t    P>|t|     [95% Conf. Interval]
-------------+----------------------------------------------------------------
         fem |  -.2368486   .1165795    -2.03   0.043    -.4660302   -.0076669
         phd |   .3225846   .0639198     5.05   0.000     .1969258    .4482435
        ment |   .0013436   .0008875     1.51   0.131    -.0004011    .0030884
         fel |   .3252657   .1224516     2.66   0.008     .0845403    .5659911
         art |   .0339053      .0365     0.93   0.353    -.0378493      .10566
         cit |     .00509   .0024751     2.06   0.040     .0002243    .0099557
       _cons |   .6854061    .218261     3.14   0.002     .2563307    1.114482
-------------+----------------------------------------------------------------
         _se |   1.087237    .046533           (Ancillary parameter)
------------------------------------------------------------------------------

  Obs. summary:         99  left-censored observations at jobcen1<=1
                       309     uncensored observations

listcoef

tobit (N=408): Unstandardized and Standardized Estimates 

 Observed SD: .97360294
   Latent SD: 1.21966
 SD of Error: 1.087237

---------------------------------------------------------------------------
 jobcen1 |      b         t     P>|t|    bStdX    bStdY   bStdXY      SDofX
---------+-----------------------------------------------------------------
     fem |  -0.23685   -2.032   0.043  -0.1156  -0.1942  -0.0948     0.4883
     phd |   0.32258    5.047   0.000   0.3077   0.2645   0.2523     0.9538
    ment |   0.00134    1.514   0.131   0.0880   0.0011   0.0722    65.5299
     fel |   0.32527    2.656   0.008   0.1583   0.2667   0.1298     0.4866
     art |   0.03391    0.929   0.353   0.0765   0.0278   0.0627     2.2561
     cit |   0.00509    2.057   0.040   0.1683   0.0042   0.1380    33.0599
---------------------------------------------------------------------------

In the next example we have a variable called acadindx which is a weighted combination of standardized test scores and academic grades. The maximum possible score on acadindx is 200 but it is clear that the 26 students who scored 200 are not exactly equal in their academic abilities. In other words, there is variability in academic ability that is not being accounted for when students score 200 on acadindx. Acadindx is right censored and in this sample, we do not know which students have 'true' scores of 200 and which ones have censored scores.

We will begin by looking at a description of the data, some descriptive statistics, and correlations among the variables.

use http://www.gseis.ucla.edu/courses/data/acadindx2
(max possible on acadindx is 200)

describe

Contains data from acadindx.dta
  obs:           200                          max possible on acadindx is 200
 vars:             5                          19 Jan 2001 20:14
 size:         4,800 (99.7% of memory free)
-------------------------------------------------------------------------------
   1. id        float  %9.0g                  
   2. female    float  %9.0g       fl         
   3. reading   float  %9.0g                  
   4. writing   float  %9.0g                  
   5. acadindx  float  %9.0g                  academic index
-------------------------------------------------------------------------------

summarize

    Variable |     Obs        Mean   Std. Dev.       Min        Max
-------------+-----------------------------------------------------
          id |     200       100.5   57.87918          1        200
      female |     200        .545   .4992205          0          1
     reading |     200       52.23   10.25294         28         76
     writing |     200      52.775   9.478586         31         67
    acadindx |     200     176.725   16.10485        143        200

count if acadindx==200
   26

corr acadindx female reading writing
(obs=200)

             | acadindx   female  reading  writing
-------------+------------------------------------
    acadindx |   1.0000
      female |  -0.0756   1.0000
     reading |   0.7105  -0.0531   1.0000
     writing |   0.6662   0.2565   0.5968   1.0000

Now, let's run a standard OLS regression on the data and generate predicted scores in p1.

regress acadindx female reading writing

      Source |       SS       df       MS              Number of obs =     200
-------------+------------------------------           F(  3,   196) =  106.87
       Model |  32031.7937     3  10677.2646           Prob > F      =  0.0000
    Residual |  19582.0813   196   99.908578           R-squared     =  0.6206
-------------+------------------------------           Adj R-squared =  0.6148
       Total |   51613.875   199  259.366206           Root MSE      =  9.9954

------------------------------------------------------------------------------
    acadindx |      Coef.   Std. Err.      t    P>|t|     [95% Conf. Interval]
-------------+----------------------------------------------------------------
      female |  -5.436622    1.52325    -3.57   0.000    -8.440685   -2.432558
     reading |    .678742   .0893394     7.60   0.000     .5025521    .8549318
     writing |   .7672243   .0998418     7.68   0.000     .5703222    .9641263
       _cons |    103.747   4.305933    24.09   0.000      95.2551    112.2389
------------------------------------------------------------------------------


predict p1
(option xb assumed; fitted values)

The tobit command is one of the commands that can be used for regression with censored data. The syntax of the command is similar to regress with the addition of the ul option to indication that the right censored value is 200. We will follow the tobit command by generating p2 containing the tobit predicted values.

tobit acadindx female reading writing, ul(200)

Tobit estimates                                   Number of obs   =        200
                                                  LR chi2(3)      =     191.51
                                                  Prob > chi2     =     0.0000
Log likelihood = -684.98404                       Pseudo R2       =     0.1226

------------------------------------------------------------------------------
    acadindx |      Coef.   Std. Err.      t    P>|t|     [95% Conf. Interval]
-------------+----------------------------------------------------------------
      female |  -6.279506   1.704417    -3.68   0.000     -9.64075   -2.918261
     reading |   .7863571   .1014259     7.75   0.000     .5863371     .986377
     writing |   .8102958    .110664     7.32   0.000     .5920577    1.028534
       _cons |   97.30504   4.865994    20.00   0.000     87.70892    106.9012
-------------+----------------------------------------------------------------
         _se |   10.91133   .5966562           (Ancillary parameter)
------------------------------------------------------------------------------

  Obs. summary:        174     uncensored observations
                        26 right-censored observations at acadindx>=200

predict p2
(option xb assumed; fitted values)

Summarizing the p1 and p2 scores shows that the tobit predicted values have a larger standard deviation and a greater range of values.

summarize acadindx p1 p2

    Variable |     Obs        Mean   Std. Dev.       Min        Max
-------------+-----------------------------------------------------
    acadindx |     200     176.725   16.10485        143        200
          p1 |     200     176.725   12.68715   148.2405   204.6992
          p2 |     200    177.7175   14.07343    146.122   208.9989

When we look at a listing of p1 and p2 for all students who scored the maximum of 200 on acadindx, we see that in every case the tobit predicted value is greater than the OLS predicted value. These predictions represents are an estimate of what the variability would be if the values of acadindx could exceed 200.

list p1 p2 if acadindx==200

            p1         p2 
 32.  183.6515   184.6332  
 39.  194.5114   197.2149  
 57.  196.3706   199.5261  
 61.  198.2299   201.8373  
 68.  204.6992   208.9989  
 80.  195.4331   198.6566  
 82.  192.0327   194.7362  
 88.  190.4983   193.1156  
 95.  199.3286   203.3269  
100.  190.9407   193.2353  
103.  195.2271   199.2036  
132.  200.8631   204.9474  
136.  193.1315   196.2257  
143.  194.8429   197.8942  
146.  188.6457    190.793  
150.  163.7104   163.5542  
154.  197.7348   201.0875  
157.  195.1677   198.5848  
161.  184.5666   186.0862  
169.   186.344   188.3621  
170.  183.2158   184.5022  
174.  195.1677   198.5848  
180.  196.3706   199.5261  
192.  199.2693   202.7081  
194.  189.4063   191.6147  
200.  191.3316   194.5333

Here is the syntax diagram for tobit:

tobit depvar [indepvars] [weight] [if exp] [in range], ll[(#)] ul[(#)]
[ level(#) offset(varname) maximize_options ]

You can declare both lower and upper censored values. The censored values are fixed in that the same lower and upper values apply to all observations.

There are two other commands in Stata that allowed you more flexibility in doing regression with censored data.

cnreg estimates a model in which the censored values may vary from observation to observation.

intreg estimates a model where the response variable for each observation is either point data, interval data, left-censored data, or right-censored data.

It is also possible to estimate censored models using a semiparametric approach known as censored least absolute deviations (CLAD). We will demonstrate a CLAD solution with our last dataset using a Stata program clad (findit clad) that estimates the standard errors using the bootstrap method. CLAD procedures are espically useful in situations with heteroscedasticity, nonnormality or lack independence of the residuals.

clad acadindx female reading writing, ul(200) reps(200)

Initial sample size = 200
Final sample size = 189
Pseudo R2 = .41301816

Bootstrap statistics

Variable |   Reps   Observed       Bias   Std. Err.   [95% Conf. Interval]
---------+-------------------------------------------------------------------
  female |    200  -7.963542   1.273409   2.467652   -12.82964  -3.09744  (N)
         |                                           -11.29608 -1.762422  (P)
         |                                           -13.82872 -4.446603 (BC)
---------+-------------------------------------------------------------------
 reading |    200   .7578125  -.0278584   .1388206    .4840643  1.031561  (N)
         |                                            .4333717  1.036449  (P)
         |                                            .5205993  1.069915 (BC)
---------+-------------------------------------------------------------------
 writing |    200   .9505209  -.0278993   .1409488    .6725759  1.228466  (N)
         |                                            .6435294  1.223179  (P)
         |                                            .6999999  1.285714 (BC)
---------+-------------------------------------------------------------------
   const |    200     92.375   1.748577   5.200157    82.12051  102.6295  (N)
         |                                            84.47677  105.6032  (P)
         |                                            82.63322    101.48 (BC)
-----------------------------------------------------------------------------
                              N = normal, P = percentile, BC = bias-corrected

I will reformat the output from tobit and clad to assist in comparing the results. I have computed a t-test for clad although I am not sure the the coefficient divided by the standard error is distributed as a t-statistic. I compute it just for comparison purposes.

tobit model
------------------------------------------------------------------------------
    acadindx |      Coef.   Std. Err.      t    P>|t|     [95% Conf. Interval]
-------------+----------------------------------------------------------------
      female |  -6.279506   1.704417    -3.68   0.000     -9.64075   -2.918261
     reading |    .7863571   .1014259    7.75   0.000     .5863371     .986377
     writing |    .8102958   .110664     7.32   0.000     .5920577    1.028534
       _cons |  97.30504    4.865994    20.00   0.000     87.70892    106.9012
-------------+----------------------------------------------------------------

clad model
Variable |     Observed      Std. Err.     t
---------+-------------------------------------------------------------------
  female |    -7.963542     2.467652    -3.23
 reading |      .7578125     .1388206    5.46
 writing |      .9505209     .1409488    6.74
   const |     92.375       5.200157    17.76
-----------------------------------------------------------------------------

Regression with Truncated Data

Truncated data occurs when some observations are not included in the analysis because of the value of the variable, that is, the sample is drawn from a restricted part of the populations. Truncation is a characteristic of the distribution from which the sample data are drawn. If x has a normal distribution with mean μ and standard deviation σ, then the density of the truncated normal distribution is

where φ and Φ are the density and distribution functions of the standard normal distribution.

Compared with the mean of an untruncated variable, the mean of the truncated variable is greater if the truncation is from below, and is smaller if the truncation is from above. Furthermore, truncation reduces the variance compared with the variance of the untruncated distribution.

The log likelihood when a is the lower limit and b is the upper limit is

Let's return to Long's (1997) example on the prestige of a scientist's first job. This time we will estimate the model using regression for truncated data. We will truncate all values for job prestige that are less than one.

truncreg jobcen0 fem phd ment fel art cit, ll(1)
(note: 99 obs. truncated)

Truncated regression
Limit:   lower =          1                             Number of obs =    309
         upper =       +inf                             Wald chi2(6)  =  71.13
Log likelihood = -318.66024                             Prob > chi2   = 0.0000

------------------------------------------------------------------------------
     jobcen0 |      Coef.   Std. Err.      z    P>|z|     [95% Conf. Interval]
-------------+----------------------------------------------------------------
eq1          |
         fem |    .114156    .095124     1.20   0.230    -.0722837    .3005956
         phd |   .3413744   .0539561     6.33   0.000     .2356224    .4471263
        ment |   .0008171   .0006589     1.24   0.215    -.0004743    .0021085
         fel |   .1709118   .1011169     1.69   0.091    -.0272737    .3690974
         art |   .0072712   .0271957     0.27   0.789    -.0460314    .0605738
         cit |   .0021862    .001788     1.22   0.221    -.0013182    .0056905
       _cons |   1.187784   .1962769     6.05   0.000     .8030885     1.57248
-------------+----------------------------------------------------------------
sigma        |
       _cons |   .7379857   .0353198    20.89   0.000     .6687602    .8072112
------------------------------------------------------------------------------

Next, we will analysis the dataset, acadindx, that was used in the previous section. If acadindx is no longer loaded in memory you can obtain it with the following use command.

use http://www.gseis.ucla.edu/courses/data/acadindx2
(max possible on acadindx is 200)

Let's imagine that in order to get into a special honors program, students need to score at least 165 on acadindx. So we will drop all observations in which the value of acadindx is less than 165.

drop if acadindx<165
(53 observations deleted)

Now, let's estimate the same model that we used in the section on censored data, only this time we will pretend that a 200 for acadindx is not censored.

regress acadindx female reading writing

      Source |       SS       df       MS              Number of obs =     147
-------------+------------------------------           F(  3,   143) =   35.17
       Model |  7418.94448     3  2472.98149           Prob > F      =  0.0000
    Residual |  10053.7222   143  70.3057495           R-squared     =  0.4246
-------------+------------------------------           Adj R-squared =  0.4125
       Total |  17472.6667   146  119.675799           Root MSE      =  8.3849

------------------------------------------------------------------------------
    acadindx |      Coef.   Std. Err.      t    P>|t|     [95% Conf. Interval]
-------------+----------------------------------------------------------------
      female |  -5.081622   1.491473    -3.41   0.001    -8.029805    -2.13344
     reading |   .4263403   .0874548     4.87   0.000      .253469    .5992115
     writing |   .5426893   .1062605     5.11   0.000     .3326451    .7527336
       _cons |   132.9936    5.43257    24.48   0.000     122.2551    143.7322
------------------------------------------------------------------------------

It is clear that the estimates of the coefficients are distorted due to the fact that 53 observations are no longer in the dataset. This amounts to restriction of range on both the response variable and the predictor variables. What this means is that if our goal is to find the relation between adadindx and the predictor variables in the popultions, then the truncation of acadindx in our sample is going to lead to baised estimates. A better approach to analyzing these data is to use truncated regression. In Stata this can be accomplished using the truncreg command where the ll option is used to indicate the lower limit of acadindx scores used in the truncation.

truncreg acadindx female reading writing, ll(165)
(note: 3 obs. truncated)  

Truncated regression
Limit:   lower =        165                             Number of obs =    144
         upper =       +inf                             Wald chi2(3)  =  80.80
Log likelihood = -499.72027                             Prob > chi2   = 0.0000

------------------------------------------------------------------------------
    acadindx |      Coef.   Std. Err.      z    P>|z|     [95% Conf. Interval]
-------------+----------------------------------------------------------------
eq1          |
      female |  -5.264574   1.697022    -3.10   0.002    -8.590676   -1.938472
     reading |   .4429962   .1024458     4.32   0.000     .2422061    .6437862
     writing |   .6816854   .1324375     5.15   0.000     .4221128    .9412581
       _cons |   123.6204   7.371454    16.77   0.000     109.1726    138.0681
-------------+----------------------------------------------------------------
sigma        |
       _cons |   8.817696    .625297    14.10   0.000     7.592136    10.04326
------------------------------------------------------------------------------

The coefficients from the truncreg command differ from the OLS and represent an attempt to adjust the analysis for the arbitrary cutoff of acadindx scores at 165.

Categorical Data Analysis Course

Phil Ender