Education 231C

Applied Categorical & Nonnormal Data Analysis

Regression with Censored or Truncated Data


Analyzing data that contain censored values or are truncated is common in many research disciplines. According to Hosmer and Lemeshow (1999), a censored value is one whose value is incomplete due to random factors for each subject. A truncated observation, on the other hand, is one which is incomplete due to a selection process in the design of the study. Thus, truncation changes the sample size while censoring does not.

We will begin by looking at analyzing data with censored values.

Regression with Censored Data

Regression models with censored data are sometimes called tobit models, named for the estimation that was originally developed by J. Tobin (1958).

The log likelihood for the general model with censored data looks like

where C are point observations, L are left censored observations, R are right censored observations, and I are intervals. And where Φ is the standard cumulative normal distribution, and the wj is the normalized weight of the jth observation.

Let's start off with an example from Long (1997), the data are available from www.indiana.edu/~jsl650 (the data file is called job1tob.dta). This example looks at the prestige of a scientist's first job. Job prestige values were not available for departments without graduate programs or for graduate programs rated below 1.0. These cases were coded as ones. In this example, some of the ones represent 'true' ones, while the others are censored values that are less than one but whose 'true' values are unknown.

First we will looks at the OLS analysis with the censored data. With this approach all of the values scored as one are treated as if they were 'true' ones.

Next, we will perform an OLS regression after dropping out all of the cases that had been censored to one. In this analysis, all of the ones are 'true' ones, the other values are deleted. We have truncated the sample by dropping all prestige ratings less than one. Finally, we will estimate a model using the tobit method. It includes those cases that were censored to a value of one. We will declare the data to be left censored at 1.0. Using information in the sample, the tobit procedure computes the probability that a value of one is censored and uses the probability to aid in the estimation of the coefficients. In the next example we have a variable called acadindx which is a weighted combination of standardized test scores and academic grades. The maximum possible score on acadindx is 200 but it is clear that the 26 students who scored 200 are not exactly equal in their academic abilities. In other words, there is variability in academic ability that is not being accounted for when students score 200 on acadindx. Acadindx is right censored and in this sample, we do not know which students have 'true' scores of 200 and which ones have censored scores.

We will begin by looking at a description of the data, some descriptive statistics, and correlations among the variables.

Now, let's run a standard OLS regression on the data and generate predicted scores in p1. The tobit command is one of the commands that can be used for regression with censored data. The syntax of the command is similar to regress with the addition of the ul option to indication that the right censored value is 200. We will follow the tobit command by generating p2 containing the tobit predicted values. Summarizing the p1 and p2 scores shows that the tobit predicted values have a larger standard deviation and a greater range of values. When we look at a listing of p1 and p2 for all students who scored the maximum of 200 on acadindx, we see that in every case the tobit predicted value is greater than the OLS predicted value. These predictions represents are an estimate of what the variability would be if the values of acadindx could exceed 200. Here is the syntax diagram for tobit:

tobit depvar [indepvars] [weight] [if exp] [in range], ll[(#)] ul[(#)]
        [ level(#) offset(varname) maximize_options ]

You can declare both lower and upper censored values. The censored values are fixed in that the same lower and upper values apply to all observations.

There are two other commands in Stata that allowed you more flexibility in doing regression with censored data.

cnreg estimates a model in which the censored values may vary from observation to observation.

intreg estimates a model where the response variable for each observation is either point data, interval data, left-censored data, or right-censored data.

It is also possible to estimate censored models using a semiparametric approach known as censored least absolute deviations (CLAD). We will demonstrate a CLAD solution with our last dataset using a Stata program clad (findit clad) that estimates the standard errors using the bootstrap method. CLAD procedures are espically useful in situations with heteroscedasticity, nonnormality or lack independence of the residuals.

I will reformat the output from tobit and clad to assist in comparing the results. I have computed a t-test for clad although I am not sure the the coefficient divided by the standard error is distributed as a t-statistic. I compute it just for comparison purposes. Regression with Truncated Data

Truncated data occurs when some observations are not included in the analysis because of the value of the variable, that is, the sample is drawn from a restricted part of the populations. Truncation is a characteristic of the distribution from which the sample data are drawn. If x has a normal distribution with mean μ and standard deviation σ, then the density of the truncated normal distribution is

where φ and Φ are the density and distribution functions of the standard normal distribution.

Compared with the mean of an untruncated variable, the mean of the truncated variable is greater if the truncation is from below, and is smaller if the truncation is from above. Furthermore, truncation reduces the variance compared with the variance of the untruncated distribution.

The log likelihood when a is the lower limit and b is the upper limit is

Let's return to Long's (1997) example on the prestige of a scientist's first job. This time we will estimate the model using regression for truncated data. We will truncate all values for job prestige that are less than one. Next, we will analysis the dataset, acadindx, that was used in the previous section. If acadindx is no longer loaded in memory you can obtain it with the following use command. Let's imagine that in order to get into a special honors program, students need to score at least 165 on acadindx. So we will drop all observations in which the value of acadindx is less than 165. Now, let's estimate the same model that we used in the section on censored data, only this time we will pretend that a 200 for acadindx is not censored. It is clear that the estimates of the coefficients are distorted due to the fact that 53 observations are no longer in the dataset. This amounts to restriction of range on both the response variable and the predictor variables. What this means is that if our goal is to find the relation between adadindx and the predictor variables in the popultions, then the truncation of acadindx in our sample is going to lead to baised estimates. A better approach to analyzing these data is to use truncated regression. In Stata this can be accomplished using the truncreg command where the ll option is used to indicate the lower limit of acadindx scores used in the truncation.

The coefficients from the truncreg command differ from the OLS and represent an attempt to adjust the analysis for the arbitrary cutoff of acadindx scores at 165.


Categorical Data Analysis Course

Phil Ender