Applied Categorical & Nonnormal Data Analysis

Loglinear Regression Models


Contingency tables are sufficient for analyzing associations in two-way tables. Three-way and higher tables have many more associations that may be explored necessitating a flexible method for generating the expected frequencies for the different hypotheses. Loglinear regression models are one approach that can be used. Loglinear regression models have the general form

We will use the ipf (iterated proportional fitting) command written by Adrian Mander to estimate the models. Use findit ipf in Stata 7 to locate the command. Iterated proportional fitting is a fast, efficient method for estimating the expected frequencies for various models. The drawback to using the ipf command is that it does not produce coefficients for each of the individual terms in the model. It produces Likelihood and Pearson chi-squares along with the expected frequencies for each cell.

We will illustrate loglinear regression models for contingency tables using the acm dataset from Agresti. The acm file contains information about alochol, cigarette, and marijuana use. Let's begin with a two-variable example looking at the relation between cigarette smoking and marijuana use. We will start off with a traditional contingency table analysis that the independence between cigarettes and marijuana.

Two-Variable Example

The last model, using c*m, perfectly reproduces the observed frequencies because it uses information concerning both the joint and marginal distributions of cigarettes and marijuana. It is equivalent to testing c+m+c*m and is called the saturated model since it is fully deterministic.

The loglinear model for independence looks like this:

The loglinear model for the saturated case looks like this:

When you get to three variables and higher models the questions that can be asked are more interesting than with only two variables. If we add alcohol to our model then testing a+c+m is equivalent to running a 3-way contingency table test for independence. This is known as the mutual independence model. It isn't really very interesting and fortunately isn't all that common either.

The loglinear model for mutual independence would look something like this:

A shorthand notation for this model would be (a,c,m).

Consider the following model:

This is the joint independence model which asserts the two-way independence independence between cigarettes and four combinations of levels of alcohol and marijuana.

Consider the following conditional independence model:

This model accounts for the association between cigarettes and marijuana controlling for alochol and for the association between alcohol and marijuana controlling for cigarettes. This is a conditional independence model specifying the conditional independence between alcohol and cigarettes while controlling for marijuana. It can be denoted (cm,am).

A model that contains all of the two-way interactions is known as a homogeneous association model.

In this model all three pairs of variables are conditionally dependent on the third variable and is denoted (am,cm,ac). The saturated model, (acm), with all possible interactions, looks like this:

Three-Variable Example

Let's collect all of the results together into a table. It is possible to test the λac = 0 in the (c*m+a*m+a*c) model. This is accomplished by computing a difference in the G2's which is equivalent to a likelihood ration test. The difference of 187.35 with 1 degree of freedom is statistically significant (P < .001) and supports an a*c partial association.

Using the Poisson Distribution

Loglinear regression models can also be estimated using the poisson distribution. In this example we will use the glm command with family = poisson and link = log to estimate the models for (a+c+m), (a*c+c*m+a*c) and (c*m+a*m). We will be discussing generalized linear models, glm, later in the course.

The constant in the glm model is the log of the expected frequency for cell a=0, c=0, m=0, thus exp(4.172538) = 64.879909. The coefficient for c is the difference in the logs of the expected frequency from c=0 to c=1 for a=0 and m=0. The test of the coefficient is the test of the differences in the marginal proportions for cigarette.

We will need to create three interaction terms, a*m, a*c and c*m. The response variable will be freq the number of observations.

Note that the Deviance and the Pearson are the same as the G2 and χ2, respectively in the table above.

Connection to Logistic Regression

We haven't begun our discussions of logistic regression yet, but this will serve as a demonstration that there is a relationship between logistic regression and loglinear regression models.

Logistic regression requires that we select one of the variables to be the response variable. Let's select m, marijuana.

The likelihood ratio chi-square in this model is the same as the a*c+m loglinear regression model above.

hsbdemo Example

Here is an example derived from the hsbdemo dataset, in which, write and read have been categorized by quantile and the data, along with female, contracted to frequencies.


Categorical Data Analysis Course

Phil Ender