Applied Categorical & Nonnormal Data Analysis
Correspondence Analysis


Correspondence analysis represents yet one more method for analyzing data in contingency tables. Correspondence analysis was developed in France and is more commonly used in Europe than in North America. Correspondence analysis is a descriptive/exploratory technique designed to analyze two-way and multi-way tables containing measures of correspondence between the row and column variables. The results produced by correspondence analysis provide information which is similar to that produced by principal components or factor analysis. They allow one to explore the structure of the categorical variables included in the table.

Correspondence analysis seeks to represent the relationships among the categories of row and column variables with a smaller number of latent dimensions. It produces a graphical representation of the relationships between the row and column categories in the same space.

We will illustrate correspondence analysis using the ca command (new in Stata 9) with the hsb2 dataset. In looking at the relationship between race and ses there can be at most two dimensions. The maximum number of dimensions is the minimum(R-1, C-1). Since ses has three categories, C-1 = 2. In this example, as you will see, the first dimension accounts for about 98% of the variability, so there is really only one dimension.

Some terminology: Mass is just the relative frequencies for each of the marginal categorites. The total inertia is the chi-square value divided by N. It is partioned into parts for each of the dimensions. We can write inertia as the weighted sum of the chi-square distance between each profile and the mean profile.

Example 1

Since both race and ses reflect socioeconomic factors, it is not surprising that they fall primarily onto a single dimension. Looking at the first graph shows that White and Asian are close to one another on Dimension 1, followed by Hispanic and further away African-American. The second graph indicates that high and middle ses are close to one another with low ses much further away. There is nothing in this analysis to contradict ones common sense interpretation of these variables.

Example 2

Next, we will try the same correspondence analysis separately by gender with numeric results suppressed.

Example 3

This is an example from anthropology involving Native American petroglyphs from eight different sites. We will be using the matrix version of the correspondence command, camat. The petroglyphs are categorized into six different motifs: linear, animal, atlatl, curved, human and amorph (for amorphouss). The data form an 8x6 table giving the number of each motif located at each of the eight sites (rows).

A quick look suggests that the atlatl motif has an unusual distribution in that roughly 75% of the atlatl pictoglyphs were found at a single site, site 2. Let's run the correspondence analysis and see what else we can find.