Introduction
Cluster analysis techniques and not the only way to find non-observed groupings in your data. In fact, from several perspectives cluster analysis may not be the best way to determine these groupings. There are several latent variable approaches that are available. In this unit we will explore two of them: Latent profile analysis and latent class analysis.
The advantages of these approaches over cluster analysis are that they are model based, generating probabilities for group membership. It is possible to test these models and to analyze their goodness of fit. The downside to this approach is that it requires sepcialized software that is more complex to run than typical statistical packages. We will demonstrate these techniques using the Mplus software from Muthén & Muthén. We will also use Stata for descriptive and subsidiary analyses.
Latent profile analysis will use continuous predictors and the latent class analysis will use binary predictor variables. We will use the reading, writing, math, science and social studies test scores from the hsb6a dataset. For the binary predictor variables we will do median splits on each of the tests to create hiread, hiwrite, himath, hisci and hiss.
Looking at the data
use hsb6a
describe
Contains data from hsb6a.dta
obs: 600 highschool and beyond (600
cases)
vars: 23 24 Oct 2003 14:18
size: 31,200 (99.0% of memory free)
-------------------------------------------------------------------------------
storage display value
variable name type format label variable label
-------------------------------------------------------------------------------
id int %9.0g
gender byte %9.0g gl
race byte %12.0g rl
ses byte %9.0g sl
sch byte %9.0g scl
prog byte %9.0g pl
locus float %9.0g locus of control
concept float %9.0g self-concept
mot float %9.0g motivation
career byte %14.0g cl career choice
read float %9.0g reading score
write float %9.0g writing score
math float %9.0g math score
sci float %9.0g science score
ss float %9.0g social studies score
hiread byte %9.0g
hiwrite byte %9.0g
himath byte %9.0g
hisci byte %9.0g
hiss byte %9.0g
sum read write math sci ss hiread hiwrite himath hisci hiss
Variable | Obs Mean Std. Dev. Min Max
-------------+--------------------------------------------------------
read | 600 51.90183 10.10298 28.3 76
write | 600 52.38483 9.726455 25.5 67.1
math | 600 51.849 9.414736 31.8 75.5
sci | 600 51.76333 9.706179 26 74.2
ss | 600 52.04567 9.879228 25.7 70.5
-------------+--------------------------------------------------------
hiread | 600 .525 .4997913 0 1
hiwrite | 600 .54 .4988133 0 1
himath | 600 .4966667 .5004061 0 1
hisci | 600 .5266667 .499705 0 1
hiss | 600 .6483333 .477889 0 1
A 2 Class Latent Profile Model
Data:
File is I:\mplus\hsb6.dat ;
Variable:
Names are
id gender race ses sch prog locus concept mot career read write math
sci ss hiread hiwrite himath hisci hiss academic;
Usevariables are
read write math sci ss ;
classes = c(2);
Analysis:
Type=mixture;
MODEL:
%C#1%
[read math sci ss write * 30 ];
%C#2%
[read math sci ss write * 60];
OUTPUT:
TECH8;
SAVEDATA:
file is lca_ex1.txt ;
save is cprob;
format is free;
THE MODEL ESTIMATION TERMINATED NORMALLY
TESTS OF MODEL FIT
Loglikelihood
H0 Value -5213.102
Information Criteria
Number of Free Parameters 16
Akaike (AIC) 10458.203
Bayesian (BIC) 10517.464
Sample-Size Adjusted BIC 10466.721
(n* = (n + 2) / 24)
Entropy 0.865
FINAL CLASS COUNTS AND PROPORTIONS OF TOTAL SAMPLE SIZE
BASED ON ESTIMATED POSTERIOR PROBABILITIES
Class 1 123.03223 0.41011
Class 2 176.96777 0.58989
CLASSIFICATION OF INDIVIDUALS BASED ON THEIR MOST LIKELY CLASS MEMBERSHIP
Class Counts and Proportions
Class 1 120 0.40000
Class 2 180 0.60000
Average Class Probabilities by Class
1 2
Class 1 0.961 0.039
Class 2 0.043 0.957
MODEL RESULTS
Estimates S.E. Est./S.E.
CLASS 1
Means
READ 43.151 0.820 52.641
WRITE 44.524 1.024 43.485
MATH 43.860 0.757 57.947
SCI 43.322 1.051 41.239
SS 45.119 0.946 47.707
Variances
READ 49.035 4.175 11.745
WRITE 44.303 3.927 11.283
MATH 45.062 3.768 11.958
SCI 48.986 5.184 9.450
SS 55.410 4.445 12.465
CLASS 2
Means
READ 57.915 0.847 68.403
WRITE 58.115 0.625 93.039
MATH 57.136 0.800 71.386
SCI 56.729 0.668 84.953
SS 57.220 0.723 79.137
Variances
READ 49.035 4.175 11.745
WRITE 44.303 3.927 11.283
MATH 45.062 3.768 11.958
SCI 48.986 5.184 9.450
SS 55.410 4.445 12.465
LATENT CLASS REGRESSION MODEL PART
Means
C#1 -0.364 0.179 -2.032
QUALITY OF NUMERICAL RESULTS
Condition Number for the Information Matrix 0.462E-03
(ratio of smallest to largest eigenvalue)
A 3 Class Latent Profile Model
Data:
File is I:\mplus\hsb6.dat ;
Variable:
Names are
id gender race ses sch prog locus concept mot career read write math
sci ss hiread hiwrite himath hisci hiss academic;
Usevariables are
read write math sci ss ;
classes = c(3);
Analysis:
Type=mixture;
MODEL:
%C#1%
[read math sci ss write *30 ];
%C#2%
[read math sci ss write *45];
%C#3%
[read math sci ss write *60];
OUTPUT:
TECH8;
SAVEDATA:
file is lca_ex2.txt ;
save is cprob;
format is free;
THE MODEL ESTIMATION TERMINATED NORMALLY
TESTS OF MODEL FIT
Loglikelihood
H0 Value -5100.544
Information Criteria
Number of Free Parameters 22
Akaike (AIC) 10245.087
Bayesian (BIC) 10326.571
Sample-Size Adjusted BIC 10256.800
(n* = (n + 2) / 24)
Entropy 0.877
FINAL CLASS COUNTS AND PROPORTIONS OF TOTAL SAMPLE SIZE
BASED ON ESTIMATED POSTERIOR PROBABILITIES
Class 1 98.08460 0.32695
Class 2 137.86474 0.45955
Class 3 64.05066 0.21350
CLASSIFICATION OF INDIVIDUALS BASED ON THEIR MOST LIKELY CLASS MEMBERSHIP
Class Counts and Proportions
Class 1 99 0.33000
Class 2 138 0.46000
Class 3 63 0.21000
Average Class Probabilities by Class
1 2 3
Class 1 0.961 0.039 0.000
Class 2 0.021 0.940 0.039
Class 3 0.000 0.068 0.932
MODEL RESULTS
Estimates S.E. Est./S.E.
CLASS 1
Means
READ 41.866 0.614 68.208
WRITE 43.080 0.870 49.514
MATH 42.447 0.549 77.337
SCI 41.409 0.748 55.358
SS 44.232 0.819 54.010
Variances
READ 33.867 3.334 10.159
WRITE 40.042 4.168 9.607
MATH 28.667 2.980 9.619
SCI 34.199 3.411 10.027
SS 48.355 4.323 11.185
CLASS 2
Means
READ 53.058 0.726 73.044
WRITE 55.195 0.677 81.493
MATH 52.704 0.683 77.191
SCI 53.195 0.600 88.727
SS 53.377 0.745 71.657
Variances
READ 33.867 3.334 10.159
WRITE 40.042 4.168 9.607
MATH 28.667 2.980 9.619
SCI 34.199 3.411 10.027
SS 48.355 4.323 11.185
CLASS 3
Means
READ 64.588 0.949 68.070
WRITE 61.318 0.624 98.232
MATH 63.667 0.907 70.167
SCI 62.043 0.873 71.064
SS 62.139 0.827 75.163
Variances
READ 33.867 3.334 10.159
WRITE 40.042 4.168 9.607
MATH 28.667 2.980 9.619
SCI 34.199 3.411 10.027
SS 48.355 4.323 11.185
LATENT CLASS REGRESSION MODEL PART
Means
C#1 0.426 0.201 2.120
C#2 0.767 0.196 3.901
QUALITY OF NUMERICAL RESULTS
Condition Number for the Information Matrix 0.461E-03
(ratio of smallest to largest eigenvalue)
A 2 Class Latent Class Model
Data:
File is h:\mplus\hsb6.dat ;
Variable:
Names are
id gender race ses sch prog locus concept mot career read write math
sci ss hiread hiwrite himath hisci hiss academic;
Usevariables are
hiread hiwrite himath hisci hiss ;
categorical = hiread hiwrite himath hisci hiss;
classes = c(2);
Analysis:
Type=mixture;
MODEL:
%C#1%
[hiread$1 *2 himath$1 *2 hisci$1 *2 hiss$1 *2 hiwrite$1 *2 ];
%C#2%
[hiread$1 *-2 himath$1 *-2 hisci$1 *-2 hiss$1 *-2 hiwrite$1 *-2 ];
OUTPUT:
TECH8;
SAVEDATA:
file is lca_ex7.txt ;
save is cprob;
format is free;
THE MODEL ESTIMATION TERMINATED NORMALLY
TESTS OF MODEL FIT
Loglikelihood
H0 Value -849.157
Information Criteria
Number of Free Parameters 11
Akaike (AIC) 1720.315
Bayesian (BIC) 1761.057
Sample-Size Adjusted BIC 1726.171
(n* = (n + 2) / 24)
Entropy 0.815
Chi-Square Test of Model Fit for the Latent Class Indicator Model Part
Pearson Chi-Square
Value 44.642
Degrees of Freedom 20
P-Value 0.0012
Likelihood Ratio Chi-Square
Value 45.747
Degrees of Freedom 20
P-Value 0.0009
FINAL CLASS COUNTS AND PROPORTIONS OF TOTAL SAMPLE SIZE
BASED ON ESTIMATED POSTERIOR PROBABILITIES
Class 1 123.33019 0.41110
Class 2 176.66981 0.58890
CLASSIFICATION OF INDIVIDUALS BASED ON THEIR MOST LIKELY CLASS MEMBERSHIP
Class Counts and Proportions
Class 1 127 0.42333
Class 2 173 0.57667
Average Class Probabilities by Class
1 2
Class 1 0.930 0.070
Class 2 0.030 0.970
MODEL RESULTS
Estimates S.E. Est./S.E.
CLASS 1
CLASS 2
LATENT CLASS INDICATOR MODEL PART
Class 1
Thresholds
HIREAD$1 2.273 0.424 5.354
HIWRITE$1 1.376 0.276 4.990
HIMATH$1 2.081 0.399 5.209
HISCI$1 2.035 0.411 4.947
HISS$1 0.642 0.231 2.780
Class 2
Thresholds
HIREAD$1 -1.540 0.264 -5.823
HIWRITE$1 -1.488 0.244 -6.109
HIMATH$1 -1.217 0.217 -5.616
HISCI$1 -1.264 0.213 -5.927
HISS$1 -2.047 0.279 -7.328
LATENT CLASS REGRESSION MODEL PART
Means
C#1 -0.359 0.161 -2.231
LATENT CLASS INDICATOR MODEL PART IN PROBABILITY SCALE
Class 1
HIREAD
Category 1 0.907 0.036 25.221
Category 2 0.093 0.036 2.599
HIWRITE
Category 1 0.798 0.044 17.985
Category 2 0.202 0.044 4.542
HIMATH
Category 1 0.889 0.039 22.555
Category 2 0.111 0.039 2.816
HISCI
Category 1 0.884 0.042 21.036
Category 2 0.116 0.042 2.748
HISS
Category 1 0.655 0.052 12.564
Category 2 0.345 0.052 6.615
Class 2
HIREAD
Category 1 0.177 0.038 4.592
Category 2 0.823 0.038 21.417
HIWRITE
Category 1 0.184 0.037 5.031
Category 2 0.816 0.037 22.288
HIMATH
Category 1 0.228 0.038 5.980
Category 2 0.772 0.038 20.197
HISCI
Category 1 0.220 0.037 6.015
Category 2 0.780 0.037 21.288
HISS
Category 1 0.114 0.028 4.043
Category 2 0.886 0.028 31.304
QUALITY OF NUMERICAL RESULTS
Condition Number for the Information Matrix 0.654E-01
(ratio of smallest to largest eigenvalue)
A 3 Class Latent Class Model
Data:
File is h:\mplus\hsb6.dat ;
Variable:
Names are
id gender race ses sch prog locus concept mot career read write math
sci ss hiread hiwrite himath hisci hiss academic;
Usevariables are
hiread hiwrite himath hisci hiss ;
categorical = hiread hiwrite himath hisci hiss;
classes = c(3);
Analysis:
Type=mixture;
MODEL:
%C#1%
[hiread$1 *2 himath$1 *2 hisci$1 *2 hiss$1 *2 hiwrite$1 *2 ];
%C#2%
[hiread$1 *0 himath$1 *0 hisci$1 *0 hiss$1 *0 hiwrite$1 *0 ];
%C#3%
[hiread$1 *-2 himath$1 *-2 hisci$1 *-2 hiss$1 *-2 hiwrite$1 *-2 ];
OUTPUT:
TECH8;
SAVEDATA:
file is lca_ex8.txt ;
save is cprob;
format is free;
THE MODEL ESTIMATION TERMINATED NORMALLY
TESTS OF MODEL FIT
Loglikelihood
H0 Value -839.066
Information Criteria
Number of Free Parameters 17
Akaike (AIC) 1712.132
Bayesian (BIC) 1775.096
Sample-Size Adjusted BIC 1721.182
(n* = (n + 2) / 24)
Entropy 0.682
Chi-Square Test of Model Fit for the Latent Class Indicator Model Part
Pearson Chi-Square
Value 21.369
Degrees of Freedom 14
P-Value 0.0925
Likelihood Ratio Chi-Square
Value 25.564
Degrees of Freedom 14
P-Value 0.0294
FINAL CLASS COUNTS AND PROPORTIONS OF TOTAL SAMPLE SIZE
BASED ON ESTIMATED POSTERIOR PROBABILITIES
Class 1 95.51732 0.31839
Class 2 127.98211 0.42661
Class 3 76.50058 0.25500
CLASSIFICATION OF INDIVIDUALS BASED ON THEIR MOST LIKELY CLASS MEMBERSHIP
Class Counts and Proportions
Class 1 94 0.31333
Class 2 130 0.43333
Class 3 76 0.25333
Average Class Probabilities by Class
1 2 3
Class 1 0.913 0.087 0.000
Class 2 0.074 0.826 0.099
Class 3 0.000 0.163 0.837
MODEL RESULTS
Estimates S.E. Est./S.E.
CLASS 1
CLASS 2
CLASS 3
LATENT CLASS INDICATOR MODEL PART
Class 1
Thresholds
HIREAD$1 2.883 0.671 4.296
HIWRITE$1 1.735 0.418 4.150
HIMATH$1 2.863 0.739 3.877
HISCI$1 3.007 0.861 3.492
HISS$1 0.991 0.319 3.106
Class 2
Thresholds
HIREAD$1 -0.392 0.348 -1.128
HIWRITE$1 -0.451 0.445 -1.013
HIMATH$1 -0.258 0.342 -0.754
HISCI$1 -0.453 0.269 -1.688
HISS$1 -1.201 0.400 -2.999
Class 3
Thresholds
HIREAD$1 -4.377 6.575 -0.666
HIWRITE$1 -15.000 0.000 0.000
HIMATH$1 -2.932 1.699 -1.726
HISCI$1 -2.257 0.986 -2.289
HISS$1 -3.761 2.143 -1.755
LATENT CLASS REGRESSION MODEL PART
Means
C#1 0.222 0.398 0.558
C#2 0.515 0.499 1.032
LATENT CLASS INDICATOR MODEL PART IN PROBABILITY SCALE
Class 1
HIREAD
Category 1 0.947 0.034 28.108
Category 2 0.053 0.034 1.574
HIWRITE
Category 1 0.850 0.053 15.951
Category 2 0.150 0.053 2.815
HIMATH
Category 1 0.946 0.038 25.073
Category 2 0.054 0.038 1.431
HISCI
Category 1 0.953 0.039 24.648
Category 2 0.047 0.039 1.219
HISS
Category 1 0.729 0.063 11.577
Category 2 0.271 0.063 4.298
Class 2
HIREAD
Category 1 0.403 0.084 4.819
Category 2 0.597 0.084 7.134
HIWRITE
Category 1 0.389 0.106 3.680
Category 2 0.611 0.106 5.775
HIMATH
Category 1 0.436 0.084 5.177
Category 2 0.564 0.084 6.702
HISCI
Category 1 0.389 0.064 6.090
Category 2 0.611 0.064 9.582
HISS
Category 1 0.231 0.071 3.249
Category 2 0.769 0.071 10.797
Class 3
HIREAD
Category 1 0.012 0.081 0.154
Category 2 0.988 0.081 12.253
HIWRITE
Category 1 0.000 0.000 0.000
Category 2 1.000 0.000 0.000
HIMATH
Category 1 0.051 0.082 0.620
Category 2 0.949 0.082 11.641
HISCI
Category 1 0.095 0.085 1.120
Category 2 0.905 0.085 10.700
HISS
Category 1 0.023 0.048 0.477
Category 2 0.977 0.048 20.530
QUALITY OF NUMERICAL RESULTS
Condition Number for the Information Matrix 0.323E-03
(ratio of smallest to largest eigenvalue)
Categorical Data Analysis Course
Phil Ender -- 24apr03