Introduction
Partition methods break the observation into distinct nonoverlapping groups. There are many different partition methods. This unit will illustrate two of them, kmeans and kmedians. Kmeans clustering is a popular nonhierarchical clustering technique. Kmeans clustering is particularly appropriate when the number of clusters or the approximate number of clusters is known apriori. Unlike hierarchical cluster analysis, kmeans clustering does not produce all possible clusters of n observations. Rather, the researcher provides the kmeans cluster analysis program with the number of clusters and the program searches for the best solution with that number of clusters.
Kmeans cluster analysis programs begin by creating the k clusters according to some arbitrary procedure. The program calculates the means or centroids of each of the clusters. If one of the observations is closer to the centroid of another cluster then the observation is made a member of that cluster. This process is repeated until none of the observations are reassigned to a different cluster.
The process of partioning is sensitive to the starting point. Stata selects a different random starting point each time the cluster command is exicuted. You can make Stata can use a specified random starting point using prandom option, making it is possible to replicate analyses exactly.
Kmeans Cluster Analysis in Stata
input lep read math lang str3 district
.38 626.5 601.3 605.3 lau
.18 654.0 647.1 641.8 ccu
.07 677.2 676.5 670.5 bhu
.09 639.9 640.3 636.0 ing
.19 614.7 617.3 606.2 com
.12 670.2 666.0 659.3 smm
.20 651.1 645.2 643.4 bur
.41 645.4 645.8 644.8 gln
.07 683.5 682.9 674.3 pvu
.39 648.6 647.8 643.1 sgu
.21 650.4 650.8 643.9 abc
.24 637.0 636.9 626.5 pas
.09 641.1 628.8 629.4 lan
.12 638.0 627.7 628.6 plm
.11 661.4 659.0 651.8 tor
.22 646.4 646.2 647.0 dow
.33 634.1 632.0 627.8 lbu
end
cluster kmeans lep read math lang, k(3) name(cl3) start(prandom(1122334455))
tabstat lep read math lang, by(cl3)
Summary statistics: mean
by categories of: cl3
cl3 | lep read math lang
---------+----------------------------------------
1 | .225 631.9 624 620.6333
2 | .22625 649.65 647.775 643.975
3 | .0866667 676.9667 675.1333 668.0333
---------+----------------------------------------
Total | .2011765 648.2059 644.2118 639.9824
--------------------------------------------------
tabulate district cl3
| cl3
district | 1 2 3 | Total
-----------+---------------------------------+----------
abc | 0 1 0 | 1
bhu | 0 0 1 | 1
bur | 0 1 0 | 1
ccu | 0 1 0 | 1
com | 1 0 0 | 1
dow | 0 1 0 | 1
gln | 0 1 0 | 1
ing | 0 1 0 | 1
lan | 1 0 0 | 1
lau | 1 0 0 | 1
lbu | 1 0 0 | 1
pas | 1 0 0 | 1
plm | 1 0 0 | 1
pvu | 0 0 1 | 1
sgu | 0 1 0 | 1
smm | 0 0 1 | 1
tor | 0 1 0 | 1
-----------+---------------------------------+----------
Total | 6 8 3 | 17
xi: mvreg lep read math lang = i.cl3
i.cl3 _Icl3_1-3 (naturally coded; _Icl3_1 omitted)
Equation Obs Parms RMSE "R-sq" F P
----------------------------------------------------------------------
lep 17 3 .1079696 0.2264 2.049005 0.1658
read 17 3 7.794986 0.8279 33.68502 0.0000
math 17 3 9.170697 0.8216 32.22942 0.0000
lang 17 3 8.15762 0.8356 35.57204 0.0000
------------------------------------------------------------------------------
| Coef. Std. Err. t P>|t| [95% Conf. Interval]
-------------+----------------------------------------------------------------
lep |
_Icl3_2 | .00125 .0583103 0.02 0.983 -.1238131 .1263131
_Icl3_3 | -.1383333 .0763461 -1.81 0.091 -.3020793 .0254127
_cons | .225 .0440784 5.10 0.000 .1304612 .3195388
-------------+----------------------------------------------------------------
read |
_Icl3_2 | 17.75002 4.209774 4.22 0.001 8.720949 26.77908
_Icl3_3 | 45.06668 5.511887 8.18 0.000 33.24486 56.8885
_cons | 631.9 3.18229 198.57 0.000 625.0747 638.7253
-------------+----------------------------------------------------------------
math |
_Icl3_2 | 23.77499 4.952742 4.80 0.000 13.15242 34.39757
_Icl3_3 | 51.13334 6.484662 7.89 0.000 37.22513 65.04156
_cons | 624 3.743921 166.67 0.000 615.9701 632.0299
-------------+----------------------------------------------------------------
lang |
_Icl3_2 | 23.34167 4.405619 5.30 0.000 13.89256 32.79078
_Icl3_3 | 47.39999 5.768309 8.22 0.000 35.0282 59.77179
_cons | 620.6333 3.330335 186.36 0.000 613.4905 627.7762
------------------------------------------------------------------------------
cluster kmeans lep read math lang, k(4) name(cl4) start(prandom(9988776655))
tabstat lep read math lang, by(cl4)
Summary statistics: mean
by categories of: cl4
cl4 | lep read math lang
---------+----------------------------------------
1 | .2457143 651.0429 648.8429 645.1143
2 | .0866667 676.9667 675.1333 668.0333
3 | .174 638.02 633.14 629.66
4 | .285 620.6 609.3 605.75
---------+----------------------------------------
Total | .2011765 648.2059 644.2118 639.9824
--------------------------------------------------
tabulate district cl4
| cl4
district | 1 2 3 4 | Total
-----------+--------------------------------------------+----------
abc | 1 0 0 0 | 1
bhu | 0 1 0 0 | 1
bur | 1 0 0 0 | 1
ccu | 1 0 0 0 | 1
com | 0 0 0 1 | 1
dow | 1 0 0 0 | 1
gln | 1 0 0 0 | 1
ing | 0 0 1 0 | 1
lan | 0 0 1 0 | 1
lau | 0 0 0 1 | 1
lbu | 0 0 1 0 | 1
pas | 0 0 1 0 | 1
plm | 0 0 1 0 | 1
pvu | 0 1 0 0 | 1
sgu | 1 0 0 0 | 1
smm | 0 1 0 0 | 1
tor | 1 0 0 0 | 1
-----------+--------------------------------------------+----------
Total | 7 3 5 2 | 17
xi: mvreg lep read math lang = i.cl4
i.cl4 _Icl4_1-4 (naturally coded; _Icl4_1 omitted)
Equation Obs Parms RMSE "R-sq" F P
----------------------------------------------------------------------
lep 17 4 .1037779 0.3364 2.196513 0.1373
read 17 4 5.286934 0.9265 54.62785 0.0000
math 17 4 6.38132 0.9198 49.68041 0.0000
lang 17 4 4.338311 0.9568 96.01699 0.0000
------------------------------------------------------------------------------
| Coef. Std. Err. t P>|t| [95% Conf. Interval]
-------------+----------------------------------------------------------------
lep |
_Icl4_2 | -.1590476 .0716136 -2.22 0.045 -.3137593 -.0043359
_Icl4_3 | -.0717143 .0607661 -1.18 0.259 -.2029915 .0595629
_Icl4_4 | .0392857 .0832074 0.47 0.645 -.140473 .2190444
_cons | .2457143 .0392244 6.26 0.000 .1609752 .3304534
-------------+----------------------------------------------------------------
read |
_Icl4_2 | 25.92381 3.648331 7.11 0.000 18.04207 33.80555
_Icl4_3 | -13.02287 3.095712 -4.21 0.001 -19.71075 -6.334991
_Icl4_4 | -30.44286 4.238978 -7.18 0.000 -39.60061 -21.2851
_cons | 651.0429 1.998273 325.80 0.000 646.7259 655.3599
-------------+----------------------------------------------------------------
math |
_Icl4_2 | 26.29049 4.403529 5.97 0.000 16.77724 35.80374
_Icl4_3 | -15.70285 3.736518 -4.20 0.001 -23.77511 -7.630592
_Icl4_4 | -39.54286 5.116438 -7.73 0.000 -50.59626 -28.48947
_cons | 648.8429 2.411912 269.02 0.000 643.6322 654.0535
-------------+----------------------------------------------------------------
lang |
_Icl4_2 | 22.91904 2.993719 7.66 0.000 16.4515 29.38658
_Icl4_3 | -15.45429 2.540255 -6.08 0.000 -20.94217 -9.966399
_Icl4_4 | -39.36428 3.478387 -11.32 0.000 -46.87888 -31.84968
_cons | 645.1143 1.639728 393.43 0.000 641.5719 648.6567
------------------------------------------------------------------------------
cluster kmeans lep read math lang, k(5) name(cl5) start(prandom(7654321123))
tabstat lep read math lang, by(cl5)
Summary statistics: mean
by categories of: cl5
cl5 | lep read math lang
---------+----------------------------------------
1 | .2457143 651.0429 648.8429 645.1143
2 | .09 639.9 640.3 636
3 | .195 637.55 631.35 628.075
4 | .285 620.6 609.3 605.75
5 | .0866667 676.9667 675.1333 668.0333
---------+----------------------------------------
Total | .2011765 648.2059 644.2118 639.9824
--------------------------------------------------
tabulate district cl5
| cl5
district | 1 2 3 4 5 | Total
-----------+-------------------------------------------------------+----------
abc | 1 0 0 0 0 | 1
bhu | 0 0 0 0 1 | 1
bur | 1 0 0 0 0 | 1
ccu | 1 0 0 0 0 | 1
com | 0 0 0 1 0 | 1
dow | 1 0 0 0 0 | 1
gln | 1 0 0 0 0 | 1
ing | 0 1 0 0 0 | 1
lan | 0 0 1 0 0 | 1
lau | 0 0 0 1 0 | 1
lbu | 0 0 1 0 0 | 1
pas | 0 0 1 0 0 | 1
plm | 0 0 1 0 0 | 1
pvu | 0 0 0 0 1 | 1
sgu | 1 0 0 0 0 | 1
smm | 0 0 0 0 1 | 1
tor | 1 0 0 0 0 | 1
-----------+-------------------------------------------------------+----------
Total | 7 1 4 2 3 | 17
xi: mvreg lep read math lang = i.cl5
i.cl5 _Icl5_1-5 (naturally coded; _Icl5_1 omitted)
Equation Obs Parms RMSE "R-sq" F P
----------------------------------------------------------------------
lep 17 5 .1045578 0.3782 1.824595 0.1890
read 17 5 5.469259 0.9274 38.3217 0.0000
math 17 5 6.22692 0.9295 39.54416 0.0000
lang 17 5 4.02521 0.9657 84.42677 0.0000
------------------------------------------------------------------------------
| Coef. Std. Err. t P>|t| [95% Conf. Interval]
-------------+----------------------------------------------------------------
lep |
_Icl5_2 | -.1557143 .111777 -1.39 0.189 -.3992555 .0878269
_Icl5_3 | -.0507143 .0655351 -0.77 0.454 -.193503 .0920744
_Icl5_4 | .0392857 .0838328 0.47 0.648 -.1433702 .2219416
_Icl5_5 | -.1590476 .0721518 -2.20 0.048 -.3162528 -.0018424
_cons | .2457143 .0395191 6.22 0.000 .1596095 .3318191
-------------+----------------------------------------------------------------
read |
_Icl5_2 | -11.14284 5.846884 -1.91 0.081 -23.88211 1.596427
_Icl5_3 | -13.49288 3.42804 -3.94 0.002 -20.96193 -6.023819
_Icl5_4 | -30.44286 4.385163 -6.94 0.000 -39.99731 -20.88841
_Icl5_5 | 25.92381 3.774148 6.87 0.000 17.70065 34.14697
_cons | 651.0429 2.067186 314.94 0.000 646.5389 655.5469
-------------+----------------------------------------------------------------
math |
_Icl5_2 | -8.542864 6.656858 -1.28 0.224 -23.04691 5.961183
_Icl5_3 | -17.49285 3.902929 -4.48 0.001 -25.9966 -8.989095
_Icl5_4 | -39.54286 4.992643 -7.92 0.000 -50.4209 -28.66483
_Icl5_5 | 26.29049 4.296983 6.12 0.000 16.92817 35.65281
_cons | 648.8429 2.353555 275.69 0.000 643.7149 653.9708
-------------+----------------------------------------------------------------
lang |
_Icl5_2 | -9.114284 4.30313 -2.12 0.056 -18.49 .261431
_Icl5_3 | -17.03929 2.522934 -6.75 0.000 -22.53629 -11.54229
_Icl5_4 | -39.36428 3.227348 -12.20 0.000 -46.39607 -32.3325
_Icl5_5 | 22.91904 2.777659 8.25 0.000 16.86704 28.97104
_cons | 645.1143 1.521386 424.03 0.000 641.7995 648.4291
------------------------------------------------------------------------------Example Using Fisher's Iris Data
use http://www.gseis.ucla.edu/courses/data/iris, clear
cluster kmeans sl sw pl pw, k(3) name(c2) euc start(prandom(4343434343))
tab c2 type
| type of iris
c2 | setosa versicolo virginica | Total
-----------+---------------------------------+----------
1 | 0 3 36 | 39
2 | 0 47 14 | 61
3 | 50 0 0 | 50
-----------+---------------------------------+----------
Total | 50 50 50 | 150Kmedians Cluster Analysis in Stata
Kmedians clustering is a variation on the kmeans method. The same process is followed except that medians are used instead of means. Kmedians would be appropriate when you need a more stable measure of the group centers.
cluster kmedians lep read math lang, k(5) name(med5) start(prandom(777444))
tabulate district med5
| med5
district | 1 2 3 4 5 | Total
-----------+-------------------------------------------------------+----------
abc | 0 0 1 0 0 | 1
bhu | 1 0 0 0 0 | 1
bur | 0 0 1 0 0 | 1
ccu | 0 0 1 0 0 | 1
com | 0 0 0 1 0 | 1
dow | 0 0 1 0 0 | 1
gln | 0 0 1 0 0 | 1
ing | 0 1 0 0 0 | 1
lan | 0 0 0 0 1 | 1
lau | 0 0 0 0 1 | 1
lbu | 0 0 0 1 0 | 1
pas | 0 0 0 1 0 | 1
plm | 0 0 0 0 1 | 1
pvu | 1 0 0 0 0 | 1
sgu | 0 0 1 0 0 | 1
smm | 1 0 0 0 0 | 1
tor | 0 0 1 0 0 | 1
-----------+-------------------------------------------------------+----------
Total | 3 1 7 3 3 | 17
Example Using Fisher's Iris Data
use http://www.gseis.ucla.edu/courses/data/iris, clear
cluster kmedians sl sw pl pw, k(3) name(c3) euc start(prandom(666565656))
tab c3 type
| type of iris
c3 | setosa versicolo virginica | Total
-----------+---------------------------------+----------
1 | 0 10 47 | 57
2 | 50 0 0 | 50
3 | 0 40 3 | 43
-----------+---------------------------------+----------
Total | 50 50 50 | 150
Multivariate Course Page
Phil Ender, 5jan05, 24apr00