This unit will cover a number of Stata commands that you have not seen before. Do not panic, this unit is primarily conceptual in nature. You do not have to learn all of the different procedures.
We begin with a fairly typical OLS regression analysis regressing api04 on meals, el, avg_ed and emer. This dataset has complete data on 4,702 schools from 834 school districts.
use http://www.philender.com/courses/data/api04, clear
keep if stype==5 /* elementary schools only */
regress api04 meals el avg_ed emer
Source | SS df MS Number of obs = 4702
-------------+------------------------------ F( 4, 4697) = 3086.43
Model | 28801025.3 4 7200256.32 Prob > F = 0.0000
Residual | 10957513.8 4697 2332.87499 R-squared = 0.7244
-------------+------------------------------ Adj R-squared = 0.7242
Total | 39758539.1 4701 8457.46418 Root MSE = 48.3
------------------------------------------------------------------------------
api04 | Coef. Std. Err. t P>|t| [95% Conf. Interval]
-------------+----------------------------------------------------------------
meals | -1.31768 .0469087 -28.09 0.000 -1.409643 -1.225717
el | .1810992 .0485153 3.73 0.000 .0859866 .2762119
avg_ed | 56.02456 1.825136 30.70 0.000 52.44643 59.60268
emer | -1.403778 .1232983 -11.39 0.000 -1.6455 -1.162055
_cons | 652.1849 7.106252 91.78 0.000 638.2533 666.1165
------------------------------------------------------------------------------
keep if e(sample)
(1017 observations deleted)
The problem with this analysis
is that we are treating these data as if the api scores were completely independent
from one school to the next. This is unlikely to be true since scores within school districts
are likely to be similar to one another.We can see how much of the variability is within district versus how much is between district by computing an intraclass correlation using the loneway command in Stata.
loneway api04 dnum
One-way Analysis of Variance for api04:
Number of obs = 4702
R-squared = 0.5563
Source SS df MS F Prob > F
-------------------------------------------------------------------------
Between dnum 22116036 833 26549.863 5.82 0.0000
Within dnum 17642503 3868 4561.1434
-------------------------------------------------------------------------
Total 39758539 4701 8457.4642
Intraclass Asy.
correlation S.E. [95% Conf. Interval]
------------------------------------------------
0.46321 0.03589 0.39287 0.53355
Estimated SD of dnum effect 62.73711
Estimated SD within dnum 67.53624
Est. reliability of a dnum mean 0.82820
(evaluated at n=5.59)
The intraclass correlation of .46 is fairly substantial and puts into question whether the
coefficients and standard errors from our original regression model are correct. The first
thing that we can try is to rerun the analysis using the cluster option. The cluster
option yields the same regression coefficients but allows for differences in the
variance/standard errors due to arbitrary intra-group correlation.
regress api04 meals el avg_ed emer, vce(cluster dnum)
Linear regression Number of obs = 4702
F( 4, 833) = 454.13
Prob > F = 0.0000
R-squared = 0.7244
Root MSE = 48.3
(Std. Err. adjusted for 834 clusters in dnum)
------------------------------------------------------------------------------
| Robust
api04 | Coef. Std. Err. t P>|t| [95% Conf. Interval]
-------------+----------------------------------------------------------------
meals | -1.31768 .1287663 -10.23 0.000 -1.570424 -1.064935
el | .1810992 .1077171 1.68 0.093 -.0303297 .3925281
avg_ed | 56.02456 5.358536 10.46 0.000 45.50674 66.54238
emer | -1.403778 .2983924 -4.70 0.000 -1.989467 -.8180886
_cons | 652.1849 21.312 30.60 0.000 610.3533 694.0164
------------------------------------------------------------------------------
Note that the variable el is nolonger significant at the .05 level.The analysis using the cluster option works but it is kind a quick-and-dirty solution that would benefit from a more precise solution.
An alternative to using the cluster option is to include dummy coded variables for school district. The advantage of dummy coding district is that it allows for differences in the average level of across across districts in addition to adjusting the standard errors taking into account the specific intra-group correlation. However, regression with 833 dummy variables for school districts is both slow and memory intensive (it requires Stata SE). The alternative is to use the areg command which is logicaly equivalent to the dummy variable approach.
areg api04 meals el avg_ed emer, absorb(dnum)
Linear regression, absorbing indicators Number of obs = 4702
F( 4, 3864) = 1793.95
Prob > F = 0.0000
R-squared = 0.8447
Adj R-squared = 0.8110
Root MSE = 39.976
------------------------------------------------------------------------------
api04 | Coef. Std. Err. t P>|t| [95% Conf. Interval]
-------------+----------------------------------------------------------------
meals | -1.622885 .059068 -27.47 0.000 -1.738693 -1.507078
el | -.1496362 .0648568 -2.31 0.021 -.2767931 -.0224794
avg_ed | 35.48807 2.231773 15.90 0.000 31.1125 39.86363
emer | -1.563359 .139788 -11.18 0.000 -1.837424 -1.289293
_cons | 734.2111 8.672639 84.66 0.000 717.2077 751.2145
-------------+----------------------------------------------------------------
dnum | F(833, 3864) = 3.593 0.000 (834 categories)
In this analysis both the coefficients and the standard errors differ from the original
regression model.It is also possible to run the areg coomand with the robust option.
areg api04 meals el avg_ed emer, absorb(dnum) vce(robust)
Linear regression, absorbing indicators Number of obs = 4702
F( 4, 3864) = 1382.06
Prob > F = 0.0000
R-squared = 0.8447
Adj R-squared = 0.8110
Root MSE = 39.976
------------------------------------------------------------------------------
| Robust
api04 | Coef. Std. Err. t P>|t| [95% Conf. Interval]
-------------+----------------------------------------------------------------
meals | -1.622885 .1022788 -15.87 0.000 -1.823411 -1.42236
el | -.1496362 .1168033 -1.28 0.200 -.3786383 .0793658
avg_ed | 35.48807 3.802746 9.33 0.000 28.03249 42.94365
emer | -1.563359 .2104234 -7.43 0.000 -1.97591 -1.150807
_cons | 734.2111 14.99503 48.96 0.000 704.8121 763.61
-------------+----------------------------------------------------------------
dnum | absorbed (834 categories)
As you can see the coefficients are the same as for the first areg model but the standard
errors have been adjusted.We will follow this analysis with fixed-effects (within) cross-sectional time-series model using xtreg.
xtreg api04 meals el avg_ed emer, i(dnum) fe
Fixed-effects (within) regression Number of obs = 4702
Group variable: dnum Number of groups = 834
R-sq: within = 0.6500 Obs per group: min = 1
between = 0.7103 avg = 5.6
overall = 0.7149 max = 373
F(4,3864) = 1793.95
corr(u_i, Xb) = -0.0378 Prob > F = 0.0000
------------------------------------------------------------------------------
api04 | Coef. Std. Err. t P>|t| [95% Conf. Interval]
-------------+----------------------------------------------------------------
meals | -1.622885 .059068 -27.47 0.000 -1.738693 -1.507078
el | -.1496362 .0648568 -2.31 0.021 -.2767931 -.0224794
avg_ed | 35.48807 2.231773 15.90 0.000 31.1125 39.86363
emer | -1.563359 .139788 -11.18 0.000 -1.837424 -1.289293
_cons | 734.2111 8.672639 84.66 0.000 717.2077 751.2145
-------------+----------------------------------------------------------------
sigma_u | 44.098947
sigma_e | 39.975999
rho | .54892132 (fraction of variance due to u_i)
------------------------------------------------------------------------------
F test that all u_i=0: F(833, 3864) = 3.59 Prob > F = 0.0000
This analysis is exactly the same as the areg without the robust option.We will follow this up with a between-effects xtreg model.
xtreg api04 meals el avg_ed emer, i(dnum) be
Between regression (regression on group means) Number of obs = 4702
Group variable: dnum Number of groups = 834
R-sq: within = 0.6412 Obs per group: min = 1
between = 0.7188 avg = 5.6
overall = 0.7200 max = 373
F(4,829) = 529.73
sd(u_i + avg(e_i.))= 43.54434 Prob > F = 0.0000
------------------------------------------------------------------------------
api04 | Coef. Std. Err. t P>|t| [95% Conf. Interval]
-------------+----------------------------------------------------------------
meals | -1.330267 .1039559 -12.80 0.000 -1.534315 -1.126219
el | .1249702 .1039732 1.20 0.230 -.0791115 .3290518
avg_ed | 50.28665 4.137794 12.15 0.000 42.16487 58.40844
emer | -2.345164 .2617322 -8.96 0.000 -2.8589 -1.831429
_cons | 663.9947 15.9463 41.64 0.000 632.6948 695.2946
------------------------------------------------------------------------------
The regression coefficients, standard errors and the R-squared between can also be obtained by
generating a mean score for each variable for each district and then running an OLS regression
with one observation per district.
sort dnum id
by dnum: generate i = _n
egen a = mean(api04), by(dnum)
egen b = mean(meals), by(dnum)
egen c = mean(el), by(dnum)
egen d = mean(avg_ed), by(dnum)
egen e = mean(emer), by(dnum)
regress a b c d e if i==1
Source | SS df MS Number of obs = 834
-------------+------------------------------ F( 4, 829) = 529.73
Model | 4017708.46 4 1004427.11 Prob > F = 0.0000
Residual | 1571874.5 829 1896.10916 R-squared = 0.7188
-------------+------------------------------ Adj R-squared = 0.7174
Total | 5589582.96 833 6710.18362 Root MSE = 43.544
------------------------------------------------------------------------------
a | Coef. Std. Err. t P>|t| [95% Conf. Interval]
-------------+----------------------------------------------------------------
b | -1.330267 .1039559 -12.80 0.000 -1.534315 -1.126219
c | .1249702 .1039732 1.20 0.230 -.0791115 .3290519
d | 50.28666 4.137794 12.15 0.000 42.16487 58.40844
e | -2.345164 .2617322 -8.96 0.000 -2.8589 -1.831429
_cons | 663.9947 15.9463 41.64 0.000 632.6948 695.2946
------------------------------------------------------------------------------
Since we have the mean values for each district let's subtract the district mean and add back in
the grand mean and then do OLS regression with these adjusted scores.
sum api04, meanonly
generate a2 = api04 - a + r(mean)
sum meals, meanonly
generate b2 = meals - b + r(mean)
sum el, meanonly
generate c2 = el - c + r(mean)
sum avg_ed, meanonly
generate d2 = avg_ed - d + r(mean)
sum emer, meanonly
generate e2 = emer - e + r(mean)
regress a2 b2 c2 d2 e2
Source | SS df MS Number of obs = 4702
-------------+------------------------------ F( 4, 4697) = 2180.69
Model | 11467519.9 4 2866879.98 Prob > F = 0.0000
Residual | 6174982.85 4697 1314.66529 R-squared = 0.6500
-------------+------------------------------ Adj R-squared = 0.6497
Total | 17642502.8 4701 3752.9255 Root MSE = 36.258
------------------------------------------------------------------------------
a2 | Coef. Std. Err. t P>|t| [95% Conf. Interval]
-------------+----------------------------------------------------------------
b2 | -1.622885 .0535748 -30.29 0.000 -1.727917 -1.517853
c2 | -.1496362 .0588253 -2.54 0.011 -.2649613 -.0343111
d2 | 35.48807 2.024223 17.53 0.000 31.51964 39.45649
e2 | -1.563359 .126788 -12.33 0.000 -1.811923 -1.314795
_cons | 734.2111 7.866102 93.34 0.000 718.7898 749.6323
------------------------------------------------------------------------------
This analysis gives us the same coefficients as the areg and fixed-effects xtreg but
not the same standard errors. We also get the same R-squared as within for the fixed-effect
model. The reason that we don't get the same standard errors is that we haven't adjusted
the degrees of freedom of the variances and covariances to account for the 834
district means we have estimated.Next, we will run a random-effects xtreg model. The random-effects model provides a gls solution giving a matrix weighted average of the between-effects and within-effects models.
xtreg api04 meals el avg_ed emer, i(dnum) re
Random-effects GLS regression Number of obs = 4702
Group variable: dnum Number of groups = 834
R-sq: within = 0.6495 Obs per group: min = 1
between = 0.7139 avg = 5.6
overall = 0.7181 max = 373
Random effects u_i ~ Gaussian Wald chi2(4) = 9561.58
corr(u_i, X) = 0 (assumed) Prob > chi2 = 0.0000
------------------------------------------------------------------------------
api04 | Coef. Std. Err. z P>|z| [95% Conf. Interval]
-------------+----------------------------------------------------------------
meals | -1.556664 .0510487 -30.49 0.000 -1.656718 -1.456611
el | -.0764894 .0542108 -1.41 0.158 -.1827405 .0297618
avg_ed | 39.97183 1.956283 20.43 0.000 36.13759 43.80608
emer | -1.715903 .1242845 -13.81 0.000 -1.959496 -1.47231
_cons | 708.6933 7.603373 93.21 0.000 693.791 723.5957
-------------+----------------------------------------------------------------
sigma_u | 31.435229
sigma_e | 39.975999
rho | .38208683 (fraction of variance due to u_i)
------------------------------------------------------------------------------
Continuing with the random-effects model, let's try a maximum likelihood solution.
xtreg api04 meals el avg_ed emer, i(dnum) mle nolog
Random-effects ML regression Number of obs = 4702
Group variable: dnum Number of groups = 834
Random effects u_i ~ Gaussian Obs per group: min = 1
avg = 5.6
max = 373
LR chi2(4) = 5189.39
Log likelihood = -24451.401 Prob > chi2 = 0.0000
------------------------------------------------------------------------------
api04 | Coef. Std. Err. z P>|z| [95% Conf. Interval]
-------------+----------------------------------------------------------------
meals | -1.548644 .0509281 -30.41 0.000 -1.648461 -1.448827
el | -.0686962 .0538775 -1.28 0.202 -.1742942 .0369018
avg_ed | 40.52672 1.966052 20.61 0.000 36.67333 44.38011
emer | -1.729005 .1242775 -13.91 0.000 -1.972584 -1.485425
_cons | 706.9804 7.610685 92.89 0.000 692.0637 721.8971
-------------+----------------------------------------------------------------
/sigma_u | 28.50507 1.300925 26.06602 31.17235
/sigma_e | 40.24634 .4572658 39.36002 41.15261
rho | .3340611 .0219029 .2922954 .3779909
------------------------------------------------------------------------------
Likelihood-ratio test of sigma_u=0: chibar2(01)= 899.23 Prob>=chibar2 = 0.000
Using the xtmixed command wit restricted maximum likelihood estimator.
xtmixed api04 meals el avg_ed emer || dnum:
Performing EM optimization:
Performing gradient-based optimization:
Iteration 0: log restricted-likelihood = -24454.551
Iteration 1: log restricted-likelihood = -24454.551
Computing standard errors:
Mixed-effects REML regression Number of obs = 4702
Group variable: dnum Number of groups = 834
Obs per group: min = 1
avg = 5.6
max = 373
Wald chi2(4) = 9692.94
Log restricted-likelihood = -24454.551 Prob > chi2 = 0.0000
------------------------------------------------------------------------------
api04 | Coef. Std. Err. z P>|z| [95% Conf. Interval]
-------------+----------------------------------------------------------------
meals | -1.548778 .0508131 -30.48 0.000 -1.64837 -1.449186
el | -.0688246 .0537867 -1.28 0.201 -.1742446 .0365954
avg_ed | 40.51744 1.949019 20.79 0.000 36.69743 44.33744
emer | -1.728796 .1241998 -13.92 0.000 -1.972223 -1.485369
_cons | 707.0091 7.570401 93.39 0.000 692.1714 721.8468
------------------------------------------------------------------------------
------------------------------------------------------------------------------
Random-effects Parameters | Estimate Std. Err. [95% Conf. Interval]
-----------------------------+------------------------------------------------
dnum: Identity |
sd(_cons) | 28.5644 1.30312 26.12118 31.23613
-----------------------------+------------------------------------------------
sd(Residual) | 40.26107 .4575715 39.37416 41.16796
------------------------------------------------------------------------------
LR test vs. linear regression: chibar2(01) = 901.11 Prob >= chibar2 = 0.0000
And next, a population averaged GEE random-effects model.
xtreg api04 meals el avg_ed emer, i(dnum) pa nolog
/* same as: xtgee api04 meals el avg_ed emer, i(dnum) nolog */
GEE population-averaged model Number of obs = 4702
Group variable: dnum Number of groups = 834
Link: identity Obs per group: min = 1
Family: Gaussian avg = 5.6
Correlation: exchangeable max = 373
Wald chi2(4) = 17223.17
Scale parameter: 2506.259 Prob > chi2 = 0.0000
------------------------------------------------------------------------------
api04 | Coef. Std. Err. z P>|z| [95% Conf. Interval]
-------------+----------------------------------------------------------------
meals | -1.596325 .0378911 -42.13 0.000 -1.67059 -1.52206
el | -.1187029 .0409814 -2.90 0.004 -.199025 -.0383808
avg_ed | 37.24855 1.442306 25.83 0.000 34.42169 40.07542
emer | -1.63356 .0905914 -18.03 0.000 -1.811116 -1.456004
_cons | 717.0666 5.704247 125.71 0.000 705.8865 728.2467
------------------------------------------------------------------------------
We can also run the population averaged model with the robust option.
xtreg api04 meals el avg_ed emer, i(dnum) pa robust nolog
/* same as: xtgee api04 meals el avg_ed emer, i(dnum) robust nolog */
GEE population-averaged model Number of obs = 4702
Group variable: dnum Number of groups = 834
Link: identity Obs per group: min = 1
Family: Gaussian avg = 5.6
Correlation: exchangeable max = 373
Wald chi2(4) = 4058.27
Scale parameter: 2506.259 Prob > chi2 = 0.0000
(Std. Err. adjusted for clustering on dnum)
------------------------------------------------------------------------------
| Semirobust
api04 | Coef. Std. Err. z P>|z| [95% Conf. Interval]
-------------+----------------------------------------------------------------
meals | -1.596325 .1943932 -8.21 0.000 -1.977329 -1.215322
el | -.1187029 .1347415 -0.88 0.378 -.3827913 .1453856
avg_ed | 37.24855 10.14065 3.67 0.000 17.37326 57.12385
emer | -1.63356 .2055431 -7.95 0.000 -2.036417 -1.230703
_cons | 717.0666 38.79005 18.49 0.000 641.0395 793.0937
------------------------------------------------------------------------------
Finally, just for the fun of it we will use svyreg with clusting even though the data
are not from a complex survey sampling design.
svyset, psu(dnum)
pweight:
VCE: linearized
Single unit: missing
Strata 1:
SU 1: dnum
FPC 1:
svy: regress api04 meals el avg_ed emer
(running regress on estimation sample)
Survey: Linear regression
Number of strata = 1 Number of obs = 4702
Number of PSUs = 834 Population size = 4702
Design df = 833
F( 4, 830) = 452.88
Prob > F = 0.0000
R-squared = 0.7244
------------------------------------------------------------------------------
| Linearized
api04 | Coef. Std. Err. t P>|t| [95% Conf. Interval]
-------------+----------------------------------------------------------------
meals | -1.31768 .1287115 -10.24 0.000 -1.570317 -1.065043
el | .1810992 .1076713 1.68 0.093 -.0302397 .3924382
avg_ed | 56.02456 5.356256 10.46 0.000 45.51121 66.5379
emer | -1.403778 .2982654 -4.71 0.000 -1.989218 -.8183379
_cons | 652.1849 21.30293 30.61 0.000 610.3711 693.9986
------------------------------------------------------------------------------
This analysis is the same as the OLS regression with the cluster option.Collectively, these analyses provide a range of options for analyzing clustered data in Stata. There is no need to use a multilevel data analysis program for these data since all of the data are collected at the school level and no cross level hypotheses are being tested. Results identical to xtreg with the mle option were obtained using SAS proc mixed.
Linear Statistical Models Course
Phil Ender, 17sep10, 11nov04