When using data from a survey design it is necessary to take into account such aspects as stratification, cluster sampling etc. If you don't take these aspects of the sampling design into account you may end up with biased coefficients and certainly with incorrect standard errors. In the next example we will demonstrate a logistic analysis using a stratified random sampling design
Survey Logit with Stratified Random Sampling
Using API data provided by the California State Department of Education we will take a stratified random sample of 100 elementary schools, 50 middle schools and 50 high schools. This is out of a total of 4,421 elementary schools, 1,018 middle schools and 755 high schools.
The file apistrat.dta contains the data for the stratified random sample.
use http://www.ats.ucla.edu/stat/stata/stat130/apistrat, clear
svyset [pw=pw], strata(stype) fpc(fpc)
pweight is pw
strata is stype
fpc is fpc
tabulate pw
pw | Freq. Percent Cum.
------------+-----------------------------------
15.1 | 50 25.00 25.00
20.36 | 50 25.00 50.00
44.21 | 100 50.00 100.00
------------+-----------------------------------
Total | 200 100.00
tabulate stype
stype | Freq. Percent Cum.
------------+-----------------------------------
E | 100 50.00 50.00
H | 50 25.00 75.00
M | 50 25.00 100.00
------------+-----------------------------------
Total | 200 100.00
tabulate fpc
fpc | Freq. Percent Cum.
------------+-----------------------------------
755 | 50 25.00 25.00
1018 | 50 25.00 50.00
4421 | 100 50.00 100.00
------------+-----------------------------------
Total | 200 100.00
codebook awards
---------------------------------------------------------------------------------------------------------------
awards eligible for awards
---------------------------------------------------------------------------------------------------------------
type: numeric (byte)
label: awards
range: [1,2] units: 1
unique values: 2 missing .: 0/200
tabulation: Freq. Numeric Label
87 1 No
113 2 Yes
generate award=awards==2
tabulate award
award | Freq. Percent Cum.
------------+-----------------------------------
0 | 87 43.50 43.50
1 | 113 56.50 100.00
------------+-----------------------------------
Total | 200 100.00
logit award meals ell yr_rnd avg_ed full enroll, nolog
Logit estimates Number of obs = 200
LR chi2(6) = 25.56
Prob > chi2 = 0.0003
Log likelihood = -124.15328 Pseudo R2 = 0.0933
------------------------------------------------------------------------------
award | Coef. Std. Err. z P>|z| [95% Conf. Interval]
-------------+----------------------------------------------------------------
meals | .0144493 .0111761 1.29 0.196 -.0074555 .0363541
ell | -.0076087 .0120865 -0.63 0.529 -.0312978 .0160803
yr_rnd | 1.396157 .6172752 2.26 0.024 .1863201 2.605995
avg_ed | .4774699 .4116758 1.16 0.246 -.3293999 1.28434
full | .0233389 .0131167 1.78 0.075 -.0023694 .0490471
enroll | -.0010137 .0003046 -3.33 0.001 -.0016107 -.0004167
_cons | -4.358013 2.137156 -2.04 0.041 -8.546761 -.169265
------------------------------------------------------------------------------
svylogit award meals ell yr_rnd avg_ed full enroll, nolog
Survey logistic regression
pweight: pw Number of obs = 200
Strata: stype Number of strata = 3
PSU: <observations> Number of PSUs = 200
FPC: fpc Population size = 6194
F( 6, 192) = 2.97
Prob > F = 0.0086
------------------------------------------------------------------------------
award | Coef. Std. Err. t P>|t| [95% Conf. Interval]
-------------+----------------------------------------------------------------
meals | .0028622 .0116124 0.25 0.806 -.0200384 .0257628
ell | -.0043035 .0117412 -0.37 0.714 -.0274581 .0188512
yr_rnd | 1.333261 .6838513 1.95 0.053 -.0153479 2.68187
avg_ed | .0238367 .4246062 0.06 0.955 -.8135203 .8611936
full | .0206382 .0137685 1.50 0.135 -.0065144 .0477907
enroll | -.0011205 .0003004 -3.73 0.000 -.0017129 -.0005281
_cons | -2.133523 2.386846 -0.89 0.372 -6.840573 2.573526
------------------------------------------------------------------------------
Finite population correction (FPC) assumes simple random sampling without
replacement of PSUs within each stratum with no subsampling within PSUs.
svylogit, or
Survey logistic regression
pweight: pw Number of obs = 200
Strata: stype Number of strata = 3
PSU: <observations> Number of PSUs = 200
FPC: fpc Population size = 6194
F( 6, 192) = 2.97
Prob > F = 0.0086
------------------------------------------------------------------------------
award | Odds Ratio Std. Err. t P>|t| [95% Conf. Interval]
-------------+----------------------------------------------------------------
meals | 1.002866 .0116457 0.25 0.806 .980161 1.026098
ell | .9957058 .0116908 -0.37 0.714 .9729154 1.01903
yr_rnd | 3.793393 2.594117 1.95 0.053 .9847692 14.61239
avg_ed | 1.024123 .434849 0.06 0.955 .4432948 2.365983
full | 1.020853 .0140556 1.50 0.135 .9935068 1.048951
enroll | .9988801 .0003001 -3.73 0.000 .9982885 .999472
------------------------------------------------------------------------------
Finite population correction (FPC) assumes simple random sampling without
replacement of PSUs within each stratum with no subsampling within PSUs.
Survey Logit with One-Stage Cluster SamplingAnother type of sampling design is cluster sampling. In this example we will use school districts as the cluster or primary sampling units. We will take a random sample of 15 school districts and look at all of the schools in each one. There are 757 school districts in the state.
The file apiclus1.dta will contain the data for the one-stage cluster sampling design.
use http://www.ats.ucla.edu/stat/stata/stat130/apiclus1, clear
svyset [pw=pw], psu(dnum) fpc(fpc)
pweight is pw
psu is dnum
fpc is fpc
tabulate pw
pw = |
6194/183 | Freq. Percent Cum.
------------+-----------------------------------
33.847 | 183 100.00 100.00
------------+-----------------------------------
Total | 183 100.00
tabulate dnum
district |
number | Freq. Percent Cum.
------------+-----------------------------------
61 | 13 7.10 7.10
135 | 34 18.58 25.68
178 | 4 2.19 27.87
197 | 13 7.10 34.97
255 | 16 8.74 43.72
406 | 2 1.09 44.81
413 | 1 0.55 45.36
437 | 4 2.19 47.54
448 | 12 6.56 54.10
510 | 21 11.48 65.57
568 | 9 4.92 70.49
637 | 11 6.01 76.50
716 | 37 20.22 96.72
778 | 2 1.09 97.81
815 | 4 2.19 100.00
------------+-----------------------------------
Total | 183 100.00
tabulate fpc
fpc | Freq. Percent Cum.
------------+-----------------------------------
757 | 183 100.00 100.00
------------+-----------------------------------
Total | 183 100.00
codebook awards
---------------------------------------------------------------------------------------------------------------
awards eligible for awards
---------------------------------------------------------------------------------------------------------------
type: numeric (byte)
label: awards
range: [1,2] units: 1
unique values: 2 missing .: 0/183
tabulation: Freq. Numeric Label
53 1 No
130 2 Yes
generate award=awards==2
tabulate award
tabulate award
award | Freq. Percent Cum.
------------+-----------------------------------
0 | 53 28.96 28.96
1 | 130 71.04 100.00
------------+-----------------------------------
Total | 183 100.00
logit award meals ell yr_rnd avg_ed full enroll, nolog
Logit estimates Number of obs = 157
LR chi2(6) = 13.44
Prob > chi2 = 0.0366
Log likelihood = -88.235274 Pseudo R2 = 0.0708
------------------------------------------------------------------------------
award | Coef. Std. Err. z P>|z| [95% Conf. Interval]
-------------+----------------------------------------------------------------
meals | -.0202277 .0111922 -1.81 0.071 -.042164 .0017087
ell | .0424034 .0162426 2.61 0.009 .0105685 .0742383
yr_rnd | 1.522462 1.123789 1.35 0.175 -.6801239 3.725049
avg_ed | .0805997 .4109408 0.20 0.845 -.7248294 .8860289
full | -.0041249 .0183279 -0.23 0.822 -.0400468 .0317971
enroll | -.0007729 .0004401 -1.76 0.079 -.0016356 .0000898
_cons | -.1568744 2.670404 -0.06 0.953 -5.390769 5.07702
------------------------------------------------------------------------------
svylogit award meals ell yr_rnd avg_ed full enroll, nolog
Survey logistic regression
pweight: pw Number of obs = 157
Strata: Number of strata = 1
PSU: dnum Number of PSUs = 15
FPC: fpc Population size = 5313.9784
F( 6, 9) = 11.60
Prob > F = 0.0009
------------------------------------------------------------------------------
award | Coef. Std. Err. t P>|t| [95% Conf. Interval]
-------------+----------------------------------------------------------------
meals | -.0202277 .0076712 -2.64 0.020 -.0366808 -.0037745
ell | .0424034 .0164367 2.58 0.022 .0071501 .0776567
yr_rnd | 1.522462 .1944188 7.83 0.000 1.105475 1.939449
avg_ed | .0805997 .3235535 0.25 0.807 -.6133535 .7745529
full | -.0041249 .0130164 -0.32 0.756 -.0320422 .0237924
enroll | -.0007729 .0004856 -1.59 0.134 -.0018144 .0002687
_cons | -.1568744 1.696472 -0.09 0.928 -3.795446 3.481697
------------------------------------------------------------------------------
Finite population correction (FPC) assumes simple random sampling without
replacement of PSUs within each stratum with no subsampling within PSUs.
svylogit, or
Survey logistic regression
pweight: pw Number of obs = 157
Strata: Number of strata = 1
PSU: dnum Number of PSUs = 15
FPC: fpc Population size = 5313.9784
F( 6, 9) = 11.60
Prob > F = 0.0009
------------------------------------------------------------------------------
award | Odds Ratio Std. Err. t P>|t| [95% Conf. Interval]
-------------+----------------------------------------------------------------
meals | .9799755 .0075176 -2.64 0.020 .9639838 .9962326
ell | 1.043315 .0171487 2.58 0.022 1.007176 1.080752
yr_rnd | 4.583498 .8911183 7.83 0.000 3.02066 6.95492
avg_ed | 1.083937 .3507116 0.25 0.807 .5415318 2.169622
full | .9958836 .0129628 -0.32 0.756 .9684657 1.024078
enroll | .9992274 .0004852 -1.59 0.134 .9981872 1.000269
------------------------------------------------------------------------------
Finite population correction (FPC) assumes simple random sampling without
replacement of PSUs within each stratum with no subsampling within PSUs.
Categorical Data Analysis Course
Phil Ender