The way we use Stata in class to analyze data is not the way one should do data analysis when analyzing "real" research data.
Consider the following hypothetical scenario. You published a research paper two years ago based on a data analysis you did four years ago. A question comes up concerning one of the analyses suggesting a modification to improve the analysis. You go back into your data archives and find that there are 10 different versions of the data file, all of which are slightly different. You cannot tell which data file you used for which analysis. Further, you recall that the analysis was very tricky requiring that you use several obscure options simultaneously. You cannot recall the exact combination of options needed. You cannot answer the question concerning your prior analysis because you cannot reproduce the published results let alone run the modified analysis being suggested.
The way to avoid this situation is to perform the data analysis using a series of do-files. These do-files preserve all the steps taken to clean and modify the data along with all of the commands used to analyze the data. The goal is to be able to perfectly reproduce your entire data analysis process from the beginning all the way to the final analysis.
Basically, there are four steps in the data analysis process:
Step 1: Reading the Raw Data
In this example we are given a comma separated file called hsberr.raw.
type hsberr.raw 70,1,4,1,1,1,57,52,41,47,57 121,2,4,2,1,3,68,59,53,63,61 86,1,4,3,1,1,44,33,54,58,31 141,1,4,3,1,3,63,44,47,53,56 172,1,4,2,1,2,47,52,57,53,61 113,1,4,2,1,2,44,52,51,63,61 50,1,3,2,1,1,50,59,42,53,61 11,1,1,2,1,2,34,46,45,39,36 84,1,4,2,1,1,63,57,54,.,51 48,1,3,2,1,2,57,55,52,50,51 [ output omitted] /* begin hsbinsheet.do */ clear insheet id gender race ses schtyp prog read write math science socst using hsberr.raw label def gl 1 "male" 2 "female" label def rl 1 "hispanic" 2 "asian" 3 "african-amer" 4 "white" label def sl 1 "low" 2 "middle" 3 "high" label def scl 1 "public" 2 "private" label def sel 1 "general" 2 "academic" 3 "vocation" label val gender gl label val race rl label val ses sl label val schtyp scl label val prog sel label var schtyp "type of school" label var prog "type of program" label var read "reading score" label var write "writing score" label var math "math score" label var science "science score" label var socst "social studies score" save hsberr, replace /* end hsbinsheet.do */Inspecting the Data
use http://www.gseis.ucla.edu/courses/data/hsberr
describe
Contains data from hsberr.dta
obs: 200 highschool and beyond (200
cases)
vars: 11 25 Mar 2003 13:51
size: 9,600 (99.7% of memory free)
-------------------------------------------------------------------------------
storage display value
variable name type format label variable label
-------------------------------------------------------------------------------
id float %9.0g
gender float %9.0g gl
race float %12.0g rl
ses float %9.0g sl
schtyp float %9.0g scl type of school
prog float %9.0g sel type of program
read float %9.0g reading score
write float %9.0g writing score
math float %9.0g math score
science float %9.0g science score
socst float %9.0g social studies score
-------------------------------------------------------------------------------
summarize
Variable | Obs Mean Std. Dev. Min Max
-------------+--------------------------------------------------------
id | 200 105.5 96.33093 1 1193
gender | 198 1.560606 .5554197 1 5
race | 200 3.44 1.068696 0 5
ses | 199 2.055276 .7261075 1 3
schtyp | 199 1.170854 .3904863 1 3
-------------+--------------------------------------------------------
prog | 200 2.015 .7051676 0 3
read | 200 52.73 12.24199 28 147
write | 200 52.775 9.478586 31 67
math | 200 52.645 9.368448 33 75
science | 195 51.0359 12.72483 -61 74
-------------+--------------------------------------------------------
socst | 200 52.405 10.73579 26 71
codebook
---------------------------------------------------------------------------------------------------------------
id (unlabeled)
---------------------------------------------------------------------------------------------------------------
type: numeric (float)
range: [1,1193] units: 1
unique values: 200 missing .: 0/200
mean: 105.5
std. dev: 96.3309
percentiles: 10% 25% 50% 75% 90%
20.5 50.5 100.5 150.5 180.5
---------------------------------------------------------------------------------------------------------------
gender (unlabeled)
---------------------------------------------------------------------------------------------------------------
type: numeric (float)
label: gl, but 1 nonmissing value is not labeled
range: [1,5] units: 1
unique values: 3 missing .: 2/200
tabulation: Freq. Numeric Label
90 1 male
107 2 female
1 5
2 .
---------------------------------------------------------------------------------------------------------------
race (unlabeled)
---------------------------------------------------------------------------------------------------------------
type: numeric (float)
label: rl, but 2 nonmissing values are not labeled
range: [0,5] units: 1
unique values: 6 missing .: 0/200
tabulation: Freq. Numeric Label
1 0
23 1 hispanic
11 2 asian
20 3 african-amer
142 4 white
3 5
---------------------------------------------------------------------------------------------------------------
ses (unlabeled)
---------------------------------------------------------------------------------------------------------------
type: numeric (float)
label: sl
range: [1,3] units: 1
unique values: 3 missing .: 1/200
tabulation: Freq. Numeric Label
47 1 low
94 2 middle
58 3 high
1 .
---------------------------------------------------------------------------------------------------------------
schtyp type of school
---------------------------------------------------------------------------------------------------------------
type: numeric (float)
label: scl, but 1 nonmissing value is not labeled
range: [1,3] units: 1
unique values: 3 missing .: 1/200
tabulation: Freq. Numeric Label
166 1 public
32 2 private
1 3
1 .
---------------------------------------------------------------------------------------------------------------
prog type of program
---------------------------------------------------------------------------------------------------------------
type: numeric (float)
label: sel, but 1 nonmissing value is not labeled
range: [0,3] units: 1
unique values: 4 missing .: 0/200
tabulation: Freq. Numeric Label
1 0
45 1 general
104 2 academic
50 3 vocation
---------------------------------------------------------------------------------------------------------------
read reading score
---------------------------------------------------------------------------------------------------------------
type: numeric (float)
range: [28,147] units: 1
unique values: 31 missing .: 0/200
mean: 52.73
std. dev: 12.242
percentiles: 10% 25% 50% 75% 90%
39 44 51 60 68
---------------------------------------------------------------------------------------------------------------
write writing score
---------------------------------------------------------------------------------------------------------------
type: numeric (float)
range: [31,67] units: 1
unique values: 29 missing .: 0/200
mean: 52.775
std. dev: 9.47859
percentiles: 10% 25% 50% 75% 90%
39 45.5 54 60 65
---------------------------------------------------------------------------------------------------------------
math math score
---------------------------------------------------------------------------------------------------------------
type: numeric (float)
range: [33,75] units: 1
unique values: 40 missing .: 0/200
mean: 52.645
std. dev: 9.36845
percentiles: 10% 25% 50% 75% 90%
40 45 52 59 65.5
---------------------------------------------------------------------------------------------------------------
science science score
---------------------------------------------------------------------------------------------------------------
type: numeric (float)
range: [-61,74] units: 1
unique values: 34 missing .: 5/200
mean: 51.0359
std. dev: 12.7248
percentiles: 10% 25% 50% 75% 90%
39 44 53 58 63
---------------------------------------------------------------------------------------------------------------
socst social studies score
---------------------------------------------------------------------------------------------------------------
type: numeric (float)
range: [26,71] units: 1
unique values: 22 missing .: 0/200
mean: 52.405
std. dev: 10.7358
percentiles: 10% 25% 50% 75% 90%
36 46 52 61 66
Step 2: Data CleaningIn this step we will use two do-files, one to check the data (hsbtest.do) and one to do the actual data cleaning (hsbclean.do).
/* begin hsbtest.do */ use `1',clear nmissing summarize assert id>=1 & id<=200 assert (gender>=1 & gender<=2) | gender==. assert (race>=1 & race<=4) | race==. assert (ses>=1 & ses<=3) | ses==. assert (schtyp>=1 & schtyp<=2) | schtyp==. assert (prog>=1 & prog<=3) | prog==. assert (read>=1 & read<=100) | read==. assert (write>=1 & write<=100) | write==. assert (math>=1 & math<=100) | math==. assert (science>=1 & science<=100) | science==. assert (socst>=1 & socst<=100) | socst==. /* end hsbtest.do */ /* begin hsbclean */ use hsberr, clear replace id=193 if id==1193 replace read=47 if read==147 replace science=61 if science==-61 replace gender=. if gender==5 replace race=. if race<1 | race>4 replace ses=. if ses<1 | ses>3 replace schtyp=. if schtyp<1 | schtyp>2 replace prog=. if prog<1 | prog>3 label data "hsb clean data using hsberr.do" save hsbclean, replace /* end hsbclean */ do hsbtest hsberr do hsbclean do hsbtest hsbcleanStep 2: Modifying the Data
In this step we will modify the cleam data creating new variables or modifying existing ones.
/* begin hsbmod.do */ use hsbclean, clear generate female = gender recode female 1=0 2=1 generate public=schtyp recode public 2=0 label define fem 0 "male" 1 "female" label value female fem generate read10=read/10 generate soc10=socst/10 label var read10 "(read score)/10" label var soc10 "(social studies score)/10" note: data modified using hsbmod.do note: read and soc rescaled -- divided by 10 save hsbmod, replace /* end hsbmod.do */ do hsbmodStep 3: Analyzing the Data
/* begin analysis1.do */ log using analysis1.log, text replace use hsbmod, clear set more off display "hsb analysis #1 using hsbmod.dta" summarize read write math science univar read write math science tabstat read10, stat(n mean sd p25 p50 p75) by(female) tabstat read10, stat(n mean sd p25 p50 p75) by(prog) kdensity read10, normal more kdensity soc10, normal more mlogit prog read10 soc10 female estimates store M1 xi: mlogit prog i.female*read10 i.female*soc10 lrtest M1 mlogit prog read10 soc10 female mlogtest, lr wald lrcomb combine listcoef fitstat prchange, x(female=0) prchange, x(female=1) set more on log close /* end analysis1.do */ /* begin analysis2.do */ log using analysis2.log, text replace use hsbmod, clear set more off display "hsb analysis #2 using hsbmod.dta" tabstat read10, stat(n mean sd p25 p50 p75) by(ses) ologit ses read10 soc10 female estimates store M1 xi: ologit ses i.female*read10 i.female*soc10 lrtest M1 ologit ses read10 soc10 female listcoef fitstat linktest prchange, x(female=0) prchange, x(female=1) set more on log close /* end analysis2.do */ do analysis1 type analysis1.log do analysis2 type analysis2,log
Categorical Data Analysis Course
Phil Ender