Applied Categorical & Nonnormal Data Analysis

The Process of Analyzing Data

The way we use Stata in class to analyze data is not the way one should do data analysis when analyzing "real" research data.

Consider the following hypothetical scenario. You published a research paper two years ago based on a data analysis you did four years ago. A question comes up concerning one of the analyses suggesting a modification to improve the analysis. You go back into your data archives and find that there are 10 different versions of the data file, all of which are slightly different. You cannot tell which data file you used for which analysis. Further, you recall that the analysis was very tricky requiring that you use several obscure options simultaneously. You cannot recall the exact combination of options needed. You cannot answer the question concerning your prior analysis because you cannot reproduce the published results let alone run the modified analysis being suggested.

The way to avoid this situation is to perform the data analysis using a series of do-files. These do-files preserve all the steps taken to clean and modify the data along with all of the commands used to analyze the data. The goal is to be able to perfectly reproduce your entire data analysis process from the beginning all the way to the final analysis.

Basically, there are four steps in the data analysis process:

Reading the raw data
Cleaning the data
Modifying the data
Analyzing the data

One important note, never change your original data file. In this case, never change hsberr.raw. If you change your original datfile and you make a mistake you may never be able to completely reproduce your data analysis.

Step 1: Reading the Raw Data

In this example we are given a comma separated file called hsberr.raw.

type hsberr.raw

70,1,4,1,1,1,57,52,41,47,57
121,2,4,2,1,3,68,59,53,63,61
86,1,4,3,1,1,44,33,54,58,31
141,1,4,3,1,3,63,44,47,53,56
172,1,4,2,1,2,47,52,57,53,61
113,1,4,2,1,2,44,52,51,63,61
50,1,3,2,1,1,50,59,42,53,61
11,1,1,2,1,2,34,46,45,39,36
84,1,4,2,1,1,63,57,54,.,51
48,1,3,2,1,2,57,55,52,50,51
[ output omitted]

/* begin hsbinsheet.do */
  clear
  insheet id gender race ses schtyp prog read write math science socst using hsberr.raw
  
  label def gl 1 "male" 2 "female"
  label def rl 1 "hispanic" 2 "asian" 3 "african-amer" 4 "white"
  label def sl 1 "low" 2 "middle" 3 "high"
  label def scl 1 "public" 2 "private"
  label def sel 1 "general" 2 "academic" 3 "vocation"
  label val gender gl
  label val race rl
  label val ses sl
  label val schtyp scl
  label val prog sel
  label var schtyp "type of school"
  label var prog "type of program"
  label var read "reading score"
  label var write "writing score"
  label var math "math score"
  label var science "science score"
  label var socst "social studies score"
  
  save hsberr, replace
/* end hsbinsheet.do */

Inspecting the Data

use http://www.gseis.ucla.edu/courses/data/hsberr

describe

Contains data from hsberr.dta
  obs:           200                          highschool and beyond (200
                                                cases)
 vars:            11                          25 Mar 2003 13:51
 size:         9,600 (99.7% of memory free)
-------------------------------------------------------------------------------
              storage  display     value
variable name   type   format      label      variable label
-------------------------------------------------------------------------------
id              float  %9.0g                  
gender          float  %9.0g       gl         
race            float  %12.0g      rl         
ses             float  %9.0g       sl         
schtyp          float  %9.0g       scl        type of school
prog            float  %9.0g       sel        type of program
read            float  %9.0g                  reading score
write           float  %9.0g                  writing score
math            float  %9.0g                  math score
science         float  %9.0g                  science score
socst           float  %9.0g                  social studies score
-------------------------------------------------------------------------------


summarize


    Variable |       Obs        Mean    Std. Dev.       Min        Max
-------------+--------------------------------------------------------
          id |       200       105.5    96.33093          1       1193
      gender |       198    1.560606    .5554197          1          5
        race |       200        3.44    1.068696          0          5
         ses |       199    2.055276    .7261075          1          3
      schtyp |       199    1.170854    .3904863          1          3
-------------+--------------------------------------------------------
        prog |       200       2.015    .7051676          0          3
        read |       200       52.73    12.24199         28        147
       write |       200      52.775    9.478586         31         67
        math |       200      52.645    9.368448         33         75
     science |       195     51.0359    12.72483        -61         74
-------------+--------------------------------------------------------
       socst |       200      52.405    10.73579         26         71

codebook

---------------------------------------------------------------------------------------------------------------
id                                                                                                  (unlabeled)
---------------------------------------------------------------------------------------------------------------

                  type:  numeric (float)

                 range:  [1,1193]                     units:  1
         unique values:  200                      missing .:  0/200

                  mean:     105.5
              std. dev:   96.3309

           percentiles:        10%       25%       50%       75%       90%
                              20.5      50.5     100.5     150.5     180.5

---------------------------------------------------------------------------------------------------------------
gender                                                                                              (unlabeled)
---------------------------------------------------------------------------------------------------------------

                  type:  numeric (float)
                 label:  gl, but 1 nonmissing value is not labeled

                 range:  [1,5]                        units:  1
         unique values:  3                        missing .:  2/200

            tabulation:  Freq.   Numeric  Label
                            90         1  male
                           107         2  female
                             1         5  
                             2         .  

---------------------------------------------------------------------------------------------------------------
race                                                                                                (unlabeled)
---------------------------------------------------------------------------------------------------------------

                  type:  numeric (float)
                 label:  rl, but 2 nonmissing values are not labeled

                 range:  [0,5]                        units:  1
         unique values:  6                        missing .:  0/200

            tabulation:  Freq.   Numeric  Label
                             1         0  
                            23         1  hispanic
                            11         2  asian
                            20         3  african-amer
                           142         4  white
                             3         5  

---------------------------------------------------------------------------------------------------------------
ses                                                                                                 (unlabeled)
---------------------------------------------------------------------------------------------------------------

                  type:  numeric (float)
                 label:  sl

                 range:  [1,3]                        units:  1
         unique values:  3                        missing .:  1/200

            tabulation:  Freq.   Numeric  Label
                            47         1  low
                            94         2  middle
                            58         3  high
                             1         .  

---------------------------------------------------------------------------------------------------------------
schtyp                                                                                           type of school
---------------------------------------------------------------------------------------------------------------

                  type:  numeric (float)
                 label:  scl, but 1 nonmissing value is not labeled

                 range:  [1,3]                        units:  1
         unique values:  3                        missing .:  1/200

            tabulation:  Freq.   Numeric  Label
                           166         1  public
                            32         2  private
                             1         3  
                             1         .  

---------------------------------------------------------------------------------------------------------------
prog                                                                                            type of program
---------------------------------------------------------------------------------------------------------------

                  type:  numeric (float)
                 label:  sel, but 1 nonmissing value is not labeled

                 range:  [0,3]                        units:  1
         unique values:  4                        missing .:  0/200

            tabulation:  Freq.   Numeric  Label
                             1         0  
                            45         1  general
                           104         2  academic
                            50         3  vocation

---------------------------------------------------------------------------------------------------------------
read                                                                                              reading score
---------------------------------------------------------------------------------------------------------------

                  type:  numeric (float)

                 range:  [28,147]                     units:  1
         unique values:  31                       missing .:  0/200

                  mean:     52.73
              std. dev:    12.242

           percentiles:        10%       25%       50%       75%       90%
                                39        44        51        60        68

---------------------------------------------------------------------------------------------------------------
write                                                                                             writing score
---------------------------------------------------------------------------------------------------------------

                  type:  numeric (float)

                 range:  [31,67]                      units:  1
         unique values:  29                       missing .:  0/200

                  mean:    52.775
              std. dev:   9.47859

           percentiles:        10%       25%       50%       75%       90%
                                39      45.5        54        60        65

---------------------------------------------------------------------------------------------------------------
math                                                                                                 math score
---------------------------------------------------------------------------------------------------------------

                  type:  numeric (float)

                 range:  [33,75]                      units:  1
         unique values:  40                       missing .:  0/200

                  mean:    52.645
              std. dev:   9.36845

           percentiles:        10%       25%       50%       75%       90%
                                40        45        52        59      65.5

---------------------------------------------------------------------------------------------------------------
science                                                                                           science score
---------------------------------------------------------------------------------------------------------------

                  type:  numeric (float)

                 range:  [-61,74]                     units:  1
         unique values:  34                       missing .:  5/200

                  mean:   51.0359
              std. dev:   12.7248

           percentiles:        10%       25%       50%       75%       90%
                                39        44        53        58        63

---------------------------------------------------------------------------------------------------------------
socst                                                                                      social studies score
---------------------------------------------------------------------------------------------------------------

                  type:  numeric (float)

                 range:  [26,71]                      units:  1
         unique values:  22                       missing .:  0/200

                  mean:    52.405
              std. dev:   10.7358

           percentiles:        10%       25%       50%       75%       90%
                                36        46        52        61        66

Step 2: Data Cleaning

In this step we will use two do-files, one to check the data (hsbtest.do) and one to do the actual data cleaning (hsbclean.do).

/* begin hsbtest.do */
  use `1',clear
  nmissing
  summarize
  assert id>=1 & id<=200
  assert (gender>=1 & gender<=2)     | gender==.
  assert (race>=1 & race<=4)         | race==.
  assert (ses>=1 & ses<=3)           | ses==.
  assert (schtyp>=1 & schtyp<=2)     | schtyp==.
  assert (prog>=1 & prog<=3)         | prog==.
  assert (read>=1   & read<=100)     | read==.
  assert (write>=1   & write<=100)   | write==.
  assert (math>=1    & math<=100)    | math==.
  assert (science>=1 & science<=100) | science==.
  assert (socst>=1   & socst<=100)   | socst==. 
/* end hsbtest.do */

/* begin hsbclean */
  use hsberr, clear

  replace id=193     if id==1193
  replace read=47    if read==147
  replace science=61 if science==-61
  replace gender=.   if gender==5
  replace race=.     if race<1 | race>4
  replace ses=.      if ses<1 | ses>3
  replace schtyp=.   if schtyp<1 | schtyp>2
  replace prog=.     if prog<1 | prog>3
  label data "hsb clean data using hsberr.do"

  save hsbclean, replace 
/* end hsbclean */

do hsbtest hsberr

do hsbclean

do hsbtest hsbclean

Step 2: Modifying the Data

In this step we will modify the cleam data creating new variables or modifying existing ones.

/* begin hsbmod.do */
  use hsbclean, clear

  generate female = gender
  recode female 1=0 2=1
  generate public=schtyp
  recode public 2=0
  label define fem 0 "male" 1 "female"
  label value female fem
  generate read10=read/10
  generate soc10=socst/10
  label var read10 "(read score)/10"
  label var soc10 "(social studies score)/10"
  note: data modified using hsbmod.do
  note: read and soc rescaled -- divided by 10

  save hsbmod, replace
/* end hsbmod.do */

do hsbmod

Step 3: Analyzing the Data

/* begin analysis1.do */
  log using analysis1.log, text replace
  use hsbmod, clear 
  set more off

  display "hsb analysis #1 using hsbmod.dta"
  summarize read write math science 
  univar read write math science 
  tabstat read10, stat(n mean sd p25 p50 p75) by(female)
  tabstat read10, stat(n mean sd p25 p50 p75) by(prog)
  kdensity read10, normal
  more
  kdensity soc10, normal
  more
  mlogit prog read10 soc10 female
  estimates store M1
  xi: mlogit prog i.female*read10 i.female*soc10
  lrtest M1
  mlogit prog read10 soc10 female
  mlogtest, lr wald lrcomb combine
  listcoef
  fitstat
  prchange, x(female=0)
  prchange, x(female=1)

  set more on
  log close
/* end analysis1.do */ 


/* begin analysis2.do */
  log using analysis2.log, text replace
  use hsbmod, clear 
  set more off

  display "hsb analysis #2 using hsbmod.dta"
  tabstat read10, stat(n mean sd p25 p50 p75) by(ses)
  ologit ses read10 soc10 female
  estimates store M1
  xi: ologit ses i.female*read10 i.female*soc10
  lrtest M1
  ologit ses read10 soc10 female
  listcoef
  fitstat
  linktest
  prchange, x(female=0)
  prchange, x(female=1)

  set more on
  log close
/* end analysis2.do */ 

do analysis1

type analysis1.log

do analysis2

type analysis2,log

Categorical Data Analysis Course

Phil Ender