* Introduction to Stata * prepared for Development Economics 2017 at LMU * by Vojtech Bartos * Examine Stata interface: review, results, command, variables, do-file, menu * Use .do files * Store log files; this just created womenswage.smcl on your computer * (the file is not of much help on its own; we'll get to reading the log file to stata at the end) * use replace if you need to run the do file multiple times * you can also use a "text" option to display the log as a text readable in Notepad/TextEdit log using womenswage, replace * running a long code does not get interrupted set more off * Typing commands display "Hello world!" * Use stata as a calculator display 2+2 * Stata case sensitive! Display 2+2 * Now show the review window - we can go back to the command history. /* Commonly used functions abs(x) the absolute value of x exp(x) the exponential function of x int(x) the integer obtained by truncating x towards zero ln(x) or log(x) the natural logarithm of x if x>0 log10(x) the log base 10 of x (for x>0) logit(x) the log of the odds for probability x: logit(x) = ln(x/(1-x)) max(x1,x2,...,xn) the maximum of x1, x2, ..., xn, ignoring missing values min(x1,x2,...,xn) the minimum of x1, x2, ..., xn, ignoring missing values round(x) x rounded to the nearest whole number sqrt(x) the square root of x if x >= 0 */ * Search: can be done in the internet (google, stata has a full pdf documentation) * or if you know the command but are not sure about syntax: (Viewer opens) help regress * use a working directory for every project you start, keep your files well organized * seems that Stata doesn't care if you use \ or / to separate directories cd "/Users/vojtabartos/Documents/4 Teaching/2017 Development Economics LMU/Tutorials/Exercise 0 - Stata Intro" /* we have data on labor market outcomes of several hundreds of women, separately in rural and urban areas; we want to connect these two datasets; we also want to merge the dataset with another variable stating their marital status */ /* Stata has other commands for interacting with the operating system, including: - mkdir to create a directory - dir to list the names of the files in a directory - type to list their contents - copy to copy files - erase to delete a file */ * loading data use womenwage_rural.dta, clear * append using the urban data append using womenwage_nonrural.dta ********************* * want to keep your existing data for later and work on some other dataset inbetween? -> * "restore" needs to follow to get back to the pre-preserve data preserve * we only have access to a .csv version of the marital status data * let's convert it to Stata .dta clear insheet using womenwage_nev_mar.csv, delimiter(";") names * use save for your preferred version of Stata, I use saveold command as some of you may have older licences * differences across versions are mainly in encoding of string variables from what I understand saveold womenwage_nev_mar, replace restore ********************* * merge with the marital status data; id is the identifying variable we use merge 1:1 id using womenwage_nev_mar * first check of what is in the data describe * alternatively, look at Data/Variables Manager * summary statistics summarize age nev_mar rural school tenure wage wagecat sum school, detail sum wage, d bysort rural: sum wage * generate new variable generate log_wage=log(wage*1000) generate age2=age^2 * generate dummies generate some_experience=tenure>0 & tenure!=. /* More operators and expressions Arithmetic Logical Relational + add ! not (also ~) == equal - subtract | or != not equal (also ~=) * multiply & and < less than / divide <= less than or equal ^ raise to power > greater than + string concatena tion >= greater than or equal */ * plotting data histogram wage * help function /* for online help see either Stata manual (available in PDF online) or look at the online forums: - http://www.statalist.org - http://www.stata.com/support/faqs/ */ help histogram histogram wage, by(rural, total) percent * scatterplots graph twoway scatter tenure wage graph twoway (scatter tenure wage) (lfit tenure wage) * use continuation lines for longer graphs (well, it can get much more complicated than that) graph twoway (scatter tenure wage) /// (lfit tenure wage), /// by(rural) * save and export the graph graph save wagetenure, replace graph export wagetenure.png, replace * regression reg wage tenure * but if we want to run more regressions, we want to save results in a nice form * uncomment the following to install a package not native for Stata * ssc install outreg2 outreg2 using "table1.xls" , replace dec(2) se * how to read this regression? reg log_wage tenure * see, you can just leave this without any additional specifications and * it takes specifications from the previously used outreg2 command outreg2 * let's add more covariates reg wage school tenure outreg2 reg wage age age2 school tenure outreg2 * see, now sample is smaller (there are some missing observations in rural) reg wage rural age age2 school tenure outreg2 * so let's make the results comparable reg wage tenure if rural!=. outreg2 * let's create a variable that we can use to define our working sample gen subsample=rural!=. * see that you can only write subsample, which is equivalent to subsample==1 * the results are equivalent to the previous regression, so we don't have to save it using outreg2 reg wage tenure if subsample * remember, the returns to tenure might differ for rural and urban areas, let's see if trends differ xi: reg wage i.rural*tenure if subsample outreg2 * recall the graph split by rural, there was one point clearly standing out: * there are two observations but one of them has a missing value for rural graph use wagetenure list if wage>70 xi: reg wage i.rural*tenure if subsample & wage<70 outreg2 * alternatively, we might want to place an upper limit on the wage variable gen wage_lim=wage replace wage_lim=70 if wage>70 xi: reg wage_lim i.rural*tenure if subsample * now, run again the full specification (no need to save it, we already have it) * for this reason we don't need to see it again, we run it "quietly" qui: reg wage rural age age2 school tenure * hypothesis testing test age=0 ttest wage, by(rural) * non-linear hypothesis testing * (makes no sense but good enough for illustrative purposes) nlcom _b[school]/_b[age] * post-estimation commands * store predicted values predict wage_pr * generate residuals gen residuals=wage-wage_pr * alternatively predict residuals2 * (see, they are equivalent) browse residuals residuals2 * testing for heteroskedasticity twoway scatter wage_pr tenure /* the alternative hypothesis states that the error variances increase (or decrease) as the predicted values of Y increase, e.g. the bigger the predicted value of Y, the bigger the error variance is. A large chi-square would indicate that heteroskedasticity was present. */ estat hettest * use the Huber-White robust standard error that relax the assumption of homoskedasticity reg wage rural age age2 school tenure, robust test age=0 * save the dataset with the new variables, use new name! saveold womenswage_new, replace // just see how I have been using comments throughout the file * it is a good practice to use commands serving as a reference for future /* as well as if you want someone else to work with the .do file */ log close type womenswage.smcl /* For reference or further guides: http://data.princeton.edu/stata + Some comments on structure of commands in Stata by German Rodriguez Having used a few Stata commands it may be time to comment briefly on their structure, which usually follows the following syntax, where bold indicates keywords and square brackets indicate optional elements: [by varlist:] command [varlist] [=exp] [if exp] [in range] [weight] [using filename] [,options] We now describe each syntax element: command: The only required element is the command itself, which is usually (but not always) an action verb, and is often followed by the names of one or more variables. Stata commands are case-sensitive. The commands describe and Describe are different, and only the former will work. Commands can usually be abbreviated as noted earlier. When we introduce a command we underline the letters that are required. For example regress indicates that the regress command can be abbreviated to reg. varlist: The command is often followed by the names of one or more variables, for example describe lexp or regress lexp loggnppc. Variable names are case sensitive. lexp and LEXP are different variables. A variable name can be abbreviated to the minimum number of letters that makes it unique in a dataset. For example in our quick tour we could refer to loggnppc as log because it is the only variable that begins with those three letters, but this is a really bad idea. Abbreviations that are unique may become ambiguous as you create new variables, so you have to be very careful. You can also use wildcards such as v* or name ranges, such as v101-v105 to refer to several variables. Type help varlist to lear more about variable lists. =exp: Commands used to generate new variables, such as generate log_gnp = log(gnp), include an arithmetic expression, basically a formula using the standard operators (+ - * and / for the four basic operations and ^ for exponentiation, so 3^2 is three squared), functions, and parentheses. We discuss expressions in Section 2. if exp and in range: As we have seen, a command's action can be restricted to a subset of the data by specifying a logical condition that evaluates to true of false, such as lexp < 55. Relational operators are <, <=, ==, >= and >, and logical negation is expressed using ! or ~, as we will see in Section 2. Alternatively, you can specify a range of the data, for example in 1/10 will restrict the command's action to the first 10 observations. Type help numlist to learn more about lists of numbers. weight: Some commands allow the use of weights, type help weights to learn more. using filename: The keyword using introduces a file name; this can be a file in your computer, on the network, or on the internet, as you will see when we discuss data input in Section 2. options: Most commands have options that are specified following a comma. To obtain a list of the options available with a command type help command where command is the actual command name. by varlist: A very powerful feature, it instructs Stata to repeat the command for each group of observations defined by distinct values of the variables in the list. For this to work the command must be "byable" (as noted on the online help) and the data must be sorted by the grouping variable(s) (or use bysort instead).