* Introduction to Stata
* prepared for Development Economics at LMU
* by Vojtech Bartos

* Examine Stata interface: review, results, command, variables, do-file, menu

* Use .do files

* Store log files; this just created womenswage.smcl on your computer 
* (the file is not of much help on its own; we'll get to reading the log file to stata at the end)
* use replace if you need to run the do file multiple times
* you can also use a "text" option to display the log as a text readable in Notepad/TextEdit
log using womenswage, replace

* running a long code does not get interrupted
set more off

* Typing commands
display "Hello world!"
* Use Stata as a calculator
display 2+2 

* Stata case sensitive!
Display 2+2 

* Now show the review window - we can go back to the command history.

/*
Commonly used functions
abs(x)	the absolute value of x
exp(x)	the exponential function of x
int(x)	the integer obtained by truncating x towards zero
ln(x) or log(x)	the natural logarithm of x if x>0
log10(x)	the log base 10 of x (for x>0)
logit(x)	the log of the odds for probability x: logit(x) = ln(x/(1-x))
max(x1,x2,...,xn)	the maximum of x1, x2, ..., xn, ignoring missing values
min(x1,x2,...,xn)	the minimum of x1, x2, ..., xn, ignoring missing values
round(x)	x rounded to the nearest whole number
sqrt(x)	the square root of x if x >= 0
*/

* Search: can be done in the internet (google, stata has a full pdf documentation)
* or if you know the command but are not sure about syntax: (Viewer opens)
help regress

* use a working directory for every project you start, keep your files well organized
* seems that Stata doesn't care if you use \ or / to separate directories
cd "/Users/vojtabartos/Documents/4 Teaching/2021 Development Economics/Tutorials/Exercise 0 - Stata Intro"

/* we have data on labor market outcomes of several hundreds of women, separately in 
rural and urban areas; we want to connect these two datasets; we also want to merge the dataset
with another variable stating their marital status 
*/

/*
Stata has other commands for interacting with the operating system, including:
- mkdir to create a directory
- dir to list the names of the files in a directory
- type to list their contents
- copy to copy files
- erase to delete a file
*/

* loading data
use womenwage_rural.dta, clear

* append using the urban data
append using womenwage_nonrural.dta

*********************
* want to keep your existing data for later and work on some other dataset inbetween? ->
* "restore" needs to follow to get back to the pre-preserve data
preserve

* we only have access to a .csv version of the marital status data
* let's convert it to Stata .dta
clear
insheet using womenwage_nev_mar.csv, delimiter(";") names

* use save for your preferred version of Stata, I use saveold command as some of you may have older licences
* differences across versions are mainly in encoding of string variables from what I understand
saveold womenwage_nev_mar, replace

restore
*********************

* merge with the marital status data; id is the identifying variable we use
merge 1:1 id using womenwage_nev_mar

* first check of what is in the data
describe
* alternatively, look at Data/Variables Manager

* summary statistics
summarize age nev_mar rural school tenure wage wagecat
sum school, detail
sum wage, d
bysort rural: sum wage

* generate new variable
generate log_wage=log(wage*1000)
generate age2=age^2
* generate dummies
generate some_experience=(tenure>0 & tenure!=.)

/*
More operators and expressions
Arithmetic			Logical				Relational
+ add				! not (also ~)		== equal
- subtract			| or				!= not equal (also ~=)
* multiply			& and				< less than
/ divide						 		<= less than or equal
^ raise to power		 				> greater than
+ string concatenation	 				>= greater than or equal
*/

* plotting data
histogram wage

* help function
/* for online help see either Stata manual (available in PDF online)
or look at the online forums:
- http://www.statalist.org
- http://www.stata.com/support/faqs/
*/
help histogram
histogram wage, by(rural, total) percent

* scatterplots
graph twoway scatter tenure wage
graph twoway (scatter tenure wage) (lfit tenure wage)
* use continuation lines for longer graphs (well, it can get much more complicated than that)
graph twoway (scatter tenure wage) ///
	(lfit tenure wage), ///
	by(rural)

* save and export the graph
graph save wagetenure, replace
graph export wagetenure.png, replace

* regression
reg wage tenure

* but if we want to run more regressions, we want to save results in a nice form
* uncomment the following to install a package not native for Stata
* ssc install outreg2
outreg2 using "table1.xls" , replace dec(2) se

* how to read this regression?
reg log_wage tenure

* see, you can just leave this without any additional specifications and 
* it takes specifications from the previously used outreg2 command
outreg2

* let's add more covariates
reg wage school tenure
outreg2
reg wage age age2 school tenure
outreg2

* see, now sample is smaller (there are some missing observations in rural)
reg wage rural age age2 school tenure
outreg2

* so let's make the results comparable
reg wage tenure if rural!=.
outreg2

* let's create a variable that we can use to define our working sample
gen subsample=rural!=.

* see that you can only write subsample, which is equivalent to subsample==1
* the results are equivalent to the previous regression, so we don't have to save it using outreg2
reg wage tenure if subsample

* remember, the returns to tenure might differ for rural and urban areas, let's see if trends differ
xi: reg wage i.rural*tenure if subsample
outreg2

* recall the graph split by rural, there was one point clearly standing out:
* there are two observations but one of them has a missing value for rural
graph use wagetenure
list if wage>70
xi: reg wage i.rural*tenure if subsample & wage<70
outreg2

* alternatively, we might want to place an upper limit on the wage variable
gen wage_lim=wage
replace wage_lim=70 if wage>70
xi: reg wage_lim i.rural*tenure if subsample


* now, run again the full specification (no need to save it, we already have it)
* for this reason we don't need to see it again, we run it "quietly"
qui: reg wage rural age age2 school tenure

* hypothesis testing
test age=0
test age=tenure
ttest wage, by(rural)

* non-linear hypothesis testing
* (makes no sense but good enough for illustrative purposes)
nlcom _b[school]/_b[age]

* post-estimation commands
* store predicted values
predict wage_pr
* generate residuals
gen residuals=wage-wage_pr
* alternatively 
predict residuals2, residual
* (see, they are equivalent)
browse residuals residuals2
* testing for heteroskedasticity
twoway scatter wage_pr tenure

/* the alternative hypothesis states that the error variances increase (or decrease) 
as the predicted values of Y increase, e.g. the bigger the predicted value of Y, the 
bigger the error variance is. A large chi-square would indicate that heteroskedasticity was present. */
estat hettest

* use the Huber-White robust standard error that relax the assumption of homoskedasticity
reg wage rural age age2 school tenure, robust
test age=0

* save the dataset with the new variables, use new name!
saveold womenswage_new, replace // just see how I have been using comments throughout the file
* it is a good practice to use commands serving as a reference for future
/* as well as if you want someone else to work with the .do file */

log close
type womenswage.smcl
/*
For reference or further guides: http://data.princeton.edu/stata

+ Some comments on structure of commands in Stata by German Rodriguez

Having used a few Stata commands it may be time to comment briefly on their structure, which usually follows the following syntax, where bold indicates keywords and square brackets indicate optional elements:

[by varlist:] command [varlist] [=exp] [if exp] [in range] [weight] [using filename] [,options]

We now describe each syntax element:
command:
The only required element is the command itself, which is usually (but not always) an action 
verb, and is often followed by the names of one or more variables. Stata commands are 
case-sensitive. The commands describe and Describe are different, and only the former will 
work. Commands can usually be abbreviated as noted earlier. When we introduce a command we 
underline the letters that are required. For example regress indicates that the regress 
command can be abbreviated to reg.

varlist:
The command is often followed by the names of one or more variables, for example describe 
lexp or regress lexp loggnppc. Variable names are case sensitive. lexp and LEXP are different 
variables. A variable name can be abbreviated to the minimum number of letters that makes it 
unique in a dataset. For example in our quick tour we could refer to loggnppc as log because 
it is the only variable that begins with those three letters, but this is a really bad idea. 
Abbreviations that are unique may become ambiguous as you create new variables, so you have 
to be very careful. You can also use wildcards such as v* or name ranges, such as v101-v105 
to refer to several variables. Type help varlist to lear more about variable lists.

=exp:
Commands used to generate new variables, such as generate log_gnp = log(gnp), include an 
arithmetic expression, basically a formula using the standard operators (+ - * and / for 
the four basic operations and ^ for exponentiation, so 3^2 is three squared), functions, 
and parentheses. We discuss expressions in Section 2.

if exp and in range:
As we have seen, a command's action can be restricted to a subset of the data by specifying a 
logical condition that evaluates to true of false, such as lexp < 55. Relational operators 
are <, <=, ==, >= and >, and logical negation is expressed using ! or ~, as we will see in 
Section 2. Alternatively, you can specify a range of the data, for example in 1/10 will 
restrict the command's action to the first 10 observations. Type help numlist to learn more 
about lists of numbers.

weight:
Some commands allow the use of weights, type help weights to learn more.

using filename:
The keyword using introduces a file name; this can be a file in your computer, on the 
network, or on the internet, as you will see when we discuss data input in Section 2.

options:
Most commands have options that are specified following a comma. To obtain a list of the 
options available with a command type help command where command is the actual command name.

by varlist:
A very powerful feature, it instructs Stata to repeat the command for each group of 
observations defined by distinct values of the variables in the list. For this to work 
the command must be "byable" (as noted on the online help) and the data must be sorted 
by the grouping variable(s) (or use bysort instead).