Instruction to commands in Stata
In this note, I would like to share some common commands in Stata provided by Stata Traning. I hope this note will be helpful for those who are new to Stata.
Preliminary
In this note, I will use the example dataset auto
which is provided by
Stata.
You can load the dataset by typing sysuse auto, clear
.
Set the global environment
We can use the global
command to set the global environment in Stata.
For example, we can set the global directory dir
as follows:
global dir "~/example"
Then, we can use dir
to refer to the directory in the following
commands.
cd $dir // change the directory to the global setting
We can also set the global macro subset
to refer a dataset, which
includes the variables length
, price
, and mpg
as follows:
global subset length price mpg
Then, we can do summary statistics for the variables x
, y
, and z
as follows:
summarize $subset
It will show the summary statistics for the variables length
, price
,
and mpg
.
. summarize $subset
Variable | Obs Mean Std. dev. Min Max
-------------+---------------------------------------------------------
length | 74 187.9324 22.26634 142 233
price | 74 6165.257 2949.496 3291 15906
mpg | 74 21.2973 5.785503 12 41
Import data
Excel files
Like R and Python, we can import different type of data in Stata.
The most common data type is .csv
or .xlsx
files, we can use the
import delimited
command to import the .csv
file.
import delimited "~/example/data.csv", clear // csv file
import excel "~/example/data.xlsx", firstrow clear // excel file
Stata files
Sometimes, we do analysis based on the Stata files. We can use the use
command to load the Stata files.
use "~/example/data.dta", clear
URL data
We can also import data from the URL. For example, we can import the
data from the URL https://www.stata-press.com/data/r16/auto.dta
as
follows:
use set "https://www.stata-press.com/data/r16/auto.dta", clear
Save data
Saving data is similar to importing data. We can save the file as .dta
or .csv
format.
save "~/example/data.dta", replace // save as dta file
export delimited "~/example/data.csv", delimiter(",") replace // save as csv file
export excel "~/example/data.xlsx", firstrow(variables) replace // save as excel file
Data manipulation
Data cleaning
Once we import the data, we can start to play with the variables. For example, to create, sort, merge, or reshape the data.
generate lpkm = 235.21 / mpg // create a new variable litre/100km = 235.21/mpg
gsort lpkm // sort the data by var1
gen id = _n // create a running id variable (gen is short for generate)
g id = _N // create a constant id variable (g is short for generate)
If you want to add some condition to the data, you can use the if
command. For example, to create a new variable lpkm_10
based on the
condition lpkm > 10
, you can use the following command:
generate lpkm_10 = lpkm if lpkm > 10
If the condition is satisfied for the observation, the value of
lpkm_10
will be the value of lpkm
. Otherwise, the value of lpkm_10
will be missing.
Sometimes, we need to group the data based on the condition. We can use
the by
command to group the data (similar to group_by
in dplyr
).
For example, we want to calculate the mean of lpkm
for each group of
make
and save the result in a new variable mean_lpkm
, we can use the
following command:
by make: egen mean_lpkm = mean(lpkm)
egen mean_lpkm = mean(lpkm), by(make)
We can also create dummy variables for the categorical variables. For
example, we want to create a dummy variable d_i
for the variable var
with the value i
, we can use the following command:
tab var, mi gen(d) // include missing values
Data summarization
To explore the basic statistics of the data, we can use the following commands
summarize var1 var2, detail // summary statistics "detail" if you want to see more details
univar var1 var2, boxplot // univariate summary
ci mean var1, level(95) // confidence interval
corr var1 var2 // correlation
The above commands will can help us to understand the basic statistics of the variables.
Data visualization
After we summarise the data or estimate the model, we would like to visualise the data. We can use the following commands to visualize the data.
histogram var1 // histogram