Skip to main content

Instruction to commands in Stata

In this note, I would like to share some common commands in Stata provided by Stata Traning. I hope this note will be helpful for those who are new to Stata.

Preliminary

In this note, I will use the example dataset auto which is provided by Stata.

You can load the dataset by typing sysuse auto, clear.

Set the global environment

We can use the global command to set the global environment in Stata.

For example, we can set the global directory dir as follows:

global dir "~/example"

Then, we can use dir to refer to the directory in the following commands.

cd $dir // change the directory to the global setting

We can also set the global macro subset to refer a dataset, which includes the variables length, price, and mpg as follows:

global subset length price mpg

Then, we can do summary statistics for the variables x, y, and z as follows:

summarize $subset

It will show the summary statistics for the variables length, price, and mpg.

. summarize $subset

Variable | Obs Mean Std. dev. Min Max
-------------+---------------------------------------------------------
length | 74 187.9324 22.26634 142 233
price | 74 6165.257 2949.496 3291 15906
mpg | 74 21.2973 5.785503 12 41

Import data

Excel files

Like R and Python, we can import different type of data in Stata.

The most common data type is .csv or .xlsx files, we can use the import delimited command to import the .csv file.

import delimited "~/example/data.csv", clear // csv file
import excel "~/example/data.xlsx", firstrow clear // excel file

Stata files

Sometimes, we do analysis based on the Stata files. We can use the use command to load the Stata files.

use "~/example/data.dta", clear

URL data

We can also import data from the URL. For example, we can import the data from the URL https://www.stata-press.com/data/r16/auto.dta as follows:

use set "https://www.stata-press.com/data/r16/auto.dta", clear

Save data

Saving data is similar to importing data. We can save the file as .dta or .csv format.

save "~/example/data.dta", replace // save as dta file
export delimited "~/example/data.csv", delimiter(",") replace // save as csv file
export excel "~/example/data.xlsx", firstrow(variables) replace // save as excel file

Data manipulation

Data cleaning

Once we import the data, we can start to play with the variables. For example, to create, sort, merge, or reshape the data.

generate lpkm = 235.21 / mpg // create a new variable litre/100km = 235.21/mpg
gsort lpkm // sort the data by var1
gen id = _n // create a running id variable (gen is short for generate)
g id = _N // create a constant id variable (g is short for generate)

If you want to add some condition to the data, you can use the if command. For example, to create a new variable lpkm_10 based on the condition lpkm > 10, you can use the following command:

generate lpkm_10 = lpkm if lpkm > 10

If the condition is satisfied for the observation, the value of lpkm_10 will be the value of lpkm. Otherwise, the value of lpkm_10 will be missing.

Sometimes, we need to group the data based on the condition. We can use the by command to group the data (similar to group_by in dplyr). For example, we want to calculate the mean of lpkm for each group of make and save the result in a new variable mean_lpkm, we can use the following command:

by make: egen mean_lpkm = mean(lpkm)
egen mean_lpkm = mean(lpkm), by(make)

We can also create dummy variables for the categorical variables. For example, we want to create a dummy variable d_i for the variable var with the value i, we can use the following command:

tab var, mi gen(d) // include missing values

Data summarization

To explore the basic statistics of the data, we can use the following commands

summarize var1 var2, detail // summary statistics "detail" if you want to see more details
univar var1 var2, boxplot // univariate summary
ci mean var1, level(95) // confidence interval
corr var1 var2 // correlation

The above commands will can help us to understand the basic statistics of the variables.

Data visualization

After we summarise the data or estimate the model, we would like to visualise the data. We can use the following commands to visualize the data.

histogram var1 // histogram