STATA DATA TRANSFORMATION COMMANDS
This page aims to summarize several different data transformation commands available in STATA and to explain the context where each of those could be used. The motivation for this post lies in the fact that researches are often not sure about what command they should use, or they do their work the hard way for being unaware of other more practical commands.
Some, but not all, of STATA data transformation commands are merge, append, joinby, cross, and reshape. The post will focus on the first four while reshape will be addressed in a separate post as it deserves separate attention.
MERGE
We start with the most commonly used STATA data management command – merge. This is a command you should use when combining two files that have common identifiers into one single file. Merge has four variations: 1:1 (one-to-one), m:1 (m-to-one), 1:m (one-to-m), and m:m (m-to-m).
1:1 MERGE
1:1 merge is the simplest variation of the merge command. It should be used when the merge variable(s) uniquely identify both datasets that you want to merge. Referring to the example below where merging is done using the name variable, this means that all values of this variable should be unique, i.e. there are no repeating names in either of the files.
Assuming the first file is called height and the second weight, this is the code you would use for merging:
use height, clear merge 1:1 name using weight
1:1 MERGE (merge on multiple variables)
Now imagine a case where we observe height and weight of different individuals at different points in time. Again, we want to merge those two files. Now individual’s name does not uniquely identify observations in the two files, but name year variables combination does, so it is still a one-to-one merge.
Assuming the first file is called height and the second weight, this is the code you would use for merging:
use height, clear merge 1:1 name year using weight
m:1 MERGE
M-to-one merge is a variation of this command that you should use when one of the two files you are aiming to combine is not uniquely identified by the merge variable. As depicted in the example below, in the first file one observation corresponds to an individual at a certain point in time, while in the second file one observation corresponds to an individual. In other words, while one file contains data at individual/year level, the other contains data at individual level. So M observations in the first file correspond to 1 (one) observations in the second file, hence the m:1 merge.
Assuming the first file is called height and the second gender, this is the code you would use for merging:
use height, clear merge m:1 name using gender
1:m MERGE
One-to-m merge is basically an inverse of m-to-one merge. Here we can use the same example as in the previous step, but the difference will be in what is the master file (the first data) and what is the using file (the second data). i.e. we will start with gender data which is unique at individual level, and merge it with height data which is unique at individual/year level.
Assuming the first file is called gender and the second height, this is the code you would use for merging:
use gender, clear merge 1:m name using height
m:m MERGE
Lastly, m-to-m merge is a variation of merge command which should be used when neither of the files is unique at merge variables level. m-to-m merge is hard to use and not recommended by many STATA users. My personal opinion is also against usage of this subcommand, and I instead recommend using the joinby command for context described here.
APPEND
Append is a STATA command that you should use when you want to ‘stack’ two or more files so they come ‘on top’ of each other. The files should have identical or similar variables (columns) but different in terms of content that is in the rows of the files. The example below shows appending two files containing height data of different individuals into one single file.
Assuming the first file is called height1 and the second height2, the code below is what you would use for appending. Please note that using the append command does not include specifying any variables!
use height1, clear append using height2
JOINBY
Joinby is a very useful STATA command which makes pairwise combinations of two files using a specified variable. In the example below, the data on the left shows sales of different branches of a firm over time. The second file shows data on firm shareholders. The two files are very different in terms of content – the first one is at firm/branch/month level and the second is at firm/shareholder level. The only common variable for the two files is firm. Now imagine that we want to combine all this info into a single file. Joinby will create a pairwise combination of the two files so we have sales and shareholder data collected in a single file.
The applicability of joinby may not be obvious, but trust me when saying that the only limit is your creativity!
Assuming the first file is called sales and the second shareholders, the code below is what you would use for making pairwise combinations of the two files. Please note that using the joinby command does require specifying joinby variable(s)!
use sales, clear joinby firm using shareholders
CROSS
Cross is another command that one can use for creating a pairwise combinations of two files, so in that sense it is similar to joinby. However, one important difference between the two is that cross does not require you to specify any variables. So while in the previous (joinby related) example we interacted data that belongs to a certain firm, with cross command we form combinations of all data in the two datasets. The example below refers to an event study context where the cross command is particularly handy.
Assuming the first file is called event_list and the second returns, the code below is what you would use for making pairwise combinations of the two files. Please note that using the cross command does not require specifying any variable(s)!
use event_list, clear cross using returns