Naming variables in Stata and other statistical packages is definitely a practice in balancing art and science. Researchers and statisticians alike find themselves balancing their needs for the variable names to be informative, but not so informative that one accidentally mis-specifies variables and producing messy and problematic analysis. J. Scott Long’s The Workflow of Data Analysis Using Stata details the three types of naming systems that people usually utilize when naming variables.
Sequential names use a stub followed by sequential digits, like v7, v11, v013 or something more complicated like R0002203, R002205, etc. The numbers might correspond to the order that the data were collected in or the questions were asked. Because the names don’t have any meaning, it is easy to use the wrong variable in a set of analyses, or difficult to remember the name of the variable you need. Because this risk if very real and wastes a lot of time when trying to figure out where the wheels came off of your analysis, some researchers refer to a printed list of variable names, descriptive statistics and variable labels, like what is produced by the command “codebook, compact”.
Source names are those that use the information about where the variable came from as part of the name. You might see this in a survey where you have questions q1, q2, q3. A question that comes in multiple parts might get named something like q4a, q4b, q4c. This would Question 4, parts a, b, and c. With source names, you might have variables that don’t fit into the scheme, because they might pertain to some aspect of the data collection, like demographic variables or information about the time and site of data collection. If these kinds of variables are part of your dataset, consider how they will be named prior to including them in your dataset. These types of names can be more useful than sequential names, but it can still be dodgy when you are looking at a complicated model with these types of names.
Mnemonic naming systems use abbreviations that convey content of the variable (e.g. female, educ, id, state, etc). These can be much more useful because they provide clues as to what they pertain to. While these names can be more useful, some consideration is necessary when planning names, because of limitations of statistical packages. For example, Stata allows for names that are 32 characters long, but will truncate names when listing results. The default in Stata tends to truncate variables like familyincome_1990, familyincome_2000, and familyincome_2010 as familyincome, familyincome, familyincome in analyses and results tables. It is best to aim for variable names that are no more than 12 characters, so that if your statistical package does truncate variable names, you can still tell which variable is which. This might look like: faminc1990, faminc2000, faminc2010.
Long also suggests that it might also be useful to include indicators about the structure of the variable in the name. You might include b=binary, i=indicator, n=negatively coded scale, and p=positively coded scale. This lets you know, without having to refer to a codebook what you are looking at when you see variable like bdepres_cesd, that this is a binary item indicating depression based on the CESD.
One final note, be careful with capitalization. Statistical packages deal with capitalized letters in different ways. In Stata, Educ, EDUC, and educ all appear to be different variables. This might not be too much of a problem for you, but consider what happens when you convert from one file format (Stata) to another (like Excel), which may result in dropping extra information (like capitalized letters). When this happens in Stata, if you have three variables like Educ, educ, and EDUC, the second and third variable names will get converted to something like varNUM, which can be confusing when you are trying to work with your data after a file conversion.