One of the trickiest parts of data analysis in the beginning of a project is file management. I have spoken before about how to get it right. Today I discuss the benefits of utilizing a MASTER do file when working with Stata. These principles apply generally to doing data analysis in other packages, assuming that you are utilizing files with command lines.
J. Scott Long illustrates in The Workflow of Data Analysis Using Stata, what is happening conceptually between what you need for analysis and what you need for creating datasets when starting a data analysis project. You start with a do file that sets up the data that gets saved as your first file (data01.do). This might be bringing your data into Stata from an Excel or ASCII file. You might then have a do file (data02.do) that creates items for analysis. Because these items are integral to your analysis, you might save that data as dataset data02.dta. From there, you might start some exploratory statistical analysis, producing do files stat01a.do, stat01b.do, and stat01c.do.
But then you might make more changes to your dataset, and eventually end up with data03.dta. The wheels can come off pretty quickly if you create variables in analysis files, because analysis files are used for different purposes than data files. For example, let’s say that you got the idea that you wanted to create a new educational variable, based on a current variable. Doing so in an analysis file will mean that sometime in the future, when you decide to use the dataset for another project, you will be forced to 1) remember that you created that awesome new educational variable in an analysis file, 2) find that analysis file (was it stat01a or stat02a?) and 3) execute the commands that created that variable (which are nestled snugly in with other lines of code that created analyses). In all an absolute nightmare. So don’t do that. Create variables in do files where you are doing data management and item creation.
It might also be useful to alphabetize your file names, so that they can be run in order. The best way to do this is with names like 01_datamerge, 02_dataclean, 03_itemcreation. This way, they sit in your project folder and can be run in order that they were meant, so you can recreate datasets and analyses.
You could also utilize a master do file, which will run the do files in order that you specify. Like so:
With all these commands in a do file, you are deliberately imposing structure and order on your research process in a way that allows you to replicate datasets and results. The worst is when you are trying to deal with file names where you have included the date, like data01_02012016.do, but which you accidentally saved over on Feb 03, 2016. Good for you, you saved your file, but -10 points because the structure you originally imposed (which was to have each day’s work represented and kept up to date with a do file dated that day) is now somewhat less informative.