A few nights ago, a rather strange bout of insomnia had me returning to J. Scott Long’s classic Stata Press book “The Workflow of Data Analysis Using Stata.” Unfortunately, it was not required reading for me during my doctoral program. I say unfortunately because I believe that it should be part of The Canon for students getting started with statistical analysis in Stata. The reason why is simple, something you might recognize from your boy Aristotle: “We are what we repeatedly do. Excellence, then, is not an act, but a habit.”
What do I mean? Your habits are developed in what you do often. And when you are in grad school and start getting your statistical analysis wheels, its easy to fall into bad habits using shortcuts. You’re living in the moment, just trying to survive, not fretting too much about tomorrow, thinking, “This is just a seminar paper, no need to take this too seriously, I am not going to try to publish these results.” Next thing you know, you’re a sloppy researcher, inattentive to detail and unable to reproduce results consistently.
J. Scott Long’s book is a step-by-step exercise in the HOW of practicing good research workflow, but more importantly, the WHY. Allow me to share some of the more important points below.
1). Acronyms are Where it Is At
Long makes the case for developing an acronym that you use to refer to projects. The acronym is important because you will include it in a lot of files. In this age, where we work on multiple machines (office machine, home machine) and where we collaborate by sending do files and such, including the acronym for your project allows you and colleagues to find files easily across locations. This is VITAL for collaborative research, but also working with something like Stata and SPSS, where analyses are saved as executable commands/lines of code in a .do file that is separate from the data file.
2) Distinguish Between Files by their Functions
Typically research projects require a fair amount of data massage/manipulation/data cleaning before analyses take place. Long notes that it’s best to keep these functions separate because you don’t want to be digging through multiple files in order to prepare a dataset for analysis.
Let’s say that you have moved on from the data prep stage to analysis, but decide to, on the fly, recode a key variable. Maybe something like education, where you’re changing what happens to people with 9-11 years of education. You’ve decided to bust those down into one larger category of “Less than High School Degree”. Because you’re like me (and slightly prone to laziness) you might be inclined to just shove that code into the logistic regression analysis .do file you are working on (which is only one of the 9 analysis .do files you have for this project). Big mistake friend. Big mistake.
Because 6 months down the road, when you decide to let your dependent variable be free and work with a multinomial logit model, now you’re going to have problems with education because you’re not even using your logit .do file anymore. But you can’t quite remember what it was that you did with education that one time.
So don’t do that-keep those lines of code in your “data cleaning/massage” files. That way you dont have to dig manically through a ton of files on your machine. It is also not that hard to do this. Stata allows multiple .do files to be open simultaneously. So a smart researcher might keep open the appropriate data cleaning .do file just to have, in the event that she wants to let a variable go rogue.
3). Name. That. File.
Long also advocates for sensible and strategic naming conventions when it comes to naming .dta and .do files. Over sixty percent of researchers have their files kept in a jumble of non-uniformly named files whose order and purpose is not apparent, according to statistics I just fabricated for the purpose of this blog post. They think they are living the hygienic data life just because they have their data and .do files backed up somewhere in “the cloud.”
He suggests differentiating between data files (those used to set up your data) and statistical files (those used for conducting analyses). But he goes further; these files should be numbered in the order of their use, along with what they are used for. The files data01-merge.do and data02-addvar.do sort really nicely in their project folder, because data01 will appear before data02 (when you sort on file title, obviously).
Think about the alternative here- the way you currently do things, which is maybe to include the date of the last time you worked with the file. Who does that help? No one. Definitely not you six months from now, when you are trying to address the overly critical constructive comments from reviewer 1 on your Magnum Opus. So, save Future You some heartache and name your files in a way that they can be kept in order.
Long suggests establishing a template for how you name .do files. One that works for him is project-task-step-Version-description. So, a .do file named els-clean04-CheckLabelsV2.do is the 2nd version of the CheckLabels do file, which is the fourth file used for data cleaning on this els project.
It is never too late to start living that hygienic data life. Take the five seconds it takes to name your files like an adult. It will save you hours later.
Don’t be like cable Jaleel White.