Stats and Methods Mini-Workshops

The Social Science Research Center will present a series of short statistics and methods workshops, beginning in February 2017.  Senior Research Methodologist Jessica Bishop-Royse will present on topics of interest to the DePaul Research Community.  The first of these workshops will be on Stata File and Data Management.

15422638442_e239227dce_o

In this session, Jessica will discuss various methods for getting data into Stata, as well as proper file management in order to reproduce results for publication.  This workshop will take place at noon on Thursday February 23, 2017 in the conference room in Suite 3100 of 990 W. Fullerton.

Advertisements

The Case for Using a Master Do File

One of the trickiest parts of data analysis in the beginning of a project is file management.  I have spoken before about how to get it right.  Today I discuss the benefits of utilizing a MASTER do file when working with Stata.  These principles apply generally to doing data analysis in other packages, assuming that you are utilizing files with command lines.

J. Scott Long illustrates in The Workflow of Data Analysis Using Stata, what is happening conceptually between what you need for analysis and what you need for creating datasets when starting a data analysis project.  You start with a do file that sets up the data that gets saved as your first file (data01.do).  This might be bringing your data into Stata from an Excel or ASCII file.  You might then have a do file (data02.do) that creates items for analysis.  Because these items are integral to your analysis, you might save that data as dataset data02.dta.  From there, you might start some exploratory statistical analysis, producing do files stat01a.do, stat01b.do, and stat01c.do.

But then you might make more changes to your dataset, and eventually end up with data03.dta.  The wheels can come off pretty quickly if you create variables in analysis files, because analysis files are used for different purposes than data files.  For example, let’s say that you got the idea that you wanted to create a new educational variable, based on a current variable.  Doing so in an analysis file will mean that sometime in the future, when you decide to use the dataset for another project, you will be forced to 1) remember that you created that awesome new educational variable in an analysis file, 2) find that analysis file (was it stat01a or stat02a?) and 3) execute the commands that created that variable (which are nestled snugly in with other lines of code that created analyses).  In all an absolute nightmare.  So don’t do that.  Create variables in do files where you are doing data management and item creation.

workflow

It might also be useful to alphabetize your file names, so that they can be run in order.  The best way to do this is with names like 01_datamerge, 02_dataclean, 03_itemcreation.  This way, they sit in your project folder and can be run in order that they were meant, so you can recreate datasets and analyses.

You could also utilize a master do file, which will run the do files in order that you specify.  Like so:

do data01.do
do data02.do
do data03.do

With all these commands in a do file, you are deliberately imposing structure and order on your research process in a way that allows you to replicate datasets and results.  The worst is when you are trying to deal with file names where you have included the date, like data01_02012016.do, but which you accidentally saved over on Feb 03, 2016.  Good for you, you saved your file, but -10 points because the structure you originally imposed (which was to have each day’s work represented and kept up to date with a do file dated that day) is now somewhat less informative.

Stata’s datasignature command

As we have discussed before, Stata file management can be tricky.  There is the incessant and iterative updating of the files.  Different kinds of files do different things… data cleaning, data management, item creation, descriptive analyses, regression analyses.  And there then there is version management.

18019403462_9d462cf6a4_o

There is a handy little trick in Stata called datasignature.  Long story short, data signature protects the integrity of your data.  When the command is executed, Stata generates a signature string, which is based on 5 characteristics (checksums) that describe your dataset, including # of cases, characteristics regarding the names of variables, # of variables, # of values of variables.

The next time you load the dataset, you can use the datasignature confirm command, and Stata will report whether or not the data have changed since you’d last used the dataset.  If they haven’t changed, then Stata will report, “data unchanged since ________ (date last saved)”.  If your data have changed, the datasignature command will indicate the day of the last save.

Why might this be important?

This might be crucial for teams of researchers collaborating on a large analysis project.  Particularly if multiple people are working with the cleaning, management, and analysis files, and if all of those people don’t have similar levels of concern for hygienic data management.  It can become a problem if someone comes back to a dataset not realizing it has changed.  Datasignature can help eliminate this problem.

Smooth Stata File Management

A few nights ago, a rather strange bout of insomnia had me returning to J. Scott Long’s classic Stata Press book “The Workflow of Data Analysis Using Stata.”  Unfortunately, it was not required reading for me during my doctoral program.  I say unfortunately because I believe that it should be part of The Canon for students getting started with statistical analysis in Stata.  The reason why is simple, something you might recognize from your boy Aristotle: “We are what we repeatedly do.  Excellence, then, is not an act, but a habit.”

What do I mean?  Your habits are developed in what you do often.  And when you are in grad school and start getting your statistical analysis wheels, its easy to fall into bad habits using shortcuts.  You’re living in the moment, just trying to survive, not fretting too much about tomorrow, thinking, “This is just a seminar paper, no need to take this too seriously, I am not going to try to publish these results.”  Next thing you know, you’re a sloppy researcher, inattentive to detail and unable to reproduce results consistently.

J. Scott Long’s book is a step-by-step exercise in the HOW of practicing good research workflow, but more importantly, the WHY.  Allow me to share some of the more important points below.

1).  Acronyms are Where it Is At

Long makes the case for developing an acronym that you use to refer to projects.  The acronym is important because you will include it in a lot of files.  In this age, where we work on multiple machines (office machine, home machine) and where we collaborate by sending do files and such, including the acronym for your project allows you and colleagues to find files easily across locations.  This is VITAL for collaborative research, but also working with something like Stata and SPSS, where analyses are saved as executable commands/lines of code in a .do file that is separate from the data file.

2) Distinguish Between Files by their Functions

Typically research projects require a fair amount of data massage/manipulation/data cleaning before analyses take place.  Long notes that it’s best to keep these functions separate because you don’t want to be digging through multiple files in order to prepare a dataset for analysis.

Let’s say that you have moved on from the data prep stage to analysis, but decide to, on the fly, recode a key variable.  Maybe something like education, where you’re changing what happens to people with 9-11 years of education.  You’ve decided to bust those down into one larger category of “Less than High School Degree”.  Because you’re like me (and slightly prone to laziness) you might be inclined to just shove that code into the logistic regression analysis .do file you are working on (which is only one of the 9 analysis .do files you have for this project).  Big mistake friend.  Big mistake.

Because 6 months down the road, when you decide to let your dependent variable be free and work with a multinomial logit model, now you’re going to have problems with education because you’re not even using your  logit .do file anymore.  But you can’t quite remember what it was that you did with education that one time.

So don’t do that-keep those lines of code in your “data cleaning/massage” files.  That way you dont have to dig manically through a ton of files on your machine.  It is also not that hard to do this.  Stata allows multiple .do files to be open simultaneously.  So a smart researcher might keep open the appropriate data cleaning .do file just to have, in the event that she wants to let a variable go rogue.

3).  Name.  That.  File.

Long also advocates for sensible and strategic naming conventions when it comes to naming .dta and .do files.  Over sixty percent of researchers have their files kept in a jumble of non-uniformly named files whose order and purpose is not apparent, according to statistics I just fabricated for the purpose of this blog post.  They think they are living the hygienic data life just because they have their data and .do files backed up somewhere in “the cloud.”

He suggests differentiating between data files (those used to set up your data) and statistical files (those used for conducting analyses).  But he goes further; these files should be numbered in the order of their use, along with what they are used for.  The files data01-merge.do and data02-addvar.do sort really nicely in their project folder, because data01 will appear before data02 (when you sort on file title, obviously).

Think about the alternative here- the way you currently do things, which is maybe to include the date of the last time you worked with the file.  Who does that help?  No one.  Definitely not you six months from now, when you are trying to address the overly critical constructive comments from reviewer 1 on your Magnum Opus.  So, save Future You some heartache and name your files in a way that they can be kept in order.

Long suggests establishing a template for how you name .do files.  One that works for him is project-task-step-Version-description.  So, a .do file named els-clean04-CheckLabelsV2.do is the 2nd version of the CheckLabels do file, which is the fourth file used for data cleaning on this els project.

It is never too late to start living that hygienic data life.  Take the five seconds it takes to name your files like an adult.  It will save you hours later.

Slide1

Don’t be like cable Jaleel White.

Video Resource for Learning Stata

In my forays into Youtube for help on a command earlier this month, I stumbled across a great set of videos for working with data in Stata.  The videos are done by Alan Neustadtl, who teaches the class “Statistical Programming Using Stata” at the University of Maryland.  He uses data from the General Social Survey to walk through the steps of examining the data, creating/recoding variables, and cleaning data.

Examining a Dataset and Creating and Recoding Variables

Creating New Variables Using Stata

Cleaning Data in Stata

Creating Additive Indices Using Stata

The videos are shortish- like 15-30 minutes for each.  They could be a great tool for faculty members who are working with grad students on projects, but don’t have time to sit down and walk them through the basics using Stata.