At the end of the summer, I attended a training at the library conducted by DePaul’s Data Services Librarian Kindra Morelock. The training was on OpenRefine and was weird for me because she was teaching an incredibly basic but necessary data skill using a very powerful tool. She showed us how to use OpenRefine (OR) for cleaning messy data.
Messy data, the kind generated in public documents, are the worst kind. Till now, there hasn’t been a great way to deal with it. Usually it meant sorting through row upon row of data, manually changing values until they sort of match up in a way that doesn’t make you want to vomit. For example, if you have lots of people entering data on NFL sports teams, it is likely that New England Patriots gets entered as: “Pats”, “Patriots”, “NE Pats”, “New England Patriots”, “NEpatriots”, etc. If you want to do any type of meaningful analysis, you have to first start with cleaning up this dataset and getting all of those entries into a uniform value.
This is a totally fine process if your datasets aren’t terribly large. But if you’re dealing with a couple of thousand observations, it can get tedious quick. That is where OpenRefine comes in.
Originally developed by Google, OpenRefine is a tool to manipulate and work with data. While you do so through a Chrome browser window, the data actually live on your machine.
Over the summer, the SSRC conducted a Faculty Needs assessment survey, where faculty were asked about their current research projects and needs. One of the questions asked respondents to report their departmental affiliation in an open text response. Because it was open ended and people refer to their departments in different ways, it was necessary to clean this item up and standardize how departments are named in the dataset. See how these responses varied? In OpenRefine, the window on the left shows the various responses. The blue arrows show some categories and names that need to be renamed.
This window shows how different faculty members refer to their departments. Some use abbreviations, some don’t. Some typed “and” and others used the “&”.
In that window on the left, I can modify the data on the fly. This is particularly useful if you only have a handful of variations. As you modify the data, changing “ENGLISH” to “English”, you can see the number next to the entry change (which reflects the changes and the updates you’ve made to the group. So in this example, when I change ENGLISH to English, the number beside English will increase to 11 and ENGLISH will disappear.
Even more powerful is the Cluster and Edit feature, which will show you a listing of all all the categories that OR thinks should go together. See below- how Sociology, SOCIOLOGY, and sociology all look like they are part of the same category. In cluster and refine, you can not only cluster all of these together, but also change their cell value. If you were so inclined, you could change the label to “Soc. Dept” where it says “Sociology” under New Cell Value.
OR includes some other editing/cleaning features. Additional and unnecessary spaces can cause problems when doing data analysis. Some programs ignore them, some programs can’t and others input an underscore for that space. In OR, you can trim leading and trailing spaces. Or change cases from title case to lower case or upper case.
In other cases, you can work with the values of a column and systematically deal with quirks in how people do data entry. Take for example the following values:
Contreras, Wilson E.
When you think about how names get entered into a dataset, the three examples above are the three most likely you’ll see. Of course, the distribution of these is likely influenced by a lot of extraneous factors, including organizational characteristics. But let’s assume that you have reasonably intelligent people participating in data entry and somehow end up with this mess above. Even for a relatively small dataset of a couple thousand observations, it would take someone a couple of days to standardize all the names. The best approach would be to create three additional columns (for first name, last name, and middle initial) and then to go row by row and manually input that data.
If you use OR though, you can write JSON language script that will do this automatically. Long story short, you basically tell OR:
- Every time there is a comma, to treat everything that comes before it as last name and to put that value into a new cell in column LAST NAME.
- Every time there is a period, to treat everything that comes immediately before it as a middle initial and put that value into new cell in column MIDDLE INITIAL.
- In the absence of a comma and or a period, the text that comes before a space is a first name, put that information into a new cell in the column FIRST NAME.
Because you’re using scripting language (JSON) it is fairly painless to take a couple of passes through a dataset, building out columns and populating cell values in a matter of minutes. Moreover, when you’re using OR, your actions are kept as a running record or script, that you can copy and paste and keep for next time. This means that if you’re working on a project that requires frequent data updates and downloading a new dataset from the same resource, you can use the script OR generates and get results with updated data, without having to go through the *painful* process of manually data cleaning. Essentially, once you’ve done it once, you can reuse that script to clean your dataset.
In the workshop, Kindra shared with the group a cheat sheet of sorts to havk JSON programming language in OR. I have scanned it and included it here: OpenRefine_cheatsheet_KindraMorelock. In all, this was an incredibly useful workshop. While I don’t often have to work with super messy data, I have decided that Open Refine is my new go-to when I do.
Some resources for learning how to work with Open Refine:
- Using OpenRefine (e-book)
- Introduction to OpenRefine on YouTube Part 1, Part 2, and Part 3.