OpenRefine- A Useful Tool for working with Messy Data

At the end of the summer, I attended a training at the library conducted by DePaul’s Data Services Librarian Kindra Morelock.  The training was on OpenRefine and was weird for me because she was teaching an incredibly basic but necessary data skill using a very powerful tool.  She showed us how to use OpenRefine (OR) for cleaning messy data.

Messy data, the kind generated in public documents, are the worst kind.  Till now, there hasn’t been a great way to deal with it.  Usually it meant sorting through row upon row of data, manually changing values until they sort of match up in a way that doesn’t make you want to vomit.  For example, if you have lots of people entering data on NFL sports teams, it is likely that New England Patriots gets entered as: “Pats”, “Patriots”, “NE Pats”, “New England Patriots”, “NEpatriots”, etc.  If you want to do any type of meaningful analysis, you have to first start with cleaning up this dataset and getting all of those entries into a uniform value.

This is a totally fine process if your datasets aren’t terribly large.  But if you’re dealing with a couple of thousand observations, it can get tedious quick.  That is where OpenRefine comes in.

Originally developed by Google, OpenRefine is a tool to manipulate and work with data.  While you do so through a Chrome browser window, the data actually live on your machine.

Over the summer, the SSRC conducted a Faculty Needs assessment survey, where faculty were asked about their current research projects and needs.  One of the questions asked respondents to report their departmental affiliation in an open text response.  Because it was open ended and people refer to their departments in different ways, it was necessary to clean this item up and standardize how departments are named in the dataset.  See how these responses varied? In OpenRefine, the window on the left shows the various responses.  The blue arrows show some categories and names that need to be renamed.

openrefine

This window shows how different faculty members refer to their departments.  Some use abbreviations, some don’t.  Some typed “and” and others used the “&”.

In that window on the left, I can modify the data on the fly.  This is particularly useful if you only have a handful of variations.  As you modify the data, changing “ENGLISH” to “English”, you can see the number next to the entry change (which reflects the changes and the updates you’ve made to the group.  So in this example, when I change ENGLISH to English, the number beside English will increase to 11 and ENGLISH will disappear.

Screenshot 2017-10-27 15.09.15

Even more powerful is the Cluster and Edit feature, which will show you a listing of all all the categories that OR thinks should go together.  See below- how Sociology, SOCIOLOGY, and sociology all look like they are part of the same category.  In cluster and refine, you can not only cluster all of these together, but also change their cell value.  If you were so inclined, you could change the label to “Soc. Dept” where it says “Sociology” under New Cell Value.

Screenshot 2017-10-27 15.09.40

OR includes some other editing/cleaning features.  Additional and unnecessary spaces can cause problems when doing data analysis.  Some programs ignore them, some programs can’t and others input an underscore for that space.  In OR, you can trim leading and trailing spaces.  Or change cases from title case to lower case or upper case.

Screenshot 2017-10-31 09.51.09In other cases, you can work with the values of a column and systematically deal with quirks in how people do data entry.  Take for example the following values:

Rizzo, Anthony

Kris Bryant

Contreras, Wilson E.

When you think about how names get entered into a dataset, the three examples above are the three most likely you’ll see.  Of course, the distribution of these is likely influenced by a lot of extraneous factors, including organizational characteristics.  But let’s assume that you have reasonably intelligent people participating in data entry and somehow end up with this mess above.  Even for a relatively small dataset of a couple thousand observations, it would take someone a couple of days to standardize all the names.  The best approach would be to create three additional columns (for first name, last name, and middle initial) and then to go row by row and manually input that data.

If you use OR though, you can write JSON language script that will do this automatically.  Long story short, you basically tell OR:

  1.  Every time there is a comma, to treat everything that comes before it as last name and to put that value into a new cell in column LAST NAME.
  2. Every time there is a period, to treat everything that comes immediately before it as a middle initial and put that value into new cell in column MIDDLE INITIAL.
  3. In the absence of a comma and or a period, the text that comes before a space is a first name, put that information into a new cell in the column FIRST NAME.

Screenshot 2017-10-31 10.06.58Because you’re using scripting language (JSON) it is fairly painless to take a couple of passes through a dataset, building out columns and populating cell values in a matter of minutes.  Moreover, when you’re using OR, your actions are kept as a running record or script, that you can copy and paste and keep for next time.  This means that if you’re working on a project that requires frequent data updates and downloading a new dataset from the same resource, you can use the script OR generates and get results with updated data, without having to go through the *painful* process of manually data cleaning.  Essentially, once you’ve done it once, you can reuse that script to clean your dataset.

In the workshop, Kindra shared with the group a cheat sheet of sorts to havk JSON programming language in OR.  I have scanned it and included it here: OpenRefine_cheatsheet_KindraMorelock.  In all, this was an incredibly useful workshop.  While I don’t often have to work with super messy data, I have decided that Open Refine is my new go-to when I do.

Some resources for learning how to work with Open Refine:

  1.  http://openrefine.org/
  2. Using OpenRefine (e-book)
  3.  Introduction to OpenRefine on YouTube Part 1, Part 2, and Part 3.
Advertisements

Research at the Year End, 2017

On June 1, the SSRC held its first Research Round Up, to commemorate the end of the academic year.  DePaul faculty members who have worked with SSRC staff or resources over the course of the year were invited to present on their work.

The event was held in Arts and Letters and was well-attended by members of the DePaul community.  After SSRC Director Greg Scott introduced each of the presenters, CDM Faculty member Robin Burke gave an update of the Reading Chicago Reading project- an interdisciplinary venture he has been working on during the last year with DePaul English faculty member John Shanahan.   Funded by the National Endowment for the Humanities, Burke and Shanahan started with a well-defined question and problem: Is it possible to predict popularity of One Book, One Chicago selections using library and demographic data?  As the project has advanced, their connections and relationships to other scholars in the DePaul community have allowed them to broaden their interests and start pushing the boundaries of what is possible.  Currently, they are working on text analysis of One Book books, but also text analysis of reviews of those books.

Shailja Sharma from International Studies talked about her experience breathing life into a new research area and project.  She talked about the lengths that she went to, cobbling together small grants, and relying on Skype interviews to move her recent book project, Postcolonial Minorities in Britain and France: In the Hyphen of the Nation-State forward, little by little.

Next, Writing, Reading and Discourse faculty member Sarah Read discussed strategies for keeping two separate research agendas going.  In her presentation, she showed a table that included a work plan and how she moved each project along, little by little.  For Sarah, wanting to maintain two separate research agendas meant that she had work on them simultaneously- not one at a time.  In her talk, she discussed the importance of making sure that all of her scholarly activity fit squarely within those agendas.   She also discussed the importance of having a group at DePaul that kept her accountable and productive.  She said that this kept her research on her desk every week, so that when there were breaks in teaching, she was able to spend less time reorienting herself with her research and materials, and more time writing.

Finally, Political Science Assistant Professor Ben Epstein reported on his experiences turning his PhD into a book proposal and how he survived the revision and re-submission process before signing the book contract.  One of the biggest issues he grappled with during the revision process was staying true to the spirit of the original work, and not letting suggestions from others change the book.  For him, revising came down to three things: 1.  Make it better, not different.  2.  Agree with a suggestion or defend why you can’t.  3.  Don’t underestimate the energy and time it takes to write the response to the editors and reviewers.  He stressed the importance of finding tools that work.  For some people, they work better in an analogue environment, writing their to-do list down others do better with an app that helps them manage their process.  He also strongly recommends the book, On Writing Well: The Classic Guide to Writing Nonfiction by William Zinsser.

The event closed with a Q&A with the presenters.  In all, it was a great event, with many agreeing that there should be another event in 2018.

Transcription Tools

The Social Science Research Center at DePaul has a micro-lab where researchers (or their graduate students) can access hardware and software to transcribe audio files.  Typically, researchers have used these tools to transcribe interviews and focus groups.  The process is relatively simple: researchers bring their audio files on portable media, which are loaded onto a machine in the micro lab.  This machine has a software called “Express Scribe” and a pedal.  The pedal is used to stop, start, rewind and fast forward the audio within the environment of Express Scribe.  Additionally, the speed of the audio playback can be modified.  In all, this is a great tool and process for individuals to transcribe audio files.  However, it is not without its flaws.  The main flaw is that it requires users to be in the physical space during business hours.  Also, it requires that someone spend the time actually typing the text of the transcription.

In this post, I review two relatively new transcription tools and demonstrate how they might be used to help researchers transcribe spoken language.

The first, oTranscribe is a web-based transcription tool.  With it, you upload an audio file and from within the web page, you control audio playback.  Keep in mind that if a researcher were going to do this on their own (without coming to the SSRC to use our machine and pedal), this would require playing the audio in something like iTunes and typing the text in a text editor (like MS Word).  Which is likely fine, if you’re working on a machine with two monitors.  Even so, stopping and restarting the audio file can be quite cumbersome using this approach- even if you are capable and have figured out how to use hotkeys and shortcuts.  Remember that hotkeys usually require that you be in the program to use it.  So, you’re typing in MS Word, but in order to get audio to stop you have to get back to iTunes with the mouse and actually press stop (or click in the window with iTunes and use a hotkey to stop the audio file).

oTranscribe allows you to do this all in the same place.  Even better, when audio is restarted, it repeats the the last bit of where you left off.  This gives you a chance to get your hands in place and allows for a much easier orientation.  In the default setup, the key to stop and start the audio is the ESC, but you could change that.  Additionally, the audio can be slowed down quite a lot.  I have demonstrated what the process is like here.

I recorded myself reading the beginning of a chapter in Howard Becker’s Writing for Social Scientists on an iPhone (using the Voice Memos app).  Although it sounds like I might be drunk, I am actually not.  I have slowed the audio down enough so that I can keep up typing it.

Overall, not a terribly onerous process.  I think it beats having to toggle back and forth between different programs.

I learned about Scribe, a tool that does automatic transcription.  According to Poynter, it was developed by some students working on a school project.  One of the students had to transcribe 12 interviews, and he didn’t want to do it (who does?).  He built a script that uses the Google Speech API to transcribe the speech to text.  Based in Europe, the Scribe website asks that a user upload an mp3 and provide an email address.  The cost to have the file transcribed is €0.09 cents per minute.  As of now, there is a limit to how long the audio file can be (80 minutes).  Because the file format from the Voice Memos app is mpeg-4, I actually had to convert my audio file before it could be uploaded.  Once this was done, I received an email with a link to my text when the transcription was finished.

Below is the unedited output that I received.  I pasted the text into OneNote so that I could add highlighting and comments.

scribe_edit

In all, I am fairly impressed with the output from Scribe.  Obviously, there are some problems with it.  The text is generally right- organized in paragraphs, but not naturally.  For example, the second paragraph is separated from the first, when they should have been kept together.  There were periods at the end of the paragraphs.   Also there is some random capitalization (i.e. “The Chronic”). Amazingly, names were capitalized (Kelly and Merten), which I thought was remarkable.  My guess is that the mix-ups with chutzpah/hot spot and vaudeville/the wave auto are fairly common with words borrowed from other languages.

Obviously, the text will need a little work.  While I think Scribe works well for interviews, I am not sure how well it would work for focus groups.  Of course, the text needs some review and editing, but I think that in the long run it would be faster to correct mistakes than it is to manually type the transcription.  The kicker for me, is how cheap it is: at €0.09 cents per minute, an 80 minute interview could be transcribed for less than $10.00.

I think that both oTranscribe and Scribe lowers the bar to entry for researchers wanting to transcribe audio material.

2017 Year End Research Round Table

The SSRC is wrapping up the academic year with a year end research round table that looks inside the projects and strategies that drove the scholarly investigations four DePaul faculty served by the Social Science Research Center. Assistant professor Ben Epstein (Political Science) will discuss his R&R process in finishing a manuscript for a book on political communication.  Assistant professor Sarah Read (Writing, Rhetoric and Discourse) will talk about how she balances two unrelated research agendas.  Assistant Professor Shailja Sharma (International Studies and Refugee and Forced Migration Studies graduate program director) will discuss how to lay out a step by step research plan.  Finally, CDM Professor Robin Burke (School of Computing) will talk about the new data, tools, and research questions that have come from his current project Reading Chicago Reading.  10388748355_9dfd61280b_o.jpg

The event will take place on Thursday June 1, from 4:30-6:00pm at Arts and Letters #404.  Light refreshments will be served.  Please contact the SSRC at ssrc[AT]depaul.edu for more information.

Through the Glass Darkly

“Hello, darkness, my old friend,” to quote a panelist at the SSRC’s recent event, “Speaking in Light and Dark.” His reference to the opening line of Simon and Garfunkel’s, The Sound of Silence, aptly set the stage for a discussion about light and dark hosted in the late afternoon of January 18 on a stage lit only by natural light coming through the windows of Cortelyou Commons. As the sun set at 4:48 pm aOLYMPUS DIGITAL CAMERAnd darkness progressively pervaded the room, four DePaul faculty members from different disciplines reflected on how lightness and darkness have informed their work or thinking, either literally or metaphorically.

DePaul’s College of Communication had just begun when Associate Professor Daniel Makagon proposed an addition to the schedule called The City at Night, a class held during the unorthodox hours of 10:00 pm to 1:00 am. To see how people utilized the night, his class visited a social worker, a karaoke expert, a needle exchange site, a CTA routing and operations center, and the Guardian Angels, the self-appointed, volunteer safety brigade that once patrolled Chicago subway lines. As an enthusiastic supporter of experiential learning, Daniel fondly recalled one class visit when education become a public event itself. The class was meeting with the Guardian Angels on a subway platform in the Loop when curious onlookers began raising their hands and spontaneously joined in the learning experience themselves. “There was this of kind of opening up at night,” he said. He’s still contemplating its meaning.

Daniel has also applied a night/day lens to his research into the punk music culture to examine underground performance spaces. Subverting our usual notions of how we use spaces by day and night, these all-age punk shows often occur in basements in DIY (do-it-yourself) spaces, during the day. There the basement space becomes a “liberatory, temporary, autonomous zone for folks to enact a different kind of economy, a different social experience in terms of how they meet together in the world, and also a different kind of political experience as well, guided by an alternativepolitics, an alternative economy, to the mainstream music industry as we find it,” he said.

A compilation of night sounds gathered by Daniel’s DePaul students formed an ongoing soundtrack that played throughout the panelists’ presentations. DePaul’s Media Production and Training (MPT) video-taped the event. The results illustrate the significance of light to a technology that depends solely on light to capture and store images.

Field observation has been fundamental to Public Policy Studies Professor Bill Sampson’s academic pursuits. Bill shared with the audience the personal question that has nagged at him throughout his educational and academic life. How was it that he, growing up poor and black in a poor, black neighborhood in Milwaukee did well in school while others sharing the same outward circumstances did not? The explanation his high school teachers gave him — that he was “an exception” — didn’t sit well with him. He has reached some conclusions based on his analysis of observational data students in his classes have gathered over the years, chronicling the lives of poor black and Latino families for comparisons of how the children of those families performed in school.

Not neighborhood, not school, not teachers most affect the results, he found. That leaves him pessimistic about how much of a difference current education policies that shower resources on schools and teachers will ever make. “What mattered most were specific things about the home environment. Kids who did well in school lived in quiet, orderly, structured homes, which is difficult to maintain when you’re poor,” he said. Those students had chores at home, took part in extracurricular activities, were internally controlled, and displayed high self-esteem. All had parents or guardians who showed that they valued education, often by participating in their children’s homework even if they couldn’t do the work themselves.

Acknowledging that “we can’t control families” and that not all families even want the best for their children, he asked: “How do we take what we’ve learnedand give it to the families that want it?” Assuming that teachers and schools are doing what they should (not a given, he noted), “for the parents who are willing, we can make a difference.”

Steve Harp, associate professor of Art, Media, and Design, approached lightness and darkness more formally, but also subjectively. Against a backdrop projection of his own striking, black and white nighttime photos (including the image accompanying this post), Steve presented what he termed a short “pseudo theoretical paper” in which he explicated the word dream from the literary and psychological perspectives of a variety of writers. Noting the seeming similarity between the words trӓume (dreams in German) and trauma (derived from the Greek word for “wound”), he said it’s hard to believe they’re not related etymologically “while linked in so many ways conceptually and experientially.”

Considering any distinction between dream and nightmare as artificial, he discussed the trauma of the nightmare as the experience of waking into consciousness. He linked the traumatic aspect of awakening to the act of departure, or awakening. Inviting the audience to think of dreams spatially, as a path into darkness, he suggested that dreams might be regarded not as wish fulfillment, but as the tension between arrival (or our visions of arrival) and departure. His last words were a lyric from the late Leonard Cohen: “There’s a crack in everything. That’s how the light gets in.”

The panel concluded with Assistant Professor of Philosophy Peter Steeves’ mind-bendingly succinct but sobering, 15-step timeline of the birth and death of light. His only prompt, a DIY “power point” flashlight beam trained on sheets of white paper carrying dates, effectively underlined his observation that light’s lifespan is a relative blip within the sprawling chronology of the universe. In increasingly bad news, he pegged the lifespan of humans on earth at a mere million years and forecast our sun to end 6.5 billion years from now, when it will swallow up the earth. A hundred trillion years from now, all stars — the manufacturers of light — will have been extinguished. Earth too, whose rank as a “Goldilocks of stars” (not big, not small), will succumb with one of the less remarkable star-death displays, he said.

Peter’s interest in the topic is rooted in “the overlap of philosophy and physics,” his twin loves, “and light plays a major role in that,” he said. “Light is not important in any fundamental way,” he concluded. “So I sometimes think, why do we make it so important? Why do we think it’s all about life and why do we think it’s all about light? That’s something I’ve been thinking about recently.”

SSRC Event: Speaking in Light and Dark

On January 18, 2017, the Social Science Research Center is hosting “Speaking in Light and Dark”, a discussion between four DePaul Scholars.  The event, which is free and open to the public features Steve Harp (Art, Media, and Design), Daniel Makagon (Communication), Bill Sampson (Public Policy Studies), and H. Peter Steeves (Humanities Center).  The discussion will focus on notions of lightness and darkness, and the ways in which both inform the work of the presenters; figuratively and literally.

light-and-dark

A reception will follow the discussion, which will take place from 4:00-6:00 pm in Cortelyou Commons (2324 N. Fremont Ave, Chicago, IL).  Individuals interested in attending the event should RSVP by sending an email to SSRC@depaul.edu.

Good Work Does Go Noticed

Congratulations to Sarah Read, assistant professor in the Department of Writing, Rhetoric, and Discourse, for the award she recently received for a paper she delivered at a professional technical communication conference at the University of Texas in Austin. The James M. Lufkin Award for Best International Professional Communication Conference Paper is given annually by the IEEE Professional Communication Society in recognition of work that supports their mission to promote effective communication within scientific, engineering and technical environments.

In the paper, Sarah and her co-author and fellow award-winner Michael E. Papka propose a more comreadaward_0001prehensive model of the document cycling process to capture significant activities not normally found in conventional project management plans. The paper emerged from an ethnographic study she conducted as a guest faculty researcher at Argonne National Laboratory where she analyzed the technical documentation and reporting processes that went into creating the facility’s 2014 annual report.

Operated by The University of Chicago Argonne LLC for the U.S. Department of Energy, the research lab and its high-powered supercomputer are used by scientists from academia and industry. Each year it produces a lengthy, polished report for the funder, “an extended statement about how the facility has met or exceeded the performance metrics set by the funder based on the previous review process,” as explained in the paper.

Sarah’s interviews with staff and her observations of the lab’s operations revealed hidden activities involved in gathering and generating data that indirectly fed into the annual report. This data-gathering had become incorporated into regular operational activities and fell outside the designated time frames for generating reportable information. These submerged activities not only informed the report but constituted a creative endeavor in their own right. (See a previous SSRC blog about Sarah’s project in which she vividly described the efforts demanded of staff in learning how to “write down the machine.”)

They did not arise sui generis. Papka, a senior scientist at Argonne, is the director of the Leadership Computing Center and an associate professor of computer science at Northern Illinois University. In 2012 he revised the annual report document creation process “from an annual last-minute all-out effort to a well-managed, well-paced drafting and revision process,” according to the paper. Reporting became on-going, rotating and cut across multiple divisions of the facility. Crucially, it entailed the development of processes “to more efficiently and accurately generate” reportable performance data.

The success of those efforts leads the paper’s authors to raise some provocative questions, including whether the staff time and effort required to write an annual report—a full-color, printed and designed document totaling 126 pages in 2014—is warranted when reportable information becomes readily accessible and available. “It is interesting to reflect upon how the imperative to develop a more accurate and efficient annual operational assessment reporting process ended up building processes at the facility…that could make the annually produced report unnecessary,” they point out. And they ask teachers and students of professional and technical writing to recognize and understand that the means of producing reportable information for the periodic reports so common to large organizations “have as much if not more value for the organization than the finished reporting document.”

The SSRC likes to think that our own support of Sarah’s research contributed to this project, from her use of ATLAS.ti, the qualitative data analysis application available in our computer lab, to analyze her data, to her ongoing participation in the SSRC’s Accountability Group in which tenure-track LAS faculty meet twice a month to set and discuss self-imposed professional and research goals. She worked on the paper during spring break at the off-campus faculty research retreat in Wisconsin that the SSRC organized to offer faculty designated writing time away from usual distractions. Sarah plans to develop the epistemic dimensions of the model in another paper.