OpenRefine- A Useful Tool for working with Messy Data

At the end of the summer, I attended a training at the library conducted by DePaul’s Data Services Librarian Kindra Morelock.  The training was on OpenRefine and was weird for me because she was teaching an incredibly basic but necessary data skill using a very powerful tool.  She showed us how to use OpenRefine (OR) for cleaning messy data.

Messy data, the kind generated in public documents, are the worst kind.  Till now, there hasn’t been a great way to deal with it.  Usually it meant sorting through row upon row of data, manually changing values until they sort of match up in a way that doesn’t make you want to vomit.  For example, if you have lots of people entering data on NFL sports teams, it is likely that New England Patriots gets entered as: “Pats”, “Patriots”, “NE Pats”, “New England Patriots”, “NEpatriots”, etc.  If you want to do any type of meaningful analysis, you have to first start with cleaning up this dataset and getting all of those entries into a uniform value.

This is a totally fine process if your datasets aren’t terribly large.  But if you’re dealing with a couple of thousand observations, it can get tedious quick.  That is where OpenRefine comes in.

Originally developed by Google, OpenRefine is a tool to manipulate and work with data.  While you do so through a Chrome browser window, the data actually live on your machine.

Over the summer, the SSRC conducted a Faculty Needs assessment survey, where faculty were asked about their current research projects and needs.  One of the questions asked respondents to report their departmental affiliation in an open text response.  Because it was open ended and people refer to their departments in different ways, it was necessary to clean this item up and standardize how departments are named in the dataset.  See how these responses varied? In OpenRefine, the window on the left shows the various responses.  The blue arrows show some categories and names that need to be renamed.

openrefine

This window shows how different faculty members refer to their departments.  Some use abbreviations, some don’t.  Some typed “and” and others used the “&”.

In that window on the left, I can modify the data on the fly.  This is particularly useful if you only have a handful of variations.  As you modify the data, changing “ENGLISH” to “English”, you can see the number next to the entry change (which reflects the changes and the updates you’ve made to the group.  So in this example, when I change ENGLISH to English, the number beside English will increase to 11 and ENGLISH will disappear.

Screenshot 2017-10-27 15.09.15

Even more powerful is the Cluster and Edit feature, which will show you a listing of all all the categories that OR thinks should go together.  See below- how Sociology, SOCIOLOGY, and sociology all look like they are part of the same category.  In cluster and refine, you can not only cluster all of these together, but also change their cell value.  If you were so inclined, you could change the label to “Soc. Dept” where it says “Sociology” under New Cell Value.

Screenshot 2017-10-27 15.09.40

OR includes some other editing/cleaning features.  Additional and unnecessary spaces can cause problems when doing data analysis.  Some programs ignore them, some programs can’t and others input an underscore for that space.  In OR, you can trim leading and trailing spaces.  Or change cases from title case to lower case or upper case.

Screenshot 2017-10-31 09.51.09In other cases, you can work with the values of a column and systematically deal with quirks in how people do data entry.  Take for example the following values:

Rizzo, Anthony

Kris Bryant

Contreras, Wilson E.

When you think about how names get entered into a dataset, the three examples above are the three most likely you’ll see.  Of course, the distribution of these is likely influenced by a lot of extraneous factors, including organizational characteristics.  But let’s assume that you have reasonably intelligent people participating in data entry and somehow end up with this mess above.  Even for a relatively small dataset of a couple thousand observations, it would take someone a couple of days to standardize all the names.  The best approach would be to create three additional columns (for first name, last name, and middle initial) and then to go row by row and manually input that data.

If you use OR though, you can write JSON language script that will do this automatically.  Long story short, you basically tell OR:

  1.  Every time there is a comma, to treat everything that comes before it as last name and to put that value into a new cell in column LAST NAME.
  2. Every time there is a period, to treat everything that comes immediately before it as a middle initial and put that value into new cell in column MIDDLE INITIAL.
  3. In the absence of a comma and or a period, the text that comes before a space is a first name, put that information into a new cell in the column FIRST NAME.

Screenshot 2017-10-31 10.06.58Because you’re using scripting language (JSON) it is fairly painless to take a couple of passes through a dataset, building out columns and populating cell values in a matter of minutes.  Moreover, when you’re using OR, your actions are kept as a running record or script, that you can copy and paste and keep for next time.  This means that if you’re working on a project that requires frequent data updates and downloading a new dataset from the same resource, you can use the script OR generates and get results with updated data, without having to go through the *painful* process of manually data cleaning.  Essentially, once you’ve done it once, you can reuse that script to clean your dataset.

In the workshop, Kindra shared with the group a cheat sheet of sorts to havk JSON programming language in OR.  I have scanned it and included it here: OpenRefine_cheatsheet_KindraMorelock.  In all, this was an incredibly useful workshop.  While I don’t often have to work with super messy data, I have decided that Open Refine is my new go-to when I do.

Some resources for learning how to work with Open Refine:

  1.  http://openrefine.org/
  2. Using OpenRefine (e-book)
  3.  Introduction to OpenRefine on YouTube Part 1, Part 2, and Part 3.
Advertisements

Through the Glass Darkly

“Hello, darkness, my old friend,” to quote a panelist at the SSRC’s recent event, “Speaking in Light and Dark.” His reference to the opening line of Simon and Garfunkel’s, The Sound of Silence, aptly set the stage for a discussion about light and dark hosted in the late afternoon of January 18 on a stage lit only by natural light coming through the windows of Cortelyou Commons. As the sun set at 4:48 pm aOLYMPUS DIGITAL CAMERAnd darkness progressively pervaded the room, four DePaul faculty members from different disciplines reflected on how lightness and darkness have informed their work or thinking, either literally or metaphorically.

DePaul’s College of Communication had just begun when Associate Professor Daniel Makagon proposed an addition to the schedule called The City at Night, a class held during the unorthodox hours of 10:00 pm to 1:00 am. To see how people utilized the night, his class visited a social worker, a karaoke expert, a needle exchange site, a CTA routing and operations center, and the Guardian Angels, the self-appointed, volunteer safety brigade that once patrolled Chicago subway lines. As an enthusiastic supporter of experiential learning, Daniel fondly recalled one class visit when education become a public event itself. The class was meeting with the Guardian Angels on a subway platform in the Loop when curious onlookers began raising their hands and spontaneously joined in the learning experience themselves. “There was this of kind of opening up at night,” he said. He’s still contemplating its meaning.

Daniel has also applied a night/day lens to his research into the punk music culture to examine underground performance spaces. Subverting our usual notions of how we use spaces by day and night, these all-age punk shows often occur in basements in DIY (do-it-yourself) spaces, during the day. There the basement space becomes a “liberatory, temporary, autonomous zone for folks to enact a different kind of economy, a different social experience in terms of how they meet together in the world, and also a different kind of political experience as well, guided by an alternativepolitics, an alternative economy, to the mainstream music industry as we find it,” he said.

A compilation of night sounds gathered by Daniel’s DePaul students formed an ongoing soundtrack that played throughout the panelists’ presentations. DePaul’s Media Production and Training (MPT) video-taped the event. The results illustrate the significance of light to a technology that depends solely on light to capture and store images.

Field observation has been fundamental to Public Policy Studies Professor Bill Sampson’s academic pursuits. Bill shared with the audience the personal question that has nagged at him throughout his educational and academic life. How was it that he, growing up poor and black in a poor, black neighborhood in Milwaukee did well in school while others sharing the same outward circumstances did not? The explanation his high school teachers gave him — that he was “an exception” — didn’t sit well with him. He has reached some conclusions based on his analysis of observational data students in his classes have gathered over the years, chronicling the lives of poor black and Latino families for comparisons of how the children of those families performed in school.

Not neighborhood, not school, not teachers most affect the results, he found. That leaves him pessimistic about how much of a difference current education policies that shower resources on schools and teachers will ever make. “What mattered most were specific things about the home environment. Kids who did well in school lived in quiet, orderly, structured homes, which is difficult to maintain when you’re poor,” he said. Those students had chores at home, took part in extracurricular activities, were internally controlled, and displayed high self-esteem. All had parents or guardians who showed that they valued education, often by participating in their children’s homework even if they couldn’t do the work themselves.

Acknowledging that “we can’t control families” and that not all families even want the best for their children, he asked: “How do we take what we’ve learnedand give it to the families that want it?” Assuming that teachers and schools are doing what they should (not a given, he noted), “for the parents who are willing, we can make a difference.”

Steve Harp, associate professor of Art, Media, and Design, approached lightness and darkness more formally, but also subjectively. Against a backdrop projection of his own striking, black and white nighttime photos (including the image accompanying this post), Steve presented what he termed a short “pseudo theoretical paper” in which he explicated the word dream from the literary and psychological perspectives of a variety of writers. Noting the seeming similarity between the words trӓume (dreams in German) and trauma (derived from the Greek word for “wound”), he said it’s hard to believe they’re not related etymologically “while linked in so many ways conceptually and experientially.”

Considering any distinction between dream and nightmare as artificial, he discussed the trauma of the nightmare as the experience of waking into consciousness. He linked the traumatic aspect of awakening to the act of departure, or awakening. Inviting the audience to think of dreams spatially, as a path into darkness, he suggested that dreams might be regarded not as wish fulfillment, but as the tension between arrival (or our visions of arrival) and departure. His last words were a lyric from the late Leonard Cohen: “There’s a crack in everything. That’s how the light gets in.”

The panel concluded with Assistant Professor of Philosophy Peter Steeves’ mind-bendingly succinct but sobering, 15-step timeline of the birth and death of light. His only prompt, a DIY “power point” flashlight beam trained on sheets of white paper carrying dates, effectively underlined his observation that light’s lifespan is a relative blip within the sprawling chronology of the universe. In increasingly bad news, he pegged the lifespan of humans on earth at a mere million years and forecast our sun to end 6.5 billion years from now, when it will swallow up the earth. A hundred trillion years from now, all stars — the manufacturers of light — will have been extinguished. Earth too, whose rank as a “Goldilocks of stars” (not big, not small), will succumb with one of the less remarkable star-death displays, he said.

Peter’s interest in the topic is rooted in “the overlap of philosophy and physics,” his twin loves, “and light plays a major role in that,” he said. “Light is not important in any fundamental way,” he concluded. “So I sometimes think, why do we make it so important? Why do we think it’s all about life and why do we think it’s all about light? That’s something I’ve been thinking about recently.”

DePaul Professor Steve Harp’s Project “In Sleep’s Dark Kingdom”

There is a crack in everything,

That’s how the light gets in.

Anthem, Leonard Cohen 

In Sleep’s Dark Kingdom, by DePaul faculty member Steve Harp, is an artist’s book created in response to the SSRC’s call for proposals to celebrate the UNESCO designated International Year of Light.OLYMPUS DIGITAL CAMERA

My approach takes as its starting point the notion that conceptions of light are meaningless without framing notions of darkness. Light only enters the realm of perception out of a darkness.OLYMPUS DIGITAL CAMERA

In “The Hollow Men,” (1925) T. S. Eliot writes of “death’s dream kingdom,” a place of disguises, with “eyes I dare not meet.” It is a kind of limbo, a twilight kingdom – a place between. The dream kingdom is also, of course, the place of sleep – itself a liminal zone between the clear consciousness of the light of day and the obscure darkness of unconsciousness.  If light is a metaphor for clarity or understanding, sleep has its own light emerging from darkness: the cold, crystalline clarity Freud posits residing in the dream continually hidden by layers of resistances obscuring it in metaphor, symbol, displacement.   Yet centrally, what Freud suggests is that the light of the dream (the latent content) can only become visible emerging from a darkness (the manifest content – always only known through its telling or representation, never through direct access to the dream “itself” – a kind of double cloaking or darkness).OLYMPUS DIGITAL CAMERA

My project touches on or suggests four “realms” or kingdoms of darkness, terrestrial and extraterrestrial, conscious and unconscious, in which light’s emergence from darkness and obscurity is to be celebrated all the more for its rarity and brevity. What I have attempted to do in this project – itself obscurely explained thus far – is to suggest darkness as an opportunity for light, darkness as the necessary frame allowing glimmers of light – of clarity, of understanding, of meaning, of hope – to break through and become manifest themselves.

OLYMPUS DIGITAL CAMERA

 

 

Map Customizer

This summer I discovered a sweet mapping tool.  For a lot of researchers and writers, it can be tricky to get places plotted on a map.  Don’t get me wrong, I adore Google Maps, but it gets fairly tedious manually adding cities to the map.  Screenshot 2016-08-04 12.00.34

Map Customizer allows you to enter a list of locations manually (by typing) or copying and pasting a list from a text editor or spreadsheet.  It uses Google Maps for mapping.

Screenshot 2016-08-04 12.01.54

Once you’ve created your survey, you can save it and point back to your map’s address.

Screenshot 2016-08-04 12.16.39

I think it is super helpful if you have a lot of addresses, cities, or locations to enter and would really just like to do so with a copy and pasted list.

Are Chicago’s Safe Passage Routes Located in the Highest Risk Areas?

Safe passage routes to school provide not only a sense of safety for Chicago students from pre-K through high school, but they reduce crime involving students and help increase school attendance. Chicago’s Safe Passage program was introduced in 2009 after the beating death by gangs of 16-year-old Fenger High School honors student Derrion Albert, which was captured on cell phone video. His death and the circumstances received national attention along with a series of other incidents involving CPS students caught in gang violence. Since then, the program has expanded to include schools, parents, residents, law enforcement officials and even local businesses in efforts to provide students with a safe environment. The various types of safe passage programs among the 51 safe route programs currently available include: safe haven programs in which students who fear for their safety can find refuge at the local police station, fire house, library and even convenience stores, barbershops and restaurants; patrols along school routes by veterans, parents and local residents; and walking to school programs in which parents and local residents create a presence to help deter unlawful incidents.

The map below shows the number of all crimes committed in the city of Chicago during the current school year, and the locations of schools and safe routes among those communities that have safe routes. Currently, there are 517 Chicago public schools, of which, only 136 Chicago public schools (26.3% of all schools) fall within the 51 safe routes. Although the safe routes are located in 37 of the high crime communities in general (south, west and northeast sides of Chicago), they do not exist in the pockets of the highest crime incidents (1,500+ highlighted in burgundy) where children are the most vulnerable. Of the 47 schools that fall within the extreme crime areas (1,500+ incidents a year), only 6 have safe routes; the others offer no safe passage options. A list of the schools appears at the end of this blog.

Click through to see the enlarged image.


SafePassage_Routs

Schools located in extremely high-crime areas of Chicago (Schools highlighted in green have safe passage routes):
Bennett, Bowen HS, Bradwell, Camelot Safe – Garfield Park, Camelot Safe Academy, Clark HS, Coles, Community, Ericson, Frazier Charter, Frazier Prospective, Galapagos Charter, Great Lakes Charter, Gregory, Harlan HS, Hefferan, Heroes, Herzl, Hirsch HS, Hubbard HS, Learn Charter – Butler, Leland, Mann, Mireles, Noble Charter – Academy, Noble Charter – Baker College Prep, Noble Charter – DRW, Noble Charter – Muchin, Noble Charter – Rowe Clark, Oglesby, Plato, Polaris Charter, Powell, Schmid, Shabazz Charter – Shabazz, Smith, South Shore Intl HS, Webster, Westcott, Winnie Mandela HS, YCCS Charter – Association House, YCCS Charter – CCA Academy, YCCS Charter – Community Service, YCCS Charter – Innovations, YCCS Charter – Olive Harvey, YCCS Charter – Sullivan, YCCS Charter – Youth Development

 

Implementing visualization techniques in faculty research
The image of the map reflects the different visualization techniques that might be used to effectively convey data or research conclusions to different types of audiences in various disciplines or industries. Visualizations can help identify existing or emerging trends, spot irregularities or obscure patterns, and even address or solve issues.

Ask us how to visualize your research
For help visualizing your own research findings or seeing if your research lends itself to similar techniques including data acquisition and pre-processing of both quantitative and qualitative data, contact Nandhini Gulasingam at mgulasin@depaul.edu.

Economic Inequality According to Adam Smith

Eliminate poverty and economic inequality disappears.  Not so, says DePaul Political Science Professor David Lay Williams, who treated a recent Mess Hall audience at the SSRC to a preview chapter from ‘The Greatest of All Plagues’: Economic Inequality in Western Political Thought, a book he’s writing for Princeton University Press.

AdamSmith

Returning to an examination of seminal free-marketeer Adam Smith, Williams traces the recurring theme of economic inequality throughout Smith’s writings, particularly in his less celebrated book, The Theory of Moral Sentiments.  And while he finds Smith’s solutions for alleviating desperate poverty stronger than those addressing economic inequality, he points out that Smith was quick to recognize potential pitfalls of inequality at the nascent roots of capitalism.

Smith, whose own 18th Century Scotland was marked by great economic inequality, ascribed its development to a combination of people’s tendencies to base their actions on self-interest, the desire for rank and distinction, and an appetite for both superiority and domination over others.  In commercial societies where people are considered responsible for their station in life where success is measured by wealth and poverty equals failure, two separate moral codes can evolve, observed Smith.  People’s inclination to worship the rich allows the rich to indulge in a very lax moral code, one that tolerates their foibles while subjecting the poor to life-long punishment for theirs.  Likewise, greater wealth will also enjoy greater political authority, continues Smith’s critique.

To Williams, relieving poverty wouldn’t address the pathologies Smith identified or control badly performing political institutions.  What Smith described as the “natural selfishness and rapacity” of the rich has both individual and societal implications.  Pitted against the morally corrupting effects on individual character that Smith warned of, the interests of the poor barely register on the radar of the rich, Williams said.  The more disproportionate the wealth, the more violently and unjustly the rich will treat the poor, a Smithian observation not generally remarked on, Williams noted.

In other chapters of his book, Williams will examine the issue of economic inequality through the lens of Plato, St. Augustine, Hobbes, Rousseau, Mill, and Marx.