OpenRefine: A Power Tool for Working with Messy Data
Links for today’s (February 12, 2025) class
- Download & install OpenRefine
- The URL for today’s dataset is:
https://media.githubusercontent.com/media/MuseumofModernArt/collection/refs/heads/main/Artists.csv - Wikidata home
- an example of Wikidata as computer gobbledygook
- the same example formatted for humans (sort of)
- The URL for the Wikidata reconciliation service is:
https://wikidata.reconci.link/en/api
About
Home page for OpenRefine: http://openrefine.org
OpenRefine is a powerful tool for cleaning many kinds of data – numeric and textual, and can import from and also export to several useful formats. In this workshop we will focus on textual and numeric data that originates in tabular form, and will go through some basic views and transformations. After this introduction to OpenRefine, you can learn more from the many tutorials and examples on the web. “Cleaning data” sounds like some sort of janitorial activity — a more productive way to describe what we are doing is to say that we are modeling and remodeling information. Changes in content and structure are important
- they reflect a research method
- they require knowledge of the data and research context
- need to be documented so they can be reproduced.
Finally, this will just scratch the surface of what OpenRefine can do. We are happy to meet with you and work out further solutions if you decide that this will be a useful tool. Interesting reading: Katie Rawson and Trevor Muñoz, “Against Cleaning” http://trevormunoz.com/notebook/2016/07/07/against-cleaning-curating-menus.html
Install:
- OpenRefine is a desktop application that runs in a local web server. You treat it like any other desktop application, but note that it uses a web browser as a user interface.
- Details for installing, starting and quitting the application are here: http://openrefine.org/download.html
- OpenRefine won’t work on older versions of many browsers.
Documentation
- OpenRefine home page: http://openrefine.org
- OpenRefine Documentation wiki: https://github.com/OpenRefine/OpenRefine/wiki This is where the most definitive version of the documentation can be found. Good starting points:
- 3 videos that provide a great introduction
- OpenRefine 101 course – haven’t tried it yet.
Examples and Techniques:
- Example of use: http://dataist.wordpress.com/2012/04/10/tutorial-using-google-refine-to-clean-mortgage-data/
- in depth tutorial: http://www.propublica.org/nerds/item/using-google-refine-for-data-cleaning
- http://programminghistorian.org/lessons/cleaning-data-with-openrefine Basically the first few chapters of the OpenRefine book.
- Miriam Posner’s OpenRefine tutorial: http://miriamposner.com/classes/dh101f17/tutorials-guides/data-manipulation/get-started-with-openrefine/
- Grel reference: https://openrefine.org/docs/manual/grelfunctions
Tasks
- Import file and types of source file
- Create a new project
- Get source file from URL http://cds.library.brown.edu/projects/OpenRefine/Menu.csv Dataset is from the NY Public Library’s What’s on the Menu. (http://menus.nypl.org/data)
- Examine the import preview and settings for managing imported data
- Explore the dataset using Filtering and make some simple changes – OpenRefine works primarily on columns. Use dropdown arrow to available actions.
- In the Notes column, what can you find out about which languages common? (Hint: use some filters)
- Make the Notes column more readable by converting the cells to mixed case. (Hint: From column dropdown, see Edit Cells>Common Transforms)
- Look at all the other cell options
- History and Undo (Tab on the left)
- Facet / Clustering
- Try applying a text facet to the Events column.
- Cluster the facet, explore the histograms on the rights.
- Try to normalize the values of the Events column by merging facets.
- (Note: if you want to know more about the clustering algorithms, see https://openrefine.org/docs/technical-reference/clustering-in-depth)
- Backtrack, make new column and try again… Don’t delete information!
- Apply a Timeline facet to the Date column
- Dates could be viewed as a timeline – try making a timeline facet from the date column.
- Why isn’t this working? Convert the column to actual dates. and try the timeline facet again.
- Clean up outliners using the timeline sliders on the left.
- Try making a new column based this value that only shows the year so it’s easier to read.
value.datePart("years")
- Scatterplots (just to show that you can)
- Make sure date is a number and select the scatterplot facet from the date column dropdown.
- Try applying a text facet to the Events column.
- Column Menus (run through them)
- Rows and Records
- Notes column has lots of information in it. Try to normalize and sort
- Column dropdown: Edit Cells>Split multi-valued cell on the “;”.
- Facet and cluster
- Look at Dimensions: Split multivalued cell as before
- Remove numeric values: View as rows, select numeric values via regex. Filter on
^[\d\. X]+$ - Delete matching rows from “All” menu
- View again as records.
- Remove numeric values: View as rows, select numeric values via regex. Filter on
- Notes column has lots of information in it. Try to normalize and sort
- GREL
- Some expressions
Place: edit cells value.replace(/;$/,"")
Dimension: value.replace(/^(?:[^\d]*)(\d+\.\d+)?/,'$1')
Various spots: value:replace(/;//)
Note: value.match(/.*?(German|French|Hungarian).*?/)[0]
- Some expressions
- Reconciling values
- Export