Skip to page navigation menu Skip entire header
Brown University
Skip 14 subheader links

Center for Digital Scholarship

OpenRefine: A Power Tool for Working with Messy Data

Links for today’s (February 12, 2025) class

About

Home page for OpenRefine: http://openrefine.org

OpenRefine is a powerful tool for cleaning many kinds of data – numeric and textual, and can import from and also export to several useful formats. In this workshop we will focus on textual and numeric data that originates in tabular form, and will go through some basic views and transformations. After this introduction to OpenRefine, you can learn more from the many tutorials and examples on the web. “Cleaning data” sounds like some sort of janitorial activity — a more productive way to describe what we are doing is to say that we  are modeling and remodeling information. Changes in content and structure are important

  • they reflect a research  method
  • they require knowledge of the data and research context
  • need to be documented so they can be reproduced.

Finally, this will just scratch the surface of what OpenRefine can do. We are happy to meet with you and work out further solutions if you decide that this will be a useful tool. Interesting reading: Katie Rawson and Trevor Muñoz, “Against Cleaning” http://trevormunoz.com/notebook/2016/07/07/against-cleaning-curating-menus.html 

Install:

  • OpenRefine is a desktop application that runs in a local web server. You treat it like any other desktop application, but note that it uses a web browser as a user interface.
  • Details for installing, starting and quitting the application are here:  http://openrefine.org/download.html
  • OpenRefine won’t work on older versions of many browsers.

Documentation

Examples and Techniques:

Tasks

  1. Import file and types of source file
  2. Explore the dataset using Filtering and make some simple changes – OpenRefine works primarily on columns.  Use dropdown arrow to available actions.
    1. In the Notes column, what can you find out about which languages common? (Hint:  use some filters)
    2. Make the Notes column more readable by converting the cells to mixed case. (Hint:  From column dropdown, see Edit Cells>Common Transforms)
  3. Look at all the other cell options
  4. History and Undo (Tab on the left)
  5. Facet / Clustering
    1. Try applying a text facet to the Events column.
      1. Cluster the facet, explore the histograms on the rights.
      2. Try to normalize the values of the Events column by merging facets.
      3. (Note: if you want to know more about the clustering algorithms, see https://openrefine.org/docs/technical-reference/clustering-in-depth)
      4. Backtrack, make new column and try again… Don’t delete information!
    2. Apply a Timeline facet to the Date column
      1. Dates could be viewed as a timeline – try making a timeline facet from the date column.
      2. Why isn’t this working? Convert the column to actual dates. and try the timeline facet again.
      3. Clean up outliners using the timeline sliders on the left.
      4. Try making a new column based this value that only shows the year so it’s easier to read.
        value.datePart("years")
    1. Scatterplots (just to show that you can)
      1. Make sure date is a number and select the scatterplot facet from the date column dropdown.
  6. Column Menus (run through them)
  7. Rows and Records
    1. Notes column has lots of information in it. Try to normalize and sort
      1. Column dropdown: Edit Cells>Split multi-valued cell on the “;”.
      2. Facet and cluster
    2. Look at Dimensions: Split multivalued cell as before
      1. Remove numeric values: View as rows, select numeric values via regex. Filter on
        ^[\d\. X]+$
      2. Delete matching rows from “All” menu
      3. View again as records.
  8. GREL
    • Some expressions
      Place: edit cells value.replace(/;$/,"")
      Dimension: value.replace(/^(?:[^\d]*)(\d+\.\d+)?/,'$1')
      Various spots: value:replace(/;//)
      Note: value.match(/.*?(German|French|Hungarian).*?/)[0]
  9. Reconciling values
  10. Export