Research Guides: Research Data Management Guide: Cleaning Data

What is Data Cleaning?

Cleaning your data involves taking steps to ensure that the compiled data points are complete, consistent, and correct. Data should conform to all rules in your data dictionary. In many cases, "clean" will also indicate that the data has been de-identified.

In short, "clean" means you have critically examined all the data as it was entered by human or machine, and you have verified that it is ready to be analyzed and to produce valid results.

The link below, Part 1 of a 3-part tutorial, provides a more detailed discussion what "clean" data entails and what aspects to think about.

Tutorial Part 1: Data cleaning for data sharing

Workflow for Data Cleaning

In order to clean our data effectively and efficiently, we should establish a basic workflow that we can follow, rather than approaching the problem haphazardly. Using reproducible methods as much as possible--for example, using code, and creating robust documentation and change logs.

The link below, Part 2 of a 3-part tutorial, suggests workflow steps and documentation to consider.

Tutorial Part 2: Creating a data cleaning workflow

Finally, the link below -- Part 3 of the 3-part tutorial -- walks through a real-world example to illustrate how to follow a data cleaning workflow.

Tutorial Part 3: Cleaning sample data in standardized way

Bad Data Handbook

Bad Data Handbook by Q. Ethan McCallum
Publication Date: 2012

What is bad data? Some people consider it a technical phenomenon, like missing values or malformed records, but bad data includes a lot more. Bottom line? Bad data is data that gets in the way. This book explains effective ways to get around it. Among the many topics covered, you'll discover how to: Test drive your data to see if it's ready for analysis; Work spreadsheet data into a usable form; Handle encoding problems that lurk in text data; Develop a successful web-scraping effort; Use NLP tools to reveal the real sentiment of online reviews; Address cloud computing issues that can impact your analysis effort; Avoid policies that create data analysis roadblocks; Take a systematic approach to data quality analysis

Data Cleaning Tools

OpenRefine
OpenRefine (previously Google Refine) is a powerful tool for cleaning and manipulating messy data. Data remains on your computer, not in the cloud, so privacy is maintained.
Data Wrangler
Wrangler allows interactive transformation of messy, real-world data into the data tables analysis tools expect. Export data for use in Excel, R, Tableau, Protovis, ..
Trifacta Wrangler
This free cloud service helps clean and prepare messy data quickly and accurately. As soon as you import datasets to Wrangler, it begins to organize and structure your data automatically, then suggest common transformations and aggregations.