Chapter 8
By Marit Brademann and Dyanna Gregory
Now that you have a prepped dataset, you’re ready to get it clean. What does that mean though? What exactly are clean data and what do we have to do get them that way?
Well, when we clean data, we’re going through and identifying incorrect information — wrong numbers, misspellings, etc. — and deciding whether to correct them (if they are correctable) or to remove them altogether. Like in data preparation, many data cleaning tasks are a combination of computerized tasks and manual work, since it is important for you to review the potential errors the computer identifies to see if they are, in fact, errors. Some of the items your computer flags as problems may turn out to just be extreme observations so it’s critical that you remain involved in the process. Don’t just automate it all or you risk the possibility of deleting valid data!
Let’s start with some of the most basic data cleaning procedures. We’re going to use Excel for many of these these examples, but you can use any spreadsheet or data manipulation software to perform these procedures. A list of programs is available in the Appendix.
Range checks are a very straightforward procedure that we use on numeric fields to see if any values in the dataset are above or below the most extreme acceptable values for that variable. Let’s use an example of homework scores. Pretend that you’re a college professor and your teaching assistants entered the first set of homework scores for the semester. You want to make sure they entered everything correctly, so you go into the dataset and sort by the column that contains the scores for the first homework, graded on a scale of 0-100. You see the first few rows:
Student ID | HW 1 Score |
---|---|
679372531 | 980 |
673540288 | 99 |
674082892 | 97 |
673923590 | 96 |
There is a score of 980, so one of the TAs probably accidentally typed a zero after a score that should have been 98. You would want to flag the record and ask the TAs what the actual score should have been.
Visual scans of data during range checks can also reveal other potential problems even within the official bounds of the data:
Student ID | HW 1 Score |
---|---|
674472019 | 78 |
679029425 | 75 |
671822390 | 74 |
671278927 | 9 |
Here, one score is much lower than the others. In this situation, it is possible that the score should have been entered as a 90 instead. This is another record that would make sense to flag and check with the source record. This is a good example of why it is important to be personally involved in the data checking process for variables that carry a lot of weight, like student grades or primary outcome variables for research projects. However, hand-checking all the variables in a dataset can be very time-consuming, especially if the dataset is large. Additionally, not all variables are created equal: it is up to you to decide which variables require involved personal attention and which can be checked automatically.
Range checks will work no matter what your range is. Maybe you’re expecting a decimal between 0 and 1, or your variable is normal body temperature in Fahrenheit so you’re expecting mostly values between 97.0 and 99.0, allowing for people that run cold and hot. You can always look and see if there are values stored in that variable group that are too low or high.
You can also use “Filter” commands to check for all values that are outside the acceptable range of a variable. However, this doesn’t catch values that are inside your range but that look “off” based on the rest of your data. Here, a basic range filter would detect the “980” score but not the “9” score. If you use filtering to do your range checks, it is a good idea to also use another method to look at the overall distribution of your data to catch any values that might seem “strange” in comparison to your other values.
Spell Check is another basic check that you can use to find problems in your dataset. We suggest doing this field-by-field rather than trying to do it across the whole dataset at once. The reason for this is that a word that might be considered a misspelling in one variable could be a valid word for another variable. A good example of this is the first name field. If you have a dataset with a first name field, many of those entries could trigger a spell check alert even though they are legitimate names. If instead you focus on a single field at a time, you can more quickly work through the dataset. In the example from the data preparation chapter where students were listing their major on a survey, say one of the students had just pulled an all-nighter and accidentally typed “Mtahmeitcs” instead of “Mathematics.” A spell check on the “Major” field in your dataset would quickly identify the misspelling and you could change it to “Math” or “Mathematics,” depending on which controlled vocabulary term you chose.
Another slightly more advanced type of data check involves pattern matching. This is the sort of check that you can use, for example, to make sure all the entries in a field are an email address. This involves something called regular expressions (often shortened to regex), which give you a way of telling the computer, “I only want things that look like {this} to be stored in that variable. Tell me if something in there doesn’t look like {this}.” The way that you indicate what {this} should be varies from program to program and can look a little complicated if you’ve never worked with it before. If you have ever used an asterisk (*) as a wildcard for searching, that’s actually part of a regex expression, so you already know a piece of it!
There are also pattern matching options in Excel and some advanced filter options that sometimes work even better. Check the resources section for links to more on regex and Excel filters and pattern matching.
You can also use combinations of fields for data checking. This is sometimes actually necessary because you have to look at all the fields together to tell if one or more of the fields are incorrect. If you do any of your banking online, you do this all the time without even realizing it. Your online bank record shows you several different fields that all have to make sense together, and if they don’t, red flags immediately go up in your mind. You have the source of the transaction, the amount of the transaction, the unit of currency that it’s in, if it’s a credit or a debit, the date the transaction occurred and the total balance of your account afterwards. All of these items are part of a dataset, and you’re doing a check on that dataset every time you pull up your account online to make sure everything looks okay. If the amount of the transaction was different from what you were expecting or the total of your account after the transaction was off, you would mentally flag it and call the bank to see what was up.
It’s the same with any other dataset. There may be fields that have to make sense together. Imagine that you’re working on a medical study of hospital patients and you’re tracking the medications they take daily by using three separate fields for medication type, the amount of the quantity of medication being administered, and the unit of the medication. So, for example, if the dataset read, “Aspirin, 500, mg” that would mean the patient took 500 mg of aspirin each day. Now imagine that you received a record that said, “Morphine, 200, lbs.” What would your reaction be? It’s logical that a hospital patient would be taking morphine and 200mg is a reasonable dosage, so the number alone wouldn’t raise flags, but even 1lb of morphine would kill someone so there’s definitely a problem there. You would probably want to go back to the patient’s record or to whoever entered the data to get the correct units.
If any of these fields are free response, there are an infinite number of combinations that you can receive. As such, you should go through your dataset early and identify groups of variables like this that need to work together so you can check records periodically as they come in for problems. Again, since this can be a time-consuming process, you need to decide how critical the cleanliness of these particular fields is to your end result, be it a specific visualization, a statistical analysis, or a general report.
As many of these cleaning procedures can be time-intensive, you will often have to decide which variables are worth cleaning, and which procedures you can justify using. But what if we just skip the data cleaning process altogether and leave the data “dirty”? The answer to that isn’t easy because it all depends how dirty your data are in the first place. At best, you’ll get lucky and your data will be minimally dirty and you won’t have any real impact on your end report. At worst, your results will be incorrect due to errors in your dataset that you could have potentially corrected if you had gone through data cleaning procedures.
Of course, your data may be in relatively good shape to begin with. If that’s the case, you might be able to ignore the cleaning process with little impact on your end product. However, until you at least go through the basic checks involved in data cleaning, there’s no real way for you to know how clean or dirty your data are. That said...
Data cleaning is just like house cleaning—you won’t ever catch everything. As hard as you may try, you can’t force people to input 100% correct data, and we make errors ourselves as we work with data. You want your data to be as accurate as possible, but there will always be a little dust in the corners so it’s important to accept that datasets are never perfect and to develop a sense for what is “good enough” for your purposes.
For example, you have a list of 1,000 contacts from your database of 100,000 contacts and you notice that 2 of the 1,000 have the first and last names together in one field. Do you take on a project of combing through and correcting all 100,000 records?
It depends.
You may make a formal decision that the data are either clean enough for your purposes or too dirty for what you want to accomplish. This really depends on the variable in question and what your intended end output is. If you’re responsible for checking that students correctly reported their majors for an internal report and you think that there was a 0.01% error rate, that’s probably of a much lower concern than if you’re checking a variable that is of critical importance to a safety report and you think there’s a possibility of a 5% error rate. Over time, you’ll get a better sense of how clean or dirty a dataset is, relatively speaking, and how labor-intensive the cleaning process will be for a particular group of variables. At that point, it’s good to consult the key stakeholders for the end product to see how much cleaning they agree is sensible to pursue. You should always aim to have your data as clean as possible but always remember that it won’t be 100% perfect.
Data preparation and cleaning have costs. If you hire someone to do this work for you, the cost is financial. If you’re going to do it yourself, it costs you or someone on your team time (and maybe a little sanity). So if you’ll never use a phone or fax number or need to refer to someone as Reverend, you may make the decision to delete those variables, stop collecting them, or just not worry about cleaning them in the future.
Once we’ve cleaned our data, we’re left with a brand new problem: how can we (and others!) verify that what we’ve done is correct and that we haven’t corrupted the data by making these changes? After all, the processed data may look vastly different from the raw data we started out with.
The simple answer is to document everything, particularly if you think you might want to share your data later on with a statistician or other researchers. When you’re cleaning your data, it’s always a good idea to save any changes as an entirely separate file: this way you’re always able to go back and look at what changed between the raw and processed data, what rows and columns were dropped, etc. It also ensures that you can go back to the unprocessed data if you ever want to slice things up a different way that might involve different cleaning procedures.
You should be careful to write a set of instructions as you go, documenting exactly what was done in each step to identify bad data and which data points were removed. It’s crucial to write this while you’re actually cleaning your data: it’s always easier to document as you go than it is to try and remember every step that you took after all is said and done. If you’re using point-and-click software to manage your data (like Excel), you should take special care to record exactly what steps were taken in cleaning the data since everything is done by hand, rather than by computer code that can be easily re-run later on. A good rule of thumb is that if you aren’t able to easily follow the instructions you wrote and end up with the same results a second time, you shouldn’t expect anyone else to be able to.
Regardless of how you choose to do it, good documentation of your cleaning procedures ensures that you can always justify why certain data points were removed and others weren’t and that others are able to verify that the data were cleaned competently and correctly. Additionally, if you think you might want to share the data with a statistician later on for further analysis, being able to see both the raw data and what operations were done on those data will make the statistician’s job much easier and quicker.