What is 'Clean Data', Anyway?

Clean data is consistent data. In order for Sheets or Excel to process and summarize your data effectively, they need the data to be a predictable arrangement and format. In some ways, it's easier to provide examples of the inverse; what does dirty data look like?

  • You have a column of names, but they don't have consistent capitalization or number of spaces between first and last names:
  • You copy a list of phone numbers from a website table into a Google Sheet. Now, though, you notice that most of your entries have space characters after each phone number.

  • You have a list of websites or email addresses, but the person doing the data entry made some typos, so some of the email addresses aren't actually valid:

You get the idea. Of course, you could go through your data set, one row at a time, and try and fix some of these problems by hand, but that's a time-consuming, error-prone way of attacking these problems. Sheets has some functions that you can use to solve some of these issues quickly and efficiently.

The TRIM and CLEAN Functions

Believe it or not, Sheets does have a CLEAN function, built right in. Before you get too excited, however, it's not really that useful. All it does is remove non-printable ASCII characters from your cells. Woooo! While this could be a lifesaver if you're copying-and-pasting data from a corrupted source document or a webpage, I personally haven't found it to be that useful.

What's much more useful is the TRIM function. If you point this function at another cell, it will remove extra leading and trailing space characters from the cell. This IS actually useful, and here's what it looks like:

See what happens when you apply the TRIM function to a column of names with inconsistent spacing. Best of all, TRIM will also remove hidden spaces AFTER words, which are invisible to the naked eye, but will skrillex your PivotTables and Filters.

Of course, you'll notice that the capitalization is still pretty wacky for this list of names. Sheets has some functions that will fix that too, namely...

The PROPER, UPPER, and LOWER Functions

These three functions all deal with capitalization. The first one, PROPER, will capitalize the first letter of each word in the target cell. Here's an example, where it turns "elle Amberson" into "Elle Amberson":

And here's what happens when you set it loose on our column of poorly-capitalized names:

Of course, maybe you want all lowercase letters. Or all uppercase letters. Well, the functions LOWER and UPPER can take care of that for you:

These may not seem like a big deal, but they can save you hours of manual data entry if you're unfortunate enough to have a dataset with inconsistent capitalization. Sheets does recognize capitalization in cells and it will sort and summarize words differently based on whether they're capitalized or not.

The ISEMAIL and ISURL Functions

These two functions are useful for checking a dataset for typos. They each simply look at a cell's contents and tell you if the contents are formatted properly as a valid website URL or email address. These formulas will just tell you, TRUE or FALSE, if a cell's contents look valid. For instance, for ISURL will look to see if a cell contains a URL that looks like it will actually work. If there's no top-level domain (like .com or .org) or if the top-level domain isn't recognized by Google (.c or .faceplant), the formula will return FALSE:

Likewise, ISEMAIL does pretty much the same thing, just with email addresses:

It's important to note that these two formulas don't actually try to visit the website in question or send email to the address provided; they don't actually test if the address is valid. All they do is look at the formatting and provide a judgement if the link or email address at least could possibly work. ISURL and ISEMAIL won't protect you against all typos, but they can help snag some of the low hanging fruit.