First, let’s get it out of the way: Data cleaning is invariably dull. It’s repetitive work and you need an eye for detail. But, I think we’ll all agree, it is essential.
A lot of our clients struggle with the concept and execution of clean data when we talk about it. Let’s break it down: What do we mean by clean data and how do you do it?
What is clean data?
Essentially, for us, there are two parts. One part is clean according to 7Sheep and the other is clean according to ‘humans’. The first part is what can be harder to communicate. Part of an IBM definition of a clean data set is:
All data types are accurately identified.
All columns contain the same data type. For example, in a numeric type column, there are no alpha-numeric strings.
Essentially, that’s it! Simple! Or is it? Something as simple as a telephone number can be stored as the correct number but still be ‘unclean’/invalid according to the database. In 7Sheep, a telephone field is a numeric field option with a few special characters (+, (.)) allowed. However, different cultures write telephone numbers in a variety of ways so, someone could be forcing an invalid result by separating digits with a full stop (period). The phone number may be correct but the way it is stored is not. Your data isn’t clean. 7Sheep is a program, when all’s said and done, it responds to your commands and does what you tell it to. Clean data is entirely in your control.
How can you clean data?
In 7Sheep it’s a multi-step process, each part of which is pretty straightforward. Here are a few ways to ensure that your data is as clean as possible:
Setting up your database
If your database is being filled in by your team make sure that you have a training and/or manual on how the data should be filled in. Use tool tips in 7Sheep and guide your staff. If you know that people will use more than one e-mail address, include 2 fields (this avoids a common problem of unexpected results in the e-mail field). Plan it to be as simple, concise and as foolproof as possible (note! this is just a convenient saying, we’re not calling a member of your team a fool!).
Valid vs Invalid Checking
Your results are divided into ‘valid’ and ‘invalid’ results. The ‘invalid’ results can be filtered and then checked for e.g. typos.
You can, of course, set fields to be unique in 7Sheep – so a unique primary e-mail address will usually cut down on the numbers of duplicates you have in the database. However, things slip through the net with people using other addresses etc. 7Sheep make deduplication easy.
Outside of 7Sheep, data is commonly considered not clean when there are information errors in there – even if the database does not recognise it as such (e.g. a phone number can be filled in perfectly but the number changed last year). Aged or old data is particularly susceptible to this. No one knows their own data better than themselves. Therefore, ask. Build into your system a process of checking data with your users on a regular basis; every 6 to 12 months would be our recommendation. 7Sheep allows users to access their own record in your system (should you allow them) and edit it themselves. This means that regular updates can be processed by the users to ensure that you are up to date and not suffering from old data.
Best Practise Guide
To receive our best practise guide on cleaning data, keeping it clean and processing it… fill in the form below: